Small Datum: Tuning Postgres and the Insert Benchmark, round 6

This continues the work from round 5 where I try to find a better configuration for Postgres 15.2 and 15.3 when running the Insert Benchmark. Once I settle on one or a few config files then I can compare Postgres 15 with the the version 16 beta releases.

tl;dr

This has been more confusing than I hoped it would be. I number the configurations and am up to number 28, but it isn't clear that I found something that is a significant improvement over the best config from round 5. However I will probably select two configurations that combine the better config options, a27 for cached workloads and a28 for IO-bound workloads, even if they don't improve on a16.
For the IO-bound setup and the l.i1 benchmark step there is far too much variance in the time to finish the benchmark step per client. The a24 config was the best but most of the configs were so bad that I would consider them to be a total failure of the benchmark and most of the evaluation that follows is pointless and I am starting to agree with the OtterTune claim about which storage engine has the worst implementation of MVCC GC when confronted with a write-heavy workload. I have put far more effort into tuning Postgres than into InnoDB and MyRocks but the results just aren't good.
I am surprised that increasing autovacuum_work_mem from the default (64MB) to 256MB doesn't help for either the cached or IO-bound setups
I am surprised that the a3 config that adds full_page_writes=off hurts write throughput
Using wal_compression=lz4, the a1 config, helps on the cached setup but hurts on the IO-bound setup
Results for the a25 config that uses a smaller buffer pool (25% of RAM) are bad

MyRocks and InnoDB had serious perf problems with the insert benchmark (long ago for InnoDB -- too much variance, not so long ago for MyRocks -- large write stalls). After much back and forth the problems were fixed. It wasn't easy and I might have made people unhappy by making noise about the bugs, but the problems were fixed. I feel like I am repeating that now with Postgres.

This problem doesn't occur for this benchmark run on the same HW with InnoDB and MyRocks. While I am wary of doing direct comparisons between Postgres and MySQL I looked at the results for InnoDB and MyRocks again and confirmed that the insert rate for MyRocks is much larger than for PG (expected) while the insert rate for InnoDB is similar in some cases and better in others. So the issue isn't that I am spotting problems with PG because it is handling more write throughput. Blog posts are here for InnoDB and MyRocks.

Setups

Because I use configurations when describing the Postgres config files I will use setups to explain how the benchmark is run. The two setups are:

Cached by Postgres - database is cached in the Postgres buffer pool
IO-bound - database is larger than memory

Configurations

All of the config files are here. The naming convention for the files is conf.diff.cx7${X}_gcp_c2s30 where $X is one of a, a1, a2, ... , a28.

The configs tested are:

base - see here
a1 - adds wal_compression=lz4
a3 - adds full_page_writes=off, not safe for production
a4 - adds checkpoint_timeout=1800s
a7 - adds max_wal_size=16GB
a10 - adds autovacuum_work_mem=256MB
a12 - adds wal_buffers=64MB
a15 - adds max_wal_size=32GB
a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
a23 - adds autovacuum_max_workers=8, autovacuum_work_mem=256MB, autovacuum_vacuum_scale_factor=0.05 autovacuum_vacuum_insert_scale_factor=0.05
a24 - adds autovacuum_vacuum_cost_limit=4000
a25 - adds shared_buffers=30GB
a26 - adds autovacuum_max_workers=8, autovacuum_vacuum_scale_factor=0.05 autovacuum_vacuum_insert_scale_factor=0.05
a27 - combines a1, a4, a12, a16, a26 as these were the best for the cached setup
a28 - combines a4, a12, a16, a26 as these were the best for the IO-bound setup

Benchmarks

The benchmark is the updated insert benchmark with the delete_per_insert option enabled so that the logical table size doesn't grow after the initial load. It was run for two setups (Cached by Postgres, Cached by OS, IO-bound).

The test server is c2-standard-30 with 15 cores, 120G RAM, Ubuntu 22.04 and XFS with SW RAID 0 over 4 local NVMe devices. The benchmark is run for 8 clients with a table per client.

I used Postgres 15.3 with the rel_native_lto build (compiled with -march=native -mtune=native -flto).

The benchmark steps are described here.

l.i0 - insert X rows per table in PK order where X is 20M for Cached by Postgres and 500M for IO-bound
l.x - create 3 secondary indexes per table
l.i1 - insert and delete X rows/table where X is 200 for Cached by Postgres and 40 for IO-bound
q100, q500, q1000 - run for 3600 seconds each, delete_per_insert is enabled. Do range queries as fast as possible with background threads doing rate-limited inserts and deletes -- 100, 500 and 1000/s for q100, q500, q1000.

Note that nothing is done to reduce MVCC GC debt when the l.i1 step finishes and configs that enable more write-throughput during l.i1 pass on that debt to the q100, q500 and q1000 steps that follow so I risk apples to oranges comparisons between configs that don't sustain the same write rate during l.i1. However, I assume this is mostly an issue for the q100 step that immediately follows l.i1 and is resolved by the time the q500 and q1000 steps run.

Results

Performance reports are here for Cached by Postgres and IO-bound. I analyze these in terms of:

average throughput
fairness between benchmark clients
SLA for background inserts and deletes
response time histograms
performance at 1-second intervals
The impact of MVCC GC on some queries

Results: average throughput

The summaries show average throughput for each benchmark step using absolute and relative values. The relative values are the throughput relative to the throughput for the base config. The summaries are here for Cached by Postgres and IO-bound. In the summaries green/red mean good/bad for the table with absolute values and blue/gold mean better/worse for the table with relative values. I focus on the l.i1, q100, q500 and q1000 benchmark steps.

For Cached by Postgres

For l.i1 the a16, a23, a26, a27, a28 help throughput and the a3, a7, a15, a24, a25 hurt it relative to the base config
For q100, q500 and q1000 none of the configs have a big impact on throughput

For IO-bound

For l.i1 the a4, a12, a15, a16, a23 to a28 configs help throughput the a1, a3, a10 configs hurt it relative to the base config
For q100, q500, q1000 the a1, a4, a7, a12, a15, a23, a24, a16 configs help throughput and the a16, a25, a27, a28 configs hurt it

At this point I favor using the a27 config for Cached by Postgres and the a28 config for IO-bound keeping in mind the warning above about apples to oranges comparisons.

Results: Fairness between benchmark clients

With MyRocks and InnoDB all of the benchmark clients finish at about the same point in time. That is much less true for Postgres for the l.i1 benchmark step and the table below shows how much variance there is in the time to complete that step. The results for IO-bound are lousy.

All of the per-client times (in seconds) to finish the l.i1 benchmark step for the IO-bound setup are here. Using the a28 config results as an example the min time is 15038.1 seconds and the max time is 30505.9 seconds. I might agree with the OtterTune claim about Postgres not having a great MVCC implementation.

I looked at the results for InnoDB and MyRocks and in the worst-case the clients finish a few minutes apart (for IO-bound InnoDB). But most of the time the gap is <= 60 seconds while for Postgres there are many examples where the gap is measured in hours.

Legend:

* config - Postgres configuration

* abs - difference in seconds between the max and min runtime from

the 8 clients

* pct - (abs / min_runtime)*100 where min_runtime is the min runtime

from any of the 8 clients

For Cached by Postgres

config abs pct

a 1066 11.3

a1 580 6.0

a3 1085 8.5

a4 892 9.3

a7 248 1.9

a10 753 7.6

a12 756 7.8

a15 927 8.5

a16 488 8.9

a23 414 6.7

a24 446 3.6

a25 837 5.4

a26 691 11.0

a27 365 5.9

a28 775 10.3

For IO-bound

config abs pct

a 18873 87.7

a1 24447 92.2

a3 25111 125.0

a4 13996 66.2

a7 18689 81.9

a10 21996 95.8

a12 13264 61.7

a15 25102 317.4

a16 13228 142.0

a23 11040 70.5

a24 5464 18.3

a25 23811 200.5

a26 9749 54.4

a27 18242 186.9

a28 15468 102.9

Results: SLA for background inserts and deletes

Scroll to the end of the Summary section for Cached by Postgres and IO-bound. The target rates are 8x100 (800/s) for q100, 8x500 (4000/s) for q500 and 8x1000 (8000/s) for q1000.

For Cached by Postgres all configs were able to sustain rates that were close enough to the targets
For IO-bound the a16 and a26 configs failed for q1000 and a23, a27, a28 almost failed for q1000.

Results: response time histograms

For response time I look at the distributions including the max response time (see the max column).

For Cached by Postgres

For l.i1 the a3, a23 to a28 configs have better distributions -- with most values in the 4ms bucket and max values much less than one second. While the other configs have more responses in the 1ms bucket they have worse outliers with a max response time of ~5 seconds.
For q1000

for queries the a3, a4, a23 to a28 configs have better distributions with max values much less than 1 second
for inserts and deletes the a3, a4, a23 to a28 configs have better distributions with max values much less than 1 second and most responses in the 4ms bucket

For IO-bound

For l.i1 all configs have a similar max response time (around 1 second). The worst distributions with no more than 25% of responses in the 4ms bucket are a1, a12, a15, a28.
For q1000

for queries the a15 config has a bad max response time, 4.484 seconds, while all others are less than 1 second. Otherwise the distribution for a25 is worse than the others. But configs other than a15 and a25 have similar results.
for inserts and deletes the a15 config has a bad max response time, ~4.5 seconds, while all others are less than 1 second. Otherwise the distribution for a25 is worse than the others. But configs other than a15 and a25 have similar results.

Results: performance at 1-second intervals

The charts are here for:

Note that these charts are from the first client and there are eight clients.

For Cached by Postgres

For l.i1 there is too much variance where the insert rate (IPS) jumps between ~50k/s (good) and less than 5k/s (bad) -- see results for the base config. Charts for some configs, like a16 and a27 are better but still have too much variance.
For q1000

for queries with most configs there appears to be a regular pattern that looks like rolling hills (see the chart for the base config). The line for the a27 config is straighter.
for inserts and deletes the base config and many other configs have more variance than the a27 config

For IO-bound

For l.i1 all of the configs have an odd spike in the first 10000 seconds of the benchmark step followed by a flatline -- ignore the spike at the end. See the charts for the base config, a16 and a27.
For q1000

for queries most of the configs show variance (either a declining line or rolling hills). The chart for a28 is better than most.
for inserts and deletes note that the a16 and a26 configs failed to sustain the target rates while a23, a27 and a28 almost failed. The chart for the base config has some variance and most configs have similar charts. The charts for a16 and a26 show more variance (a26 is worse).

Results: The impact of MVCC GC on some queries

There is a query done at the start of benchmark steps to determine the min value of transactionid (the PK column in the table). Inserts are done to the head of the table while deletes are done from the tail and for InnoDB and MyRocks this query can be slow unless you change the values of a few options.

For Postgres this query isn't a problem and the response time was always less than 100 milliseconds.

A blog post with more details is here.

Small Datum

Wednesday, August 9, 2023

Tuning Postgres and the Insert Benchmark, round 6

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)