Wednesday, August 9, 2023

Tuning Postgres and the Insert Benchmark, round 6

This continues the work from round 5 where I try to find a better configuration for Postgres 15.2 and 15.3 when running the Insert Benchmark. Once I settle on one or a few config files then I can compare Postgres 15 with the the version 16 beta releases.

tl;dr

  • This has been more confusing than I hoped it would be. I number the configurations and am up to number 28, but it isn't clear that I found something that is a significant improvement over the best config from round 5. However I will probably select two configurations that combine the better config options, a27 for cached workloads and a28 for IO-bound workloads, even if they don't improve on a16.
  • For the IO-bound setup and the l.i1 benchmark step there is far too much variance in the time to finish the benchmark step per client. The a24 config was the best but most of the configs were so bad that I would consider them to be a total failure of the benchmark and most of the evaluation that follows is pointless and I am starting to agree with the OtterTune claim about which storage engine has the worst implementation of MVCC GC when confronted with a write-heavy workload. I have put far more effort into tuning Postgres than into InnoDB and MyRocks but the results just aren't good.
  • I am surprised that increasing autovacuum_work_mem from the default (64MB) to 256MB doesn't help for either the cached or IO-bound setups
  • I am surprised that the a3 config that adds full_page_writes=off hurts write throughput
  • Using wal_compression=lz4, the a1 config, helps on the cached setup but hurts on the IO-bound setup
  • Results for the a25 config that uses a smaller buffer pool (25% of RAM) are bad
MyRocks and InnoDB had serious perf problems with the insert benchmark (long ago for InnoDB -- too much variance, not so long ago for MyRocks -- large write stalls). After much back and forth the problems were fixed. It wasn't easy and I might have made people unhappy by making noise about the bugs, but the problems were fixed. I feel like I am repeating that now with Postgres.

This problem doesn't occur for this benchmark run on the same HW with InnoDB and MyRocks. While I am wary of doing direct comparisons between Postgres and MySQL I looked at the results for InnoDB and MyRocks again and confirmed that the insert rate for MyRocks is much larger than for PG (expected) while the insert rate for InnoDB is similar in some cases and better in others. So the issue isn't that I am spotting problems with PG because it is handling more write throughput. Blog posts are here for InnoDB and MyRocks.

Setups

Because I use configurations when describing the Postgres config files I will use setups to explain how the benchmark is run. The two setups are:
  • Cached by Postgres - database is cached in the Postgres buffer pool
  • IO-bound - database is larger than memory
Configurations

All of the config files are here. The naming convention for the files is conf.diff.cx7${X}_gcp_c2s30 where $X is one of a, a1, a2, ... , a28.

The configs tested are:

  • base - see here
  • a1 - adds wal_compression=lz4
  • a3 - adds full_page_writes=off, not safe for production
  • a4 - adds checkpoint_timeout=1800s
  • a7 - adds max_wal_size=16GB
  • a10 - adds autovacuum_work_mem=256MB
  • a12 - adds wal_buffers=64MB
  • a15 - adds max_wal_size=32GB
  • a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • a23 - adds autovacuum_max_workers=8, autovacuum_work_mem=256MB, autovacuum_vacuum_scale_factor=0.05  autovacuum_vacuum_insert_scale_factor=0.05
  • a24 - adds autovacuum_vacuum_cost_limit=4000
  • a25 - adds shared_buffers=30GB
  • a26 - adds autovacuum_max_workers=8, autovacuum_vacuum_scale_factor=0.05  autovacuum_vacuum_insert_scale_factor=0.05
  • a27 - combines a1, a4, a12, a16, a26 as these were the best for the cached setup
  • a28 - combines a4, a12, a16, a26 as these were the best for the IO-bound setup

Benchmarks

The benchmark is the updated insert benchmark with the delete_per_insert option enabled so that the logical table size doesn't grow after the initial load. It was run for two setups (Cached by Postgres, Cached by OS, IO-bound).

The test server is c2-standard-30 with 15 cores, 120G RAM, Ubuntu 22.04 and XFS with SW RAID 0 over 4 local NVMe devices. The benchmark is run for 8 clients with a table per client.

I used Postgres 15.3 with the rel_native_lto build (compiled with -march=native -mtune=native -flto).

The benchmark steps are described here

  • l.i0 - insert X rows per table in PK order where X is 20M for Cached by Postgres and 500M for IO-bound
  • l.x - create 3 secondary indexes per table
  • l.i1 - insert and delete X rows/table where X is 200 for Cached by Postgres and 40 for IO-bound
  • q100, q500, q1000 - run for 3600 seconds each, delete_per_insert is enabled. Do range queries as fast as possible with background threads doing rate-limited inserts and deletes -- 100, 500 and 1000/s for q100, q500, q1000. 
Note that nothing is done to reduce MVCC GC debt when the l.i1 step finishes and configs that enable more write-throughput during l.i1 pass on that debt to the q100, q500 and q1000 steps that follow so I risk apples to oranges comparisons between configs that don't sustain the same write rate during l.i1. However, I assume this is mostly an issue for the q100 step that immediately follows l.i1 and is resolved by the time the q500 and q1000 steps run.

Results

Performance reports are here for Cached by Postgres and IO-bound. I analyze these in terms of:
  • average throughput
  • fairness between benchmark clients
  • SLA for background inserts and deletes
  • response time histograms
  • performance at 1-second intervals
  • The impact of MVCC GC on some queries
Results: average throughput

The summaries show average throughput for each benchmark step using absolute and relative values. The relative values are the throughput relative to the throughput for the base config. The summaries are here for Cached by Postgres and IO-bound. In the summaries green/red mean good/bad for the table with absolute values and blue/gold mean better/worse for the table with relative values. I focus on the l.i1, q100, q500 and q1000 benchmark steps.

For Cached by Postgres
  • For l.i1 the a16, a23, a26, a27, a28 help throughput and the a3, a7, a15, a24, a25 hurt it relative to the base config
  • For q100, q500 and q1000 none of the configs have a big impact on throughput
For IO-bound
  • For l.i1 the  a4, a12, a15, a16, a23 to a28 configs help throughput the a1, a3, a10 configs hurt it relative to the base config
  • For q100, q500, q1000 the a1, a4, a7, a12, a15, a23, a24, a16 configs help throughput and the a16, a25, a27, a28 configs hurt it
At this point I favor using the a27 config for Cached by Postgres and the a28 config for IO-bound keeping in mind the warning above about apples to oranges comparisons.

Results: Fairness between benchmark clients

With MyRocks and InnoDB all of the benchmark clients finish at about the same point in time. That is much less true for Postgres for the l.i1 benchmark step and the table below shows how much variance there is in the time to complete that step. The results for IO-bound are lousy. 

All of the per-client times (in seconds) to finish the l.i1 benchmark step for the IO-bound setup are here. Using the a28 config results as an example the min time is 15038.1 seconds and the max time is 30505.9 seconds. I might agree with the OtterTune claim about Postgres not having a great MVCC implementation.

I looked at the results for InnoDB and MyRocks and in the worst-case the clients finish a few minutes apart (for IO-bound InnoDB). But most of the time the gap is <= 60 seconds while for Postgres there are many examples where the gap is measured in hours.

Legend:
* config - Postgres configuration
* abs - difference in seconds between the max and min runtime from
        the 8 clients
* pct - (abs / min_runtime)*100 where min_runtime is the min runtime
        from any of the 8 clients

For Cached by Postgres
config  abs     pct
a       1066    11.3
a1      580     6.0
a3      1085    8.5
a4      892     9.3
a7      248     1.9
a10     753     7.6
a12     756     7.8
a15     927     8.5
a16     488     8.9
a23     414     6.7
a24     446     3.6
a25     837     5.4
a26     691     11.0
a27     365     5.9
a28     775     10.3

For IO-bound
config  abs     pct
a       18873   87.7
a1      24447   92.2
a3      25111   125.0
a4      13996   66.2
a7      18689   81.9
a10     21996   95.8
a12     13264   61.7
a15     25102   317.4
a16     13228   142.0
a23     11040   70.5
a24     5464    18.3
a25     23811   200.5
a26     9749    54.4
a27     18242   186.9
a28     15468   102.9

Results: SLA for background inserts and deletes

Scroll to the end of the Summary section for Cached by Postgres and IO-bound. The target rates are 8x100 (800/s) for q100, 8x500 (4000/s) for q500 and 8x1000 (8000/s) for q1000. 
  • For Cached by Postgres all configs were able to sustain rates that were close enough to the targets
  • For IO-bound the a16 and a26 configs failed for q1000 and a23, a27, a28 almost failed for q1000.
Results: response time histograms

For response time I look at the distributions including the max response time (see the max column).

  • For l.i1 the a3, a23 to a28 configs have better distributions -- with most values in the 4ms bucket and max values much less than one second. While the other configs have more responses in the 1ms bucket they have worse outliers with a max response time of ~5 seconds.
  • For q1000 
    • for queries the a3, a4, a23 to a28 configs have better distributions with max values much less than 1 second
    • for inserts and deletes the a3, a4, a23 to a28 configs have better distributions with max values much less than 1 second and most responses in the 4ms bucket
  • For l.i1 all configs have a similar max response time (around 1 second). The worst distributions with no more than 25% of responses in the 4ms bucket are a1, a12, a15, a28.
  • For q1000
    • for queries the a15 config has a bad max response time, 4.484 seconds, while all others are less than 1 second. Otherwise the distribution for a25 is worse than the others. But configs other than a15 and a25 have similar results.
    • for inserts and deletes the a15 config has a bad max response time, ~4.5 seconds, while all others are less than 1 second. Otherwise the distribution for a25 is worse than the others. But configs other than a15 and a25 have similar results.
Results: performance at 1-second intervals

The charts are here for:
Note that these charts are from the first client and there are eight clients.

For Cached by Postgres
  • For l.i1 there is too much variance where the insert rate (IPS) jumps between ~50k/s (good) and less than 5k/s (bad) -- see results for the base config. Charts for some configs, like a16 and a27 are better but still have too much variance.
  • For q1000
    • for queries with most configs there appears to be a regular pattern that looks like rolling hills (see the chart for the base config). The line for the a27 config is straighter.
    • for inserts and deletes the base config and many other configs have more variance than the a27 config
For IO-bound
  • For l.i1 all of the configs have an odd spike in the first 10000 seconds of the benchmark step followed by a flatline -- ignore the spike at the end. See the charts for the base config, a16 and a27
  • For q1000
    • for queries most of the configs show variance (either a declining line or rolling hills). The chart for a28 is better than most.
    • for inserts and deletes note that the a16 and a26 configs failed to sustain the target rates while a23, a27 and a28 almost failed. The chart for the base config has some variance and most configs have similar charts. The charts for a16 and a26 show more variance (a26 is worse).
Results: The impact of MVCC GC on some queries

There is a query done at the start of benchmark steps to determine the min value of transactionid (the PK column in the table). Inserts are done to the head of the table while deletes are done from the tail and for InnoDB and MyRocks this query can be slow unless you change the values of a few options.

For Postgres this query isn't a problem and the response time was always less than 100 milliseconds.

A blog post with more details is here.
















No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...