A previous post shared results from the insert benchmark for Postgres 15.3 with a variety of configurations. This has results for even more configurations to understand the impact of several tuning variables. The goal is to understand which config options improve average throughput and/or reduce throughput variance. The focus in on the l.i1 benchmark step (inserts with secondary index maintenance) as that is the largest source of stress for MVCC GC.
tl;dr (for this workload):
- the base config I have been using is good once I tuned autovacuum to be more agressive
- adding wal_compression=lz4 helps a lot while pglz didn't help (in 2020, see here)
- tuning autovacuum scale factors might help
- several of the changes made performance much worse
- increasing autovacuum_work_mem didn't help here but might help on IO-bound tests
Benchmark
This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory.
The configurations all use shared_buffers=4G so that the database is cached by the OS but not by Postgres -- this kind-of simulates fast storage. Results where the database is cached by Postgres are pending.
All of the config files are here. The base config file is cx7a_4g_gcp_c2s30 and then the other config files make a small change to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:
- one of them has a significant benefit: a1
- a few of them have a small benefit: a9, a10, a11
- one had no impact: a17
- many of them hurt performance: a2, a5, a6, a7, a12, a13, a14, a15
- a few of them can be ignored: a3, a4, a8
- conf.diff.cx7a1_4g_gcp_c2s30 (a1) - adds wal_compression=lz4
- conf.diff.cx7a2_4g_gcp_c2s30 (a2) - adds backend_flush_after=256kb
- conf.diff.cx7a3_4g_gcp_c2s30 (a3) - adds full_page_writes=off (not for production!)
- conf.diff.cx7a4_4g_gcp_c2s30 (a4) - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
- conf.diff.cx7a5_4g_gcp_c2s30 (a5) - adds min_wal_size=16384
- conf.diff.cx7a6_4g_gcp_c2s30 (a6) - adds backend_flush_after=256kb, initdb --wal-segsize 256
- conf.diff.cx7a7_4g_gcp_c2s30 (a7) - adds max_wal_size=16GB
- conf.diff.cx7a8_4g_gcp_c2s30 (a8) - adds shared_buffers=8GB, ignore this, I was just curious about the benefit of doubling shared buffers from the base config
- conf.diff.cx7a9_4g_gcp_c2s30 (a9) - adds checkpoint_timeout=900s
- conf.diff.cx7a10_4g_gcp_c2s30 (a10) - adds autovacuum_work_mem=256MB
- conf.diff.cx7a11_4g_gcp_c2s30 (a11) - adds autovacuum_work_mem=1GB
- conf.diff.cx7a12_4g_gcp_c2s30 (a12) - adds wal_buffers=64MB
- conf.diff.cx7a13_4g_gcp_c2s30 (a13) - adds wal_buffers=64MB, initdb --wal-segsize 256
- conf.diff.cx7a14_4g_gcp_c2s30 (a14) - adds backend_flush_after=1MB
- conf.diff.cx7a15_4g_gcp_c2s30 (a15) - adds max_wal_size=32GB
- conf.diff.cx7a16_4g_gcp_c2s30 (a16) - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
- conf.diff.cx7a17_4g_gcp_c2s30 (a17) - reduces bgwriter_lru_maxpages from 5000 to 1000
The summary is here. Relative to the base config the a1 config provides the largest improvement to average throughput for l.i1 (from 110299 to 122963 inserts/s).
Finally, I look at the charts for per-second insert rates vs time (see here). From that the a1 config makes the graph look much better and the a16 config has a minor benefit. The results are here for base, a1 and a16.
The conclusion is that I should definitely use a1 (wal_compression=lz4) and probably use a9, one of a10/a11 and a16 for this workload.