Small Datum: Tuning Postgres and the new insert benchmark, round 2

Wednesday, May 31, 2023

Tuning Postgres and the new insert benchmark, round 2

A previous post shared results from the insert benchmark for Postgres 15.3 with a variety of configurations. This has results for even more configurations to understand the impact of several tuning variables. The goal is to understand which config options improve average throughput and/or reduce throughput variance. The focus in on the l.i1 benchmark step (inserts with secondary index maintenance) as that is the largest source of stress for MVCC GC.

tl;dr (for this workload):

the base config I have been using is good once I tuned autovacuum to be more agressive
adding wal_compression=lz4 helps a lot while pglz didn't help (in 2020, see here)
tuning autovacuum scale factors might help
several of the changes made performance much worse
increasing autovacuum_work_mem didn't help here but might help on IO-bound tests

Benchmark

This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory.

The configurations all use shared_buffers=4G so that the database is cached by the OS but not by Postgres -- this kind-of simulates fast storage. Results where the database is cached by Postgres are pending.

All of the config files are here. The base config file is cx7a_4g_gcp_c2s30 and then the other config files make a small change to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

one of them has a significant benefit: a1
a few of them have a small benefit: a9, a10, a11
one had no impact: a17
many of them hurt performance: a2, a5, a6, a7, a12, a13, a14, a15
a few of them can be ignored: a3, a4, a8

The configurations are:

conf.diff.cx7a1_4g_gcp_c2s30 (a1) - adds wal_compression=lz4
conf.diff.cx7a2_4g_gcp_c2s30 (a2) - adds backend_flush_after=256kb
conf.diff.cx7a3_4g_gcp_c2s30 (a3) - adds full_page_writes=off (not for production!)
conf.diff.cx7a4_4g_gcp_c2s30 (a4) - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
conf.diff.cx7a5_4g_gcp_c2s30 (a5) - adds min_wal_size=16384
conf.diff.cx7a6_4g_gcp_c2s30 (a6) - adds backend_flush_after=256kb, initdb --wal-segsize 256
conf.diff.cx7a7_4g_gcp_c2s30 (a7) - adds max_wal_size=16GB
conf.diff.cx7a8_4g_gcp_c2s30 (a8) - adds shared_buffers=8GB, ignore this, I was just curious about the benefit of doubling shared buffers from the base config
conf.diff.cx7a9_4g_gcp_c2s30 (a9) - adds checkpoint_timeout=900s
conf.diff.cx7a10_4g_gcp_c2s30 (a10) - adds autovacuum_work_mem=256MB
conf.diff.cx7a11_4g_gcp_c2s30 (a11) - adds autovacuum_work_mem=1GB
conf.diff.cx7a12_4g_gcp_c2s30 (a12) - adds wal_buffers=64MB
conf.diff.cx7a13_4g_gcp_c2s30 (a13) - adds wal_buffers=64MB, initdb --wal-segsize 256
conf.diff.cx7a14_4g_gcp_c2s30 (a14) - adds backend_flush_after=1MB
conf.diff.cx7a15_4g_gcp_c2s30 (a15) - adds max_wal_size=32GB
conf.diff.cx7a16_4g_gcp_c2s30 (a16) - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
conf.diff.cx7a17_4g_gcp_c2s30 (a17) - reduces bgwriter_lru_maxpages from 5000 to 1000

Results

I start by comparing the impact of the config changes with the results for the base config.

The summary is here. Relative to the base config the a1 config provides the largest improvement to average throughput for l.i1 (from 110299 to 122963 inserts/s).

Next I consider the impact on worst-case response time for inserts (see here). About half of the configs have either no impact or a minor impact on that compared to the base config. Some of the configs (a6, a11, a12, a13, a14, a15) make the worst-case insert time much worse.

Finally, I look at the charts for per-second insert rates vs time (see here). From that the a1 config makes the graph look much better and the a16 config has a minor benefit. The results are here for base, a1 and a16.

The conclusion is that I should definitely use a1 (wal_compression=lz4) and probably use a9, one of a10/a11 and a16 for this workload.

Small Datum

Wednesday, May 31, 2023

Tuning Postgres and the new insert benchmark, round 2

No comments:

Post a Comment

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1