Tuesday, June 13, 2023

Tuning Postgres and the new insert benchmark on a small server

This has results for the updated insert benchmark with Postgres 15.2 on a small server. The goal is to search for a better configuration. Previous posts have results from a medium server: post 1, post 2, post 3. My focus is on the l.i1 benchmark step that is write-heavy -- inserts & deletes with a PK index and 3 secondary indexes in place. And the goal is to find a configuration that minimizes write stalls and response time variance.

tl;dr

  • the 1-client tests are more sensitive to config options than the 4-client tests at least for the l.i1 benchmark step. Most of the configs get about 10k inserts/s on l.i1 for the 1-client tests, except for the a16 and a18 configs which get >= 17k/s. But on the 4-client tests all of the configs get ~22k inserts/s for cached by Postgres and ~14k inserts/s for cached by OS
  •  the largest improvement to the base config is from the a16 config that tunes autovacuum*scale_factor (see here). This is the second big improvement from tuning autovaccum. The previous one is here.

Benchmark

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. Postgres version 15.2 was used, compiled from source with the o2_nofp build (compiled with -O2 -no-omit-frame-pointer).

The insert benchmark was run with --delete_per_insert to avoid growing tables after the initial benchmark step that loads tables. The database was either cached by Postgres (shared_buffers=10GB) or the OS (shared_buffers=1GB).

The benchmark was repeated for 1 client and 8 clients with a table per client in each case. There were 20M inserts/table in the l.i0 step, 50M inserts/table in the l.i1 step and the read+write steps (q100, q500, q100) run for 1800 seconds. The benchmark steps are explained here.

There were four variants of the benchmark:

  • by Postgres, 1 client - database cached by Postgres, 1 client
  • by Postgres, 4 clients - database cached by Postgres, 4 clients
  • by OS, 1 client - database cached by OS, 1 client
  • by OS, 4 clients - database cached by OS, 4 clients

All of the config files are here. The base config files are here for cached by Postgres and by OS. The other config files make small changes to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

  • one of them has a significant benefit: a16 which tunes autovacuum*scale_factor
  • a few of them have a small benefit: 
  • many of them hurt performance:  
  • a few of them can be ignored:
The configurations are:
  • a1 - adds wal_compression=lz4
  • a2 - adds backend_flush_after=256kb
  • a3 - adds full_page_writes=off (not for production!)
  • a4 - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
  • a9 - adds checkpoint_timeout=900s
  • a10 - adds autovacuum_work_mem=256MB
  • a12 - adds wal_buffers=64MB
  • a14 - adds backend_flush_after=1MB
  • a15 - adds max_wal_size=32GB
  • a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • a17 - reduces bgwriter_lru_maxpages from 2000 to 500
  • a18 - combines the best configs: a1, a10, a12 and a16
The full names for the config files used for cached by Postgres have the pattern conf.diff.cx8${X}_bee where X is one of a1 to a18. And for cached by OS the full name has the pattern conf.diff.cx8${X}_1g_bee. Examples are conf.diff.cx8a18_bee and conf.diff.cx8a18_1g_bee.

Results

To understand the config changes I compare results from a given config with results from the base config.

The summaries are here for:
First an analysis for by Postgres & 1 client:
  • Relative to the base config the a16 config provides the largest improvement to average throughput for l.i1.
  • Next I consider the impact on worst-case response time for inserts (see here).
  • Finally, I look at the charts for per-second insert rates vs time for l.i1 (see here). The a16 config has the best average throughput and the chart for a16 is a bit different than the others.
  • The conclusion is that I should definitely use a16 and probably use a1, a10, and a12 for this workload. I then created the a18 config that does just this.
Results from the other workloads are similar to by Postgres & 1 client and the a16 config was the best but a few other configs were useful so I created a new config, a18, that combines them. However, results from the a18 config are similar to the a16 config, not significantly better.

I then classified the impact of each config file for 4 workloads: by Postgres & 1 client, by Postgres & 4 clients, by OS & 1 client, by OS & 4 clients. The classification has 5 groups and uses relative throughput which is (throughput for config) / (throughput for base config). The 5 groups are:
  • significant benefit - relative throughput is more than 1.05
  • small benefit - relative throughput is >= 1.02 and <= 1.05
  • no impact -  relative throughput is between >= 0.99 and <= 1.01
  • hurt performance - relative throughput is <= 0.98
  • ignored - the result for this config can be ignored
From these results I chose to create the a18 config by combining a1, a10, a12 and a16.

Significant benefit
by Postgres, 1 client:  a16
by OS, 1 client:        a16
by Postgres, 4 clients: a16
by OS, 4 clients:

Small benefit:
by Postgres, 1 client:  a2 a10
by OS, 1 client:        a1 a9 a10 a12 a15
by Postgres, 4 clients:
by OS, 4 clients:       a16

No impact:
by Postgres, 1 client:  a1 a9 a12 a15 a17
by OS, 1 client:        
by Postgres, 4 clients: a1 a2 a9 a10 a12 a14 a17
by OS, 4 clients:       a1 a9 a10 a12 a15 a17

Hurt performance:
by Postgres, 1 client:  a14
by OS, 1 client:        a2 a14 a17
by Postgres, 4 clients: a15
by OS, 4 clients:       a2 a14

Ignored:
all:    a3 (not safe for prod), a4 (duplicate of base config)








No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...