Small Datum: Tuning Postgres and the new insert benchmark, round 5

Like round 4, this has results from a medium server, the insert benchmark and Postgres 15.3. I test 10 configuration files to determine which works best for three setups: Cached by Postgres, Cached by OS and IO-bound.

tl;dr

The best configs are a21_4g for Cached by OS, a21 for IO-bound, a22 for Cached by Postgres. The a21 & a21_4g configs include wal_compression=lz4. The a22 config does not.
There is still too much variance in the l.i1 benchmark step
There are ~10 second write stalls for the l.i1 benchmark step with all configurations for Cached by Postgres.
Postgres does much better than InnoDB and MyRocks at the get-min query

Setups

Because I use configurations when describing the Postgres config files I will use setups to explain how the benchmark is run. The three setups are:

Cached by Postgres - database is cached in the Postgres buffer pool
Cached by OS - database fits in the OS page cache but not the 4G Postgres buffer pool
IO-bound - database is larger than memory

Configurations

All of the config files are here. The configs are:

base - see here (Cached by Postgres, IO-bound) and here (Cached by OS)
a1 - adds wal_compression=lz4
a4 - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this is a duplicate
a7 - adds max_wal_size=16GB
a10 - adds autovacuum_work_mem=256MB
a12 - adds wal_buffers=64MB
a15 - adds max_wal_size=32GB
a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
a21 - combines a1, a4, a10, a16
a22 - combines a4, a10, a16

Benchmarks

The previous post has more detail. The benchmark is the updated insert benchmark with the delete_per_insert option enabled so that the logical table size doesn't grow after the initial load. It was run for three setups (Cached by Postgres, Cached by OS, IO-bound).

The test server is c2-standard-30 with 15 cores, 120G RAM, Ubuntu 22.04 and XFS with SW RAID 0 over 4 local NVMe devices. The benchmark is run for 8 clients.

The benchmark steps are described here. I focus on the l.i1 benchmark step that is a stress test for MVCC GC and I want to minimize variance and maximize throughput for it.

Results

Results are here for Cached by Postgres, Cached by OS and IO-bound. With a somewhat scientific method I decided that the best config is a22 for Cached by Postgres, a21_4g for Cached by OS and a21 for IO-bound.

This means that wal_compression=lz4 is good for Cached by OS and IO-bound but not for Cached by Postgres. That makes sense because WAL compression reduces writeback (delays the need for checkpoint) but that is less of an issue when the database is cached. From the wkbpi (KB written to storage per insert) column in the metrics tables there isn't a large difference between the base and a1 configs (see here for Cached by Postgres, Cached by OS and IO-bound). Looking at the results for l.i1 the reduction in wkbpi is from 11.367 to 10.673 for Cached by OS and from 22.690 to 20.719. Perhaps the benefit would be larger for a config with a smaller WAL -- max_wal_size was 70GB for the base and a1 configs.

From the response time histograms:

Cached by Postgres - not much to see here
Cached by OS - for l.i1, all of the configs have ~10 second write stalls and from the charts (see below) these stalls were frequent
IO-bound - for l.i1 most of the configs have ~1 second write stalls, but a16 and a21 were worse. For l.i0 there are ~1 second write stalls for all configs except for a22 which has a 5.6 second stall

From charts with throughput and response time at 1-second intervals

Cached by Postgres - from charts for l.i0, l.i1, q100, q500 and q1000 there are frequent response time spikes up to 100ms for l.i0.
Cached by OS - from charts for l.i0, l.i1, q100, q500 and q1000 all of the configs have too much variance for the l.i1 benchmark step.
IO-bound - from charts for l.i0, l.i1, q100, q500 and q1000 - all of the configs have a lot of variance on the l.i0 benchmark step, the IPS (insert/s at 1-second intervals) charts are visually interesting and for a21 (the best config) there is a decline from 1000 to 2800 seconds and then results are more stable.

Queries vs MVCC GC (get-min query)

Several of the benchmark steps (l.i1, q100, q500, q1000) run a query when they start to determine the minimum transaction ID -- select min(transactionid from $table, see here) and note that the PK column is transactionid. I call this the get-min query. Rows are continuously deleted from the tail of these tables -- there is a loop (with think time) that does delete N rows with the smallest value for transactionid. For some DBMS this query can be slow if MVCC GC falls behind. Fortunately that is not the case for Postgres.

The tables below show the response time for the query run at the start of each benchmark step. While there is some variance, meaning it is slowed because MVCC GC fell behind, it is always less than 1 second. Future posts will show that the response time here is much better than with InnoDB or MyRocks.

Cached by Postgres

config l.i1 q100 q500 q1000

160m.pg153_o3_native_lto.cx7a_gcp_c2s30 0.017 0.009 0.027 0.001

160m.pg153_o3_native_lto.cx7a1_gcp_c2s30 0.002 0.001 0.033 0.001

160m.pg153_o3_native_lto.cx7a4_gcp_c2s30 0.011 0.019 0.028 0.007

160m.pg153_o3_native_lto.cx7a7_gcp_c2s30 0.009 0.007 0.028 0.001

160m.pg153_o3_native_lto.cx7a10_gcp_c2s30 0.018 0.006 0.026 0.001

160m.pg153_o3_native_lto.cx7a12_gcp_c2s30 0.017 0.032 0.026 0.001

160m.pg153_o3_native_lto.cx7a15_gcp_c2s30 0.001 0.001 0.026 0.001

160m.pg153_o3_native_lto.cx7a16_gcp_c2s30 0.018 0.005 0.026 0.012

160m.pg153_o3_native_lto.cx7a21_gcp_c2s30 0.001 0.004 0.033 0.017

160m.pg153_o3_native_lto.cx7a22_gcp_c2s30 0.018 0.008 0.026 0.013

Cached by OS

config l.i1 q100 q500 q1000

160m.pg153_o3_native_lto.cx7a_4g_gcp_c2s30 0.015 0.001 0.040 0.001

160m.pg153_o3_native_lto.cx7a1_4g_gcp_c2s30 0.001 0.001 0.028 0.001

160m.pg153_o3_native_lto.cx7a4_4g_gcp_c2s30 0.021 0.001 0.022 0.001

160m.pg153_o3_native_lto.cx7a7_4g_gcp_c2s30 0.002 0.001 0.022 0.001

160m.pg153_o3_native_lto.cx7a10_4g_gcp_c2s30 0.018 0.001 0.024 0.002

160m.pg153_o3_native_lto.cx7a12_4g_gcp_c2s30 0.033 0.017 0.023 0.001

160m.pg153_o3_native_lto.cx7a15_4g_gcp_c2s30 0.001 0.020 0.023 0.001

160m.pg153_o3_native_lto.cx7a16_4g_gcp_c2s30 0.017 0.016 0.024 0.005

160m.pg153_o3_native_lto.cx7a21_4g_gcp_c2s30 0.001 0.032 0.047 0.017

160m.pg153_o3_native_lto.cx7a22_4g_gcp_c2s30 0.014 0.076 0.043 0.035

IO-bound

config l.i1 q100 q500 q1000

4000m.pg153_o3_native_lto.cx7a_gcp_c2s30 0.004 0.023 0.028 0.165

4000m.pg153_o3_native_lto.cx7a1_gcp_c2s30 0.005 0.423 0.116 0.231

4000m.pg153_o3_native_lto.cx7a4_gcp_c2s30 0.004 0.126 0.033 0.192

4000m.pg153_o3_native_lto.cx7a7_gcp_c2s30 0.004 0.025 0.028 0.202

4000m.pg153_o3_native_lto.cx7a10_gcp_c2s30 0.004 0.022 0.028 0.168

4000m.pg153_o3_native_lto.cx7a12_gcp_c2s30 0.004 0.717 0.040 0.179

4000m.pg153_o3_native_lto.cx7a15_gcp_c2s30 0.062 0.022 0.028 0.169

4000m.pg153_o3_native_lto.cx7a16_gcp_c2s30 0.004 0.077 0.039 0.220

4000m.pg153_o3_native_lto.cx7a21_gcp_c2s30 0.005 0.139 0.049 0.202

4000m.pg153_o3_native_lto.cx7a22_gcp_c2s30 0.004 0.197 0.101 0.271

Small Datum

Tuesday, July 11, 2023

Tuning Postgres and the new insert benchmark, round 5

No comments:

Post a Comment

Postgres 17.4 vs sysbench on a large server, revisited part 2