Tuesday, July 11, 2023

Tuning Postgres and the new insert benchmark, round 5

Like round 4, this has results from a medium server, the insert benchmark and Postgres 15.3. I test 10 configuration files to determine which works best for three setups: Cached by Postgres, Cached by OS and IO-bound.

tl;dr

  • The best configs are a21_4g for Cached by OS, a21 for IO-bound, a22 for Cached by Postgres. The a21 & a21_4g configs include wal_compression=lz4. The a22 config does not.
  • There is still too much variance in the l.i1 benchmark step
  • There are ~10 second write stalls for the l.i1 benchmark step with all configurations for Cached by Postgres.
  • Postgres does much better than InnoDB and MyRocks at the get-min query 
Setups

Because I use configurations when describing the Postgres config files I will use setups to explain how the benchmark is run. The three setups are:
  • Cached by Postgres - database is cached in the Postgres buffer pool
  • Cached by OS - database fits in the OS page cache but not the 4G Postgres buffer pool
  • IO-bound - database is larger than memory
Configurations

All of the config files are here. The configs are:

  • base - see here (Cached by Postgres, IO-bound) and here (Cached by OS)
  • a1 - adds wal_compression=lz4
  • a4 - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this is a duplicate
  • a7 - adds max_wal_size=16GB
  • a10 - adds autovacuum_work_mem=256MB
  • a12 - adds wal_buffers=64MB
  • a15 - adds max_wal_size=32GB
  • a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • a21 - combines a1, a4, a10, a16
  • a22 - combines a4, a10, a16

Benchmarks

The previous post has more detail. The benchmark is the updated insert benchmark with the delete_per_insert option enabled so that the logical table size doesn't grow after the initial load. It was run for three setups (Cached by Postgres, Cached by OS, IO-bound).

The test server is c2-standard-30 with 15 cores, 120G RAM, Ubuntu 22.04 and XFS with SW RAID 0 over 4 local NVMe devices. The benchmark is run for 8 clients.

The benchmark steps are described here. I focus on the l.i1 benchmark step that is a stress test for MVCC GC and I want to minimize variance and maximize throughput for it.

Results

Results are here for Cached by Postgres, Cached by OS and IO-bound. With a somewhat scientific method I decided that the best config is a22 for Cached by Postgres, a21_4g for Cached by OS and a21 for IO-bound. 

This means that wal_compression=lz4 is good for Cached by OS and IO-bound but not for Cached by Postgres. That makes sense because WAL compression reduces writeback (delays the need for checkpoint) but that is less of an issue when the database is cached. From the wkbpi (KB written to storage per insert) column in the metrics tables there isn't a large difference between the base and a1 configs (see here for Cached by Postgres, Cached by OS and IO-bound). Looking at the results for l.i1 the reduction in wkbpi is from 11.367 to 10.673 for Cached by OS and from 22.690 to 20.719. Perhaps the benefit would be larger for a config with a smaller WAL -- max_wal_size was 70GB for the base and a1 configs.

From the response time histograms:
  • Cached by Postgres - not much to see here
  • Cached by OS - for l.i1, all of the configs have ~10 second write stalls and from the charts (see below) these stalls were frequent
  • IO-bound - for l.i1 most of the configs have ~1 second write stalls, but a16 and a21 were worse. For l.i0 there are ~1 second write stalls for all configs except for a22 which has a 5.6 second stall
From charts with throughput and response time at 1-second intervals
  • Cached by Postgres - from charts for l.i0, l.i1, q100, q500 and q1000 there are frequent response time spikes up to 100ms for l.i0.
  • Cached by OS - from charts for l.i0, l.i1, q100, q500 and q1000 all of the configs have too much variance for the l.i1 benchmark step.
  • IO-bound - from charts for l.i0, l.i1, q100, q500 and q1000 - all of the configs have a lot of variance on the l.i0 benchmark step, the IPS (insert/s at 1-second intervals) charts are visually interesting and for a21 (the best config) there is a decline from 1000 to 2800 seconds and then results are more stable.
Queries vs MVCC GC (get-min query)

Several of the benchmark steps (l.i1, q100, q500, q1000) run a query when they start to determine the minimum transaction ID -- select min(transactionid from $table, see here) and note that the PK column is transactionid. I call this the get-min query. Rows are continuously deleted from the tail of these tables -- there is a loop (with think time) that does delete N rows with the smallest value for transactionid. For some DBMS this query can be slow if MVCC GC falls behind. Fortunately that is not the case for Postgres.

The tables below show the response time for the query run at the start of each benchmark step. While there is some variance, meaning it is slowed because MVCC GC fell behind, it is always less than 1 second. Future posts will show that the response time here is much better than with InnoDB or MyRocks.

Cached by Postgres

config                                          l.i1    q100    q500    q1000
160m.pg153_o3_native_lto.cx7a_gcp_c2s30         0.017   0.009   0.027   0.001
160m.pg153_o3_native_lto.cx7a1_gcp_c2s30        0.002   0.001   0.033   0.001
160m.pg153_o3_native_lto.cx7a4_gcp_c2s30        0.011   0.019   0.028   0.007
160m.pg153_o3_native_lto.cx7a7_gcp_c2s30        0.009   0.007   0.028   0.001
160m.pg153_o3_native_lto.cx7a10_gcp_c2s30       0.018   0.006   0.026   0.001
160m.pg153_o3_native_lto.cx7a12_gcp_c2s30       0.017   0.032   0.026   0.001
160m.pg153_o3_native_lto.cx7a15_gcp_c2s30       0.001   0.001   0.026   0.001
160m.pg153_o3_native_lto.cx7a16_gcp_c2s30       0.018   0.005   0.026   0.012
160m.pg153_o3_native_lto.cx7a21_gcp_c2s30       0.001   0.004   0.033   0.017
160m.pg153_o3_native_lto.cx7a22_gcp_c2s30       0.018   0.008   0.026   0.013

Cached by OS

config                                          l.i1    q100    q500    q1000
160m.pg153_o3_native_lto.cx7a_4g_gcp_c2s30      0.015   0.001   0.040   0.001
160m.pg153_o3_native_lto.cx7a1_4g_gcp_c2s30     0.001   0.001   0.028   0.001
160m.pg153_o3_native_lto.cx7a4_4g_gcp_c2s30     0.021   0.001   0.022   0.001
160m.pg153_o3_native_lto.cx7a7_4g_gcp_c2s30     0.002   0.001   0.022   0.001
160m.pg153_o3_native_lto.cx7a10_4g_gcp_c2s30    0.018   0.001   0.024   0.002
160m.pg153_o3_native_lto.cx7a12_4g_gcp_c2s30    0.033   0.017   0.023   0.001
160m.pg153_o3_native_lto.cx7a15_4g_gcp_c2s30    0.001   0.020   0.023   0.001
160m.pg153_o3_native_lto.cx7a16_4g_gcp_c2s30    0.017   0.016   0.024   0.005
160m.pg153_o3_native_lto.cx7a21_4g_gcp_c2s30    0.001   0.032   0.047   0.017
160m.pg153_o3_native_lto.cx7a22_4g_gcp_c2s30    0.014   0.076   0.043   0.035

IO-bound
config                                          l.i1    q100    q500    q1000
4000m.pg153_o3_native_lto.cx7a_gcp_c2s30        0.004   0.023   0.028   0.165
4000m.pg153_o3_native_lto.cx7a1_gcp_c2s30       0.005   0.423   0.116   0.231
4000m.pg153_o3_native_lto.cx7a4_gcp_c2s30       0.004   0.126   0.033   0.192
4000m.pg153_o3_native_lto.cx7a7_gcp_c2s30       0.004   0.025   0.028   0.202
4000m.pg153_o3_native_lto.cx7a10_gcp_c2s30      0.004   0.022   0.028   0.168
4000m.pg153_o3_native_lto.cx7a12_gcp_c2s30      0.004   0.717   0.040   0.179
4000m.pg153_o3_native_lto.cx7a15_gcp_c2s30      0.062   0.022   0.028   0.169
4000m.pg153_o3_native_lto.cx7a16_gcp_c2s30      0.004   0.077   0.039   0.220
4000m.pg153_o3_native_lto.cx7a21_gcp_c2s30      0.005   0.139   0.049   0.202
4000m.pg153_o3_native_lto.cx7a22_gcp_c2s30      0.004   0.197   0.101   0.271

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...