Tuesday, June 13, 2023

Tuning Postgres and the new insert benchmark, round 4

After looking at the results from round 3 I reran the insert+delete benchmark for a new configuration that combines the best changes identified in the previous tests. The goal is to find a configuration that improves throughput and reduces variance. And in this post I am explaining the impact of a new configuration (a18) that improves on the previous best configuration (a16).

tl;dr

  • The new config (a18) improves on the previous best one (a16)
  • There is still too much variance in the l.i1 benchmark step
Configurations

The new configuration is a18 and I use two variations of it: one with shared_buffers=70GB for the cached by Postgres and IO-bound workloads, and one with shared_buffers=4GB to simulate fast storage (reads are done from the OS page cache). It starts with the base config and then combines the changes from the a1, a10, a12 and a16 configs. The changes are: wal_compression=lz4, autovacuum_work_mem=256MB, wal_buffers=64MB, autovacuum_scale_factor=0.05 and autovacuum_insert_scale_factor=0.05. The last two are the most important.

Benchmarks

The previous post has more detail. The benchmark is the updated insert benchmark with the delete_per_insert option enabled so that the logical table size doesn't grow after the initial load. It was run for two workloads:

  • cached by Postgres - database fits in the Postgres buffer pool
  • cached by OS - database is too big for the Postgres buffer pool but fits in the OS page cache
The test server is c2-standard-30 with 15 cores and 120G RAM. The benchmark is run for 8 clients.

The benchmark steps are described here. I focus on the l.i1 benchmark step that is a stress test for MVCC GC and I want to minimize variance and maximize throughput for it.

Results

The reports are here for cached by Postgres and by OS. Things that I see:
  • From the Summary section it is clear that throughput with the a18 config is slightly better than with the a16 config. 
  • I have yet to explain why QPS with the read+write tests (q100, q500, q1000) on the cached by Postgres workload is slightly worse for a16 and a18 vs the base config. One possible issue is that read+write tests inherit more writeback debt with a16/a18 because the l.i1 benchmark step finishes earlier with them.
  • I missed this in my previous post but the max insert response time for cached by Postgres and q1000 is 15.661 seconds with the a11 config that adds autovacuum_work_mem=1GB. That is a bad write stall. For queries with the same config it was 15.654 seconds, again much too long.
  • For cached by Postgres
  • For cached by OS
    • for l.i1 the response time distributions are similar for a16 and a18 except that the max insert response time is 1.998 seconds with a18 which is too long.
    • for l.i1 write stalls are much worse (longer duration) for a18 than for a16. This is harder to see from the response time distribution because (sadly) the benchmark client suffers from coordinated omission. But for q1000 the insert rate has more variance with a16 than with a18.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...