Tuesday, May 23, 2023

Tuning Postgres and the new insert benchmark

This post explains my work in progress to tune the configurations I use for Postgres. I support a DBMS that needs some tuning, I am wary of a DBMS that needs too much tuning. I assume that Postgres is the former, and at least for me so is MyRocks/RocksDB. I wish it were possible to do tuning in two parts, but I am not aware of a DBMS that supports such an approach today:
  1. Describe the HW capacity the DBMS can use (RAM, number of cores, IO throughput)
  2. Describe the desired SLA (favor reads vs writes, what matters WRT read, write & space amp)
I am a big fan of the Postgres community. I write about my good and bad experiences with Postgres. I do my best to avoid snark and in return I get advice from experts. 

  • Once again I made mistakes with huge pages. I need to repeat the cached by OS benchmarks and will replace the results I shared today.
My latest round of this is for the insert benchmark. It has 6 steps and they used to be:
  1. Insert in PK order with a PK index and without secondary indexes
  2. Create 3 secondary indexes
  3. Insert in  PK order, but random relative to the secondary indexes, with a PK index and 3 secondary indexes. All indexes require maintenance per insert.
  4. Do queries as fast as possible. Do rate-limited inserts (100/s/client).
  5. Do queries as fast as possible. Do rate-limited inserts (500/s/client).
  6. Do queries as fast as possible. Do rate-limited inserts (1000/s/client).
The initial usage was for data sets larger than memory and I could control the benchmark step in which that occurred based on how I setup the benchmark. Eventually I started to use it for in-memory workloads to search for CPU regressions. Unfortunately when an insert-only workload runs for enough time the workload eventually stops being in-memory. And on a modern server I usually can't run benchmark steps 1 and 3 for more than a few minutes before the database is larger than memory.

This is a problem when I want to check that b-tree writeback/checkpoint and LSM compaction provide stable (minimal variance) throughput over time. So I finally did it -- I added an option to the insert benchmark (--delete_per_insert) that optionally does a delete per insert to keep the database from growing and I use this for benchmark steps 3, 4, 5 and 6 above. The inserts done in those steps use a separate thread (and separate database connection). The deletes do the same, and just match the rate of inserts. The deletes are done from the tail of the table. The table has a PK on transactionid, inserts use increasing transactionid values and deletes delete the N rows with the smallest transactionid values.

First up with the new and improved insert benchmark, or insert & delete benchmark, is Postgres. I am using the workload to improve the Postgres configuration that I use with much help from Postgres experts.


I ran two variants of the benchmark on a c2-standard-30 server that has 15 cores, hyperthreads disabled, 120G of RAM and XFS on SW RAID 0 across 4 NVMe devices. The variants are:
  • cached by Postgres - database is cached by Postgres
  • cached by OS - database is cached by the OS page cache but not by Postgres
The config files that I use are here and for cached by OS I use shared_buffers=4G. The benchmark uses 8 tables with a client per table. The benchmark steps are:
  1. Insert 20M rows/table in parallel
  2. Create 3 secondary indexes per table (I ignore perf from this step)
  3. Insert 100M rows/table in parallel with a delete per insert
  4. Do queries as fast as possible with 100/inserts/s/client and a delete per insert 
  5. Do queries as fast as possible with 500/inserts/s/client and a delete per insert 
  6. Do queries as fast as possible with 1000/inserts/s/client and a delete per insert 

The benchmark reports are here for cached by Postgres and cached by OS.

While I have yet to get results for the insert+delete benchmark from InnoDB or MyRocks I see more variance than desired with the Postgres configurations that I tested.

Things that interest me are:
  • Average throughput for l.i1 (benchmark step 3) for cached by Postgres and by OS
  • Response time distribution for inserts during l.i1 for cached by Postgres and by OS
  • Insert rate at 1-second intervals during l.i1 for cached by Postgres and by OS
From the above I rank the impact of changes to the base configuration file I used. The base is here for by Postgres and by OS. It already benefits from changes I did to increase the rate at which autovacuum runs (see here). With those changes the database doesn't get larger than ~50G. Without them it grew to be much larger than memory.

But there were other changes suggested by Postgres experts and I tried many of them. The base config for cached by Postgres is named conf.diff.cx7a_gcp_c2s30. There are 9 variations that I tried -- 8 suggested by others, one I added. Using the cached by Postgres names they are:
  • conf.diff.cx7a1_gcp_c2s30 - adds wal_compression=lz4
  • conf.diff.cx7a2_gcp_c2s30 - adds backend_flush_after=256kb
  • conf.diff.cx7a3_gcp_c2s30 - adds full_page_writes=off (not for production!)
  • conf.diff.cx7a4_gcp_c2s30 - adds checkpoint_timeout=1800s
  • conf.diff.cx7a5_gcp_c2s30 - adds min_wal_size=16384
  • conf.diff.cx7a6_gcp_c2s30 - adds backend_flush_after=256kb, initdb --wal-segsize 256, inherits larger value for wal_buffers
  • conf.diff.cx7a7_gcp_c2s30 - adds max_wal_size=16GB
  • conf.diff.cx7a8_gcp_c2s30 - adds shared_buffers=8GB (hurts perf for cached by Postgres, improves it for cached by OS because by OS base config has shared_buffers=4GB)
  • conf.diff.cx7a9_gcp_c2s30 - adds checkpoint_timeout=900s
I judged the impact from each of the config file variants listed above. The a1, a3, a4 and a9 configs help performance. The a2, a5 and a7 configs hurt performance. The a6 config hurts perf in 3 cases and helps it in one.

Legend for header:
* config -
* avg.tp - impact on average insert rate
* rt - impact on response time distribution

Legend for symbols:
+     -> some improvement
++    -> much improvement
-     -> some regression
--    -> much regression
blank -> not much impact
NA    -> ignored for now

        avg.tp                  rt
config  byOS    byPG    ::      byOS    byPG
a1      +       +
a2      --                      --      --
a3      ++      ++              +       +
a4      +
a5      -       -
a6      --      +               --
a7      --      --              --      --
a8      NA      NA              NA      NA
a9      +       +

No comments:

Post a Comment