Friday, July 14, 2023

Tuning Postgres and the new insert benchmark on a small server, part 2

This follows my previous post for Postgres vs the insert benchmark on a small server. Things that are new in this post include: results for Postgres 16 beta1, results for an IO-bound setup and results from more configuration options. I number the config files tested and have reached 22.

tl;dr

  • There is too much variance for the l.i1 benchmark step (inserts without rate limits, table has PK and 3 secondary indexes) with the IO-bound setup. I have yet to find config options that make this better. I am repeating the benchmarks but will run the write-heavy benchmark step for ~4 hours rather than ~2 hours. This problem also reproduces on a larger server -- see here.
  • The biggest improvement comes from the configs that decrease the values for autovacuum_vacuum_scale_factor and autovacuum_vacuum_insert_scale_factor

Benchmark

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. Postgres version 15.2 was used, compiled from source with the o2_nofp build (compiled with -O2 -no-omit-frame-pointer).

There were three setups for the benchmark and for each the benchmark was repeated for 1 and 4 clients with a table per client:

  • Cached by Postgres - database cached by Postgres. shared_buffers=10GB
  • Cached by OS - database cached by OS, but not by Postgres, shared_buffers=1GB
  • IO-bound - database is larger than memory, shared_buffers=10GB

The benchmark steps are explained here.

  • l.i0 - insert 160M rows for Cached by Postgres, 160M rows for Cached by OS and 800M rows for IO-bound
  • l.x - create 3 secondary indexes per table
  • l.i1 - insert and delete 50M rows/table for Cached by Postgres, 50M rows/table for Cached by OS and 10M rows/table for IO-bound.
  • q100, q500, q1000 - run for 1800 seconds each, delete_per_insert is enabled

All of the config files are here. The base config files are here for cached by Postgres and by OS. The other config files make small changes to the base.

The configurations are:
  • a1 - adds wal_compression=lz4
  • a2 - adds backend_flush_after=256kb
  • a3 - adds full_page_writes=off (not for production!)
  • a4 - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
  • a9 - adds checkpoint_timeout=900s
  • a10 - adds autovacuum_work_mem=256MB
  • a12 - adds wal_buffers=64MB
  • a14 - adds backend_flush_after=1MB
  • a15 - adds max_wal_size=32GB
  • a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • a17 - reduces bgwriter_lru_maxpages from 2000 to 500
  • a18 - combines a1, a10, a12 and a16
  • a20 - combines a10, a12, a16
  • a21 - combines, a1, a4, a10, a16
  • a22 - combines a4, a10, a16
The full names for the config files are longer. For example, a1 is conf.diff.cx8a1_bee for shared_buffers=10GB and conf.diff.cx8a1_1g_bee for shared_buffers=1GB.

Results

Performance reports are here:
First I consider the impact on average throughput (see links above) for the l.i1 benchmark step because the configuration options have the largest impact there:
  • Cached by Postgres
    • The biggest improvement is from the a16 config that reduces the values of autovacuum_vacuum_*scale_factor. The a18, a20, a21 and a22 configs also have that change.
  • Cached by OS
    • For the 1 client case results are similar to Cached by Postgres. For the 4 client case the config doesn't make a significant difference.
  • IO-bound
    • Results for the l.i1 benchmark step are hard to explain and almost appear random. I can only guess at the cause. Perhaps 4 clients is too much for the CPU and/or SSD. Perhaps the small server overheats which is more likely with the arrival of hot weather (these servers are in my home). I am repeating the benchmark for IO-bound so that the l.i1 step will run for ~4 hours rather than ~2 hours.
Next I consider the impact on response time:
  • Cached by Postgres for 1 client and 4 clients
    • nothing interesting
  • Cached by OS for 1 client and 4 clients
    • nothing interesting
  • IO-bound for 1 client and 4 clients
    • For l.i1 the a1 and a17 configs have a worse response time histogram. That is expected for a17. But for a1, in general I have many results where wal_compression=lz4 hurts performance more than I expect.
Charts for insert/delete/query rates and max response time at 1-second intervals. Note that the charts are from the first client -- that doesn't matter in the 1-client benchmarks but with the 4-client benchmarks this only shows results for one of four clients.
  • Cached by Postgres
    • 1 client: l.i0, l.i1, q100, q500, q1000
      • For l.i the insert rate (IPS) charts show too much variance and look like this prior to the a16 config. There is less, but still too much variance, starting with the a16 config (see here).
      • In many cases the max delete response shows a steady increase and then a sharp drop. Perhaps that lines up with the checkpoint frequency but I don't understand the root cause (see here).
      • QPS has some variance. For q1000 here is an example where the range is between 8000/s and 12000/s (see here).
    • 4 clients: l.i0, l.i1, q100, q500, q1000
      • For l.i1 the insert rate (IPS) charts show a lot of variance. It is between 5000/s and 30000/s for configs prior to a16 (see here) and the range shrinks to between 10000/s and 30000/s with the a16 config (see here).
  • Cached by OS
    • 1 client: l.i0, l.i1, q100, q500, q1000
      • For l.i1 the insert rate (IPS) show a range between ~1000/s and ~20000/s for configs prior to a16 (see here) and the range shrinks to between ~5000/s and ~20000/s with a16 (see here). 
    • 4 clients: l.i0, l.i1, q100, q500, q1000
      • For l.i1 the benefit from the a16 config isn't as significant here as it was above -- see here for the base config and the a16 config.
  • IO-bound
    • 1 client: l.i0l.i1, q100, q500, q1000
      • For l.i1 the insert rate (IPS) charts show a steady decline and then a sudden increase while the max delete response time shows a steady increase and then a sudden drop (see here). It takes ~8000 seconds to get through this cycle and the l.i1 benchmark step only runs for ~9000 seconds so I don't yet know whether this repeats.
    • 4 clients: l.i0, l.i1, q100, q500, q1000
      • For l.i1 there are two types of insert rate charts -- some have a slow decline followed by a sudden jump and then some stability (see here) while others have an extended flatline (see here). Both are far from what I seek. These patterns also occur on the medium server (c2-standard-30 on GCP) -- see here and here.
      • For q1000 the query rate (QPS) charts show a regular pattern of a slow increase followed by a sudden drop (see here).

























































No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...