Monday, September 18, 2023

Postgres 16.0 vs the Insert Benchmark on a medium server

This has results for Postgres 16.0 and rc1 vs the Insert Benchmark on a medium server. Previous posts have results for this benchmark with Postgres 16.0 on a small server and with Postgres 16 beta on a medium server.

tl;dr

  • Postgres 16.0 looks good
  • All versions of Postgres (old and new) have too much variance on this benchmark with the IO-bound setups. See rant here. I like Postgres but not so much for write-heavy workloads.
  • All versions of Postgres were unable to sustain the target insert rate (8000/s) for the IO-bound setup during q1000.
  • All versions of Postgres suffer from fairness during l.i1 -- some clients finish much sooner than others and in the IO-bound setups the slowest clients take ~3X longer than the fastest ones.

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

The insert benchmark was run in two setups.

  • cached by Postgres - all tables are cached by Postgres
  • IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables. The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows per table where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another X million rows per table with secondary index maintenance where X is 300 for Cached and 60 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.
Configurations

I used the a27 and a28 configurations. From previous work, a27 was good for Cached by Postgres and a28 was good for IO-bound. The diff between them is:
  • wal_compression is lz4 for a27 and off for a28
  • autovacuum_vacuum_cost_limit is 2000 for a27 and 4000 for a28
  • max_wal_size is 70G for a27 and 32GB for a28
Results

Reports are here for Cached by Postgres (a27 config, a28 config) and IO-bound (a27 config, a28 config).

The results for average throughput are interesting and confusing. The tables linked below use absolute and relative throughput where relative throughput is (QPS for my version / QPS for base version) and the base version is Postgres 15.2.
  • Cached, a27 config (see here)
    • For l.i1
      • The average throughput gradually declines by ~7% from 15.2 to 16.0. I don't know whether that is a regression or just noise (see here). 
      • From HW metrics the CPU overhead (cpupq is CPU/operation) does not change.
      • From the charts with per-second insert rates there is much variance for all versions (see for 15.2 and 16.0) and the insert rate frequently changes from ~10k/s to ~40k/s.
      • All versions suffer from fairness. From the 8 benchmark clients the average time is ~10,000 seconds but the slowest client takes ~1000 more seconds than the fastest client.
    • For q100, q500 and q1000 the average throughput is the same for all versions.
  • Cached, a28 config (see here)
    • For l.i1
      • The average throughput is the same from 15.2 to 16.0.
      • From the charts with per-second insert rates there is much variance for all versions (see for 15.2 and 16.0) and the insert rate frequently changes from ~10k/s to ~40k/s.
      • All versions suffer from fairness. From the 8 benchmark clients the average time is ~10,000 seconds but the slowest client takes ~1000 more seconds than the fastest client.
    • For q100, q500 and q1000 the average throughput is the same for all versions
  • IO-bound, a27 config (see here)
    • For l.i1
      • I am wary of interpreting results because there is too much variance. 
      • The per-second insert rate charts are similar for 15.2 and 16.0 -- there are spikes.
      • All versions suffer from fairness. From the 8 benchmark clients the times to finish the l.i1 benchmark step in seconds is: 17866, 19057, 34004, 35698, 36535, 38123, 44844, 45365. The slowest client takes almost 3X longer than the fastest.
    • For q100 and q500 results are probably similar across versions but there is noise.
    • For q1000 none of the versions were able to sustain the target insert rate (1000/s/client for a total of 8000/s). See the third table in the Summary section.
  • IO-bound, a28 config (see here)
    • For l.i1 
      • I am wary of interpreting results because there is too much variance.
      • The per-second insert rate charts are similar for 15.2 and 16.0 - there are spikes.
      • All versions suffer from fairness. From the 8 benchmark clients the times to finish the l.i1 benchmark step in seconds is: 18387, 25221, 25764, 26830, 29587, 30214, 35562, 37566. The slowest client takes almost 2X longer than the fastest.
    • For q100 and q500 results are probably similar across versions but there is noise.
    • For q1000 none of the versions were able to sustain the target insert rate (1000/s/client for a total of 8000/s). See the third table in the Summary section.































No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...