Wednesday, August 16, 2023

Postgres 16 beta3 and the Insert Benchmark on a medium server

Earlier this year I spent time comparing Postgres 16 beta1 with Postgres 15 to check for regressions, then realized I needed to confirm the configurations I use were good and that took longer than expected. Now I return to testing the beta releases. A post about Postgres 15 on the same hardware is here.

In this post I compare Postgres 15.3, 15.4, 16 beta1, 16 beta2 and 16 beta3 using the Insert Benchmark on a medium server (15 cores, 120G RAM).

tl;dr

  • I am repeating the tests to get more results
  • Create index is ~8% faster in 16 beta vs 15.x
  • For the write heavy benchmark steps the lack of fairness during the l.i1 (inserts+deletes) benchmark step is a big problem for all versions
  • For the read+write benchmark steps pg16 beta3 struggles or fails to sustain the target insert rates in the IO-bound setups. With each release from 15.4 to 16 beta3 the results get worse.
Builds

I compiled Postgres 15.3, 15.4, 16 beta1, 16 beta2 and 16 beta3 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

The insert benchmark was run in two setups.

  • cached by Postgres - all tables are cached by Postgres
  • IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows per table where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another X million rows per table with secondary index maintenance where X is 200 for cached by InnoDB and 40 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. This step took 6000+ seconds for Cached by Postgres and 25,000+ seconds for IO-bound.
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
Configurations

I used the a27 and a28 configurations. From previous work, a27 was good for Cached by Postgres and a28 was good for IO-bound. The diff between them is:
  • wal_compression is lz4 for a27 and off for a28
  • autovacuum_vacuum_cost_limit is 2000 for a27 and 4000 for a28
  • max_wal_size is 70G for a27 and 32GB for a28
Results


I analyze these in terms of:
  • average throughput
  • fairness between benchmark clients
  • SLA for background inserts and deletes
Results: average throughput

Cached by Postgres (see here)
  • For l.i1 there is a ~5% regression from 15.3 to 16 beta3. From the HW perf metrics I see ...
    • small increase (48 -> 49) in CPU/operation (see cpupq column) and average CPU utilization (vmstat us + sy, see cpups column) is 78.3 for 15.3 vs 76.5 for 16 beta3, which might be from doing more IO or an increase in IO latency.
    • But from the wkbpi (KB written to storage/operation) I see that 16 beta3 does slightly less write IO than 15.3.
    • The wmbps (MB/sec written to storage) numbers are correlated with the throughput implies that either storage write latency varies or some versions are able to do writeback faster.
  • Results for the other benchmark steps are similar across versions
For l.i1 I also computed average values writes from iostat and I don't see an increase in average latency (see w_await) for the cases where versions were slower. I do see a decrease in wMB/s so for some reason writeback wasn't happening as fast. The table below has the average values from the l.i1 step.

version w/s     wMB/s   wrqm/s  w_await wareq-sz  aqu-sz  %util
15.3    1175    196.9   0.0     0.935   165.3     2.329   66.6
15.4    1107    184.9   0.0     0.860   163.8     2.059   64.3
16beta1 1182    200.9   0.0     1.041   166.7     2.673   68.0
16beta2 1089    179.8   0.0     0.864   162.9     2.070   62.6
16beta3 1103    186.1   0.0     0.883   164.3     2.103   64.6

For IO-bound with config a27 (see here)
  • For l.x (create index) 16 beta releases are ~8% faster than 15.x perhaps because writeback is faster. From the HW perf metrics the write rate to storage (wmbps is MB/s written to storage) is higher for 16 beta
  • For l.i1 (inserts+deletes) 16 beta1 and beta2 are 11% to 15% slower than 15.3, but 16 beta3 is only 2% slower. From HW perf metrics the problem is more CPU overhead (cpupq is CPU/operation) but it is hard to tell whether that is CPU consumed by user threads or by background (vacuum). Given the lack of fairness (see section below) it is harder to reason about the root causes.
  • For read+write the results for q100 and q500 are similar across versions. For q1000 versions 15.4 and 16 beta3 do much better perhaps because they failed to sustain the target insert+delete rates -- they get 7209/s and 7486/s when the target is 8000/s.
For IO-bound with config a28 (see here)
  • For l.x (create index) 16 beta releases are ~8% faster than 15.x
  • For l.i1 (inserts+deletes) results for all versions are similar excluding 16 beta2
  • For read+write the results for q100 and q500 results for all versions are similar excluding 16 beta2 (again, what happened there?). But for q1000 only 15.3 was able to sustain the target insert rate of ~8000/s and the rate that can be sustained drops in each release following 15.3. The HW perf metrics for q1000 can be confusing. I see a large reduction in rmbps (read MB/s from storage) and wmbps (write MB/s from storage). Did storage or Postgres get slower?
It is hard to compare some of the HW perf metrics for IO-bound with config a28 because the versions are not doing work (queries, inserts, deletes) at the same rate, but I checked the iostat results to confirm that storage latency isn't the problem and it doesn't appear to be based on the values for w_await and r_await below.

For IO-bound, config a28 and the l.i1 benchmark step

version w/s     wMB/s   wrqm/s  w_await wareq-sz  aqu-sz  %util
15.3    20157   317947  0.0     0.418   22.5      10.4    79.16
15.4    16543   278533  0.0     0.337   25.4       7.9    66.11
16beta1 18233   305163  0.0     0.405   26.1       9.4    81.82
16beta2 19296   349306  0.0     0.517   30.0      11.7    83.76
16beta3 17284   310738  0.0     0.430   32.9       9.2    73.27

version r/s     rMB/s   rrqm/s  r_await rareq-sz
15.3    11439   132618  0.0     0.207   15.0
15.4     8309    93010  0.0     0.200   12.2
16beta1 10297   116141  0.0     0.208   14.6
16beta2  9646   117296  0.0     0.219   15.1
16beta3  8813   107172  0.0     0.211   15.6

For IO-bound, config a28 and the q1000 benchmark step

version w/s     wMB/s   wrqm/s  w_await wareq-sz  aqu-sz  %util
15.3    20645   280606  0.0     0.275   17.4      8.715   97.57
15.4    18735   242598  0.0     0.152   17.3      4.738   91.35
16beta1 18375   239554  0.0     0.203   17.2      5.796   83.97
16beta2 18140   232740  0.0     0.247   16.9      5.383   91.77
16beta3 15730   203518  0.0     0.315   19.2      7.035   79.99

version r/s     rMB/s   rrqm/s  r_await rareq-sz
15.3    11499   115412  0.0     0.195   9.9
15.4    10799    89337  0.0     0.186   8.3
16beta1 10589    87671  0.0     0.190   8.3
16beta2 10414    86414  0.0     0.183   8.3
16beta3  9165    76209  0.0     0.182   8.5

Results: fairness between benchmark clients

Fairness is a big problem for all versions during the l.i1 benchmark step as show by the following tables. I expect all benchmark clients to finish at about the same time as each does the same number of inserts and deletes. But with Postgres some clients run much faster than others and the problem is much worse for the IO-bound setups.

Legend:
* min_time:  minimum time in seconds for a client to finish
* max_time: maximum time in seconds for a client to finish
* diff: max_time - min_time
* pct: (diff / min_time) * 100

cached by Postgres
version diff    pct
15.3    365     5.9
15.4    536     8.5
16beta1 378     6.3
16beta2 661     10.3
16beta3 596     9.5

io-bound, a27
version diff    pct
15.3    18242   186.9
15.4    16743   138.6
16beta1 18559   142.2
16beta2 18773   127.9
16beta3 15960   127.7

io-bound, a28
version diff    pct
15.3    15468   102.9
15.4    18808   151.4
16beta1 15723   110.7
16beta2 13275   107.3
16beta3 18398   159.7

Results: SLA for background inserts and deletes

From the last table in the Summary sections
  • Cached by Postgres - all versions sustain the target insert rates
  • IO-bound with a27 - 15.3 and 16 beta3 don't sustain the target insert rates for q1000 and 16 beta 2 almost fails for q1000
  • IO-bound with a28 - all versions after 15.3 fail to sustain the target insert rates for q1000 and the failure gets worse from 15.4 to 16 beta3
















No comments:

Post a Comment

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB a...