Small Datum: Postgres 16 beta3 and the Insert Benchmark on a medium server

Earlier this year I spent time comparing Postgres 16 beta1 with Postgres 15 to check for regressions, then realized I needed to confirm the configurations I use were good and that took longer than expected. Now I return to testing the beta releases. A post about Postgres 15 on the same hardware is here.

In this post I compare Postgres 15.3, 15.4, 16 beta1, 16 beta2 and 16 beta3 using the Insert Benchmark on a medium server (15 cores, 120G RAM).

tl;dr

I am repeating the tests to get more results
Create index is ~8% faster in 16 beta vs 15.x
For the write heavy benchmark steps the lack of fairness during the l.i1 (inserts+deletes) benchmark step is a big problem for all versions
For the read+write benchmark steps pg16 beta3 struggles or fails to sustain the target insert rates in the IO-bound setups. With each release from 15.4 to 16 beta3 the results get worse.

Builds

I compiled Postgres 15.3, 15.4, 16 beta1, 16 beta2 and 16 beta3 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

The insert benchmark was run in two setups.

cached by Postgres - all tables are cached by Postgres
IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

l.i0

insert X million rows per table where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance where X is 200 for cached by InnoDB and 40 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. This step took 6000+ seconds for Cached by Postgres and 25,000+ seconds for IO-bound.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

I used the a27 and a28 configurations. From previous work, a27 was good for Cached by Postgres and a28 was good for IO-bound. The diff between them is:

wal_compression is lz4 for a27 and off for a28
autovacuum_vacuum_cost_limit is 2000 for a27 and 4000 for a28
max_wal_size is 70G for a27 and 32GB for a28

Results

Reports are here for Cached by Postgres, IO-bound with a27, IO-bound with a28.

I analyze these in terms of:

average throughput
fairness between benchmark clients
SLA for background inserts and deletes

Results: average throughput

Cached by Postgres (see here)

For l.i1 there is a ~5% regression from 15.3 to 16 beta3. From the HW perf metrics I see ...

A small increase (48 -> 49) in CPU/operation (see cpupq column) and average CPU utilization (vmstat us + sy, see cpups column) is 78.3 for 15.3 vs 76.5 for 16 beta3, which might be from doing more IO or an increase in IO latency.
But from the wkbpi (KB written to storage/operation) I see that 16 beta3 does slightly less write IO than 15.3.
The wmbps (MB/sec written to storage) numbers are correlated with the throughput implies that either storage write latency varies or some versions are able to do writeback faster.

Results for the other benchmark steps are similar across versions

For l.i1 I also computed average values writes from iostat and I don't see an increase in average latency (see w_await) for the cases where versions were slower. I do see a decrease in wMB/s so for some reason writeback wasn't happening as fast. The table below has the average values from the l.i1 step.

version w/s wMB/s wrqm/s w_await wareq-sz aqu-sz %util

15.3 1175 196.9 0.0 0.935 165.3 2.329 66.6

15.4 1107 184.9 0.0 0.860 163.8 2.059 64.3

16beta1 1182 200.9 0.0 1.041 166.7 2.673 68.0

16beta2 1089 179.8 0.0 0.864 162.9 2.070 62.6

16beta3 1103 186.1 0.0 0.883 164.3 2.103 64.6

For IO-bound with config a27 (see here)

For l.x (create index) 16 beta releases are ~8% faster than 15.x perhaps because writeback is faster. From the HW perf metrics the write rate to storage (wmbps is MB/s written to storage) is higher for 16 beta
For l.i1 (inserts+deletes) 16 beta1 and beta2 are 11% to 15% slower than 15.3, but 16 beta3 is only 2% slower. From HW perf metrics the problem is more CPU overhead (cpupq is CPU/operation) but it is hard to tell whether that is CPU consumed by user threads or by background (vacuum). Given the lack of fairness (see section below) it is harder to reason about the root causes.
For read+write the results for q100 and q500 are similar across versions. For q1000 versions 15.4 and 16 beta3 do much better perhaps because they failed to sustain the target insert+delete rates -- they get 7209/s and 7486/s when the target is 8000/s.

For IO-bound with config a28 (see here)

For l.x (create index) 16 beta releases are ~8% faster than 15.x
For l.i1 (inserts+deletes) results for all versions are similar excluding 16 beta2
For read+write the results for q100 and q500 results for all versions are similar excluding 16 beta2 (again, what happened there?). But for q1000 only 15.3 was able to sustain the target insert rate of ~8000/s and the rate that can be sustained drops in each release following 15.3. The HW perf metrics for q1000 can be confusing. I see a large reduction in rmbps (read MB/s from storage) and wmbps (write MB/s from storage). Did storage or Postgres get slower?

It is hard to compare some of the HW perf metrics for IO-bound with config a28 because the versions are not doing work (queries, inserts, deletes) at the same rate, but I checked the iostat results to confirm that storage latency isn't the problem and it doesn't appear to be based on the values for w_await and r_await below.

For IO-bound, config a28 and the l.i1 benchmark step

version w/s wMB/s wrqm/s w_await wareq-sz aqu-sz %util

15.3 20157 317947 0.0 0.418 22.5 10.4 79.16

15.4 16543 278533 0.0 0.337 25.4 7.9 66.11

16beta1 18233 305163 0.0 0.405 26.1 9.4 81.82

16beta2 19296 349306 0.0 0.517 30.0 11.7 83.76

16beta3 17284 310738 0.0 0.430 32.9 9.2 73.27

version r/s rMB/s rrqm/s r_await rareq-sz

15.3 11439 132618 0.0 0.207 15.0

15.4 8309 93010 0.0 0.200 12.2

16beta1 10297 116141 0.0 0.208 14.6

16beta2 9646 117296 0.0 0.219 15.1

16beta3 8813 107172 0.0 0.211 15.6

For IO-bound, config a28 and the q1000 benchmark step

version w/s wMB/s wrqm/s w_await wareq-sz aqu-sz %util

15.3 20645 280606 0.0 0.275 17.4 8.715 97.57

15.4 18735 242598 0.0 0.152 17.3 4.738 91.35

16beta1 18375 239554 0.0 0.203 17.2 5.796 83.97

16beta2 18140 232740 0.0 0.247 16.9 5.383 91.77

16beta3 15730 203518 0.0 0.315 19.2 7.035 79.99

version r/s rMB/s rrqm/s r_await rareq-sz

15.3 11499 115412 0.0 0.195 9.9

15.4 10799 89337 0.0 0.186 8.3

16beta1 10589 87671 0.0 0.190 8.3

16beta2 10414 86414 0.0 0.183 8.3

16beta3 9165 76209 0.0 0.182 8.5

Results: fairness between benchmark clients

Fairness is a big problem for all versions during the l.i1 benchmark step as show by the following tables. I expect all benchmark clients to finish at about the same time as each does the same number of inserts and deletes. But with Postgres some clients run much faster than others and the problem is much worse for the IO-bound setups.

Legend:

* min_time: minimum time in seconds for a client to finish

* max_time: maximum time in seconds for a client to finish

* diff: max_time - min_time

* pct: (diff / min_time) * 100

cached by Postgres

version diff pct

15.3 365 5.9

15.4 536 8.5

16beta1 378 6.3

16beta2 661 10.3

16beta3 596 9.5

io-bound, a27

version diff pct

15.3 18242 186.9

15.4 16743 138.6

16beta1 18559 142.2

16beta2 18773 127.9

16beta3 15960 127.7

io-bound, a28

version diff pct

15.3 15468 102.9

15.4 18808 151.4

16beta1 15723 110.7

16beta2 13275 107.3

16beta3 18398 159.7

Results: SLA for background inserts and deletes

From the last table in the Summary sections

Cached by Postgres - all versions sustain the target insert rates
IO-bound with a27 - 15.3 and 16 beta3 don't sustain the target insert rates for q1000 and 16 beta 2 almost fails for q1000
IO-bound with a28 - all versions after 15.3 fail to sustain the target insert rates for q1000 and the failure gets worse from 15.4 to 16 beta3

Small Datum

Wednesday, August 16, 2023

Postgres 16 beta3 and the Insert Benchmark on a medium server

No comments:

Post a Comment

How efficient is RocksDB for IO-bound, point-query workloads?