This post has results for the Insert Benchmark on a small server with a cached workload. The goal is to compare new Postgres releases with older ones to determine whether they get better or worse over time. The results here are from the newest small servers in my test cluster -- an ASUS PN53.
This work was done by Small Datum LLC.
The workload here has low concurrency (1 or 4 clients) and the database is cached. The results might be different when the workload is IO-bound or has more concurrency. Results were recently shared from tests run on an older small server.
tl;dr
- There are no regressions from Postgres 16.3 to 17beta1 for this benchmark
- All Postgres versions have too much variance on the insert rate for the l.i1 and l.i2 benchmark steps. While tuning fixes that is a too frequent response, I spent much time trying to fix this via tuning last year without success. This has not been a problem for modern MySQL and one example is here (equivalent Postgres graphs are here). The issue is explained below.
- The builtin provider might have improved performance on two of the benchmark steps (l.i1, l.i2) when comparing Postgres 17beta1 with and without it. But I need to repeat the benchmark to confirm that the result isn't just noise.
- cx9a2_c8r32 - --data-checksums
- cx9a2lc_c8r32 - --data-checksums --locale-provider=builtin --builtin-local=C.UTF-8
The benchmark steps are:
- l.i0
- insert 50 million rows in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts 40M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted.
- Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
- Postgres 16.3 with the cx9a2_c8r32 config (pg163_def.cx9a2_c8r32)
- Postgres 17beta1 with the cx9a2_c8r32 config (pg17beta1_def.cx9a2_c8r32)
- Postgres 17beta1 with the cx9a2lc_c8r32 config (pg17beta1_def.cx9a2lc_c8r32) that uses the builtin locale provider
- Postgres 16.3 and 17beta1 have similar performance
- It isn't clear that the builtin provider helps this benchmark
- The results for l.i1 and l.i2 should be revisited to explain why 17beta1 without the builtin locale provider didn't do better
- l.i0
- relative QPS is 1.14 in PG 16.3
- relative QPS is 1.18 in PG 17beta1 without the builtin locale
- relative QPS is 1.18 in PG 17beta1 with the builtin locale
- l.x - I ignore this for now
- l.i1, l.i2
- relative QPS is 2.23, 2.27 in PG 16.3
- relative QPS is 1.99, 2.00 in PG 17beta1 without the builtin locale
- relative QPS is 2.29, 2.22 in PG 17beta1 with the builtin locale
- qr100, qr500, qr1000
- relative QPS is 1.03, 1.08, 1.08 in PG 16.3
- relative QPS is 1.04, 1.07, 1.08 in PG 17beta1 without the builtin locale
- relative QPS is 1.05, 1.07, 1.07 in PG 17beta1 with the builtin locale
- qp100, qp500, qp1000
- relative QPS is 0.98, 0.98, 0.97 in PG 16.3
- relative QPS is 0.97, 0.98, 0.97 in PG 17beta1 without the builtin locale
- relative QPS is 0.96, 0.97, 0.96 in PG 17beta1 with the builtin locale
Metrics for each of the benchmark steps starts here. By metrics I mean mostly counters from iostat and vmstat normalized by the insert or query rates. These help to explain differences in performance and spot where something is using more or less HW per operation. Here I focus on the metrics for the l.i1 and l.i2 benchmark steps because that is where there is a difference for Postgres 17beta1 with and without the builtin locale provider.
The other interesting result is that the number of context switches per operation (cspq) is about 15% smaller in Postgres 17beta1 vs 16.3. This is a good change, but I want to see more results before I am certain about it.
The response time histograms for l.i1 makes this easier to understand. The histograms are great for inserts but lousy for deletes. The problems is explained in this Twitter thread and blog post. The issue is that the Postgres optimizer probes an index to figure out the min value for a column in some cases, and while this part of the benchmark isn't written in the best way, I still prefer that Postgres do better here and get_actual_variable_range is the culprit.