This has results for Postgres versions 12.22 through 18.3 with the Insert Benchmark on a small server. My previous post for the same hardware with results up to Postgres 18.1 is here. This post also has results for:
- all 17.x releases from 17.0 through 17.9
- 18.2 with and without full page writes enabled
- both 1 and 4 users
Postgres continues to be boring in a good way. It is hard to find performance regressions. Performance wasn't always stable, but I am reluctant to expect it to show no changes because there are sources of variance beyond the DBMS, especially HW (a too-hot SSD or CPU will run slower). Sometimes perf changes because there are obvious perf bugs, sometimes it changes for other reasons.
tl;dr for a CPU-bound workload
- performance is stable from Postgres 12 through 18
- performance is stable from Postgres 17.0 through 17.9
- disabling full-page writes improves throughput on write-heavy benchmark steps
- performance is mostly stable from Postgres 12 through 18
- performance is stable from Postgres 17.0 through 17.9
- disabling full-page writes improves throughput on write-heavy benchmark steps
- in a few cases there are large improvements to point-query throughput on the qp1000 benchmark step. I will try to explain that soon.
The server is an Beelink SER7 with a Ryzen 7 7840HS CPU with 8 cores and AMD SMT disabled, 32G of RAM. Storage is one SSD for the OS and an NVMe SSD for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.
- cx10b_fpw0
- this adds full_page_writes=off while cx10b has =on
- cx10b_wallz4
- this adds wal_compression=lz4 while cx10b has =off
- cx10b_fpw0_wallz4
- this adds full_page_writes=off and wal_compression=lz4
- CPU-bound
- for 1 user the values for X, Y, Z are 30M, 40M, 10M
- for 4 users the values for X, Y, Z are 10M, 16M, 4M
- IO-bound
- for 1 user the values for X, Y, Z are 800M, 4M, 1M
- for 4 users the values for X, Y, Z are 200M, 4M, 1M
- l.i0
- insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
- Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
- 1 user: latest point releases, all 17.x releases and full-page writes
- 4 users: latest point releases, all 17.x releases and full-page writes
- there is either no change or a small improvement for l.i0 (load in PK order), l.x (create indexes) and the read-write tests (qr*, qp*).
- for l.i1 and l.i2 (random write-only) throughput drops by 5% to 10% from 12.22 to 13.23 and has been stable since then (throughput in 18.2 is similar to 13.23. The CPU per operation overhead (cpupq here) increases after 12.22 for the l.i2 step but there wasn't an obvious increase for the l.i1 step - but the way I measure this is far from perfect. The results I share here are worse than what I measured in December 2025.
- there might be a small (3%) regression for l.i0 (load in PK order) in 18.2 vs 12.22. Perhaps this is noise. From vmstat and iostat metrics there aren't obvious changes.
- throughput in 18.2 is better than 12.22 for all other benchmark steps
- throughput is stable from 17.0 to 17.9 for all benchmark steps except l.i1 and l.i2 (random writes) where there might be a 5% regression late in 17.x. This might be from new CPU overhead - see cpupq here.
- throughput is stable with small improvements from 17.0 to 17.9
- throughput improves by ~5% for l.i1 and l.i2 (random writes) when full-page writes are disabled and KB written to storage per commit drops by ~20% -- see wkbpi here.
- enabling wal_compression=lz4 decreases write throughput for all write-heavy steps when full-page writes are enabled. The impact is smaller when full page writes are disabled.
- throughput improves by <= 5% for all write-heavy steps when full-page writes are disabled
- the impact from wal_compression=lz4 isn't obvious
- 1 user: latest point releases, all 17.x releases and full-page writes
- 4 users: latest point releases, all 17.x releases and full-page writes
- there are small (<= 10%) improvements for l.i0 (load in PK order) and l.x (create index). I don't see anything obvious in vmstat and iostat metrics to explain this.
- there are small (<= 10%) regressions for l.i1 and l.i2 (random writes) that might be from a sequence of small regressions from 13.x through 18.x. I don't see anything obvious in vmstat and iostat metrics to explain this.
- throughput is unchanged for the range-query read+write tests (qr*)
- throughput improves by ~1.4X for the point-query read+write tests (qp*). This improvement arrived in 13.x. This can be explained by large drops in CPU overhead (cpupq) and context switch rates (cspq) -- see here.
- the results here are similar to what I measured in December 2025
- there are small (~10%) regressions for l.i0 (load in PK order) that arrived in 17.x. The context switch rate (cspq) increases in 17.x.
- there are small (<= 20%) improvements for l.x (create index) that arrived in 13.x
- there are large regressions for l.i1 and l.i2 (random writes) that arrive in 15.x through 18.x. There are large increases in CPU overhead (cpupq) -- see here.
- throughput is unchanged for the range-query read+write tests (qr*)
- throughput improves for the point-query read+write tests (qp*) at higher write rates (qp500, qp1000).
- throughput is stable with a few exceptions
- for qp1000 (point-query, read+write) it improves by ~5% in 17.1 and is then stable to 17.9
- in 17.9 there are large (~1.4x) improvements for all of the point-query, read+write tests
- the changes in throughput for qp1000 might be explained by a small drop in CPU overhead per query (cpupq) that arrived in 17.1 and a large drop that arrived in 17.9 -- see here.
- throughput for most steps (l.i0, l.x, qr*, qp100, qp500) is stable
- throughput for l.i1 and l.i2 (random writes) has more variance
- throughput for qp1000 drops by up to 10% from 17.3 through 17.8 and in those cases the CPU overhead increased -- see cpupq here.
- throughput improves by 6% for l.i1 (random writes) when full-page writes are disabled
- throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.
- throughput improves by 20% for l.i1 (random writes) when full-page writes are disabled
- throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.