Small Datum: February 2024

Monday, February 19, 2024

Perf regressions in Postgres from 9.0 to 16 with sysbench and a small server

This has results for sysbench vs Postgres on a small server. I have results for versions from 9.0 through 16. My last report only went back to Postgres 11. The goal is to document where things get faster or slower over time for a low-concurrency and CPU-bound workload. The focus is on CPU regressions.

My results here aren't universal, but you have to start somewhere:

The microbenchmarks here mostly measure CPU overheads
Things won't look the same with an IO-bound workload
Things won't look the same with a workload that has more concurrency
Things won't look the same with a workload that has complex queries

Summaries

Sections after this explain how the microbenchmark results are grouped.

Comparing Postgres 16.2 with 9.0.23:

point query, part 1

Postgres 16.2 is faster than 9.0.23 for all but one microbenchmark

point query, part 2

Postgres 16.2 is faster than 9.0.23 for all microbenchmarks

range query, part 1 & part2

About half of the microbenchmarks are ~20% slower in 16.2 vs 9.0.23
The big regression occurs between 9.0 and 9.1
For part 2 where aggregation is done the problem is worse for shorter range scans

writes

Postgres 16.2 is faster than 9.0.23 for all microbenchmarks

Comparing Postgres 16.2 with 10.23

Postgres 16.2 is faster than 10.23 for all microbenchmarks

Comparing Postgres 16.2 with 9.0.23

point query, part 1

Postgres 16.2 is at most 4% slower than 14.10

point query, part 2

Postgres 16.2 is at most 1% slower than 14.10

range query, part 1

Postgres 16.2 is at most 5% slower than 14.10

range query, part 2

Postgres 16.2 is as fast or faster than 14.10

writes

Postgres 16.2 is at most 1% slower than 14.10

Build + Configuration

I used these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 14.11, 15.5, 15.6, 16.1 and 16.2.

The configuration files are in the subdirectories named pg9, pg10, pg11, pg12, pg13, pg14, pg15 and pg16 from here. They are named conf.diff.cx9a2_bee.

Benchmarks

I used sysbench and my usage is explained here. There are 42 microbenchmarks and each tests ~1 type of SQL statement and is run for 1200 seconds.

Tests were run on a small server I have at home (see here). The server is an SER4 from Beelink with 8 cores, 16G of RAM and 1 m.2 storage device with XFS and Ubuntu 22.04. The test tables are cached by Postgres.

The benchmark is run with:

one connection
30M rows and a database cached by Postgres
each microbenchmark runs for 1200 seconds
prepared statements were enabled

The command line was: bash r.sh 1 30000000 1200 1200 nvme0n1 1 1 1

Results

For the results below I split the microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. Unfortunately, I included the full scan microbenchmark (scan_range=100) in part 2 but it doesn't do aggregation. The spreadsheet with all data and charts is here and is easier to read.

All of the charts have relative throughput on the y-axis where that is (QPS for $me) / (QPS for $base), $me is a version (for example 5.7.20) and $base is the base version. The base version is specified below and one of 5.6.21, 5.7.10 and 8.0.13 depending on what I am comparing. The y-axis doesn't start at 0 to improve readability.

The legend on under the x-axis truncates the names I use for the microbenchmark and I don't know how to fix that other than sharing the link to the Google Sheet I used. File I used to create the spreadsheets are here.

Results: from 9.0 through 16.2

Summary:

point query, part 1

Postgres 16.2 is faster than 9.0.23 for all but one microbenchmark

point query, part 2

Postgres 16.2 is faster than 9.0.23 for all microbenchmarks

range query, part 1 & part2

About half of the microbenchmarks are ~20% slower in 16.2 vs 9.0.23
The big regression occurs between 9.0 and 9.1
For part 2 where aggregation is done the problem is worse for shorter range scans

writes

Postgres 16.2 is faster than 9.0.23 for all microbenchmarks

This table has summary statistics from Postgres 16.2 for each microbenchmark group. The numbers represent the relative QPS (relative to 9.0.23) and a value > 1 means that 16.2 is faster than 9.0.23.

	min	max	avg	median	stdev
point-1	0.92	1.32	1.18	1.20	0.12
point-2	1.07	1.18	1.12	1.13	0.04
range-1	0.77	1.68	1.09	1.00	0.37
range-2	0.78	1.34	1.01	0.85	0.25
writes	1.11	4.64	2.21	1.97	1.10

Results: from 10.23 through 16.2

Summary

Postgres 16.2 is faster than 10.23 for all microbenchmarks

This table has summary statistics from Postgres 16.2 for each microbenchmark group. The numbers represent the relative QPS (relative to 10.23) and a value > 1 means that 16.2 is faster than 10.23.

	min	max	avg	median	stdev
point-1	1.02	1.11	1.07	1.08	0.03
point-2	1.04	1.08	1.06	1.06	0.01
range-1	1.07	1.13	1.10	1.10	0.02
range-2	1.04	1.09	1.06	1.05	0.02
writes	1.02	1.15	1.08	1.06	0.04

Results: 14.10, 14.11, 15.5, 15.6, 16.1, 16.2

Summary

point query, part 1

Postgres 16.2 is at most 4% slower than 14.10

point query, part 2

Postgres 16.2 is at most 1% slower than 14.10

range query, part 1

Postgres 16.2 is at most 5% slower than 14.10

range query, part 2

Postgres 16.2 is as fast or faster than 14.10

writes

Postgres 16.2 is at most 1% slower than 14.10

This table has summary statistics from Postgres 16.2 for each microbenchmark group. The numbers represent the relative QPS (relative to 14.10) and a value > 1 means that 16.2 is faster than 14.10.

	min	max	avg	median	stdev
point-1	0.96	1.07	1.00	1.00	0.03
point-2	0.99	1.00	1.00	1.00	0.01
range-1	0.95	1.00	0.98	0.99	0.02
range-2	1.00	1.07	1.02	1.00	0.03
writes	0.99	1.04	1.01	1.02	0.01

Friday, February 16, 2024

Perf regressions in MySQL from 5.6.21 to 8.0.36 using sysbench and a small server

This has results for sysbench vs upstream MySQL on a small server. I have results for some 5.6, 5.7 and 8.0 releases up to 8.0.36. My last report stopped at 8.0.34. The goal is to document where things get faster or slower over time for a low-concurrency and CPU-bound workload. The focus is on CPU regressions.

My results here aren't universal.

The microbenchmarks here mostly measure CPU overheads
Things won't look the same with an IO-bound workload. If nothing else that will make many of the CPU regressions less significant.
Things won't look the same with a workload that has more concurrency. While MySQL tends to get slower over time from more CPU overhead it also gets faster over time on concurrent workloads from improvements to synchronization code. Results from a few months ago on a larger server are here and the regressions are much smaller.
Things won't look the same with a workload that has complex queries. Most of the queries used by sysbench are simple and short running. This amplifies the impact of perf regressions in parse, semantic analysis and query optimization.

tl;dr

Upstream MySQL would benefit from changepoint detection as provided by Nyrkiö.
MySQL 8.0 is the worst for perf regressions, while 5.7 and 5.6 are better at avoiding them. Also, there tend to be large regressions between the last point release in one major version and the first point release in the following major version, for instance from 5.6.51 to 5.7.10.
The scan_range=100 microbenchmark that does a full table scan has a large regression from 8.0.28 to 8.0.36 and bug 111538 is open for this

Comparing 8.0.36 with 5.6.21

For point queries, 8.0.36 gets 19% to 39% less QPS than 5.6.21
For range queries that don't do aggregation (part 1), 8.0.36 gets 29% to 39% less QPS than 5.6.21
For range queries that do aggregation, 8.0.36 gets 3% to 45% less QPS than 5.6.21. The difference depends on the length of the range scan, where shorter scan == larger regression.
Full scan (scan_range=100) has the largest regression (5.6.21 is ~2X faster than 8.0.36)
For most writes (ignoring the update-index microbenchmark), 8.0.36 gets about half of the throughput compared to 5.6.21

Builds

It isn't easy to build older code on newer systems, compilers, etc. Notes on that are here for 5.6, for 5.6 and 5.7, for 5.7 and for 8.0. A note on using cmake is here. The rel builds were used -- everything was compiled using CMAKE_BUILD_TYPE=Release.

Tests were done for:

5.6 - 5.6.21, 5.6.31, 5.6.41, 5.6.51
5.7 - 5.7.10, 5.7.20, 5.7.30, 5.7.44
8.0 - 8.0.13, 8.0.14, 8.0.20, 8.0.28, 8.0.35, 8.0.36

I used the cz10a_bee config and it is here for 5.6, 5.7 and 8.0 (here and here). For 8.0 releases older than 8.0.19 I changed innodb_idle_flush_pct=1 to loose_innodb_idle_flush_pct=1.

Benchmarks

I used sysbench and my usage is explained here. There are 42 microbenchmarks and each tests ~1 type of SQL statement and is run for 1200 seconds.

one connection
30M rows and a database cached by InnoDB
each microbenchmark runs for 1200 seconds
prepared statements were enabled

The command line was: bash r.sh 1 30000000 1200 1200 nvme0n1 1 1 1

Results

The legend under the x-axis truncates the names I use for the microbenchmarks and I don't know how to fix that other than sharing the link to the Google Sheet I used. Files I used to create the spreadsheets are here.

From 5.6.21 to 8.0.36

This section uses 5.6.21 as the base version and then compares that with 5.6.51, 5.7.10, 5.7.44, 8.0.13, 8.0.14, 8.0.20, 8.0.28, 8.0.35 and 8.0.36 to show how performance has changed from oldest tested (5.6.21) to newest tested (8.0.36).

The largest regressions might occur between the last point release in one major version and the first point release in the next major version.
For point queries, 8.0.36 gets 19% to 39% less QPS vs 5.6.21
For range queries that don't do aggregation (part 1), 8.0.36 gets 29% to 39% less QPS vs 5.6.21
For range queries that do aggregation, 8.0.36 gets 3% to 45% less QPS vs 5.6.21. The difference depends on the length of the range scan -- shorter scan == larger regression. And full scan (scan_range=100) has the largest regression.
For most writes (ignoring the update-index microbenchmark), 8.0.36 gets about half of the throughput compared to 5.6.21

Summary statistics for each of the benchmark groupings:

	min	max	avg	median	stdev
point-1	0.63	0.78	0.71	0.69	0.05
point-2	0.61	0.81	0.71	0.70	0.09
range-1	0.61	0.71	0.64	0.62	0.04
range-2	0.55	0.97	0.76	0.74	0.16
writes	0.44	1.08	0.63	0.56	0.19

MySQL 8.0: some point releases

This section uses 8.0.13 as the base version and then compares that with 8.0.14, 8.0.20, 8.0.28, 8.0.35 and 8.0.36 to show how performance has changed from 8.0.13 to 8.0.36.

There was a perf bug in 8.0.28 (bug 102037) from the optimizer for queries with large in-lists that explains the two results below in Point query, part 2 that are close to 0.40.

From MySQL 8.0.13 to 8.0.36

Point queries are ~5% slower in 8.0.36
Range queries without aggregation are between 6% and 15% slower in 8.0.36 and for a few microbenchmarks there is a big regression after 8.0.28 (possibly bug 111538)
Range queries with aggregation are mostly ~15% slower in 8.0.36
Full scan is ~32% slower in 8.0.36 with a big regression after 8.0.28 (possibly bug 111538)
Writes are ~20% slower in 8.0.36 with a big regression after 8.0.20

MySQL 5.7: some point releases

This section uses 5.7.10 as the base version and then compares that with 5.7.20, 5.7.30 and 5.7.44 to show how performance has changed from 5.7.10 to 5.7.44.

For most microbenchmarks the throughput in 5.7.44 is no more than 5% less than in 5.7.10. For two microbenchmarks (update-index and update-inlist) the throughput in 5.7.44 is larger than in 5.7.10.

MySQL 5.6: some point releases

This section uses 5.6.21 as the base version and then compares that with 5.6.31, 5.6.41 and 5.6.51 to show how performance has changed from 5.6.21 to 5.6.51.

For most microbenchmarks the throughput in 5.6.51 is no more than 5% less than in 5.6.21. The largest regression is ~10% from full scan (scan_range=100) and 5.6.51 is faster than 5.6.21 for the update-inlist microbenchmark.

Monday, February 12, 2024

It wasn't a performance regression in Postgres 14

With help from a Postgres expert (Peter Geoghegan) I was able to confirm there wasn't a performance regression for Postgres 14 in a few of the benchmark steps with the Insert Benchmark as I started to report on in a previous blog post. The results here are from a small server for both cached and IO-bound workloads and replace my previous blog posts (cached, IO-bound).

The reason for the false alarm is that index cleanup was skipped during vacuum starting with Postgres 14 and the impact is that the optimizer had more work to do (more not-visible index entries to skip) in the get_actual_variable_range function. Output like this from the vacuum command makes that obvious:

table "pi1": index scan bypassed: 48976 pages from table (0.62% of total) have 5000000 dead item identifiers

The problem is solved by adding INDEX_CLEANUP ON to the vacuum command.

tl;dr

All results here are from a low-concurrency workload on a small server (1 client, <= 3 connections, 8 cores). Results from a bigger server are pending.
For cached workloads throughput improves a lot on all benchmark steps except point query (qp*) where from Postgres 9.0 to 16 it is slightly better on the SER4 servers and then stable to slightly slower on the SER7 server.
For IO-bound workloads on the SER4 server from Postgres 9.0 through 16 the results are similar to the cache workload -- things improve a lot for all benchmark steps except point query (qp*).
For IO-bound workloads on the SER7 server from Postgres 12 through 16 the throughput for range and point queries (qr*, qp*) are stable while for write-heavy there are some regressions. Also there is a ~5% increase in CPU/operation from Postgres 12 through 16 on the random-write benchmark steps (l.i1, l.i2).
For IO-bound workloads with the SER4 server the benchmark steps were unable to sustain the target write rates during the qr1000 and qp1000 benchmark steps (1000 inserts/s + 1000 deletes/s) for many of the Postgres versions. This was an issue on the SER7 server that has a faster CPU and better RAM / data ratio.
The delete/s rate is between 5X and 20X larger for the l.i1 benchmark step vs the l.i2 step. The issue is that l.i1 deletes 10X more rows/statement and the impact of the optimizer CPU overhead is much worse during l.i2 -- see my comments below about the weird workload.

Update - I have claimed that InnoDB and MyRocks don't have this problem and that is more truthy than true because they had a problem from MVCC GC getting behind, but the problem shows up on a SELECT statement while Postgres has the problem with a DELETE statement. See here for details.

Editorial

At a high-level there are several issues:

The workload that triggers the issue is weird (uncommon)
With MVCC GC in Postgres, garbage can remain in indexes for a while
The Postgres optimizer can use too much CPU time in get_actual_variable_range to help figure out index selectivity even when there is only one good index for a SQL statement

First, the workload that triggers the issue is weird. I hope to rewrite the Insert Benchmark later this year to be less weird. The weirdness is that last year I enhanced the Insert Benchmark to optionally delete from tables at the same rate as inserts to keep the tables from growing too big. A too big table might make my tests fail when a disk is full. And a too big table means I can't run the cached (in-memory) variant of the test for longer periods of time. But the problem is the enhancements meant I added statements like DELETE FROM foo WHERE pk_column > $a and < $b. The constants $a and $b are usually in or even less than the histogram bucket with the smallest value for the column which means that get_actual_variable_range then tries to read from the index to determine the current minimum value for that index.

Second, with MVCC GC in Postgres, garbage can remain in indexes for longer than I want it to.

MVCC GC in InnoDB is called purge and is running all of the time -- although it can be slowed by read IO latency and temporarily halted when there is a long-open snapshot. But most of the time InnoDB will cleanup (remove non-visible index entries) soon after transaction commit.
Cleanup with Postgres has more lag. It can be done by vacuum with a lag of many hours. It can also be done by simple index deletion but only on page splits and my workload doesn't trigger page splits on DELETE. Perhaps the existing code can be updated to also trigger simple index deletion when an index leaf page is mostly or all non-visible entries. I risk writing nonsense about Postgres in this regard, and for better information see this video from Peter and this document page.

Third, the Postgres optimizer can use too much CPU time in get_actual_variable_range. I don't mind that get_actual_variable_range exists because it is useful for cases where index statistics are not current. But the problem is that for the problematic SQL statement (see the DELETE above and this blog post) there is only one good index for the statement. So I prefer the optimizer not do too much work in that case. I have experienced this problem a few times with MySQL. One of the fixes from upstream MySQL was to change the optimizer to do less work when there was a FORCE INDEX hint. And with some OLTP workloads where the same statements are so frequent I really don't want the optimizer to use extra CPU time. For the same reason, I get much better throughput from Postgres when prepared statements are enabled and now I always enable them for the range and point queries with Postgres during the insert benchmark, but not for MySQL (because they don't help much with MySQL).

Build + Configuration

See the previous report for more details. In all but one case (IO-bound on SER7) I tested these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1. For IO-bound on SER7 tests are from Postgres 12 through 16.

The configuration files are in subdirectories from here. Search for files named conf.diff.cx9a2_bee and conf.diff.cx9a2_ser7 which exist for each major version of Postgres.

The Benchmark

The benchmark is run with one client.

There are two test servers and the SER7 has a faster CPU. More info on the servers is here:

Beelink SER4 with 8 AMD cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device
Beelink SER7 with 8 AMD cores, 32G RAM, Ubuntu 22.04 and XFS using 1 m.2 device

The benchmark steps are:

l.i0

insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. With a cached workload the value of X is 30M for SER4 and 60M for SER7. With an IO-bound workload X is 800M for both because I forgot to make it larger for SER7.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts XM rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. With a cached workload the value of X is 40M. With an IO-bound workload X is 4M.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and inserts XM rows total. With a cached workload the value of X is 10M. With an IO-bound X is 1M.
Vacuum the test table, do a checkpoint and wait ~Y seconds to reduce variance during the read-write benchmark steps that follow. The value if Y is based on the size of the table.

qr100

use 3 connections/client. One does range queries for Z seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. For cached workloads the value of Z was 1800. For IO-bound on SER4 it was 1800 and on SER7 it was 7200.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results

The performance reports are here for:

Cached: SER4 and SER7
IO-bound: SER4 and SER7

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

From the summary for SER4 with a cached workload

The base case is pg9023_def which means Postgres 9.0.23
For the benchmark steps

l.i0 - improves in Postgres 9.4 and 11.22 and then is stable
l.x - improves in Postgres 9.6 and then is stable
l.i1, l.i2 - improves in Postgres 12 through 14
qr100, qr500, qr1000 - slow but steady improvements from Postgres 9.2 through 16
qp100, qp500 - slow but steady improvements from Postgres 9.2 through 16
qp1000 - stable from Postgres 9 through 16. Perhaps this is most affected by the CPU overhead from get_actual_variable_range.

Comparing throughput in Postgres 16.2 to 9.0.23

Write-heavy

l.i0, l.x, l.i1, l.i2 - relative QPS is 1.22, 1.79, 3.38, 2.36

Range queries

qr100, qr500, qr1000 - relative QPS is 1.20, 1.19, 1.23

Point queries

qp100, qp500, qp1000 - relative QPS is 1.08, 1.10, 1.01

From the summary for SER7 with a cached workload

The base case is pg9023_def which means Postgres 9.0.23
For the benchmark steps

l.i0 - improves in Postgres 9.4 and 11.22 and then is stable
l.x - improves in Postgres 9.6 and 10 and then is stable
l.i1, l.i2 - improves in Postgres 12 through 14
qr100, qr500, qr1000 - improves in Postgres 9.2 and then is stable or slowly improving
qp100, qp500, qp1000 - improves in Postgres 9.2 through 9.5 and then slowly gets worse

Comparing throughput in Postgres 16.2 to 9.0.23

Write-heavy

l.i0, l.x, l.i1, l.i2 - relative QPS is 1.34, 1.65, 3.12, 2.42

Range queries

qr100, qr500, qr1000 - relative QPS is 1.44, 1.62, 1.62

Point queries

qp100, qp500, qp1000 - relative QPS is 1.01, 0.92, 0.99

From the summary for SER4 with an IO-bound workload

The base case is pg9023_def which means Postgres 9.0.23
For the benchmark steps

l.i0 - improves in Postgres 11.22 and then is stable
l.x - improves in Postgres 9.4 through 10 and then is stable
l.i1, l.i2 - improves in Postgres 12 and then is stable
qr100, qr500, qr1000 - slowly improves from Postgres 9.2 though 11 and then is stable
qp100, qp500, qp1000 - same as qr100, qr500, qr1000

Comparing throughput in Postgres 16.2 to 9.0.23

Write-heavy

l.i0, l.x, l.i1, l.i2 - relative QPS is 1.21, 2.29, 1.85, 1.85

Range queries

qr100, qr500, qr1000 - relative QPS is 1.15, 1.23, 1.36

Point queries

qp100, qp500, qp1000 - relative QPS is 0.99, 1.00, 0.98

From the summary for SER7 with an IO-bound workload

I only have results from Postgres 12 though 16
The read-write benchmark steps were run for 7200s vs 1800s above
Looking at write rates over time for the l.i2 benchmark step where write is insert/s and delete/s the rates are ~195/s at the start of the benchmark step and ~155/s at the end. I assume the issue is there is more garbage (non-visible index entries) in the PK index over time so there is more CPU overhead from get_actual_variable_range having to read and skip them while figuring out the minimum visible value in the index during DELETE statements.
The base case is pg1217_def which means Postgres 12.17. The improvements show here don't match the results above for IO-bound on the SER4 server because it uses an older (9.0) base case
For the benchmark steps

l.i0 - throughput is stable
l.x - throughput slowly improves from Postgres 13 through 16
l.i1, l.i2 - with some variance, throughput gets worse from Postgres 12 through 16. From vmstat results normalized by write rates I see a 4% to 7% increase in CPU/operation on SER7. If I limit myself to Postgres 12.17 through 16 then I also see a 5% to 8% increase on SER4.
qr100, qr500, qr1000 - throughput is stable
qp100, qp500, qp1000 - throughput is stable

Comparing throughput in Postgres 16.2 to 9.0.23

Write-heavy

l.i0, l.x, l.i1, l.i2 - relative QPS is 1.00, 1.14, 0.93, 0.88

Range queries

qr100, qr500, qr1000 - relative QPS is 1.02, 1.00, 1.00

Point queries

qp100, qp500, qp1000 - relative QPS is 0.98, 0.98, 0.98

Target write rates

The third table in the summaries linked above show the write rates sustained during the read-write benchmark steps. The target write rates are 100/s for qr100 and qp100, 500/s for qr500 and qp500 and then 1000/s for qr1000 and qp1000. Note that X/s means X inserts/s and X delete/s. When the value is close enough to the target then I assume the target has been sustained. The table cells in red indicate the cases where the target has not been sustained.

For cached workloads all versions sustained the target write rates.

For IO-bound workloads

Note that SER4 and SER7 had the same amount of data, but SER7 has twice as much RAM so it was less IO-bound. And SER7 has a faster CPU.
With SER4

Postgres 9.x, 10 and 11 did not sustain the target write rates during qr1000 and qp1000
Postgres 13, 14, 15.5 and 16 did not sustain the target write rates during qp1000 but they were close to the target

With SER7

All versions sustained the target write rates with SER7.