This has results for Postgres 16 beta1 using sysbench on a small server with low-concurrency and a cached database. The goal is to determine whether there are CPU regressions from Postgres 15.3 to 16-beta1.
tl;dr
- There were no regressions
- A few queries are ~1.7X faster in PG 16-beta1
Benchmark
A description of how I run sysbench is here. The sysbench microbenchmarks were run for 20 clients, 300 (600) seconds per read (write) microbenchmark using 1 tables with 20M rows per table. The test database was <= 6G and fits in Postgres buffer pool.
The test server is a Beelink SER 4700u with 8 cores, 16G of RAM, 1T of NVMe SSD with XFS running Ubuntu 22.04. I compiled Postgres from source using gcc 11.3.
The config file, conf.diff.cx8_bee, is here for Postgres 15.3 and 16-beta1.
The previous post explains the builds that I used where each build uses different compiler optimizations. I used the def, o2_nofp and o3_native_lto builds and ran the benchmark 6 times -- once per build for each of Postgres 15.3 and 16-beta1.
I use sysbench to run 42 microbenchmarks and each microbenchmark is put in one of three groups based on the dominant operation: point query, range query, writes.
Results
The table below lists the relative throughput which is: (QPS for PG 16-beta1 / QPS for PG 15.3). When the relative throughput is less than 1 then PG 16-beta1 is slower than PG 15.3 and there might be a CPU regression.
With a few exceptions the relative throughput is between 0.97 and 1.04. I consider values between 0.97 and 1.0 to be noise rather than a regression so I will declare there are no regressions. The exceptions are:
- point-query.pre_range=100 which has a noisy relative throughput of 0.89, 0.97 and 1.07. This benchmark step runs early in the benchmark and I assume this is just noise.
- scan_range=100 which has a noisy relative throughput of 1.00, 0.92, 0.98. I have yet to explain it but the scan benchmark step seems to have more noise so I ignore this.
- read-only.pre_range=10000 which has a relative throughput of 1.60, 1.46, 1.63
- read-only.range=10000 which has a a relative throughput of 1.50, 1.39, 1.50
The results show that PG 16-beta1 does ~1.5X better for the read-only.*range=10000 benchmark step which is a big improvement. That is discussed in the Follow Up section below.
Legend:
* def - uses the def build
* o2 - uses the o2_nofp build
* o3 - uses the o3_native_lto build
------ build ------ --- benchmark step ---
def o2 o3
1.00 0.97 1.01 hot-points_range=100
0.89 0.97 1.07 point-query.pre_range=100
1.00 1.02 1.01 point-query_range=100
1.00 1.03 1.03 points-covered-pk.pre_range=100
0.99 1.04 1.03 points-covered-pk_range=100
1.00 1.03 1.04 points-covered-si.pre_range=100
0.99 1.01 1.03 points-covered-si_range=100
1.00 1.00 1.01 points-notcovered-pk.pre_range=100
1.00 1.00 1.02 points-notcovered-pk_range=100
1.00 1.00 1.02 points-notcovered-si.pre_range=100
1.00 1.00 1.01 points-notcovered-si_range=100
0.99 0.99 1.00 random-points.pre_range=1000
1.00 1.00 0.99 random-points.pre_range=100
1.01 0.99 1.02 random-points.pre_range=10
0.99 1.00 1.01 random-points_range=1000
1.00 1.00 1.01 random-points_range=100
1.02 0.99 1.02 random-points_range=10
0.99 0.98 1.01 range-covered-pk.pre_range=100
1.00 0.99 1.01 range-covered-pk_range=100
0.98 0.98 1.01 range-covered-si.pre_range=100
1.00 0.97 1.01 range-covered-si_range=100
0.97 0.98 0.99 range-notcovered-pk.pre_range=100
0.98 0.97 1.00 range-notcovered-pk_range=100
0.98 1.00 1.00 range-notcovered-si.pre_range=100
0.98 0.99 1.00 range-notcovered-si_range=100
1.60 1.46 1.63 read-only.pre_range=10000
1.00 0.99 1.04 read-only.pre_range=100
1.01 0.98 0.99 read-only.pre_range=10
1.50 1.39 1.50 read-only_range=10000
1.01 0.99 1.02 read-only_range=100
1.00 1.00 0.97 read-only_range=10
1.00 0.92 0.98 scan_range=100
1.02 1.01 1.04 delete_range=100
1.01 1.02 0.99 insert_range=100
1.02 1.00 1.03 read-write_range=100
1.00 0.98 0.98 read-write_range=10
1.00 0.99 0.98 update-index_range=100
1.00 0.98 1.00 update-inlist_range=100
1.01 1.00 0.98 update-nonindex_range=100
1.01 0.97 0.98 update-one_range=100
1.02 0.99 0.97 update-zipf_range=100
1.01 1.02 1.00 write-only_range=10000
Follow up
The read-only.pre_range=10000 and read-only.range=10000 benchmark steps run a read-only transaction that has 5 types of queries. They use the same queries but read-only.pre_range=10000 runs before a large number of write-heavy benchmark steps while read-only.range=10000 runs after them. The range=10000 in the name indicates that the range queries in the read-only transaction each scan 10,000 rows.
The Lua for the benchmark step is oltp_read_only.lua which relies on oltp_common.lua. The loop that does the queries is here.
There are five types of queries in the read-only*range=10000 benchmark step. The first (q1) is run 10 times per read-only transaction while the others are run once. For q2, q3, q4, q5 the value of N is 10,000. The read-only transaction is run in a loop and sysbench reports the number of transactions/second.
- q1
- fetch one row by PK
- SELECT c FROM sbtest WHERE id=?
- q2
- scan N rows via PK index, return 1 column/row
- SELECT c FROM sbtest WHERE id BETWEEN ? AND ?
- q3
- scan N rows via an index, return sum of one column
- SELECT SUM(k) FROM sbtest WHERE id BETWEEN ? AND ?"
- q4
- like q2 but sorts the result
- SELECT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c
- q5
- like q4 but adds DISINCT
- SELECT DISTINCT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c
To explain the perf improvement from PG 15.3 to 16-beta1 I tried the following:
- Looked at query plans - they did not change. Plans are here.
- Ran each of the queries separately
- performance for q1, q2 and q3 did not change
- q4 is ~1.7X faster in PG 16-beta1
- q5 is ~1.5X faster in PG 16-beta1
- Started to look at CPU profiles from the perf tool. Unfortunately, my work on this was incomplete so I might revisit this in a few days and for now will mumble something about PG 16 using less CPU thanks to ICU and less time doing sorts.
No comments:
Post a Comment