Saturday, May 27, 2023

Postgres 16beta1 looks good vs sysbench

This has results for Postgres 16 beta1 using sysbench on a small server with low-concurrency and a cached database. The goal is to determine whether there are CPU regressions from Postgres 15.3 to 16-beta1.

tl;dr

  • There were no regressions
  • A few queries are ~1.7X faster in PG 16-beta1
Benchmark

A description of how I run sysbench is here. The sysbench microbenchmarks were run for 20 clients, 300 (600) seconds per read (write) microbenchmark using 1 tables with 20M rows per table. The test database was <= 6G and fits in Postgres buffer pool.

The test server is a Beelink SER 4700u with 8 cores, 16G of RAM, 1T of NVMe SSD with XFS running Ubuntu 22.04. I compiled Postgres from source using gcc 11.3.

The config file, conf.diff.cx8_bee, is here for Postgres 15.3 and 16-beta1.

The previous post explains the builds that I used where each build uses different compiler optimizations. I used the def, o2_nofp and o3_native_lto builds and ran the benchmark 6 times -- once per build for each of Postgres 15.3 and 16-beta1.

I use sysbench to run 42 microbenchmarks and each microbenchmark is put in one of three groups based on the dominant operation: point query, range query, writes.

Results

The table below lists the relative throughput which is: (QPS for PG 16-beta1 / QPS for PG 15.3). When the relative throughput is less than 1 then PG 16-beta1 is slower than PG 15.3 and there might be a CPU regression. 

With a few exceptions the relative throughput is between 0.97 and 1.04. I consider values between 0.97 and 1.0 to be noise rather than a regression so I will declare there are no regressions. The exceptions are:
  • point-query.pre_range=100 which has a noisy relative throughput of 0.89, 0.97 and 1.07. This benchmark step runs early in the benchmark and I assume this is just noise.
  • scan_range=100 which has a noisy relative throughput of 1.00, 0.92, 0.98. I have yet to explain it but the scan benchmark step seems to have more noise so I ignore this.
  • read-only.pre_range=10000 which has a relative throughput of 1.60, 1.46, 1.63
  • read-only.range=10000 which has a a relative throughput of 1.50, 1.39, 1.50
The results show that PG 16-beta1 does ~1.5X better for the read-only.*range=10000 benchmark step which is a big improvement. That is discussed in the Follow Up section below.
 
Legend:
* def - uses the def build
* o2 - uses the o2_nofp build
* o3 - uses the o3_native_lto build

------ build ------     --- benchmark step ---
def     o2      o3
1.00    0.97    1.01    hot-points_range=100
0.89    0.97    1.07    point-query.pre_range=100
1.00    1.02    1.01    point-query_range=100
1.00    1.03    1.03    points-covered-pk.pre_range=100
0.99    1.04    1.03    points-covered-pk_range=100
1.00    1.03    1.04    points-covered-si.pre_range=100
0.99    1.01    1.03    points-covered-si_range=100
1.00    1.00    1.01    points-notcovered-pk.pre_range=100
1.00    1.00    1.02    points-notcovered-pk_range=100
1.00    1.00    1.02    points-notcovered-si.pre_range=100
1.00    1.00    1.01    points-notcovered-si_range=100
0.99    0.99    1.00    random-points.pre_range=1000
1.00    1.00    0.99    random-points.pre_range=100
1.01    0.99    1.02    random-points.pre_range=10
0.99    1.00    1.01    random-points_range=1000
1.00    1.00    1.01    random-points_range=100
1.02    0.99    1.02    random-points_range=10
0.99    0.98    1.01    range-covered-pk.pre_range=100
1.00    0.99    1.01    range-covered-pk_range=100
0.98    0.98    1.01    range-covered-si.pre_range=100
1.00    0.97    1.01    range-covered-si_range=100
0.97    0.98    0.99    range-notcovered-pk.pre_range=100
0.98    0.97    1.00    range-notcovered-pk_range=100
0.98    1.00    1.00    range-notcovered-si.pre_range=100
0.98    0.99    1.00    range-notcovered-si_range=100
1.60    1.46    1.63    read-only.pre_range=10000
1.00    0.99    1.04    read-only.pre_range=100
1.01    0.98    0.99    read-only.pre_range=10
1.50    1.39    1.50    read-only_range=10000
1.01    0.99    1.02    read-only_range=100
1.00    1.00    0.97    read-only_range=10
1.00    0.92    0.98    scan_range=100
1.02    1.01    1.04    delete_range=100
1.01    1.02    0.99    insert_range=100
1.02    1.00    1.03    read-write_range=100
1.00    0.98    0.98    read-write_range=10
1.00    0.99    0.98    update-index_range=100
1.00    0.98    1.00    update-inlist_range=100
1.01    1.00    0.98    update-nonindex_range=100
1.01    0.97    0.98    update-one_range=100
1.02    0.99    0.97    update-zipf_range=100
1.01    1.02    1.00    write-only_range=10000

Follow up

The read-only.pre_range=10000 and read-only.range=10000 benchmark steps run a read-only transaction that has 5 types of queries. They use the same queries but read-only.pre_range=10000 runs before a large number of write-heavy benchmark steps while read-only.range=10000 runs after them. The range=10000 in the name indicates that the range queries in the read-only transaction each scan 10,000 rows.

The Lua for the benchmark step is oltp_read_only.lua which relies on oltp_common.lua. The loop that does the queries is here

There are five types of queries in the read-only*range=10000 benchmark step. The first (q1) is run 10 times per read-only transaction while the others are run once. For q2, q3, q4, q5 the value of N is 10,000. The read-only transaction is run in a loop and sysbench reports the number of transactions/second.
  • q1
    • fetch one row by PK
    • SELECT c FROM sbtest WHERE id=?
  • q2
    • scan N rows via PK index, return 1 column/row
    • SELECT c FROM sbtest WHERE id BETWEEN ? AND ?
  • q3
    • scan N rows via an index, return sum of one column
    • SELECT SUM(k) FROM sbtest WHERE id BETWEEN ? AND ?"
  • q4
    • like q2 but sorts the result
    • SELECT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c
  • q5
    • like q4 but adds DISINCT
    • SELECT DISTINCT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c
To explain the perf improvement from PG 15.3 to 16-beta1 I tried the following:
  • Looked at query plans - they did not change. Plans are here.
  • Ran each of the queries separately
    • performance for q1, q2 and q3 did not change
    • q4 is ~1.7X faster in PG 16-beta1
    • q5 is ~1.5X faster in PG 16-beta1
  • Started to look at CPU profiles from the perf tool. Unfortunately, my work on this was incomplete so I might revisit this in a few days and for now will mumble something about PG 16 using less CPU thanks to ICU and less time doing sorts.




























No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...