Small Datum: IO-bound sysbench vs Postgres on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr - for Postgres 18.1 relative to 12.22

QPS for IO-bound point-query tests is similar while there is a large improvement for the one CPU-bound test (hot-points)
QPS for range queries without aggregation is similar
QPS for range queries with aggregation is between 1.05X and 1.25X larger in 18.1
QPS for writes show there might be a few large regressions in 18.1

tl;dr - for Postgres 18.1 using different values for the io_method option

for tests that do long range queries without aggregation

the best QPS is from io_method=io_uring
the second best QPS is from io_method=worker with a large value for io_workers

for tests that do long range queries with aggregation

when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
for most tests the best QPS is from io_method=io_uring

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.23, 14.20, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used a 48-core server from Hetzner

an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)

Configuration files for the big server

the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 12, 13, 14, 15, 16 and 17.
for Postgres 18 I used

conf.diff.cx10b_c32r128 (x10b_c32r128)

uses io_method=sync and is similar to the config used for versions 12 through 17.

conf.diff.cx10c_c32r128 (x10c_c32r128)

uses io_method=worker and io_workers is not set

conf.diff.cx10cw8_c32r128 (x10cw8_c32r128)

uses io_method=worker and io_workers=8

conf.diff.cx10cw16_c32r128 (x10cw8_c32r128)

uses io_method=worker and io_workers=16

conf.diff.cx10cw32_c32r128 (x10cw8_c32r128)

uses io_method=worker and io_workers=32

conf.diff.cx10d_c32r128 (x10d_c32r128)

uses io_method=io_uring

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for base version)

When the relative QPS is > 1 then some version is faster than base version. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

I provide two comparisons and each uses a different base version. They are:

base version is Postgres 12.22

compare 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1
the goal for this is to see how performance changes over time
per-test results from vmstat and iostat are here

base version is Postgres 18.1

compare 18.1 using the x10b_c32r128, x10c_c32r128, x10cw8_c32r128, x10cw16_c32r128, x10cw32_c32r128 and x10d_c32r128 configs
the goal for this is to understand the impact of the io_method option
per-test results from vmstat and iostat are here

The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: Postgres 12.22 through 18.1

All charts except the first have the y-axis start at 0.7 rather than 0.0 to improve readability.

There are two charts for point queries. The second truncates the y-axis to improve readability.

a large improvement for the hot-points test arrives in 17.x. While most tests are IO-bound, this test is CPU-bound because all queries fetch the same N rows.
for other tests there are small changes, both improvements and regressions, and the regressions are too small to investigate

For range queries without aggregation:

QPS for Postgres 18.1 is within 5% of 12.22, sometimes better and sometimes worse
for Postgres 17.7 there might be a large regression on the scan test and that also occurs with 17.6 (not shown). But the scan test can be prone to variance, especially with Postgres and I don't expect to spend time debugging this. Note that the config I use for 18.1 here uses io_method=sync which is similar to what Postgres uses in releases prior to 18.x. From the vmstat and iostat metrics what I see is:

a small reduction in CPU overhead (cpu/o) in 18.1
a large reduction in the context switch rate (cs/o) in 18.1
small reductions in read IO (r/o and rKB/o) in 18.1

For range queries with aggregation:

QPS for 18.1 is between 1.05X and 1.25X better than for 12.22

For write-heavy tests

there might be large regressions for several tests: read-write, update-zipf and write-only, The read-write tests do all of the writes done by write-only and then add read-only statements.
from the vmstat and iostat results for the read-write tests I see

CPU (cpu/o) is up by ~1.2X in PG 16.x through 18.x
storage reads per query (r/o) have been increasing from PG 16.x through 18.x and are up by ~1.1X in PG 18.1
storage KB read per query (rKB/o) increased started in PG 16.1 and are 1.44X and 1.16X larger in PG 18.x

from the vmstat and iostat results for the update-zipf test

results are similar to the read-write tests above

from the vmstat and iostat results for the write-only test

results are similar to the read-write tests above

Results: Postgres 18.1 and io_method

For point queries

results are similar for all configurations and this is expected

For range queries without aggregation

there are two charts, the y-axis is truncated in the second to improve readability
all configs get similar QPS for all tests except scan
for the scan test

the x10c_c32r128 config has the worst result. This is expected given there are 40 concurrent connections and it used the default for io_workers (=3)
QPS improves for io_method=worker with larger values for io_workers
io_method=io_uring has the best QPS (the x10d_c32r128 config)

For range queries with aggregation

when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
io_method=io_uring gets the best QPS on all tests except for the read-only tests with range=10 and 10,000. There isn't an obvious problem based on the vmstat and iostat results. From the r_await column in iostat output (not shown) the differences are mostly explained by a change in IO latency. Perhaps variance in storage latency is the issue.

For writes

the best QPS occurs with the x10b_c32r128 config (io_method=sync). I am not sure if that option matters here and perhaps there is too much noise in the results.

Small Datum

Monday, December 29, 2025

IO-bound sysbench vs Postgres on a 48-core server

No comments:

Post a Comment

MariaDB innovation: vector index performance