Small Datum: Measuring scaleup for Postgres 18.0 with sysbench

Sunday, October 5, 2025

Measuring scaleup for Postgres 18.0 with sysbench

This post has results to measure scaleup for Postgres 18.0 on a 48-core server.

tl;dr

Postgres continues to be boring (in a good way)
Results are mostly excellent
A few of the range query tests have a scaleup that is less than great but I need time to debug

Builds, Configuration & Hardware

The server has an AMD EPYC 9454P 48-Core Processor with AMD SMT disabled, 128G of RAM and SW RAID 0 with 2 NVMe devices. The OS is Ubuntu 22.04.

I compiled Postgres 18.0 from source and the configuration file is here.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks

and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres. Each microbenchmark is run for 300 seconds.

The benchmark is run with 1, 2, 4, 8, 12, 16, 20, 24, 32, 40 and 48 clients. The purpose is to determine how well Postgres scales up. All tests use 8 tables with 10M rows per table.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.

I still use relative QPS here, but in a different way. The relative QPS here is:

(QPS at X clients) / (QPS at 1 client)

The goal is to determine scaleup efficiency for Postgres. When the relative QPS at X clients is a value near X, then things are great. But sometimes things aren't great and the relative QPS is much less than X. One issue is data contention for some of the write-heavy microbenchmarks. Another issue is mutex and rw-lock contention.

Perf debugging via vmstat and iostat

I use normalized results from vmstat and iostat to help explain why things aren't as fast as expected. By normalized I mean I divide the average values from vmstat and iostat by QPS to see things like how much CPU is used per query or how many context switches occur per write. And note that a high context switch rate is often a sign of mutex contention.

Those results are here but can be difficult to read.

Charts: point queries

The spreadsheet with all of the results is here.

While results aren't perfect, they are excellent. Perfect results would be to get a scaleup of 48 at 48 clients and here the result is between 40 and 42 in most tests. The worst-case is for hot-points where the scaleup is 32.57 at 48 clients. Note that the hot-points test has the most data contention of the point-query tests, as all queries fetch the same rows.

From the vmstat metrics (see here) I don't see an increase in mutex contention (more context switches, see the cs/o column) but I do see an increase in CPU (cpu/o). When compared to a test that has better scaleup, like points-covered-pk, there I also don't see an increase in mutex contention and do see an increase in CPU overhead (see cpu/o) but the CPU increase is smaller (see here).

Charts: range queries without aggregation

The spreadsheet with all of the results is here.

The results again are great, but not perfect. The worst case is for range-notcovered-pk where the scaleup is 32.92 at 48 clients. The base case is for scan where the scaleup is 46.56 at 48 clients.

From the vmstat metrics for range-notcovered-pk I don't see any obvious problems. The CPU overhead (cpu/o, CPU per query) increases by 1.08 (about 8%) from 1 to 48 clients while the context switches per query (cs/o) decreases (see here).

Charts: range queries with aggregation

The spreadsheet with all of the results is here.

Results for range queries with aggregation are worse than for range queries without aggregation. I hope to try and explain that later. A perfect result is scaleup equal to 48. Here, 3 of 8 tests have scaleup less than 3, 4 have scaleup between 30 and 40, and the best case is read-only_range=10 with a scaleup of 43.35.

The worst-case was read-only-count with a scaleup of 21.38. From the vmstat metrics I see that at CPU overhead (cpu/o, CPU per query) increases by 2.08 at 48 clients vs 1 client while context switches per query (cs/o) decrease (see here). I am curious about that CPU increase as isn't as bad for the other range query tests, for example see here where it is no larger than 1.54. The query for read-only-count is here.

Later I hope to explain why read-only-count, read-only-simple and read-only-sum don't do better.

Charts: writes

The spreadsheet with all of the results is here.

The worst-case is update-one where scaleup is 2.86 at 48 clients. The bad result is expected as having many concurrent clients update the same row is an anti-pattern with Postgres. The scaleup for Postgres on that test is a lot worse than for MySQL where it was ~8 with InnoDB. But I am not here for Postgres vs InnoDB arguments.

Excluding the tests that mix reads and writes (read-write-*) the scaleup is between 13 and 21. This is far from great but isn't horrible. I run with fsync-on-commit disabled which highlights problems but is less realistic. So for now I am happy with this results.

11 comments:

AdrienOctober 6, 2025 at 1:18 AM
Hello,
You really should use huge pages. Especially with big shared_buffers and many CPU.
Regards
ReplyDelete
Replies
AnonymousOctober 12, 2025 at 1:33 AM
Can you also measure and report
server efficiency? IPC if possible.
Thx
ReplyDelete
Replies
Dan KorenOctober 12, 2025 at 1:35 AM
Can you also measure and
report server efficiency?
IPC if possible. Thanks!
ReplyDelete
Replies

Add comment

Sunday, October 5, 2025

Measuring scaleup for Postgres 18.0 with sysbench

11 comments:

Using db_bench to measure RocksDB performance with gcc and clang