This has performance results for Postgres 17.4, 17.5, 18 beta1 and 18 beta2 on a large server with sysbench microbenchmarks. Results like this from me are usually boring because Postgres has done a great job at avoiding performance regressions over time. This work was done by Small Datum LLC and not sponsored. Previous work from me for Postgres 17.4 and 18 beta1 is here.
The workload here is cached by Postgres and my focus is on regressions from new CPU overhead or mutex contention.
tl;dr
- there might be small regressions (~2%) for range queries on the benchmark with 1 client. One cause is more CPU in BuildCachedPlan.
- there might besmall regressions (~2%) for range queries on the benchmark with 40 clients. One cause is more CPU in PortalRunSelect.
- otherwise things look great
- conf.diff.cx10b_c8r32
- uses io_method='sync' to match Postgres 17 behavior
- conf.diff.cx10c_c8r32
- uses io_method='worker' and io_workers=32 to do async IO via a thread pool. I eventually learned that 32 is too large but I don't think it matters much on this workload.
- conf.diff.cx10d_c8r32
- uses io_method='io_uring' to do async IO via io_uring
The tests are run using two workloads. For both the read-heavy microbenchmarks run for 300 seconds and write-heavy run for 600 seconds.
- 1-client
- run with 1 client and 1 table with 50M rows
- 40-clients
- run with 40 client and 8 table with 10M rows per table
- bash r.sh 1 50000000 300 600 $deviceName 1 1 1
- bash r.sh 8 10000000 300 600 $deviceName 1 1 40
I do provide tables below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for PG 17.4)
Tables with QPS per microbenchmark are here using absolute and relative QPS. All of the files I saved for this workload are here.
- QPS is mostly ~2% better in PG 18 beta2 relative to 17.4 (see here), but the same is true for 17.5. Regardless, this is good news.
- full table scan is ~6% faster in PG 18 beta2 and ~4% faster in 17.5, both relative to 17.4
- but for the other microbenchmarks, PG 18 beta2, 18 beta1 and 17.5 are 1% to 5% slower than 17.4.
- From vmstat and iostat metrics for range-[not]covered-pk and range-[not]covered-si this is explained by an increase in CPU/query (see the cpu/o column in the previous links). I also see a few cases where CPU/query is much larger but only for 18 beta2 with configs that use io_method =worker and =io_uring.
- I measured CPU using vmstat which includes all CPU on the host so perhaps something odd happens with other Postgres processes or some rogue process is burning CPU. I checked more results from vmstat and iostat and don't see storage IO during the tests.
- Code that does the vacuum and checkpoint is here, output from the vacuum work is here, and the Postgres logfiles are here. This work is done prior to the range query tests.
- there are regressions (see here), but here they are smaller than what I see above for range queries without aggregation
- the interesting result is for the same query, but run with different selectivity to go from a larger to a smaller range and the regression increases as the range gets smaller (see here). To me this implies the likely problem is the fixed cost -- either in the optimizer or query setup (allocating memory, etc).
- there are small regressions, mostly from 1% to 3% (see here).
- the regressions are largest for the 18beta configs that use io_method=io_uring, that might be expected given the benefits it provides
- QPS is similar from PG 17.4 through 18 beta1 (see here).
- full table scan is mostly ~2% faster after 17.4 (see here)
- for the other microbenchmarks, 3 of the 4 have small regressions of ~2% (see here). The worst is range-covered-pk and the problem appears to be more CPU per query (see here). Unlike above where the new overhead was in BuildCachedPlan, here it is in the stack with PortalRunSelect.
- QPS is similar from PG 17.4 through 18 beta2 (see here)
- QPS drops by 1% to 5% for many microbenchmarks, but this problem starts in 17.5 (see here)
- From vmstat and iostat metrics for update-one (which suffers the most, see here) the CPU per operation overhead does not increase (see the cpu/o column), the number of context switches per operation also does not increase (see the cs/o column).
- Also from iostat, the amount of data written to storage doesn't change much.