Wednesday, January 14, 2026

Debugging regressions with Postgres in IO-bound sysbench

I explained in this post that there is a possible performance regression for Postgres with IO-bound sysbench. It arrived in Postgres 16 and remains in Postgres 18. I normally run sysbench with a cached database, but I had a spare server so I repeated tests with an IO-bound workload.

The bad news for me is that I need to spend more time explaining the problem. The good news for me is that I learn more about Postgres by doing this. And to be clear, I have yet to explain the regression but this post documents my debugging efforts.

sysbench

This post explains how I use sysbench. Note that modern sysbench is a framework for running benchmarks and it comes with many built-in tests. By framework I mean that it includes much plumbing for several DBMS and you can write tests in Lua with support for the Lua JIT so the client side of the tests use less CPU.

The sysbench framework has been widely used in the MySQL community for a long time. Originally it hard-coded one (or perhaps a few tests) and in too many cases by sysbench people mean the classic (original) sysbench transaction that is a mix of point queries, range queries and writes. The classic transaction is now implemented by oltp_read_write.lua with help from oltp_common.lua.

The oltp_read_write.lua test is usually not good at detecting regressions but in this case (the problem motivating me to write this blog post) it has detected the regression.

How to find regressions

I have opinions about how to run DBMS benchmarks. For example, be careful about running read-only benchmarks for an LSM because the state (shape) of the LSM tree can have a large impact on CPU overhead and the LSM state (shape) might be in the same (bad or good) state for the duration of the test. With read-write tests the LSM state should cycle between better and worse states.

But traditional b-tree storage engines also have states. One aspect of that is MVCC debt -- write-heavy tests increase the debt and the engine doesn't always do a great job of limiting that debt. So I run sysbench tests in a certain order to both create and manage MVCC debt while trying to minimize noise.

The order of the tests is listed here and the general sequence is:

  1. Load and index the tables
  2. Run a few read-only tests to let MVCC debt get reduced if it exists
  3. Optionally run more read-only tests (I usually skip these)
  4. Run write-heavy tests
  5. Do things to reduce MVCC debt (see here for Postgres and MySQL)
  6. Run read-only tests (after the tables+indexes have been subject to random writes)
  7. Run delete-only and then insert-only tests
Results

The possible regressions are:
  • update-zipf in Postgres 16, 17 and 18
  • write-only in Postgres 16, 17 and 18
  • read-write in Postgres 16, 17 and 18
  • insert in Postgres 18 - this reproduces in 18.0 and 18.1

Legend:
* relative QPS for Postgres 16.11, 17.7 and 18.1
* relative QPS is (QPS for my version / QPS for Postgres 15.15)

16.11   17.7    18.1
1.01    1.00    1.01    update-inlist
1.06    1.03    1.04    update-index
1.04    1.04    1.04    update-nonindex
1.00    1.08    1.07    update-one
0.85    0.72    0.71    update-zipf
0.88    0.86    0.84    write-only_range=10000
0.71    0.82    0.81    read-write_range=100
0.74    0.78    0.82    read-write_range=10
1.06    1.03    1.00    delete
1.02    1.00    0.80    insert

Note that when the insert test has a relative QPS of 0.80 for Postgres 18.1, then 18.1 gets 80% of the throughput vs Postgres 15.5. So this is a problem to figure out.

Explaining the insert test for Postgres 18.1

I wrote useful but messy Bash scripts to make it easier to run and explain sysbench results. One of the things I do is collect results from vmstat and iostat per test and then summarize average and normalized values from them where normalized values are: (avg from iostat or vmstat / QPS). 

And then I compute relative values for them which is the following and the base case here is Postgres 15.15: (value from my version / value from the base case) 

From the results below I see that from Postgres 17.7 to 18.1
  • throughput decreases by ~20% in 18.1
  • cpu/o increased by ~17% in 18.1
  • cs/o increased by ~2.6% in 18.1
  • r/o increased by ~14% in 18.1
  • rKB/o increased by ~22% in 18.1
  • wKB/o increased by ~27% in 18.1
So Postgres 18.1 does a lot more IO during the insert test. Possible reasons are that either insert processing changed to be less efficient or there is more MVCC debt at the start of the insert test from the write-heavy tests that precede it and thus there is more IO from write-back and vacuum during the insert test which reduces throughput. At this point I don't know. A next step for me is to repeat the benchmark with the delete test removed as that immediately precedes the insert test (see here).

Legend:
* cpu/o - CPU per operation. This includes us and sy from vmstat but
          it isn't CPU microseconds.
* cs/o - context switches per operation
* r/o - reads from storage per operation
* rKB/o - KB read from storage per operation
* wKB/o - KB written to storage per operation
* o/s - operations/s, QPS, throughput
* dbms - Postgres version

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000780        5.017   0.407   8.937   24.2    34164   15.15
0.000755        5.090   0.409   9.002   24.648  34833   16.11
0.000800        5.114   0.418   9.146   24.626  34195   17.7
0.000939        5.251   0.477   11.186  31.197  27304   18.1

--- relative to first result
0.97            1.01    1.00    1.01    1.02    1.02    16.11
1.03            1.02    1.03    1.02    1.02    1.00    17.7
1.20            1.05    1.17    1.25    1.29    0.80    18.1

I also have results from pg_stat_all_tables collected at the end of the delete and insert tests for Postgres 17.7 and 18.1. Then I computed the difference as (results after insert test - results before insert test). From the results below I don't see a problem. Note that the difference for n_dead_tup is zero. While my analysis was limited to one of the 8 tables used for the benchmark, the values for the other 7 tables are similar.

The columns with a non-zero difference for Postgres 17.7 are:
  3996146   n_tup_ins
  3996146   n_live_tup
  3996146   n_mod_since_analyze
  3996146   n_ins_since_vacuum

The columns with a non-zero difference for Postgres 18.1 are:
  3191278   n_tup_ins
  3191278   n_live_tup
  3191278   n_mod_since_analyze
  3191278   n_ins_since_vacuum

Explaining the other tests for Postgres 16, 17 and 18

Above I listed the order in which the write-heavy tests are run. The regressions occur on the update-zipf, write-only and read-write tests. All of these follow the update-one test. 
  • There is no regression for write-only and read-write when I change the test order so these run prior to update-one and update-zipf
  • There is a regression if either or both of update-one and update-zip are run prior to write-only and read-write
I assume that either the amount of MVCC debt created by update-one and update-zipf is larger starting with Postgres 16 or that starting in Postgres 16 something changed so that Postgres is less effective at dealing with that MVCC debt.

Note that the update-one and update-zipf tests are a bit awkward. But this workload hasn't been a problem for InnoDB so I assume this is specific to Postgres (and vacuum). For the update-one test all updates are limited to one row per table (the first row in the table). And for the update-zipf test a zipfian distribution is used to select the rows to update. So in both cases a small number of rows receive most of the updates.

Results from pg_stat_all_tables collected immediately prior to update-one and then after update-zipf are here for Postgres 15.15 and 16.11. Then I computed the difference as (results after update-zip - results before update-one). 

The columns with a non-zero difference for Postgres 15.15 are:
  23747485   idx_scan
  23747485   idx_tup_fetch
  23747485   n_tup_upd
  22225868   n_tup_hot_upd
    -69576   n_live_tup
   3273433   n_dead_tup
  -1428177   n_mod_since_analyze

The columns with a non-zero difference for Postgres 16.11 are:
  23102012   idx_scan
  23102012   idx_tup_fetch
  23102012   n_tup_upd
  21698107   n_tup_hot_upd
   1403905   n_tup_newpage_upd
     -3568   n_live_tup
   2983730   n_dead_tup
  -2064095   n_mod_since_analyze

I also have the vmstat and iostat data for each of the tests. In all cases, after update-one, the amount of CPU and IO per operation increases after Postgres 15.15.

For update-one results don't change much across versions

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000281        12.492  0.001   0.011   0.741   162046  15.15
0.000278        12.554  0.001   0.012   0.746   162415  16.11
0.000253        11.028  0.002   0.029   0.764   174715  17.7
0.000254        10.960  0.001   0.011   0.707   172790  18.1

--- relative to first result
0.99            1.00    1.00    1.09    1.01    1.00    16.11
0.90            0.88    2.00    2.64    1.03    1.08    17.7
0.90            0.88    1.00    1.00    0.95    1.07    18.1

For update-zipf the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000721        6.517   0.716   6.583   18.811  41531   15.15
0.000813        7.354   0.746   7.838   22.746  35497   16.11
0.000971        5.679   0.796   10.492  27.858  29700   17.7
0.000966        5.718   0.838   10.354  28.965  29289   18.1

--- relative to first result
1.13            1.13    1.04    1.19    1.21    0.85    16.11
1.35            0.87    1.11    1.59    1.48    0.72    17.7
1.34            0.88    1.17    1.57    1.54    0.71    18.1

For write-only_range=10000 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000784        5.881   0.799   7.611   26.835  30174   15.15
0.000920        6.315   0.85    10.033  32      26444   16.11
0.001003        5.883   0.871   11.147  33.142  25889   17.7
0.000999        5.905   0.891   10.988  34.443  25232   18.1

--- relative to first result
1.17            1.07    1.06    1.32    1.19    0.88    16.11
1.28            1.00    1.09    1.46    1.24    0.86    17.7
1.27            1.00    1.12    1.44    1.28    0.84    18.1

For read-write_range=100 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001358        7.756   1.758   25.654  22.615  31574   15.15
0.001649        8.111   1.86    30.99   35.413  22264   16.11
0.001629        7.609   1.856   29.901  28.681  25906   17.7
0.001646        7.484   1.978   29.456  29.138  25573   18.1

--- relative to first result
1.21            1.05    1.06    1.21    1.57    0.71    16.11
1.20            0.98    1.06    1.17    1.27    0.82    17.7
1.21            0.96    1.13    1.15    1.29    0.81    18.1

For read-write_range=10 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000871        6.441   1.096   10.238  21.426  37141   15.15
0.001064        6.926   1.162   13.761  29.738  27540   16.11
0.001074        6.461   1.181   14.759  29.338  28839   17.7
0.001045        6.497   1.222   14.726  28.632  30431   18.1

--- relative to first result
1.22            1.08    1.06    1.34    1.39    0.74    16.11
1.23            1.00    1.08    1.44    1.37    0.78    17.7
1.20            1.01    1.11    1.44    1.34    0.82    18.1

No comments:

Post a Comment

Debugging regressions with Postgres in IO-bound sysbench

I explained in this post that there is a possible performance regression for Postgres with IO-bound sysbench. It arrived in Postgres 16 and...