Wednesday, February 18, 2026

Explaining why throughput varies for Postgres with a CPU-bound Insert Benchmark

Throughput for the write-heavy steps of the Insert Benchmark look like a distorted sine wave with Postgres on CPU-bound workloads but not on IO-bound workloads. For the CPU-bound workloads the chart for max response time at N-second intervals for inserts is flat but for deletes it looks like the distorted sine wave. To see the chart for deletes, scroll down from here. So this looks like a problem for deletes and this post starts to explain that.

tl;dr

  • Once again, blame vacuum

History of the Insert Benchmark

Long ago (prior to 2010) the Insert Benchmark was published by Tokutek to highlight things that the TokuDB storage engine was great at. I was working on MySQL at Google at the time and the benchmark was useful to me, however it was written in C++. While the Insert Benchmark is great at showing the benefits of an LSM storage engine, this was years before MyRocks and I was only doing InnoDB at the time, on spinning disks. So I rewrote it in Python to make it easier to modify, and then the Tokutek team improved a few things about my rewrite, and I have been enhancing it slowly since then.

Until a few years ago the steps of the benchmark were:

  • load - insert in PK order
  • create 3 secondary indexes
  • do more inserts as fast as possible
  • do rate-limited inserts concurrent with range and point queries
The problem with this approach is that the database size grows forever and that limited for how long I could run the benchmark before running out of storage. So I changed it and the new approach keeps the database at a fixed size after the load. The new workflow is:
  • load - insert in PK order
  • create 3 secondary indexes
  • do inserts+deletes at the same rate, as fast as possible
  • do rate-limited inserts+deletes at the same rate concurrent with range and point queries
The insert and delete statements run at the same rate to keep the table from changing size. The Insert Benchmark client uses Python multiprocessing, there is one process doing Insert statements, another doing Delete statements and both get their work from queues. Another process populates those queues and that other process controlling what is put on the queue is what keeps them running at the same rate.

The benchmark treats the table like a queue, and when ordered by PK (transactionid) there are inserts at the high end and deletes at the low end. The delete statement currently looks like:
    delete from %s where transactionid in
        (select transactionid from %s where transactionid >= %d order by transactionid asc limit %d)

The delete statement is written like that because it must delete the oldest rows -- the ones that have the smallest value for transactionid. While the process that does deletes has some idea of what that smallest value is, it doesn't know it for sure, thus the query. To improve performance it maintains a guess for the value that will be <= the real minimum and it updates that guess over time.

I encountered other performance problems with Postgres while figuring out how to maintain that guess and get_actual_variable_range() in Postgres was the problem. Maintaining that guess requires a resync query every N seconds where the resync query is: select min(transactionid) from %s. The problem for this query in general is that is scans the low end of the PK index on transactionid and when vacuum hasn't been done recently, then it will scan and skip many entries that aren't visible (wasting much CPU and some IO) before finding visible rows. Unfortunately, there will be some time between consecutive vacuums to the same table and this problem can't be avoided. The result is that the response time for the query increases a lot in between vacuums. For more on how get_actual_variable_range() contributes to this problem, see this post.

I assume the sine wave for delete response time is caused by one or both of:
  • get_actual_varable_range() CPU overhead while planning the delete statement
  • CPU overhead from scanning and skipping tombstones while executing the select subquery
The structure of the delete statement above reduces the number of tombstones that the select subquery might encounter by specifying where transactionid >= %d. Perhaps that isn't sufficient. Perhaps the Postgres query planner still has too much CPU overhead from get_actual_variable_range() while planning that delete statement. I have yet to figure that out. But I have figured out that vacuum is a frequent source of problems.


    Tuesday, February 17, 2026

    MariaDB innovation: binlog_storage_engine, small server, Insert Benchmark

     MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

    My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a similar small server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

    tl;dr for a CPU-bound workload

    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
    tl;dr for an IO-bound workload
    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
    Builds, configuration and hardware

    I compiled MariaDB 12.3.0 from source.

    The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

    I used 4 my.cnf files:
    • z12b
      • my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
    • z12c
    • z12b_sync
    • z12c_sync
      • my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
    The Benchmark

    The benchmark is explained here. It was run with 1 client for two workloads:
    • CPU-bound - the database is cached by InnoDB, but there is still much write IO
    • IO-bound - most, but not all, benchmark steps are IO-bound
    The benchmark steps are:

    • l.i0
      • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for CPU-bound and 800M for IO-bound.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
      • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: summary

    Results: summary

    The performance reports are here for:
    • CPU-bound
      • all-versions - results for z12b, z12c, z12b_sync and z12c_sync
      • sync-only - results for z12b_sync vs 12c_sync
    • IO-bound
      • all-versions - results for z12b, z12c, z12b_sync and z12c_sync
      • sync-only - results for z12b_sync vs 12c_sync
    The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version. The base version is Postgres 12.22.

    When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

    I often use context switch rates as a proxy for mutex contention.

    Results: CPU-bound

    The summaries are here for all-versions and sync-only.
    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
    The second table from the summary section has been inlined below. That table shows relative throughput which is:
    • all-versions: (QPS for my config / QPS for z12b)
    • sync-only: (QPS for my z12c / QPS for z12b)
    For all-versions
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_c8r321.031.011.001.031.000.991.001.001.011.00
    ma120300_rel_withdbg.cz12b_sync_c8r320.041.020.070.011.011.011.001.011.001.00
    ma120300_rel_withdbg.cz12c_sync_c8r320.081.030.280.061.021.011.011.021.021.01

    For sync-only
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_sync_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_sync_c8r321.751.013.996.831.011.011.011.011.031.01

    Results: IO-bound

    The summaries are here for all-versions and sync-only.
    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
    The second table from the summary section has been inlined below. That table shows relative throughput which is:
    • all-versions: (QPS for my config / QPS for z12b)
    • sync-only: (QPS for my z12c / QPS for z12b)
    For all-versions
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_c8r321.010.990.991.011.011.011.011.071.011.04
    ma120300_rel_withdbg.cz12b_sync_c8r320.041.000.550.101.020.971.000.800.950.55
    ma120300_rel_withdbg.cz12c_sync_c8r320.181.000.830.311.021.011.020.961.020.86

    For sync-only
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_sync_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_sync_c8r324.741.001.502.991.001.041.021.201.081.57












    Monday, February 16, 2026

    MariaDB innovation: binlog_storage_engine

    MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

    In this post I have results for the performance benefit from this when using storage that has a high fsync latency. This is probably a best-case comparison for the feature. A future post will cover the benefit on servers that don't have high fsync latency.

    tl;dr

    • the performance benefit from this is excellent when storage has a high fsync latency
    • there is a small improvement (up to 6%) for write throughput when binlog_storage_engine is enabled but sync-on-commit is not enabled
    • my mental performance model needs to be improved. I gussed that throughput would increase by ~2X when using binlog_storage_engine relative to not using it but using sync_binlog=1 and innodb_flush_log_at_trx_commit=1. However the improvement is larger than 4X.
    Some history

    MongoDB has done this for years -- the replication log is stored in WiredTiger. 

    Long ago there were requests for this feature from the Galera team, and I wonder if they will benefit from this now. I have been curious about the benefit of the feature, but long ago I was also wary of it because it can increase stress on InnoDB and back in the day InnoDB already struggled with high-concurrency workloads.

    Long ago group commit didn't work for the binlog. The Facebook MySQL team did some work to fix that, and eventually. A Google search describes our work as the first and I found an old Facebook note that I probably wrote about the effort.

    Builds, configuration and hardware

    I compiled MariaDB 12.3.0 from source.

    The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

    I used 4 my.cnf files:
    • z12b
      • my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
    • z12c
    • z12b_sync
    • z12c_sync
      • my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.

    Benchmark

    I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
    and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

    The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

    The benchmark is run with 1 client, 1 table and 50M rows. 

    Results

    The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.  

    But here I only report results for the write-heavy tests.

    I provide charts below with relative QPS. The relative QPS is the following:
    (QPS for some version) / (QPS for base version)
    When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. 

    I present results for:
    • z12b, z12c, z12b_sync and z12c_sync with z12b as the base version
    •  z12b_sync and z12c_sync with z12b_sync as the base version
    Results: z12b, z12c, z12b_sync, z12c_sync

    Summary:
    • z12c gets up to 6% more throughput than z12b but the CPU overhead per operation are similar for z12b and z12c
    • z12b_sync has the worst performance thanks to 2 fsyncs per commit
    • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and in the next section I mention that enabling binlog_storage_engine also reduces the CPU overhead.
    • some per-test data from iostat and vmstat is here
    • a representative sample of iostat collected at 1-second intervals during the update-inlist test is here. When comparing z12b_sync with z12c_sync
      • the fsync rate (f/s) is ~2.5X larger for z12c_sync vs z12b_sync (~690/s vs ~275/s) but fsync latency (f_await) is similar. So with binlog_storage_engine enabled MySQL is more efficient, and perhaps thanks to a lower CPU overhead, there is less work to do in between calls to fsync
    Relative to: z12b
    col-1 : z12c
    col-2 : z12b_sync
    col-3 : z12c_sync

    col-1   col-2   col-3
    1.06    0.01    0.05    delete
    1.05    0.01    0.05    insert
    1.01    0.12    0.47    read-write_range=100
    1.01    0.10    0.44    read-write_range=10
    1.03    0.01    0.11    update-index
    1.02    0.02    0.12    update-inlist
    1.05    0.01    0.06    update-nonindex
    1.05    0.01    0.06    update-one
    1.05    0.01    0.06    update-zipf
    1.01    0.03    0.20    write-only

    Results: z12b_sync, z12c_sync

    Summary:
    • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and below I mention that enabling binlog_storage_engine also reduces the CPU overhead.
    • some per-test data from iostat and vmstat is here and the CPU overhead per operation is much smaller with binlog_storage_engine -- see here for the update-inlist test. In general, when sync-on-commit is enabled then the CPU overhead with binlog_storage_engine enabled is between 1/3 and 2/3 of the overhead without it enabled.
    Relative to: z12b_sync
    col-1 : z12c_sync

    col-1
    6.40    delete
    5.64    insert
    4.06    read-write_range=100
    4.40    read-write_range=10
    7.64    update-index
    7.17    update-inlist
    5.73    update-nonindex
    5.82    update-one
    5.80    update-zipf
    6.61    write-only

    Sunday, February 15, 2026

    HammerDB tproc-c on a large server, Postgres and MySQL

    This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

    The comparison might favor Postgres for the IO-bound workloads because I used smaller buffer pools than normal to avoid OOM. I have to do this because RSS for the HammerDB client grows over time as it buffers more response time stats. And while I used buffered IO for Postgres, I use O_DIRECT for InnoDB. So Postgres might have avoided some read IO thanks to the OS page cache while InnoDB did not.

    tl;dr for MySQL

    • With vu=40 MySQL 8.4.8 uses about 2X more CPU per transaction and does more than 2X more context switches per transaction compared to Postgres 18.1. I will get CPU profiles soon.
    • Modern MySQL brings us great improvements to concurrency and too many new CPU overheads
      • MySQL 5.6 and 8.4 have similar throughput at the lowest concurrency (vu=10)
      • MySQl 8.4 is a lot faster than 5.6 at the highest concurrency (vu=40)
    tl;dr for Postgres
    • Modern Postgres has regressions relative to old Postgres
    • The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).
    tl;dr for Postgres vs MySQL
    • Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
    • Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)

    Builds, configuration and hardware

    I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

    I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.45, 8.4.8, 9.4.0 and 9.6.0.

    I used a 48-core server from Hetzner
    • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
    Postgres configuration files:
    • prior to v18 the config file is named conf.diff.cx10a50g_c32r128 (x10a_c32r128) and is here for versions 1213141516 and 17.
    • for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) with io_method=sync to be similar to the config used for versions 12 through 17.
    MySQL configuration files
    • prior to 9.6 the config file is named my.cnf.cz12a50g_c32r128 (z12a50g_c32r128 or z12a50g) and is here for versions 5.6, 5.7, 8.0 and 8.4
    • for 9.6 it is named my.cnf.cz13a50g_c32r128 (z13a50g_c32r128 or z13a50g) and is here
    For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has 2 SSDs with SW RAID and low fsync latency.

    Benchmark

    The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

    The benchmark was run for several workloads:
    • vu=10, wh=1000 - 10 virtual users, 1000 warehouses
    • vu=20, wh=1000 - 20 virtual users, 1000 warehouses
    • vu=40, wh=1000 - 40 virtual users, 1000 warehouses
    • vu=10, wh=2000 - 10 virtual users, 2000 warehouses
    • vu=20, wh=2000 - 20 virtual users, 2000 warehouses
    • vu=40, wh=2000 - 40 virtual users, 2000 warehouses
    • vu=10, wh=4000 - 10 virtual users, 4000 warehouses
    • vu=20, wh=4000 - 20 virtual users, 4000 warehouses
    • vu=40, wh=4000 - 40 virtual users, 4000 warehouses
    The wh=1000 workloads are less heavy on IO. The wh=4000 workloads are more heavy on IO.

    The benchmark for Postgres is run by a variant of this script which depends on scripts here. The MySQL scripts are similar.
    • stored procedures are enabled
    • partitioning is used because the warehouse count is >= 1000
    • a 5 minute rampup is used
    • then performance is measured for 60 minutes
    Basic metrics: iostat

    I am still improving my helper scripts to report various performance metrics. The table here has average values from iostat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM) and NOPM is throughput.
    • average wMB/s increases with the warehouse count for Postgres but not for MySQL
    • r/s increases with the warehouse count for Postgres and MySQL
    iostat metrics
    * r/s = average rate of reads/s from storage
    * wMB/s = average MB/s written to storage

    my8408
    r/s     wMB/s
    22833.0 906.2   vu=40, wh=1000
    63079.8 1428.5  vu=40, wh=2000
    82282.3 1398.2  vu=40, wh=4000

    pg181
    r/s     wMB/s
    30394.9 1261.9  vu=40, wh=1000
    59770.4 1267.8  vu=40, wh=2000
    78052.3 1272.9  vu=40, wh=4000

    Basic metrics: vmstat

    I am still improving my helper scripts to report various performance metrics. The table here has average values from vmstat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM).
    • CPU utilization is almost 2X larger for MySQL
    • Context switch rates are more than 2X larger for MySQL
    • In the future I hope to learn why MySQL uses almost 2X more CPU per transaction and has more than 2X more context switches per transaction relative to Postgres
    vmstat metrics
    * cs - average value for cs (context switches/s)
    * us - average value for us (user CPU)
    * sy - average value for sy (system CPU)
    * id - average value for id (idle)
    * wa - average value for wa (waiting for IO)
    * us+sy - sum of us and sy

    my8408
    cs      us      sy      id      wa      us+sy
    455648  61.9    8.2     24.2    5.7     70.1    vu=40, wh=1000
    484955  50.4    9.2     19.5    21.0    59.6    vu=40, wh=2000
    487410  39.5    8.4     19.4    32.6    48.0    vu=40, wh=4000

    pg181
    cs      us      sy      id      wa      us+sy
    127486  23.5    10.1    63.3    3.0     33.6    vu=40, wh=1000
    166257  17.2    11.1    62.5    9.1     28.3    vu=40, wh=2000
    203578  13.9    11.3    59.2    15.6    25.2    vu=40, wh=4000

    Results

    My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

    On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
    (NOPM for some-version / NOPM for base-version)

    I provide three charts below:

    • only MySQL - base-version is MySQL 5.6.51
    • only Postgres - base-version is Postgres 12.22
    • Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.8
    Results: MySQL 5.6 to 9.6

    Legend:

    • my5651.z12a is MySQL 5.6.51 with the z12a50g config
    • my5744.z12a is MySQL 5.7.44 with the z12a50g config
    • my8045.z12a is MySQL 8.0.45 with the z12a50g config
    • my8408.z12a is MySQL 8.4.8 with the z12a50g config
    • my9500.z13a is MySQL 9.6.0 with the z13a50g config

    Summary

    • At the lowest concurrency (vu=10) MySQL 8.4.8 has similar throughput as 5.6.51 because CPU regressions in modern MySQL offset the concurrency improvements.
    • At the highest concurrency (vu=40) MySQL 8.4.8 is much faster than 5.6.51 and the regressions after 5.7 are small. This matches what I have seen elsewhere -- while modern MySQL suffers from CPU regressions it benefits from concurrency improvements. Imagine if we could get those concurrency improvements without the CPU regressions.

    And the absolute NOPM values are here:

    my5651my5744my8045my8408my9600
    vu=10, wh=1000163059183268156039155194151748
    vu=20, wh=1000210506321670283282281038279269
    vu=40, wh=1000216677454743439589435095433618
    vu=10, wh=2000107492130229111798110161108386
    vu=20, wh=2000155398225068193658190717189847
    vu=40, wh=2000178278302723297236307504293217
    vu=10, wh=400081242103406894148931688458
    vu=20, wh=4000131241179112155134152998152301
    vu=40, wh=4000146809228554234922229511230557

    Results: Postgres 12 to 18

    Legend:

    • pg1222 is Postgres 12.22 with the x10a50g config
    • pg1323 is Postgres 13.23 with the x10a50g config
    • pg1420 is Postgres 14.20 with the x10a50g config
    • pg1515 is Postgres 15.15 with the x10a50g config
    • pg1611 is Postgres 16.11 with the x10a50g config
    • pg177 is Postgres 17.7 with the x10a50g config
    • pg181 is Postgres 18.1 with the x10b50g config

    Summary

    • Modern Postgres has regressions relative to old Postgres
    • The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).


    The relative NOPM values are here:

    pg1222pg1323pg1420pg1515pg1611pg177pg181
    vu=10, wh=10001.0001.0001.0541.0421.0041.0100.968
    vu=20, wh=10001.0001.0351.0371.0281.0281.0010.997
    vu=40, wh=10001.0001.0400.9881.0001.0270.9980.970
    vu=10, wh=20001.0001.0261.0591.0751.0681.0811.029
    vu=20, wh=20001.0001.0221.0461.0430.9790.9720.934
    vu=40, wh=20001.0001.0141.0321.0360.9791.0100.947
    vu=10, wh=40001.0001.0271.0321.0350.9930.9980.974
    vu=20, wh=40001.0001.0051.0491.0480.9400.9270.876
    vu=40, wh=40001.0000.9911.0190.9831.0010.9790.937

    The absolute NOPM values are here:

    pg1222pg1323pg1420pg1515pg1611pg177pg181
    vu=10, wh=1000353077353048372015367933354513356469341688
    vu=20, wh=1000423565438456439398435454435288423986422397
    vu=40, wh=1000445114462851439728445144457110444364431648
    vu=10, wh=2000223048228914236231239868238117241185229549
    vu=20, wh=2000314380321380328688328044307728305452293627
    vu=40, wh=2000320347324769330444331896313553323454303403
    vu=10, wh=4000162054166461167320167761160962161716157872
    vu=20, wh=4000244598245804256593256231230037226844214309
    vu=40, wh=4000252931250634257820248584253059247610236986

    Results: MySQL vs Postgres

    Legend:

    • pg181 is Postgres 18.1 with the x10b50g config
    • my8408 is MySQL 8.4.8 with the z12a50g config

    Summary

    • Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
    • Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)
    The absolute NOPM values are here:

    pg181my8408
    vu=10, wh=1000341688155194
    vu=20, wh=1000422397281038
    vu=40, wh=1000431648435095
    vu=10, wh=2000229549110161
    vu=20, wh=2000293627190717
    vu=40, wh=2000303403307504
    vu=10, wh=400015787289316
    vu=20, wh=4000214309152998
    vu=40, wh=4000236986229511









    Explaining why throughput varies for Postgres with a CPU-bound Insert Benchmark

    Throughput for the write-heavy steps of the Insert Benchmark look like a distorted sine wave with Postgres on CPU-bound  workloads but not o...