Monday, February 16, 2026

MariaDB innovation: binlog_storage_engine

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

In this post I have results for the performance benefit from this when using storage that has a high fsync latency. This is probably a best-case comparison for the feature. A future post will cover the benefit on servers that don't have high fsync latency.

tl;dr

  • the performance benefit from this is excellent when storage has a high fsync latency
  • there is a small improvement (up to 6%) for write throughput when binlog_storage_engine is enabled but sync-on-commit is not enabled
  • my mental performance model needs to be improved. I gussed that throughput would increase by ~2X when using binlog_storage_engine relative to not using it but using sync_binlog=1 and innodb_flush_log_at_trx_commit=1. However the improvement is larger than 4X.
Some history

MongoDB has done this for years -- the replication log is stored in WiredTiger. 

Long ago there were requests for this feature from the Galera team, and I wonder if they will benefit from this now. I have been curious about the benefit of the feature, but long ago I was also wary of it because it can increase stress on InnoDB and back in the day InnoDB already struggled with high-concurrency workloads.

Long ago group commit didn't work for the binlog. The Facebook MySQL team did some work to fix that, and eventually. A Google search describes our work as the first and I found an old Facebook note that I probably wrote about the effort.

Builds, configuration and hardware

I compiled MariaDB 12.3.0 from source.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

I used 4 my.cnf files:
  • z12b
    • my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
  • z12c
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The benchmark is run with 1 client, 1 table and 50M rows. 

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.  

But here I only report results for the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. 

I present results for:
  • z12b, z12c, z12b_sync and z12c_sync with z12b as the base version
  •  z12b_sync and z12c_sync with z12b_sync as the base version
Results: z12b, z12c, z12b_sync, z12c_sync

Summary:
  • z12c gets up to 6% more throughput than z12b but the CPU overhead per operation are similar for z12b and z12c
  • z12b_sync has the worst performance thanks to 2 fsyncs per commit
  • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and in the next section I mention that enabling binlog_storage_engine also reduces the CPU overhead.
  • some per-test data from iostat and vmstat is here
  • a representative sample of iostat collected at 1-second intervals during the update-inlist test is here. When comparing z12b_sync with z12c_sync
    • the fsync rate (f/s) is ~2.5X larger for z12c_sync vs z12b_sync (~690/s vs ~275/s) but fsync latency (f_await) is similar. So with binlog_storage_engine enabled MySQL is more efficient, and perhaps thanks to a lower CPU overhead, there is less work to do in between calls to fsync
Relative to: z12b
col-1 : z12c
col-2 : z12b_sync
col-3 : z12c_sync

col-1   col-2   col-3
1.06    0.01    0.05    delete
1.05    0.01    0.05    insert
1.01    0.12    0.47    read-write_range=100
1.01    0.10    0.44    read-write_range=10
1.03    0.01    0.11    update-index
1.02    0.02    0.12    update-inlist
1.05    0.01    0.06    update-nonindex
1.05    0.01    0.06    update-one
1.05    0.01    0.06    update-zipf
1.01    0.03    0.20    write-only

Results: z12b_sync, z12c_sync

Summary:
  • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and below I mention that enabling binlog_storage_engine also reduces the CPU overhead.
  • some per-test data from iostat and vmstat is here and the CPU overhead per operation is much smaller with binlog_storage_engine -- see here for the update-inlist test. In general, when sync-on-commit is enabled then the CPU overhead with binlog_storage_engine enabled is between 1/3 and 2/3 of the overhead without it enabled.
Relative to: z12b_sync
col-1 : z12c_sync

col-1
6.40    delete
5.64    insert
4.06    read-write_range=100
4.40    read-write_range=10
7.64    update-index
7.17    update-inlist
5.73    update-nonindex
5.82    update-one
5.80    update-zipf
6.61    write-only

Sunday, February 15, 2026

HammerDB tproc-c on a large server, Postgres and MySQL

This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

The comparison might favor Postgres for the IO-bound workloads because I used smaller buffer pools than normal to avoid OOM. I have to do this because RSS for the HammerDB client grows over time as it buffers more response time stats. And while I used buffered IO for Postgres, I use O_DIRECT for InnoDB. So Postgres might have avoided some read IO thanks to the OS page cache while InnoDB did not.

tl;dr for MySQL

  • With vu=40 MySQL 8.4.8 uses about 2X more CPU per transaction and does more than 2X more context switches per transaction compared to Postgres 18.1. I will get CPU profiles soon.
  • Modern MySQL brings us great improvements to concurrency and too many new CPU overheads
    • MySQL 5.6 and 8.4 have similar throughput at the lowest concurrency (vu=10)
    • MySQl 8.4 is a lot faster than 5.6 at the highest concurrency (vu=40)
tl;dr for Postgres
  • Modern Postgres has regressions relative to old Postgres
  • The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).
tl;dr for Postgres vs MySQL
  • Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
  • Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)

Builds, configuration and hardware

I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.45, 8.4.8, 9.4.0 and 9.6.0.

I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Postgres configuration files:
  • prior to v18 the config file is named conf.diff.cx10a50g_c32r128 (x10a_c32r128) and is here for versions 1213141516 and 17.
  • for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) with io_method=sync to be similar to the config used for versions 12 through 17.
MySQL configuration files
  • prior to 9.6 the config file is named my.cnf.cz12a50g_c32r128 (z12a50g_c32r128 or z12a50g) and is here for versions 5.6, 5.7, 8.0 and 8.4
  • for 9.6 it is named my.cnf.cz13a50g_c32r128 (z13a50g_c32r128 or z13a50g) and is here
For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has 2 SSDs with SW RAID and low fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=10, wh=1000 - 10 virtual users, 1000 warehouses
  • vu=20, wh=1000 - 20 virtual users, 1000 warehouses
  • vu=40, wh=1000 - 40 virtual users, 1000 warehouses
  • vu=10, wh=2000 - 10 virtual users, 2000 warehouses
  • vu=20, wh=2000 - 20 virtual users, 2000 warehouses
  • vu=40, wh=2000 - 40 virtual users, 2000 warehouses
  • vu=10, wh=4000 - 10 virtual users, 4000 warehouses
  • vu=20, wh=4000 - 20 virtual users, 4000 warehouses
  • vu=40, wh=4000 - 40 virtual users, 4000 warehouses
The wh=1000 workloads are less heavy on IO. The wh=4000 workloads are more heavy on IO.

The benchmark for Postgres is run by a variant of this script which depends on scripts here. The MySQL scripts are similar.
  • stored procedures are enabled
  • partitioning is used because the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 60 minutes
Basic metrics: iostat

I am still improving my helper scripts to report various performance metrics. The table here has average values from iostat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM) and NOPM is throughput.
  • average wMB/s increases with the warehouse count for Postgres but not for MySQL
  • r/s increases with the warehouse count for Postgres and MySQL
iostat metrics
* r/s = average rate of reads/s from storage
* wMB/s = average MB/s written to storage

my8408
r/s     wMB/s
22833.0 906.2   vu=40, wh=1000
63079.8 1428.5  vu=40, wh=2000
82282.3 1398.2  vu=40, wh=4000

pg181
r/s     wMB/s
30394.9 1261.9  vu=40, wh=1000
59770.4 1267.8  vu=40, wh=2000
78052.3 1272.9  vu=40, wh=4000

Basic metrics: vmstat

I am still improving my helper scripts to report various performance metrics. The table here has average values from vmstat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM).
  • CPU utilization is almost 2X larger for MySQL
  • Context switch rates are more than 2X larger for MySQL
  • In the future I hope to learn why MySQL uses almost 2X more CPU per transaction and has more than 2X more context switches per transaction relative to Postgres
vmstat metrics
* cs - average value for cs (context switches/s)
* us - average value for us (user CPU)
* sy - average value for sy (system CPU)
* id - average value for id (idle)
* wa - average value for wa (waiting for IO)
* us+sy - sum of us and sy

my8408
cs      us      sy      id      wa      us+sy
455648  61.9    8.2     24.2    5.7     70.1    vu=40, wh=1000
484955  50.4    9.2     19.5    21.0    59.6    vu=40, wh=2000
487410  39.5    8.4     19.4    32.6    48.0    vu=40, wh=4000

pg181
cs      us      sy      id      wa      us+sy
127486  23.5    10.1    63.3    3.0     33.6    vu=40, wh=1000
166257  17.2    11.1    62.5    9.1     28.3    vu=40, wh=2000
203578  13.9    11.3    59.2    15.6    25.2    vu=40, wh=4000

Results

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for some-version / NOPM for base-version)

I provide three charts below:

  • only MySQL - base-version is MySQL 5.6.51
  • only Postgres - base-version is Postgres 12.22
  • Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.8
Results: MySQL 5.6 to 9.6

Legend:

  • my5651.z12a is MySQL 5.6.51 with the z12a50g config
  • my5744.z12a is MySQL 5.7.44 with the z12a50g config
  • my8045.z12a is MySQL 8.0.45 with the z12a50g config
  • my8408.z12a is MySQL 8.4.8 with the z12a50g config
  • my9500.z13a is MySQL 9.6.0 with the z13a50g config

Summary

  • At the lowest concurrency (vu=10) MySQL 8.4.8 has similar throughput as 5.6.51 because CPU regressions in modern MySQL offset the concurrency improvements.
  • At the highest concurrency (vu=40) MySQL 8.4.8 is much faster than 5.6.51 and the regressions after 5.7 are small. This matches what I have seen elsewhere -- while modern MySQL suffers from CPU regressions it benefits from concurrency improvements. Imagine if we could get those concurrency improvements without the CPU regressions.

And the absolute NOPM values are here:

my5651my5744my8045my8408my9600
vu=10, wh=1000163059183268156039155194151748
vu=20, wh=1000210506321670283282281038279269
vu=40, wh=1000216677454743439589435095433618
vu=10, wh=2000107492130229111798110161108386
vu=20, wh=2000155398225068193658190717189847
vu=40, wh=2000178278302723297236307504293217
vu=10, wh=400081242103406894148931688458
vu=20, wh=4000131241179112155134152998152301
vu=40, wh=4000146809228554234922229511230557

Results: Postgres 12 to 18

Legend:

  • pg1222 is Postgres 12.22 with the x10a50g config
  • pg1323 is Postgres 13.23 with the x10a50g config
  • pg1420 is Postgres 14.20 with the x10a50g config
  • pg1515 is Postgres 15.15 with the x10a50g config
  • pg1611 is Postgres 16.11 with the x10a50g config
  • pg177 is Postgres 17.7 with the x10a50g config
  • pg181 is Postgres 18.1 with the x10b50g config

Summary

  • Modern Postgres has regressions relative to old Postgres
  • The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).


The relative NOPM values are here:

pg1222pg1323pg1420pg1515pg1611pg177pg181
vu=10, wh=10001.0001.0001.0541.0421.0041.0100.968
vu=20, wh=10001.0001.0351.0371.0281.0281.0010.997
vu=40, wh=10001.0001.0400.9881.0001.0270.9980.970
vu=10, wh=20001.0001.0261.0591.0751.0681.0811.029
vu=20, wh=20001.0001.0221.0461.0430.9790.9720.934
vu=40, wh=20001.0001.0141.0321.0360.9791.0100.947
vu=10, wh=40001.0001.0271.0321.0350.9930.9980.974
vu=20, wh=40001.0001.0051.0491.0480.9400.9270.876
vu=40, wh=40001.0000.9911.0190.9831.0010.9790.937

The absolute NOPM values are here:

pg1222pg1323pg1420pg1515pg1611pg177pg181
vu=10, wh=1000353077353048372015367933354513356469341688
vu=20, wh=1000423565438456439398435454435288423986422397
vu=40, wh=1000445114462851439728445144457110444364431648
vu=10, wh=2000223048228914236231239868238117241185229549
vu=20, wh=2000314380321380328688328044307728305452293627
vu=40, wh=2000320347324769330444331896313553323454303403
vu=10, wh=4000162054166461167320167761160962161716157872
vu=20, wh=4000244598245804256593256231230037226844214309
vu=40, wh=4000252931250634257820248584253059247610236986

Results: MySQL vs Postgres

Legend:

  • pg181 is Postgres 18.1 with the x10b50g config
  • my8408 is MySQL 8.4.8 with the z12a50g config

Summary

  • Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
  • Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)
The absolute NOPM values are here:

pg181my8408
vu=10, wh=1000341688155194
vu=20, wh=1000422397281038
vu=40, wh=1000431648435095
vu=10, wh=2000229549110161
vu=20, wh=2000293627190717
vu=40, wh=2000303403307504
vu=10, wh=400015787289316
vu=20, wh=4000214309152998
vu=40, wh=4000236986229511









Saturday, February 14, 2026

HammerDB tproc-c on a small server, Postgres and MySQL

This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

tl;dr

  • Modern Postgres is faster than old Postgres
  • Modern MySQL has large perf regressions relative to old MySQL, and they are worst at low concurrency for CPU-bound worklads. This is similar to what I see on other benchmarks.
  • Modern Postgres is about 2X faster than MySQL at low concurrency (vu=1) and when the workload isn't IO-bound (w=100). But with some concurrency (vu=6) or with more IO per transaction (w=1000, w=2000) they have similar throughput. Note that partitioning is used at w=1000 and 2000 but not at w=100.

Builds, configuration and hardware

I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.44, 8.4.7, 9.4.0 and 9.5.0.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

For MySQL the config files are named my.cnf.cz12a_c8r32 and are here: 5.6.515.7.448.0.4x8.4.x9.x.0.

For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has an SSD with high fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
  • vu=6, w=100 - 6 virtual users, 100 warehouses
  • vu=1, w=1000 - 1 virtual user, 1000 warehouses
  • vu=6, w=1000 - 6 virtual users, 1000 warehouses
  • vu=1, w=2000 - 1 virtual user, 2000 warehouses
  • vu=6, w=2000 - 6 virtual users, 2000 warehouses
The w=100 workloads are less heavy on IO. The w=1000 and w=2000 workloads are more heavy on IO.

The benchmark for Postgres is run by this script which depends on scripts here. The MySQL scripts are similar.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes
Results

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for some-version / NOPM for base-version)

I provide three charts below:

  • only MySQL - base-version is MySQL 5.6.51
  • only Postgres - base-version is Postgres 12.22
  • Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.7

Results: MySQL 5.6 to 9.5

Legend:

  • my5651.z12a is MySQL 5.6.51 with the z12a_c8r32 config
  • my5744.z12a is MySQL 5.7.44 with the z12a_c8r32 config
  • my8044.z12a is MySQL 8.0.44 with the z12a_c8r32 config
  • my8407.z12a is MySQL 8.4.7 with the z12a_c8r32 config
  • my9400.z12a is MySQL 9.4.0 with the z12a_c8r32 config
  • my9500.z12a is MySQL 9.5.0 with the z12a_c8r32 config

Summary

  • Perf regressions in MySQL 8.4 are smaller with vu=6 and wh >= 1000 -- the cases where there is more concurrency (vu=6) and the workload does more IO per transaction (wh=1000 & 2000). Note that partitioning is used at w=1000 and 2000 but not at w=100.
  • Perf regressions in MySQL 8.4 are larger with vu=1 and even more so with wh=100 (low concurrency, less IO per transaction).
  • Performance has mostly been dropping from MySQL 5.6 to 8.4. From other benchmarks the problem is from new CPU overheads at low concurrency.
  • While perf regressions in modern MySQL at high concurrency have been less of a problem on other benchmarks, this server is too small to support high concurrency.

Results: Postgres 12 to 18

Legend:

  • pg1222.x10a is Postgres 12.22 with the x10a_c8r32 config
  • pg1323.x10a is Postgres 13.23 with the x10a_c8r32 config
  • pg1420.x10a is Postgres 14.20 with the x10a_c8r32 config
  • pg1515.x10a is Postgres 15.15 with the x10a_c8r32 config
  • pg1611.x10a is Postgres 16.11 with the x10a_c8r32 config
  • pg177.x10a is Postgres 17.7 with the x10a_c8r32 config
  • pg181.x10b is Postgres 18.1 with the x10b_c8r32 config

Summary

  • Modern Postgres is faster than old Postgres

Results: MySQL vs Postgres

Legend:

  • pg181.x10b is Postgres 18.1 with the x10b_c8r32 config
  • my8407.z12a is MySQL 8.4.7 with the z12a_c8r32 config

Summary

  • MySQL and Postgres have similar throughput for vu=6 at w=1000 and 2000. Note that partitioning is used at w=1000 and 2000 but not at w=100.
  • Otherwise Postgres is 2X faster than MySQL

Thursday, January 22, 2026

CPU-bound Insert Benchmark vs Postgres on 24-core and 32-core servers

This has results for Postgres versions 12 through 18 with a CPU-bound Insert Benchmark on 24-core and 32-core servers. A report for MySQL on the same setup is here.

tl;dr

  • good news
    • there are small improvments
    • with the exception of get_actual_variable range I don't see new CPU overheads in Postgres 18
  • bad news
Update 1: increasing min_wal_size

I repeated tests for Postgres 18.1 with a new configuration, conf.diff.cx10b_walsize_c24r64 and conf.diff.cx10b_walsize_c32r128, that increases min_wal_size to be equal to max_wal_size. This was done based on feedback from a Postgres expert here.

Next, I will try with a new config that uses much smaller values for min_wal_size and max_wal_size, 8G on the 24-core server and 16G on the 32-core server.

On my 24-core/64G-RAM server:
  • conf.diff.cx10b_c24r64
    • shared_buffers = 48G, min_wal_size=32G, max_wal_size=64G
  • conf.diff.cx10b_walsize_c24r64
    • shared_buffers = 48G, min_wal_size=64G, max_wal_size=64G
On my 32-core/128G-RAM server:
  • conf.diff.cx10b_c32r128
    • shared_buffers = 96G, min_wal_size=48G, max_wal_size=96G
  • conf.diff.cx10b_walsize_c32r128
    • shared_buffers = 96G, min_wal_size=96G, max_wal_size=96G
The impact from this change:
  • average performance is a bit worse on several benchmark steps, see here for the 24-core and 32-core servers
  • response time distributions for the write-heavy steps (l.i1, l.i2) are a bit worse for the 24-core and 32-core servers
  • for the l.i1 step the insert rate vs time charts are better for the 24-core server (here and here) and similar for the 32-core server (here and here)
  • for the l.i2 step the insert rate vs time charts are similar for the 24-core (here and here) and 32-core (here and here) servers
  • for the write-heavy test steps the count of WAL files quickly reaches the max and stays there
Update 2: reducing min_wal_size and max_wal_size

I repeated the benchmark using a new config that sets min_wal_size equal to max_wal_size and reduces both. The config name is conf.diff.cx10b_walsize8G_c24r64 for the 24-core server and conf.diff.cx10b_walsize16G_c32r128 for the 32-core server.

On my 24-core/64G-RAM server:
  • conf.diff.cx10b_c24r64
    • shared_buffers = 48G, min_wal_size=32G, max_wal_size=64G
  • conf.diff.cx10b_walsize_c24r64
    • shared_buffers = 48G, min_wal_size=64G, max_wal_size=64G
  • conf.diff.cx10b_walsize8G_c24r64
    • shared_buffers = 48G, min_wal_size=8G, max_wal_size=8G
On my 32-core/128G-RAM server:
  • conf.diff.cx10b_c32r128
    • shared_buffers = 96G, min_wal_size=48G, max_wal_size=96G
  • conf.diff.cx10b_walsize_c32r128
    • shared_buffers = 96G, min_wal_size=96G, max_wal_size=96G
  • conf.diff.cx10b_walsize16G_c32r128
    • shared_buffers = 96G, min_wal_size=16G, max_wal_size=16G
The impact from this change:
  • average throughput drops for l.i1 but increases for l.i2. The increase for l.i2 is likely a side-effect of the decrease for l.i1 as there is less MVCC debt at the start of l.i2 -- see here for the 24-core and 32-core servers
  • response time distributions for the write-heavy steps (l.i1, l.i2) are a bit worse for the 24-core and 32-core servers
  • for the l.i1 step the insert rate vs time charts are worse for the 24-core server (here and here) as there more frequent write-stalls because checkpoint is more frequent. For the 32-core server (here and here) there are more frequent write-stalls but they are less severe.
  • write-amplification is much larger because checkpoint IO is more frequent. KB written to storage per transaction is more than 4X larger on the 24-core server and more than 2X larger on the 32-core server -- see wkbpi here and here.
Builds, configuration and hardware

I compiled Postgre from source for versions 12.22, 13.22, 13.23, 14.19, 14.20, 15.14, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

The servers are:
  • 24-core
    • the server has 24-cores, 2-sockets and 64G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. Intel HT is disabled.
    • the Postgres conf files are here for versions 1213141516 and 17. These are named conf.diff.cx10a_c24r64 (or x10a).
    • For 18.0 I tried 3 configuration files:
  • 32-core
    • the server has 32-cores and 128G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. AMD SMT is disabled.
    • the Postgres config files are here for versions 1213141516 and 17. These are named conf.diff.cx10a_c32r128 (or x10a).
    • I used several config files for Postgres 18
    The Benchmark

    The benchmark is explained here. It was run with 8 clients on the 24-core server and 12 clients on the 32-core server. The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

    The benchmark steps are:

    • l.i0
      • insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 250M on the 24-core server and 300M on the 32-core server.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
      • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    For each server there are two performance reports
    • latest point releases
      • has results for the latest point release I tested from each major release
      • the base version is Postgres 12.22 when computing relative QPS
    • all releases
      • has results for all of the versions I tested
      • the base version is Postgres 12.22 when computing relative QPS
    Results: summary

    The performance reports are here for:
    The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version. The base version is Postgres 12.22.

    When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

    I often use context switch rates as a proxy for mutex contention.

    Results: latest point releases

    The summaries are here for the 24-core and 32-core servers

    The tables have relative throughput: (QPQ for my version / QPS for MySQL 5.6.51). Values less than 0.95 have a yellow background. Values greater than 1.05 have a blue background.

    From the 24-core server:

    • there are small improvements on the l.i1 (write-heavy) step. I don't see regressions.
    • thanks to vacuum, there is much variance for insert rates on the l.i1 and l.i2 steps. For the l.i1 step there are also several large write-stalls where there are many instances of stalls up to 6 seconds.
    • the overhead from get_actual_variable_range increased by 10% from Postgres 14 to 18. Eventually that hurts performance.
    • with the exception of get_actual_variable range I don't see new CPU overheads in Postgres 18

    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    pg1222_o2nofp.cx10a_c24r641.001.001.001.001.001.001.001.001.001.00
    pg1322_o2nofp.cx10a_c24r641.030.971.021.021.011.021.001.011.001.02
    pg1419_o2nofp.cx10a_c24r640.980.951.101.071.011.011.011.011.011.01
    pg1515_o2nofp.cx10a_c24r641.021.021.081.051.011.021.011.021.011.02
    pg1611_o2nofp.cx10a_c24r641.020.981.040.981.021.021.021.021.021.02
    pg177_o2nofp.cx10a_c24r641.020.981.070.991.021.021.021.021.021.02
    pg181_o2nofp.cx10b_c24r641.021.001.060.971.001.011.001.001.001.01

    From the 32-core server:

    • there are small improvements for the l.x (index create) step.
    • there might be small regressions for the l.i2 (random writes) step
    • thanks to vacuum, there is much variance for insert rates on the l.i1 and l.i2 steps. For the l.i1 step there are also several large write-stalls.
    • the overhead from get_actual_variable_range increased by 10% from Postgres 14 to 18. That might explain the small decrease in throughput for l.i2.
    • with the exception of get_actual_variable range I don't see new CPU overheads in Postgres 18
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    pg1222_o2nofp.cx10a_c32r1281.001.001.001.001.001.001.001.001.001.00
    pg1323_o2nofp.cx10a_c32r1280.890.961.000.931.001.001.000.991.001.00
    pg1420_o2nofp.cx10a_c32r1280.960.981.020.951.020.991.010.991.010.99
    pg1515_o2nofp.cx10a_c32r1281.011.000.970.971.000.991.000.991.000.99
    pg1611_o2nofp.cx10a_c32r1280.991.020.980.941.011.001.011.001.011.00
    pg177_o2nofp.cx10a_c32r1280.981.061.000.981.021.001.020.991.020.99
    pg181_o2nofp.cx10b_c32r1280.991.061.010.951.020.991.020.991.020.99


    Results: all releases

    The summaries are here for the 24-core and 32-core servers.

    From the 24-core server I small improvements on the l.i1 (write-heavy) step. I don't see regressions.
    • there are small improvements on the l.i1 (write-heavy) step. I don't see regressions.
    • io_method =worker and =io_uring doesn't help here, I don't expect them to help
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    pg1222_o2nofp.cx10a_c24r641.001.001.001.001.001.001.001.001.001.00
    pg1322_o2nofp.cx10a_c24r641.030.971.021.021.011.021.001.011.001.02
    pg1419_o2nofp.cx10a_c24r640.980.951.101.071.011.011.011.011.011.01
    pg1514_o2nofp.cx10a_c24r641.020.981.020.881.011.011.011.011.011.01
    pg1515_o2nofp.cx10a_c24r641.021.021.081.051.011.021.011.021.011.02
    pg1610_o2nofp.cx10a_c24r641.021.001.050.931.021.021.021.021.011.02
    pg1611_o2nofp.cx10a_c24r641.020.981.040.981.021.021.021.021.021.02
    pg176_o2nofp.cx10a_c24r641.021.021.060.971.031.021.031.021.021.02
    pg177_o2nofp.cx10a_c24r641.020.981.070.991.021.021.021.021.021.02
    pg180_o2nofp.cx10b_c24r641.011.021.050.921.021.021.011.011.011.02
    pg180_o2nofp.cx10c_c24r641.001.021.060.891.011.011.011.011.011.01
    pg180_o2nofp.cx10d_c24r641.001.001.050.941.021.011.011.011.011.01
    pg181_o2nofp.cx10b_c24r641.021.001.060.971.001.011.001.001.001.01
    pg181_o2nofp.cx10d_c24r641.021.001.060.921.001.011.001.000.991.01


    From the 32-core server
    • there are small improvements for the l.x (index create) step.
    • there might be small regressions for the l.i2 (random writes) step
    • io_method =worker and =io_uring doesn't help here, I don't expect them to help
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    pg1222_o2nofp.cx10a_c32r1281.001.001.001.001.001.001.001.001.001.00
    pg1322_o2nofp.cx10a_c32r1281.000.960.990.901.011.001.011.001.011.00
    pg1323_o2nofp.cx10a_c32r1280.890.961.000.931.001.001.000.991.001.00
    pg1419_o2nofp.cx10a_c32r1280.970.960.990.911.020.991.010.991.010.99
    pg1420_o2nofp.cx10a_c32r1280.960.981.020.951.020.991.010.991.010.99
    pg1514_o2nofp.cx10a_c32r1280.981.020.950.921.011.001.011.001.021.00
    pg1515_o2nofp.cx10a_c32r1281.011.000.970.971.000.991.000.991.000.99
    pg1610_o2nofp.cx10a_c32r1280.981.001.000.891.011.001.011.001.011.00
    pg1611_o2nofp.cx10a_c32r1280.991.020.980.941.011.001.011.001.011.00
    pg176_o2nofp.cx10a_c32r1281.001.061.020.911.021.001.011.001.021.00
    pg177_o2nofp.cx10a_c32r1280.981.061.000.981.021.001.020.991.020.99
    pg180_o2nofp.cx10b_c32r1281.001.061.040.921.000.991.000.991.000.99
    pg180_o2nofp.cx10c_c32r1280.991.061.010.961.000.991.000.991.000.99
    pg180_o2nofp.cx10d_c32r1280.991.061.000.941.000.991.000.991.000.99
    pg181_o2nofp.cx10b_c32r1280.991.061.010.951.020.991.020.991.020.99
    pg181_o2nofp.cx10d_c32r1280.981.061.010.931.000.991.000.991.000.99






















    MariaDB innovation: binlog_storage_engine

    MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine . When enabled it uses InnoDB instead of raw files to store the b...