Wednesday, March 18, 2026

MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (32-core) server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

While throughput improves with the InnoDB doublewrite buffer disabled, I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr for a CPU-bound workload

  • throughput for write-heavy steps is larger with the InnoDB doublewrite buffer disabled
  • throughput for write-heavy steps is much larger with the binlog storage engine enabled
  • throughput for write-heavy steps is largest with both the binlog storage engine enabled and the InnoDB doublewrite buffer disabled. In this case it was up to 8.9X larger.
tl;dr for an IO-bound workload
  • see the tl;dr above
  • the best throughput comes from enabling the binlog storage engine and disabling the InnoDB doublewrite buffer and was 3.26X.
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 32-cores and 128G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. AMD SMT is disabled. The SSD has high fsync latency.

I tried 4 my.cnf files:
The Benchmark

The benchmark is explained here. It was run with 12 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 300M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 16M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 4M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.63X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 2.80X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 8.13X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 1.14X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.93X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.61X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.03X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 8.90X larger for z12b_sync_dw0
Results: IO-bound

The summary is here.

For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.05X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 1.22X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 1.58X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 2.06X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.59X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.01X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.26X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 2.78X larger for z12b_sync_dw0









MariaDB innovation: binlog_storage_engine, 48-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). See this blog post for more details on the new feature. This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (48-core) server. Storage on this server has a low fsync latency while the small server has high fsync latency.

In this test throughput doesn't improve with the InnoDB doublewrite buffer disabled. Even if it did I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr

  • binlog storage engine makes some things better without making other things worse
  • binlog storage engine doesn't make all write-heavy steps faster because the commit path isn't the bottleneck in all cases on a server with storage that has low fsync latency

tl;dr for a CPU-bound workload

  • the l.i0 step (load in PK order) is ~1.3X faster with binlog storage engine
  • the l.i2 step (write-only with smaller transactions) is ~1.5X faster with binlog storage engine
tl;dr for an IO-bound workload
  • the l.i0 step (load in PK order) is ~1.08X faster with binlog storage engine
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 48-cores and 128G of RAM. Storage is 2 NVMe device with ext-4, discard enabled and RAID. The OS is Ubuntu 22.04. AMD SMT is disabled. The SSD has low fsync latency.

I tried 4 my.cnf files:
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c32r128 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
  • z12b_sync_dw0
  • z12c_sync_dw0
The Benchmark

The benchmark is explained here. It was run with 20 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 200M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 3600 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here
  • Disabling the InnoDB doublewrite buffer doesn't improve performance.
With and without the InnoDB doublewrite buffer enabled, enabling the binlog storage engine improves throughput a lot for two of the write-heavy steps while there are only small changes on the other two write-heavy steps:
  • l.i0, load in PK order, gets ~1.3X more throughput
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 1/2
      • KB written to storage per insert (wkbpi) is a bit smaller
      • context switches per insert (cspq) are reduced by about 1/3
  • l.x, create secondary indexes, is unchanged
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 4/5
      • KB written to storage per insert (wkbpi) are reduced almost in half
      • context switches per insert (cspq) are reduced by about 1/4
  • l.i1, write-only with larger tranactions, is unchanged
  • l.i2, write-only with smaller transactions, gets ~1.5X more throughput
Results: IO-bound

The summary is here.
  • Disabling the InnoDB doublewrite buffer doesn't improve performance.
  • For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.
  • Enabling the InnoDB doublewrite buffer improves throughput by ~1.25X on the l.i2 step (write-only with smaller transactions) but doesn't change performance on the other steps.
    • as expected there is a large reduction in KB written to storage (see wkbpi here)
  • Enabling the binlog storage engine improves throughput by 9% and 8% on the l.i0 step (load in PK order) but doesn't have a significant impact on other steps.
    • with the binlog storage engine there is a large reduction in storage writes per insert (wpi), a small reduction in KB written to storage per insert (wkbpi) and small increases in CPU per insert (cpupq) and contex switches per insert (cspq) -- see here



Monday, March 16, 2026

CPU efficiency for MariaDB, MySQL and Postgres on TPROC-C with a small server

I started to use TPROC-C from HammerDB to test MariaDB, MySQL and Postgres and published results for MySQL and Postgres on small and large servers. This post provides more detail on CPU overheads for MariaDB, MySQL and Postgres on a small server.

tl;dr

  • Postgres get the most throughput and the difference is large.
  • MariaDB gets more throughput than MySQL
  • Throughput improves for MariaDB and MySQL but not for Postgres when stored procedures are enabled. It is possible that the stored procedure support in MariaDB and MySQL is more CPU efficient than in Postgres. The HammerDB author explained that HammerDB uses server-side functions with Postgres when stored procs are disabled. That might explain why there isn't much of a benefit.
  • Postgres uses ~2X to ~4X more CPU for background tasks than InnoDB but it is doing between 1.5X and 3X more writes so were I to normalize that CPU overhead (from vacuum) it might be similar to MySQL and MariaDB. Regardless, the total amount of CPU for background tasks is not significant relative to other CPU consumers.
Builds, configuration and hardware

I compiled everything from source: MariaDB 11.8.6, MySQL 8.4.8 and Postgres 18.2.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

For MySQL the config file is named my.cnf.cz12a_c8r32.

For MariaDB the config file is named my.cnf.cz12b_c8r32.

For all DBMS fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has an SSD with high fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for one workload, the working set is cached and there is only one user:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
The test was repeated with stored procedure support in HammerDB enabled and then disabled. For my previous results it was always enabled. I did this to understand the impact of stored procedures. While they are great for workloads with much concurrency because they reduce lock-hold durations, the workload here did not have much concurrency. That helps me understand the CPU efficiency of stored procedures.

The benchmark for Postgres is run by this script which depends on scripts here. The MySQL scripts are similar.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes
Results: NOPM

The numbers in the table below are the NOPM (throughput) for TPROC-C.

Summary
  • Postgres sustains the most throughput with and without stored procedures
  • MariaDB sustains more throughput than MySQL
  • Stored procedures help MariaDB and MySQL, but do not improve Postgres throughput
Legend:
* sp0 - stored procedures disabled
* sp1 - stored procedures enabled

sp0     sp1
11975   19281   MariaDB 11.8.6
 9400   16874   MySQL 8.4.8
33261   33679   Postgres 18.2

Results: vmstat

The following is computed from a sample of ~1000 lines of vmstat output collected from the middle of the benchmark run. 
  • The ratio of us to sy is almost 2X larger in Postgres than in MariaDB and MySQL with stored procedures disabled. But the ratios are similar with stored procedures enabled.
  • The context switch rate is about 5X larger in MariaDB and MySQL vs Postgres with stored procedures disabled before normalizing by thoughput, with normalization the difference would be even larger. But the difference is smaller with stored procedures enabled.
  • Postgres has better throughput because MariaDB and MySQL use more CPU per NOPM. The diference is larger with stored procedures disabled. Perhaps the stored prcoedure evaluator in MariaDB and MySQL is more efficient than in Postgres.
Legend:
* r - average value for the r column, runnable tasks
* cs - average value for the cs column, context switches/s
* us, sy - average value for the us and sy columns, user and system CPU utilization/s
* us+sy - average value for the sum of us and sy
* cpuPer - ((us+sy) / NOPM) * 1000, smaller is better

--- sp0
r       cs      us      sy      us+sy   cpuPer
1.112   54786   10.0    3.1     13.2    1.102   MariaDB 11.8.6
1.130   65413   10.8    2.9     13.7    1.457   MySQL 8.4.8
1.206   11266   12.2    1.9     14.1    0.423   Postgres 18.2

--- sp1
r       cs      us      sy      us+sy   cpuPer
1.079   11739   12.0    1.2     13.1    0.679   MariaDB 11.8.6
1.043   14698   12.0    1.0     13.0    0.770   MySQL 8.4.8
1.107    9776   12.4    1.4     13.8    0.409   Postgres 18.2

Results: flamegraphs with stored procedures

The flamegraphs are here.'

The following tables summarize CPU time based on the percentage of samples that can be mapped to various tasks and processes. Note that these are absolute values. So both MySQL and Postgres have similar distributions of CPU time per area even when Postgres gets 2X or 3X more throughput.

Summary
  • the CPU distributions by area are mostly similar for MariaDB, MySQL and Postgres
  • Postgres uses 2X to 4X more CPU for background work (vacuum)
Legend
* client - time in the HammerDB benchmark client
* swap - time in kernel swap code
* db-fg - time running statements in the DBMS for the client
* db-bg - time doing background work in the DBMS

- Total
        MariaDB MySQL   Postgres
client   4.62    5.70    6.82
swap     5.15    7.09    5.55
db-fg   86.83   83.89   79.43
db-bg    1.43    3.00   ~6.x

- Limited to db-fg, excludes Postgres because the data is messy
        MariaDB MySQL
update  22.39   21.49
insert   6.17    5.82
select  23.43   18.04
commit   ~4.0    ~5.0
parse   ~10.0   ~10.0

Results: flamegraphs without stored procedures

The flamegraphs are here.

Summary
  • the CPU distributions by area are mostly similar for MariaDB and MySQL
  • Postgres uses 2X to 4X more CPU for background work (vacuum)
Legend
* client - time in the HammerDB benchmark client
* swap - time in kernel swap code
* db-fg - time running statements in the DBMS for the client
* db-bg - time doing background work in the DBMS

- Total
        MariaDB MySQL   Postgres
client  14.29   11.92    7.52
swap    13.18   15.58    5.80
db-fg   70.39   70.14   77.24
db-bg   ~1.0    ~2.0     6.55

- Limited to db-fg, excludes Postgres because the data is messy
        MariaDB MySQL
update  14.19   12.04
insert   4.29    3.57
select  18.14   11.96
prepare   NA     6.73
commit  ~2.0     2.64
parse    8.86    9.73
network 15.47   13.73

For MySQL parse, 2.5% was from pfs_digest_end_vc and children.












Friday, March 6, 2026

The first rule of database fight club: admit nothing

 I am fascinated by tech marketing but would be lousy at it.

A common practice is to admit nothing -- my product, project, company, idea is perfect. And I get it because admitting something isn't perfect just provides fodder for marketing done by the other side, and that marketing is often done in bad faith.

But it is harder to fix things when you don't acknowledge the problems. I wrote about this in 2019, this post builds on the previous post.

In the MySQL community we did a good job of acknowledging problems -- sometimes too good. For a long time as an external contributor I filed many bug reports, fixed some bugs myself and then spent much time marketing open bugs that I hoped would be fixed by upstream. Upstream wasn't always happy about my marketing, sometimes there was much snark, but snark was required because there was a large wall between upstream and the community. I amplified the message to be heard.

My take is that the MySQL community was more willing than the Postgres community to acknowledge problems. I have theories about that and I think several help to explain this:

  • Not all criticism is valid
    • While I spend much time with Postgres on benchmarks I don't use it in production. I try to be fair and limit my feedback to things where I have sweat equity my perspective is skewed.  This doesn't mean my feedback is wrong but my context is different. And sometimes my feedback is wrong.
  • Bad faith
    • Some criticism is done in bad faith. By bad faith I means that truth takes a back seat to scoring points. A frequent source of Postgres criticism is done to promote another DBMS. Recently I have seen much anti-Postgres marketing from MongoDB. I assume they encounter Postgres as competition more than they used to. 
  • Good faith gone bad
    • Sometimes criticism given in good faith will be repackaged by others and used in bad faith. This happens with some of the content from my blog posts. I try to make this less likely by burying the lead in the details but it still happens.
  • MySQL was more popular than Postgres until recently. 
    • Perhaps people didn't like that MySQL was getting most of the attention and admitting flaws might not help with adoption. But today the attention has shifted to Postgres so this justification should end. I still remember my amusement at a Postgres conference long ago when the speaker claimed that MySQL doesn't do web-scale. Also amusing was being told that Postgres didn't need per-page checksums because you should just use ZFS to get similar protection.
  • Single-vendor vs community
    • MySQL is a single-vendor project currently owned by Oracle. At times that enables an us vs them mentality (community vs coporation). The coporation develops the product and it is often difficult for the community to contribute. So it was easy to complain about problems, because the corporation was responsible for fixing them.
    • Postgres is developed by the community. There is no us vs them here and the community is more reluctant to criticize the product (Postgres). This is human nature and I see variants of it at work -- my work colleagues are far more willing to be critical of open-source projects we used at work than they were to be critical of the many internally developed projects. 

Monday, February 23, 2026

MariaDB innovation: vector index performance

Last year I shared many posts documenting MariaDB performance for vector search using ann-benchmarks. Performance was great in MariaDB 11 and this blog post explains that it is even better in MariaDB 12. This work was done by Small Datum LLC and sponsored by the MariaDB Foundation. My previous posts were published in January and February 2025.

tl;dr

  • Vector search recall vs precision in MariaDB 12.3 is better than in MariaDB 11.8
  • Vector search recall vs precision in Maria 11.8 is better than in Postgres 18.2 with pgvector 0.8.1
  • The improvements in MariaDB 12.3 are more significant for larger datasets
  • MariaDB 12.3 has the best results because it use less CPU per query, This is confirmed by running vmstat in the background.
Benchmark

This post has much more detail about my approach. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit.  The ann-benchmarks config files are here for MariaDB and for Postgres.

This time I used the dbpedia-openai-X-angular tests for X in 100k, 500k and 1000k.

For hardware I used a larger server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices. 

For databases I used:
  • MariaDB versions 11.8.5 and 12.3.0 with this config file. Both were compiled from source. 
  • Postgres 18.2 with pgvector 0.8.1 with this config file. These were compiled from source. For Postgres tests were run with and without halfvec (float16).
I had ps and vmstat running during the benchmark and confirmed there weren't storage reads as the table and index were cached by MariaDB and Postgres.

The command lines to run the benchmark using my helper scripts are:
    bash rall.batch.sh v1 dbpedia-openai-100k-angular c32r128
    bash rall.batch.sh v1 dbpedia-openai-500k-angular c32r128
    bash rall.batch.sh v1 dbpedia-openai-1000k-angular c32r128

Results: dbpedia-openai-100k-angular

Summary
  • MariaDB 12.3 has the best results
  • the difference between MariaDB 12.3 and 11.8 is smaller here than it is below for 500k and 1000k
Results: dbpedia-openai-500k-angular

Summary
  • MariaDB 12.3 has the best results
  • the difference between MariaDB 12.3 and 11.8 is larger here than above for 100k
Results: dbpedia-openai-1000k-angular

Summary
  • MariaDB 12.3 has the best results
  • the difference between MariaDB 12.3 and 11.8 is larger here than it is above for 100k and 500k


Wednesday, February 18, 2026

Explaining why throughput varies for Postgres with a CPU-bound Insert Benchmark

Throughput for the write-heavy steps of the Insert Benchmark look like a distorted sine wave with Postgres on CPU-bound workloads but not on IO-bound workloads. For the CPU-bound workloads the chart for max response time at N-second intervals for inserts is flat but for deletes it looks like the distorted sine wave. To see the chart for deletes, scroll down from here. So this looks like a problem for deletes and this post starts to explain that.

tl;dr

  • Once again, blame vacuum

History of the Insert Benchmark

Long ago (prior to 2010) the Insert Benchmark was published by Tokutek to highlight things that the TokuDB storage engine was great at. I was working on MySQL at Google at the time and the benchmark was useful to me, however it was written in C++. While the Insert Benchmark is great at showing the benefits of an LSM storage engine, this was years before MyRocks and I was only doing InnoDB at the time, on spinning disks. So I rewrote it in Python to make it easier to modify, and then the Tokutek team improved a few things about my rewrite, and I have been enhancing it slowly since then.

Until a few years ago the steps of the benchmark were:

  • load - insert in PK order
  • create 3 secondary indexes
  • do more inserts as fast as possible
  • do rate-limited inserts concurrent with range and point queries
The problem with this approach is that the database size grows forever and that limited for how long I could run the benchmark before running out of storage. So I changed it and the new approach keeps the database at a fixed size after the load. The new workflow is:
  • load - insert in PK order
  • create 3 secondary indexes
  • do inserts+deletes at the same rate, as fast as possible
  • do rate-limited inserts+deletes at the same rate concurrent with range and point queries
The insert and delete statements run at the same rate to keep the table from changing size. The Insert Benchmark client uses Python multiprocessing, there is one process doing Insert statements, another doing Delete statements and both get their work from queues. Another process populates those queues and that other process controlling what is put on the queue is what keeps them running at the same rate.

The benchmark treats the table like a queue, and when ordered by PK (transactionid) there are inserts at the high end and deletes at the low end. The delete statement currently looks like:
    delete from %s where transactionid in
        (select transactionid from %s where transactionid >= %d order by transactionid asc limit %d)

The delete statement is written like that because it must delete the oldest rows -- the ones that have the smallest value for transactionid. While the process that does deletes has some idea of what that smallest value is, it doesn't know it for sure, thus the query. To improve performance it maintains a guess for the value that will be <= the real minimum and it updates that guess over time.

I encountered other performance problems with Postgres while figuring out how to maintain that guess and get_actual_variable_range() in Postgres was the problem. Maintaining that guess requires a resync query every N seconds where the resync query is: select min(transactionid) from %s. The problem for this query in general is that is scans the low end of the PK index on transactionid and when vacuum hasn't been done recently, then it will scan and skip many entries that aren't visible (wasting much CPU and some IO) before finding visible rows. Unfortunately, there will be some time between consecutive vacuums to the same table and this problem can't be avoided. The result is that the response time for the query increases a lot in between vacuums. For more on how get_actual_variable_range() contributes to this problem, see this post.

I assume the sine wave for delete response time is caused by one or both of:
  • get_actual_varable_range() CPU overhead while planning the delete statement
  • CPU overhead from scanning and skipping tombstones while executing the select subquery
The structure of the delete statement above reduces the number of tombstones that the select subquery might encounter by specifying where transactionid >= %d. Perhaps that isn't sufficient. Perhaps the Postgres query planner still has too much CPU overhead from get_actual_variable_range() while planning that delete statement. I have yet to figure that out. But I have figured out that vacuum is a frequent source of problems.


    Tuesday, February 17, 2026

    MariaDB innovation: binlog_storage_engine, small server, Insert Benchmark

    MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

    My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a similar small server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

    tl;dr for a CPU-bound workload

    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
    tl;dr for an IO-bound workload
    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
    Builds, configuration and hardware

    I compiled MariaDB 12.3.0 from source.

    The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

    I used 4 my.cnf files:
    • z12b
      • my.cnf.cz12b_c8r32 (z12b) is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
    • z12c
    • z12b_sync
    • z12c_sync
      • my.cnf.cz12c_sync_c8r32 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
    The Benchmark

    The benchmark is explained here. It was run with 1 client for two workloads:
    • CPU-bound - the database is cached by InnoDB, but there is still much write IO
    • IO-bound - most, but not all, benchmark steps are IO-bound
    The benchmark steps are:

    • l.i0
      • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for CPU-bound and 800M for IO-bound.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
      • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: summary

    The performance reports are here for:
    • CPU-bound
      • all-versions - results for z12b, z12c, z12b_sync and z12c_sync
      • sync-only - results for z12b_sync vs 12c_sync
    • IO-bound
      • all-versions - results for z12b, z12c, z12b_sync and z12c_sync
      • sync-only - results for z12b_sync vs 12c_sync
    The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version. 

    When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

    I often use context switch rates as a proxy for mutex contention.

    Results: CPU-bound

    The summaries are here for all-versions and sync-only.
    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
    The second table from the summary section has been inlined below. That table shows relative throughput which is:
    • all-versions: (QPS for my config / QPS for z12b)
    • sync-only: (QPS for my config / QPS for z12b)
    For all-versions
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_c8r321.031.011.001.031.000.991.001.001.011.00
    ma120300_rel_withdbg.cz12b_sync_c8r320.041.020.070.011.011.011.001.011.001.00
    ma120300_rel_withdbg.cz12c_sync_c8r320.081.030.280.061.021.011.011.021.021.01

    And for sync-only the relative QPS is:
    • all-versions: (QPS for my config / QPS for z12b_sync)
    • sync-only: (QPS for my config / QPS for z12b_sync)
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_sync_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_sync_c8r321.751.013.996.831.011.011.011.011.031.01

    Results: IO-bound

    The summaries are here for all-versions and sync-only.
    • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
    • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
    The second table from the summary section has been inlined below. That table shows relative throughput which is:
    • all-versions: (QPS for my config / QPS for z12b)
    • sync-only: (QPS for my config / QPS for z12b)
    For all-versions
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_c8r321.010.990.991.011.011.011.011.071.011.04
    ma120300_rel_withdbg.cz12b_sync_c8r320.041.000.550.101.020.971.000.800.950.55
    ma120300_rel_withdbg.cz12c_sync_c8r320.181.000.830.311.021.011.020.961.020.86

    And for sync-only the relative QPS is:
    • all-versions: (QPS for my config / QPS for z12b_sync)
    • sync-only: (QPS for my config / QPS for z12b_sync)
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    ma120300_rel_withdbg.cz12b_sync_c8r321.001.001.001.001.001.001.001.001.001.00
    ma120300_rel_withdbg.cz12c_sync_c8r324.741.001.502.991.001.041.021.201.081.57












    MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark

    MariaDB 12.3 has a new feature enabled by the option  binlog_storage_engine . When enabled it uses InnoDB instead of raw files to store the ...