Wednesday, April 8, 2026

The Insert Benchmark vs MariaDB 10.2 to 13.0 on a 32-core server

This has results for MariaDB versions 10.2 through 13.0 vs the Insert Benchmark on a 32-core server. The goal is to see how performance changes over time to find regressions or highlight improvements. My previous post has results from a 24-core server.  Differences between these servers include:

  • RAM - 32-core server has 128G, 24-core server has 64G
  • fsync latency - 32-core has an SSD with high fsync latency, while it is fast on the 24-core server
  • sockets - 32-core server has 1 CPU socket, 24-core server has two
  • CPU maker  - 32-core server uses an AMD Threadripper, 24-core server has an Intel Xeon
  • cores - obviously it is 32 vs 24, Intel HT and AMD SMT are disabled

The results here for modern MariaDB are great for the CPU-bound workload but not for the IO-bound workload.. They were great for both on the 24-core server. The regressions are likely caused by the extra fsync calls that are done because the equivalent of equivalent of innodb_flush_method =O_DIRECT_NO_FSYNC was lost with the new options that replace innodb_flush_method starting in MariaDB 11.4. I created MDEV-33545 to request support for it. The workaround is to use an SSD that doesn't have high fsync latency, which is always a good idea, but not always possible.

tl;dr

  • for a CPU-bound workload
    • the write-heavy steps are much faster in 13.0.0 than 10.2.30
    • the read-heavy steps get similar QPS in 13.0.0 and 10.2.30
    • this is similar to the results on the 24-core server
  • for an IO-bound workload
    • the initial load (l.i0) is much faster in 13.0.0 than 10.2.30
    • the random write step (l.i1) is slower in 13.0.0 than 10.2.30 because fsync latency
    • the range query step (qr100) gets similar QPS in 13.0.0 and 10.2.30
    • the point query step (qp100) is much slower in 13.0.0 than 10.2.30 because fsync latency

Builds, configuration and hardware

I compiled MariaDB from source for versions 10.2.30, 10.2.44, 10.3.39, 10.4.34, 10.5.29, 10.6.25, 10.11.16, 11.4.10, 11.8.6, 12.3.1 and 13.0.0.

The server has 24-cores, 2-sockets and 64G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. Intel HT is disabled.

The my.cnf files are here for: 10.210.310.410.510.610.1111.411.812.3 and 13.0

For MariaDB 10.11.16 I used both the z12a config, as I did for all 10.x releases, and also used the z12b config. The difference is that the z12a config uses innodb_flush_method =O_DIRECT_NO_FSYNC while the z12b config uses =O_DIRECT. And the z12b config is closer to the configs used for MariaDB because with the new variables that replaced innodb_flush_method, we lose support for the equivalent of =O_DIRECT_NO_FSYNC.

And I write about this because the extra fsync calls that are done when the z12b config is used have a large impact on throughput on a server that uses an SSD with high fsync latency, which causes perf regressions for all DBMS versions that used the z12b config -- 10.11.16, 11.4, 11.8, 12.3 and 13.0.

The Benchmark

The benchmark is explained here and is run with 12 clients with a table per client. I repeated it with two workloads:
  • CPU-bound
    • the values for X, Y, Z are 10M, 16M, 4M
  • IO-bound
    • the values for X, Y, Z are 300M, 4M, 1M
The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

The benchmark steps are:

  • l.i0
    • insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
    • Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance reports are here for the CPU-bound and IO-bound workloads.

The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version. The base version is MariaDB 10.2.30.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
This statement doesn't apply to this blog post, but I keep it here for copy/paste into future posts. Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: CPU-bound

The performance summary is here.

The summary per benchmark step, where rQPS means relative QPS.
  • l.i0
    • MariaDB 13.0.0 is faster than 10.2.30, rQPS is 1.47
    • CPU per insert (cpupq) and KB written to storage per insert (wKBpi) are much smaller in 13.0.0 than 10.2.30 (see here)
  • l.x
    • I will ignore this
  • l.i1, l.i2
    • MariaDB 13.0.0 is faster than 10.2.30, rQPS is 1.50 and 1.37
    • CPU per write (cpupq) is much smaller in 13.0.0 than 10.2.30 (see here)
  • qr100, qr500, qr1000
    • MariaDB 13.0.0 and 10.2.30 have similar QPS (rQPS is close to 1.0)
    • CPU per query (cqpq) is similar in 13.0.0 and 10.2.30 (see here)
  • qp100, qp500, qp1000
    • MariaDB 13.0.0 and 10.2.30 have similar QPS (rQPS is close to 1.0)
    • CPU per query (cqpq) is similar in 13.0.0 and 10.2.30 (see here)

Results: IO-bound

The performance summary is here.

The summary per benchmark step, where rQPS means relative QPS.
  • l.i0
    • MariaDB 13.0.0 is faster than 10.2.30, rQPS is 1.25
    • CPU per insert (cpupq) and KB written to storage per insert (wKBpi) are much smaller in 13.0.0 than 10.2.30 (see here)
  • l.x
    • I will ignore this
  • l.i1, l.i2
    • MariaDB 13.0.0 is slower than 10.2.30 for l.i1, rQPS is 0.68
    • MariaDB 13.0.0 is faster than 10.2.30 for l.i2, rQPS is 1.31. I suspect it is faster on l.i2 because it inherits less MVCC GC debt from l.i1 because it was slower on l.i1. So I won't celebrate this result and will focus on l.i1.
    • From the normalized vmstat and iostat metrics I don't see anything obvious. But I do see a reduction in storage reads/s (rps) and storage read MB/s (rMBps). And this reduction starts in 10.11.16 with the z12b config and continues to 13.0.0. This does not occur on the earlier releases that are eable to use the z12a config. So I am curious if the extra fsyncs are the root cause.
    • From the iostat summary for l.i1 that includes average values for all iostat columns, and these are not divided by QPS, what I see a much higher rate for fsyncs (f/s) as well as an increase in read latency. For MariaDB 10.11.16 the value for r_await is 0.640 with the z12a config vs 0.888 with the z12b config. I assume that more frequent fsync calls hurt read latency. The iostat results don't look great for either the z12a or z12b config and the real solution is to avoid using an SSD with high fsync latency, but that isn't always possible.
  • qr100, qr500, qr1000
    • no DBMS versions were able to sustain the target write rate for qr500 or qr1000 so I ignore them. This server needs more IOPs capacity -- a second SSD, and both SSDs needs power loss protection to reduce fsync latency.
    • MariaDB 13.0.0 and 10.2.30 have similar performance, rQPS is 0.96The qr100 step for MariaDB 13.0.0 might not suffer from fsync latency like the qp100 step because it does less read IO per query than qp100 (see rpq here).
  • qp100, qp500, qp1000
    • no DBMS versions were able to sustain the target write rate for qp500 or qp1000 so I ignore them. This server needs more IOPs capacity -- a second SSD, and both SSDs needs power loss protection to reduce fsync latency.
    • MariaDB 13.0.0 is slower than 10.2.30, rQPS is 0.62
    • From the normalized vmstat and iostat metrics there are increases in CPU per query (cpupq) and storage reads per query (rpq) for all DBMS versions that use the z12b config (see here).
    • From the iostat summary for qp100 that includes average values for all iostat columns the read latency increases for all DBMS versions that use the z12b config. I blame interference from the extra fsync calls.
























Tuesday, April 7, 2026

The Insert Benchmark vs MariaDB 10.2 to 13.0 on a 24-core server

This has results for MariaDB versions 10.2 through 13.0 vs the Insert Benchmark on a 24-core server. The goal is to see how performance changes over time to find regressions or highlight improvements.

MariaDB 13.0.0 is faster than 10.2.30 on most benchmark steps and otherwise as fast as 10.2.30. This is a great result.

tl;dr

  • for a CPU-bound workload
    • the write-heavy steps are much faster in 13.0.0 than 10.2.30
    • the read-heavy steps get similar QPS in 13.0.0 and 10.2.30
  • for an IO-bound workload
    • most of the write-heavy steps are much faster in 13.0.0 than 10.2.30
    • the point-query heavy steps get similar QPS in 13.0.0 and 10.2.30
    • the range-query heavy steps get more QPS in 13.0.0 than 10.2.30

Builds, configuration and hardware

I compiled MariaDB from source for versions 10.2.30, 10.2.44, 10.3.39, 10.4.34, 10.5.29, 10.6.25, 10.11.16, 11.4.10, 11.8.6, 12.3.1 and 13.0.0.

The server has 24-cores, 2-sockets and 64G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. Intel HT is disabled.

The my.cnf files are here for: 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.8, 12.3 and 13.0.

The Benchmark

The benchmark is explained here and is run with 8 clients with a table per client. I repeated it with two workloads:
  • CPU-bound
    • the values for X, Y, Z are 10M, 16M, 4M
  • IO-bound
    • the values for X, Y, Z are 250M, 4M, 1M
The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

The benchmark steps are:

  • l.i0
    • insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
    • Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance reports are here for the CPU-bound and IO-bound workloads.

The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version. The base version is MariaDB 10.2.30.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
This statement doesn't apply to this blog post, but I keep it here for copy/paste into future posts. Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: CPU-bound

The performance summary is here.

The summary per benchmark step, where rQPS means relative QPS.
  • l.i0
    • MariaDB 13.0.0 is faster than 10.2.30 (rQPS is 1.22)
    • KB written to storage per insert (wKBpi) and CPU per insert (cpupq) are smaller in 13.0.0 than 10.2.30, see here
  • l.x
    • I will ignore this
  • l.i1, l.i2
    • MariaDB 13.0.0 is faster than 10.2.30 (rQPS is 1.21 and 1.45)
    • for l.i1, CPU per insert (cpupq) is smaller in 13.0.0 than 10.2.30 but KB written to storage per insert (wKBpi) and the context switch rate (cspq) are larger in 13.0.0 than 10.2.30, see here
    • for l.i2, CPU per insert (cpupq) and KB written to storage per insert (wKBpi) are smaller in 13.0.0 than 10.2.30 but the context switch rate (cspq) is larger in 13.0.0 than 10.2.30, see here
  • qr100, qr500, qr1000
    • MariaDB 13.0.0 and 10.2.30 have similar QPS (rQPS is close to 1.0)
    • the results from vmstat and iostat are less useful here because the write rate in 10.2 to 10.4 was much larger than 10.5+. While the my.cnf settings are as close as possible across all versions, it looks like furious flushing was enabled in 10.2 to 10.4 and I need to figure out whether it is possible to disable that.
  • qp100, qp500, qp1000
    • MariaDB 13.0.0 and 10.2.30 have similar QPS (rQPS is close to 1.0)
    • what I wrote above for vmstat and iostat with the qr* test also applies here
Results: IO-bound

The performance summary is here.

The summary per benchmark step, where rQPS means relative QPS.
  • l.i0
    • MariaDB 13.0.0 is faster than 10.2.30 (rQPS is 1.16)
    • KB written to storage per insert (wKBpi) and CPU per insert (cpupq) are smaller in 13.0.0 than 10.2.30, see here
  • l.x
    • I will ignore this
  • l.i1, l.i2
    • MariaDB 13.0.0 and 10.2.30 have the same QPS for l.i1 while 13.0.0 is faster for l.i2 (rQPS is 1.03 and 3.70). It is odd that QPS drops from 12.3.1 to 13.0.0 on the l.i1 step.
    • for l.i1, CPU per insert (cpupq) and the context switch rate (cspq) are larger in 13.0.0 than 12.3.1, see here. The flamegraphs, that I have not shared, look similar. From iostat results there is much more discard (TRIM, SSD GC) in progress with 13.0.0 than 12.3.1 and the overhead from that might explain the difference.
    • for l.i2, almost everything looks better in 13.0.0 than 10.2.30. Unlike what occurs for the l.i1 step, the results for 13.0.0 are similar to 12.3.1, see here.
  • qr100, qr500, qr1000
    • no DBMS versions were able to sustain the target write rate for qr1000 so I ignore that step
    • MariaDB 13.0.0 and 10.2.30 have similar QPS (rQPS is close to 1.0)
    • the results from vmstat and iostat are less useful here because the write rate in 10.2 to 10.4 was much larger than 10.5+. While the my.cnf settings are as close as possible across all versions, it looks like furious flushing was enabled in 10.2 to 10.4 and I need to figure out whether it is possible to disable that.
  • qp100, qp500, qp1000
    • no DBMS versions were able to sustain the target write rate for qr1000 so I ignore that step
    • MariaDB 13.0.0 is faster than 10.2.30 (rQPS is 1.17 and 1.56)
    • what I wrote above for vmstat and iostat with the qr* test also applies here






Sunday, April 5, 2026

Sysbench vs MariaDB on a small server: using the same charset for all versions

This has results for sysbench vs MariaDB on a small server. I repeated tests using the same charset (latin1) for all versions as explained here. In previous results I used a multi-byte charset for modern MariaDB (probably 11.4+) by mistake and that adds a 5% CPU overhead for many tests.

tl;dr

  • MariaDB has done much better than MySQL at avoid regressions from code bloat.
  • There are several performance improvements in MariaDB 12.3 and 13.0
  • For reads there are small regressions and frequent improvements.
  • For writes there are  regressions up to 10%, and the biggest contributor is MariaDB 11.4

Builds, configuration and hardware

I compiled MariaDB from source for versions 10.2.30, 10.2.44, 10.3.39, 10.4.34, 10.5.29, 10.6.25, 10.11.16, 11.4.10, 11.8.6, 12.3.1 and 13.0.0.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files are here for 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.8, 12.3 and 13.0.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 1800 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression.  The relative QPS is:
(QPS for some version) / (QPS for MariaDB 10.2.30) 
Values from iostat and vmstat divided by QPS are hereThese can help to explain why something is faster or slower because it shows how much HW is used per request.

The spreadsheet with results and charts is here. Files with performance summaries are here.

Results: point queries

Summary
  • The y-axis starts at 0.8 to improve readability.
  • Modern MariaDB (13.0) is faster than old MariaDB (10.2) in 7 of 9 tests
    • There were regressions from 10.2 through 10.5
    • Performance has been improving from 10.6 through 13.0

Results: range queries without aggregation

Summary
  • The y-axis starts at 0.8 to improve readability.
  • Modern MariaDB (13.0) is faster than old MariaDB (10.2) in 2 of 5 tests
    • There were regressions from 10.2 through 10.5, then performance was stable from 10.6 though 11.8, and now performance has improved in 12.3 and 13.0.
Results: range queries with aggregation

Summary
  • The y-axis starts at 0.8 to improve readability.
  • Modern MariaDB (13.0) is faster than old MariaDB (10.2) in 1 of 8 tests and within 2% in 6 tests
Results: writes

Summary
  • The y-axis starts at 0.8 to improve readability.
  • Modern MariaDB (13.0) is about 10% slower than old MariaDB (10.2) in 5 of 10 tests and the largest regressions arrive in 11.4.

Friday, April 3, 2026

CPU-bound sysbench on a large server: Postgres, MySQL and MariaDB

This post has results for CPU-bound sysbench vs Postgres, MySQL and MariaDB on a large server using older and newer releases. 

The goal is to measure:

  • how performance changes over time from old versions to new versions
  • performance between modern MySQL, MariaDB and Postgres

The context here is a collection of microbenchmarks using a large server with high concurrency. Results on other workloads might be different. But you might be able to predict performance for a more complex workload using the data I share here.

tl;dr

  • for point queries
    • Postgres is faster than MySQL, MySQL is faster than MariaDB
    • modern MariaDB suffers from huge regressions that arrived in 10.5 and remain in 12.x
  • for range queries without aggregation
    • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • for range queries with aggregation
    • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • for writes
    • Postgres is much faster than MariaDB and MySQL (up to 4X faster)
    • MariaDB is between 1.3X and 1.5X faster than MySQL
  • on regressions
    • Postgres tends to be boring with few regressions from old to new versions
    • MySQL and MariaDB are exciting, with more regressions to debug
Hand-wavy summary

My hand-wavy summary about performance over time has been the following. It needs a revision, but also needs to be concise. 

Modern Postgres is about as fast as old Postgres, with some improvements. It has done great at avoiding perf regressions.

Modern MySQL at low concurrency has many performance regressions from new CPU overheads (code bloat). At high concurrency it is faster than old MySQL because the improvements for concurrency are larger than the regressions from code bloat.

Modern MariaDB at low concurrency has similar perf as old MariaDB. But at high concurrency it has large regressions for point queries, small regressions for range queries and some large improvements for writes. Note that many things use point queries internally - range scan on non-covering index, updates, deletes. The regressions arrive in 10.5, 10.6, 10.11 and 11.4.

For results on a small server with a low concurrency workload, I have many posts including:
Builds, configuration and hardware

I compiled:
  • Postgres from source for versions 12.22, 13.23, 14.21, 15.16, 16.12, 17.8 and 18.2.
  • MySQL from source for versions 5.6.51, 5.7.44, 8.0.44, 8.4.7 and 9.5.0
  • MariaDB from source for versions 10.2.30, 10.2.44, 10.3.39, 10.4.34, 10.5.29, 10.6.25, 10.11.15, 11.4.10, 11.8.6, 12.2.2 and 12.3.1
I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic). The server has since been updated to Ubuntu 24.04 and I am repeating tests.
Configuration files for Postgres:
  • the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 1213141516 and 17.
  • for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) which is as close as possible to the Postgres 17 config and uses io_method=sync
The my.cnf files for MySQL are here: 5.6.515.7.448.0.4x8.4.x9.x.0

The my.cnf files for MariaDB are here: 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.8, 12.2, 12.3.

I thought I was using the latin1 charset for all versions of MariaDB and MySQL but I recently learned I was using somehting like utf8mb4 on recent versions (maybe MariaDB 11.4+ and MySQL 8.0+). See here for details. I will soon repeat tests using latin1 for all versions. For some tests, the use of a multi-byte charset increases CPU overhead by up to 5%, which reduces throughput by a similar amount.

With Postgres I have been using a multi-byte charset for all versions.

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 10M rows per table. The database is cached.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

The tests can be called microbenchmarks. They are very synthetic. But microbenchmarks also make it easy to understand which types of SQL statements have great or lousy performance. Performance testing benefits from a variety of workloads -- both more and less synthetic.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet with charts is here and in some cases is easier to read than the charts below. Files with performance summaries are archived here.

The relative QPS numbers are also here for:
Files with HW efficiency numbers, average values from vmstat and iostat normalized by QPS, are here for:
Results: MySQL vs MariaDB vs Postgres

HW efficiency metrics are here. They have metrics from vmstat and iostat normalized by QPS.

Point queries
  • Postgres is faster than MySQL is faster than MariaDB
  • MySQL gets about 2X more QPS than MariaDB on 5 of the 9 tests
  • a table for relative QPS by test is here
  • from HW efficiency metrics for the random-points.range1000 test:
    • Postgres is 1.35X faster than MySQL, MySQL is more than 2X faster than MariaDB
    • MariaDB uses 2.28X more CPU and does 23.41X more context switches than MySQL
    • Postgres uses less CPU but does ~1.93X more context switches than MySQL
Range queries without aggregation
  • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • MariaDB has lousy results on the range-notcovered-si test because it must do many point lookups to fetch columns not in the index and MariaDB has problems with point queries at high concurrency
  • a table for relative QPS by test is here
  • from HW efficiency metrics for the scan:
    • MySQL is 1.2X faster than Postgres and 1.5X faster than MariaDB
    • MariaDB uses 1.19X more CPU and does ~1000X more context switches than MySQL
    • Postgres uses 1.55X more CPU but does few context switches than MySQL
Range queries with aggregation
  • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • a table for relative QPS by test is here
  • from HW efficiency metrics for read-only-count
    • MariaDB is 1.22X faster than MySQL, MySQL is 4.2X faster than Postgres
    • MariaDB uses 1.22X more CPU than MySQL but does ~2X more context switches
    • Postgres uses 4.11X more CPU than MySQL and does 1.08X more context switches
    • Query plans are here and MySQL + MariaDB benefit from the InnoDB clustered index
  • from HW efficiency metrics for read-only.range=10
    • MariaDB is 1.22X faster than MySQL, MySQL is 4.2X fasterMySQL is 1.2X faster than Postgres and 1.5X faster than MariaDB
    • MariaDB uses 1.19X more CPU and does ~1000X more context switches than MySQL
    • Postgres uses 1.55X more CPU but does few context switches than MySQL
Writes
  • Postgres is much faster than MariaDB and MySQL (up to 4X faster)
  • MariaDB is between 1.3X and 1.5X faster than MySQL
  • a table for relative QPS by test is here
  • from HW efficiency metrics for insert
    • Postgres is 3.03X faster than MySQL, MariaDB is 1.32X faster than MySQL
    • MySQL uses ~1.5X more CPU than MariaDB and ~2X more CPU than Postgres
    • MySQL does ~1.3X more context switches than MariaDB and ~2.9X more than Postgres
Results: MySQL

HW efficiency metrics are here. They have metrics from vmstat and iostat normalized by QPS.

Point queries
  • For 7 of 9 tests QPS is ~1.8X larger or more in 5.7.44 than in 5.6.51
  • For 2 tests there are small regressions after 5.6.51 -- points-covered-si & points-notcovered-si
  • a table for relative QPS by test is here
  • from HW efficiency metrics for points-covered-si:
    • the regression is explained by an increase in CPU
Range queries without aggregation
  • there is a small regression from 5.6 to 5.7 and a larger one from 5.7 to 8.0
  • a table for relative QPS by test is here
  • from HW efficiency metrics for range-covered-pk:
    • CPU overhead grows by up to 1.4X after 5.6.51, this is true for all of the tests
Range queries with aggregation
  • regressions after 5.6.51 here are smaller than in the other groups, but 5.7 tends to do better than 8.0, 8.4 and 9.5
  • a table for relative QPS by test is here
  • HW efficiency metrics are here for read-only_range=100
    • QPS changes because CPU/query changes
Writes
  • QPS improves after 5.6 by up to ~7X
  • a table for relative QPS by test is here
  • HW efficiency metrics are here insert
    • QPS improves after 5.6.51 because CPU per statement drops
Results: MariaDB

HW efficiency metrics are here. The have metrics from vmstat and iostat normalized by QPS.

Point queries
  • QPS for 6 of 9 tests drops in half (or more) from 10.2 to 12.3
  • a table for relative QPS is here
  • most of the regressions arrive in 10.5 and the root cause might be remove support for innodb_buffer_pool_intances and only support one buffer pool instance
  • HW efficiency metrics are here for points-covered-pk
    • there are large increases in CPU overhead and the context switch rate starting in 10.5
Range queries without aggregation
  • for range-covered-* and range-notcovered-pk there is a small regression in 10.4
  • for range-not-covered-si there is a large regression in 10.5 because this query does frequent point lookups on the PK to get missing columns
  • for scan there is a regression in 10.5 that goes away, but the regressions return in 10.11 and 11.4 
  • a table for relative QPS by test is here
  • HW efficiency metrics are here
Range queries with aggregation
  • for most tests there are small regressions in 10.4 and 10.5
  • a table for relative QPS by test is here
  • HW efficiency metrics are here
Writes
  • for most tests modern MariaDB is faster than 10.2
  • table for relative QPS by test is here
  • HW efficiency metrics are here
Results: Postgres

HW efficiency metrics are here. They have metrics from vmstat and iostat normalized by QPS.

Point queries
  • QPS for hot-points increased by ~2.5X starting in Postgres 17.x
  • otherwise QPS is stable from 12.22 through 18.2
  • a table for relative QPS by test is here
  • HW efficiency metrics for the hot-points test are here
    • CPU drops by more than half starting in 17.x
Range queries without aggregation
  • QPS is stable for the range-not-covered-* and scan tests
  • QPS drops almost in half for the range-covered-* tests
  • a table for relative QPS by test is here
  • all versions use the same query plan for the range-covered-pk test
  • HW efficiency metrics are here for range-covered-pk and for range-covered-si
    • An increase in CPU overhead explains the regressions for range-covered-*
    • I hope to get flamegraphs and thread stacks for these tests to explain what happens
Range queries with aggregation
  • QPS is stable from 12.22 through 18.2
  • a table for relative QPS by test is here
  • HW efficiency metrics are here
Writes
  • QPS is stable for 5 of 10 tests
  • QPS improves by up to 1.7X for the other 5 tests, most of that arrives in 17.x
  • a table for relative QPS by test is here
  • HW efficiency metrics are here for update-index




















    Tuesday, March 31, 2026

    gcc vs clang for sysbench on a small server with Postgres, MySQL and MariaDB

    This has results for sysbench on a small server and compares performanc for Postgres, MySQL and MariaDB compiled using clang vs using gcc.

    tl;dr

    • Throughput with clang and gcc is similar

    Builds, configuration and hardware

    I compiled Postgres 18.3, MySQL 8.4.8 and MariaDB 11.8.6 from source. The server has 8 AMD cores with SMT disabled and 32G of RAM. The OS is Ubuntu 24.04, gcc is version 13.3.0 and clang is version 18.1.3. Storage is ext-4 with discard enabled and an NVMe SSD.

    Benchmark

    I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

    The tests are run using 1 client and 1 table with 50M rows. The read-heavy microbenchmarks run for 630 seconds and the write-heavy for 930 seconds.

    Results

    The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

    I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. The number below are the relative QPS computed as: (QPS with a gcc build / QPS with a clang build)

    Legend:
    * pg - for Postgres 18.3, (QPS with gcc / QPS with clang)
    * my - for MySQL 8.4.8, (QPS with gcc / QPS with clang)
    * ma - for MariaDB 11.8.6, (QPS with gcc / QPS with clang)

    -- point queries
    pg      my      ma
    1.02    1.00    0.99    hot-points
    1.02    0.98    1.02    point-query
    0.95    1.01    1.02    points-covered-pk
    0.96    1.02    1.02    points-covered-si
    0.97    1.00    1.02    points-notcovered-pk
    0.96    1.03    1.02    points-notcovered-si
    0.97    1.01    1.01    random-points_range=1000
    0.98    1.01    1.01    random-points_range=100
    1.00    0.99    1.00    random-points_range=10

    -- range queries without aggregation
    pg      my      ma
    1.01    0.98    1.03    range-covered-pk
    1.00    0.98    1.05    range-covered-si
    0.99    0.98    1.04    range-notcovered-pk
    0.99    1.02    0.97    range-notcovered-si
    1.02    1.06    1.03    scan

    -- range queries with aggregation
    pg      my      ma
    1.01    0.96    1.05    read-only-count
    0.99    0.99    1.01    read-only-distinct
    0.99    1.00    1.00    read-only-order
    0.99    1.00    1.01    read-only_range=10000
    1.00    0.98    1.00    read-only_range=100
    1.01    0.97    1.00    read-only_range=10
    0.99    0.97    1.03    read-only-simple
    1.02    0.98    1.02    read-only-sum

    -- writes
    pg      my      ma
    1.03    0.98    1.00    delete
    1.01    1.00    1.00    insert
    1.00    0.98    1.00    read-write_range=100
    1.00    0.98    1.00    read-write_range=10
    0.99    1.01    0.97    update-index
    0.96    1.01    0.99    update-inlist
    0.99    0.99    0.99    update-nonindex
    1.02    0.98    0.99    update-one
    0.98    0.98    0.99    update-zipf
    1.00    0.99    0.99    write-only

    Sunday, March 29, 2026

    The insert benchmark on a small server : Postgres 12.22 through 18.3

    This has results for Postgres versions 12.22 through 18.3 with the Insert Benchmark on a small server. My previous post for the same hardware with results up to Postgres 18.1 is here. This post also has results for:

    • all 17.x releases from 17.0 through 17.9 
    • 18.2 with and without full page writes enabled
    • both 1 and 4 users

    Postgres continues to be boring in a good way. It is hard to find performance regressions. Performance wasn't always stable, but I am reluctant to expect it to show no changes because there are sources of variance beyond the DBMS, especially HW (a too-hot SSD or CPU will run slower). Sometimes perf changes because there are obvious perf bugs, sometimes it changes for other reasons.

     tl;dr for a  CPU-bound workload

    • performance is stable from Postgres 12 through 18
    • performance is stable from Postgres 17.0 through 17.9
    • disabling full-page writes improves throughput on write-heavy benchmark steps
    tl;dr for an IO-bound workload
    • performance is mostly stable from Postgres 12 through 18
    • performance is stable from Postgres 17.0 through 17.9
    • disabling full-page writes improves throughput on write-heavy benchmark steps
    • in a few cases there are large improvements to point-query throughput on the qp1000 benchmark step. I will try to explain that soon.
    Builds, configuration and hardware

    I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.22, 15.17, 16.13, 17.0 to 17.9, 18.2 and 18.3.

    The server is an Beelink SER7 with a Ryzen 7 7840HS CPU with 8 cores and AMD SMT disabled, 32G of RAM. Storage is one SSD for the OS and an NVMe SSD for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.

    For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

    For Postgres 18 in most cases I used a config named conf.diff.cx10b_c8r32 (aka cx10b) which is as similar as possible to the configs for versions 17 and earlier. But for tests with full-page writes disabled I used additional configs to compare with results from the cx10b config.
    The Benchmark

    The benchmark is explained here and is run with 1 and 4 clients. In each case each client uses a separate table. I repeated it with two workloads:
    • CPU-bound
      • for 1 user the values for X, Y, Z are 30M, 40M, 10M
      • for 4 users the values for X, Y, Z are 10M, 16M, 4M
    • IO-bound
      • for 1 user the values for X, Y, Z are 800M, 4M, 1M
      • for 4 users the values for X, Y, Z are 200M, 4M, 1M
    The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

    The benchmark steps are:

    • l.i0
      • insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
      • Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    The performance reports are here for:
    The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version. The base version is Postgres 12.22 for the latest point releases comparison, 17.0 for the 17.x releases comparison and 18.2 with the cx10b config for the full-page writes comparison. 

    When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    This statement doesn't apply to this blog post, but I keep it here for copy/paste into future posts. Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    Results: CPU-bound

    The performance summaries are here for:
    For latest point releases at 1 user
    • there is either no change or a small improvement for l.i0 (load in PK order), l.x (create indexes) and the read-write tests (qr*, qp*).
    • for l.i1 and l.i2 (random write-only) throughput drops by 5% to 10% from 12.22 to 13.23 and has been stable since then (throughput in 18.2 is similar to 13.23. The CPU per operation overhead (cpupq here) increases after 12.22 for the l.i2 step but there wasn't an obvious increase for the l.i1 step - but the way I measure this is far from perfect. The results I share here are worse than what I measured in December 2025.
    For latest point releases at 4 users
    • there might be a small (3%) regression for l.i0 (load in PK order) in 18.2 vs 12.22. Perhaps this is noise. From vmstat and iostat metrics there aren't obvious changes.
    • throughput in 18.2 is better than 12.22 for all other benchmark steps
    For all 17.x releases at 1 user
    • throughput is stable from 17.0 to 17.9 for all benchmark steps except l.i1 and l.i2 (random writes) where there might be a 5% regression late in 17.x. This might be from new CPU overhead - see cpupq here.
    For all 17.x releases at 4 users
    • throughput is stable with small improvements from 17.0 to 17.9
    For full-page writes at 1 user
    • throughput improves by ~5% for l.i1 and l.i2 (random writes) when full-page writes are disabled  and KB written to storage per commit drops by ~20% -- see wkbpi here.
    • enabling wal_compression=lz4 decreases write throughput for all write-heavy steps when full-page writes are enabled. The impact is smaller when full page writes are disabled.
    For full-page writes at 4 users
    • throughput improves by <= 5% for all write-heavy steps when full-page writes are disabled
    • the impact from wal_compression=lz4 isn't obvious
    Results: IO-bound

    The performance summaries are here for:
    For latest point releases at 1 user
    • there are small (<= 10%) improvements for l.i0 (load in PK order) and l.x (create index). I don't see anything obvious in vmstat and iostat metrics to explain this.
    • there are small (<= 10%) regressions for l.i1 and l.i2 (random writes) that might be from a sequence of small regressions from 13.x through 18.x. I don't see anything obvious in vmstat and iostat metrics to explain this.
    • throughput is unchanged for the range-query read+write tests (qr*)
    • throughput improves by ~1.4X for the point-query read+write tests (qp*). This improvement arrived in 13.x. This can be explained by large drops in CPU overhead (cpupq) and context switch rates (cspq) -- see here.
    • the results here are similar to what I measured in December 2025
    For latest point releases at 4 users
    • there are small (~10%) regressions for l.i0 (load in PK order) that arrived in 17.x. The context switch rate (cspq) increases in 17.x
    • there are small (<= 20%) improvements for l.x (create index) that arrived in 13.x
    • there are large regressions for l.i1 and l.i2 (random writes) that arrive in 15.x through 18.x. There are large increases in CPU overhead (cpupq) -- see here.
    • throughput is unchanged for the range-query read+write tests (qr*)
    • throughput improves for the point-query read+write tests (qp*) at higher write rates (qp500, qp1000).
    For all 17.x releases at 1 user
    • throughput is stable with a few exceptions
    • for qp1000 (point-query, read+write) it improves by ~5% in 17.1 and is then stable to 17.9
    • in 17.9 there are large (~1.4x) improvements for all of the point-query, read+write tests
    • the changes in throughput for qp1000 might be explained by a small drop in CPU overhead per query (cpupq) that arrived in 17.1 and a large drop that arrived in 17.9 -- see here.
    For all 17.x releases at 4 users
    • throughput for most steps (l.i0, l.x, qr*, qp100, qp500) is stable
    • throughput for l.i1 and l.i2 (random writes) has more variance
    • throughput for qp1000 drops by up to 10% from 17.3 through 17.8 and in those cases the CPU overhead increased -- see cpupq here.
    For full-page writes at 1 user
    • throughput improves by 6% for l.i1 (random writes) when full-page writes are disabled
    • throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.
    For full-page writes at 4 users
    • throughput improves by 20% for l.i1 (random writes) when full-page writes are disabled
    • throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.

    n_dead_tup vs n_live_tup

    The tables below show the ratio: n_dead_tup / (n_dead_tup + n_live_tup) for the CPU-bound and IO-bound workloads using 1 user (and one table). These were measured at the end of each benchmark step.

    CPU-bound
            12.22   18.3
    l.i0    0.000   0.000
    l.x     0.000   0.000
    l.i1    0.065   0.035
    l.i2    0.045   0.020
    qr100   0.006   0.006
    qp100   0.012   0.012
    qr500   0.040   0.040
    qp500   0.021   0.024
    qr1000  0.031   0.036
    qp1000  0.040   0.003

    IO-bound
            12.22   18.3
    l.i0    0.000   0.000
    l.x     0.000   0.000
    l.i1    0.005   0.005
    l.i2    0.006   0.006
    qr100   0.000   0.000
    qp100   0.000   0.000
    qr500   0.002   0.002
    qp500   0.003   0.003
    qr1000  0.005   0.005
    qp1000  0.007   0.007

















     

    The Insert Benchmark vs MariaDB 10.2 to 13.0 on a 32-core server

    This has results for MariaDB versions 10.2 through 13.0 vs the  Insert Benchmark  on a 32-core server. The goal is to see how performance ch...