Sunday, March 29, 2026

The insert benchmark on a small server : Postgres 12.22 through 18.3

This has results for Postgres versions 12.22 through 18.3 with the Insert Benchmark on a small server. My previous post for the same hardware with results up to Postgres 18.1 is here. This post also has results for:

  • all 17.x releases from 17.0 through 17.9 
  • 18.2 with and without full page writes enabled
  • both 1 and 4 users

Postgres continues to be boring in a good way. It is hard to find performance regressions. Performance wasn't always stable, but I am reluctant to expect it to show no changes because there are sources of variance beyond the DBMS, especially HW (a too-hot SSD or CPU will run slower). Sometimes perf changes because there are obvious perf bugs, sometimes it changes for other reasons.

 tl;dr for a  CPU-bound workload

  • performance is stable from Postgres 12 through 18
  • performance is stable from Postgres 17.0 through 17.9
  • disabling full-page writes improves throughput on write-heavy benchmark steps
tl;dr for an IO-bound workload
  • performance is mostly stable from Postgres 12 through 18
  • performance is stable from Postgres 17.0 through 17.9
  • disabling full-page writes improves throughput on write-heavy benchmark steps
  • in a few cases there are large improvements to point-query throughput on the qp1000 benchmark step. I will try to explain that soon.
Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.22, 15.17, 16.13, 17.0 to 17.9, 18.2 and 18.3.

The server is an Beelink SER7 with a Ryzen 7 7840HS CPU with 8 cores and AMD SMT disabled, 32G of RAM. Storage is one SSD for the OS and an NVMe SSD for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 in most cases I used a config named conf.diff.cx10b_c8r32 (aka cx10b) which is as similar as possible to the configs for versions 17 and earlier. But for tests with full-page writes disabled I used additional configs to compare with results from the cx10b config.
The Benchmark

The benchmark is explained here and is run with 1 client 4 clients. In each case each client uses a separate table. I repeated it with two workloads:
  • CPU-bound
    • for 1 user the values for X, Y, Z are 30M, 40M, 10M
    • for 4 users the values for X, Y, Z are 10M, 16M, 4M
  • IO-bound
    • for 1 user the values for X, Y, Z are 800M, 4M, 1M
    • for 4 users the values for X, Y, Z are 200M, 4M, 1M
The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

The benchmark steps are:

  • l.i0
    • insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
    • Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance reports are here for:
The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version. The base version is Postgres 12.22 for the latest point releases comparison, 17.0 for the 17.x releases comparison and 18.2 with the cx10b config for the full-page writes comparison. 

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
This statement doesn't apply to this blog post, but I keep it here for copy/paste into future posts. Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: CPU-bound

The performance summaries are here for:
For latest point releases at 1 user
  • there is either no change or a small improvement for l.i0 (load in PK order), l.x (create indexes) and the read-write tests (qr*, qp*).
  • for l.i1 and l.i2 (random write-only) throughput drops by 5% to 10% from 12.22 to 13.23 and has been stable since then (throughput in 18.2 is similar to 13.23. The CPU per operation overhead (cpupq here) increases after 12.22 for the l.i2 step but there wasn't an obvious increase for the l.i1 step - but the way I measure this is far from perfect. The results I share here are worse than what I measured in December 2025.
For latest point releases at 4 users
  • there might be a small (3%) regression for l.i0 (load in PK order) in 18.2 vs 12.22. Perhaps this is noise. From vmstat and iostat metrics there aren't obvious changes.
  • throughput in 18.2 is better than 12.22 for all other benchmark steps
For all 17.x releases at 1 user
  • throughput is stable from 17.0 to 17.9 for all benchmark steps except l.i1 and l.i2 (random writes) where there might be a 5% regression late in 17.x. This might be from new CPU overhead - see cpupq here.
For all 17.x releases at 4 users
  • throughput is stable with small improvements from 17.0 to 17.9
For full-page writes at 1 user
  • throughput improves by ~5% for l.i1 and l.i2 (random writes) when full-page writes are disabled  and KB written to storage per commit drops by ~20% -- see wkbpi here.
  • enabling wal_compression=lz4 decreases write throughput for all write-heavy steps when full-page writes are enabled. The impact is smaller when full page writes are disabled.
For full-page writes at 4 users
  • throughput improves by <= 5% for all write-heavy steps when full-page writes are disabled
  • the impact from wal_compression=lz4 isn't obvious
Results: IO-bound

The performance summaries are here for:
For latest point releases at 1 user
  • there are small (<= 10%) improvements for l.i0 (load in PK order) and l.x (create index). I don't see anything obvious in vmstat and iostat metrics to explain this.
  • there are small (<= 10%) regressions for l.i1 and l.i2 (random writes) that might be from a sequence of small regressions from 13.x through 18.x. I don't see anything obvious in vmstat and iostat metrics to explain this.
  • throughput is unchanged for the range-query read+write tests (qr*)
  • throughput improves by ~1.4X for the point-query read+write tests (qp*). This improvement arrived in 13.x. This can be explained by large drops in CPU overhead (cpupq) and context switch rates (cspq) -- see here.
  • the results here are similar to what I measured in December 2025
For latest point releases at 4 users
  • there are small (~10%) regressions for l.i0 (load in PK order) that arrived in 17.x. The context switch rate (cspq) increases in 17.x
  • there are small (<= 20%) improvements for l.x (create index) that arrived in 13.x
  • there are large regressions for l.i1 and l.i2 (random writes) that arrive in 15.x through 18.x. There are large increases in CPU overhead (cpupq) -- see here.
  • throughput is unchanged for the range-query read+write tests (qr*)
  • throughput improves for the point-query read+write tests (qp*) at higher write rates (qp500, qp1000).
For all 17.x releases at 1 user
  • throughput is stable with a few exceptions
  • for qp1000 (point-query, read+write) it improves by ~5% in 17.1 and is then stable to 17.9
  • in 17.9 there are large (~1.4x) improvements for all of the point-query, read+write tests
  • the changes in throughput for qp1000 might be explained by a small drop in CPU overhead per query (cpupq) that arrived in 17.1 and a large drop that arrived in 17.9 -- see here.
For all 17.x releases at 4 users
  • throughput for most steps (l.i0, l.x, qr*, qp100, qp500) is stable
  • throughput for l.i1 and l.i2 (random writes) has more variance
  • throughput for qp1000 drops by up to 10% from 17.3 through 17.8 and in those cases the CPU overhead increased -- see cpupq here.
For full-page writes at 1 user
  • throughput improves by 6% for l.i1 (random writes) when full-page writes are disabled
  • throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.
For full-page writes at 4 users
  • throughput improves by 20% for l.i1 (random writes) when full-page writes are disabled
  • throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.

n_dead_tup vs n_live_tup

The tables below show the ratio: n_dead_tup / (n_dead_tup + n_live_tup) for the CPU-bound and IO-bound workloads using 1 user (and one table). These were measured at the end of each benchmark step.

CPU-bound
        12.22   18.3
l.i0    0.000   0.000
l.x     0.000   0.000
l.i1    0.065   0.035
l.i2    0.045   0.020
qr100   0.006   0.006
qp100   0.012   0.012
qr500   0.040   0.040
qp500   0.021   0.024
qr1000  0.031   0.036
qp1000  0.040   0.003

IO-bound
        12.22   18.3
l.i0    0.000   0.000
l.x     0.000   0.000
l.i1    0.005   0.005
l.i2    0.006   0.006
qr100   0.000   0.000
qp100   0.000   0.000
qr500   0.002   0.002
qp500   0.003   0.003
qr1000  0.005   0.005
qp1000  0.007   0.007

















 

Saturday, March 28, 2026

Selecting a character set for MySQL and MariaDB clients

 MySQL and MariaDB have many character-set related options, perhaps too many:

  1. character_set_client
  2. character_set_connection
  3. character_set_database
  4. character_set_filesystem
  5. character_set_results
  6. character_set_server
  7. character_set_system
This is a topic that I don't know much about and I am still far from an expert. My focus has been other DBMS topics. But I spent time recently on this topic while explaining what looked like a performance regression, but really was just a new release of MySQL using a charset that is less CPU-efficient than the previous charset that was used.

Debugging

The intial sequence to understand what was going on was:
  1. mysql -e 'SHOW GLOBAL VARIABLES like "character_set_%"
  2. mysql -e 'SHOW SESSION VARIABLES like "character_set_%"
  3. run "SHOW SESSION VARIABLES" from my benchmark client
Note:
  • the output from steps 1 and 2 was different
    • with SHOW GLOBAL VARIABLES I got character_set_client =latin1 but with SHOW SESSION VARIABLES I got character_set_client =utf8mb3. This happens. One reason is that some MySQL client binaries autodetect the charset based on the value of LANG or LC_TYPE from your Linux env. Another reason is that if autodetection isn't done then the clients can use the default charset that was set at compile time. That charset is then passed to the server during connection handshake (see thd_init_client_charset). So it is likely that character_set_client as displayed by SHOW GLOBAL VARIABLES isn't what your client will use.
  • the output from steps 2 and 3 was different
    • autodetection is only done when mysql_options() is called with a certain flag (see below). And that is not done by the MySQL driver in sysbench, nor is it done by Python's MySQLdb. So my benchmark clients are likely selecting the default charset and don't do autodetection. And that default is determined by the version of the MySQL client library, meaning that default can change over the years. For the source that implements this, search for MYSQL_AUTODETECT_CHARSET_NAME and read sql-common/client.c.
The following enables autodetection and should be called before calling mysql_real_connect():
    mysql_options(..., 
                  MYSQL_SET_CHARSET_NAME,
                  MYSQL_AUTODETECT_CHARSET_NAME);

Note that adding the following into my.cnf isn't a workaround for clients that don't do autodetect.
    [client]
    default-character-set=...

Notes

These are from my usage of MySQL 5.7.44, 8.0.45 and 8.4.8 along with MariaDB 10.6.25, 10.11.16 and 11.4.10. All were compiled from source as was sysbench. I installed MySQLdb and the MySQL client library via apt for Ubuntu 24.04.

The values for character_set_client, character_set_results and character_set_connection were measured via the MySQL command-line client running SHOW GLOBAL VARIABLES and SHOW SESSION VARIABLES and then the benchmark clients running SHOW SESSION VARIABLES.

The reason for sharing this is to explain the many possible values your session might use for character_set_client, character_set_results and character_set_connection. And using the wrong value might waste CPU.

What per-session values are used for character_set_client|results|connection?
* my.cnf has character_set_server=latin1
* per SHOW GLOBAL VARIABLES each is set to =latin1
* values below measured via SHOW SESSION VARIABLES
Values for character_set_client|results|connection
... with "mysql" command line client
... this is easy to change with --default-character-set command line option or equivalent option in my.cnf
dbms
5.7.44 utf8
8.0.45 utf8mb4
8.4.8 utf8mb4
10.6.25 utf8mb3
10.11.16 utf8mb3
11.4.10 utf8mb3
Values for character_set_client|results|connection
... with sysbench

client library version
dbms 5.7 8.0 8.4 10.6 10.11 11.4
5.7.44 latin1 latin1 latin1 NA NA NA
8.0.45 latin1 utf8mb4 utf8mb4 NA NA NA
8.4.8 latin1 utf8mb4 utf8mb4 NA NA NA
10.6.25 latin1 latin1 latin1 utf8mb4 utf8mb4 utf8mb4
10.11.16 latin1 latin1 latin1 utf8mb4 utf8mb4 utf8mb4
11.4.10 latin1 utf8mb4 utf8mb4 utf8mb4 utf8mb4 utf8mb4
Values for character_set_client|results|connection
... with insert benchmark (Python MySQLdb and /lib/x86_64-linux-gnu/libmysqlclient.so.21
... I am not what version is libmysqlclient.so.21, this is on Ubuntu 24.04

dbms
5.7.44 latin1
8.0.45 utf8mb4
8.4.8 utf8mb4
10.6.25 latin1
10.11.16 latin1
11.4.10 utf8mb4

Tuesday, March 24, 2026

Sysbench vs MySQL on a small server: no new regressions, many old ones

This has performance results for InnoDB from MySQL 5.6.51, 5.7.44, 8.0.X, 8.4.8 and 9.7.0 on a small server with sysbench microbenchmarks. The workload here is cached by InnoDB and my focus is on regressions from new CPU overheads. 

In many cases, MySQL 5.6.51 gets about 1.5X more QPS than modern MySQL (8.0.x thru 9.7). The root cause is new CPU overhead, possibly from code bloat.

tl;dr

  • There are too many performance regressions in MySQL 8.0.X
  • There are few performance regressions in MySQL 8.4 through 9.7.0
  • In many cases MySQL 5.6.51 gets ~1.5X more QPS than 9.7.0 because 9.7.0 uses more CPU
  • Large regressions arrived in MySQL 8.0.30 and 8.0.32, especiall for full-table scans

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.X, 8.4.8 and 9.7.0. For MySQL 8.0.X I used 8.0.28, 8.0.30, 8.0.31, 8.0.32, 8.0.33, 8.0.34, 8.0.35, 8.0.36 and 8.0.45.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files are here for 5.6, 5.7, 8.4 and 9.7.

There my.cnf files are here fo 8.0.28, 8.0.30, 8.0.31, 8.0.32, 8.0.33, 8.0.34, 8.0.35, 8.0.36 and 8.0.45.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 630 seconds and the write-heavy for 930 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression.  The relative QPS is below where the base version is either MySQL 5.6.51 or 8.0.28:
(QPS for some version) / (QPS for base version) 
Values from iostat and vmstat divided by QPS are here for 5.6.51 as the base version and then here for 8.0.28 as the base version. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Results: point queries

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the regressions in the random-points tests are larger for range=10 than range=1000 (larger when the range is smaller). So the regressions are more likely to be in places other than InnoDB. The problem is new CPU overhead (see cpu/o here) which is 1.55X larger in 9.7.0 vs 5.6.51 for random-points_range=10 but only 1.19X larger in 9.7.0 for random-points_range=1000.
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.87    0.65    0.65    0.64    hot-points
0.87    0.69    0.67    0.63    point-query
0.87    0.72    0.72    0.71    points-covered-pk
0.90    0.78    0.78    0.76    points-covered-si
0.89    0.73    0.72    0.71    points-notcovered-pk
0.89    0.77    0.76    0.75    points-notcovered-si
1.00    0.84    0.83    0.83    random-points_range=1000
0.89    0.72    0.72    0.72    random-points_range=100
0.87    0.69    0.68    0.66    random-points_range=10

Summary:
  • The large regressions in 8.0.x for point queries (see above) occur prior to 8.0.28
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.92    1.14    1.14    1.12    1.17    1.16    1.16    1.16    hot-points
0.97    0.97    0.95    0.96    0.95    0.95    0.95    0.95    point-query
0.94    1.09    1.09    1.08    1.12    1.12    1.11    1.15    points-covered-pk
0.90    1.08    1.07    1.07    1.12    1.13    1.12    1.16    points-covered-si
0.91    1.04    1.04    1.03    1.07    1.07    1.06    1.11    points-notcovered-pk
0.88    0.96    0.96    0.95    1.00    1.01    1.00    1.06    points-notcovered-si
0.79    2.35    2.42    2.37    2.45    2.45    2.47    2.56    random-points_range=1000
0.94    1.07    1.06    1.06    1.09    1.08    1.10    1.12    random-points_range=100
0.93    0.94    0.93    0.93    0.94    0.94    0.93    0.95    random-points_range=10

Results: range queries without aggregation

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the problem is new CPU overhead and for the scan test the CPU overhead per query is about 1.5X larger in modern MySQL (8.0 thru 9.7) relative to MySQL 5.6.51 (see cpu/o here)
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.83    0.68    0.66    0.65    range-covered-pk
0.83    0.70    0.69    0.67    range-covered-si
0.84    0.66    0.65    0.64    range-notcovered-pk
0.88    0.74    0.73    0.73    range-notcovered-si
0.84    0.67    0.66    0.67    scan

Summary:
  • There is a large regression in 8.0.30 and a larger one in 8.0.32
  • The scan test is the worst case for the regression.
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.95    0.94    0.92    0.92    0.92    0.93    0.93    0.96    range-covered-pk
0.96    0.96    0.94    0.93    0.93    0.94    0.93    0.95    range-covered-si
0.94    0.94    0.93    0.93    0.94    0.94    0.93    0.93    range-notcovered-pk
0.89    0.87    0.87    0.86    0.89    0.91    0.89    0.95    range-notcovered-si
0.93    0.92    0.79    0.82    0.83    0.77    0.82    0.80    scan

Results: range queries with aggregation

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.86    0.70    0.69    0.68    read-only-count
1.42    1.27    1.24    1.23    read-only-distinct
0.91    0.75    0.74    0.73    read-only-order
1.23    1.01    1.01    1.01    read-only_range=10000
0.93    0.77    0.76    0.74    read-only_range=100
0.86    0.69    0.68    0.66    read-only_range=10
0.83    0.68    0.68    0.66    read-only-simple
0.83    0.67    0.67    0.66    read-only-sum

Summary:
  • There are significant regressions in 8.0.30 and 8.0.32
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.95    0.94    0.87    0.87    0.88    0.87    0.89    0.91    read-only-count
0.97    0.96    0.94    0.95    0.96    0.95    0.95    0.96    read-only-distinct
0.97    0.96    0.93    0.95    0.95    0.94    0.95    0.95    read-only-order
0.96    0.95    0.93    0.94    0.95    0.95    0.96    0.98    read-only_range=10000
0.96    0.96    0.94    0.95    0.95    0.94    0.95    0.94    read-only_range=100
0.96    0.97    0.95    0.95    0.95    0.94    0.95    0.94    read-only_range=10
0.94    0.94    0.92    0.93    0.93    0.93    0.94    0.94    read-only-simple
0.94    0.94    0.89    0.91    0.92    0.90    0.93    0.91    read-only-sum

Results: writes

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the insert test is the worst case and a big part of that is new CPU overhead, see cpu/o here, where it is 2.13X larger in 9.7.0 than 5.6.51. But for update-one the problem is writing more to storage per commit (see wkbpi here) rather than new CPU overhead.
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.85    0.60    0.59    0.55    delete
0.81    0.55    0.54    0.52    insert
0.93    0.75    0.74    0.71    read-write_range=100
0.87    0.70    0.68    0.66    read-write_range=10
1.20    0.88    0.89    0.91    update-index
1.04    0.74    0.73    0.71    update-inlist
0.87    0.62    0.61    0.57    update-nonindex
0.87    0.62    0.60    0.57    update-one
0.87    0.63    0.61    0.58    update-zipf
0.93    0.69    0.68    0.66    write-only

Summary:
  • There are significant regressions in 8.0.30 and 8.0.32
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.96    0.95    0.92    0.92    0.92    0.91    0.91    0.91    delete
0.94    0.93    0.91    0.91    0.91    0.90    0.90    0.90    insert
0.96    0.96    0.94    0.94    0.94    0.94    0.94    0.93    read-write_range=100
0.96    0.96    0.94    0.94    0.94    0.94    0.94    0.93    read-write_range=10
0.91    0.91    0.84    0.84    0.86    0.85    0.86    0.79    update-index
0.94    0.95    0.92    0.91    0.92    0.91    0.91    0.91    update-inlist
0.95    0.96    0.92    0.92    0.92    0.91    0.91    0.90    update-nonindex
0.96    0.96    0.93    0.92    0.92    0.91    0.92    0.91    update-one
0.96    0.96    0.92    0.92    0.92    0.91    0.91    0.90    update-zipf
0.94    0.94    0.91    0.91    0.91    0.91    0.91    0.89    write-only

Wednesday, March 18, 2026

MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (32-core) server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

While throughput improves with the InnoDB doublewrite buffer disabled, I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr for a CPU-bound workload

  • throughput for write-heavy steps is larger with the InnoDB doublewrite buffer disabled
  • throughput for write-heavy steps is much larger with the binlog storage engine enabled
  • throughput for write-heavy steps is largest with both the binlog storage engine enabled and the InnoDB doublewrite buffer disabled. In this case it was up to 8.9X larger.
tl;dr for an IO-bound workload
  • see the tl;dr above
  • the best throughput comes from enabling the binlog storage engine and disabling the InnoDB doublewrite buffer and was 3.26X.
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 32-cores and 128G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. AMD SMT is disabled. The SSD has high fsync latency.

I tried 4 my.cnf files:
The Benchmark

The benchmark is explained here. It was run with 12 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 300M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 16M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 4M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.63X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 2.80X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 8.13X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 1.14X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.93X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.61X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.03X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 8.90X larger for z12b_sync_dw0
Results: IO-bound

The summary is here.

For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.05X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 1.22X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 1.58X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 2.06X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.59X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.01X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.26X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 2.78X larger for z12b_sync_dw0









MariaDB innovation: binlog_storage_engine, 48-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). See this blog post for more details on the new feature. This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (48-core) server. Storage on this server has a low fsync latency while the small server has high fsync latency.

In this test throughput doesn't improve with the InnoDB doublewrite buffer disabled. Even if it did I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr

  • binlog storage engine makes some things better without making other things worse
  • binlog storage engine doesn't make all write-heavy steps faster because the commit path isn't the bottleneck in all cases on a server with storage that has low fsync latency

tl;dr for a CPU-bound workload

  • the l.i0 step (load in PK order) is ~1.3X faster with binlog storage engine
  • the l.i2 step (write-only with smaller transactions) is ~1.5X faster with binlog storage engine
tl;dr for an IO-bound workload
  • the l.i0 step (load in PK order) is ~1.08X faster with binlog storage engine
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 48-cores and 128G of RAM. Storage is 2 NVMe device with ext-4, discard enabled and RAID. The OS is Ubuntu 22.04. AMD SMT is disabled. The SSD has low fsync latency.

I tried 4 my.cnf files:
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c32r128 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
  • z12b_sync_dw0
  • z12c_sync_dw0
The Benchmark

The benchmark is explained here. It was run with 20 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 200M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 3600 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here
  • Disabling the InnoDB doublewrite buffer doesn't improve performance.
With and without the InnoDB doublewrite buffer enabled, enabling the binlog storage engine improves throughput a lot for two of the write-heavy steps while there are only small changes on the other two write-heavy steps:
  • l.i0, load in PK order, gets ~1.3X more throughput
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 1/2
      • KB written to storage per insert (wkbpi) is a bit smaller
      • context switches per insert (cspq) are reduced by about 1/3
  • l.x, create secondary indexes, is unchanged
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 4/5
      • KB written to storage per insert (wkbpi) are reduced almost in half
      • context switches per insert (cspq) are reduced by about 1/4
  • l.i1, write-only with larger tranactions, is unchanged
  • l.i2, write-only with smaller transactions, gets ~1.5X more throughput
Results: IO-bound

The summary is here.
  • Disabling the InnoDB doublewrite buffer doesn't improve performance.
  • For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.
  • Enabling the InnoDB doublewrite buffer improves throughput by ~1.25X on the l.i2 step (write-only with smaller transactions) but doesn't change performance on the other steps.
    • as expected there is a large reduction in KB written to storage (see wkbpi here)
  • Enabling the binlog storage engine improves throughput by 9% and 8% on the l.i0 step (load in PK order) but doesn't have a significant impact on other steps.
    • with the binlog storage engine there is a large reduction in storage writes per insert (wpi), a small reduction in KB written to storage per insert (wkbpi) and small increases in CPU per insert (cpupq) and contex switches per insert (cspq) -- see here



The insert benchmark on a small server : Postgres 12.22 through 18.3

This has results for Postgres versions 12.22 through 18.3 with the  Insert Benchmark  on a small server. My previous post for the same hardw...