Thursday, April 18, 2024

Changes to managing turbo boost in Ubuntu 22.04 and Linux 6.5

I often use HWE kernels with Ubuntu and currently use Ubuntu 22.04. Until recently that meant I ran Linux 6.2 but after a recent update I am now on Linux 6.5.

I am far from an expert on this topic and what I write here might just be notes to myself. Be wary of following my advice.

Disabling turbo boost yesterday

I have been disabling turbo boost for many years on my home test servers to reduce performance variance from hardware, especially as the weather gets warm because I don't have a server room with AC.  The problem with turbo boost on some of my servers was cyclical behavior:

  1. CPU cools, turbo boost does its thing
  2. benchmark runs faster
  3. CPU gets hot
  4. turbo boost stops doing its thing
  5. benchmark runs slower
  6. repeat

On my Intel servers I disable turbo boost via BIOS settings. On my AMD servers that used to be done via a script because I was using acpi-cpufreq: echo 0 > /sys/devices/system/cpu/cpufreq/boost

My goal is repeatable performance and I am willing to sacrifice peak HW performance to get that. Avoiding the cycle described above helps to achieve that. Alas this is a spectrum -- I tolerate other things (CPU cache, database cache) that improve performance while adding variance. I assume that I want CPU frequency to stay within a narrow range. It isn't clear that even when using acpi-cpufreq that I was getting a narrow range, but it did help. 

From the Ryzen 7 7840HS CPU I am use on these servers the AMD specs state that the base speed is 3.8GHz and the max boost is up to 5.1GHz. With acpi-cpufreq the CPU cores can be in one of three frequency levels, and from cpupower frequency-info they are:

available frequency steps:  3.80 GHz, 2.20 GHz, 1.60 GHz

So even with turbo boost disabled (see the echo command above) there is still room for variance. But I don't know enough to determine whether I need to do more tuning.

Disabling turbo boost today

After a recent update on Ubuntu 22.04 with HWE kernels I now run 6.5.0-27-generic and acpi-cpufreq has been replaced by amd-pstate. I am sure there are many benefits from this change, alas, it also brings complexity and confusion from users who now have server cooling problems (because things are running faster) and are trying to figure out how to fix them. Notes on setting up the server are here.

I noticed this change because with the the default (amd-pstate in active mode) this file doesn't exist:

/sys/devices/system/cpu/cpufreq/boost

On a Ryzen 7 CPU I get the amd-pstate-epp driver in active mode. Output from /proc/cpuinfo and cpupower frequency-info from this state is below. Note that  /sys/devices/system/cpu/cpufreq/boost doesn't exist when in active mode. It does exist when in guided or passive mode. So I either need to switch to guided or passive mode or rollback to using the acpi-cpufreq driver. Which means I need to understand a bit more.

There is a lot of documentation for the amd-pstate driver. It isn't meant for the casual user.

There is a big difference between acpi-cpufreq and amd-pstate and amd-pstate is the future but perhaps not today (for me). While with acpi-cpufreq and turbo boost disabled I should only get one of three CPU frequencies, I can get many more with amd-pstate. From cpupower frequency-info output

analyzing CPU 7:
  driver: amd-pstate-epp
  CPUs which run at the same hardware frequency: 7
  CPUs which need to have their frequency coordinated by software: 7
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 5.61 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 5.61 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 2.97 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: no

For now I will just rollback to using the acpi-cpufreq driver while figuring this out and possibly waiting for Linux 6.6 to show up on Ubuntu 22.04. I am not sure how mature amd-pstate is, and I won't get support for cpupower set --turbo-boost 1 until 6.6 arrives.

I now have this in /etc/default/grub: 

GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off nosmt amd_pstate=disable"

  • pcie_aspm=off is there to avoid correctable PCI errors (maybe Beelink BIOS needs an update)
  • nosmt disables hyperthreads because BIOS doesn't have an option for that
  • amd_pstate=disable lets me use the acpi-cpufreq driver 

CPU frequencies with acpi-cpufreq

This shows the CPU frequencies I get from an idle server with the acpi-cpufreq driver. Note that I mostly get only 3 values when boost is disabled (set to 0).

With /sys/devices/system/cpu/cpufreq/boost set to 0

  current CPU frequency: 2.18 GHz (asserted by call to kernel)
  current CPU frequency: 1.50 GHz (asserted by call to kernel)
  current CPU frequency: 3.80 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 3.80 GHz (asserted by call to kernel)

With /sys/devices/system/cpu/cpufreq/boost set to 1

  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 2.04 GHz (asserted by call to kernel)
  current CPU frequency: 2.11 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 1.57 GHz (asserted by call to kernel)
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  current CPU frequency: 3.21 GHz (asserted by call to kernel)

Appendix

Note that cpupower frequency-info only shows frequencies for one core, to see them all use cpupower -c all frequency-info.

Output from  cpupower frequency-info with active mode

analyzing CPU 7:
  driver: amd-pstate-epp
  CPUs which run at the same hardware frequency: 7
  CPUs which need to have their frequency coordinated by software: 7
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 5.61 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 5.61 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 2.97 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: no

Output from  cpupower frequency-info with guided mode

analyzing CPU 7:
  driver: amd-pstate
  CPUs which run at the same hardware frequency: 7
  CPUs which need to have their frequency coordinated by software: 7
  maximum transition latency: 20.0 us
  hardware limits: 400 MHz - 5.61 GHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 400 MHz and 5.61 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 1.44 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes
    AMD PSTATE Highest Performance: 214. Maximum Frequency: 5.61 GHz.
    AMD PSTATE Nominal Performance: 145. Nominal Frequency: 3.80 GHz.
    AMD PSTATE Lowest Non-linear Performance: 42. Lowest Non-linear Frequency: 1.10 GHz.
    AMD PSTATE Lowest Performance: 16. Lowest Frequency: 400 MHz.

Output from  cpupower frequency-info with passive mode

analyzing CPU 7:
  driver: amd-pstate
  CPUs which run at the same hardware frequency: 7
  CPUs which need to have their frequency coordinated by software: 7
  maximum transition latency: 20.0 us
  hardware limits: 400 MHz - 5.61 GHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 400 MHz and 5.61 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 2.74 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes
    AMD PSTATE Highest Performance: 214. Maximum Frequency: 5.61 GHz.
    AMD PSTATE Nominal Performance: 145. Nominal Frequency: 3.80 GHz.
    AMD PSTATE Lowest Non-linear Performance: 42. Lowest Non-linear Frequency: 1.10 GHz.
    AMD PSTATE Lowest Performance: 16. Lowest Frequency: 400 MHz.

Output from /proc/cpuinfo

processor : 7
vendor_id : AuthenticAMD
cpu family : 25
model : 116
model name : AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
stepping : 1
microcode : 0xa704103
cpu MHz : 3800.000
cache size : 1024 KB
physical id : 0
siblings : 8
core id : 7
cpu cores : 8
apicid : 14
initial apicid : 14
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca flush_l1d
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips : 7585.46
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14] [15]

Status as of April 18, 2024

The problem isn't resolved. I tried both the current HWE and non-HWE kernels with a variety of kernel boot params but the result was that one or both of the new SER7 servers were slower. I tried all of these:
GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm.policy=performance"
GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off nosmt amd_pstate=disable"
GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"
GRUB_CMDLINE_LINUX_DEFAULT="amd_pstate=disable"
GRUB_CMDLINE_LINUX_DEFAULT="nosmt"
GRUB_CMDLINE_LINUX_DEFAULT=""

Results from tests are here. In many cases the CPU overhead (user+system) is significantly different on the new servers compared to the old one.

The selection of small servers (with small TDP) that don't have a mix of performance and efficiency cores is limited (the few available use AMD Ryzen CPUs). I might replace the Beelink SER7 with ASUS PN53.

Saturday, April 13, 2024

The Insert Benchmark: MariaDB, MySQL, small server, cached workload

This post has results for the Insert Benchmark on a small server with a cached workload. The goal is to compare MariaDB and MySQL.

This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

The workload here has low concurrency and the database is cached. The results might be different when the workload is IO-bound or has more concurrency.

tl;dr

  • Modern MariaDB (11.4.1) is faster than modern MySQL (8.0.36) on all benchmark steps except for qr* (range query) and l.x (create index) where they have similar performance.
  • Modern MariaDB (11.4.1) was at most 15% slower than older MariaDB (10.2.44). It is nice to see that MariaDB has done a great job of avoiding performance regressions over time.
  • There are significant performance regressions from older MySQL (5.6) to modern MySQL (8.0)
  • Performance with innodb_flush_method set to =O_DIRECT_NO_FSYNC is better than with =O_DIRECT is better than =fsync.

Build + Configuration

This report has results for InnoDB from:
  • MySQL - versions 5.6.51, 5.7.44 and 8.0.36
  • MariaDB - versions 10.2.44, 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.4.1. Versions 10.2, 10.3, 10.4, 10.5, 10.6 and 10.11 are the most recent LTS releases and 11.4 will be the next LTS release.
All of the my.cnf files are here. I tried to use similar configurations across releases, but isn't always possible. And even when it was possible I favor setting fewer entries especially for options where the default value changes between releases.

I started with the my.cnf.cz11a_bee config and then began to make small changes. For all configs I set these values to limit the size of the history list which also keeps the database from growing larger than expected. I rarely did this in the past.
innodb_max_purge_lag=500000
innodb_max_purge_lag_delay=1000000

Some of the changes were challenging when trying to make things comparable. 
  • the InnoDB change buffer was removed in MariaDB 11.4. 
    • I disable it in all my.cnf files for all MariaDB versions except for the my.cnf.cz11d and my.cnf.cz11d1 configs.
    • I don't disable it for the MySQL configs named my.cnf.cz11[abc]_bee but I do disable it for the my.cnf.cz11d_bee config used by MySQL. The result is that for MariaDB the my.cnf.cz11d_bee config enables the change buffer while for MySQL it disables it. Sorry for the confusion.
  • innodb_buffer_pool_instances was removed in 10.5 (assume it is =1). 
    • I don't set it to =1 in the my.cnf.cz[abc]_bee configs for MariaDB 10.2, 10.3, 10.4
  • innodb_flush_method was removed in 11.4 and there is a new way to configure this.
    • In 11.4.1 there is an equivalent of =O_DIRECT but not of =O_DIRECT_NO_FSYNC
For MariaDB the typical my.cnf files were:
  • my.cnf.cz11a_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC
  • my.cnf.cz11a1_bee - like my.cnf.cz11a_bee but reduces the sizes of the buffer pool and redo log so that both fit in memory
  • my.cnf.cz11b_bee - uses innodb_flush_method=O_DIRECT
  • my.cnf.cz11b1_bee - like my.cnf.cz11b_bee but reduces the sizes of the buffer pool and redo log so that both fit in memory
  • my.cnf.cz11c_bee - uses innodb_flush_method=fsync
  • my.cnf.cz11d_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC and enables the InnoDB change buffer
  • my.cnf.cz11d1_bee - like my.cnf.cz11d_bee but reduces the sizes of the buffer pool and redo log so that both fit in memory
For MySQL the type my.cnf files were:
  • my.cnf.cz11a_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC
  • my.cnf.cz11b_bee - uses innodb_flush_method=O_DIRECT
  • my.cnf.cz11c_bee - uses innodb_flush_method=fsync
  • my.cnf.cz11d_bee - uses innodb_flush_method=O_DIRECT_NO_FSYNC and disables the InnoDB change buffer. Note the my.cnf.cz11[abc]_bee configs for MySQL enabled it. This is the opposite of what is done for MariaDB.
The Benchmark

The benchmark is run with 1 client. It is explained here and was run in two setups with a cached workload. The initial load (l.i0) inserts 30M rows and the database fits in memory. I recently updated the Insert Benchmark to change how deletes are done and the new delete SQL is here.

The test server was named v4 here and is a Beelink SER4. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

  • l.i0
    • insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 40M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results

The performance reports are here for:
The summary in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: MariaDB vs MySQL

Modern MariaDB (11.4.1) is faster than modern MySQL (8.0.36) on all benchmark steps except for qr* (range query) and l.x (create index) where they have similar performance.
  • The base case is MariaDB 10.11.7 with the cz11a_bee config (ma101107_rel.cz11a_bee). It is compared with
    • MariaDB 11.4.1 with the cz11b_bee config (ma110401_rel.cz11b_bee)
    • MySQL 8.0.36 with the cz11a_bee config (my8036_rel.cz11a_bee)
  • Relative throughput per benchmark step
    • l.i0
      • relative QPS is 0.98 in MariaDB 11.4.1
      • relative QPS is 0.68 in MySQL 8.0.36
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 0.930.97 in MariaDB 11.4.1
      • relative QPS is 0.880.78 in MySQL 8.0.36
    • qr100, qr500, qr1000
      • relative QPS is 1.03, 1.021.01 in MariaDB 11.4.1
      • relative QPS is 1.021.041.04 in MySQL 8.0.36
    • qp100, qp500, qp1000
      • relative QPS is 1.001.001.00 in MariaDB 11.4.1
      • relative QPS is 0.760.740.75 in MySQL 8.0.36

Results: MariaDB

Modern MariaDB (11.4.1) was at most 15% slower than older MariaDB (10.2.44). It is nice to see that MariaDB has done a great job of avoiding performance regressions over time.

  • The base case is MariaDB 10.2.44 with the cz11a_bee config (ma100244_rel.cz11a_bee). It is compared with more recent LTS releases from 10.3, 10.4, 10.5, 10.6, 10.11 and 11.4.
  • Throughput per benchmark step for 11.4.1 relative to 10.2.44
    • l.i0
      • relative QPS is 0.86 in MariaDB 11.4.1
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.081.07 in MariaDB 11.4.1
    • qr100, qr500, qr1000
      • relative QPS is 0.89, 0.86, 0.85 in MariaDB 11.4.1
    • qp100, qp500, qp1000
      • relative QPS is 0.91, 0.91, 0.88 in MariaDB 11.4.1
Results: MariaDB 10.11

  • The base case is MariaDB 10.11.7 with the cz11a_bee config (ma101107_rel.cz11a_bee) and 10.11.7 releases with other configs are compared to it.
  • Performance does not change with the cz11a1_bee config that uses a smaller size for the InnoDB redo log and buffer pool so that they fit in memory. In workloads that are more IO-bound I see that MariaDB 10.11+ does read IO for the redo log while MySQL does not. Shrinking the buffer pool and redo log isn't good for performance but avoids the read IO.
  • Performance is worse with the cz11b_bee config on the random write (l.i1, l.i2) benchmark steps. This indicates that setting innodb_flush_method to O_DIRECT_NO_FSYNC is better than setting it to O_DIRECT.
  • Performance is worse with the cz11c_bee config on the random write (l.i1, l.i2) benchmark steps. This indicates that setting innodb_flush_method to O_DIRECT_NO_FSYNC is better than setting it to fsync.
  • Performance does not change with the cz11d_bee config which enabled the InnoDB change buffer. Given that the working set fits in the InnoDB buffer pool the result is expected.
  • Performance is worse with the cz11d1_bee config on the random write (l.i1, l.i2) benchmark steps. See the comments above for cz11a1_bee.
Results: MariaDB 11.4
  • The base case is MariaDB 11.4.1 with the cz11b_bee config (ma110401.cz11b_bee) and 11.4.1 releases with other configs are compared to it. Note that 11.4.1 does not support the equivalent of O_DIRECT_NO_FSYNC for innodb_flush_method.
  • Performance is worse with the cz11b1_bee config on the random write (l.i1, l.i2) benchmark steps. See the comments in the previous section on the cz11a1_bee config.
  • Performance is slightly worse with the cz11c_bee config on the random write (l.i1, l.i2) benchmark steps. This indicates that setting innodb_flush_method to O_DIRECT_NO_FSYNC is better than setting it to fsync.
Results: MySQL

There are significant performance regressions from MySQL 5.6 to 8.0.

  • The base case is MySQL 5.6.51 with the cz11a_bee config (my5651_rel.cz11a_bee) and it is compared to MySQL 5.7.44 and 8.0.36.
  • Relative throughput per benchmark step
    • l.i0
      • relative QPS is 0.84 in MariaDB 11.4.1
      • relative QPS is 0.57 in MySQL 8.0.36
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.170.89 in MariaDB 11.4.1
      • relative QPS is 1.070.73 in MySQL 8.0.36
    • qr100, qr500, qr1000
      • relative QPS is 0.73, 0.72, 0.71 in MariaDB 11.4.1
      • relative QPS is 0.630.630.63 in MySQL 8.0.36
    • qp100, qp500, qp1000
      • relative QPS is 0.820.790.80 in MariaDB 11.4.1
      • relative QPS is 0.620.600.62 in MySQL 8.0.36

Thursday, April 11, 2024

Sysbench on a (less) small server: MariaDB and MySQL

This has results from the sysbench benchmark for MariaDB and MySQL on a (less) small server with a cached and low-concurrency workload. For MariaDB I tested LTS releases from 10.2 through 11.4. For MySQL I tested 5.6, 5.7 and 8.0. The results from MySQL here are a good reason to use changepoint detection to spot regressions early, like that provided by Nyrkiƶ.

This work was done by Small Datum LLC and sponsored by the MariaDB Foundation. A previous post shared results from my smallest and oldest server. This has results from two newer and faster, but still small, servers. Regardless, the results don't change.

My standard disclaimer is that sysbench with low-concurrency is great for spotting CPU regressions. However, a result with higher concurrency from a larger server is also needed to understand things. Results from IO-bound workloads and less synthetic workloads are also needed. But low-concurrency, cached sysbench is a great place to start.

tl;dr

  • MariaDB is great at avoiding CPU regressions over time
  • MySQL is less than great at avoiding CPU regressions over time
  • Modern MariaDB is 13% to 36% faster than modern MySQL (on this setup)
  • Enabling the InnoDB change buffer does not improve results here. It might help for IO-bound sysbench.
  • Modern MariaDB does reads from storage for the redo log (read-modify-write) while MySQL does not. I know the MariaDB redo log architecture has changed from MySQL -- MariaDB uses just one large redo log file. But I can't fully explain this.
Builds and configuration

For MariaDB I used the latest point releases from LTS versions: 10.2.44, 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7 and the upcoming LTS 11.4.1. 

For MySQL I used 5.6.51, 5.7.44 and 8.0.36.

The first GA release for MariaDB 10.2 was in 2017. The first GA release for MySQL 5.6 was in 2013. So while I cannot claim that my testing covers MySQL and MariaDB from the same time period, I can claim that I am testing old versions of both.

Everything was compiled from source with similar CMake command lines and CMAKE_BUILD_TYPE set to Release. It is much easier to compile older MariaDB releases than older MySQL releases. For MariaDB I did not have to edit any source files. Notes on compiling MySQL are here for 5.6for 5.6 and 5.7for 5.7 and for 8.0. A note on using cmake is here.

The my.cnf files are here for the ser4 and the ser7 servers. Note that the ser7 server has a faster CPU and twice as much RAM.

Benchmarks

I used sysbench and my usage is explained here. There are 42 microbenchmarks and each tests ~1 type of SQL statement.

Tests were run on two variants of the small servers I have at home. Each has 8 cores and NVMe SSD with XFS and Ubuntu 22.04. The servers are described here
  • ser4 - this is the v4 server here with 16G of RAM and has a Ryzen 7 4700U CPU. It is a Beelink SER 4700U so I call it the ser4 (or SER4).
  • ser7 - this is the v5 server here with 32G of RAM and has a Ryzen 7 7840HS CPU. It is a Beelink SER7 so I call it the ser7 (or SER7).
The benchmark is run with:
  • one connection
  • 30M rows and a database cached by InnoDB
  • each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise
  • prepared statements were enabled
The command line was: bash r.sh 1 30000000 300 600 nvme0n1 1 1 1

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data and charts are here for ser4 and for ser7.

All of the charts have relative throughput on the y-axis where that is (QPS for $me) / (QPS for $base), $me is some DBMS version (for example MariaDB 11.4.1) and $base is the DBMS version for the base case. The base version is specified below depending on what I am comparing. The y-axis doesn't start at 0 to improve readability. When the relative throughput is > 1 then the throughput on some DBMS version is greater than the throughput for the base case.

The legend under the x-axis truncates the names I use for the microbenchmarks and I don't know how to fix that other than sharing links (see above) to the Google Sheets I used.

Results: MariaDB from old to new

This section uses MariaDB 10.2.44 as the base version and then compares that with MariaDB versions 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7 and 11.4.1. The goal is to determine how throughput (QPS) changes from older releases like 10.2 to the latest release (11.4). 

These tables have summary statistics from ser4 and ser7 of the relative throughput for MariaDB 11.4.1 vs 10.2.44 for each of the microbenchmark groups. A value greater than one means the throughput for MariaDB 11.4.1 is better than for 10.2.44. From the results here, new MariaDB (11.4.1) gets at least 91% with ser4 and 93% with ser7 of the throughput relative to old MariaDB (10.2.44) using the median relative throughput per microbenchmark group. New features added to MariaDB don't get in the way of performance because there aren't significant regressions over time.

from ser4point, part 1point, part 2range, part 1range, part 2writes
average0.960.950.990.960.91
median0.960.970.990.940.91
min0.900.910.930.880.77
max0.980.981.111.071.03
stddev0.020.030.050.090.07

from ser7point, part 1point, part 2range, part 1range, part 2writes
average0.940.940.950.980.92
median0.960.960.930.960.95
min0.870.880.900.930.76
max0.970.991.101.041.02
stddev0.030.050.060.040.08

There are two graphs per microbenchmark group - first for ser4 and second for ser7. The y-axis doesn't begin at zero to improve readability. The y-axis starts at 0.75 for ser4 and 0.60 for ser7.
Graphs for point queries, part 2.
Graphs for range queries, part 1.
Graphs for range queries, part 2.
Graphs for writes.
Results: MySQL from old to new

This section uses MySQL 5.6.51 as the base version and then compares that with MySQL versions 5.7.44 and 8.0.36. The goal is to determine how throughput (QPS) changes from older to newer releases. 

These tables have summary statistics from ser4 and ser7 of the relative throughput for MySQL 8.0.36 vs 5.6.51 for each of the microbenchmark groups. A value greater than one means the throughput for MySQL 8.0.36 is better than for 5.6.51. From the results here, new MySQL (8.0.36) gets between 56% and 89% of the throughput relative to old MySQL (5.6.51) using the median relative throughput per microbenchmark group. New features in modern MySQL come at the cost of much CPU overhead.

from ser4point, part 1point, part 2range, part 1range, part 2writes
average0.720.710.660.810.61
median0.700.700.650.760.56
min0.670.610.630.660.44
max0.780.820.721.000.99
stddev0.040.090.030.150.16

from ser7point, part 1point, part 2range, part 1range, part 2writes
average0.770.770.720.920.71
median0.800.750.730.890.67
min0.650.690.620.800.51
max0.840.860.771.071.10
stddev0.050.070.050.120.18

There are two graphs per microbenchmark group - first for ser4 and second for ser7. The y-axis doesn't begin at zero to improve readability. The y-axis starts at 0.40 for ser4 and 0.50 for ser7.
Graphs for point queries, part 2.
Graphs for range queries, part 1.
Graphs for range queries, part 2.
Graphs for writes.
Results: MariaDB vs MySQL

This section uses MySQL 8.0.36 as the base version and then compares that with MariaDB 11.4.1. The goal is to determine which DBMS gets more throughput (or uses less CPU/query).

These tables have summary statistics from ser4 and ser7 of the relative throughput for MariaDB 11.4.1 vs MySQL 8.0.36 for each of the microbenchmark groups. A value greater than one means the throughput for MariaDB 11.4.1 is better than for MySQL 8.0.36. From the results here, modern MariaDB (11.4.1) gets between 113% and 136% of the throughput relative to modern MySQL (8.0.36) using the median relative throughput per microbenchmark group. Modern MariaDB is faster than modern MySQL (on this setup) because MySQL has more performance regressions over time.

from ser4point, part 1point, part 2range, part 1range, part 2writes
average1.171.211.211.191.31
median1.151.221.211.171.36
min1.121.191.151.161.11
max1.301.241.371.251.48
stddev0.050.020.070.040.12

from ser7point, part 1point, part 2range, part 1range, part 2writes
average1.161.201.181.171.23
median1.131.201.151.151.28
min1.101.161.071.100.96
max1.321.241.401.261.41
stddev0.060.030.100.070.14

There are two graphs per microbenchmark group - first for ser4 and second for ser7. The y-axis doesn't begin at zero to improve readability. The y-axis starts at 1.00 for both ser4 and ser7.
Graphs for point queries, part 2.
Graphs for range queries, part 1.
Graphs for range queries, part 2.
Graphs for writes. The bar for update-index_range=1000 isn't visible on the second graph because the relative throughput is 0.96 but the y-axis starts at 1.00.
Results: change buffer

The InnoDB change buffer was removed from MariaDB 11.4. That feature has been good to me during the Insert Benchmark so I repeated tests in MariaDB 10.11.7 with it enabled and disabled. For the sysbench microbenchmarks and this workload (low concurrency, cached database) there is no benefit from it. It might have a benefit for some sysbench microbenchmarks when the database isn't cached.

This post is already too long so I won't provide charts but the raw data is here for ser4 and for ser7.

Results: disk read mystery

The database is cached by InnoDB for these tests but the sum of the sizes of the InnoDB buffer pool and InnoDB redo logs are larger than memory. And that is a normal (best practice) configuration. Unfortunately, starting in MariaDB 10.11 there is read IO during some of the microbenchmark steps. All of the configs use either O_DIRECT or O_DIRECT_NO_FSYNC for InnoDB flush method and I confirmed that database file reads aren't being done based on the output of SHOW ENGINE INNODB STATUS.

What might be the source of the reads? My first guess is read-modify-write (RMW) from the InnoDB redo log. The problem is that the redo log uses buffered IO (not O_DIRECT), redo log writes are done as a multiple of 512 bytes and writing to the first 512 bytes of a 4kb filesystem page will do a storage read if that 4kb page isn't in the OS page cache. Note that innodb_log_write_ahead_size in MySQL 8 reduces the chance of this happening. It was my first guess because we suffered from it long ago with web-scale MySQL.

This happens on the ser4 server but not the ser7 server. The ser7 server has twice as much RAM but uses the same size tables, so there is more spare RAM and perhaps that makes it easier for Linux to cache the InnoDB redo log.

I ran additional tests where the InnoDB redo log and InnoDB data files use different storage devices I see that all of the read IO occurs on the storage device for the redo log so my guess was correct but why this occurs for modern MariaDB but not for MySQL remains a mystery.

I also looked at strace output from MariaDB and MySQL. I didn't see anything different. Although the redo log architecture has changed in MariaDB. It now uses just one redo log file while MySQL uses as many as you configure and the my.cnf files I used are here for the ser4 and the ser7 servers. 

The unexpected storage reads might not hurt performance here because I use innodb_flush_log_at_trx_commit =2 (no sync on commit) and user transactions might not have to wait for that read on commit. 

I have a script that provides iostat and vmstat metrics normalized by throughput (operations/s) to show how much hardware is used per request. And the problem is visible here for several microbenchmark steps including update-inlist.range100.pk1 and update-index.range100,pk1. The columns in the output are:
  • cpu/o - CPU per operation (from vmstat us + sy)
  • cs/o - context switches per operation (from vmstat cs)
  • r/o - storage reads per operation (from iostat r/s)
  • rKB/o - storage KB read per operation (from iostat rKB/s)
  • wKB/o - storage KB written per operation (from iostat wKB/s)
  • o/s - throughput, QPS, operations/s
In the tables below the value for r/o is 0 prior to MariaDB 10.11.7 (ma101107_rel) and then is 0.077 for MariaDB 10.11.7 and 0.034 for MariaDB 11.4.1. While I don't show it here, this problem also occurs for MySQL 5.6.51 but not for MySQL 5.7.44 or 8.0.36.

The data from the tables below is here for ser4 and for ser7.

sb.met.update-inlist.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.006937        36.076  0       0       90.665  3966    x.ma100244_rel.z11a_bee.pk1
0.007467        35.345  0       0       89.534  3845    x.ma100339_rel.z11a_bee.pk1
0.007336        35.395  0       0       91.588  3722    x.ma100433_rel.z11a_bee.pk1
0.005755        37.579  0       0       7.408   4126    x.ma100524_rel.z11a_bee.pk1
0.004227        8.442   0       0       33.077  3890    x.ma100617_rel.z11a_bee.pk1
0.004253        8.595   0.077   0.306   33.099  3856    x.ma101107_rel.z11a_bee.pk1
0.004430        8.462   0.034   0.136   32.22   3650    x.ma110401_rel.z11b_bee.pk1
--- relative to first result
1.08            0.98    1       1       0.99    0.97    x.ma100339_rel.z11a_bee.pk1
1.06            0.98    1       1       1.01    0.94    x.ma100433_rel.z11a_bee.pk1
0.83            1.04    1       1       0.08    1.04    x.ma100524_rel.z11a_bee.pk1
0.61            0.23    1       1       0.36    0.98    x.ma100617_rel.z11a_bee.pk1
0.61            0.24    inf     inf     0.37    0.97    x.ma101107_rel.z11a_bee.pk1
0.64            0.23    inf     inf     0.36    0.92    x.ma110401_rel.z11b_bee.pk1

sb.met.update-index.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.015003        34.740  0       0       110.865 2268    x.ma100244_rel.z11a_bee.pk1
0.016460        39.791  0       0       123.805 1715    x.ma100339_rel.z11a_bee.pk1
0.016282        39.626  0       0       124.773 1734    x.ma100433_rel.z11a_bee.pk1
0.010618        55.358  0       0       111.238 2277    x.ma100524_rel.z11a_bee.pk1
0.007834        12.873  0       0       53.445  1997    x.ma100617_rel.z11a_bee.pk1
0.006468        14.170  0.857   3.426   50.978  2453    x.ma101107_rel.z11a_bee.pk1
0.006733        14.250  0.859   3.434   51.687  2337    x.ma110401_rel.z11b_bee.pk1
--- relative to first result
1.10            1.15    1       1       1.12    0.76    x.ma100339_rel.z11a_bee.pk1
1.09            1.14    1       1       1.13    0.76    x.ma100433_rel.z11a_bee.pk1
0.71            1.59    1       1       1.00    1.00    x.ma100524_rel.z11a_bee.pk1
0.52            0.37    1       1       0.48    0.88    x.ma100617_rel.z11a_bee.pk1
0.43            0.41    inf     inf     0.46    1.08    x.ma101107_rel.z11a_bee.pk1
0.45            0.41    inf     inf     0.47    1.03    x.ma110401_rel.z11b_bee.pk1