Friday, July 26, 2024

Sysbench on a medium server: MariaDB is faster than MySQL

Modern MariaDB is significantly faster than modern MySQL on a medium server with cached sysbench because MySQL suffers from too many CPU performance regressions (code bloat, etc) over time.

This post has results for sysbench with a cached database and medium server (16 cores) using MariaDB and MySQL. The benchmark is run with 12 threads. A full writeup for MySQL on this HW is here and for MariaDB on a smaller server is here.

Note that the great results for MariaDB only occur when MDEV-33894 has been fixed in versions 10.11 or greater -- it isn't a bug prior to 10.11. I reported that bug a few months ago and then worked with the MariaDB core team to explain the problem and validate the fix.

The gap between MariaDB and MySQL is likely to grow. While MySQL 8.0.40 might have fixes that undo some of the regressions that arrived in InnoDB after 8.0.28 (see PS-8822 and MySQL 111538), I expect to see improvements in MariaDB as well. I opened three perf bugs for MariaDB earlier this year. The great results here are due in part to fixing MDEV-33894 and there will be even better results in the future when the two other bugs are fixed: MDEV-34178 and MDEV-33966.

tl;dr

  • The fix for MDEV-33894 is important
  • MariaDB has done great at avoiding CPU performance regressions over time, MySQL has not
  • Modern MariaDB is faster than modern MySQL
    • MariaDB was faster on 41 microbenchmarks and slower on one (update-inlist)
    • A typical result is MariaDB is ~11% faster than MySQL

Builds, configuration and hardware

For MySQL I compiled 8.0.38 from source.

For MariaDB I compiled these versions from source:
  • ma100244_rel_withdbg
    • MariaDB 10.2.44
  • ma100339_rel_withdbg
    • MariaDB 10.3.29
  • ma100434_rel_withdbg
    • MariaDB 10.4.34
  • ma100525_rel_withdbg
    • MariDB 10.5.25
  • ma100618_rel_withdbg
    • MariaDB 10.6.18
  • ma101108_rel_withdbg
    • MariaDB 10.11.8
  • ma1011_240723_c944cd6e_ilwas
    • MariaDB 10.11 compiled on 24/07/23 at git hash c944cd6f with innodb_log_writeahead_size =4096. Note the git hash in ma1011_240723_c944cd6e_ilwas is a typo (should end in 6f)
  • ma110402_rel_withdbg
    • MariaDB 11.4.2
  • ma1104_240723_2ee061c2_ilwas
    • MariaDB 11.4 compiled on 24/07/23 at git hash 2ee061c2 with innodb_log_writeahead_size =4096
  • ma110501_rel_withdbg
    • MariaDB 11.5.1
  • ma1105_240723_2f4b0ba3_ilwas
    • MariaDB 11.5 compiled on 24/07/23 at git hash 2f4b0ba3 with innodb_log_writeahead_size =4096
  • ma110600_rel_withdbg
    • MariaDB 11.6.0
  • ma1106_240723_d8374262_ilwas
    • MariaDB 11.6 compiled on 24/07/23 at git hash d8374262 with innodb_log_writeahead_size =4096
The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

The my.cnf file is here for MySQL 8.0.38.

The my.cnf files for MariaDB are here in per-version subdirectories. For 10.x the cz11a_c24r64 variant was used. For 11.x the cz11b_c24r64 variant was used. For the *_ilwas binaries above I added innodb_log_writeahead_size =4096 to the my.cnf file, although that wasn't needed because it was =4096 when I didn't set it in my.cnf.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with 12 threads, 8 tables and 10M rows per table.. Ech microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper scripts were:
 8 tables, 10M rows/table, 12 threads
bash r.sh 8 10000000 300 600 md0 1 1 12

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here. For each microbenchmark group there is a table with summary statistics. I don't have charts because that would use too much space, but the results per microbenchmark are in the spreadsheets.

The numbers in the spreadsheets and the tables below are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

For all results below the base case is InnoDB from MySQL 8.0.26

Results: iostat and vmstat

This section has data from vmstat and iostat normalized by throughput to understand HW consumption per database operation. When MySQL is slower than MariaDB the usual culprit is CPU overhead (see the cpu/o column) and one example is the result for scans (see here) where cpu/o is 0.059403 for MySQL vs 0.043163 for MariaDB which is ~1.38X larger for MySQL.

But the results also show new problems for MariaDB and MDEV-34178 and MDEV-33966 are open for these problems. The usual signal is an increase in the context switch rate (cs/o). Examples of that occur on:
  • The problem is new in MariDB 10.11
  • The problem is new in MariaDB 10.5
    • update-inlist (see here) - the problem arrives in 10.5
    • update-index (see here) - the problem arrives in 10.5
The full results are here for MariaDB vs MySQL and here for MariaDB.

Results: MariaDB vs MySQL, summary statistics

Each table has summary statistics per microbenchmark group. The numbers are the relative QPS.
  • Relative QPS is: (QPS for MariaDB) / (QPS for MySQL)
  • MariaDB is ma1104_240723_2ee061c2_ilwas (11.4 with MDEV-33894 fix)
  • MySQL is 8.0.38
  • MariaDB is faster than MySQL on all microbenchmarks except for update-inlist
MariaDBminmaxavgmedian
point-11.051.271.091.08
point-21.061.091.071.07
range-11.031.351.121.11
range-21.101.371.191.11
writes0.832.061.361.37

Results: MariaDB, summary statistics

Each table has summary statistics per microbenchmark group. The numbers are the relative QPS.
  • Relative QPS is: (QPS for modern MariaDB) / (QPS for older MariaDB)
  • modern MariaDB is 11.4 without and with the MDEV-33894 fix -- ma110402_rel_withdbg and ma1104_240723_2ee061c2_ilwas
  • older MariaDB is 10.2.44 -- ma100244_rel_withdbg
Summary
  • Writes are faster in 11.4 than in 10.2 with the MDEV-33894 fix
  • Reads are between 13% slower and 23% faster in 11.4 than in 10.2
  • MariaDB has done great at avoiding regressions over time
For 11.4.2 ( ma110402_rel_withdbg) relative to 10.2.44 (ma100244_rel_withdbg)

11.4.2minmaxavgmedian
point-10.860.980.920.93
point-20.870.910.890.88
range-10.871.180.950.93
range-20.921.040.970.94
writes0.430.750.620.61

For 11.4 with the fix for MDEV-33894 (ma1104_240723_2ee061c2_ilwas) relative to 10.2.44 (ma100244_rel_withdbg)

11.4.ilwasminmaxavgmedian
point-10.860.960.920.93
point-20.870.910.890.88
range-10.871.250.950.93
range-20.921.040.970.94
writes0.741.291.061.12

Results: MariaDB, charts

The y-axis starts at 0.85 for the read microbenchmark groups (point-1, point-2, range-1, range-2) to improve readability.

Summary:
  • MariaDB has done much better than MySQL at avoiding CPU perf regressions over time
  • For the scan microbenchmark modern MariaDB is faster than older MariaDB
  • For write microbenchmarks the MDEV-33894 fix is important as QPS is almost 2X larger with it

Monday, July 22, 2024

Searching for regressions in RocksDB with db_bench: part 2

In a recent post I shared results for RocksDB performance tests using versions from 6.0 through 9.0 and 3 different types of servers (small, medium, big). While there were few regressions over time, there is one regression that arrived in version 8.6, bug 12038, and the workarounds are one of:

  • use O_DIRECT for compaction reads
  • set compaction_readahead_size to be <= max_sectors_kb for the database storage device. When SW RAID is used I don't know whether the value that matters is from the underlying storage devices or the SW RAID device.
In this post I have more results from tests done with compaction_readahead_size set to a value <= max_sectors_kb.

tl;dr
  • Setting compaction_readahead_size to be <= max_sectors_kb was good for performance on the small and big servers. One effect of this is the average read request size is large (tens of KB) when the value is correctly sized and ~4K (single-block reads) when it is not.
  • If you don't want to worry about this then use O_DIRECT for compaction reads
Read the Builds and Benchmark sections from my recent post for more context.

Hardware

I tested on three servers:
  • Small server
    • SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU, ext4 with data=writeback and 1 NVMe device. The storage device has 128 for max_hw_sectors_kb and max_sectors_kb.
    • I set compaction_readahead_size to 96K
  • Medium server
    • C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and 16 cores, XFS with data=writeback, SW RAID 0 and 4 NVMe devices. The RAID device has 512 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
    • I set compaction_readahead_size to 512K
  • Big server
    • BIG - Intel CPU, 1-socket, 48 cores with HT enabled, enough RAM (vague on purpose), xfs with SW RAID 0 and 3 devices. The RAID device has 128 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
    • I set compaction_readahead_size to 96K

Workloads

There are two workloads:

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
Results: byrx (cached)

For each server there are links to two sets of results.

The first set of results has 3 lines per test. The first line is from RocksDB 8.5.4, the second from 8.7.3 using the default (=2M) for compaction_readahead_size and the third from 8.7.3 with compaction_readahead_size =96K. An example is here.

The second set of results is similar to the first, except the second and third lines are from RocksDB 9.3.1 instead of 8.7.3.

Below I use CRS to mean compaction_readahead_size and compare the QPS from the overwriteandwait microbenchmark.

The results:
  • SER4 (small server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~12% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~6% less QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf
  • C2D (medium server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~2% more QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~26% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf
  • BIG (big server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 gets the same QPS as 8.5.4 with CRS set to either =2M or =96K
      • 9.3.1 gets ~4% less QPS than 8.5.4 with CRS set to either =2M or =96K
Summary
  • Setting compaction_readahead_size to be <= max_sectors_kb helps performance on the small and medium server but not on the big server. Note there are large differences on the big server between the value of max_sectors_kb for the RAID device and for the underlying storage devices -- it is much larger for the storage devices.
  • In the cases where reducing the value of compaction_readahead_size helped, QPS from overwriteandwait in RocksDB 8.5.4 is still better than the versions that follow
Results: iobuf (IO-bound with buffered IO)

For each server there are links to two sets of results.

The first set of results has 4 lines per test. The first line is from RocksDB 8.5.4, the second from 8.7.3 using the default (=2M) for compaction_readahead_size and the third from 8.7.3 with compaction_readahead_size =96K. O_DIRECT was not used for the first three lines. The fourth line is from 8.7.3 using O_DIRECT. An example is here.

The second set of results is similar to the first, except the second, third and fourth lines are from RocksDB 9.3.1 instead of 8.7.3.

Below I use CRS to mean compaction_readahead_size and compare the QPS from the overwriteandwait microbenchmark.

The results:
  • SER4 (small server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get 30% to 40% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~10% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get ~2% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf but O_DIRECT is better
      • Average ead request size per iostat (see rareqsz here) is much larger with CRS =96K than =2M (84.5 vs 4.6)
  • C2D (medium server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get 4% to 7% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =512K get ~20% more QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get ~11% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf and better than O_DIRECT
      • Average read request size per iostat (see rareqsz here) is similar with CRS =512K and 2M (185.9 vs 194.1)
  • BIG (big server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~28% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get 4% to 7% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get 7% to 10% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf but O_DIRECT is better
      • Average ead request size per iostat (see rareqsz here) is much larger with CRS =96K than =2M (61.2 vs 5.1)
Summary
  • Setting compaction_readahead_size to be <= max_sectors_kb helps on all servers
  • On the small and big server, performance with O_DIRECT was better than without.













MyRocks vs InnoDB on cached sysbench: revised

A few weeks ago I shared results for sysbench with InnoDB and MyRocks on a variety of servers. The worst-case for MyRocks occurred on a 2-socket server with the write microbenchmarks. After some performance debugging I learned that changing the CPU frequency governor from schedutil to performance increased QPS by ~2X for the worst cases (see here) with MyRocks. Note that for Ubuntu 22.04 the default for the CPU frequency governor is schedutil.

This blog post shares results for the 2-socket server after I repeated all tests with the performance CPU frequency governor.

tl;dr

  • MyRocks uses ~1.4X more CPU than InnoDB for this benchmark which means that MyRocks gets ~70% of the QPS compared to InnoDB for CPU-bound benchmarks. Note that compression is enabled for MyRocks but not for InnoDB. That will increase the CPU for MyRocks on the microbenchmarks that do writes. Here I ignore the benefits from compression, but they are a big deal in production.
  • the largest regressions from early MyRocks 5.6.35 to modern MyRocks 8.0.32, occur on writes and range queries and a typical regression is ~5% 
  • MySQL with InnoDB is faster in 8.0.37 (on average) than 5.6.35 for writes and some queries, but also up to 28% slower on some point and range queries

Builds

I tested the following builds for FB MyRocks:
  • 5635-210407 - FB MyRocks 5.6.35 at git sha f896415f (as of 21/04/07) with RocksDB 6.19.0
  • 5635-231016 - FB MyRocks 5.6.35 at git sha 4f3a57a1 (as of 23/10/16) with RocksDB 8.7.0
  • 8032-231204 - FB MyRocks 8.0.32 at git sha e3a854e8 (as of 23/12/04) with RocksDB 8.7.0
  • 8032-240529 - FB MyRocks 8.0.32 at git sha 49b37dfe (as of 24/05/29) with RocksDB 9.2.1
  • 8032-240529-LTO - same as 8032-240529 except adds link-time optimization
I also compiled upstream MySQL 5.6.35, 5.6.51, 5.7.10, 5.7.44, 8.0.11 and 8.0.37 from source. For 8.0.37 I also created a binary for 8.0.37 with link-time optimization (LTO) enabled via -DWITH_LTO=ON.

Hardware

I tested on one server that I call Socket2. It is a SuperMicro SuperWorkstation (see here) with 2-sockets, 12 cores/socket, 64G RAM and ext4 (SW RAID 0 over 2 NVMe devices). It uses Ubuntu 22.04 with ext4 and Intel HT is disabled. As described above, the server uses the performance CPU frequency governor.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with 16 threads, 8 tables and 10M rows per table. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper scripts was:
# Socket2 -> 16 clients
bash r.sh 8 10000000 300 600 md0 1 1 16

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here. For each microbenchmark group there is a table with summary statistics. I don't have charts because that would use too much space, but the results per microbenchmark are in the spreadsheets.

The numbers in the spreadsheets and the tables below are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

Results: MyRocks 5.6.35 vs 8.0.32

The numbers in the tables are the relative QPS as explained above.

The base case is 5635-210407, FB MyRocks 5.6.35 as of 21/04/07. The tables below compare it with:
  • 5635-231016 - FB MyRocks 5.6.35 as of 23/10/16
  • 8032-231204 - FB MyRocks 8.0.32 as of 23/12/04
  • 8032-240529 - FB MyRocks 8.0.32 as of 24/05/29
  • 8032-240529-LTO - FB MyRocks 8.0.32 as of 24/05/29 with link-time optimization
This shows the relative QPS as: (QPS for 5635-231016) / (QPS for 5635-210407)
  • the max (1.24) occurs on points-covered-si.pre_range=100
  • the min (0.90) occurs on range-notcovered-pk.pre_range=100 
5635-231016minmaxavgmedian
point-10.931.241.020.98
point-20.940.980.960.95
range-10.900.980.940.94
range-20.920.990.950.96
writes0.940.990.960.95

This shows the relative QPS as: (QPS for 8032-231204) / (QPS for 5635-210407)
  • the max (1.28) occurs on points-covered-si.pre_range=100
  • the min (0.74) occurs on  scan_range=100
8032-231204minmaxavgmedian
point-10.861.280.980.92
point-20.911.080.980.94
range-10.741.020.950.96
range-20.970.990.980.98
writes0.810.980.920.93

This shows the relative QPS as: (QPS for 8032-240529) / (QPS for 5635-210407)
  • the max (1.24) occurs on points-covered-si.pre_range=100
  • the min (0.67) occurs on scan_range=100
  • the min for writes (0.79) occurs on insert_range=100
    • from vmstat metrics the CPU overhead (cpu/o) grows and the context switch rate (cs/o) drops from 5635-210407 to 8032-240529
8032-240529minmaxavgmedian
point-10.861.240.980.93
point-20.891.080.970.94
range-10.671.010.930.94
range-20.960.990.970.97
writes0.790.960.900.91

This shows the relative QPS as: (QPS for 8032-240529-LTO) / (QPS for 5635-210407)
  • the max (1.26) occurs on points-covered-si.pre_range=100
  • the min (0.67) occurs on scan_range=100
  • the min for writes (0.80) occurs on insert_range=100
    • from vmstat metrics the CPU overhead (cpu/o) grows and the context switch rate (cs/o) drops from 5635-210407 to 8032-240529-LTO
8032-240529-LTOminmaxavgmedian
point-10.791.260.990.93
point-20.911.100.980.94
range-10.671.020.940.95
range-20.971.010.991.00
writes0.800.980.930.93

Results: InnoDB 5.6, 5.7 and 8.0

The numbers in the tables are the relative QPS as explained above.

The tables below show the relative QPS. The base case is InnoDB from MySQL 5.6.35. It is compared with InnoDB from 5.6.51, 5.7.10, 5.7.44, 8.0.11, 8.0.37 and 8.0.37 with link-time optimization.

This shows the relative QPS as: (QPS for InnoDB+MySQL 5.7.44) / (QPS for InnoDB+MySQL 5.6.35)
  • the max (3.16) occurs on update-index_range=100
  • the min (0.77) occurs on scan_range=100
5.7.44minmaxavgmedian
point-10.871.571.110.99
point-20.981.381.191.25
range-10.770.870.850.86
range-21.021.261.131.12
writes1.073.161.431.25

This shows the relative QPS as: (QPS for InnoDB+MySQL 8.0.37) / (QPS for InnoDB+MySQL 5.6.35)
  • the max (3.16) occurs on update-index_range=100 which is much less than the result in 5.7.44
  • the min (0.65) occurs on scan_range=100
  • the median values here are much less than the values above for InnoDB 5.7.44 because there are significant performance regressions in 8.0
8.0.37minmaxavgmedian
point-10.741.260.910.84
point-20.781.120.960.99
range-10.650.750.720.72
range-20.881.080.970.96
writes0.962.051.221.12

This shows the relative QPS as : (QPS for InnoDB 8.0.37+LTO) / (QPS for InnoDB 5.6.35)
  • the max(2.18) occurs on update-index_range=100 which is much less than the result in 5.7.44
  • the min (0.64) occurs on scan_range=100
  • LTO improves QPS by ~4%
8.0.37-LTOminmaxavgmedian
point-10.771.370.950.88
point-20.821.150.991.02
range-10.640.790.750.75
range-20.911.111.001.00
writes0.992.181.271.17

Results: MyRocks vs InnoDB

The numbers in the tables are the relative QPS as explained above.

The base case is InnoDB from MySQL 8.0.37. The tables below show the relative QPS for MyRocks:
  • 5635-231016 - FB MyRocks 5.6.35 as of 23/10/16
  • 8032-231204 - FB MyRocks 8.0.32 as of 24/05/29
This shows the relative QPS as (QPS for FB MyRocks 5635-231016) / (QPS for InnoDB 8.0.37)
  • the max (1.06) occurs on update-index_range=100
  • the min (~0.50) occurs on many tests
  • the median is ~0.70 which means that MyRocks uses ~1.4X more CPU than InnoDB
5635-231016minmaxavgmedian
point-10.520.940.740.78
point-20.610.870.720.68
range-10.560.770.650.65
range-20.690.770.730.73
writes0.551.060.710.64

This shows the relative QPS as: (QPS for FB MyRocks 8032-240529) / (QPS for InnoDB 8.0.37)
  • the max (1.02) occurs on update-index_range=100
  • the min (~0.50) occurs on many tests
  • the median is ~0.70 which means that MyRocks uses ~1.4X more CPU than InnoDB
  • the worst cases here are ~0.50 when using the performance CPU frequency governor. It was ~0.25 when using schedutil.
8032-240529minmaxavgmedian
point-10.510.840.710.78
point-20.610.850.730.74
range-10.500.770.640.63
range-20.700.800.750.75
writes0.511.020.670.61

















Friday, July 19, 2024

MySQL 8.0.38 vs cached Sysbench on a medium server

This has benchmark results for MySQL 8.0.38 and a few other 8.0 releases using Sysbench with a cached database on a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. Results from the Insert Benchmark in the same setup are here.

tl;dr

  • Performance for many range scans might be about 10% lower in 8.0.38 than in 8.0.26. The regression is much larger for the scan benchmark where the performance drop is about 22%. The regressions arrived in 8.0.30 and 8.0.31. Percona PS-8822 and MySQL 111538 are open for this.
  • Performance for writes is up to 9% slower in 8.0.38 vs 8.0.26 with one exception. Performance for the update-index microbenchmark is ~23% slower in 8.0.38. The update-index regression arrives in 8.0.30. I assume it is related to changes for the InnoDB redo log (see innodb_redo_log_capacity and the release notes). It is odd that the large regression is limited to the update-index microbenchmark, although this benchmark requires secondary index maintenance which might mean there is more stress on redo.
  • There is a huge improvement, almost 2X, for several microbenchmarks in the point-2 microbenchmark group. Bug 102037 was fixed in MySQL 8.0.31. I reported that bug against MySQL 8.0.22 thanks to sysbench.
Builds, configuration and hardware

I compiled from source MySQL versions 8.0.26 through 8.0.38.

The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

The my.cnf file is here for MySQL 8.0.30+ and is here for 8.0.26 through 8.0.28.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with 12 threads, 8 tables and 10M rows per table.. Ech microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper scripts were:
 8 tables, 10M rows/table, 12 threads
bash r.sh 8 10000000 300 600 md0 1 1 12

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here. For each microbenchmark group there is a table with summary statistics. I don't have charts because that would use too much space, but the results per microbenchmark are in the spreadsheets.

The numbers in the spreadsheets and the tables below are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

For all results below the base case is InnoDB from MySQL 8.0.26

Results: summary statistics

Each table has summary statistics per microbenchmark group. The numbers are the relative QPS for MySQL 8.0.38 which is (QPS for 8.0.38 / QPS for 8.0.26).

The results are mixed using the median values.
  • Performance for many range scans might be about 10% lower in 8.0.38 than in 8.0.26. The regression is much larger for the scan benchmark where the performance drop is about 22%. The regressions arrived in 8.0.30 and 8.0.31. Percona PS-8822 and MySQL 111538 are open for this. From bug comments and browsing the code, the root cause might be a change in how function inlining is done for InnoDB.
  • Performance for writes is up to 9% slower in 8.0.38 vs 8.0.26 with one exception. Performance for the update-index microbenchmark is ~23% slower in 8.0.38. The update-index regression arrives in 8.0.30. I assume it is related to changes for the InnoDB redo log (see innodb_redo_log_capacity and the release notes).
  • There is a huge improvement, almost 2X, for several microbenchmarks in the point-2 microbenchmark group. Bug 102037 was fixed in MySQL 8.0.31. I reported that bug against MySQL 8.0.22 thanks to sysbench.
  • The releases at which regressions occur are visible in the spreadsheet
minmaxavgmedian
point-10.931.061.001.02
point-20.951.971.311.02
range-10.780.910.890.90
range-20.910.930.920.92
writes0.770.960.920.94

Results: charts

The y-axis starts at 0.70 instead of 0 to improve readability. The charts can make it easier to see trends and to see when regressions or improvements arrive.

I suspect the regressions for point queries are related to PS-8822 and MySQL 111538.
I suspect the regressions for point queries are related to PS-8822 and MySQL 111538. The large improvement to the random-points microbenchmarks are from fixing bug 102037
I suspect the regressions for range queries are related to PS-8822 and MySQL 111538 and the worst-case regression occurs for the scan microbenchmark.
I suspect the regressions for range queries are related to PS-8822 and MySQL 111538 

The regressions for most write microbenchmarks is <= 10% with one exception -- update-index. The update-index regression arrives in 8.0.30. I assume it is related to changes for the InnoDB redo log (see innodb_redo_log_capacity and the release notes).
Debugging the regression in update-index

The regression for update-index arrives in 8.0.30. 

From vmstat metrics I see:
  • more CPU per operation in 8.0.30
    • The cpu/o column is CPU /operation and it increases from .002615 in 8.0.28 to .002979 in 8.0.30. The cpu in cpu/o is derived from the sum of vmstat us and sy.
  • more context switches per operation in 8.0.30
    • The cs/o column is context switches /operation and it increases from 10.726 in 8.0.28 to 12.258 in 8.0.30. There appears to be more mutex contention in 8.0.30.
I repeated tests with MySQL 8.0.28 with a my.cnf changed to use the same number and size of redo log files as used by 8.0.30+. The goal was to determine whether more+smaller redo log files were the issue for the update-index regression. Alas, it was not and 8.0.28 with that alternate my.cnf still was much better at update-index than 8.0.30+. With 8.0.30+ the my.cnf has redo_log_capacity =50G and MySQL uses 32 files which are each ~1.6G (50G / 32). To match that in the repeated tests with 8.0.28 I used: 
innodb_log_files_in_group=32
innodb_log_file_size=1638M
I then repeated tests and used PMP to collect and aggregate thread stacks. One stack I see in 8.0.30 that doesn't occur in 8.0.28 is below and I confirmed that log_free_check_wait exists in 8.0.28. Alas, this thread stack also doesn't show up with 8.0.31 or 8.0.32.
 __GI___clock_nanosleep
__GI___nanosleep
std::this_thread::sleep_for<long,,wait_for<log_free_check_wait(log_t&)::<lambda(bool)>
log_free_check_wait
log_free_check,log_free_check
log_free_check,row_upd
row_upd_step
row_update_for_mysql_using_upd_graph
ha_innobase::update_row
handler::ha_update_row
Sql_cmd_update::update_single_table
Sql_cmd_dml::execute
mysql_execute_command
Prepared_statement::execute
Prepared_statement::execute_loop
mysqld_stmt_execute
dispatch_command
do_command
handle_connection
pfs_spawn_thread
start_thread
clone
Debugging the regression in scan

From vmstat metrics for a range query I see that the CPU overhead increases ~5% from 8.0.28 to 8.0.30 while QPS drops ~5%. I don't have good vmstat data for the scan microbenchmark but in the past the problem has been more CPU. This looks like Percona PS-8822 and MySQL 111538 and I still have hope it will be fixed.

















RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs ...