Wednesday, July 3, 2024

The journey to explain performance regressions in MySQL

I have spent much time documenting performance regressions from new CPU overheads in MySQL. Soon I will be spending much time trying to explain the source of the regressions with perf, flamegraphs and more. This post is one step in the journey and it shows how performance changes from MySQL 5.6.35 through 8.0.37 using 5 different servers. Previous posts showed the impact of LTO and PGO.

Questions I try to answer in this post:

  • Where in the release cycle do the biggest regressions occur?
  • Are the regressions smaller on the larger servers? One hypothesis is that the regressions are larger at low-concurrency and smaller at high-concurrency because of improvements to InnoDB and other parts of the server. However, it would be great to get the improvements at high-concurrency without regressions at low-concurrency.
tl;dr
  • The largest regressions occur between major versions (5.6.51 to 5.7.10, 5.7.44 to 8.0.11)
  • There are also large regressions from 8.0.11 to 8.0.37 but not from 5.7.10 to 5.7.44
  • In many cases, MySQL 8.0.37 is slower than 5.6.35 at high-concurrency and the problem is obvious on microbenchmarks that do simple SQL (point queries). The standard opinion that MySQL 8.0.x is faster at high-concurrency is far from true.
  • Regressions are larger with low-concurrency than with high-concurrency. Much work was done to improve MySQL for high-concurrency. Alas, regressions (some of code bloat and memory stalls) hurts low-concurrency performance and reduces the benefit from the high-concurrency improvements.

Builds

I used InnoDB from MySQL 5.6.35, 5.6.51, 5.7.10, 5.7.44, 8.0.11, 8.0.28 and 8.0.37. The compiler was gcc 11.4.0.

Hardware

I tested on five servers:
  • Small servers
    • SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU 
    • PN53 - ASUS ExpertCenter PN53 (see here) with 8 cores and an AMD Ryzen 7 7735HS CPU. The CPU on the PN53 is newer than the CPU on the SER4.
  • Medium server
    • C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and SMT disabled so there are 16 cores
  • Big servers
    • Socket2 - SuperMicro SuperWorkstation (see here) with 2-sockets, 12 cores/socket, 64G RAM and ext4 (SW RAID 0 over 2 NVMe devices), HT disabled.
    • Dell32 - Dell Precision 7865 Tower (see here) with a 32-corAMD Ryzen Threadripper PRO 5975WX, 128G RAM and ext4 (SW RAID 0 over 2 NVMe devices), SMT disabled.
All servers use Ubuntu 22.04 with ext4. 

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with:
  • SER4, PN53 - 1 thread, 1 table and 30M rows
  • C2D - 12 threads, 8 tables and 10M rows per table
  • Socket2 - 16 threads, 8 tables and 10M rows per table
  • Dell32 - 24 threads, 8 tables and 10M rows per table
  • each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise
  • prepared statements were enabled
The command lines for my helper scripts were:
# PN53, SER4
bash r.sh 1 30000000 300 600 nvme0n1 1 1 1
# C2D
bash r.sh 8 10000000 300 600 md0 1 1 12
# Socket2
bash r.sh 8 10000000 300 600 md0 1 1 16
# Dell32
bash r.sh 8 10000000 300 600 md0 1 1 24
Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheets with all data is here. For each microbenchmark group there is a table with summary statistics. I don't have charts because that would use too much space, but the results per microbenchmark are in the spreadsheets.

The numbers in the spreadsheets are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

For all results below the base case is MySQL 5.6.35.

Results: 5.7.10

Summary
  • The base case is 5.6.35 and slower or faster is in comparison to 5.6.35
  • Small servers
    • 5.7.10 is ~18% slower on SER4 except for range-2
    • 5.7.10 is ~12% slower on PN53 except for range-2
  • Medium server
    • 5.7.10 is slower and faster for reads and then much faster for writes
    • The update-index microbenchmark has the largest improvement (relative QPS is 3.57)
  • Large servers
    • 5.7.10 has similar perf or is faster except for range-1
    • The update-index microbenchmark has the largest improvement (relative QPS is > 4)
Educated guesses:
  • I don't see a large regression for range-2 on the small servers. Relative to range-1, the range-2 microbenchmark group spends more time in query execution vs parse or optimize. Perhaps the CPU regressions in the other microbenchmark groups is from new CPU overhead prior to query execution. Note that the range-2 group has microbenchmarks are the classic sysbench transaction minus writes (see here).

Results for SER4

5.7.10, SER4minmaxavgmedian
point-10.780.870.830.82
point-20.790.920.840.82
range-10.801.040.850.84
range-20.861.251.030.98
writes0.781.170.890.82

Results for PN53

5.7.10, PN53minmaxavgmedian
point-10.820.890.850.84
point-20.840.950.880.87
range-10.830.990.880.88
range-20.921.271.061.00
writes0.851.290.940.89

Results for C2D

5.7.10, C2Dminmaxavgmedian
point-10.821.441.000.94
point-20.901.201.061.08
range-10.831.120.910.90
range-21.161.301.231.22
writes1.643.572.342.40

Results for Socket2

5.7.10, Socket2minmaxavgmedian
point-10.861.481.060.99
point-20.991.161.101.15
range-10.900.920.910.91
range-21.041.321.171.15
writes1.054.971.611.27

Results for Dell32

5.7.10, Dell32minmaxavgmedian
point-10.892.151.291.06
point-21.071.561.391.55
range-10.890.970.940.95
range-20.961.201.111.16
writes1.234.521.801.61

Results: 5.7.44

Summary
  • The base case is 5.6.35 and slower or faster is in comparison to 5.6.35
  • Results for 5.7.44 are similar to results for 5.7.10 -- there are few regressions during 5.7. This is easier to see on the spreadsheet (scroll to the bottom of each tab).
  • Small servers
    • 5.7.44 is ~15% slower on SER4 except for range-2
    • 5.7.44 is ~12% slower on PN53 except for range-2
  • Medium server
    • 5.7.44 is slower and faster for reads and then much faster for writes
    • The update-index microbenchmark has the largest improvement (relative QPS is 3.96)
  • Large servers
    • 5.7.44 has similar perf or is faster except for range-1
    • The update-index microbenchmark has the largest improvement (relative QPS is > 5)
Educated guesses
  • For range-2 see the explanation in the 5.7.10 results above

Results for SER4

5.7.44, SER4minmaxavgmedian
point-10.850.890.860.85
point-20.780.990.880.85
range-10.800.930.820.81
range-20.851.211.010.96
writes0.741.290.900.85

Results for PN53

5.7.44, PN53minmaxavgmedian
point-10.860.920.890.88
point-20.861.020.920.88
range-10.830.900.860.84
range-20.901.231.040.99
writes0.821.170.920.89

Results for C2D

5.7.44, C2Dminmaxavgmedian
point-10.861.441.030.94
point-20.911.291.111.13
range-10.851.000.880.86
range-21.151.261.201.20
writes1.453.962.462.49

Results for Socket2

5.7.44, Socket2minmaxavgmedian
point-10.881.541.100.99
point-20.981.361.181.22
range-10.780.870.850.86
range-21.041.261.141.12
writes0.755.451.611.26

Results for Dell32

5.7.44, Dell32minmaxavgmedian
point-10.912.131.321.03
point-21.061.721.481.65
range-10.850.930.910.91
range-20.981.161.091.13
writes1.155.441.951.68

Results: 8.0.11

Summary
  • The base case is 5.6.35 and slower or faster is in comparison to 5.6.35
  • Small servers
    • 8.0.11 is ~20% slower on SER4 except for range-2
    • 8.0.11 is ~20% slower on PN53 except for range-2
    • 8.0.11 is 5% to 10% slower than 5.7.44. This is easier to see on the spreadsheet.
  • Medium server
    • 8.0.11 is slower and faster for reads and then much faster for writes
    • The update-index microbenchmark has the largest improvement (relative QPS is 3.28)
    • 8.0.11 is often 5% to 10% slower than 5.7.44. On the update-index benchmark the relative QPS here is 3.28 vs 3.96 in 5.7.10. This is easier to see on the spreadsheet.
  • Large servers
    • 8.0.11 is slower and faster for reads and then much faster for writes
    • The update-index microbenchmark has the largest improvement (relative QPS is 1.24 and 1.66)
    • There are big regressions from 5.7.44 to 8.0.11. The largest one might be for update-index which had a relative QPS of (5.45, 5.44) in 5.7.10 which drops to (3.58, 2.35) in 5.7.44. This is easier to see on the spreadsheet.

Results for SER4

8.0.11, SER4minmaxavgmedian
point-10.760.860.800.80
point-20.710.920.810.79
range-10.750.990.810.81
range-20.791.140.950.91
writes0.641.360.810.74

Results for PN53

8.0.11, PN53minmaxavgmedian
point-10.770.830.800.81
point-20.780.930.840.82
range-10.770.950.820.80
range-20.841.170.980.93
writes0.721.330.870.82

Results for C2D

8.0.11, C2Dminmaxavgmedian
point-10.801.330.960.86
point-20.831.191.021.04
range-10.791.100.850.84
range-21.081.211.141.13
writes1.273.282.092.14

Results for Socket2

8.0.11, Socket2minmaxavgmedian
point-10.841.501.050.94
point-20.911.251.081.12
range-10.810.900.860.87
range-20.981.221.071.07
writes1.053.581.481.24

Results for Dell32

8.0.11, Dell32minmaxavgmedian
point-10.862.031.230.95
point-20.971.601.361.50
range-10.850.920.880.90
range-20.961.111.051.07
writes1.162.351.631.66

Results: 8.0.37

Summary
  • The base case is 5.6.35 and slower or faster is in comparison to 5.6.35
  • Small servers
    • 8.0.37 is 31% to 41% slower on SER4 except for range-2
    • 8.0.37 is 29% to 34% slower on PN53 except for range-2
    • 8.0.37 is 10% to 15% slower than 8.0.11. This is easier to see on the spreadsheet.
  • Medium server
    • 8.0.37 is 11% to 29% slower on most reads (point-1, point-2, range-1), has similar perf on range-2 and much faster for writes
    • The update-index microbenchmark has the second the largest improvement (relative QPS is 2.75 vs 3.06 for update-inlist). Regressions are slowly reducing the improvements on the write-heavy microbenchmarks.
    • 8.0.37 is often 10% slower than 8.0.11. This is easier to see on the spreadsheet.
  • Large servers
    • For both
      • 8.0.37 is up to ~25% slower than 5.6.35
      • 8.0.37 is ~10% slower than 8.0.11
      • CPU regressions are slowly taking away the improvements for high-concurrency.
    • For Socket2
      • 8.0.37 is now slower than 5.6.35 on all microbenchmark groups except writes.
    • For Dell32
      • 8.0.37 is slower than 5.6.35 on some reads (point-1, range-1), faster on some reads (point-2) and similar on some reads (range-2). It remains faster on writes.
Results for SER4

8.0.37, SER4minmaxavgmedian
point-10.650.780.700.69
point-20.610.800.700.69
range-10.640.690.660.66
range-20.650.960.790.75
writes0.471.020.620.59

Results for PN53

8.0.37, PN53minmaxavgmedian
point-10.660.780.710.70
point-20.700.830.750.71
range-10.670.710.690.69
range-20.740.980.850.82
writes0.570.970.690.66

Results for C2D

8.0.37, C2Dminmaxavgmedian
point-10.711.100.830.75
point-20.731.040.880.89
range-10.670.790.710.71
range-20.931.050.990.99
writes1.253.061.911.91

Results for Socket2

8.0.37, Socket2minmaxavgmedian
point-10.741.230.900.83
point-20.771.070.930.96
range-10.660.740.710.72
range-20.871.070.960.95
writes0.973.461.401.14

Results for Dell32

8.0.37, Dell32minmaxavgmedian
point-10.761.611.040.85
point-20.861.391.171.28
range-10.670.800.750.75
range-20.951.081.000.99
writes1.083.031.661.55


No comments:

Post a Comment