Friday, October 11, 2024

What does "perf stat" tell us about MySQL performance regressions on an AMD Threadripper CPU

I used perf stat to collect a variety of HW performance counter and will share the results over several posts. I used 4 different servers with 4 different CPUs and there will be a post per CPU. This is the third post with results from a server class CPU. Previous posts that used laptop class CPUs (but mini PCs, not laptops) are here and here and a result from a server class CPU is here.

tl;dr

  • MySQL (InnoDB and MyRocks) gets slower over time from both memory system bloat (TLB and cache activity) and code bloat (more instructions/operation). For code bloat, InnoDB in MySQL 8.0.39 uses (1.43, 1.22, 1.28, 1.36) times more instructions per SQL operation for (scan, insert, point-query, update-index) microbenchmarks relative to 5.6.51.
  • the regressions in MyRocks and InnoDB are similar so this appears to be independent of the storage engine (perhaps code above the storage engine, perhaps just new code in general)
  • the results for InnoDB here are better than on the small servers. While memory system and code bloat hurts modern MySQL, there were many improvements that reduce mutex contention and the workload here has more concurrency (24 threads) while the small server workloads used low concurrency (1 thread)
  • the update-index microbenchmark has a huge regression from 8.0.28 to 8.0.39. I need a few more days to explain that. I see increases in CPU utilization and the context switch rate.
Builds

I compiled upstream MySQL from source for versions 5.6.51, 5.7.10, 5.7.44, 8.0.11, 8.0.28 and 8.0.39.

I compiled FB MySQL from source and tested the following builds:
  • fbmy5635_rel_o2nofp_210407_f896415f_6190
    • MyRocks 5.6.35 at git sha f896415f (as of 21/04/07) with RocksDB 6.19
  • fbmy5635_rel_o2nofp_231016_4f3a57a1_870
    • MyRocks 5.6.35 at git sha 4f3a57a1 (as of 23/10/16) with RocksDB 8.7.0
  • fbmy8028_rel_o2nofp_220829_a35c8dfe_752
    • MyRocks 8.0.28 at git sha a35c8dfe (as of 22/08/29) with RocksDB 7.5.2
  • fbmy8028_rel_o2nofp_231202_4edf1eec_870
    • MyRocks 8.0.28 at git sha 4edf1eec (as of 23/12/02) with RocksDB 8.7.0
  • fbmy8032_rel_o2nofp_231204_e3a854e8_870
    • MyRocks 8.0.32 at git sha e3a854e8 (as of 23/12/04) with RocksDB 8.7.0
  • fbmy8032_rel_o2nofp_240529_49b37dfe_921
    • MyRocks 8.0.32 at git sha 49b37dfe (as of 24/05/29) with RocksDB 9.2.1
The my.cnf files are here for MyRocks: 5.6.358.0.28 and 8.0.32.

The my.cnf files for InnoDB are in the subdirectories here. Given the amount of innovation in MySQL 8.0 I can't use one my.cnf file for all 8.0 versions.

Hardware

The server is a Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX 32-Cores. The OS is Ubuntu 22.04 with an HWE kernel. Storage is 1 m.2 SSD with ext4 (data=writeback, discard enabled).

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with 24 threads, 8 tables and 10M rows per table. Each microbenchmark runs for 330 seconds if read-only and 630 seconds otherwise. Prepared statements were enabled.

The command line for my helper scripts was:
    bash r.sh 8 10000000 330 630 nvme0n1 1 1 24

Using perf

I used perf stat with code that does:

  1. Sleep for 30 seconds
  2. Run perf stat 7 times for 10 seconds at a time. Each run collects different counters - see here.
So it takes ~100 seconds per loop and given that I run the read-heavy tests for ~330 seconds and the write-heavy tests for 630 seconds I collect data from ~3 loops for read-heavy and ~6 loops for write-heavy.

I then compare results across DBMS versions for one (or a few) of the loops and generate two tables per microbenchmark -- one with absolute values, the other with values relative to the base version. The base version for InnoDB is 5.6.51 and for MyRocks is fbmy5635_rel_o2nofp_210407_f896415f_6190. By values relative I mean: (value for my version / value for base version). My focus is to compare MyRocks with MyRocks and InnoDB with InnoDB. I assume that most of the regressions for MyRocks are also in InnoDB and then try to confirm that assumption.

Results

For the results I normally split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. But I don't do that here because I share 1 (or 2) tables of results from perf stat per microbenchmark and doing that for 42 microbenchmarks is too much.

So I focus on four that I think are representative:
  • scan - do full tables scans in a loop for ~300 seconds
  • insert - do inserts
  • point-query - fetch 1 row by exact match on the PK index
  • update-index - do updates that require secondary index maintenance
For each microbenchmark I continue to use relative QPS (rQPS) which is:
     (QPS for my version / QPS for base version)

And the base version is explained above in the Using perf section.

The relative QPS is here for MyRocks. The QPS for the subset of microbenchmarks on which I focus is below. The results for a recent build for FB MyRocks 8.0.32 are in col-5 and the relative QPS values range from 0.60 for scan to 0.92 for update-index. When the relative QPS for scan is 0.60 with a recent MyRocks 8.0.32 build, the scan throughput in that recent build is 60% versus an old build of MyRocks 5.6.35 -- when the workload is CPU bound. I am not happy about that.

Relative to: x.fbmy5635_rel_o2nofp_210407_f896415f_6190
col-1 : x.fbmy5635_rel_o2nofp_231016_4f3a57a1_870
col-2 : x.fbmy8028_rel_o2nofp_220829_a35c8dfe_752
col-3 : x.fbmy8028_rel_o2nofp_231202_4edf1eec_870
col-4 : x.fbmy8032_rel_o2nofp_231204_e3a854e8_870
col-5 : x.fbmy8032_rel_o2nofp_240529_49b37dfe_921

col-1   col-2   col-3   col-4   col-5
0.97    0.64    0.71    0.62    0.60    scan_range=100
0.96    0.83    0.80    0.78    0.77    insert_range=100
0.97    0.87    0.87    0.85    0.85    point-query_range=100
0.98    0.97    0.94    0.92    0.92    update-index_range=100

The relative QPS is here for InnoDB. The QPS for the subset of microbenchmarks on which I focus is below. The results for MySQL 8.0.39 are in col-5 and the relative QPS values range from 0.75 for scan to 1.77 for update-index. InnoDB in MySQL 8.0.39 is slower than 5.6.51 for the read-only microbenchmarks, scan and point-query, and then faster on the write-only microbenchmarks, insert and update-index. Also note that the values in col-5 (8.0.39) are a bit smaller than the values in col-4 (8.0.28), especially for scan. I think this is bug 111538 which might get fixed in the upcoming 8.0.40 release. Even in the one case where 8.0.39 gets more throughput than 5.6.51 (update-index) there is a steady drop in relative QPS from col-2 (5.7.44) through col-5 (8.0.39), and even from col-4 (8.0.28) to col-5 (8.0.39).
 
Relative to: x.my5651_rel_o2nofp
col-1 : x.my5710_rel_o2nofp
col-2 : x.my5744_rel_o2nofp
col-3 : x.my8011_rel_o2nofp
col-4 : x.my8028_rel_o2nofp
col-5 : x.my8039_rel_o2nofp

col-1   col-2   col-3   col-4   col-5
0.98    0.90    1.03    0.92    0.75    scan_range=100
1.51    1.52    1.59    1.50    1.41    insert_range=100
1.04    1.01    0.94    0.89    0.83    point-query_range=100
3.12    3.62    1.44    3.21    1.77    update-index_range=100

Explaining regressions

While it would be great if most of the regressions were caused by a small number of diffs, that does not appear to be the case. When I look at QPS across all of the point releases I see a few things:
  • larger regressions across major releases (5.6.51 to 5.7.10 and 5.7.44 to 8.0.11)
  • many regressions across 8.0 releases (8.0.11 through 8.0.39)
  • some regressions across 5.7 releases (5.7.10 through 5.7.44)
  • not many regressions across 5.6 releases
At a high level there are two reasons for the regressions and I am working to distinguish between them when I look at results from perf record and perf stat.
  • code bloat - more instructions are executed per SQL command
  • memory system bloat - more cache and TLB activity per SQL command
With either tool  (perf statperf record) I run it for a fixed number of seconds while the amount of work (QPS) completed during that interval is not fixed -- for low-concurrency workloads the QPS is larger for older MySQL and smaller for modern MySQL. I must keep this in mind when interpreting the results. 

With flamegraphs from perf record a common case is that the distribution of time per function doesn't change that much even when there are large regressions and my hypothesis is that memory system bloat is the problem. 

When looking at perf stat output I start with the absolute values for the counters. When they are small then a 10X difference in the value for that counter between old and modern MySQL is frequently not significant as it won't have a large impact on performance. Otherwise, I focus on the relative values for these counters which is (value for my version / value for base version). I then use the old version of MySQL or FB MyRocks as the base version and try to focus on the cases where the relative value is large (perhaps larger than 1.15, as in a 15% increase).

Just as with perf record, the values from perf stat are measured over a fixed number of seconds while the QPS completed during that interval tends to be larger for older MySQL. So I then compute another value which is: (value for my version / value for base version / relative QPS for my version).  This is the relative value for a counter per query (query == SQL command) and when that is equal to 1.5 for some version then some version does 1.5X more for that counter per query (1.5X more cache misses, etc).

While I use the word significant above to mean that the value for some counter is interesting it isn't easy to estimate how much that contributes to a performance regression.

Looking at perf stat output

This spreadsheet has the interesting counters from perf stat for the scan, insert, point-query and update-index microbenchmarks. The set of tables on the left has the absolute values for the counters and on the right are the normalized values -- (value for my version / value for base version / relative QPS).

The amd-rx tab has results for MyRocks and the amd-in tab has results for InnoDB on the server I write about here (the SER4 explained above). And I will inline the tables with normalized values below.

For MyRocks the version names I use below are abbreviated from what I explained above:
  • fbmy5635.a - fbmy5635_rel_o2nofp_210407_f896415f_6190
  • fbmy5635.b - fbmy5635_rel_o2nofp_231016_4f3a57a1_870
  • fbmy8028.a - fbmy8028_rel_o2nofp_220829_a35c8dfe_752
  • fbmy8028.b - fbmy8028_rel_o2nofp_231202_4edf1eec_870
  • fbmy8032.a - fbmy8032_rel_o2nofp_231204_e3a854e8_870
  • fbmy8032.b - fbmy8032_rel_o2nofp_240529_49b37dfe_921
A few more things:
  • I use yellow to highlight counters for which there are large increases
  • I use red to highlight the values for cpu/o (relative CPU per SQL command) and relative QPS
  • values for cycles, GHz, ipc, cpu/o and relative QPS are not normalized as described above. For them the value is computed as: (value for me / value for base version) and I don't divide that by the relative QPS.
Results from perf stat: scan

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.60 & 0.75 for modern MyRocks & InnoDB so they get 60% & 75% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per scan (cpu/o) is 1.65X & 1.36X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 4.63X & 12.61X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.61X & 1.43X in modern MyRocks & InnoDB
  • there are four cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.061.541.421.671.65
branch-misses1.000.941.261.011.141.12
cache-misses1.000.951.020.911.081.08
cache-references1.000.701.141.321.921.76
cycles1.001.001.001.001.001.00
dTLB-load-misses1.000.901.110.921.111.08
dTLB-loads1.001.148.131.362.082.75
GHz1.001.001.001.001.001.00
instructions1.001.071.511.381.631.61
ipc1.001.040.970.981.000.97
iTLB-load-misses1.000.990.860.840.960.92
iTLB-loads1.001.3824.341.407.002.96
L1-dcache-load-misses1.001.181.361.231.731.38
L1-dcache-loads1.001.061.431.331.561.56
L1-icache-loads1.001.312.182.072.512.26
L1-icache-loads-misses1.000.731.752.626.524.23
stalled-cycles-backend1.000.860.795.931.370.97
stalled-cycles-frontend1.001.180.961.540.912.15
cpu/o1.001.041.561.421.621.65
QPS1.000.970.640.710.620.60

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.061.121.061.151.45
branch-misses1.001.000.970.220.250.41
cache-misses1.000.931.030.700.970.94
cache-references1.000.940.961.551.681.70
cycles1.001.001.001.001.001.00
dTLB-load-misses1.000.950.930.951.171.19
dTLB-loads1.000.771.001.063.804.74
GHz1.001.001.001.001.001.00
instructions1.001.021.071.041.151.43
ipc1.001.010.951.071.051.07
iTLB-load-misses1.001.171.211.651.622.75
iTLB-loads1.002.2270.582.12728.815591.47
L1-dcache-load-misses1.000.920.921.982.092.08
L1-dcache-loads1.001.061.121.111.251.50
L1-icache-loads1.001.061.550.610.591.17
L1-icache-loads-misses1.001.922.462.1711.0412.61
stalled-cycles-backend1.001.422.482.220.853.07
stalled-cycles-frontend1.003.047.314.061.522.19
cpu/o1.001.031.141.001.121.36
QPS1.000.980.901.030.920.75

Results from perf stat: insert

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.77 for modern MyRocks and 1.41 for modern InnoDB. So MyRocks gets slower over time. InnoDB in MySQL 8.0.39 is faster than 5.6.51 but has been getting slower since 5.7.10
  • the CPU overhead per insert (cpu/o) is 1.40X larger in modern MyRocks but smaller in modern InnoDB.
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.65X & 1.61X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • modern MyRocks & InnoDB suffer from code bloat as the number of instructions per SQL operation increases by 1.22X for both and InnoDB has been getting worse since 5.7.10
  • there are three cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.021.581.181.551.23
branch-misses1.001.031.491.411.591.49
cache-misses1.001.031.481.481.591.58
cache-references1.001.031.451.421.591.55
cycles1.000.971.201.061.181.07
dTLB-load-misses1.000.871.251.081.211.16
dTLB-loads1.000.991.421.341.471.51
GHz1.001.001.061.051.061.05
instructions1.001.011.581.161.611.22
ipc1.000.991.090.871.060.87
iTLB-load-misses1.001.131.992.202.432.44
iTLB-loads1.001.111.551.471.621.65
L1-dcache-load-misses1.000.951.321.211.411.25
L1-dcache-loads1.000.831.391.221.451.16
L1-icache-loads1.001.191.731.691.731.87
L1-icache-loads-misses1.001.161.671.751.681.65
stalled-cycles-backend1.002.713.012.753.133.03
stalled-cycles-frontend1.000.960.990.971.020.98
cpu/o1.001.031.341.341.391.40
QPS1.000.960.830.800.780.77

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.000.820.821.051.181.25
branch-misses1.000.860.931.101.231.35
cache-misses1.000.940.991.081.271.42
cache-references1.000.940.981.091.251.38
cycles1.000.940.901.041.061.07
dTLB-load-misses1.001.021.081.461.451.65
dTLB-loads1.001.111.231.511.701.83
GHz1.000.980.970.980.991.00
instructions1.000.830.841.041.151.22
ipc1.001.321.391.581.611.58
iTLB-load-misses1.001.341.422.042.653.46
iTLB-loads1.001.271.401.371.581.72
L1-dcache-load-misses1.000.971.021.141.191.27
L1-dcache-loads1.000.950.981.211.331.47
L1-icache-loads1.000.920.961.121.251.40
L1-icache-loads-misses1.001.051.091.211.431.61
stalled-cycles-backend1.000.860.841.041.051.07
stalled-cycles-frontend1.000.790.780.860.920.94
cpu/o1.000.670.610.710.740.80
QPS1.001.511.521.591.501.41

Results from perf stat: point-query

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.85 & 0.83 for modern MyRocks & InnoDB so they get ~84% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per query (cpu/o) is 1.18X & 1.22X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.29X & 1.63X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.20X & 1.28X in modern MyRocks & InnoDB
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.051.161.151.161.21
branch-misses1.001.061.271.301.341.32
cache-misses1.001.061.291.321.371.34
cache-references1.001.051.271.281.341.32
cycles1.001.011.051.051.061.06
dTLB-load-misses1.000.931.091.001.050.99
dTLB-loads1.001.041.381.311.361.38
GHz1.001.001.011.011.021.02
instructions1.001.041.151.141.141.20
ipc1.001.000.950.930.910.95
iTLB-load-misses1.001.231.261.581.531.66
iTLB-loads1.001.081.341.241.461.41
L1-dcache-load-misses1.001.021.191.171.211.22
L1-dcache-loads1.001.041.201.191.221.25
L1-icache-loads1.001.051.271.301.361.33
L1-icache-loads-misses1.000.921.211.301.391.29
stalled-cycles-backend1.000.970.990.931.000.99
stalled-cycles-frontend1.001.011.010.990.971.01
cpu/o1.001.031.151.151.171.18
QPS1.000.970.870.870.850.85

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.061.101.161.211.27
branch-misses1.001.021.061.211.261.40
cache-misses1.001.081.131.281.421.57
cache-references1.001.071.121.241.341.46
cycles1.001.061.071.101.121.15
dTLB-load-misses1.000.850.940.941.041.09
dTLB-loads1.001.321.491.651.711.84
GHz1.001.011.021.031.031.04
instructions1.001.061.101.191.201.28
ipc1.001.041.041.020.960.92
iTLB-load-misses1.001.661.301.692.572.99
iTLB-loads1.001.201.381.491.521.63
L1-dcache-load-misses1.001.091.151.251.271.34
L1-dcache-loads1.001.061.121.211.241.33
L1-icache-loads1.001.061.091.211.281.40
L1-icache-loads-misses1.001.131.061.311.551.63
stalled-cycles-backend1.000.900.860.910.930.98
stalled-cycles-frontend1.000.920.950.950.971.00
cpu/o1.000.991.021.091.151.22
QPS1.001.041.010.940.890.83

Results from perf stat: update-index

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.92 for modern MyRocks and 1.77X for modern InnoDB. So MyRocks gets slower over time. InnoDB in MySQL 8.0.39 is a lot faster than 5.6.51 but has been getting slower since 5.7.10 and there is a large regression from 8.0.28 to 8.0.39.
  • the CPU overhead per insert (cpu/o) is 1.04X larger in modern MyRocks but for InnoDB it is smaller in 8.0.39 than 5.6.51
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.30X & 1.75X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • modern MyRocks suffers from code bloat because the number of instructions per SQL operation increases by 1.32X. For InnoDB this is less of an issue assuming the large increase from 8.0.28 to 8.0.39 is caused by mutex contention.
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.051.121.131.231.30
branch-misses1.001.021.031.041.091.09
cache-misses1.001.011.121.121.191.18
cache-references1.001.021.101.091.181.17
cycles1.001.000.940.920.960.96
dTLB-load-misses1.000.960.330.310.310.32
dTLB-loads1.000.981.151.101.171.17
GHz1.001.001.011.011.021.02
instructions1.001.061.111.131.251.32
ipc1.001.051.161.171.211.26
iTLB-load-misses1.001.060.640.670.710.71
iTLB-loads1.001.232.542.412.722.68
L1-dcache-load-misses1.001.001.031.001.041.04
L1-dcache-loads1.001.031.111.081.131.02
L1-icache-loads1.001.011.100.971.191.18
L1-icache-loads-misses1.001.111.221.221.341.30
stalled-cycles-backend1.000.490.770.420.760.75
stalled-cycles-frontend1.000.960.410.410.430.42
cpu/o1.001.011.021.011.041.04
QPS1.000.980.970.940.920.92

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.000.460.431.190.751.30
branch-misses1.000.700.671.200.851.26
cache-misses1.001.010.971.351.201.52
cache-references1.000.890.861.391.111.52
cycles1.001.401.291.011.401.30
dTLB-load-misses1.000.750.731.610.981.34
dTLB-loads1.000.910.981.941.361.68
GHz1.001.001.000.990.991.00
instructions1.000.450.431.250.771.36
ipc1.001.001.201.771.761.84
iTLB-load-misses1.001.010.991.651.612.06
iTLB-loads1.001.581.662.201.912.32
L1-dcache-load-misses1.000.730.741.300.941.34
L1-dcache-loads1.000.480.511.310.851.44
L1-icache-loads1.000.780.731.290.981.36
L1-icache-loads-misses1.000.981.031.621.341.75
stalled-cycles-backend1.000.650.591.280.841.29
stalled-cycles-frontend1.000.500.511.100.721.00
cpu/o1.000.410.330.590.400.62
QPS1.003.123.621.443.211.77

No comments:

Post a Comment

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for mak...