Tuesday, October 8, 2024

What does "perf stat" tell us about MySQL performance regressions on a Ryzen 7 7735HS CPU

I used perf stat to collect a variety of HW performance counter and will share the results over several posts. I used 4 different servers with 4 different CPUs and there will be a post per CPU. This is my second post and my first post is here. The previous post and this one use a small server. Results here are similar to results on my other small server.

tl;dr

  • MySQL (InnoDB and MyRocks) gets slower over time from both memory system bloat (TLB and cache activity) and code bloat (more instructions/operation). For code bloat, InnoDB in MySQL 8.0.39 uses (1.44, 1.70, 1.30) times more instructions per SQL operation for (scan, insert, point-query) microbenchmarks relative to 5.6.51.
  • the regressions in MyRocks and InnoDB are similar so this appears to be independent of the storage engine (perhaps code above the storage engine, perhaps just new code in general)

Builds

I compiled upstream MySQL from source for versions 5.6.51, 5.7.10, 5.7.44, 8.0.11, 8.0.28 and 8.0.39.

I compiled FB MySQL from source and tested the following builds:
  • fbmy5635_rel_o2nofp_210407_f896415f_6190
    • MyRocks 5.6.35 at git sha f896415f (as of 21/04/07) with RocksDB 6.19
  • fbmy5635_rel_o2nofp_231016_4f3a57a1_870
    • MyRocks 5.6.35 at git sha 4f3a57a1 (as of 23/10/16) with RocksDB 8.7.0
  • fbmy8028_rel_o2nofp_220829_a35c8dfe_752
    • MyRocks 8.0.28 at git sha a35c8dfe (as of 22/08/29) with RocksDB 7.5.2
  • fbmy8028_rel_o2nofp_231202_4edf1eec_870
    • MyRocks 8.0.28 at git sha 4edf1eec (as of 23/12/02) with RocksDB 8.7.0
  • fbmy8032_rel_o2nofp_231204_e3a854e8_870
    • MyRocks 8.0.32 at git sha e3a854e8 (as of 23/12/04) with RocksDB 8.7.0
  • fbmy8032_rel_o2nofp_240529_49b37dfe_921
    • MyRocks 8.0.32 at git sha 49b37dfe (as of 24/05/29) with RocksDB 9.2.1
The my.cnf files are here for MyRocks: 5.6.358.0.28 and 8.0.32.

The my.cnf files for InnoDB are in the subdirectories here. Given the amount of innovation in MySQL 8.0 I can't use one my.cnf file for all 8.0 versions.

Updates:
  • I corrected how I display instructions in the tables below. Originally, I just used (instructions for my version / instructions for base version) which is the number of instructions per 10-second interval for some version relative to the base version. Now I use (instructions for my version / instructions for base version / relative QPS) which is the number of instructions per operation relative to the base version. With this change it is easy to spot code bloat, that occurs when the value is much greater than 1.0. When the value in the table is 2.0 that means that some version uses 2X more instructions per SQL operation which usually means code bloat, although it can also mean more mutex contention (see the results for insert with MySQL 8.0.11).
  • I updated the tables below to indicate cases where regressions from perf HW counters arrive in fbmy5635_rel_o2nofp_231016_4f3a57a1_870. I did not expect this to happen in MyRocks 5.6.35. I do expect it to happen in MyRocks 8.0.
Hardware

The server here is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU with 8 cores and SMT disabled, 32G of RAM and Ubuntu 22.04. The storage is 1 NVMe device and ext4 with data=writeback.

The CPU used here (AMD 7735HS) and the CPU from my previous post (AMD 4700u) are both described as laptop class CPUs. They might be more prone to suffer from code bloat than server CPUs. But have no fear, the next two CPUs for which I have results are server class.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with 1 thread, 1 table and 50M rows. Each microbenchmark runs for 330 seconds if read-only and 630 seconds otherwise. Prepared statements were enabled.

The command line for my helper scripts was:
    bash r.sh 1 50000000 330 630 nvme0n1 1 1 1

Using perf

I used perf stat with code that does:

  1. Sleep for 30 seconds
  2. Run perf stat 7 times for 10 seconds at a time. Each run collects different counters - see here.
So it takes ~100 seconds per loop and given that I run the read-heavy tests for ~330 seconds and the write-heavy tests for 630 seconds I collect data from ~3 loops for read-heavy and ~6 loops for write-heavy.

I then compare results across DBMS versions for one (or a few) of the loops and generate two tables per microbenchmark -- one with absolute values, the other with values relative to the base version. The base version for InnoDB is 5.6.51 and for MyRocks is fbmy5635_rel_o2nofp_210407_f896415f_6190. By values relative I mean: (value for my version / value for base version). My focus is to compare MyRocks with MyRocks and InnoDB with InnoDB. I assume that most of the regressions for MyRocks are also in InnoDB and then try to confirm that assumption.

Results

For the results I normally split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. But I don't do that here because I share 1 (or 2) tables of results from perf stat per microbenchmark and doing that for 42 microbenchmarks is too much.

So I focus on four that I think are representative:
  • scan - do full tables scans in a loop for ~300 seconds
  • insert - do inserts
  • point-query - fetch 1 row by exact match on the PK index
  • update-index - do updates that require secondary index maintenance
For each microbenchmark I continue to use relative QPS (rQPS) which is:
     (QPS for my version / QPS for base version)

And the base version is explained above in the Using perf section.

The relative QPS is here for MyRocks. The QPS for the subset of microbenchmarks on which I focus is below. The results for a recent build for FB MyRocks 8.0.32 are in col-5 and the relative QPS values range from 0.64 for insert to 0.81 for update-index. When the relative QPS for insert is 0.64 with a recent MyRocks 8.0.32 build, the insert throughput in that recent build is 64% of the the rate I get with an old build of MyRocks 5.6.35 -- when the workload is CPU bound. I am not happy about that.

Relative to: fbmy5635_rel_o2nofp_210407_f896415f_6190
col-1 : fbmy5635_rel_o2nofp_231016_4f3a57a1_870
col-2 : fbmy8028_rel_o2nofp_220829_a35c8dfe_752
col-3 : fbmy8028_rel_o2nofp_231202_4edf1eec_870
col-4 : fbmy8032_rel_o2nofp_231204_e3a854e8_870
col-5 : fbmy8032_rel_o2nofp_240529_49b37dfe_921

col-1   col-2   col-3   col-4   col-5
0.98    0.73    0.78    0.78    0.75    scan_range=100
0.94    0.69    0.68    0.64    0.64    insert_range=100
0.95    0.83    0.83    0.80    0.79    point-query_range=100
0.96    0.86    0.84    0.81    0.81    update-index_range=100

The relative QPS is here for InnoDB. The QPS for the subset of microbenchmarks on which I focus is below. The results for MySQL 8.0.39 are in col-5 and the relative QPS values range from 0.58 for insert to 1.19 for update-index. Again, the low values (0.58 for insert, 0.70 for point-query, 0.71 for scan) mean that MySQL 8.0.39 is much slower than 5.6.51. Also note that most of the values in col-5 (8.0.39) are a bit smaller than the values in col-4 (8.0.28), especially for scan. I think this is bug 111538 which might get fixed in the upcoming 8.0.40 release. The results for update-index here are odd, just as they were on my other small server.
 
Relative to: my5651_rel_o2nofp
col-1 : my5710_rel_o2nofp
col-2 : my5744_rel_o2nofp
col-3 : my8011_rel_o2nofp
col-4 : my8028_rel_o2nofp
col-5 : my8039_rel_o2nofp

col-1   col-2   col-3   col-4   col-5
0.98    0.89    0.97    0.84    0.71    scan_range=100
0.89    0.82    0.74    0.63    0.58    insert_range=100
0.92    0.87    0.80    0.74    0.70    point-query_range=100
1.57    1.17    1.57    1.12    1.19    update-index_range=100

Explaining regressions

While it would be great if most of the regressions were caused by a small number of diffs, that does not appear to be the case. When I look at QPS across all of the point releases I see a few things:
  • larger regressions across major releases (5.6.51 to 5.7.10 and 5.7.44 to 8.0.11)
  • many regressions across 8.0 releases (8.0.11 through 8.0.39)
  • some regressions across 5.7 releases (5.7.10 through 5.7.44)
  • not many regressions across 5.6 releases
At a high level there are two reasons for the regressions and I am working to distinguish between them when I look at results from perf record and perf stat.
  • code bloat - more instructions are executed per SQL command
  • memory system bloat - more cache and TLB activity per SQL command
With either tool (perf stat, perf record) I run it for a fixed number of seconds while the amount of work (QPS) completed during that interval is not fixed -- for low-concurrency workloads the QPS is larger for older MySQL and smaller for modern MySQL. I must keep this in mind when interpreting the results. 

With flamegraphs from perf record a common case is that the distribution of time per function doesn't change that much even when there are large regressions and my hypothesis is that memory system bloat is the problem. 

When looking at perf stat output I start with the absolute values for the counters. When they are small then a 10X difference in the value for that counter between old and modern MySQL is frequently not significant as it won't have a large impact on performance. Otherwise, I focus on the relative values for these counters which is (value for my version / value for base version). I then use the old version of MySQL or FB MyRocks as the base version and try to focus on the cases where the relative value is large (perhaps larger than 1.15, as in a 15% increase).

Just as with perf record, the values from perf stat are measured over a fixed number of seconds while the QPS completed during that interval tends to be larger for older MySQL. So I then compute another value which is: (value for my version / value for base version / relative QPS for my version).  This is the relative value for a counter per query (query == SQL command) and when that is equal to 1.5 for some version then some version does 1.5X more for that counter per query (1.5X more cache misses, etc).

While I use the word significant above to mean that the value for some counter is interesting it isn't easy to estimate how much that contributes to a performance regression.

Looking at perf stat output

This spreadsheet has the interesting counters from perf stat for the scan, insert, point-query and update-index microbenchmarks. The set of tables on the left has the absolute values for the counters and on the right are the normalized values -- (value for my version / value for base version / relative QPS).

The amd-rx tab has results for MyRocks and the amd-in tab has results for InnoDB on the server I write about here (the SER4 explained above). And I will inline the tables with normalized values below.

For MyRocks the version names I use below are abbreviated from what I explained above:
  • fbmy5635.a - fbmy5635_rel_o2nofp_210407_f896415f_6190
  • fbmy5635.b - fbmy5635_rel_o2nofp_231016_4f3a57a1_870
  • fbmy8028.a - fbmy8028_rel_o2nofp_220829_a35c8dfe_752
  • fbmy8028.b - fbmy8028_rel_o2nofp_231202_4edf1eec_870
  • fbmy8032.a - fbmy8032_rel_o2nofp_231204_e3a854e8_870
  • fbmy8032.b - fbmy8032_rel_o2nofp_240529_49b37dfe_921
A few more things:
  • I use yellow to highlight counters for which there are large increases
  • I use red to highlight the values for cpu/o (relative CPU per SQL command) and relative QPS
  • values for cycles, GHz, ipc, cpu/o and relative QPS are not normalized as described above. For them the value is computed as: (value for me / value for base version) and I don't divide that by the relative QPS.
Results from perf stat: scan

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.75 & 0.71 for modern MyRocks & InnoDB so they get ~75% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per scan (cpu/o) is 1.36X & 1.41X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 3.24X & 33.34X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.30X & 1.44X in modern MyRocks & InnoDB
  • there is one case where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.041.391.271.301.33
branch-misses1.000.720.920.800.880.81
cache-references1.000.891.381.681.433.29
cycles1.001.001.001.001.001.00
dTLB-load-misses1.000.811.040.790.830.87
dTLB-loads1.000.994.901.551.881.69
GHz1.001.001.001.001.001.00
instructions1.001.041.361.251.271.30
ipc1.001.020.990.980.990.97
iTLB-load-misses1.000.981.331.191.671.64
iTLB-loads1.001.6625.801.3212.686.07
L1-icache-loads1.001.071.601.561.371.45
L1-icache-loads-misses1.000.260.481.800.913.24
stalled-cycles-backend1.000.300.430.473.414.31
stalled-cycles-frontend1.000.941.891.811.343.74
cpu/o1.001.021.401.291.301.36
relative QPS1.000.980.730.780.780.75

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.051.141.061.191.47
branch-misses1.000.991.000.240.260.44
cache-references1.000.981.061.912.111.88
cycles1.001.001.001.001.001.00
dTLB-load-misses1.000.980.981.061.051.05
dTLB-loads1.000.760.611.103.934.40
GHz1.001.001.001.001.001.00
instructions1.001.021.081.041.191.44
ipc1.000.990.971.011.001.03
iTLB-load-misses1.001.391.384.212.346.11
iTLB-loads1.001.221.481.57237.592381.81
L1-icache-loads1.001.111.530.630.601.18
L1-icache-loads-misses1.003.8117.206.6041.4433.34
stalled-cycles-backend1.000.761.843.161.723.90
stalled-cycles-frontend1.001.191.981.061.341.80
cpu/o1.001.041.121.061.191.41
relative QPS1.000.980.890.970.840.71

Results from perf stat: insert

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.64 & 0.58 for modern MyRocks & InnoDB so they get ~60% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per insert (cpu/o) is 1.64X & 2.03X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.53X & 2.57X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.57X & 1.70X in modern MyRocks & InnoDB
  • there are two cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.111.481.501.581.61
branch-misses1.001.051.471.521.591.60
cache-references1.000.901.251.361.331.34
cycles1.000.991.021.021.041.03
dTLB-load-misses1.001.201.811.591.911.58
dTLB-loads1.001.211.501.451.481.52
GHz1.001.001.011.011.011.01
instructions1.001.081.441.461.541.57
ipc1.001.030.970.970.950.97
iTLB-load-misses1.001.072.422.643.313.50
iTLB-loads1.001.081.391.391.441.46
L1-icache-loads1.001.071.531.561.661.67
L1-icache-loads-misses1.001.021.341.441.521.53
stalled-cycles-backend1.000.901.140.701.081.10
stalled-cycles-frontend1.000.871.011.020.990.98
cpu/o1.001.091.411.541.601.64
relative QPS1.000.940.690.680.640.64

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.101.163.541.631.77
branch-misses1.001.161.311.981.761.95
cache-references1.001.241.371.651.801.95
cycles1.001.081.072.471.111.13
dTLB-load-misses1.000.941.281.461.431.42
dTLB-loads1.001.441.652.232.202.36
GHz1.001.031.031.011.041.04
instructions1.001.101.173.521.571.70
ipc1.000.910.881.050.880.87
iTLB-load-misses1.001.982.133.854.555.88
iTLB-loads1.001.441.781.712.062.12
L1-icache-loads1.001.261.401.951.862.09
L1-icache-loads-misses1.001.481.441.882.302.57
stalled-cycles-backend1.000.950.982.051.171.18
stalled-cycles-frontend1.000.830.892.650.971.00
cpu/o1.001.221.383.051.832.03
relative QPS1.000.890.820.740.630.58

Results from perf stat: point-query

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.79 & 0.70 for modern MyRocks & InnoDB so they get ~75% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per query (cpu/o) is 1.33X & 1.73X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.46X & 1.37X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.18X & 1.30X in modern MyRocks & InnoDB
  • there are three cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.041.121.111.161.19
branch-misses1.001.041.231.231.291.27
cache-references1.001.071.271.261.341.35
cycles1.001.001.021.031.031.03
dTLB-load-misses1.000.800.930.820.880.90
dTLB-loads1.001.041.411.361.421.47
GHz1.001.001.011.011.011.01
instructions1.001.031.111.101.151.18
ipc1.001.000.900.900.900.92
iTLB-load-misses1.001.161.631.782.082.22
iTLB-loads1.001.021.251.221.381.40
L1-icache-loads1.001.021.241.211.301.26
L1-icache-loads-misses1.001.171.261.251.621.46
stalled-cycles-backend1.002.182.611.002.793.00
stalled-cycles-frontend1.000.990.990.970.991.01
cpu/o1.001.061.271.271.341.33
relative QPS1.000.950.830.830.800.79

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.071.141.191.281.29
branch-misses1.001.061.141.241.321.46
cache-references1.001.121.201.331.451.51
cycles1.001.041.051.081.081.10
dTLB-load-misses1.001.111.151.221.341.05
dTLB-loads1.001.461.691.812.062.08
GHz1.001.021.031.041.041.05
instructions1.001.071.141.221.271.30
ipc1.000.950.930.900.870.82
iTLB-load-misses1.001.531.612.323.474.79
iTLB-loads1.001.271.421.511.631.60
L1-icache-loads1.001.271.241.361.471.56
L1-icache-loads-misses1.000.870.791.090.941.37
stalled-cycles-backend1.001.151.091.131.291.25
stalled-cycles-frontend1.000.930.980.980.980.95
cpu/o1.001.271.281.501.561.73
relative QPS1.000.920.870.800.740.70

Results from perf stat: update-index

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.81 & 1.19 for modern MyRocks & InnoDB so MyRocks gets slower over time while InnoDB in 8.0.39 is faster than 5.6.51 but has been getting slower since 5.7.10
  • the CPU overhead per operation (cpu/o) is 1.23X larger in modern vs older MyRocks but smaller in modern vs older InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.18X & 1.30X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • MyRocks suffers from code bloat because the number of instructions per SQL operation increases by 1.39X. InnoDB does better in 8.0.39 than 5.6.51.
  • there are two cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.111.275.001.381.41
branch-misses1.001.041.101.451.171.17
cache-references1.001.041.081.041.241.33
cycles1.001.021.011.611.011.01
dTLB-load-misses1.000.990.970.910.930.78
dTLB-loads1.001.031.321.281.361.34
GHz1.001.001.001.021.001.00
instructions1.001.121.255.011.361.39
ipc1.001.061.062.601.091.11
iTLB-load-misses1.001.151.771.872.222.34
iTLB-loads1.001.111.271.281.381.38
L1-icache-loads1.001.011.151.151.211.24
L1-icache-loads-misses1.001.331.161.211.481.18
stalled-cycles-backend1.000.680.730.300.710.90
stalled-cycles-frontend1.000.840.760.580.700.78
cpu/o1.001.051.101.201.241.23
relative QPS1.000.960.860.840.810.81

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.000.790.710.830.730.73
branch-misses1.000.840.530.560.470.43
cache-references1.000.830.960.981.060.97
cycles1.001.281.091.470.760.80
dTLB-load-misses1.000.750.890.860.770.69
dTLB-loads1.001.131.511.861.741.57
GHz1.001.021.031.011.031.04
instructions1.000.780.730.860.770.78
ipc1.000.960.790.921.131.15
iTLB-load-misses1.000.830.650.650.820.78
iTLB-loads1.001.342.101.622.061.82
L1-icache-loads1.000.910.680.720.660.64
L1-icache-loads-misses1.001.021.141.261.311.30
stalled-cycles-backend1.000.680.330.550.290.30
stalled-cycles-frontend1.000.620.280.750.310.29
cpu/o1.000.771.130.980.850.72
relative QPS1.001.571.171.571.121.19


No comments:

Post a Comment

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...