Small Datum: Why do table scans get slower with MySQL from 5.6 to 8.0?

As I search to explain why MySQL is getting (too much) slower over time I collect a lot of data from many servers and then get overwhelmed trying to summarize it. Here I focus on one microbenchmark (table scan) on one server (ax162-s from Hetzner) using upstream MySQL with InnoDB and FB MyRocks.

While my primary goal is to explain the regressions in MyRocks, many of them are inherited from upstream MySQL so I must first explain them in upstream MySQL with InnoDB.

My focus is on a test that does full table scans so the ratio of (time spent in storage engine / time spent elsewhere) is much larger for that test than many of the other tests I use with sysbench.

The basic problem is that the table scan done by my tests is

~1.5X faster for InnoDB in MySQL 5.6.51 vs 8.0.39
~1.5X faster for MyRocks in FB MySQL 5.6.35 vs 8.0.32

The problem is spread over many releases, but the summary is:

Code bloat - more instructions are executed per SQL statement

InnoDB 8.0.28 uses ~1.3X more instructions/query than InnoDB 5.6.51
MyRocks 8.0.32 uses ~1.3X more instructions/query than MyRocks 5.6.35

Memory system bloat - more TLB and cache misses per SQL statement

InnoDB 8.0.28 has a lot more cache & TLB loads & misses relative to InnoDB 5.6.51. The difference ranges from ~1.6X more cache-misses to ~1700X more iTLB-loads.
MyRocks 8.0.32 has a lot more cache & TLB loads & misses relative to MyRocks 5.6.35. The difference ranges from ~1.6X more cache-misses to ~22X more iTLB-loads.

The benchmark

I used sysbench but not as described here. Instead I just did load, a few point-query tests and then the full scan test. Here I focus on the full scan. I ran a subset of the tests to get more results in less time.

Tests are run with 1 thread, 1 table and 50M rows. While modern MySQL suffers from new CPU overheads it also benefits from a reduction in mutex contention. Alas, I don't highlight the improvements here. And modern MySQL would be even faster if it could reduce mutex contention without increasing CPU overheads.

I ran things 3 times:

plain - just run the test, don't use perf to get HW counters or flamegraphs
stat - run perf stat to collect HW counters
svg - run perf record -e cycles -F 1999 -g -p $pid to generate flame graphs

For stat the counters are samples over 10-second intervals for several groups of counters via:

perf stat -e cpu-clock,cycles,bus-cycles,instructions,branches,branch-misses -p $pid -- sleep 10

perf stat -e cache-references,cache-misses,stalled-cycles-backend,stalled-cycles-frontend -p $pid -- sleep 10

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -p $pid -- sleep 10

perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -p $pid -- sleep 10

perf stat -e iTLB-load-misses,iTLB-loads,L1-icache-loads-misses,L1-icache-loads -p $pid -- sleep 10

perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches -p $pid -- sleep 10

perf stat -e alignment-faults,context-switches,migrations,major-faults,minor-faults,faults -p $pid -- sleep 10

Database versions tested

For upstream MySQL I tested InnoDB from 5.6.51, 5.7.1, 5.7.3, 5.7.5, 5.7.7, 5.7.9, 5.7.10, 5.7.19, 5.7.35, 5.7.44, 8.0.0, 8.0.2, 8.0.4, 8.0.11, 8.0.13, 8.0.15, 8.0.17, 8.0.19, 8.0.20, 8.0.21, 8.0.23, 8.0.25, 8.0.27, 8.0.28, 8.0.39.

For MyRocks I tested:

5635.old - MyRocks 5.6.35 from Apr 7, 2021 at git hash f896415f with RocksDB 6.19.0
5635.new - MyRocks 5.6.35 from Oct 16, 2023 at git hash 4f3a57a1 with RocksDB 8.7.0
8028.old - MyRocks 8.0.28 from Aug 29 2022 at git hash a35c8dfe with RocksDB 7.5.2
8028.new - MyRocks 8.0.28 from Dec 2 2023 at git hash 4edf1eec with RocksDB 8.7.0
8032.old - MyRocks 8.0.32 from Dec 4 2023 at git hash e3a854e8 with RocksDB 8.7.0
8032.new - MyRocks 8.0.32 from May 29 2024 at git hash 49b37dfe with RocksDB 9.2.1

Charts: InnoDB

The charts use relative QPS which is: (QPS for my version / QPS for InnoDB 5.6.51). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

The regressions during 5.7 are small.

The regressions during 8.0 are larger. The worst occurs after 8.0.28 which hopefully is resolved in the soon-to-be released 8.0.40 -- see bug 111538..

Charts: MyRocks

The charts use relative QPS which is: (QPS for my version / QPS for MyRocks 5635.old). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

The large regression occurs during the 5.6.35 to 8.0.28 transition.

Hardware counters

Above I explained how perf stat is used to collect HW counters over 10 second intervals. It is important to remember that measurements are made over a fixed amount of time rather than a fixed amount of work (queries). I want know to know things like how many instructions are executed per query and for that I have to normalize the results from perf stat by QPS.

But before I spend too much time evaluating results from perf stat I confirm that the clock rate during the samples is similar. I learned this the hard way when using the schedutil CPU frequency governor with MyRocks (see here). The information for this comes from perf stat -e cycles via the line for cycles:

10,010.49 msec cpu-clock # 1.001 CPUs utilized

27,526,557,704 cycles # 2.750 GHz

<not supported> bus-cycles

122,334,672,524 instructions # 4.44 insn per cycle

23,595,569,084 branches # 2.357 G/sec

9,662,382 branch-misses # 0.04% of all branches

10.002641688 seconds time elapsed

I assume that the 10-second sample is representative and that is more true for read-heavy tests and less true for write-heavy tests because write-heavy has intermittent background work (b-tree write back, LSM compaction).

Using relative values

I focus on using relative values for the counters which is: (value for my version) / (value for base case). This makes it easy to see how things change over time from old to new releases but it doesn't show the significance of the change. For example, in many cases the value for context switches during a table scan grows by a large amount (as in 40X) but that isn't significant because the change might be from 10 context switches per second to 400. And I doubt such a change has any impact on the performance regressions.

Unfortunately, it isn't trivial to understand the significance of changes for other metrics. For example, when should I care more about cache vs TLB misses?

The data

At last, I can share some of the data collected by perf stat. The data is here for:

InnoDB 5.6.51 to 5.7.44

see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.

InnoDB 5.7.44 to 8.0.39

see here. The gist has three parts -- first is the counters (as relative values) that might explain the regressions, the second is the absolute values for those counters and the third is all of the counters as relative values.

MyRocks 5.6.35 to 8.0.32

see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.

First up is InnoDB:

MySQL 8.0.28 uses ~1.3 times more instructions/query than 5.6.51
MySQL 8.0.28 has 1.6X or more cache & TLB misses and loads per query than 5.6.51. The worst ratio is for iTLB-loads (over 1700X) but I don't know whether the absolute cost from that is more significant than for the other counters.

(value/query for MySQL 8.0.28) / (value/query for MySQL 5.6.51)

instructions (.991 / .109) / (1 / .135) = 1.227

cache-misses (1.301 / .109) / (1 / .135) = 1.611

cache-references (1.504 / .109) / (1 / .135) = 1.862

dTLB-loads (2.903 / .109) / (1 / .135) = 3.595

iTLB-load-misses (2.011 / .109) / (1 / .135) = 2.490

iTLB-loads (1412.0 / .109) / (1 / .135) = 1748.903

L1-dcache-load-misses (1.723 / .109) / (1 / .135) = 2.134

L1-icache-loads (3.101 / .109) / (1 / .135) = 3.840

L1-icache-loads-misses (15.607 / .109) / (1 / .135) = 19.330

stalled-cycles-backend (1.427 / .109) / (1 / .135) = 1.767

Next up is MyRocks:

MyRocks 8032.new uses ~1.3X more instructions/query vs 5635.old
MyRocks 8032.new has 1.6X or more cache & TLB misses and loads per query than 5635.old. The worst ratio is for iTLB-loads (over 22X) but I don't know whether the absolute cost from that is more significant than for the other counters.

(value/query for MyRocks 8032.new) / (value/query for 5635.old)

instructions (0.883 / .049) / (1 / .073) = 1.315

cache-references (1.410 / .049) / (1 / .073) = 2.100

dTLB-loads (1.253 / .049) / (1 / .073) = 1.866

iTLB-load-misses (1.104 / .049) / (1 / .073) = 1.644

iTLB-loads (15.068 / .049) / (1 / .073) = 22.449

L1-dcache-load-misses (1.275 / .049) / (1 / .073) = 1.899

L1-icache-loads (1.432 / .049) / (1 / .073) = 2.133

L1-icache-loads-misses (5.306 / .049) / (1 / .073) = 7.905

stalled-cycles-frontend (1.429 / .049) / (1 / .073) = 2.128

Small Datum

Monday, September 23, 2024

Why do table scans get slower with MySQL from 5.6 to 8.0?

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)