Small Datum: Explaining changes in MySQL performance via hardware perf counters: part 1

I spend much time documenting how MySQL performance has changed over the years. After my latest round of benchmarks I looked at flamegraphs from MySQL/InnoDB during the insert benchmark. Unfortunately, I didn't see anything obvious when comparing flamegraphs for MySQL 5.6, 5.7 and 8.0. Mostly, the flamegraphs looked the same -- the percentage of time in various call stacks was similar, and the call stacks were similar.

So the flamegraphs look similar, but MySQL 8.0 gets less work done per second. A possible explanation is that everything gets slower, perhaps from more cache misses, so I used Linux perf to debug this and the results are interesting.

What happened in MySQL 5.7 and 8.0 to put so much more stress on the memory system?

tl;dr

It looks like someone sprinkled magic go slower dust across most of the MySQL code because the slowdown from MySQL 5.6 to 8.0 is not isolated to a few call stacks.
MySQL 8.0 uses ~1.5X more instructions/operation than 5.6. Cache activity (references, loads, misses are frequently up 1.5X or more. TLB activity (loads, misses) is frequently up 2X to 5X with the iTLB being a bigger problem than the dTLB.
innodb_log_writer_threads=ON is worse than I thought. I will soon have a post on that.

Too long to be a tl;dr

I don't have much experience using Linux perf counters to explain performance changes.
There are huge increases for data TLB, instruction TLB and L1 cache counters. From the Xeon CPU (socket2 below) the change from MySQL 5.6.21 to 8.0.34 measured as events/query are:

branches, branch misses: up ~1.8X, ~1.5X
cache references: up ~1.4X
instructions: up ~1.8X
dTLB loads, load-misses, stores, store-misses: up ~2.0X, ~4.7X, ~2.1X, ~3.5X
iTLB loads, load-misses: up ~6.5X, ~5.0X
L1 data cache loads, load-misses, stores: up ~2.0X, ~2.0X, ~2.1X
L1 instruction cache load-misses: up ~2.8X
LLC loads, load-misses, stores, store-misses: up ~1.3X, ~1.1X, ~1.1X, ~1.1X and it is interesting that the growth here is much less than the other counters
Context switches, CPU migrations: up ~1.9X, ~7.9X

Two open questions

Does the increase in context switches and CPU migrations explains the rise in the cache and TLB counters?
Is this change caused by the use of innodb_log_writer_threads=ON, which I set for the big server (socket2) but not for the small servers (beelink, ser7).

For many of the HW counters the biggest jumps occur between the last point release in 5.6 and the first in 5.7 and then again between the last in 5.7 and the first in 8.0. Perhaps this is good news because it means the problems are not spread across every point release.

The posts in this series are:

part 1 - (this post) introduction and results for the l.i0 (initial load) benchmark step
part 2 - results for the l.i1 (write-only) benchmark step
part 3 - results for the q100 (read+write) benchmark step
part 4 - results for the q1000 (read+write) benchmark step

Builds

It isn't easy to build older code on newer systems, compilers, etc. Notes on that are here for 5.6, for 5.6 and 5.7, for 5.7 and for 8.0. A note on using cmake is here. The rel builds were used as everything was compiled using CMAKE_BUILD_TYPE=Release.

Tests were done for:

5.6 - 5.6.21, 5.6.31, 5.6.41, 5.6.51. Note that 5.6.51 is the last release for 5.6.
5.7 - 5.7.10, 5.7.20, 5.7.30, 5.7.43. Note that 5.7.43 is the next-to-last release for 5.7 and 5.7.10 is the first.
8.0 - 8.0.14, 8.0.20, 8.0.28, 8.0.34. Note that 8.0.13 was the first GA for 8.0.

Servers

Tests were run on 3 servers:

beelink - the old server from Beelink explained here that has an AMD 4700u CPU with 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04
ser7 - the new server from Beelink (SER7 7840HS) that has an AMD 7840HS CPU with 8 cores, 32G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04
socket2 - my new 2-socket server with 2 Intel Xeon Silver CPU @ 2.40GHz, 64G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04

The servers named beelink and ser7 use a mobile CPU so I worried whether the results there are truthy and repeated tests on the Xeon. I am most interested in the results from socket2 as that uses a server CPU.

Configurations

The my.cnf files are here. Note that innodb_log_writer_threads=OFF for the small servers (beelink, ser7) but =ON for the big server (socket2). That likely explains some of the differences in the charts below.

Benchmarks

The Insert Benchmark was run in a cached setup and all tables were cached by InnoDB.

The benchmark is run with 12 clients/tables for the socket2 server and 1 client/table for the others. Each client uses a separate table. The benchmark is a sequence of steps and I often call each of these a benchmark step.

l.i0

insert 20 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 20 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1200 seconds.

Perf

I modified my benchmark helper scripts to run Linux perf and collect stats over 10 second intervals. One loop looks like the following and there was a sleep for 30+ seconds after each iteration of the loop. The data I share here is from the second-to-last sample collected per benchmark step.

perf stat -e cpu-clock,cycles,bus-cycles,instructions -p $pid -- sleep 10 ; sleep 2
perf stat -e cache-references,cache-misses,branches,branch-misses -p $pid -- sleep 10 ; sleep 2
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-icache-loads-misses -p $pid -- sleep 10 ; sleep 2
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetch-misses -p $pid -- sleep 10 ; sleep 2
perf stat -e iTLB-load-misses,iTLB-loads -p $pid -- sleep 10 ; sleep 2
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches -p $pid -- sleep 10 ; sleep 2
perf stat -e alignment-faults,context-switches,migrations,major-faults,minor-faults,faults -p $pid -- sleep 10 ; sleep 2

Performance

The charts below show average throughput (QPS, or really operations/s) for the l.i0 benchmark step.

The benchmark uses 1 client for the small servers (beelink, ser7) and 12 clients for the big server (socket2).
MySQL gets slower from 5.6 to 8.0 on the small servers (beelink, ser7) but slightly faster on the big server (socket2). Perhaps performance is dominated by CPU overhead on the small servers, but improvements for concurrent workloads offsets the new CPU overhead on the big server.

Results

The results are split into four parts because there are so many charts -- there is one part per benchmark step, and I ignore the l.x and q500 steps to save time. This post focuses on the l.i0 step that loads the database.

The purpose for this post is to document MySQL performance in terms of HW counters collected by Linux perf. The challenge is that throughput (QPS or operations/s) isn't the same across the versions of MySQL that I tested. As explained in the Perf section above I collected perf output over a 10-second interval. But the amount of work done (QPS or operations/s) in that interval was not constant. So rather than just graph the values of the HW counters, I graph:

($counter/query for my version) / ($counter/query for MySQL 5.6.21)

Assume:

cache-misses is 1M for MySQL 5.6.21 and 2M for 8.0.34, per 10-second interval
QPS is 10,000 for MySQL 5.6.21 and 5,000 for 8.0.34

Then cache-misses/query for MySQL 8.0.34 relative to 5.6.21 is:

(2M / 5000) / (1M / 10000) = 4

Were performance unchanged from MySQL 5.6.21 to 8.0.34 then I would expect the values of relative counter/query to be 1, but they tend to be a bit larger than 1 as shown here.

One things

the y-axis frequently does not start at zero to improve readability. But this also makes it harder to compare adjacent graphs
below when I write up 30% that is the same as up 1.3X. I switch from the former to the latter when the increase is larger than 99%.
CPI isn't scaled using the formula above because it is a rate, instead the charts show (CPI for my version) / (CPI for MySQL 5.6.21)

Spreadsheets are here for beelink, ser7 and socket2. See the Servers section above to understand the HW for beelink, ser7 and socket2.

The HW counter names used in the charts are the ones from perf output.

Results: branches and branch misses

Summary: