This blog post attempts to explain some of the performance regressions that landed in InnoDB with MySQL 8.0.30. I'd rather not try to debug perf regressions long after they arrived but here we are. I assume that most of the problems landing in MySQL 8.0.28+ are also in 8.4 and 9.X, so these are problems the community will be dealing with for a long time.
One blog post that documents the regressions is here. The regression that I am trying to explain occurs on the dell32 server (see below) where the throughput for the update-index microbenchmark drops in half after MySQL 8.0.28.
tl;dr
- I filed bug 116463 for this
- Behavior related to innodb_spin_wait_delay changed in 8.0.30 and I assume that was not intentional. One workaround is to reduce innodb_spin_wait_delay from 6 to 5
- On the server that has the largest regression for the update-index microbenchmark, several workarounds improve QPS by ~1.25X, unfortunately that only fixes about 1/3 of the regression
- I am curious whether there are other changes after 8.0.28 to InnoDB hash functions that contribute to regressions. I have no doubt that the new hash functions are better with respect to hashing things uniformly, but that might come at a cost in latency. Most of these are inlined and perf problems caused by inlined functions are harder to spot longer after the fact.
The Plot
For CPU-bound sysbench on a variety of servers there are large regressions that arrived after 8.0.28 and one of the worst occurs for the update-index microbenchmark on only one of my servers (32-core AMD Threadripper) where throughput drops almost in half after 8.0.28 (in 8.0.30 through 8.0.40). From vmstat I see that the amount of CPU and number of context switches per update increase by ~1.5X when throughput drops.
While reading release notes and looking at some of the code I notice that code used by InnoDB busy-wait loops (ut_rnd_interval in 8.0.28, ut::random_from_interval_fast in 8.0.30+) has changed. Alas, understanding what has changed is harder because there is much noise (code updated to use more C++) hiding the signal (real changes).
There is a clue in the 8.0.30 release notes, alas the bugs are not in the public MySQL bug database.
InnoDB: Hash and random generator functions in the
InnoDB
sources were improved. (Bug #16739204, Bug #23584861)
The problems that I see are:
- default busy-wait loop time has a 20% increase in 8.0.30+
- code that determines the busy-wait loop time is slower in 8.0.30+
- ut::random_from_interval_fast is slower than ut::random_from_interval
- InnoDB makes this expensive to debug
- beelink
- Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device. This is a laptop-class CPU in a mini PC.
- pn53
- ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device. This is a laptop-class CPU in a mini PC.
- dell32
- Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). This is a server-class CPU in a server.
- ax162-s
- AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud. This is a server-class CPU in a server.
- socket2
- SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, Ubuntu 22.04 and ext4 on 1 m.2 SSD . The CPU is Intel Xeon Silver 4214R CPU @ 2.40GHz
InnoDB has custom mutexes and rw-locks that add value (no snark) beyond what you get from pthreads - the value is special semantics for the rw-lock and support for debugging and performance monitoring. While there are pthreads variants that include a busy-wait loop the duration of the busy-wait is a bit of a black box with the pthreads versions. The busy-wait duration for the InnoDB versions is usually between 0 and 7 microseconds which mostly works great for InnoDB.
The busy-wait code is:
- MySQL 8.0.28 and earlier
- ut_delay(ut_rnd_interval(0, srv_spin_wait_delay)
- MySQL 8.0.30 to 8.0.33
- ut_delay(ut::random_from_interval(0, srv_spin_wait_delay)
- MySQL 8.0.34 to 8.0.40
- ut_delay(ut::random_from_interval_fast(0, srv_spin_wait_delay)
The problem is that with innodb_spin_wait_delay=6, the return values are between 0 and 5 in MySQL 8.0.28 and earlier releases but between 0 and 6 in 8.0.30+. The docs for 8.0, 8.4 and 9.0 still describe the old (8.0.28) behavior for innodb_spin_wait_delay which I hope is restored to avoid the need to update docs or my.cnf files
- for 8.0.28 and earlier the average value is 2.5 -> (0 + 1 + 2 + 3 + 4 + 5) / 6
- for 8.0.30+ the average value is 3.0 -> (0 + 1 + 2 + 3 + 4 + 5 + 6) /7
- 3 / 2.5 = 1.2
- in MySQL 8.0.28: return (low + (rnd % (high - low)));
- in MySQL 8.0.30: return low + (random_64() % (high - low + 1));
- in MySQL 8.0.40: return low + (random_64_func() % (high - low + 1));
Both ut::random_from_interval and ut::random_from_interval_fast are a lot slower than ut_rnd_interval on four different CPUs I used for benchmarks. All of these functions are PRNG. The 8.0.28 implementation is the most simple. In MySQL 8.0.30 through 8.0.33 there is ut::random_from_interval and then ut::random_from_interval_fast is added in 8.0.34.
The ut::random_from_interval_fast implementation uses rdtsc (random_64_fast calls my_timer_cycles) to read the cycle timer.
Results are provided for 3 different builds. All use CMAKE_BUILD_TYPE =Release. The O2 build is compiled with -O2, the O3 build uses -O3 and the native build uses -O3 -march=native -mtune=native. The compiler is gcc 11.4 on Ubuntu 22.04.5.
I also tried clang with the -O3 build and the results were similar to gcc.
I also want InnoDB to log the latency in wall-clock time for ut_delay(X) for X from 0 to innodb_spin_wait_delay. I collected this via the patches I added for 8.0.28 and 8.0.40. From the following, the delay is ~X usecs for ut_delay(X) on 3 of the 4 CPUs.The outlier CPU is beelink which has the oldest + slowest CPU of the 4. With innodb_spin_wait_delay =6 the delay ranges from ~0 to ~5 usecs in 8.0.28 and from ~0 to ~6 usecs in 8.0.44. These were measured via getimeofday with 1000 calls in a loop for each value of X.
- beelink - 1 thread, 1 table, 30M rows
- pn53 - 1 thread, 1 table, 50M rows
- dell32 - 24 threads, 8 tables, 10M rows/table
- ax162-s - 40 threads, 8 tables, 10M rows/table
- socket2 - I did not run benchmarks on this server because it was busy
- my8028_rel_o2nofp.z11a_X
- MySQL 8.0.28 with -O2 and -fno-omit-frame-pointer
- my8040_rel_o2nofp.z11a_X
- MySQL 8.0.40 with -O2 and fno-omit-frame-pointer
- my8040_rel.z11a_X
- MySQL 8.0.40 with -O3
- my8040_rel_o2nofp_norndfast.z11a_X
- MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer.
- InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast
- my8040_rel_o2nofp_norndfast.z11a_nops_X
- MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer.
- InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
- nops means that I added performance_schema =0 to my.cnf.
- my8040_rel_o2nofp_norndfast.z11a_spin5_X
- MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
- spin5 means that I added innodb_spin_wait_delay =5 to my.cnf to workaround the issue described above.
- my8040_rel_o2nofp_norndfast.z11a_nops_spin5_X
- MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
- nops means that I added performance_schema =0 to my.cnf.
- spin5 means that I added innodb_spin_wait_delay =5 to my.cnf to workaround the issue described above.
- my8040_rel_o2nofp_oldrnd.z11a_X
- MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer.
- InnoDB was changed to use ut_rnd_interval copied from 8.0.28
But on the dell32 server where the update-index regression in 8.0.40 is much larger than it is on the other servers, I get ~1.25X more QPS with several of the builds that attempt to fix or workaround the problems described in this post.
Do you think
ReplyDelete- spinning longer
- more context switches
are somehow connected? If so, how?
I am far from certain. It isn't easy trying to explain what went wrong as an outsider. First I need to understand what has changed, which is much work and made harder by so many formatting diffs that hide the real changes. And then there are the C -> C++ changes that don't always change behavior (although the +1 issue in ut::random_from_interval_fast is a behavior change).
Delete