Wednesday, October 23, 2024

InnoDB busy-wait loops changed in MySQL 8.0.30

This blog post attempts to explain some of the performance regressions that landed in InnoDB with MySQL 8.0.30. I'd rather not try to debug perf regressions long after they arrived but here we are. I assume that most of the problems landing in MySQL 8.0.28+ are also in 8.4 and 9.X, so these are problems the community will be dealing with for a long time.

One blog post that documents the regressions is here. The regression that I am trying to explain occurs on the dell32 server (see below) where the throughput for the update-index microbenchmark drops in half after MySQL 8.0.28.

tl;dr

  • I filed bug 116463 for this
  • Behavior related to innodb_spin_wait_delay changed in 8.0.30 and I assume that was not intentional. One workaround is to reduce innodb_spin_wait_delay from 6 to 5
  • On the server that has the largest regression for the update-index microbenchmark, several workarounds improve QPS by ~1.25X, unfortunately that only fixes about 1/3 of the regression
  • I am curious whether there are other changes after 8.0.28 to InnoDB hash functions that contribute to regressions. I have no doubt that the new hash functions are better with respect to hashing things uniformly, but that might come at a cost in latency. Most of these are inlined and perf problems caused by inlined functions are harder to spot longer after the fact.

The Plot

For CPU-bound sysbench on a variety of servers there are large regressions that arrived after 8.0.28 and one of the worst occurs for the update-index microbenchmark on only one of my servers (32-core AMD Threadripper) where throughput drops almost in half after 8.0.28 (in 8.0.30 through 8.0.40). From vmstat I see that the amount of CPU and number of context switches per update increase by ~1.5X when throughput drops.

While reading release notes and looking at some of the code I notice that code used by InnoDB busy-wait loops (ut_rnd_interval in 8.0.28, ut::random_from_interval_fast in 8.0.30+) has changed. Alas, understanding what has changed is harder because there is much noise (code updated to use more C++) hiding the signal (real changes).

There is a clue in the 8.0.30 release notes, alas the bugs are not in the public MySQL bug database.

  • InnoDB: Hash and random generator functions in the InnoDB sources were improved. (Bug #16739204, Bug #23584861)

The problems that I see are:

  • default busy-wait loop time has a 20% increase in 8.0.30+
  • code that determines the busy-wait loop time is slower in 8.0.30+
  • ut::random_from_interval_fast is slower than ut::random_from_interval
  • InnoDB makes this expensive to debug
Hardware

I used 5 different servers for this blog post, the first 4 use AMD and the last uses Intel:

  • beelink
    • Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device. This is a laptop-class CPU in a mini PC.
  • pn53
    • ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device. This is a laptop-class CPU in a mini PC.
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). This is a server-class CPU in a server.
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud. This is a server-class CPU in a server.
  • socket2
    • SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, Ubuntu 22.04 and ext4 on 1 m.2 SSD . The CPU is Intel Xeon Silver 4214R CPU @ 2.40GHz
Problem: default busy-wait loop time has a 20% increase in 8.0.30+

InnoDB has custom mutexes and rw-locks that add value (no snark) beyond what you get from pthreads - the value is special semantics for the rw-lock and support for debugging and performance monitoring. While there are pthreads variants that include a busy-wait loop the duration of the busy-wait is a bit of a black box with the pthreads versions. The busy-wait duration for the InnoDB versions is usually between 0 and 7 microseconds which mostly works great for InnoDB.

The busy-wait code is:
  • MySQL 8.0.28 and earlier
    • ut_delay(ut_rnd_interval(0, srv_spin_wait_delay)
  • MySQL 8.0.30 to 8.0.33
    • ut_delay(ut::random_from_interval(0, srv_spin_wait_delay)
  • MySQL 8.0.34 to 8.0.40
    • ut_delay(ut::random_from_interval_fast(0, srv_spin_wait_delay)
The variable srv_spin_wait_delay has the value of the innodb_spin_wait_delay configuration variable which is 6 by default. The ut_rnd_interval, ut::random_from_interval and ut::random_from_interval_fast functions return a random integer between 0 and ~innodb_spin_wait_delay and then ut_delay($X) does a busy-wait loop for approximately $X microseconds.

The problem is that with innodb_spin_wait_delay=6, the return values are between 0 and 5 in MySQL 8.0.28 and earlier releases but between 0 and 6 in  8.0.30+. The docs for 8.08.4 and 9.0 still describe the old (8.0.28) behavior for innodb_spin_wait_delay which I hope is restored to avoid the need to update docs or my.cnf files

On average, the return value is 20% larger in 8.0.30+ thus the busy-wait time will be ~20% larger (~3.0 usecs vs ~2.5 usecs). 
  • for 8.0.28 and earlier the average value is 2.5 -> (0 + 1 + 2 + 3 + 4 + 5) / 6
  • for 8.0.30+ the average value is 3.0 -> (0 + 1 + 2 + 3 + 4 + 5 + 6) /7
  • 3 / 2.5 = 1.2
The workaround for this is to set innodb_spin_wait_delay=5 in my.cnf when using 8.0.30+ but I hope this is fixed upstream. The root cause is the +1 added to this line of code
Problem: code that determines the busy-wait loop time is slower in 8.0.30+

I doubt this is a big problem, but it is interesting. The reason I doubt this hurts performance is that after spending ~10 extra nanonseconds in ut::random_from_interval_fast the caller will then spend up to a few microseconds in a busy-wait loop.

Both ut::random_from_interval and ut::random_from_interval_fast are a lot slower than ut_rnd_interval on four different CPUs I used for benchmarks. All of these functions are PRNG. The 8.0.28 implementation is the most simple. In MySQL 8.0.30 through 8.0.33 there is ut::random_from_interval and then ut::random_from_interval_fast is added in 8.0.34.

The ut::random_from_interval implementation uses HW crc32 instructions (via hash_uint64) when they are available and I confirmed (via printfs added to mysqld) that they are available on the CPUs that I used.

The ut::random_from_interval_fast implementation uses rdtsc (random_64_fast calls my_timer_cycles) to read the cycle timer.

This table shows the time in naonseconds per call and ut::random_from_interval_fast is ~6X slower than ut_rnd_interval on 3 of the 4 CPUs. This was measured by gettimeofday for a loop with 1000 calls. The patch for this is here for 8.0.28 and for 8.0.40. More time spent in ut::random_from_interval_fast means it takes longer to get through contended mutexes and rw-locks.

Results are provided for 3 different builds. All use CMAKE_BUILD_TYPE =Release. The O2 build is compiled with -O2, the O3 build uses -O3 and the native build uses -O3 -march=native -mtune=native. The compiler is gcc 11.4 on Ubuntu 22.04.5.

I also tried clang with the -O3 build and the results were similar to gcc.

        ------- 8.0.28 -------          ------- 8.0.40 -------
        ut_rnd_interval                 ut::random_from_interval_fast
        -------- nsecs -------          -------- nsecs -------
        O2      O3      native          O2      O3      native
beelink 22      22      23              43      44      43
pn53     3       3       3              23      22      23
ax162-s  3       4       4              25      26      25
dell32   4       3       4              26      27      27
socket2 13      12      11              17      18      17

Problem: ut::random_from_interval_fast is slower than ut::random_from_interval

Perhaps this is ironic, but ut::random_from_interval_fast is slower than ut::random_from_interval on 3 of the 4 AMD CPUs that I tested. However, ut::random_from_interval_fast is faster on the 1 Intel CPU I tested.

        ------- 8.0.40 -------          ------- 8.0.40 -------
        ut::random_from_interval        ut::random_from_interval_fast
        -------- nsecs -------          -------- nsecs -------
        O2      O3      native          O2      O3      native
beelink 36      37      42              43      44      43
pn53    11       8       8              23      22      23
ax162-s 13      10      12              25      26      25
dell32  12      17      16              26      27      27
socket  22      26      29              17      18      17

And this query returns 40 for beelink, 32 for pn53, 28 for ax162-s, 36 for dell32 and 26 for socket:
    select timer_overhead from performance_schema.performance_timers where timer_name='CYCLE'

Problem: InnoDB makes this expensive to debug

When I add log_error_verbosity=3 I get more use information including:
    ... [Note] [MY-013546] [InnoDB] Atomic write enabled
    ... [Note] [MY-012932] [InnoDB] PUNCH HOLE support available
    ... [Note] [MY-012944] [InnoDB] Uses event mutexes
    ... [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier
    ... [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.3.1
    ... [Note] [MY-012951] [InnoDB] Using hardware accelerated crc32 and polynomial multiplication.

Alas, I had to add more instrumentation to learn the latency of ut_rnd_interval, ut::random_from_interval and ut::random_from_interval_fast. I prefer that InnoDB do those tests and write that to the error log.

I also want InnoDB to log the latency in wall-clock time for ut_delay(X) for X from 0 to innodb_spin_wait_delay. I collected this via the patches I added for 8.0.28 and 8.0.40. From the following, the delay is ~X usecs for ut_delay(X) on 3 of the 4 CPUs.The outlier CPU is beelink which has the oldest + slowest CPU of the 4. With innodb_spin_wait_delay =6 the delay ranges from ~0 to ~5 usecs in 8.0.28 and from ~0 to ~6 usecs in 8.0.44. These were measured via getimeofday with 1000 calls in a loop for each value of X.

I also measured this with clang and the results were similar to gcc.

Wall-clock time in microseconds for a call to ut_delay(X) in MySQL 8.0.28
        0       1       2       3       4       5       <- X
beelink 0.00    1.66    3.31    4.96    6.61    8.26
pn53    0.00    1.06    2.13    3.20    4.26    5.32
ax162-s 0.00    1.20    2.41    3.61    4.81    6.02
dell32  0.00    1.23    2.47    3.69    4.91    6.15
socket2 0.00    0.88    1.76    2.63    3.51    4.37

Wall-clock time in microseconds for a call to ut_delay(X) in MySQL 8.0.40
        0       1       2       3       4       5       6       <- X
beelink 0.00    1.66    3.32    4.96    6.61    8.26    9.91
pn53    0.00    1.06    2.13    3.19    4.27    5.35    6.42
ax162-s 0.00    1.20    2.41    3.60    4.81    6.02    7.21
dell32  0.00    1.23    2.46    3.69    4.91    6.14    7.40
socket2 0.00    0.88    1.76    2.63    3.50    4.37    5.24

Results

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks but I ran a subset of that to save time. The database is cached by InnoDB.

The benchmark is run with ...
  • beelink - 1 thread, 1 table, 30M rows
  • pn53 - 1 thread, 1 table, 50M rows
  • dell32 - 24 threads, 8 tables, 10M rows/table
  • ax162-s - 40 threads, 8 tables, 10M rows/table
  • socket2 - I did not run benchmarks on this server because it was busy
I tried 8 combinations of builds and my.cnf where the A.B pattern below uses A for the build and B for the my.cnf. All builds use CMAKE_BUILD_TYPE =Release.
  • my8028_rel_o2nofp.z11a_X
    • MySQL 8.0.28 with -O2 and -fno-omit-frame-pointer
  • my8040_rel_o2nofp.z11a_X
    • MySQL 8.0.40 with -O2 and fno-omit-frame-pointer
  • my8040_rel.z11a_X
    • MySQL 8.0.40 with -O3
  • my8040_rel_o2nofp_norndfast.z11a_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. 
    • InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast
  • my8040_rel_o2nofp_norndfast.z11a_nops_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. 
    • InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
    • nops means that I added performance_schema =0 to my.cnf.
  • my8040_rel_o2nofp_norndfast.z11a_spin5_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
    • spin5 means that I added innodb_spin_wait_delay =5 to my.cnf to workaround the issue described above.
  • my8040_rel_o2nofp_norndfast.z11a_nops_spin5_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
    • nops means that I added performance_schema =0 to my.cnf.
    • spin5 means that I added innodb_spin_wait_delay =5 to my.cnf to workaround the issue described above.
  • my8040_rel_o2nofp_oldrnd.z11a_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. 
    • InnoDB was changed to use ut_rnd_interval copied from 8.0.28
The results I share are formatted as relative QPS (rQPS) which is below. When the rQPS is less than 1.0 then that version gets less QPS than MySQL 8.0.28.
    (QPS for my version / QPS for my8028_rel_o2nofp.z11a_X)

On the beelink, pn53 and ax162-s servers the rQPS is similar for all of the MySQL 8.0.40 builds so I won't share those results here to save space. I am surprised by this, especially for the ax162-s server.

But on the dell32 server where the update-index regression in 8.0.40 is much larger than it is on the other servers, I get ~1.25X more QPS with several of the builds that attempt to fix or workaround the problems described in this post.

Relative to: my8028_rel_o2nofp.z11a_c32r128
col-1 : my8040_rel_o2nofp.z11a_c32r128
col-2 : my8040_rel.z11a_c32r128
col-3 : my8040_rel_o2nofp_norndfast.z11a_c32r128
col-4 : my8040_rel_o2nofp_norndfast.z11a_nops_c32r128
col-5 : my8040_rel_o2nofp_norndfast.z11a_spin5_c32r128
col-6 : my8040_rel_o2nofp_norndfast.z11a_nops_spin5_c32r128
col-7 : my8040_rel_o2nofp_oldrnd.z11a_c32r128

col-1   col-2   col-3   col-4   col-5   col-6   col-7
0.96    0.95    0.94    0.99    0.94    0.99    0.94    point-query.pre_range=100
1.06    1.05    1.04    1.05    1.05    1.06    1.05    points-covered-pk_range=100
1.06    1.06    1.04    1.06    1.04    1.05    1.05    points-covered-si_range=100
1.06    1.05    1.04    1.06    1.05    1.06    1.04    points-notcovered-pk_range=100
1.04    1.03    1.02    1.03    1.02    1.03    1.01    points-notcovered-si_range=100
0.98    0.97    0.96    0.97    0.97    0.97    0.96    range-notcovered-si.pre_range=100
0.95    0.95    0.94    0.97    0.94    0.97    0.94    read-only_range=100
0.87    0.72    0.80    0.81    0.81    0.81    0.82    scan_range=100
0.94    0.94    0.94    0.97    0.93    0.97    0.94    delete_range=100
0.94    0.94    0.94    0.98    0.94    0.98    0.94    insert_range=100
0.95    0.95    0.94    0.98    0.94    0.97    0.94    read-write_range=100
0.56    0.56    0.67    0.70    0.71    0.70    0.69    update-index_range=100
1.03    1.02    1.01    1.02    1.01    1.03    1.01    update-inlist_range=100
0.95    0.95    0.94    0.96    0.95    0.97    0.95    update-nonindex_range=100
0.94    0.94    0.94    0.97    0.94    0.97    0.95    update-one_range=100
0.95    0.95    0.95    0.97    0.95    0.97    0.95    update-zipf_range=100
0.93    0.93    0.92    0.95    0.92    0.95    0.93    write-only_range=10000

2 comments:

  1. Do you think
    - spinning longer
    - more context switches
    are somehow connected? If so, how?

    ReplyDelete
    Replies
    1. I am far from certain. It isn't easy trying to explain what went wrong as an outsider. First I need to understand what has changed, which is much work and made harder by so many formatting diffs that hide the real changes. And then there are the C -> C++ changes that don't always change behavior (although the +1 issue in ut::random_from_interval_fast is a behavior change).

      Delete

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs ...