Tuesday, October 29, 2024

Trying out Advanced MySQL

I recently learned of the Advanced MySQL project on github via a tweet. There is a book and a repo for an enhanced version of 8.0.40. I wish I had the time to read some of that book and learn more about the enhancements but for now I just ran my sysbench tests for it on a few large servers. Hopefully I will also run the Insert Benchmark for it on the same servers.

tl;dr

  • Advanced MySQL is good for performance
  • On the dell32 server that has a large regression for the update-index microbenchmark in MySQL 8.0.40, the Advanced MySQL project fixes most of the regression
  • On the ax162-s server for which the update-index regression is smaller, the Advanced MySQL project still does a lot better than upstream
Builds

I used upstream MySQL 8.0.28 and 8.0.40 as well as the advanced MySQL version of 8.0.40. All were compiled from source using CMAKE_BUILD_TYPE =Release, -O2 and -fno-omit-frame-pointer.

The builds are named:
  • my8028_rel_o2nofp - upstream MySQL 8.0.28
  • my8040_rel_o2nofp - upstream MySQL 8.0.40
  • my8040adv_rel_o2nofp - Advanced MySQL 8.0.40
The my.cnf files are here for 8.0.28 and for 8.0.40.

Hardware

The servers are
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement.

The benchmark is run with ...
  • dell32 - 24 threads, 8 tables, 10M rows/table
  • ax162-s - 40 threads, 8 tables, 10M rows/table
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions (8.0.40 from upstream, 8.0.40 from Advanced MySQL)
Results: dell32

Summary
  • 8.0.40 from upstream and Advanced MySQL have similar performance on most microbenchmarks
  • Advanced MySQL does a lot better on the update-index microbenchmark while upstream has a large regression that started in 8.0.30 (and perhaps in 8.0.29). 
Relative to: my8028_rel_o2nofp
col-1 : my8040_rel_o2nofp
col-2 : my8040adv_rel_o2nofp

col-1   col-2
1.06    1.07    hot-points_range=100
0.95    0.95    point-query.pre_range=100
0.95    0.95    point-query_range=100
1.05    1.05    points-covered-pk.pre_range=100
1.05    1.05    points-covered-pk_range=100
1.04    1.06    points-covered-si.pre_range=100
1.05    1.07    points-covered-si_range=100
1.05    1.05    points-notcovered-pk.pre_range=100
1.05    1.06    points-notcovered-pk_range=100
1.02    1.05    points-notcovered-si.pre_range=100
1.01    1.04    points-notcovered-si_range=100
1.89    1.89    random-points.pre_range=1000
1.05    1.05    random-points.pre_range=100
0.96    0.96    random-points.pre_range=10
1.91    1.92    random-points_range=1000
1.05    1.05    random-points_range=100
0.96    0.96    random-points_range=10
0.99    0.97    range-covered-pk.pre_range=100
0.99    0.97    range-covered-pk_range=100
0.99    0.96    range-covered-si.pre_range=100
0.99    0.97    range-covered-si_range=100
0.96    0.97    range-notcovered-pk.pre_range=100
0.96    0.96    range-notcovered-pk_range=100
0.97    0.99    range-notcovered-si.pre_range=100
0.97    0.98    range-notcovered-si_range=100
1.01    1.01    read-only.pre_range=10000
0.95    0.95    read-only.pre_range=100
0.95    0.95    read-only.pre_range=10
1.01    1.01    read-only_range=10000
0.95    0.95    read-only_range=100
0.95    0.95    read-only_range=10
0.91    0.85    scan_range=100
0.93    1.00    delete_range=100
0.94    0.96    insert_range=100
0.95    0.95    read-write_range=100
0.95    0.95    read-write_range=10
0.56    0.84    update-index_range=100
1.03    1.03    update-inlist_range=100
0.95    0.97    update-nonindex_range=100
0.94    0.94    update-one_range=100
0.96    0.97    update-zipf_range=100
0.94    0.98    write-only_range=10000

From vmstat metrics both the CPU overhead (cpu/o) and context switch rate (cs/o) are much lower in Advanced MySQL 8.0.40. Metrics from vmstat and iostat for all of the microbenchmarks are here.

sb.met.update-index.range100.pk1.dop40
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000550        12.157  0       0       8.354   83129   my8028_rel_o2nofp
0.000600        12.514  0       0       11.416  74917   my8040_rel_o2nofp
0.000469        8.063   0       0       7.437   98180   my8040adv_rel_o2nofp
--- relative to first result
1.09            1.03    1       1       1.37    0.90    my8040_rel_o2nofp
0.85            0.66    1       1       0.89    1.18    my8040adv_rel_o2nofp

sb.met.update-nonindex.range100.pk1.dop40
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000363        12.236  0       0       2.611   114970  my8028_rel_o2nofp
0.000383        12.235  0       0       2.794   109896  my8040_rel_o2nofp
0.000310        8.128   0       0       2.663   121196  my8040adv_rel_o2nofp
--- relative to first result
1.06            1.00    1       1       1.07    0.96    my8040_rel_o2nofp
0.85            0.66    1       1       1.02    1.05    my8040adv_rel_o2nofp

Results: ax162-s

Summary
  • 8.0.40 from upstream and Advanced MySQL have similar performance on most microbenchmarks
  • Advanced MySQL does a lot better on the update-index microbenchmark while upstream has a large regression that started in 8.0.30 (and perhaps in 8.0.29). 

Relative to: my8028_rel_o2nofp
col-1 : my8040_rel_o2nofp
col-2 : my8040adv_rel_o2nofp

col-1   col-2
1.06    1.07    hot-points_range=100
0.95    0.95    point-query.pre_range=100
0.95    0.95    point-query_range=100
1.05    1.05    points-covered-pk.pre_range=100
1.05    1.05    points-covered-pk_range=100
1.04    1.06    points-covered-si.pre_range=100
1.05    1.07    points-covered-si_range=100
1.05    1.05    points-notcovered-pk.pre_range=100
1.05    1.06    points-notcovered-pk_range=100
1.02    1.05    points-notcovered-si.pre_range=100
1.01    1.04    points-notcovered-si_range=100
1.89    1.89    random-points.pre_range=1000
1.05    1.05    random-points.pre_range=100
0.96    0.96    random-points.pre_range=10
1.91    1.92    random-points_range=1000
1.05    1.05    random-points_range=100
0.96    0.96    random-points_range=10
0.99    0.97    range-covered-pk.pre_range=100
0.99    0.97    range-covered-pk_range=100
0.99    0.96    range-covered-si.pre_range=100
0.99    0.97    range-covered-si_range=100
0.96    0.97    range-notcovered-pk.pre_range=100
0.96    0.96    range-notcovered-pk_range=100
0.97    0.99    range-notcovered-si.pre_range=100
0.97    0.98    range-notcovered-si_range=100
1.01    1.01    read-only.pre_range=10000
0.95    0.95    read-only.pre_range=100
0.95    0.95    read-only.pre_range=10
1.01    1.01    read-only_range=10000
0.95    0.95    read-only_range=100
0.95    0.95    read-only_range=10
0.91    0.85    scan_range=100
0.93    1.00    delete_range=100
0.94    0.96    insert_range=100
0.95    0.95    read-write_range=100
0.95    0.95    read-write_range=10
0.56    0.84    update-index_range=100
1.03    1.03    update-inlist_range=100
0.95    0.97    update-nonindex_range=100
0.94    0.94    update-one_range=100
0.96    0.97    update-zipf_range=100
0.94    0.98    write-only_range=10000

From vmstat metrics both the CPU overhead (cpu/o) and context switch rate (cs/o) are much lower in Advanced MySQL 8.0.40. Metrics from vmstat and iostat for all of the microbenchmarks are here.

sb.met.update-index.range100.pk1.dop24
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001185        10.777  0       0       14.275  51319   my8028_rel_o2nofp
0.001781        15.242  0       0       33.139  28966   my8040_rel_o2nofp
0.001328        9.994   0       0       21.068  43145   my8040adv_rel_o2nofp
--- relative to first result
1.50            1.41    1       1       2.32    0.56    my8040_rel_o2nofp
1.12            0.93    1       1       1.48    0.84    my8040adv_rel_o2nofp

sb.met.update-nonindex.range100.pk1.dop24
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000610        9.890   0       0       2.788   84660   my8028_rel_o2nofp
0.000645        10.006  0       0       3.215   80750   my8040_rel_o2nofp
0.000590        7.445   0       0       3.123   81943   my8040adv_rel_o2nofp
--- relative to first result
1.06            1.01    1       1       1.15    0.95    my8040_rel_o2nofp
0.97            0.75    1       1       1.12    0.97    my8040adv_rel_o2nofp





















Thursday, October 24, 2024

RocksDB benchmarks: small server, leveled compaction

I shared benchmark results for RocksDB a few weeks ago and there was a suggestion for me to repeat tests using different (older) values for format_version. Then while replacing a failed SSD, I also updated the OS and changed a few kernel-related config options. Thus, I ended up repeating all tests.

This post has results from a small server with leveled compaction. Results from a large server and from universal compaction are in progress.

tl;dr - on a small server with a low concurrency workload

  • older values of format_version (2 thru 5) don't impact QPS
  • auto hyperclock cache makes read-heavy tests up to 15% faster
  • for a cached database
    • QPS drops by 5% to 15% from RocksDB 6.0.2 to 9.7.2
    • QPS has been stable since 8.0
  • for an IO-bound database with buffered IO
    •  bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
    • QPS for fillseq has been stable
    • QPS for read-heavy tests is 15% to 20% better in RocksDB 9.7.2 vs 6.0.2
  • for an IO-bound database with O_DIRECT
    • QPS for fillseq is ~11% less in 9.7.2 vs 6.0.2 but has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
    • QPS for overwrite is stable
    • QPS for read-heavy tests is 16% to 38% better in RocksDB 9.7.1 vs 6.0.2

Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device. 

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.2 at git sha b5cde68b8a
Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version. For 9.6.1 I repeated tests using the hyperclock cache (default. is LRU) and format_version =2, =3, =4 and =5 (default is =6). 

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

    # Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
    bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use one client (thread):

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are linked below:
Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2 for the all versions tests and 9.6.1 with my standard configuration for the 9.6 variations tests.

Results: byrx with 9.6 variations

The byrx tests use a cached database. The performance summary is here. This has results for RocksDB 9.6.1 using my standard configuration and the variations are:
  • fv2 - uses format_version=2 instead of the default (=6)
  • fv3 - uses format_version=3
  • fv4 - uses format_version=4
  • fv5 - uses formatio_version=5
  • ahcc - uses auto_hyper_clock_cache instead of the default (LRU)
This chart shows the relative QPS for RocksDB 9.6.1 with a given configuration relative to 9.6.1 with my standard configuration. The y-axis doesn't start at 0 to improve readability.

Summary:
  • Using different values of format_version don't have a large impact here
  • Using auto hyperclock instead of LRU improves read-heavy QPS by up to 15%
Results: byrx with all versions

The byrx tests use a cached database. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 to improve readability.

Summary:
  • QPS drops by 5% to 15% from RocksDB 6.0.2 to 9.7.2
  • Performance has been stable since 8.0
  • For overwrite the excellent result in RocksDB 6.0.2 comes at the cost of bad write stalls (see pmax here)

Results: iobuf with all versions

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 to improve readability.

Summary:
  • bug 12038 explains the regression for overwrite (fixed soon in 9.7)
  • QPS for fillseq has been stable
  • QPS for read-heavy tests is 15% to 20% better in RocksDB 9.7.2 vs 6.0.2
Results: iodir with all versions

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 to improve readability.

Summary:
  • QPS for fillseq is ~11% less in 9.7.2 vs 6.0.2 but has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
  • QPS for overwrite is stable
  • QPS for read-heavy tests is 16% to 38% better in RocksDB 9.7.1 vs 6.0.2

Wednesday, October 23, 2024

InnoDB busy-wait loops changed in MySQL 8.0.30

This blog post attempts to explain some of the performance regressions that landed in InnoDB with MySQL 8.0.30. I'd rather not try to debug perf regressions long after they arrived but here we are. I assume that most of the problems landing in MySQL 8.0.28+ are also in 8.4 and 9.X, so these are problems the community will be dealing with for a long time.

One blog post that documents the regressions is here. The regression that I am trying to explain occurs on the dell32 server (see below) where the throughput for the update-index microbenchmark drops in half after MySQL 8.0.28.

tl;dr

  • I filed bug 116463 for this
  • Behavior related to innodb_spin_wait_delay changed in 8.0.30 and I assume that was not intentional. One workaround is to reduce innodb_spin_wait_delay from 6 to 5
  • On the server that has the largest regression for the update-index microbenchmark, several workarounds improve QPS by ~1.25X, unfortunately that only fixes about 1/3 of the regression
  • I am curious whether there are other changes after 8.0.28 to InnoDB hash functions that contribute to regressions. I have no doubt that the new hash functions are better with respect to hashing things uniformly, but that might come at a cost in latency. Most of these are inlined and perf problems caused by inlined functions are harder to spot longer after the fact.

The Plot

For CPU-bound sysbench on a variety of servers there are large regressions that arrived after 8.0.28 and one of the worst occurs for the update-index microbenchmark on only one of my servers (32-core AMD Threadripper) where throughput drops almost in half after 8.0.28 (in 8.0.30 through 8.0.40). From vmstat I see that the amount of CPU and number of context switches per update increase by ~1.5X when throughput drops.

While reading release notes and looking at some of the code I notice that code used by InnoDB busy-wait loops (ut_rnd_interval in 8.0.28, ut::random_from_interval_fast in 8.0.30+) has changed. Alas, understanding what has changed is harder because there is much noise (code updated to use more C++) hiding the signal (real changes).

There is a clue in the 8.0.30 release notes, alas the bugs are not in the public MySQL bug database.

  • InnoDB: Hash and random generator functions in the InnoDB sources were improved. (Bug #16739204, Bug #23584861)

The problems that I see are:

  • default busy-wait loop time has a 20% increase in 8.0.30+
  • code that determines the busy-wait loop time is slower in 8.0.30+
  • ut::random_from_interval_fast is slower than ut::random_from_interval
  • InnoDB makes this expensive to debug
Hardware

I used 5 different servers for this blog post, the first 4 use AMD and the last uses Intel:

  • beelink
    • Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device. This is a laptop-class CPU in a mini PC.
  • pn53
    • ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device. This is a laptop-class CPU in a mini PC.
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). This is a server-class CPU in a server.
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud. This is a server-class CPU in a server.
  • socket2
    • SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, Ubuntu 22.04 and ext4 on 1 m.2 SSD . The CPU is Intel Xeon Silver 4214R CPU @ 2.40GHz
Problem: default busy-wait loop time has a 20% increase in 8.0.30+

InnoDB has custom mutexes and rw-locks that add value (no snark) beyond what you get from pthreads - the value is special semantics for the rw-lock and support for debugging and performance monitoring. While there are pthreads variants that include a busy-wait loop the duration of the busy-wait is a bit of a black box with the pthreads versions. The busy-wait duration for the InnoDB versions is usually between 0 and 7 microseconds which mostly works great for InnoDB.

The busy-wait code is:
  • MySQL 8.0.28 and earlier
    • ut_delay(ut_rnd_interval(0, srv_spin_wait_delay)
  • MySQL 8.0.30 to 8.0.33
    • ut_delay(ut::random_from_interval(0, srv_spin_wait_delay)
  • MySQL 8.0.34 to 8.0.40
    • ut_delay(ut::random_from_interval_fast(0, srv_spin_wait_delay)
The variable srv_spin_wait_delay has the value of the innodb_spin_wait_delay configuration variable which is 6 by default. The ut_rnd_interval, ut::random_from_interval and ut::random_from_interval_fast functions return a random integer between 0 and ~innodb_spin_wait_delay and then ut_delay($X) does a busy-wait loop for approximately $X microseconds.

The problem is that with innodb_spin_wait_delay=6, the return values are between 0 and 5 in MySQL 8.0.28 and earlier releases but between 0 and 6 in  8.0.30+. The docs for 8.08.4 and 9.0 still describe the old (8.0.28) behavior for innodb_spin_wait_delay which I hope is restored to avoid the need to update docs or my.cnf files

On average, the return value is 20% larger in 8.0.30+ thus the busy-wait time will be ~20% larger (~3.0 usecs vs ~2.5 usecs). 
  • for 8.0.28 and earlier the average value is 2.5 -> (0 + 1 + 2 + 3 + 4 + 5) / 6
  • for 8.0.30+ the average value is 3.0 -> (0 + 1 + 2 + 3 + 4 + 5 + 6) /7
  • 3 / 2.5 = 1.2
The workaround for this is to set innodb_spin_wait_delay=5 in my.cnf when using 8.0.30+ but I hope this is fixed upstream. The root cause is the +1 added to this line of code
Problem: code that determines the busy-wait loop time is slower in 8.0.30+

I doubt this is a big problem, but it is interesting. The reason I doubt this hurts performance is that after spending ~10 extra nanonseconds in ut::random_from_interval_fast the caller will then spend up to a few microseconds in a busy-wait loop.

Both ut::random_from_interval and ut::random_from_interval_fast are a lot slower than ut_rnd_interval on four different CPUs I used for benchmarks. All of these functions are PRNG. The 8.0.28 implementation is the most simple. In MySQL 8.0.30 through 8.0.33 there is ut::random_from_interval and then ut::random_from_interval_fast is added in 8.0.34.

The ut::random_from_interval implementation uses HW crc32 instructions (via hash_uint64) when they are available and I confirmed (via printfs added to mysqld) that they are available on the CPUs that I used.

The ut::random_from_interval_fast implementation uses rdtsc (random_64_fast calls my_timer_cycles) to read the cycle timer.

This table shows the time in naonseconds per call and ut::random_from_interval_fast is ~6X slower than ut_rnd_interval on 3 of the 4 CPUs. This was measured by gettimeofday for a loop with 1000 calls. The patch for this is here for 8.0.28 and for 8.0.40. More time spent in ut::random_from_interval_fast means it takes longer to get through contended mutexes and rw-locks.

Results are provided for 3 different builds. All use CMAKE_BUILD_TYPE =Release. The O2 build is compiled with -O2, the O3 build uses -O3 and the native build uses -O3 -march=native -mtune=native. The compiler is gcc 11.4 on Ubuntu 22.04.5.

I also tried clang with the -O3 build and the results were similar to gcc.

        ------- 8.0.28 -------          ------- 8.0.40 -------
        ut_rnd_interval                 ut::random_from_interval_fast
        -------- nsecs -------          -------- nsecs -------
        O2      O3      native          O2      O3      native
beelink 22      22      23              43      44      43
pn53     3       3       3              23      22      23
ax162-s  3       4       4              25      26      25
dell32   4       3       4              26      27      27
socket2 13      12      11              17      18      17

Problem: ut::random_from_interval_fast is slower than ut::random_from_interval

Perhaps this is ironic, but ut::random_from_interval_fast is slower than ut::random_from_interval on 3 of the 4 AMD CPUs that I tested. However, ut::random_from_interval_fast is faster on the 1 Intel CPU I tested.

        ------- 8.0.40 -------          ------- 8.0.40 -------
        ut::random_from_interval        ut::random_from_interval_fast
        -------- nsecs -------          -------- nsecs -------
        O2      O3      native          O2      O3      native
beelink 36      37      42              43      44      43
pn53    11       8       8              23      22      23
ax162-s 13      10      12              25      26      25
dell32  12      17      16              26      27      27
socket  22      26      29              17      18      17

And this query returns 40 for beelink, 32 for pn53, 28 for ax162-s, 36 for dell32 and 26 for socket:
    select timer_overhead from performance_schema.performance_timers where timer_name='CYCLE'

Problem: InnoDB makes this expensive to debug

When I add log_error_verbosity=3 I get more use information including:
    ... [Note] [MY-013546] [InnoDB] Atomic write enabled
    ... [Note] [MY-012932] [InnoDB] PUNCH HOLE support available
    ... [Note] [MY-012944] [InnoDB] Uses event mutexes
    ... [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier
    ... [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.3.1
    ... [Note] [MY-012951] [InnoDB] Using hardware accelerated crc32 and polynomial multiplication.

Alas, I had to add more instrumentation to learn the latency of ut_rnd_interval, ut::random_from_interval and ut::random_from_interval_fast. I prefer that InnoDB do those tests and write that to the error log.

I also want InnoDB to log the latency in wall-clock time for ut_delay(X) for X from 0 to innodb_spin_wait_delay. I collected this via the patches I added for 8.0.28 and 8.0.40. From the following, the delay is ~X usecs for ut_delay(X) on 3 of the 4 CPUs.The outlier CPU is beelink which has the oldest + slowest CPU of the 4. With innodb_spin_wait_delay =6 the delay ranges from ~0 to ~5 usecs in 8.0.28 and from ~0 to ~6 usecs in 8.0.44. These were measured via getimeofday with 1000 calls in a loop for each value of X.

I also measured this with clang and the results were similar to gcc.

Wall-clock time in microseconds for a call to ut_delay(X) in MySQL 8.0.28
        0       1       2       3       4       5       <- X
beelink 0.00    1.66    3.31    4.96    6.61    8.26
pn53    0.00    1.06    2.13    3.20    4.26    5.32
ax162-s 0.00    1.20    2.41    3.61    4.81    6.02
dell32  0.00    1.23    2.47    3.69    4.91    6.15
socket2 0.00    0.88    1.76    2.63    3.51    4.37

Wall-clock time in microseconds for a call to ut_delay(X) in MySQL 8.0.40
        0       1       2       3       4       5       6       <- X
beelink 0.00    1.66    3.32    4.96    6.61    8.26    9.91
pn53    0.00    1.06    2.13    3.19    4.27    5.35    6.42
ax162-s 0.00    1.20    2.41    3.60    4.81    6.02    7.21
dell32  0.00    1.23    2.46    3.69    4.91    6.14    7.40
socket2 0.00    0.88    1.76    2.63    3.50    4.37    5.24

Results

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks but I ran a subset of that to save time. The database is cached by InnoDB.

The benchmark is run with ...
  • beelink - 1 thread, 1 table, 30M rows
  • pn53 - 1 thread, 1 table, 50M rows
  • dell32 - 24 threads, 8 tables, 10M rows/table
  • ax162-s - 40 threads, 8 tables, 10M rows/table
  • socket2 - I did not run benchmarks on this server because it was busy
I tried 8 combinations of builds and my.cnf where the A.B pattern below uses A for the build and B for the my.cnf. All builds use CMAKE_BUILD_TYPE =Release.
  • my8028_rel_o2nofp.z11a_X
    • MySQL 8.0.28 with -O2 and -fno-omit-frame-pointer
  • my8040_rel_o2nofp.z11a_X
    • MySQL 8.0.40 with -O2 and fno-omit-frame-pointer
  • my8040_rel.z11a_X
    • MySQL 8.0.40 with -O3
  • my8040_rel_o2nofp_norndfast.z11a_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. 
    • InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast
  • my8040_rel_o2nofp_norndfast.z11a_nops_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. 
    • InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
    • nops means that I added performance_schema =0 to my.cnf.
  • my8040_rel_o2nofp_norndfast.z11a_spin5_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
    • spin5 means that I added innodb_spin_wait_delay =5 to my.cnf to workaround the issue described above.
  • my8040_rel_o2nofp_norndfast.z11a_nops_spin5_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. InnoDB was changed to use ut::random_from_interval rather than ut::random_from_interval_fast.
    • nops means that I added performance_schema =0 to my.cnf.
    • spin5 means that I added innodb_spin_wait_delay =5 to my.cnf to workaround the issue described above.
  • my8040_rel_o2nofp_oldrnd.z11a_X
    • MySQL 8.0.40 with -O2 and -fno-omit-frame-pointer. 
    • InnoDB was changed to use ut_rnd_interval copied from 8.0.28
The results I share are formatted as relative QPS (rQPS) which is below. When the rQPS is less than 1.0 then that version gets less QPS than MySQL 8.0.28.
    (QPS for my version / QPS for my8028_rel_o2nofp.z11a_X)

On the beelink, pn53 and ax162-s servers the rQPS is similar for all of the MySQL 8.0.40 builds so I won't share those results here to save space. I am surprised by this, especially for the ax162-s server.

But on the dell32 server where the update-index regression in 8.0.40 is much larger than it is on the other servers, I get ~1.25X more QPS with several of the builds that attempt to fix or workaround the problems described in this post.

Relative to: my8028_rel_o2nofp.z11a_c32r128
col-1 : my8040_rel_o2nofp.z11a_c32r128
col-2 : my8040_rel.z11a_c32r128
col-3 : my8040_rel_o2nofp_norndfast.z11a_c32r128
col-4 : my8040_rel_o2nofp_norndfast.z11a_nops_c32r128
col-5 : my8040_rel_o2nofp_norndfast.z11a_spin5_c32r128
col-6 : my8040_rel_o2nofp_norndfast.z11a_nops_spin5_c32r128
col-7 : my8040_rel_o2nofp_oldrnd.z11a_c32r128

col-1   col-2   col-3   col-4   col-5   col-6   col-7
0.96    0.95    0.94    0.99    0.94    0.99    0.94    point-query.pre_range=100
1.06    1.05    1.04    1.05    1.05    1.06    1.05    points-covered-pk_range=100
1.06    1.06    1.04    1.06    1.04    1.05    1.05    points-covered-si_range=100
1.06    1.05    1.04    1.06    1.05    1.06    1.04    points-notcovered-pk_range=100
1.04    1.03    1.02    1.03    1.02    1.03    1.01    points-notcovered-si_range=100
0.98    0.97    0.96    0.97    0.97    0.97    0.96    range-notcovered-si.pre_range=100
0.95    0.95    0.94    0.97    0.94    0.97    0.94    read-only_range=100
0.87    0.72    0.80    0.81    0.81    0.81    0.82    scan_range=100
0.94    0.94    0.94    0.97    0.93    0.97    0.94    delete_range=100
0.94    0.94    0.94    0.98    0.94    0.98    0.94    insert_range=100
0.95    0.95    0.94    0.98    0.94    0.97    0.94    read-write_range=100
0.56    0.56    0.67    0.70    0.71    0.70    0.69    update-index_range=100
1.03    1.02    1.01    1.02    1.01    1.03    1.01    update-inlist_range=100
0.95    0.95    0.94    0.96    0.95    0.97    0.95    update-nonindex_range=100
0.94    0.94    0.94    0.97    0.94    0.97    0.95    update-one_range=100
0.95    0.95    0.95    0.97    0.95    0.97    0.95    update-zipf_range=100
0.93    0.93    0.92    0.95    0.92    0.95    0.93    write-only_range=10000

Monday, October 21, 2024

The impact of PGO, LTO and more for MySQL on a small CPU: insert benchmark

This post has results for the impact of PGO, LTO and other compiler options for MySQL using the Insert Benchmark. I previously shared results like this for sysbench. The PGO builds here were created from profiles captured during sysbench and I am curious to understand whether the benefit from those builds extends to other workloads. 

tl;dr

  • PGO builds created using sysbench also help other workloads like the Insert Benchmark
  • results here are similar to results from sysbench
    • gcc does slightly better than clang without LTO or PGO
    • clang does slightly better than gcc with PGO
    • clang does more than slightly better than gcc with PGO + LTO
    • gcc with PGO has similar performance as with PGO + LTO

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here

All work (builds & tests) are done on a server that uses Ubuntu 22.04.5. The gcc version is 11.4.0 and the clang version is 14.0.0-1ubuntu1.1.

The CMake command lines for all of the builds are here. All builds use CMAKE_BUILD_TYPE =Release. The builds are listed below and the name of the CMake command line file per build is cmk.80.$val where $val is the build name listed below:
  • rel_o2nofp
    • uses gcc, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel_o2nofp_clang
    • uses clang, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel
    • uses gcc, -O3, doesn't work great for flamegraphs
  • rel_clang
    • uses clang, -O3, doesn't work great for flamegraphs
  • rel_o2nofp_lto
    • like rel_o2nofp but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_lto_clang
    • like rel_o2nofp_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto
    • like rel but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto_clang
    • like rel_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_native
    • like rel_o2nofp but adds -march=native -mtune=native
  • rel_o2nofp_native_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native
  • rel_native
    • like rel but adds -march=native -mtune=native
  • rel_native_clang
    • like rel_clang but adds -march=native -mtune=native
  • rel_o2nofp_native_lto
    • like rel_o2nofp but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_o2nofp_native_lto_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto
    • like rel but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto_clang
    • like rel_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_pgo_use
    • like rel but uses PGO via -fprofile-generate. The profiles for PGO were captured while running sysbench.
  • rel_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate. The profiles for PGO were captured while running sysbench.
  • rel_lto_pgo_use
    • like rel but uses PGO via -fprofile-generate and -DWITH_LTO=ON. The profiles for PGO were captured while running sysbench.
  • rel_lto_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate and -DWITH_LTO=ON. The profiles for PGO were captured while running sysbench.
Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table. The database is cached by InnoDB.The benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 40M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted per table.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section in the performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case MySQL 8.0.28 with the rel_o2nofp build.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Normally I summarize performance here but I think the summary in the tl;dr above is sufficient.

MySQL 8.0.40 does not fix the regressions I hoped it would fix

Performance regressions arrived in InnoDB with MySQL 8.0.30. Eventually multiple bugs were filed. The worst regressions were from changes to the hash function (perhaps fixed in 8.0.36) and from changes to how functions are inlined for InnoDB (bug 111538). The problems are obvious if you run CPU-bound workloads, and my CPU-bound workload is sysbench with a cached database.

Bug 111538 is now closed and marked as fixed in 8.0.40. Alas, there are still significant CPU perf regressions in 8.0.40 relative to 8.0.28. My advice to upstream is to stop innovating if you don't have the CI setup to catch the new performance problems that your innovation creates. Using something like Nyrkio would help.

This post has results from sysbench on several servers using MySQL 8.0.28, 8.0.30, 8.0.33, 8.0.39 and 8.0.40 to show there are large regressions starting in 8.0.30, many of which are still there in 8.0.40. Tests were repeated on 5 different servers.

tl;dr

  • SELECT statements with a large in-list use much less CPU starting in MySQL 8.0.31 because bug 102037 was fixed. I found that via sysbench and filed a bug vs 8.0.22
  • bug 111538 should not have been closed as fixed
  • The scan microbenchmark still has regressions from 8.0.28 to 8.0.40
    • For all servers the QPS is less in 8.0.40 than in 8.0.28
    • On 4 of 5 servers the QPS is less in 8.0.40 than in 8.0.30
  • The update microbenchmark still has regressions from 8.0.28 to 8.0.40
    • In all cases the QPS is less in 8.0.40 than in 8.0.28
    • From 8.0.30 to 8.0.40 -- on 2 servers the QPS is less in 8.0.40, on 2 it is about the same and one 1 it has improved 
    • Regressions after 8.0.28 are bad for the small servers (beelink and pn53 below) and really bad (QPS drops in half) for one of the large servers (delll32 below). That is the subject of a pending blog post.
Updates
  • MySQL 8.0.40 fixes regressions for many of the read-only microbenchmarks. I failed to acknowledge that when I first posted this. But the large regression for long range scans remains.

Builds

I used MySQL 8.0.28, 8.0.30, 8.0.33, 8.0.39 and 8.0.40 compiled from source using CMAKE_BUILD_TYPE =Release, -O2 and -fno-omit-frame-pointer.

Hardware

The servers are
  • beelink
    • Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
  • pn53
    • ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
  • socket2
    • SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB,  ext4). The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4)
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. In some cases I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run with ...
  • beelink - 1 thread, 1 table, 30M rows
  • pn53 - 1 thread, 1 table, 50M rows
  • socket2 - 16 threads, 8 tables, 10M rows/table
  • dell32 - 24 threads, 8 tables, 10M rows/table
  • ax162-s - 40 threads, 8 tables, 10M rows/table
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results 

All of the results se relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions (8.0.30, 8.0.33, 8.0.39, 8.0.40)
The scan microbenchmark is the canary in the coal mine for bug 111538 as most of the CPU time is spent in InnoDB. Regressions that arrived after 8.0.28 remain unfixed in 8.0.40. The rQPS drops from MySQL 8.0.30 to 8.0.40 on 4 of the 5 servers.

                rQPS    rQPS
server          8.0.30  8.0.40
------          ------  ------
beelink         0.89    0.78
pn53            0.91    0.83
socket2         0.90    0.84
dell32          0.70    0.91
ax162-s         0.91    0.83

The update-index microbenchmark also has large regressions after 8.0.28. The QPS drops from 8.0.30 to 8.0.40 on 2 servers (beelink, pn53), remains about the same on 2 of them (socket2, dell32) and improves on 1 (ax162-s). The result for dell32 is lousy as update QPS drops almost in half after 8.0.28 and I will have more about that in a pending blog post.

                rQPS    rQPS
server          8.0.30  8.0.40
------          ------  ------
beelink         0.88    0.76
pn53            0.91    0.78
socket2         0.89    0.92
dell32          0.56    0.56
ax162-s         0.80    0.90

Results: beelink

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
Relative to: x.my8028_rel_o2nofp.z11a_bee.pk1
col-1 : x.my8030_rel_o2nofp.z11a_bee.pk1
col-2 : x.my8033_rel_o2nofp.z11a_bee.pk1
col-3 : x.my8039_rel_o2nofp.z11a_bee.pk1
col-4 : x.my8040_rel_o2nofp.z11a_bee.pk1

col-1 col-2 col-3 col-4
0.91 1.06 1.09 1.13 hot-points_range=100
1.00 0.91 0.92 0.92 point-query.pre_range=100
1.00 0.91 0.93 0.92 point-query_range=100
0.91 1.01 1.03 1.06 points-covered-pk.pre_range=100
0.90 1.00 1.03 1.06 points-covered-pk_range=100
0.89 1.00 1.03 1.07 points-covered-si.pre_range=100
0.89 1.00 1.03 1.06 points-covered-si_range=100
0.89 1.00 1.02 1.04 points-notcovered-pk.pre_range=100
0.89 1.00 1.02 1.05 points-notcovered-pk_range=100
0.88 0.93 0.98 1.01 points-notcovered-si.pre_range=100
0.88 0.93 0.98 1.01 points-notcovered-si_range=100
0.96 2.20 2.30 2.35 random-points.pre_range=1000
0.90 1.00 1.03 1.05 random-points.pre_range=100
0.94 0.91 0.93 0.94 random-points.pre_range=10
0.94 2.20 2.30 2.35 random-points_range=1000
0.90 1.00 1.03 1.05 random-points_range=100
0.93 0.91 0.93 0.94 random-points_range=10
0.95 0.90 0.90 0.96 range-covered-pk.pre_range=100
0.95 0.90 0.90 0.96 range-covered-pk_range=100
0.96 0.91 0.91 0.97 range-covered-si.pre_range=100
0.95 0.90 0.90 0.98 range-covered-si_range=100
0.97 0.93 0.94 0.97 range-notcovered-pk.pre_range=100
0.96 0.92 0.93 0.97 range-notcovered-pk_range=100
0.88 0.86 0.91 0.94 range-notcovered-si.pre_range=100
0.87 0.86 0.91 0.94 range-notcovered-si_range=100
0.97 0.94 0.93 0.98 read-only.pre_range=10000
0.97 0.93 0.93 0.94 read-only.pre_range=100
0.99 0.93 0.94 0.94 read-only.pre_range=10
0.98 0.95 0.94 0.99 read-only_range=10000
1.00 0.95 0.95 0.97 read-only_range=100
0.98 0.94 0.94 0.96 read-only_range=10
0.89 0.82 0.80 0.78 scan_range=100
0.95 0.88 0.89 0.90 delete_range=100
0.94 0.88 0.88 0.88 insert_range=100
0.97 0.93 0.93 0.95 read-write_range=100
0.96 0.92 0.92 0.93 read-write_range=10
0.88 0.82 0.84 0.76 update-index_range=100
0.94 0.87 0.89 0.90 update-inlist_range=100
0.96 0.89 0.90 0.91 update-nonindex_range=100
0.97 0.91 0.91 0.91 update-one_range=100
0.95 0.90 0.90 0.91 update-zipf_range=100
0.93 0.88 0.89 0.87 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.28X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.222553        2.534   0       0.001   0.035   55      x.my8028_rel_o2nofp
0.251913        6.640   0       0       0.052   49      x.my8030_rel_o2nofp
0.273314        7.107   0       0       0.039   45      x.my8033_rel_o2nofp
0.282176        15.578  0       0       0.053   44      x.my8039_rel_o2nofp
0.285792        7.622   0       0       0.041   43      x.my8040_rel_o2nofp
--- relative to first result
1.13            2.62    1       0.00    1.49    0.89    x.my8030_rel_o2nofp
1.23            2.80    1       0.00    1.11    0.82    x.my8033_rel_o2nofp
1.27            6.15    1       0.00    1.51    0.80    x.my8039_rel_o2nofp
1.28            3.01    1       0.00    1.17    0.78    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • CPU per update (cpu/o) is 1.33X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.47X larger in 8.0.40 vs 8.0.28. 
    • A possible cause is that the writeback code path isn't slowed by regressions while the update codepath is slowed. But I am just waving my hands.
    • This one is confusing because I use the same redo log size for all versions. However, configuration for the redo log changes in 8.0.30 (see here) and as a result I used 15 X 1G log segments in 8.0.28 vs 32 X 480M log segments in 8.0.40. I am repeating tests for 8.0.28 with 32 X 480M log segments. Using the output of the Pages ..., written line from SHOW ENGINE INNODB STATUS the pages written per update value is 0.97 for 8.0.28 vs 1.31 for 8.0.40. Perhaps 8.0.30+ is more aggressive about doing dirty page writeback but I use the same values for innodb_max_dirty_pages_pct_lwm (=80) and innodb_max_dirty_pages_pct (=90) for 8.0.28 to 8.0.40. I repeated tests for 8.0.28 using 32 redo log segments, but results (QPS, HW metrics) doesn't change with that.
  • Context switches per update (cs/o) is 1.24X larger in 8.0.40 vs 8.0.28, but I am less sure this is a problem as the absolute rates (15.312 vs 18.917 per update) are not that big.
sb.met.update-index.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.008211        15.312  0       0       47.066  2649    x.my8028_rel_o2nofp
0.009487        17.353  0       0       66.043  2337    x.my8030_rel_o2nofp
0.009875        17.636  0       0       66.695  2178    x.my8033_rel_o2nofp
0.010162        17.959  0       0       67.203  2216    x.my8039_rel_o2nofp
0.010903        18.917  0       0       68.954  2024    x.my8040_rel_o2nofp
--- relative to first result
1.16            1.13    1       1       1.40    0.88    x.my8030_rel_o2nofp
1.20            1.15    1       1       1.42    0.82    x.my8033_rel_o2nofp
1.24            1.17    1       1       1.43    0.84    x.my8039_rel_o2nofp
1.33            1.24    1       1       1.47    0.76    x.my8040_rel_o2nofp

Results: pn53

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
  • the changes to HW overheads here are similar to the changes above for the beelink server
Relative to: x.my8028_rel_o2nofp.z11a_c8r32.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c8r32.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c8r32.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c8r32.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c8r32.pk1

col-1 col-2 col-3 col-4
0.89 1.12 1.13 1.18 hot-points_range=100
0.97 0.94 0.93 0.94 point-query.pre_range=100
0.98 0.95 0.94 0.94 point-query_range=100
0.89 1.03 1.05 1.10 points-covered-pk_range=100
0.87 1.02 1.04 1.09 points-covered-si_range=100
0.91 1.03 1.05 1.11 points-notcovered-pk_range=100
0.84 0.96 1.01 1.06 points-notcovered-si_range=100
0.95 2.14 2.21 2.30 random-points_range=1000
0.89 1.03 1.06 1.10 random-points_range=100
0.96 0.94 0.95 0.97 random-points_range=10
0.97 0.93 0.92 0.99 range-covered-pk_range=100
0.98 0.93 0.93 0.98 range-covered-si_range=100
0.98 0.92 0.94 0.97 range-notcovered-pk_range=100
0.91 0.89 0.92 0.99 range-notcovered-si.pre_range=100
0.92 0.89 0.92 0.99 range-notcovered-si_range=100
0.97 0.94 0.98 1.01 read-only_range=10000
0.98 0.94 0.95 0.96 read-only_range=100
0.98 0.94 0.94 0.96 read-only_range=10
0.91 0.83 0.83 0.83 scan_range=100
0.96 0.91 0.92 0.92 delete_range=100
0.94 0.90 0.90 0.91 insert_range=100
0.97 0.93 0.94 0.95 read-write_range=100
0.97 0.93 0.94 0.95 read-write_range=10
0.91 0.85 0.84 0.78 update-index_range=100
0.94 0.91 0.91 0.94 update-inlist_range=100
0.96 0.92 0.92 0.93 update-nonindex_range=100
0.96 0.92 0.92 0.92 update-one_range=100
0.96 0.92 0.92 0.93 update-zipf_range=100
0.94 0.90 0.91 0.90 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.22X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.532922        6.523   0       0       0.092   23      x.my8028_rel_o2nofp
0.586497        15.751  0       0       0.122   21      x.my8030_rel_o2nofp
0.650201        17.147  0       0       0.105   19      x.my8033_rel_o2nofp
0.651073        17.145  0       0       0.098   19      x.my8039_rel_o2nofp
0.649427        17.975  0       0       0.128   19      x.my8040_rel_o2nofp
--- relative to first result
1.10            2.41    1       1       1.33    0.91    x.my8030_rel_o2nofp
1.22            2.63    1       1       1.14    0.83    x.my8033_rel_o2nofp
1.22            2.63    1       1       1.07    0.83    x.my8039_rel_o2nofp
1.22            2.76    1       1       1.39    0.83    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the changes here are similar to the ones above for the beelink server
  • CPU per update (cpu/o) is 1.29X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.48X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.22X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.004972        11.052  0       0       43.246  4198    x.my8028_rel_o2nofp
0.005658        11.948  0       0       61.606  3810    x.my8030_rel_o2nofp
0.005927        12.288  0       0       62.306  3578    x.my8033_rel_o2nofp
0.005997        12.536  0       0       62.804  3544    x.my8039_rel_o2nofp
0.006401        13.486  0       0       63.98   3294    x.my8040_rel_o2nofp
--- relative to first result
1.14            1.08    1       1       1.42    0.91    x.my8030_rel_o2nofp
1.19            1.11    1       1       1.44    0.85    x.my8033_rel_o2nofp
1.21            1.13    1       1       1.45    0.84    x.my8039_rel_o2nofp
1.29            1.22    1       1       1.48    0.78    x.my8040_rel_o2nofp

Results: socket2

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
  • the changes to HW overheads here and the regressions here are similar to but smaller than the changes above for the beelink and pn53 servers
Relative to: x.my8028_rel_o2nofp.z11a_c24r64.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c24r64.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c24r64.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c24r64.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c24r64.pk1

col-1 col-2 col-3 col-4
0.94 1.04 1.01 1.10 hot-points_range=100
0.97 0.94 0.97 0.96 point-query.pre_range=100
0.99 0.96 0.97 0.97 point-query_range=100
0.92 1.00 1.04 1.06 points-covered-pk_range=100
0.90 0.98 1.03 1.04 points-covered-si_range=100
0.92 1.00 1.04 1.06 points-notcovered-pk_range=100
0.89 0.94 0.98 1.00 points-notcovered-si_range=100
0.93 1.76 1.81 1.87 random-points_range=1000
0.92 1.00 1.03 1.06 random-points_range=100
0.94 0.93 0.95 0.97 random-points_range=10
0.98 0.94 0.95 1.00 range-covered-pk_range=100
0.97 0.94 0.94 1.00 range-covered-si_range=100
0.98 0.94 0.95 0.98 range-notcovered-pk_range=100
0.91 0.91 0.94 0.98 range-notcovered-si.pre_range=100
0.90 0.90 0.93 0.98 range-notcovered-si_range=100
0.98 0.97 0.97 1.00 read-only_range=10000
0.99 0.96 0.96 0.98 read-only_range=100
0.99 0.95 0.96 0.97 read-only_range=10
0.90 0.84 0.80 0.84 scan_range=100
0.95 0.93 0.94 0.94 delete_range=100
0.93 0.92 0.93 0.93 insert_range=100
0.97 0.94 0.95 0.96 read-write_range=100
0.97 0.94 0.95 0.96 read-write_range=10
0.89 0.90 0.88 0.92 update-index_range=100
0.96 0.96 0.98 0.98 update-inlist_range=100
0.95 0.94 0.94 0.94 update-nonindex_range=100
0.95 0.93 0.93 0.93 update-one_range=100
0.96 0.95 0.94 0.94 update-zipf_range=100
0.91 0.90 0.92 0.90 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.16X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop8
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.180593        4.756   0       0       0.007   174     x.my8028_rel_o2nofp
0.199852        5.846   0       0       0.008   156     x.my8030_rel_o2nofp
0.213651        6.156   0       0       0.008   146     x.my8033_rel_o2nofp
0.223563        6.615   0       0       0.008   140     x.my8039_rel_o2nofp
0.209727        6.136   0       0       0.008   146     x.my8040_rel_o2nofp
--- relative to first result
1.11            1.23    1       1       1.14    0.90    x.my8030_rel_o2nofp
1.18            1.29    1       1       1.14    0.84    x.my8033_rel_o2nofp
1.24            1.39    1       1       1.14    0.80    x.my8039_rel_o2nofp
1.16            1.29    1       1       1.14    0.84    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the changes here are similar to the ones above for the beelink and pn53 servers, but not as large. And the regression here is also not as large.
  • CPU per update (cpu/o) is 1.09X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.30X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.05X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop16
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001677        11.125  0       0       16.578  38477   x.my8028_rel_o2nofp
0.001859        12.496  0       0       24.596  34256   x.my8030_rel_o2nof
0.001849        11.945  0       0       22.664  34786   x.my8033_rel_o2nofp
0.001907        12.115  0       0       24.136  33971   x.my8039_rel_o2nofp
0.001823        11.670  0       0       21.581  35336   x.my8040_rel_o2nofp
--- relative to first result
1.11            1.12    1       1       1.48    0.89    x.my8030_rel_o2nofp
1.10            1.07    1       1       1.37    0.90    x.my8033_rel_o2nofp
1.14            1.09    1       1       1.46    0.88    x.my8039_rel_o2nofp
1.09            1.05    1       1       1.30    0.92    x.my8040_rel_o2nofp

Results: dell32

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update. I still don't understand why the regression for update-index here is so much worse than on the other servers.
Relative to: x.my8028_rel_o2nofp.z11a_c32r128.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c32r128.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c32r128.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c32r128.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c32r128.pk1

col-1 col-2 col-3 col-4
0.91 1.01 1.04 1.06 hot-points_range=100
0.98 0.94 0.93 0.95 point-query.pre_range=100
0.98 0.94 0.92 0.95 point-query_range=100
0.89 0.96 1.00 1.05 points-covered-pk.pre_range=100
0.89 0.96 1.00 1.05 points-covered-pk_range=100
0.87 0.93 1.00 1.04 points-covered-si.pre_range=100
0.88 0.94 1.01 1.05 points-covered-si_range=100
0.89 0.96 1.00 1.05 points-notcovered-pk.pre_range=100
0.89 0.96 1.00 1.05 points-notcovered-pk_range=100
0.89 0.90 0.98 1.02 points-notcovered-si.pre_range=100
0.88 0.90 0.97 1.01 points-notcovered-si_range=100
0.92 1.73 1.80 1.89 random-points.pre_range=1000
0.89 0.96 1.00 1.05 random-points.pre_range=100
0.94 0.92 0.93 0.96 random-points.pre_range=10
0.92 1.75 1.82 1.91 random-points_range=1000
0.89 0.96 1.00 1.05 random-points_range=100
0.94 0.92 0.93 0.96 random-points_range=10
0.96 0.93 0.93 0.99 range-covered-pk.pre_range=100
0.96 0.93 0.93 0.99 range-covered-pk_range=100
0.97 0.93 0.93 0.99 range-covered-si.pre_range=100
0.97 0.93 0.93 0.99 range-covered-si_range=100
0.96 0.92 0.93 0.96 range-notcovered-pk.pre_range=100
0.96 0.92 0.93 0.96 range-notcovered-pk_range=100
0.88 0.86 0.92 0.97 range-notcovered-si.pre_range=100
0.88 0.86 0.91 0.97 range-notcovered-si_range=100
0.98 0.96 0.98 1.01 read-only.pre_range=10000
0.97 0.93 0.94 0.95 read-only.pre_range=100
0.97 0.94 0.93 0.95 read-only.pre_range=10
0.99 0.96 0.97 1.01 read-only_range=10000
0.97 0.93 0.94 0.95 read-only_range=100
0.97 0.94 0.94 0.95 read-only_range=10
0.70 0.87 0.85 0.91 scan_range=100
0.94 0.92 0.93 0.93 delete_range=100
0.95 0.94 0.94 0.94 insert_range=100
0.97 0.93 0.94 0.95 read-write_range=100
0.97 0.93 0.93 0.95 read-write_range=10
0.56 0.56 0.56 0.56 update-index_range=100
0.99 0.99 1.00 1.03 update-inlist_range=100
0.96 0.95 0.95 0.95 update-nonindex_range=100
0.96 0.93 0.94 0.94 update-one_range=100
0.96 0.95 0.95 0.96 update-zipf_range=100
0.95 0.93 0.93 0.94 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.13X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop8
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.093496        3.256   0       0       0.006   246     x.my8028_rel_o2nofp
0.104925        5.301   0       0       0.007   172     x.my8030_rel_o2nofp
0.110480        4.195   0       0       0.006   215     x.my8033_rel_o2nofp
0.113259        4.306   0       0       0.006   210     x.my8039_rel_o2nofp
0.106105        4.065   0       0       0.006   225     x.my8040_rel_o2nofp
--- relative to first result
1.12            1.63    1       1       1.17    0.70    x.my8030_rel_o2nofp
1.18            1.29    1       1       1.00    0.87    x.my8033_rel_o2nofp
1.21            1.32    1       1       1.00    0.85    x.my8039_rel_o2nofp
1.13            1.25    1       1       1.00    0.91    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the regression here is huge and I am not sure about the root causes
  • CPU per update (cpu/o) is 1.50X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 2.32X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.41X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop24
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001185        10.777  0       0       14.275  51319   x.my8028_rel_o2nofp
0.001739        15.475  0       0       33.68   28515   x.my8030_rel_o2nofp
0.001794        15.192  0       0       32.953  28872   x.my8033_rel_o2nofp
0.001783        15.196  0       0       33.05   28983   x.my8039_rel_o2nofp
0.001781        15.242  0       0       33.139  28966   x.my8040_rel_o2nofp
--- relative to first result
1.47            1.44    1       1       2.36    0.56    x.my8030_rel_o2nofp
1.51            1.41    1       1       2.31    0.56    x.my8033_rel_o2nofp
1.50            1.41    1       1       2.32    0.56    x.my8039_rel_o2nofp
1.50            1.41    1       1       2.32    0.56    x.my8040_rel_o2nofp

Results: ax162-s

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
  • the changes to HW overheads here and the regressions here are similar to but smaller than the changes above for the beelink and pn53 servers
Relative to: x.my8028_rel_o2nofp.z11a_c32r128.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c32r128.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c32r128.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c32r128.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c32r128.pk1

col-1 col-2 col-3 col-4
0.89 0.93 0.99 0.99 hot-points_range=100
0.98 0.94 0.94 0.94 point-query.pre_range=100
0.96 0.93 0.93 0.94 point-query_range=100
0.90 0.96 1.01 1.02 points-covered-pk.pre_range=100
0.89 0.95 1.00 1.02 points-covered-pk_range=100
0.87 0.89 0.95 0.98 points-covered-si.pre_range=100
0.88 0.89 0.96 0.99 points-covered-si_range=100
0.90 0.96 1.01 1.02 points-notcovered-pk.pre_range=100
0.89 0.95 1.00 1.02 points-notcovered-pk_range=100
0.86 0.88 0.94 0.96 points-notcovered-si.pre_range=100
0.87 0.88 0.94 0.96 points-notcovered-si_range=100
0.95 1.61 1.68 1.72 random-points.pre_range=1000
0.90 0.96 1.01 1.02 random-points.pre_range=100
0.92 0.91 0.95 0.94 random-points.pre_range=10
0.95 1.62 1.69 1.72 random-points_range=1000
0.89 0.95 1.00 1.02 random-points_range=100
0.91 0.90 0.94 0.94 random-points_range=10
0.92 0.90 0.92 0.95 range-covered-pk.pre_range=100
0.92 0.90 0.92 0.95 range-covered-pk_range=100
0.93 0.91 0.93 0.96 range-covered-si.pre_range=100
0.94 0.91 0.94 0.96 range-covered-si_range=100
0.93 0.90 0.93 0.94 range-notcovered-pk.pre_range=100
0.93 0.90 0.93 0.94 range-notcovered-pk_range=100
0.87 0.87 0.92 0.93 range-notcovered-si.pre_range=100
0.87 0.86 0.92 0.93 range-notcovered-si_range=100
0.99 0.97 0.98 1.01 read-only.pre_range=10000
0.95 0.92 0.93 0.94 read-only.pre_range=100
0.96 0.92 0.94 0.94 read-only.pre_range=10
0.99 0.97 0.98 1.00 read-only_range=10000
0.95 0.92 0.93 0.93 read-only_range=100
0.96 0.92 0.93 0.94 read-only_range=10
0.91 0.82 0.81 0.83 scan_range=100
0.93 0.91 0.94 0.93 delete_range=100
0.96 0.94 0.94 0.94 insert_range=100
0.95 0.94 0.95 0.95 read-write_range=100
0.96 0.94 0.95 0.95 read-write_range=10
0.80 0.85 0.83 0.90 update-index_range=100
0.99 1.04 1.03 1.02 update-inlist_range=100
0.98 0.97 0.96 0.96 update-nonindex_range=100
0.96 0.93 0.94 0.95 update-one_range=100
0.97 0.97 0.96 0.95 update-zipf_range=100
0.94 0.93 0.94 0.92 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.20X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop8
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.018767        0.552   0       0       0.052   872     x.my8028_rel_o2nofp
0.020741        0.746   0       0       0.016   793     x.my8030_rel_o2nofp
0.022690        0.808   0       0       0.004   713     x.my8033_rel_o2nofp
0.023079        0.791   0       0       0.003   706     x.my8039_rel_o2nofp
0.022533        0.800   0       0       0.013   725     x.my8040_rel_o2nofp
--- relative to first result
1.11            1.35    1       1       0.31    0.91    x.my8030_rel_o2nofp
1.21            1.46    1       1       0.08    0.82    x.my8033_rel_o2nofp
1.23            1.43    1       1       0.06    0.81    x.my8039_rel_o2nofp
1.20            1.45    1       1       0.25    0.83    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the changes here are similar to the ones above for the beelink and pn53 servers, but not as large. And the regression here is also not as large
  • CPU per update (cpu/o) is 1.09X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.37X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.03X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop40
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000550        12.157  0       0       8.354   83129   x.my8028_rel_o2nofp
0.000641        13.426  0       0       14.791  66432   x.my8030_rel_o2nofp
0.000623        12.988  0       0       13.155  70890   x.my8033_rel_o2nofp
0.000637        13.044  0       0       14.158  68996   x.my8039_rel_o2nofp
0.000600        12.514  0       0       11.416  74917   x.my8040_rel_o2nofp
--- relative to first result
1.17            1.10    1       1       1.77    0.80    x.my8030_rel_o2nofp
1.13            1.07    1       1       1.57    0.85    x.my8033_rel_o2nofp
1.16            1.07    1       1       1.69    0.83    x.my8039_rel_o2nofp
1.09            1.03    1       1       1.37    0.90    x.my8040_rel_o2nofp




How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB a...