Small Datum: November 2024

Friday, November 29, 2024

RocksDB on a big server: LRU vs hyperclock, v2

This post shows that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a large server to show the speedup from the hyperclock block cache implementation for different concurrency levels with RocksDB 9.6. Here I share results from the same server and different (old and new) RocksDB releases.

Results are amazing on a large (48 cores) server with 40 client threads

~2X more QPS for range queries with hyperclock
~3X more QPS for point queries with hyperclock

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.0.4, 7.6.0, 7.7.8, 8.5.4, 8.6.7, 9.0.1, 9.1.2, 9.3.2, 9.5.2, 9.7.4 and 9.9.0. Everything was compiled with gcc 11.4.0.

The --cache_type argument selected the block cache implementation:

lru_cache was used for versions 7.6 and earlier. Because some of the oldest releases don't support --cache_type I also used --undef_params=...,cache_type
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 40 threads.

I focus on the read-heavy benchmark steps:

revrangeww (reverse range while writing) - this does short reverse range scans
fwdrangeww (forward range while writing) - this does short forward range scans
readww (read while writing) - this does point queries

For each of these there is a fixed rate for writes done in the background and performance is reported for the reads. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

Results

All results are in this spreadsheet and the performance summary is here.

The graph below shows relative QPS which is: (QPS for my version / QPS for RocksDB 6.0.2) and the results are amazing:

~2X more QPS for range queries with hyperclock
~3X more QPS for point queries with hyperclock

The average values for vmstat metrics provide more detail on why hyperclock is so good for performance. The context switch rate drops dramatically when it is enabled because there is much less mutex contention. The user CPU utilization increases by ~1.6X because more useful work can get done when there is less mutex contention.

legend
* cs - context switches per second per vmstat
* us - user CPU utilization per vmstat
* sy - system CPU utilization per vmstat
* id - idle CPU utilization per vmstat
* wa - wait CPU utilization per vmstat
* version - RocksDB version

cs us sy us+sy id wa version
1495325 50.3 14.0 64.3 18.5 0.1 7.6.0
2360 82.7 14.0 96.7 16.6 0.1 9.9.0

Monday, November 25, 2024

RocksDB benchmarks: large server, universal compaction

This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

tl;dr

there are some small regressions for cached workloads (see byrx below)
there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:

6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3

Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:

fillseq -- load in key order with the WAL disabled
revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
fwdrangeww -- like revrangeww except do short forward range scans
readww - like revrangeww except do point queries
overwrite - do overwrites (Put) as fast as possible

Workloads

There are three workloads, all of which use 40 threads:

byrx - the database is cached by RocksDB (100M KV pairs)
iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrx, iobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here.

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:

fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
QPS for other tests has been stable, with some variance, since late 6.x

Results: iobuf

The iodir tests use an IO-bound database with buffered. The performance summary is here.

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:

fillseq has been stable since 7.6
readww has always been stable
overwrite improved in 7.6 and has been stable since then
fwdrangeww and revrangeww improved in late 6.0 and have been stable since then

Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here.

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:

fillseq has been stable since 7.6
readww has always been stable
overwrite improved in 7.6 and has been stable since then
fwdrangeww and revrangeww have been stable but there is some variance

Saturday, November 9, 2024

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL, a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions for write heavy workloads that landed in InnoDB after MySQL 8.0.28.

In response to my results, the project lead filed a bug for performance regressions and then quickly came up with a diff. The bug in this case is for regressions that are most obvious during full table scans and the problems arrived in MySQL 8.0.29 and 8.0.30 -- see bug 111538 and this post. The bug is closed for upstream but the perf regressions remain so I am excited to see the community working to solve this problem.

tl;dr

Advanced MySQL with the fix removes much of the regression in scan performance

Builds

I tried 4 builds

my8028 - upstream MySQL 8.0.28
my8040 - upstream MySQL 8.0.40
my8040adv_pre - Advanced MySQL 8.0.40 without the fix (without d347cdb)
my8040adv_post - Advanced MySQL 8.0.40 with the fix (at d347cdb)

Hardware

The servers are

dell32

Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4).

ax162-s

AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.

bee

Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, Ubuntu 22.04 and ext4 on NVMe

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with ...

dell32 - 8 tables, 10M rows per table and 24 threads
ax162-s - 8 tables, 10M rows per table and 40 threads
bee - 1 table, 30M rows and 1 thread

Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results: overview

All of the results use relative QPS (rQPS) where:

rQPS is: (QPS for my version / QPS for base version)
base version is the QPS from MySQL 8.0.28
my version is one of the other versions

Here I only share the results for the scan microbenchmark.

Results: dell32

Summary

Summary
- QPS with the fix in Advanced MySQL is ~9% better than without the fix
- QPS with the fix in Advanced MySQL is ~2% better than my8040.
- I am not sure why my8040adv_pre did much worse than my8040

From the relative QPS results the QPS with my8040adv_pre was ~15% less than my8028. But my8040adv_post is only ~7% slower than my8028 so it removes half of the regression.

Relative to: my8028

col-1 : my8040

col-2 : my8040adv_pre

col-3 : my8040adv_post

col-1 col-2 col-3

0.91 0.85 0.93 scan

From vmstat and iostat metrics CPU overhead for my8040adv_pre was ~22% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~8% larger than my8028. This is great.

--- absolute

cpu/o cs/o r/o rKB/o wKB/o o/s dbms

0.093496 3.256 0 0 0.006 246 my8028

0.106105 4.065 0 0 0.006 225 my8040

0.113878 4.344 0 0 0.006 208 my8040adv_pre

0.101104 3.978 0 0 0.006 228 my8040adv_post

--- relative to first result

1.13 1.25 1 1 1.00 0.91 my8040

1.22 1.33 1 1 1.00 0.85 my8040adv_pre

1.08 1.22 1 1 1.00 0.93 my8040adv_post

Results: ax162-s

Summary

QPS is ~18% larger with the fix in Advanced MySQL
CPU overhead is ~15% smaller with the fix

From the relative QPS results the QPS with my8040adv_pre was the same as my8040 and both were ~17% slower than my8028. But my8040adv_post is only ~2% slower than my8028 which is excellent.

Relative to: my8028

col-1 : my8040

col-2 : my8040adv_pre

col-3 : my8040adv_post

col-1 col-2 col-3

0.83 0.83 0.98 scan

From vmstat and iostat metrics CPU overhead for my8040 and my8040adv_pre were ~20% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute

cpu/o cs/o r/o rKB/o wKB/o o/s dbms

0.018767 0.552 0 0 0.052 872 my8028

0.022533 0.800 0 0 0.013 725 my8040

0.022499 0.808 0 0.001 0.034 727 my8040adv_pre

0.019305 0.731 0 0 0.03 851 my8040adv_post

--- relative to first result

1.20 1.45 1 1 0.25 0.83 my8040

1.20 1.46 1 inf 0.65 0.83 my8040adv_pre

1.03 1.32 1 1 0.58 0.98 my8040adv_post

Results: bee

Summary:

QPS is ~17% larger with the fix in Advanced MySQL
CPU overhead is ~15% smaller with the fix

I did not test my8040adv_pre on this server.

From the relative QPS results the QPS with my8040 is ~22% less than my8028. But QPS from my8040adv_post is only ~9% less than my8028. This is great.

Relative to: my8028

col-1 : my8040

col-2 : my8040adv_post

col-1 col-2

0.78 0.91 scan

From vmstat and iostat metrics CPU overhead for my8040 was ~28% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute

cpu/o cs/o r/o rKB/o wKB/o o/s dbms

0.222553 2.534 0 0.001 0.035 55 my8028

0.285792 7.622 0 0 0.041 43 my8040

0.246404 6.475 0 0 0.036 50 my8040adv_post

--- relative to first result

1.28 3.01 1 0.00 1.17 0.78 my8040

1.11 2.56 1 0.00 1.03 0.91 my8040adv_post

RocksDB benchmarks: large server, leveled compaction

I recently shared benchmark results for RocksDB a few weeks ago for both leveled and universal compaction on a small server. This post has results from a large server with leveled compaction.

tl;dr

there are a few regressions from bug 12038
QPS for overwrite is ~1.5X to ~2X better in 9.x than 6.0 (ignoring bug 12038)
otherwise QPS in 9.x is similar to 6.x

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:

6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3

Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

The tests on the charts are named as:

fillseq -- load in key order with the WAL disabled
revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
fwdrangeww -- like revrangeww except do short forward range scans
readww - like revrangeww except do point queries
overwrite - do overwrites (Put) as fast as possible

Workloads

There are three workloads, all of which use 40 threads:

byrx - the database is cached by RocksDB (100M KV pairs)
iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrx, iobuf and iodir.

Relative QPS

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here.

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:

fillseq is worse from 6.0 to 8.0 but stable since then
overwrite has large improvements late in 6.0 and small improvements since then
fwdrangeww has small improvements in early 7.0 and is stable since then
revrangeww and readww are stable from 6.0 through 9.

Results: iobuf

The iobuf tests use an IO-bound database with buffered IO. The performance summary is here.

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:

bug 12038 explains the drop in throughput for overwrite since 8.6.7
otherwise QPS in 9.x is similar to 6.0

Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here.

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:

the QPS drop for overwrite in 8.6.7 occurs because the db_bench client wasn't updated to use the new default value for compaction readahead size
QPS for overwrite is ~2X better in 9.x relative to 6.0
otherwise QPS in 9.x is similar to 6.0

todo

Tuesday, November 5, 2024

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs hyperclock) and a few other configuration choices for a CPU-bound workload. A previous post with great results for the hyperclock block cache is here.

tl;dr

read QPS is up to ~3X better with auto_hyper_clock_cache vs LRU
read QPS is up to ~1.3X better with the per-level fanout set to 32 vs 8
read QPS drops by ~15% as the background write rate increases from 2 to 32 M/s

Software

I used RocksDB 9.6, compiled with gcc 11.4.0.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and are repeated for 1, 10, 20 and 40 threads.

I focus on the readwhilewriting benchmark where performance is reported for the reads (point queries) while there is a fixed rate for writes done in the background. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

To save time I did not run the fwdrangewhilewriting benchmark. Were I to repeat this work I would include it because the results from it would be interesting for a few of the configuration options I compared.

I did tests to understand the following:

LRU vs auto_hyper_clock_cache for the block cache implementation

LRU is the original implementation. The code was simple, which is nice. The implementation for LRU is sharded with a mutex per shard and that mutex can become a hot spot. The hyperclock implementation is much better at avoiding hot spots.

per level fanout (8 vs 32)

By per level fanout I mean the value of --max_bytes_for_level_multiplier which determines the target size difference between adjacent levels. By default I use 8, while 10 is also a common choice. Here I compare 8 vs 32. When the fanout is larger the LSM tree has fewer levels -- meaning there are fewer places to check for data which should reduce CPU overhead and increase QPS.

background write rate

I repeated tests with the background write rate (--benchmark_write_rate_limit) set to 2, 8 and 32 MB/s. With a higher write rate there is more chance for interference between reads and writes. The interference might be from mutex contention, compaction threads using more CPU, more L0 files to check or more data in levels L1 and larger.

target size for L0

By target size I mean the number of files in the L0 that trigger compaction. The db_bench option for this is --level0_file_num_compaction_trigger. When the value is larger there will be more L0 files on average that a query might have to check and that means there is more CPU overhead. Unfortunately, I configured RocksDB incorrectly so I don't have results to share. The issue is that when the L0 is configured to be larger, the L1 should be configured to be at least as large as the L0 (L1 target size should be >= sizeof(SST) * num(L0 files). If not, then L0->L1 compaction will happen sooner than expected.

All of the results are in this spreadsheet.

Results: LRU vs auto_hyper_clock_cache

These graphs have QPS from the readwhilewriting benchmark for the LRU and AHCC block cache implementations where LRU is the original version with a sharded hash table and a mutex per shard while AHCC is the hyper clock cache (--cache_type=auto_hyper_clock_cache).

Summary:

QPS is much better with AHCC than LRU (~3.3X faster at 40 threads)
QPS with AHCC scales linearly with the thread count
QPS with LRU does not scale linearly and suffers from mutex contention
There are some odd effects in the results for 1 thread

With a 2M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.

With an 8M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.

With a 32M/s background write rate AHCC is ~1.1X faster at 1 thread and ~2.9X faster at 40 threads relative to LRU.

Results: per level fanout

These graphs have QPS from the readwhilewriting benchmark to compare results with per-level fanout set to 8 and 32.

Summary

QPS is often 1.1X to 1.3X larger with fanout=32 vs fanout=8

With an 8M/s background write rate and LRU, fanout=8 is faster at 1 thread but then fanout=32 is from 1.1X to 1.3X faster at 10 to 40 threads.

With an 8M/s background write rate and AHCC, fanout=8 is faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.

With a 32M/s background write rate and LRU, fanout=8 is ~2X faster at 1 thread but then fanout=32 is from 1.1X to 1.2X faster at 10 to 40 threads.

With a 32M/s background write rate and AHCC, fanout=8 is ~2X faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.

Results: background write rate

Summary:

With LRU

QPS drops by up to ~15% as the background write rate grows from 2M/s to 32M/s
QPS does not scale linearly and suffers from mutex contention

With AHCC

QPS drops by up to 13% as the background write rate grows from 2M/s to 32M/s
QPS scales linearly with the thread count

There are some odd effects in the results for 1 thread

Results with LRU show that per-thread QPS doesn't scale linearly

Results with AHCC show that per-thread QPS scales linearly ignoring the odd results for 1 thread

Monday, November 4, 2024

RocksDB benchmarks: small server, universal compaction

I shared benchmark results for RocksDB a few weeks ago using leveled compaction and a small server. Here I have results for universal compaction and the same small server.

tl;dr

in general the there are some improvements and some small regressions with one exception (see bug 12038)
for a cached database

From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
Performance has been stable since 7.x

for an IO-bound database with buffered IO

bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
QPS is otherwise stable

for an IO-bound database with O_DIRECT

QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0
QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2

Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device.

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:

6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1, 9.6.2, 9.7.2, 9.7.4 and 9.8.1

Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

# Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound

bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:

fillseq -- load in key order with the WAL disabled
revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
fwdrangeww -- like revrangeww except do short forward range scans
readww - like revrangeww except do point queries
overwrite - do overwrites (Put) as fast as possible

Workloads

There are three workloads, all of which use one client (thread):

byrx - the database is cached by RocksDB
iobuf - the database is larger than memory and RocksDB uses buffered IO
iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are here for byrx, for iobuf and for iodir.

Relative QPS

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:

From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
Performance has been stable since 7.x

Results: iobuf

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:

bug 12038 explains the regression for overwrite (fixed soon in 9.7)
QPS for fillseq has been stable
QPS for revrangeww, fwdrangeww and readww is stable. I am not sure about the variance in 9.6 and 9.7 releases. The cause might be that universal (tiered) is more prone to variance. I will revisit that when I run tests again in a few months.

Results: iodir

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:

QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2

Friday, November 1, 2024

Too many performance regressions for InnoDB in MySQL 8.0.29 and 8.0.30

There are many changes to InnoDB in MySQL 8.0.29 and 8.0.30. And many changes increases the chance of performance regressions. A recent report on this from me is here. In the worst-case the throughput drops almost in half for the update-index microbenchmark. This blog post identifies the diff that causes the regression.

tl;dr

this diff causes the update-index regression. The diff adds support for instant add/drop column to InnoDB and landed in 8.0.29, then 8.0.29 was retracted and the diff is new to most of us in 8.0.30.
MySQL 8.0.40 fixes regressions for many of the read-only microbenchmarks. I failed to acknowledge that in my previous post. But the large regression for long range scans remains, especially at lower concurrency levels. And the regression for update-index also remains in 8.0.40.
I remain confused about why this problem is so much worse on one of my servers (dell32) than on my other large servers

The problems that arrived in InnoDB after 8.0.28 include:

bug 111538 - compile-time inlining was changed leading to less-inlining for performance critical code and more CPU overhead. This bug has been closed as of 8.0.40 but the perf regression is still here.
bug 116463 - InnoDB busy-wait loops wait ~20% longer on average. This change wasn't intended but arrived as part of rewriting hash functions used by InnoDB. The impact from this is more CPU overhead and more context switches when there is contention. I filed this bug recently and expect it to be fixed soon.
new hash functions - the hash functions used by InnoDB were rewritten and while the new algorithms are likely better at hashing uniformly, they changes also brought a few bugs and bug fixes
I filed bug 116531 for the update-index regression

My list could have been longer but I always disable the InnoDB adaptive hash index and missed the bugs that arrived from all of the improvements to it.

Release notes

A few relevant comments from the release notes.

From the 8.0.29 release notes

InnoDB: To improve code quality and facilitate debugging, instances of #define in the InnoDB sources were replaced by constexpr specifiers or inline functions. (WL #14680)
InnoDB: InnoDB now supports ALTER TABLE ... DROP COLUMN operations using ALGORITHM=INSTANT.
Operations that support ALGORITHM=INSTANT only modify metadata in the data dictionary. Table data is unaffected, making the operations instantaneous. If not specified explicitly, ALGORITHM=INSTANT is used by default by DDL operations that support it.
Prior to MySQL 8.0.29, an instantly added column could only be added as the last column of the table. From MySQL 8.0.29, an instantly added column can be added to any position in the table.
Instantly added or dropped columns create a new row version. Up to 64 row versions are permitted. A new TOTAL_ROW_VERSIONS column was added to the INFORMATION_SCHEMA.INNODB_TABLES table to track the number of row versions.
For more information about DDL operations that support ALGORITHM=INSTANT, see Online DDL Operations. (WL #13899)

From the 8.0.30 release notes

InnoDB: Hash and random generator functions in the InnoDB sources were improved. (Bug #16739204, Bug #23584861)

From the 8.0.34 release notes

InnoDB: Changes in hashing functions made in MySQL 8.0.30 had an adverse effect on performance. (Bug #34870256)

From the 8.0.40 release notes

Performance; InnoDB: Several functions internal to InnoDB, which were defined as inline in MySQL 8.0.28, were found to be no longer inline in MySQL 8.0.33, due in part to refactoring which accompanied improvements made in MySQL 8.0.30 to improve the InnoDB adaptive hash index. This had an adverse effect on queries using joins on InnoDB tables. (Bug #111538, Bug #35531293)
References: This issue is a regression of: Bug #81814, Bug #16739204, Bug #23584861.

Builds

I compiled everything from source. For MySQL 8.0.28 and 8.0.40 I used the community download. For MySQL 8.0.29 and the the two mid-release builds I used the git repo. The 8.0.29 release was retracted because of a bug. The builds are:

my8028_rel_o2nofp

MySQL 8.0.28 community release

my8029_rel_o2nofp

MySQL 8.0.29 from the git repo

my8040_rel_o2nofp

MySQL 8.0.40 community release

my8029_rel_o2nofp_preidrop_155cf3d902

this is after 8.0.28 but prior to 8.0.29 and is the diff immediately prior to the diff that adds support for instant add/drop column. It is here in github.

my8029_rel_o2nofp_drop_e8f422de

this is after 8.0.28 but prior to 8.0.29 and is the diff that adds support for instant add/drop column. It is here in github.

The my.cnf files are here for 8.0.28/8.0.29 and for 8.0.40. I use different my.cnf files because innodb_redo_log_capacity arrived in 8.0.30.

Hardware

The servers are

dell32

Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4).

ax162-s

AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with 8 tables and 10M rows per table. I used 24 threads for the dell32 server and 40 threads for the ax162-s server. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script were:

bash r.sh 8 10000000 300 600 nvm 1 1 24 # for dell32

bash r.sh 8 10000000 300 600 md2 1 1 40 # for ax162-s

Results: overview

All of the results use relative QPS (rQPS) where:

rQPS is: (QPS for my version / QPS for base version)
base version is the QPS from MySQL 8.0.28
my version is one of the other versions

The problem arrives in the diff that adds instant add/drop column support, which is in the my8029_rel_o2nofp_drop_e8f422de build.

Throughput relative to: my8028_rel_o2nofp

col-1 : my8029_rel_o2nofp_preidrop_155cf3d902

col-2 : my8029_rel_o2nofp_drop_e8f422de

col-3 : my8029_rel_o2nofp

col-4 : my8040_rel_o2nofp

1.01 0.55 0.56 0.56 update-index_range=100 -> dell32

0.99 0.78 0.80 0.90 update-index_range=100 -> ax162-s

Results: dell32

Summary

Throughput on update-index drops almost in half starting in 8.0.29. The problem arrives in the diff that adds instant add/drop column support. The problem remains in 8.0.40.
Metrics from vmstat and iostat are here and the results for update-index show a ~1.4X increase in both CPU overhead (cpu/o) and context switch rates (cs/o) per update statement.
Throughput per microbenchmark is here for relative (csv, tsv) and absolute (csv, tsv)

Throughput relative to: my8028_rel_o2nofp

col-1 : my8029_rel_o2nofp_preidrop_155cf3d902

col-2 : my8029_rel_o2nofp_drop_e8f422de

col-3 : my8029_rel_o2nofp

col-4 : my8040_rel_o2nofp

col-1 col-2 col-3 col-4

1.02 1.00 1.02 1.06 hot-points_range=100

0.98 1.00 0.98 0.95 point-query.pre_range=100

0.99 1.00 0.99 0.95 point-query_range=100

1.00 0.96 0.97 1.05 points-covered-pk.pre_range=100

0.99 0.96 0.97 1.05 points-covered-pk_range=100

0.99 0.94 0.96 1.04 points-covered-si.pre_range=100

1.00 0.95 0.97 1.05 points-covered-si_range=100

1.00 0.96 0.98 1.05 points-notcovered-pk.pre_range=100

1.00 0.96 0.97 1.05 points-notcovered-pk_range=100

1.00 0.96 0.97 1.02 points-notcovered-si.pre_range=100

1.00 0.94 0.96 1.01 points-notcovered-si_range=100

0.97 0.96 0.97 1.89 random-points.pre_range=1000

0.99 0.96 0.97 1.05 random-points.pre_range=100

1.00 0.98 0.98 0.96 random-points.pre_range=10

0.98 0.97 0.97 1.91 random-points_range=1000

0.99 0.96 0.97 1.05 random-points_range=100

1.00 0.98 0.98 0.96 random-points_range=10

0.98 0.97 0.98 0.99 range-covered-pk.pre_range=100

0.98 0.97 0.98 0.99 range-covered-pk_range=100

0.98 0.98 0.97 0.99 range-covered-si.pre_range=100

0.99 0.98 0.98 0.99 range-covered-si_range=100

0.98 0.99 0.98 0.96 range-notcovered-pk.pre_range=100

0.98 0.98 0.98 0.96 range-notcovered-pk_range=100

1.00 0.96 0.97 0.97 range-notcovered-si.pre_range=100

1.00 0.96 0.97 0.97 range-notcovered-si_range=100

0.99 0.99 0.99 1.01 read-only.pre_range=10000

0.99 0.99 0.98 0.95 read-only.pre_range=100

0.98 0.99 0.99 0.95 read-only.pre_range=10

1.00 1.00 1.00 1.01 read-only_range=10000

0.99 0.98 0.99 0.95 read-only_range=100

0.99 0.99 0.98 0.95 read-only_range=10

0.99 0.93 0.93 0.91 scan_range=100

1.00 0.99 0.99 0.93 delete_range=100

1.00 1.00 1.00 0.94 insert_range=100

0.99 0.99 0.99 0.95 read-write_range=100

0.99 0.99 0.99 0.95 read-write_range=10

1.01 0.55 0.56 0.56 update-index_range=100

1.00 0.99 0.99 1.03 update-inlist_range=100

1.00 1.00 1.00 0.95 update-nonindex_range=100

1.00 0.99 1.00 0.94 update-one_range=100

1.00 1.00 1.00 0.96 update-zipf_range=100

1.00 0.98 0.97 0.94 write-only_range=10000

Results: ax162-s

Summary

Throughput on update-index drops by 20% starting in 8.0.29. The problem arrives in the diff that adds instant add/drop column support. The regression isn't as bad in 8.0.40 where the drop is only ~10% relative to 8.0.28.
Metrics from vmstat and iostat are here and the results for update-index show a ~1.15X increase in both CPU overhead (cpu/o) and context switch rates (cs/o) per update statement in 8.0.29 and smaller (1.09X for CPU, 1.03X for context switches) in 8.0.40.
Throughput per microbenchmark is here for relative (csv, tsv) and absolute (csv, tsv)

Throughput relative to: x.my8028_rel_o2nofp

col-1 : x.my8029_rel_o2nofp_preidrop_155cf3d902

col-2 : x.my8029_rel_o2nofp_drop_e8f422de

col-3 : x.my8029_rel_o2nofp

col-4 : x.my8040_rel_o2nofp

col-1 col-2 col-3 col-4

0.97 0.99 1.00 0.99 hot-points_range=100

1.00 0.99 0.98 0.94 point-query.pre_range=100

0.98 0.99 0.98 0.94 point-query_range=100

1.00 1.01 1.00 1.02 points-covered-pk.pre_range=100

0.99 1.01 1.00 1.02 points-covered-pk_range=100

1.00 1.00 0.98 0.98 points-covered-si.pre_range=100

1.00 1.00 0.98 0.99 points-covered-si_range=100

1.00 1.02 1.00 1.02 points-notcovered-pk.pre_range=100

0.99 1.01 1.00 1.02 points-notcovered-pk_range=100

0.99 0.99 0.96 0.96 points-notcovered-si.pre_range=100

0.99 0.99 0.97 0.96 points-notcovered-si_range=100

1.00 1.02 0.99 1.72 random-points.pre_range=1000

0.99 1.01 1.00 1.02 random-points.pre_range=100

1.01 1.01 0.99 0.94 random-points.pre_range=10

0.99 1.01 0.98 1.72 random-points_range=1000

0.99 1.01 1.00 1.02 random-points_range=100

1.01 1.00 0.99 0.94 random-points_range=10

1.01 0.98 0.95 0.95 range-covered-pk.pre_range=100

1.00 0.97 0.95 0.95 range-covered-pk_range=100

1.00 0.99 0.96 0.96 range-covered-si.pre_range=100

1.00 0.99 0.97 0.96 range-covered-si_range=100

1.00 0.99 0.96 0.94 range-notcovered-pk.pre_range=100

1.00 0.99 0.97 0.94 range-notcovered-pk_range=100

1.00 1.00 0.98 0.93 range-notcovered-si.pre_range=100

0.99 0.99 0.98 0.93 range-notcovered-si_range=100

1.00 0.99 0.99 1.01 read-only.pre_range=10000

0.99 0.97 0.98 0.94 read-only.pre_range=100

1.00 0.98 0.98 0.94 read-only.pre_range=10

1.00 0.99 0.99 1.00 read-only_range=10000

0.99 0.97 0.97 0.93 read-only_range=100

1.00 0.99 0.98 0.94 read-only_range=10

0.93 0.81 0.86 0.83 scan_range=100

1.00 0.99 1.02 0.93 delete_range=100

1.00 1.00 1.01 0.94 insert_range=100

0.99 0.98 0.98 0.95 read-write_range=100

0.99 0.98 0.98 0.95 read-write_range=10

0.99 0.78 0.80 0.90 update-index_range=100

0.96 0.95 0.98 1.02 update-inlist_range=100

1.01 1.01 1.02 0.96 update-nonindex_range=100

1.00 0.99 1.00 0.95 update-one_range=100

1.00 1.00 1.01 0.95 update-zipf_range=100

0.99 0.97 0.98 0.92 write-only_range=10000