Tuesday, November 5, 2024

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs hyperclock) and a few other configuration choices for a CPU-bound workload. A previous post with great results for the hyperclock block cache is here.

tl;dr

  • read QPS is up to ~3X better with auto_hyper_clock_cache vs LRU
  • read QPS is up to ~1.3X better with the per-level fanout set to 32 vs 8
  • read QPS drops by ~15% as the background write rate increases from 2 to 32 M/s
Software

I used RocksDB 9.6, compiled with gcc 11.4.0.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and are repeated for 1, 10, 20 and 40 threads. 

I focus on the readwhilewriting benchmark where performance is reported for the reads (point queries) while there is a fixed rate for writes done in the background. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

To save time I did not run the fwdrangewhilewriting benchmark. Were I to repeat this work I would include it because the results from it would be interesting for a few of the configuration options I compared.

I did tests to understand the following:

  • LRU vs auto_hyper_clock_cache for the block cache implementation
    • LRU is the original implementation. The code was simple, which is nice. The implementation for LRU is sharded with a mutex per shard and that mutex can become a hot spot. The hyperclock implementation is much better at avoiding hot spots.
  • per level fanout (8 vs 32)
    • By per level fanout I mean the value of --max_bytes_for_level_multiplier which determines the target size difference between adjacent levels. By default I use 8, while 10 is also a common choice. Here I compare 8 vs 32. When the fanout is larger the LSM tree has fewer levels -- meaning there are fewer places to check for data which should reduce CPU overhead and increase QPS.
  • background write rate
    • I repeated tests with the background write rate (--benchmark_write_rate_limit) set to 2, 8 and 32 MB/s. With a higher write rate there is more chance for interference between reads and writes. The interference might be from mutex contention, compaction threads using more CPU, more L0 files to check or more data in levels L1 and larger.
  • target size for L0
    • By target size I mean the number of files in the L0 that trigger compaction. The db_bench option for this is --level0_file_num_compaction_trigger. When the value is larger there will be more L0 files on average that a query might have to check and that means there is more CPU overhead. Unfortunately, I configured RocksDB incorrectly so I don't have results to share. The issue is that when the L0 is configured to be larger, the L1 should be configured to be at least as large as the L0 (L1 target size should be >= sizeof(SST) * num(L0 files). If not, then L0->L1 compaction will happen sooner than expected.
All of the results are in this spreadsheet.

Results: LRU vs auto_hyper_clock_cache

These graphs have QPS from the readwhilewriting benchmark for the LRU and AHCC block cache implementations where LRU is the original version with a sharded hash table and a mutex per shard while AHCC is the hyper clock cache (--cache_type=auto_hyper_clock_cache).

Summary:
  • QPS is much better with AHCC than LRU (~3.3X faster at 40 threads)
  • QPS with AHCC scales linearly with the thread count
  • QPS with LRU does not scale linearly and suffers from mutex contention
  • There are some odd effects in the results for 1 thread
With a 2M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With an 8M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With a 32M/s background write rate AHCC is ~1.1X faster at 1 thread and ~2.9X faster at 40 threads relative to LRU.

Results: per level fanout

These graphs have QPS from the readwhilewriting benchmark to compare results with per-level fanout set to 8 and 32.

Summary
  • QPS is often 1.1X to 1.3X larger with fanout=32 vs fanout=8

With an 8M/s background write rate and LRU, fanout=8 is faster at 1 thread but then fanout=32 is from 1.1X to 1.3X faster at 10 to 40 threads.
With an 8M/s background write rate and AHCC, fanout=8 is faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.

With a 32M/s background write rate and LRU, fanout=8 is ~2X faster at 1 thread but then fanout=32 is from 1.1X to 1.2X faster at 10 to 40 threads.
With a 32M/s background write rate and AHCC, fanout=8 is ~2X faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.
Results: background write rate

Summary:
  • With LRU
    • QPS drops by up to ~15% as the background write rate grows from 2M/s to 32M/s
    • QPS does not scale linearly and suffers from mutex contention
  • With AHCC
    • QPS drops by up to 13% as the background write rate grows from 2M/s to 32M/s
    • QPS scales linearly with the thread count
  • There are some odd effects in the results for 1 thread
Results with LRU show that per-thread QPS doesn't scale linearly
Results with AHCC show that per-thread QPS scales linearly ignoring the odd results for 1 thread



Monday, November 4, 2024

RocksDB benchmarks: small server, universal compaction

I shared benchmark results for RocksDB a few weeks ago using leveled compaction and a small server. Here I have results for universal compaction and the same small server.

tl;dr
  • in general the there are some improvements and some small regressions with one exception (see  bug 12038)
  • for a cached database
    • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
    • Performance has been stable since 7.x
  • for an IO-bound database with buffered IO
    •  bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
    • QPS is otherwise stable 
  • for an IO-bound database with O_DIRECT
    • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0
    • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2
Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device. 

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1, 9.6.2, 9.7.2, 9.7.4 and 9.8.1
Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

    # Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
    bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use one client (thread):

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are here for byrx, for iobuf and for iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
  • Performance has been stable since 7.x
Results: iobuf

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • bug 12038 explains the regression for overwrite (fixed soon in 9.7)
  • QPS for fillseq has been stable
  • QPS for revrangeww, fwdrangeww and readww is stable. I am not sure about the variance in 9.6 and 9.7 releases. The cause might be that universal (tiered) is more prone to variance. I will revisit that when I run tests again in a few months.
Results: iodir

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
  • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2


Friday, November 1, 2024

Too many performance regressions for InnoDB in MySQL 8.0.29 and 8.0.30

There are many changes to InnoDB in MySQL 8.0.29 and 8.0.30. And many changes increases the chance of performance regressions. A recent report on this from me is here. In the worst-case the throughput drops almost in half for the update-index microbenchmark. This blog post identifies the diff that causes the regression.

tl;dr
  • this diff causes the update-index regression. The diff adds support for instant add/drop column to InnoDB and landed in 8.0.29, then 8.0.29 was retracted and the diff is new to most of us in 8.0.30.
  • MySQL 8.0.40 fixes regressions for many of the read-only microbenchmarks. I failed to acknowledge that in my previous post. But the large regression for long range scans remains, especially at lower concurrency levels. And the regression for update-index also remains in 8.0.40.
  • I remain confused about why this problem is so much worse on one of my servers (dell32) than on my other large servers
The problems that arrived in InnoDB after 8.0.28 include:
  • bug 111538 - compile-time inlining was changed leading to less-inlining for performance critical code and more CPU overhead. This bug has been closed as of 8.0.40 but the perf regression is still here.
  • bug 116463 - InnoDB busy-wait loops wait ~20% longer on average. This change wasn't intended but arrived as part of rewriting hash functions used by InnoDB. The impact from this is more CPU overhead and more context switches when there is contention. I filed this bug recently and expect it to be fixed soon.
  • new hash functions - the hash functions used by InnoDB were rewritten and while the new algorithms are likely better at hashing uniformly, they changes also brought a few bugs and bug fixes
  • I filed bug 116531 for the update-index regression
My list could have been longer but I always disable the InnoDB adaptive hash index and missed the bugs that arrived from all of the improvements to it.

Release notes

A few relevant comments from the release notes.


  • InnoDB: To improve code quality and facilitate debugging, instances of #define in the InnoDB sources were replaced by constexpr specifiers or inline functions. (WL #14680)

  • InnoDB: InnoDB now supports ALTER TABLE ... DROP COLUMN operations using ALGORITHM=INSTANT.

    Operations that support ALGORITHM=INSTANT only modify metadata in the data dictionary. Table data is unaffected, making the operations instantaneous. If not specified explicitly, ALGORITHM=INSTANT is used by default by DDL operations that support it.

    Prior to MySQL 8.0.29, an instantly added column could only be added as the last column of the table. From MySQL 8.0.29, an instantly added column can be added to any position in the table.

    Instantly added or dropped columns create a new row version. Up to 64 row versions are permitted. A new TOTAL_ROW_VERSIONS column was added to the INFORMATION_SCHEMA.INNODB_TABLES table to track the number of row versions.

    For more information about DDL operations that support ALGORITHM=INSTANT, see Online DDL Operations. (WL #13899)


  • InnoDB: Hash and random generator functions in the InnoDB sources were improved. (Bug #16739204, Bug #23584861)


  • InnoDB: Changes in hashing functions made in MySQL 8.0.30 had an adverse effect on performance. (Bug #34870256)


  • Performance; InnoDB: Several functions internal to InnoDB, which were defined as inline in MySQL 8.0.28, were found to be no longer inline in MySQL 8.0.33, due in part to refactoring which accompanied improvements made in MySQL 8.0.30 to improve the InnoDB adaptive hash index. This had an adverse effect on queries using joins on InnoDB tables. (Bug #111538, Bug #35531293)

    References: This issue is a regression of: Bug #81814, Bug #16739204, Bug #23584861.

Builds

I compiled everything from source. For MySQL 8.0.28 and 8.0.40 I used the community download. For MySQL 8.0.29 and the the two mid-release builds I used the git repo. The 8.0.29 release was retracted because of a bug. The builds are:
  • my8028_rel_o2nofp
    • MySQL 8.0.28 community release
  • my8029_rel_o2nofp
    • MySQL 8.0.29 from the git repo
  • my8040_rel_o2nofp
    • MySQL 8.0.40 community release
  • my8029_rel_o2nofp_preidrop_155cf3d902
    • this is after 8.0.28 but prior to 8.0.29 and is the diff immediately prior to the diff that adds support for instant add/drop column. It is here in github.
  • my8029_rel_o2nofp_drop_e8f422de
    • this is after 8.0.28 but prior to 8.0.29 and is the diff that adds support for instant add/drop column. It is here in github.
The my.cnf files are here for 8.0.28/8.0.29 and for 8.0.40. I use different my.cnf files because innodb_redo_log_capacity arrived in 8.0.30.

Hardware

The servers are
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with 8 tables and 10M rows per table. I used 24 threads for the dell32 server and 40 threads for the ax162-s server. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script were:
    bash r.sh 8 10000000 300 600 nvm 1 1 24 # for dell32
    bash r.sh 8 10000000 300 600 md2 1 1 40 # for ax162-s

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions
The problem arrives in the diff that adds instant add/drop column support, which is in the my8029_rel_o2nofp_drop_e8f422de build.

Throughput relative to: my8028_rel_o2nofp
col-1 : my8029_rel_o2nofp_preidrop_155cf3d902
col-2 : my8029_rel_o2nofp_drop_e8f422de
col-3 : my8029_rel_o2nofp
col-4 : my8040_rel_o2nofp

1.01    0.55    0.56    0.56    update-index_range=100 -> dell32
0.99 0.78 0.80 0.90 update-index_range=100 -> ax162-s

Results: dell32

Summary
  • Throughput on update-index drops almost in half starting in 8.0.29. The problem arrives in the diff that adds instant add/drop column support. The problem remains in 8.0.40.
  • Metrics from vmstat and iostat are here and the results for update-index show a ~1.4X increase in both CPU overhead (cpu/o) and context switch rates (cs/o) per update statement.
  • Throughput per microbenchmark is here for relative (csv, tsv) and absolute (csv, tsv)
Throughput relative to: my8028_rel_o2nofp
col-1 : my8029_rel_o2nofp_preidrop_155cf3d902
col-2 : my8029_rel_o2nofp_drop_e8f422de
col-3 : my8029_rel_o2nofp
col-4 : my8040_rel_o2nofp

col-1   col-2   col-3   col-4
1.02    1.00    1.02    1.06    hot-points_range=100
0.98    1.00    0.98    0.95    point-query.pre_range=100
0.99    1.00    0.99    0.95    point-query_range=100
1.00    0.96    0.97    1.05    points-covered-pk.pre_range=100
0.99    0.96    0.97    1.05    points-covered-pk_range=100
0.99    0.94    0.96    1.04    points-covered-si.pre_range=100
1.00    0.95    0.97    1.05    points-covered-si_range=100
1.00    0.96    0.98    1.05    points-notcovered-pk.pre_range=100
1.00    0.96    0.97    1.05    points-notcovered-pk_range=100
1.00    0.96    0.97    1.02    points-notcovered-si.pre_range=100
1.00    0.94    0.96    1.01    points-notcovered-si_range=100
0.97    0.96    0.97    1.89    random-points.pre_range=1000
0.99    0.96    0.97    1.05    random-points.pre_range=100
1.00    0.98    0.98    0.96    random-points.pre_range=10
0.98    0.97    0.97    1.91    random-points_range=1000
0.99    0.96    0.97    1.05    random-points_range=100
1.00    0.98    0.98    0.96    random-points_range=10
0.98    0.97    0.98    0.99    range-covered-pk.pre_range=100
0.98    0.97    0.98    0.99    range-covered-pk_range=100
0.98    0.98    0.97    0.99    range-covered-si.pre_range=100
0.99    0.98    0.98    0.99    range-covered-si_range=100
0.98    0.99    0.98    0.96    range-notcovered-pk.pre_range=100
0.98    0.98    0.98    0.96    range-notcovered-pk_range=100
1.00    0.96    0.97    0.97    range-notcovered-si.pre_range=100
1.00    0.96    0.97    0.97    range-notcovered-si_range=100
0.99    0.99    0.99    1.01    read-only.pre_range=10000
0.99    0.99    0.98    0.95    read-only.pre_range=100
0.98    0.99    0.99    0.95    read-only.pre_range=10
1.00    1.00    1.00    1.01    read-only_range=10000
0.99    0.98    0.99    0.95    read-only_range=100
0.99    0.99    0.98    0.95    read-only_range=10
0.99    0.93    0.93    0.91    scan_range=100
1.00    0.99    0.99    0.93    delete_range=100
1.00    1.00    1.00    0.94    insert_range=100
0.99    0.99    0.99    0.95    read-write_range=100
0.99    0.99    0.99    0.95    read-write_range=10
1.01    0.55    0.56    0.56    update-index_range=100
1.00    0.99    0.99    1.03    update-inlist_range=100
1.00    1.00    1.00    0.95    update-nonindex_range=100
1.00    0.99    1.00    0.94    update-one_range=100
1.00    1.00    1.00    0.96    update-zipf_range=100
1.00    0.98    0.97    0.94    write-only_range=10000

Results: ax162-s

Summary
  • Throughput on update-index drops by 20% starting in 8.0.29. The problem arrives in the diff that adds instant add/drop column support. The regression isn't as bad in 8.0.40 where the drop is only ~10% relative to 8.0.28.
  • Metrics from vmstat and iostat are here and the results for update-index show a ~1.15X increase in both CPU overhead (cpu/o) and context switch rates (cs/o) per update statement in 8.0.29 and smaller (1.09X for CPU, 1.03X for context switches) in 8.0.40.
  • Throughput per microbenchmark is here for relative (csvtsv) and absolute (csvtsv)
Throughput relative to: x.my8028_rel_o2nofp
col-1 : x.my8029_rel_o2nofp_preidrop_155cf3d902
col-2 : x.my8029_rel_o2nofp_drop_e8f422de
col-3 : x.my8029_rel_o2nofp
col-4 : x.my8040_rel_o2nofp

col-1 col-2 col-3 col-4
0.97 0.99 1.00 0.99 hot-points_range=100
1.00 0.99 0.98 0.94 point-query.pre_range=100
0.98 0.99 0.98 0.94 point-query_range=100
1.00 1.01 1.00 1.02 points-covered-pk.pre_range=100
0.99 1.01 1.00 1.02 points-covered-pk_range=100
1.00 1.00 0.98 0.98 points-covered-si.pre_range=100
1.00 1.00 0.98 0.99 points-covered-si_range=100
1.00 1.02 1.00 1.02 points-notcovered-pk.pre_range=100
0.99 1.01 1.00 1.02 points-notcovered-pk_range=100
0.99 0.99 0.96 0.96 points-notcovered-si.pre_range=100
0.99 0.99 0.97 0.96 points-notcovered-si_range=100
1.00 1.02 0.99 1.72 random-points.pre_range=1000
0.99 1.01 1.00 1.02 random-points.pre_range=100
1.01 1.01 0.99 0.94 random-points.pre_range=10
0.99 1.01 0.98 1.72 random-points_range=1000
0.99 1.01 1.00 1.02 random-points_range=100
1.01 1.00 0.99 0.94 random-points_range=10
1.01 0.98 0.95 0.95 range-covered-pk.pre_range=100
1.00 0.97 0.95 0.95 range-covered-pk_range=100
1.00 0.99 0.96 0.96 range-covered-si.pre_range=100
1.00 0.99 0.97 0.96 range-covered-si_range=100
1.00 0.99 0.96 0.94 range-notcovered-pk.pre_range=100
1.00 0.99 0.97 0.94 range-notcovered-pk_range=100
1.00 1.00 0.98 0.93 range-notcovered-si.pre_range=100
0.99 0.99 0.98 0.93 range-notcovered-si_range=100
1.00 0.99 0.99 1.01 read-only.pre_range=10000
0.99 0.97 0.98 0.94 read-only.pre_range=100
1.00 0.98 0.98 0.94 read-only.pre_range=10
1.00 0.99 0.99 1.00 read-only_range=10000
0.99 0.97 0.97 0.93 read-only_range=100
1.00 0.99 0.98 0.94 read-only_range=10
0.93 0.81 0.86 0.83 scan_range=100
1.00 0.99 1.02 0.93 delete_range=100
1.00 1.00 1.01 0.94 insert_range=100
0.99 0.98 0.98 0.95 read-write_range=100
0.99 0.98 0.98 0.95 read-write_range=10
0.99 0.78 0.80 0.90 update-index_range=100
0.96 0.95 0.98 1.02 update-inlist_range=100
1.01 1.01 1.02 0.96 update-nonindex_range=100
1.00 0.99 1.00 0.95 update-one_range=100
1.00 1.00 1.01 0.95 update-zipf_range=100
0.99 0.97 0.98 0.92 write-only_range=10000



Tuesday, October 29, 2024

Trying out Advanced MySQL

I recently learned of the Advanced MySQL project on github via a tweet. There is a book and a repo for an enhanced version of 8.0.40. I wish I had the time to read some of that book and learn more about the enhancements but for now I just ran my sysbench tests for it on a few large servers. Hopefully I will also run the Insert Benchmark for it on the same servers.

tl;dr

  • Advanced MySQL is good for performance
  • On the dell32 server that has a large regression for the update-index microbenchmark in MySQL 8.0.40, the Advanced MySQL project fixes most of the regression
  • On the ax162-s server for which the update-index regression is smaller, the Advanced MySQL project still does a lot better than upstream
Builds

I used upstream MySQL 8.0.28 and 8.0.40 as well as the advanced MySQL version of 8.0.40. All were compiled from source using CMAKE_BUILD_TYPE =Release, -O2 and -fno-omit-frame-pointer.

The builds are named:
  • my8028_rel_o2nofp - upstream MySQL 8.0.28
  • my8040_rel_o2nofp - upstream MySQL 8.0.40
  • my8040adv_rel_o2nofp - Advanced MySQL 8.0.40
The my.cnf files are here for 8.0.28 and for 8.0.40.

Hardware

The servers are
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement.

The benchmark is run with ...
  • dell32 - 24 threads, 8 tables, 10M rows/table
  • ax162-s - 40 threads, 8 tables, 10M rows/table
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions (8.0.40 from upstream, 8.0.40 from Advanced MySQL)
Results: dell32

Summary
  • 8.0.40 from upstream and Advanced MySQL have similar performance on most microbenchmarks
  • Advanced MySQL does a lot better on the update-index microbenchmark while upstream has a large regression that started in 8.0.30 (and perhaps in 8.0.29). 
Relative to: my8028_rel_o2nofp
col-1 : my8040_rel_o2nofp
col-2 : my8040adv_rel_o2nofp

col-1   col-2
1.06    1.07    hot-points_range=100
0.95    0.95    point-query.pre_range=100
0.95    0.95    point-query_range=100
1.05    1.05    points-covered-pk.pre_range=100
1.05    1.05    points-covered-pk_range=100
1.04    1.06    points-covered-si.pre_range=100
1.05    1.07    points-covered-si_range=100
1.05    1.05    points-notcovered-pk.pre_range=100
1.05    1.06    points-notcovered-pk_range=100
1.02    1.05    points-notcovered-si.pre_range=100
1.01    1.04    points-notcovered-si_range=100
1.89    1.89    random-points.pre_range=1000
1.05    1.05    random-points.pre_range=100
0.96    0.96    random-points.pre_range=10
1.91    1.92    random-points_range=1000
1.05    1.05    random-points_range=100
0.96    0.96    random-points_range=10
0.99    0.97    range-covered-pk.pre_range=100
0.99    0.97    range-covered-pk_range=100
0.99    0.96    range-covered-si.pre_range=100
0.99    0.97    range-covered-si_range=100
0.96    0.97    range-notcovered-pk.pre_range=100
0.96    0.96    range-notcovered-pk_range=100
0.97    0.99    range-notcovered-si.pre_range=100
0.97    0.98    range-notcovered-si_range=100
1.01    1.01    read-only.pre_range=10000
0.95    0.95    read-only.pre_range=100
0.95    0.95    read-only.pre_range=10
1.01    1.01    read-only_range=10000
0.95    0.95    read-only_range=100
0.95    0.95    read-only_range=10
0.91    0.85    scan_range=100
0.93    1.00    delete_range=100
0.94    0.96    insert_range=100
0.95    0.95    read-write_range=100
0.95    0.95    read-write_range=10
0.56    0.84    update-index_range=100
1.03    1.03    update-inlist_range=100
0.95    0.97    update-nonindex_range=100
0.94    0.94    update-one_range=100
0.96    0.97    update-zipf_range=100
0.94    0.98    write-only_range=10000

From vmstat metrics both the CPU overhead (cpu/o) and context switch rate (cs/o) are much lower in Advanced MySQL 8.0.40. Metrics from vmstat and iostat for all of the microbenchmarks are here.

sb.met.update-index.range100.pk1.dop40
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000550        12.157  0       0       8.354   83129   my8028_rel_o2nofp
0.000600        12.514  0       0       11.416  74917   my8040_rel_o2nofp
0.000469        8.063   0       0       7.437   98180   my8040adv_rel_o2nofp
--- relative to first result
1.09            1.03    1       1       1.37    0.90    my8040_rel_o2nofp
0.85            0.66    1       1       0.89    1.18    my8040adv_rel_o2nofp

sb.met.update-nonindex.range100.pk1.dop40
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000363        12.236  0       0       2.611   114970  my8028_rel_o2nofp
0.000383        12.235  0       0       2.794   109896  my8040_rel_o2nofp
0.000310        8.128   0       0       2.663   121196  my8040adv_rel_o2nofp
--- relative to first result
1.06            1.00    1       1       1.07    0.96    my8040_rel_o2nofp
0.85            0.66    1       1       1.02    1.05    my8040adv_rel_o2nofp

Results: ax162-s

Summary
  • 8.0.40 from upstream and Advanced MySQL have similar performance on most microbenchmarks
  • Advanced MySQL does a lot better on the update-index microbenchmark while upstream has a large regression that started in 8.0.30 (and perhaps in 8.0.29). 

Relative to: my8028_rel_o2nofp
col-1 : my8040_rel_o2nofp
col-2 : my8040adv_rel_o2nofp

col-1   col-2
1.06    1.07    hot-points_range=100
0.95    0.95    point-query.pre_range=100
0.95    0.95    point-query_range=100
1.05    1.05    points-covered-pk.pre_range=100
1.05    1.05    points-covered-pk_range=100
1.04    1.06    points-covered-si.pre_range=100
1.05    1.07    points-covered-si_range=100
1.05    1.05    points-notcovered-pk.pre_range=100
1.05    1.06    points-notcovered-pk_range=100
1.02    1.05    points-notcovered-si.pre_range=100
1.01    1.04    points-notcovered-si_range=100
1.89    1.89    random-points.pre_range=1000
1.05    1.05    random-points.pre_range=100
0.96    0.96    random-points.pre_range=10
1.91    1.92    random-points_range=1000
1.05    1.05    random-points_range=100
0.96    0.96    random-points_range=10
0.99    0.97    range-covered-pk.pre_range=100
0.99    0.97    range-covered-pk_range=100
0.99    0.96    range-covered-si.pre_range=100
0.99    0.97    range-covered-si_range=100
0.96    0.97    range-notcovered-pk.pre_range=100
0.96    0.96    range-notcovered-pk_range=100
0.97    0.99    range-notcovered-si.pre_range=100
0.97    0.98    range-notcovered-si_range=100
1.01    1.01    read-only.pre_range=10000
0.95    0.95    read-only.pre_range=100
0.95    0.95    read-only.pre_range=10
1.01    1.01    read-only_range=10000
0.95    0.95    read-only_range=100
0.95    0.95    read-only_range=10
0.91    0.85    scan_range=100
0.93    1.00    delete_range=100
0.94    0.96    insert_range=100
0.95    0.95    read-write_range=100
0.95    0.95    read-write_range=10
0.56    0.84    update-index_range=100
1.03    1.03    update-inlist_range=100
0.95    0.97    update-nonindex_range=100
0.94    0.94    update-one_range=100
0.96    0.97    update-zipf_range=100
0.94    0.98    write-only_range=10000

From vmstat metrics both the CPU overhead (cpu/o) and context switch rate (cs/o) are much lower in Advanced MySQL 8.0.40. Metrics from vmstat and iostat for all of the microbenchmarks are here.

sb.met.update-index.range100.pk1.dop24
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001185        10.777  0       0       14.275  51319   my8028_rel_o2nofp
0.001781        15.242  0       0       33.139  28966   my8040_rel_o2nofp
0.001328        9.994   0       0       21.068  43145   my8040adv_rel_o2nofp
--- relative to first result
1.50            1.41    1       1       2.32    0.56    my8040_rel_o2nofp
1.12            0.93    1       1       1.48    0.84    my8040adv_rel_o2nofp

sb.met.update-nonindex.range100.pk1.dop24
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000610        9.890   0       0       2.788   84660   my8028_rel_o2nofp
0.000645        10.006  0       0       3.215   80750   my8040_rel_o2nofp
0.000590        7.445   0       0       3.123   81943   my8040adv_rel_o2nofp
--- relative to first result
1.06            1.01    1       1       1.15    0.95    my8040_rel_o2nofp
0.97            0.75    1       1       1.12    0.97    my8040adv_rel_o2nofp





















Thursday, October 24, 2024

RocksDB benchmarks: small server, leveled compaction

I shared benchmark results for RocksDB a few weeks ago and there was a suggestion for me to repeat tests using different (older) values for format_version. Then while replacing a failed SSD, I also updated the OS and changed a few kernel-related config options. Thus, I ended up repeating all tests.

This post has results from a small server with leveled compaction. Results from a large server and from universal compaction are in progress.

tl;dr - on a small server with a low concurrency workload

  • older values of format_version (2 thru 5) don't impact QPS
  • auto hyperclock cache makes read-heavy tests up to 15% faster
  • for a cached database
    • QPS drops by 5% to 15% from RocksDB 6.0.2 to 9.7.2
    • QPS has been stable since 8.0
  • for an IO-bound database with buffered IO
    •  bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
    • QPS for fillseq has been stable
    • QPS for read-heavy tests is 15% to 20% better in RocksDB 9.7.2 vs 6.0.2
  • for an IO-bound database with O_DIRECT
    • QPS for fillseq is ~11% less in 9.7.2 vs 6.0.2 but has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
    • QPS for overwrite is stable
    • QPS for read-heavy tests is 16% to 38% better in RocksDB 9.7.1 vs 6.0.2

Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device. 

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.2 at git sha b5cde68b8a
Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version. For 9.6.1 I repeated tests using the hyperclock cache (default. is LRU) and format_version =2, =3, =4 and =5 (default is =6). 

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

    # Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
    bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use one client (thread):

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are linked below:
Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2 for the all versions tests and 9.6.1 with my standard configuration for the 9.6 variations tests.

Results: byrx with 9.6 variations

The byrx tests use a cached database. The performance summary is here. This has results for RocksDB 9.6.1 using my standard configuration and the variations are:
  • fv2 - uses format_version=2 instead of the default (=6)
  • fv3 - uses format_version=3
  • fv4 - uses format_version=4
  • fv5 - uses formatio_version=5
  • ahcc - uses auto_hyper_clock_cache instead of the default (LRU)
This chart shows the relative QPS for RocksDB 9.6.1 with a given configuration relative to 9.6.1 with my standard configuration. The y-axis doesn't start at 0 to improve readability.

Summary:
  • Using different values of format_version don't have a large impact here
  • Using auto hyperclock instead of LRU improves read-heavy QPS by up to 15%
Results: byrx with all versions

The byrx tests use a cached database. The performance summary is here

This chart shows the relative QPS for a given version of RocksDB 6.0.2. The y-axis doesn't start at 0 to improve readability.

Summary:
  • QPS drops by 5% to 15% from RocksDB 6.0.2 to 9.7.2
  • Performance has been stable since 8.0
  • For overwrite the excellent result in RocksDB 6.0.2 comes at the cost of bad write stalls (see pmax here)

Results: iobuf with all versions

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

This chart shows the relative QPS for a given version of RocksDB 6.0.2. The y-axis doesn't start at 0 to improve readability.

Summary:
  • bug 12038 explains the regression for overwrite (fixed soon in 9.7)
  • QPS for fillseq has been stable
  • QPS for read-heavy tests is 15% to 20% better in RocksDB 9.7.2 vs 6.0.2
Results: iodir with all versions

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

This chart shows the relative QPS for a given version of RocksDB 6.0.2. The y-axis doesn't start at 0 to improve readability.

Summary:
  • QPS for fillseq is ~11% less in 9.7.2 vs 6.0.2 but has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
  • QPS for overwrite is stable
  • QPS for read-heavy tests is 16% to 38% better in RocksDB 9.7.1 vs 6.0.2

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs ...