Monday, November 4, 2024

RocksDB benchmarks: small server, universal compaction

I shared benchmark results for RocksDB a few weeks ago using leveled compaction and a small server. Here I have results for universal compaction and the same small server.

tl;dr
  • in general the there are some improvements and some small regressions with one exception (see  bug 12038)
  • for a cached database
    • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
    • Performance has been stable since 7.x
  • for an IO-bound database with buffered IO
    •  bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
    • QPS is otherwise stable 
  • for an IO-bound database with O_DIRECT
    • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0
    • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2
Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device. 

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1, 9.6.2, 9.7.2, 9.7.4 and 9.8.1
Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

    # Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
    bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use one client (thread):

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are here for byrx, for iobuf and for iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
  • Performance has been stable since 7.x
Results: iobuf

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • bug 12038 explains the regression for overwrite (fixed soon in 9.7)
  • QPS for fillseq has been stable
  • QPS for revrangeww, fwdrangeww and readww is stable. I am not sure about the variance in 9.6 and 9.7 releases. The cause might be that universal (tiered) is more prone to variance. I will revisit that when I run tests again in a few months.
Results: iodir

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
  • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2


Friday, November 1, 2024

Too many performance regressions for InnoDB in MySQL 8.0.29 and 8.0.30

There are many changes to InnoDB in MySQL 8.0.29 and 8.0.30. And many changes increases the chance of performance regressions. A recent report on this from me is here. In the worst-case the throughput drops almost in half for the update-index microbenchmark. This blog post identifies the diff that causes the regression.

tl;dr
  • this diff causes the update-index regression. The diff adds support for instant add/drop column to InnoDB and landed in 8.0.29, then 8.0.29 was retracted and the diff is new to most of us in 8.0.30.
  • MySQL 8.0.40 fixes regressions for many of the read-only microbenchmarks. I failed to acknowledge that in my previous post. But the large regression for long range scans remains, especially at lower concurrency levels. And the regression for update-index also remains in 8.0.40.
  • I remain confused about why this problem is so much worse on one of my servers (dell32) than on my other large servers
The problems that arrived in InnoDB after 8.0.28 include:
  • bug 111538 - compile-time inlining was changed leading to less-inlining for performance critical code and more CPU overhead. This bug has been closed as of 8.0.40 but the perf regression is still here.
  • bug 116463 - InnoDB busy-wait loops wait ~20% longer on average. This change wasn't intended but arrived as part of rewriting hash functions used by InnoDB. The impact from this is more CPU overhead and more context switches when there is contention. I filed this bug recently and expect it to be fixed soon.
  • new hash functions - the hash functions used by InnoDB were rewritten and while the new algorithms are likely better at hashing uniformly, they changes also brought a few bugs and bug fixes
  • I filed bug 116531 for the update-index regression
My list could have been longer but I always disable the InnoDB adaptive hash index and missed the bugs that arrived from all of the improvements to it.

Release notes

A few relevant comments from the release notes.


  • InnoDB: To improve code quality and facilitate debugging, instances of #define in the InnoDB sources were replaced by constexpr specifiers or inline functions. (WL #14680)

  • InnoDB: InnoDB now supports ALTER TABLE ... DROP COLUMN operations using ALGORITHM=INSTANT.

    Operations that support ALGORITHM=INSTANT only modify metadata in the data dictionary. Table data is unaffected, making the operations instantaneous. If not specified explicitly, ALGORITHM=INSTANT is used by default by DDL operations that support it.

    Prior to MySQL 8.0.29, an instantly added column could only be added as the last column of the table. From MySQL 8.0.29, an instantly added column can be added to any position in the table.

    Instantly added or dropped columns create a new row version. Up to 64 row versions are permitted. A new TOTAL_ROW_VERSIONS column was added to the INFORMATION_SCHEMA.INNODB_TABLES table to track the number of row versions.

    For more information about DDL operations that support ALGORITHM=INSTANT, see Online DDL Operations. (WL #13899)


  • InnoDB: Hash and random generator functions in the InnoDB sources were improved. (Bug #16739204, Bug #23584861)


  • InnoDB: Changes in hashing functions made in MySQL 8.0.30 had an adverse effect on performance. (Bug #34870256)


  • Performance; InnoDB: Several functions internal to InnoDB, which were defined as inline in MySQL 8.0.28, were found to be no longer inline in MySQL 8.0.33, due in part to refactoring which accompanied improvements made in MySQL 8.0.30 to improve the InnoDB adaptive hash index. This had an adverse effect on queries using joins on InnoDB tables. (Bug #111538, Bug #35531293)

    References: This issue is a regression of: Bug #81814, Bug #16739204, Bug #23584861.

Builds

I compiled everything from source. For MySQL 8.0.28 and 8.0.40 I used the community download. For MySQL 8.0.29 and the the two mid-release builds I used the git repo. The 8.0.29 release was retracted because of a bug. The builds are:
  • my8028_rel_o2nofp
    • MySQL 8.0.28 community release
  • my8029_rel_o2nofp
    • MySQL 8.0.29 from the git repo
  • my8040_rel_o2nofp
    • MySQL 8.0.40 community release
  • my8029_rel_o2nofp_preidrop_155cf3d902
    • this is after 8.0.28 but prior to 8.0.29 and is the diff immediately prior to the diff that adds support for instant add/drop column. It is here in github.
  • my8029_rel_o2nofp_drop_e8f422de
    • this is after 8.0.28 but prior to 8.0.29 and is the diff that adds support for instant add/drop column. It is here in github.
The my.cnf files are here for 8.0.28/8.0.29 and for 8.0.40. I use different my.cnf files because innodb_redo_log_capacity arrived in 8.0.30.

Hardware

The servers are
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with 8 tables and 10M rows per table. I used 24 threads for the dell32 server and 40 threads for the ax162-s server. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script were:
    bash r.sh 8 10000000 300 600 nvm 1 1 24 # for dell32
    bash r.sh 8 10000000 300 600 md2 1 1 40 # for ax162-s

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions
The problem arrives in the diff that adds instant add/drop column support, which is in the my8029_rel_o2nofp_drop_e8f422de build.

Throughput relative to: my8028_rel_o2nofp
col-1 : my8029_rel_o2nofp_preidrop_155cf3d902
col-2 : my8029_rel_o2nofp_drop_e8f422de
col-3 : my8029_rel_o2nofp
col-4 : my8040_rel_o2nofp

1.01    0.55    0.56    0.56    update-index_range=100 -> dell32
0.99 0.78 0.80 0.90 update-index_range=100 -> ax162-s

Results: dell32

Summary
  • Throughput on update-index drops almost in half starting in 8.0.29. The problem arrives in the diff that adds instant add/drop column support. The problem remains in 8.0.40.
  • Metrics from vmstat and iostat are here and the results for update-index show a ~1.4X increase in both CPU overhead (cpu/o) and context switch rates (cs/o) per update statement.
  • Throughput per microbenchmark is here for relative (csv, tsv) and absolute (csv, tsv)
Throughput relative to: my8028_rel_o2nofp
col-1 : my8029_rel_o2nofp_preidrop_155cf3d902
col-2 : my8029_rel_o2nofp_drop_e8f422de
col-3 : my8029_rel_o2nofp
col-4 : my8040_rel_o2nofp

col-1   col-2   col-3   col-4
1.02    1.00    1.02    1.06    hot-points_range=100
0.98    1.00    0.98    0.95    point-query.pre_range=100
0.99    1.00    0.99    0.95    point-query_range=100
1.00    0.96    0.97    1.05    points-covered-pk.pre_range=100
0.99    0.96    0.97    1.05    points-covered-pk_range=100
0.99    0.94    0.96    1.04    points-covered-si.pre_range=100
1.00    0.95    0.97    1.05    points-covered-si_range=100
1.00    0.96    0.98    1.05    points-notcovered-pk.pre_range=100
1.00    0.96    0.97    1.05    points-notcovered-pk_range=100
1.00    0.96    0.97    1.02    points-notcovered-si.pre_range=100
1.00    0.94    0.96    1.01    points-notcovered-si_range=100
0.97    0.96    0.97    1.89    random-points.pre_range=1000
0.99    0.96    0.97    1.05    random-points.pre_range=100
1.00    0.98    0.98    0.96    random-points.pre_range=10
0.98    0.97    0.97    1.91    random-points_range=1000
0.99    0.96    0.97    1.05    random-points_range=100
1.00    0.98    0.98    0.96    random-points_range=10
0.98    0.97    0.98    0.99    range-covered-pk.pre_range=100
0.98    0.97    0.98    0.99    range-covered-pk_range=100
0.98    0.98    0.97    0.99    range-covered-si.pre_range=100
0.99    0.98    0.98    0.99    range-covered-si_range=100
0.98    0.99    0.98    0.96    range-notcovered-pk.pre_range=100
0.98    0.98    0.98    0.96    range-notcovered-pk_range=100
1.00    0.96    0.97    0.97    range-notcovered-si.pre_range=100
1.00    0.96    0.97    0.97    range-notcovered-si_range=100
0.99    0.99    0.99    1.01    read-only.pre_range=10000
0.99    0.99    0.98    0.95    read-only.pre_range=100
0.98    0.99    0.99    0.95    read-only.pre_range=10
1.00    1.00    1.00    1.01    read-only_range=10000
0.99    0.98    0.99    0.95    read-only_range=100
0.99    0.99    0.98    0.95    read-only_range=10
0.99    0.93    0.93    0.91    scan_range=100
1.00    0.99    0.99    0.93    delete_range=100
1.00    1.00    1.00    0.94    insert_range=100
0.99    0.99    0.99    0.95    read-write_range=100
0.99    0.99    0.99    0.95    read-write_range=10
1.01    0.55    0.56    0.56    update-index_range=100
1.00    0.99    0.99    1.03    update-inlist_range=100
1.00    1.00    1.00    0.95    update-nonindex_range=100
1.00    0.99    1.00    0.94    update-one_range=100
1.00    1.00    1.00    0.96    update-zipf_range=100
1.00    0.98    0.97    0.94    write-only_range=10000

Results: ax162-s

Summary
  • Throughput on update-index drops by 20% starting in 8.0.29. The problem arrives in the diff that adds instant add/drop column support. The regression isn't as bad in 8.0.40 where the drop is only ~10% relative to 8.0.28.
  • Metrics from vmstat and iostat are here and the results for update-index show a ~1.15X increase in both CPU overhead (cpu/o) and context switch rates (cs/o) per update statement in 8.0.29 and smaller (1.09X for CPU, 1.03X for context switches) in 8.0.40.
  • Throughput per microbenchmark is here for relative (csvtsv) and absolute (csvtsv)
Throughput relative to: x.my8028_rel_o2nofp
col-1 : x.my8029_rel_o2nofp_preidrop_155cf3d902
col-2 : x.my8029_rel_o2nofp_drop_e8f422de
col-3 : x.my8029_rel_o2nofp
col-4 : x.my8040_rel_o2nofp

col-1 col-2 col-3 col-4
0.97 0.99 1.00 0.99 hot-points_range=100
1.00 0.99 0.98 0.94 point-query.pre_range=100
0.98 0.99 0.98 0.94 point-query_range=100
1.00 1.01 1.00 1.02 points-covered-pk.pre_range=100
0.99 1.01 1.00 1.02 points-covered-pk_range=100
1.00 1.00 0.98 0.98 points-covered-si.pre_range=100
1.00 1.00 0.98 0.99 points-covered-si_range=100
1.00 1.02 1.00 1.02 points-notcovered-pk.pre_range=100
0.99 1.01 1.00 1.02 points-notcovered-pk_range=100
0.99 0.99 0.96 0.96 points-notcovered-si.pre_range=100
0.99 0.99 0.97 0.96 points-notcovered-si_range=100
1.00 1.02 0.99 1.72 random-points.pre_range=1000
0.99 1.01 1.00 1.02 random-points.pre_range=100
1.01 1.01 0.99 0.94 random-points.pre_range=10
0.99 1.01 0.98 1.72 random-points_range=1000
0.99 1.01 1.00 1.02 random-points_range=100
1.01 1.00 0.99 0.94 random-points_range=10
1.01 0.98 0.95 0.95 range-covered-pk.pre_range=100
1.00 0.97 0.95 0.95 range-covered-pk_range=100
1.00 0.99 0.96 0.96 range-covered-si.pre_range=100
1.00 0.99 0.97 0.96 range-covered-si_range=100
1.00 0.99 0.96 0.94 range-notcovered-pk.pre_range=100
1.00 0.99 0.97 0.94 range-notcovered-pk_range=100
1.00 1.00 0.98 0.93 range-notcovered-si.pre_range=100
0.99 0.99 0.98 0.93 range-notcovered-si_range=100
1.00 0.99 0.99 1.01 read-only.pre_range=10000
0.99 0.97 0.98 0.94 read-only.pre_range=100
1.00 0.98 0.98 0.94 read-only.pre_range=10
1.00 0.99 0.99 1.00 read-only_range=10000
0.99 0.97 0.97 0.93 read-only_range=100
1.00 0.99 0.98 0.94 read-only_range=10
0.93 0.81 0.86 0.83 scan_range=100
1.00 0.99 1.02 0.93 delete_range=100
1.00 1.00 1.01 0.94 insert_range=100
0.99 0.98 0.98 0.95 read-write_range=100
0.99 0.98 0.98 0.95 read-write_range=10
0.99 0.78 0.80 0.90 update-index_range=100
0.96 0.95 0.98 1.02 update-inlist_range=100
1.01 1.01 1.02 0.96 update-nonindex_range=100
1.00 0.99 1.00 0.95 update-one_range=100
1.00 1.00 1.01 0.95 update-zipf_range=100
0.99 0.97 0.98 0.92 write-only_range=10000



RocksDB benchmarks: small server, universal compaction

I shared benchmark results for RocksDB  a few weeks ago  using leveled compaction and a small server. Here I have results for universal comp...