Thursday, December 26, 2024

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for making RocksDB better. I am wary about evaluating such claims because that takes a lot of time and when the claim is bogus I am reluctant to blog about that because I don't want to punch down on a startup.

Here I share results from the RocksDB benchmark scripts to compare Speedb and RocksDB and I am happy to claim that Speedb does some things better than RocksDB.

tl;dr

  • RocksDB and Speedb have similar average throughput for ...
    • a cached database
    • an IO-bound database when using O_DIRECT
  • RocksDB 8.6+ is slower than Speedb for write-heavy workloads with an IO-bound database when O_DIRECT isn't used. 
    • This problem arrived in RocksDB 8.6 (see issue 12038) which introduces the use of the readahead system call to prefetch data that compaction will soon read. I am not sure what was used prior. The regression that arrived in 8.6 was partially fixed in release 9.9. 
  • In general, Speedb has better QoS (less throughput variance) for write-heavy workloads
Updates
  • Update 1 - with a minor hack I can remove about 1/3 of the regression between RocksDB 8.5 and 9.9 in the overwrite benchmark for the iobuf workload

On issue 12038

RocksDB 8.6 switched to using the readahead system call to prefetch SSTs that will soon be read by compaction. The goal is to reduce the time that compaction threads must wait for data. But what I see with iostat is that a readahead call is ignored when the value of the count argument is larger than max_sectors_kb for the storage device. And this happens on one or both of ext-4 and xfs. I am not a kernel guru and I have yet to read this nice writeup of readahead internals. I do read this great note from Jens Axboe every few years.

I opened issue 12038 for this issue and it was fixed in RocksDB 9.9 by adding code that reduces the value of compaction_readahead_size to be <= the value of max_sectors_kb for the database's storage device. However the fix in 9.9 doesn't restore the performance that existed prior to the change (see 8.5 results). I assume the real fix is to have code in RocksDB to do the prefetches rather than rely on the readahead system call.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

The values of max_sectors_kb and max_hw_sectors_kb for the database's storage device is 128 (KB) for both the SW RAID device (md2) and the underlying storage devices (nvme0n1, nvme1n1).

Builds

I compiled db_bench from source. I used RocksDB versions 7.3.2, 7.10.2, 8.5.4, 8.6.7, 9.7.4 and 9.9.3. 

For Speedb I used the latest diff in their github repo which appears to be based on RocksDB 8.6.7, although I am confused that it doesn't suffer from issue 12038 which is in RocksDB 8.6.7.

commit 8d850b666cce6f39fbd4064e80b85f9690eaf385 (HEAD -> main, origin/main, origin/HEAD)
Author: udi-speedb <106253580+udi-speedb@users.noreply.github.com>
Date:   Mon Mar 11 14:00:03 2024 +0200

    Support Speedb's Paired Bloom Filter in db_bloom_filter_test (#810)

Benchmark

All tests used 2MB for compaction_readahead_size which and the hyper clock block cache.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: 
    bash x3.sh 40 no 1800 c48r128 100000000 2000000000 iobuf iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
For configuration options that Speedb and RocksDB have in common I set those options to the same values. I didn't experiment with Speedb-only options except for use_dynamic_delay.

Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

Spreadsheets with charts are here and here. Performance summaries are here for byrx, iobuf and iodir.

Results: average throughput

These charts plot relative QPS by test where relative QPS is (QPS for me / QPS for speedb.udd0). When this value is less than 1.0 then the given version is slower than speedb.udd0. When the value is greater than 1.0 then the given version is faster than speedb.udd0. The base case is speedb.udd0 which is Speedb with use_dynamic_delay=0. The versions listed in the charts are:
  • speedb.udd1 - Speedb with use_dynamic_delay=1
  • rocksdb.7.3 - RocksDB 7.3.2
  • rocksdb.7.10 - RocksDB 7.10.2
  • rocksdb.8.5 - RocksDB 8.5.4
  • rocksdb.8.6 - RocksDB 8.6.7
  • rocksdb.9.7 - RocksDB 9.7.4
  • rocksdb.9.9 - RocksDB 9.9.3
For a cached workload (byrx):
  • RocksDB is faster at fillseq (load in key order), otherwise modern RocksDB and Speedb have similar average throughput
  • RocksDB 7.3 was much slower on the read while writing tests (revrangeww, fwdrangeww and readww) but that was fixed by 7.10 and might have been related to issue 9423 or it might be from improvements to the hyper clock cache.
For an IO-bound workload that doesn't use O_DIRECT (iobuf)
  • Modern RocksDB is faster than Speedb at fillseq, has similar perf on the read while writing tests and is slower on overwrite (write-only with keys in random order. The difference in overwrite perf is probably from issue 12038 which arrives in RocksDB 8.6 and then has a fix in 9.9.
  • Similar to above RocksDB 7.3 has a few perf problems that have since been fixed
For an IO-bound workload that uses O_DIRECT (iodir)
  • Modern RocksDB is much faster than Speedb at fillseq and then has similar perf on the other tests
  • Similar to above RocksDB 7.3 has a few perf problems that have since been fixed.

Results: throughput over time

The previous section shows average throughput. And while more average QPS is nice, if that comes with more variance than it is less than great. The charts in this section show QPS at 1-second intervals for fillseq and overwrite for two releases:
  • speedb.udd1 - Speedb with use_dynamic_delay=1
  • rocksdb.9.9.3 - RocksDB 9.9.3

fillseq (load in key order) for byrx, iobuf and iodir
  • For the cached workload the better perf from RocksDB is obvious
  • For the IO-bound workloads the results are closer
  • Variance is less with Speedb than with RocksDB based on the thickness of the lines
With overwrite for a cached workload (byrx) there are two charts - one from the entire run and one from the last 300 seconds.
  • Using the eyeball method (much hand waving) the variance is similar for RocksDB and Speedb. Both suffer from write stalls.
  • Response time percentiles (see p50, p99, p99.9, p99.99 and pmax here) where pmax is <= 57 milliseconds for everything but RocksDB 7.3. Note the numbers in the gist are in usecs.
With overwrite for an IO-bound workload (iobuf) without O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.
  • Speedb has slightly better average throughput
  • Speedb has much better QoS (less variance). That is obvious based on the thickness of the red line vs the blue line. RocksDB also has more write stalls based on the number of times the blue lines drop to near zero.
  • Using the response time percentiles here the differences between Speedb and RocksDB are less obvious.
  • The percentage of time that writes are stalled is <= 9% for Speedb and >= 20% for modern RocksDB (see stall% here). This is a good way to measure QoS.
With overwrite for an IO-bound workload (iodir) with O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.
  • Average throughput is slightly better for RocksDB than for Speedb
  • QoS is slightly better for Speedb than for RocksDB (see the blue lines dropping to zero)
  • RocksDB does much better here with O_DIRECT than it does above without O_DIRECT
  • Speedb still does better than RocksDB at avoiding write stalls. See stall% here.
Sources of write stalls

The output from db_bench includes a summary of the sources of write stalls. However, that summary just shows the number of times each was invoked without telling you have much each contributes to the stall% (total percentage of time that write stalls are in effect).

For the IO-bound workload (iobuf and iodir) the number of write stalls is much lower with Speedb. It appears to be more clever about managing write throughput.

The summary for iobuf with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb  rocksdb
1429      285   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0   cf-l0-file-count-limit-stops-with-ongoing-compaction
1429      285   l0-file-count-limit-delays
   0        0   l0-file-count-limit-stops
   0        0   memtable-limit-delays
   0        0   memtable-limit-stops
1131    12585   pending-compaction-bytes-delays
   0        0   pending-compaction-bytes-stops
2560    12870   total-delays
   0        0   total-stops

The summary for iodir with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb  rocksdb
   5      287   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0   cf-l0-file-count-limit-stops-with-ongoing-compaction
   5      287   l0-file-count-limit-delays
   0        0   l0-file-count-limit-stops
  22        0   memtable-limit-delays
   0        0   memtable-limit-stops
   0      687   pending-compaction-bytes-delays
   0        0   pending-compaction-bytes-stops
  27      974   total-delays
   0        0   total-stops

Compaction efficiency

The final sample of compaction IO statistics at the end of the overwrite test is here for iobuf and iodir for speedb.udd1 and RocksDB 9.9.3.

For iobuf the wall clock time for which compaction threads are busy (the Comp(sec) column) is about 1.10X larger for RocksDB than Speedb. This is likely because Speedb is doing larger reads from storage (see the next section) so RocksDB has more IO waits. But the CPU time for compaction (the CompMergeCPU(sec) column) is about 1.12X larger for Speedb (I am not sure why).

For iodir the values in the Comp(sec) and CompMergeCPU(sec) columns are not as different across Speedb and RocksDB as they were above for iobuf.

Understanding IO via iostat

The numbers below are the average values from iostat collected during overwrite.

For iobuf the average read size (rareqsz) drops to 4.7 in RocksDB 8.6 courtesy of issue 12038 while it was >= 50KB prior to RocksDB 8.6. The value improves to 34.2 in RocksDB 9.9.3 but it is still much less than what it used to be in RocksDB 8.5.

For iobuf the average read size (rareqsz) is ~112KB in all cases.

The RocksDB compaction threads read from the compaction input a RocksDB block at a time. For these tests I use an 8kb RocksDB block but in the IO-bound tests the block are compressed for the larger levels of the LSM tree, and a compressed block is ~4kb. Thus some kind of prefetching that does large read requests is need to improve read IO efficiency.

Legend:
  • rps - reads/s
  • rMBps - read MB/s
  • rawait - read wait latency in milliseconds
  • rareqsz - read request size in KB
  • wps - writes/s
  • wMBps - write MB/s
  • wawait - read wait latency in milliseconds
  • wareqsz - write request size in KB
iobuf (buffered IO)
rps     rMBps   rawait  rareqsz wps     wMBps   wawait  wareqsz
6336    364.4   0.43    52.9    14344   1342.0  0.40    95.5    speedb.udd0
6209    357.6   0.41    51.6    14177   1322.8  0.41    95.3    speedb.udd1
2133    164.0   0.19    26.1    8954    854.4   0.31    99.3    rocksdb.7.3
4542    333.5   0.49    71.8    14734   1361.6  0.39    94.5    rocksdb.7.10
6101    352.1   0.42    52.4    14878   1391.8  0.41    96.1    rocksdb.8.5
40471   184.7   0.10    4.7     8552    784.1   0.31    93.6    rocksdb.8.6
39201   178.8   0.10    4.7     8783    801.5   0.32    93.7    rocksdb.9.7
7733    268.7   0.29    34.2    12742   1156.0  0.30    93.2    rocksdb.9.9

iodir (O_DIRECT)
rps     rMBps   rawait  rareqsz wps     wMBps   wawait  wareqsz
12757   1415.9  0.82    112.8   16642   1532.6  0.40    93.5    speedb.udd0
12539   1392.9  0.83    112.7   16327   1507.9  0.41    93.6    speedb.udd1
8090    903.5   0.74    114.3   10036   976.6   0.30    100.5   rocksdb.7.3
12155   1346.2  0.90    112.8   15602   1462.4  0.40    95.7    rocksdb.7.10
13315   1484.5  0.85    113.6   17436   1607.1  0.40    94.0    rocksdb.8.5
12978   1444.3  0.84    112.9   16981   1563.0  0.46    93.4    rocksdb.8.6
12641   1411.9  0.84    113.9   16217   1535.6  0.43    96.7    rocksdb.9.7
12990   1450.7  0.83    113.8   16704   1576.0  0.41    96.3    rocksdb.9.9

Update 1

For the overwrite test and the iobuf (buffered IO, IO-bound) workload the QPS is 277795 for RocksDB 8.5.4 vs 227068 for 9.9.3. So 9.9.3 gets ~82% of the QPS relative to 8.5.4. Note that 9.8 only gets about 57% and the improved from 9.8 to 9.9 is from fixing issue 12038 but that fix isn't sufficient given the gap between 8.5 and 9.9.

With a hack I can improve the QPS for 9.9 from 227068/s to 241260/s. The problem, explained to me by the RocksDB team, is that while this code adjusts compaction readahead to be no larger than max_sectors_kb, the code that requests prefetch can request for (X + Y) bytes where X is the adjusted compaction readahead amount and Y is the block size. The sum of these is likely to be larger than max_sectors_kb. So my hack was to reduce compaction readahead to be (max_sectors_kb - 8kb) given that I am using block_size=8kb.

And with the hack the QPS for overwrite improves. The performance summary from the overwrite test is here and the stall% decreased from 25.9 to 22. Alas, the stall% was 9.0 with RocksDB 8.5.4 so there is still room for improvement.

The compaction reasons for 8.5.4, 9.9.3 without the hack (9.9orig) and 9.9.3 with the hack (9.9hack) provide a better idea of what has changed.

8.5.4   9.9orig 9.9hack
1089      285     211   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0       0   cf-l0-file-count-limit-stops-with-ongoing-compaction
1089      285     211   l0-file-count-limit-delays
   0        0       0   l0-file-count-limit-stops
   0        0       0   memtable-limit-delays
   0        0       0   memtable-limit-stops
1207    12585   10735   pending-compaction-bytes-delays
   0        0       0   pending-compaction-bytes-stops
2296    12870   10946   total-delays
   0        0       0   total-stops

And finally, averages from iostat during the test show that 9.9.3 with the hack gets the largest average read request size (rareqsz is 56.4) but it still is slower than 8.5.4.

rps     rmbps   rawait  rareqsz wps     wmbps   wawait  wareqsz
6101    352.1   0.42    52.4    14878   1391.8  0.41    96.1    8.5.4
7733    268.7   0.29    34.2    12742   1156.0  0.30    93.2    9.9orig
5020    285.7   0.43    56.4    13467   1220.2  0.32    93.1    9.9hack

Friday, November 29, 2024

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a large server to show the speedup from the hyperclock block cache implementation for different concurrency levels with RocksDB 9.6. Here I share results from the same server and different (old and new) RocksDB releases.

Results are amazing on a large (48 cores) server with 40 client threads

  • ~2X more QPS for range queries with hyperclock
  • ~3X more QPS for point queries with hyperclock

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.0.4, 7.6.0, 7.7.8, 8.5.4, 8.6.7, 9.0.1, 9.1.2, 9.3.2, 9.5.2, 9.7.4 and 9.9.0. Everything was compiled with gcc 11.4.0.

The --cache_type argument selected the block cache implementation:

  • lru_cache was used for versions 7.6 and earlier. Because some of the oldest releases don't support --cache_type I also used --undef_params=...,cache_type
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auth_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 40 threads.

I focus on the read-heavy benchmark steps:

  • revrangeww (reverse range while writing) - this does short reverse range scans
  • fwdrangeww (forward range while writing) - this does short forward range scans
  • readww (read while writing) - this does point queries

For each of these there is a fixed rate for writes done in the background and performance is reported for the reads. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

Results

All results are in this spreadsheet and the performance summary is here.

The graph below shows relative QPS which is: (QPS for my version / QPS for RocksDB 6.0.2) and the results are amazing:

  • ~2X more QPS for range queries with hyperclock
  • ~3X more QPS for point queries with hyperclock

The average values for vmstat metrics provide more detail on why hyperclock is so good for performance. The context switch rate drops dramatically when it is enabled because there is much less mutex contention. The user CPU utilization increases by ~1.6X because more useful work can get done when there is less mutex contention.

legend
* cs - context switches per second per vmstat
* us - user CPU utilization per vmstat
* sy - system CPU utilization per vmstat
* id - idle CPU utilization per vmstat
* wa - wait CPU utilization per vmstat
* version - RocksDB version

cs      us      sy      us+sy   id      wa      version
1495325 50.3    14.0    64.3    18.5    0.1     7.6.0
2360    82.7    14.0    96.7    16.6    0.1     9.9.0







Monday, November 25, 2024

RocksDB benchmarks: large server, universal compaction

This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

tl;dr

  • there are some small regressions for cached workloads (see byrx below)
  • there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
  • modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrxiobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
  • QPS for other tests has been stable, with some variance, since late 6.x
Results: iobuf

The iodir tests use an IO-bound database with buffered. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww improved in late 6.0 and have been stable since then
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww have been stable but there is some variance








Saturday, November 9, 2024

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL, a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions for write heavy workloads that landed in InnoDB after MySQL 8.0.28.

In response to my results, the project lead filed a bug for performance regressions and then quickly came up with a diff. The bug in this case is for regressions that are most obvious during full table scans and the problems arrived in MySQL 8.0.29 and 8.0.30 -- see bug 111538 and this post. The bug is closed for upstream but the perf regressions remain so I am excited to see the community working to solve this problem.

tl;dr

  • Advanced MySQL with the fix removes much of the regression in scan performance
Builds

I tried 4 builds

  • my8028 - upstream MySQL 8.0.28
  • my8040 - upstream MySQL 8.0.40
  • my8040adv_pre - Advanced MySQL 8.0.40 without the fix (without d347cdb)
  • my8040adv_post - Advanced MySQL 8.0.40 with the fix (at d347cdb)
Hardware

The servers are

  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
  • bee
    • Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, Ubuntu 22.04 and ext4 on NVMe

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with ...
  • dell32 - 8 tables, 10M rows per table and 24 threads
  • ax162-s - 8 tables, 10M rows per table and 40 threads
  • bee - 1 table, 30M rows and 1 thread
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions
Here I only share the results for the scan microbenchmark.

Results: dell32

Summary
  • Summary
    • QPS with the fix in Advanced MySQL is ~9% better than without the fix
    • QPS with the fix in Advanced MySQL is ~2% better than my8040.
    • I am not sure why my8040adv_pre did much worse than my8040
From the relative QPS results the QPS with my8040adv_pre was ~15% less than my8028. But my8040adv_post is only ~7% slower than my8028 so it removes half of the regression.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_pre
col-3 : my8040adv_post

col-1   col-2   col-3
0.91    0.85    0.93    scan

From vmstat and iostat metrics CPU overhead for my8040adv_pre was ~22% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~8% larger than my8028. This is great.

--- absolute
cpu/o cs/o r/o rKB/o wKB/o o/s dbms
0.093496 3.256 0 0 0.006 246 my8028
0.106105 4.065 0 0 0.006 225 my8040
0.113878 4.344 0 0 0.006 208 my8040adv_pre
0.101104 3.978 0 0 0.006 228 my8040adv_post
--- relative to first result
1.13 1.25 1 1 1.00 0.91 my8040
1.22 1.33 1 1 1.00 0.85 my8040adv_pre
1.08 1.22 1 1 1.00 0.93 my8040adv_post

Results: ax162-s

Summary
  • QPS is ~18% larger with the fix in Advanced MySQL
  • CPU overhead is ~15% smaller with the fix
From the relative QPS results the QPS with my8040adv_pre was the same as my8040 and both were ~17% slower than my8028. But my8040adv_post is only ~2% slower than my8028 which is excellent.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_pre
col-3 : my8040adv_post

col-1   col-2   col-3
0.83    0.83    0.98    scan

From vmstat and iostat metrics CPU overhead for my8040 and my8040adv_pre were ~20% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute
cpu/o cs/o r/o rKB/o wKB/o o/s dbms
0.018767 0.552 0 0 0.052 872 my8028
0.022533 0.800 0 0 0.013 725 my8040
0.022499 0.808 0 0.001 0.034 727 my8040adv_pre
0.019305 0.731 0 0 0.03 851 my8040adv_post
--- relative to first result
1.20 1.45 1 1 0.25 0.83 my8040
1.20 1.46 1 inf 0.65 0.83 my8040adv_pre
1.03 1.32 1 1 0.58 0.98 my8040adv_post

Results: bee

Summary:
  • QPS is ~17% larger with the fix in Advanced MySQL
  • CPU overhead is ~15% smaller with the fix
I did not test my8040adv_pre on this server.

From the relative QPS results the QPS with my8040 is ~22% less than my8028. But QPS from my8040adv_post is only ~9% less than my8028. This is great.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_post

col-1   col-2
0.78    0.91    scan

From vmstat and iostat metrics CPU overhead for my8040 was ~28% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.222553        2.534   0       0.001   0.035   55      my8028
0.285792        7.622   0       0       0.041   43      my8040
0.246404        6.475   0       0       0.036   50      my8040adv_post
--- relative to first result
1.28            3.01    1       0.00    1.17    0.78    my8040
1.11            2.56    1       0.00    1.03    0.91    my8040adv_post

RocksDB benchmarks: large server, leveled compaction

I recently shared benchmark results for RocksDB a few weeks ago for both leveled and universal compaction on a small server. This post has results from a large server with leveled compaction. 

tl;dr

  • there are a few regressions from bug 12038
  • QPS for overwrite is ~1.5X to ~2X better in 9.x than 6.0 (ignoring bug 12038)
  • otherwise QPS in 9.x is similar to 6.x

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrx, iobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • fillseq is worse from 6.0 to 8.0 but stable since then
  • overwrite has large improvements late in 6.0 and small improvements since then
  • fwdrangeww has small improvements in early 7.0 and is stable since then
  • revrangeww and readww are stable from 6.0 through 9.
Results: iobuf

The iobuf tests use an IO-bound database with buffered IO. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • bug 12038 explains the drop in throughput for overwrite since 8.6.7
  • otherwise QPS in 9.x is similar to 6.0
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • the QPS drop for overwrite in 8.6.7 occurs because the db_bench client wasn't updated to use the new default value for compaction readahead size
  • QPS for overwrite is ~2X better in 9.x relative to 6.0
  • otherwise QPS in 9.x is similar to 6.0
todo








Tuesday, November 5, 2024

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs hyperclock) and a few other configuration choices for a CPU-bound workload. A previous post with great results for the hyperclock block cache is here.

tl;dr

  • read QPS is up to ~3X better with auto_hyper_clock_cache vs LRU
  • read QPS is up to ~1.3X better with the per-level fanout set to 32 vs 8
  • read QPS drops by ~15% as the background write rate increases from 2 to 32 M/s
Software

I used RocksDB 9.6, compiled with gcc 11.4.0.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and are repeated for 1, 10, 20 and 40 threads. 

I focus on the readwhilewriting benchmark where performance is reported for the reads (point queries) while there is a fixed rate for writes done in the background. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

To save time I did not run the fwdrangewhilewriting benchmark. Were I to repeat this work I would include it because the results from it would be interesting for a few of the configuration options I compared.

I did tests to understand the following:

  • LRU vs auto_hyper_clock_cache for the block cache implementation
    • LRU is the original implementation. The code was simple, which is nice. The implementation for LRU is sharded with a mutex per shard and that mutex can become a hot spot. The hyperclock implementation is much better at avoiding hot spots.
  • per level fanout (8 vs 32)
    • By per level fanout I mean the value of --max_bytes_for_level_multiplier which determines the target size difference between adjacent levels. By default I use 8, while 10 is also a common choice. Here I compare 8 vs 32. When the fanout is larger the LSM tree has fewer levels -- meaning there are fewer places to check for data which should reduce CPU overhead and increase QPS.
  • background write rate
    • I repeated tests with the background write rate (--benchmark_write_rate_limit) set to 2, 8 and 32 MB/s. With a higher write rate there is more chance for interference between reads and writes. The interference might be from mutex contention, compaction threads using more CPU, more L0 files to check or more data in levels L1 and larger.
  • target size for L0
    • By target size I mean the number of files in the L0 that trigger compaction. The db_bench option for this is --level0_file_num_compaction_trigger. When the value is larger there will be more L0 files on average that a query might have to check and that means there is more CPU overhead. Unfortunately, I configured RocksDB incorrectly so I don't have results to share. The issue is that when the L0 is configured to be larger, the L1 should be configured to be at least as large as the L0 (L1 target size should be >= sizeof(SST) * num(L0 files). If not, then L0->L1 compaction will happen sooner than expected.
All of the results are in this spreadsheet.

Results: LRU vs auto_hyper_clock_cache

These graphs have QPS from the readwhilewriting benchmark for the LRU and AHCC block cache implementations where LRU is the original version with a sharded hash table and a mutex per shard while AHCC is the hyper clock cache (--cache_type=auto_hyper_clock_cache).

Summary:
  • QPS is much better with AHCC than LRU (~3.3X faster at 40 threads)
  • QPS with AHCC scales linearly with the thread count
  • QPS with LRU does not scale linearly and suffers from mutex contention
  • There are some odd effects in the results for 1 thread
With a 2M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With an 8M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With a 32M/s background write rate AHCC is ~1.1X faster at 1 thread and ~2.9X faster at 40 threads relative to LRU.

Results: per level fanout

These graphs have QPS from the readwhilewriting benchmark to compare results with per-level fanout set to 8 and 32.

Summary
  • QPS is often 1.1X to 1.3X larger with fanout=32 vs fanout=8

With an 8M/s background write rate and LRU, fanout=8 is faster at 1 thread but then fanout=32 is from 1.1X to 1.3X faster at 10 to 40 threads.
With an 8M/s background write rate and AHCC, fanout=8 is faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.

With a 32M/s background write rate and LRU, fanout=8 is ~2X faster at 1 thread but then fanout=32 is from 1.1X to 1.2X faster at 10 to 40 threads.
With a 32M/s background write rate and AHCC, fanout=8 is ~2X faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.
Results: background write rate

Summary:
  • With LRU
    • QPS drops by up to ~15% as the background write rate grows from 2M/s to 32M/s
    • QPS does not scale linearly and suffers from mutex contention
  • With AHCC
    • QPS drops by up to 13% as the background write rate grows from 2M/s to 32M/s
    • QPS scales linearly with the thread count
  • There are some odd effects in the results for 1 thread
Results with LRU show that per-thread QPS doesn't scale linearly
Results with AHCC show that per-thread QPS scales linearly ignoring the odd results for 1 thread



Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...