Thursday, December 26, 2024

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for making RocksDB better. I am wary about evaluating such claims because that takes a lot of time and when the claim is bogus I am reluctant to blog about that because I don't want to punch down on a startup.

Here I share results from the RocksDB benchmark scripts to compare Speedb and RocksDB and I am happy to claim that Speedb does some things better than RocksDB.

tl;dr

  • RocksDB and Speedb have similar average throughput for ...
    • a cached database
    • an IO-bound database when using O_DIRECT
  • RocksDB 8.6+ is slower than Speedb for write-heavy workloads with an IO-bound database when O_DIRECT isn't used. 
    • This problem arrived in RocksDB 8.6 (see issue 12038) which introduces the use of the readahead system call to prefetch data that compaction will soon read. I am not sure what was used prior. The regression that arrived in 8.6 was partially fixed in release 9.9. 
  • In general, Speedb has better QoS (less throughput variance) for write-heavy workloads
Updates
  • Update 1 - with a minor hack I can remove about 1/3 of the regression between RocksDB 8.5 and 9.9 in the overwrite benchmark for the iobuf workload

On issue 12038

RocksDB 8.6 switched to using the readahead system call to prefetch SSTs that will soon be read by compaction. The goal is to reduce the time that compaction threads must wait for data. But what I see with iostat is that a readahead call is ignored when the value of the count argument is larger than max_sectors_kb for the storage device. And this happens on one or both of ext-4 and xfs. I am not a kernel guru and I have yet to read this nice writeup of readahead internals. I do read this great note from Jens Axboe every few years.

I opened issue 12038 for this issue and it was fixed in RocksDB 9.9 by adding code that reduces the value of compaction_readahead_size to be <= the value of max_sectors_kb for the database's storage device. However the fix in 9.9 doesn't restore the performance that existed prior to the change (see 8.5 results). I assume the real fix is to have code in RocksDB to do the prefetches rather than rely on the readahead system call.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

The values of max_sectors_kb and max_hw_sectors_kb for the database's storage device is 128 (KB) for both the SW RAID device (md2) and the underlying storage devices (nvme0n1, nvme1n1).

Builds

I compiled db_bench from source. I used RocksDB versions 7.3.2, 7.10.2, 8.5.4, 8.6.7, 9.7.4 and 9.9.3. 

For Speedb I used the latest diff in their github repo which appears to be based on RocksDB 8.6.7, although I am confused that it doesn't suffer from issue 12038 which is in RocksDB 8.6.7.

commit 8d850b666cce6f39fbd4064e80b85f9690eaf385 (HEAD -> main, origin/main, origin/HEAD)
Author: udi-speedb <106253580+udi-speedb@users.noreply.github.com>
Date:   Mon Mar 11 14:00:03 2024 +0200

    Support Speedb's Paired Bloom Filter in db_bloom_filter_test (#810)

Benchmark

All tests used 2MB for compaction_readahead_size which and the hyper clock block cache.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: 
    bash x3.sh 40 no 1800 c48r128 100000000 2000000000 iobuf iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
For configuration options that Speedb and RocksDB have in common I set those options to the same values. I didn't experiment with Speedb-only options except for use_dynamic_delay.

Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

Spreadsheets with charts are here and here. Performance summaries are here for byrx, iobuf and iodir.

Results: average throughput

These charts plot relative QPS by test where relative QPS is (QPS for me / QPS for speedb.udd0). When this value is less than 1.0 then the given version is slower than speedb.udd0. When the value is greater than 1.0 then the given version is faster than speedb.udd0. The base case is speedb.udd0 which is Speedb with use_dynamic_delay=0. The versions listed in the charts are:
  • speedb.udd1 - Speedb with use_dynamic_delay=1
  • rocksdb.7.3 - RocksDB 7.3.2
  • rocksdb.7.10 - RocksDB 7.10.2
  • rocksdb.8.5 - RocksDB 8.5.4
  • rocksdb.8.6 - RocksDB 8.6.7
  • rocksdb.9.7 - RocksDB 9.7.4
  • rocksdb.9.9 - RocksDB 9.9.3
For a cached workload (byrx):
  • RocksDB is faster at fillseq (load in key order), otherwise modern RocksDB and Speedb have similar average throughput
  • RocksDB 7.3 was much slower on the read while writing tests (revrangeww, fwdrangeww and readww) but that was fixed by 7.10 and might have been related to issue 9423 or it might be from improvements to the hyper clock cache.
For an IO-bound workload that doesn't use O_DIRECT (iobuf)
  • Modern RocksDB is faster than Speedb at fillseq, has similar perf on the read while writing tests and is slower on overwrite (write-only with keys in random order. The difference in overwrite perf is probably from issue 12038 which arrives in RocksDB 8.6 and then has a fix in 9.9.
  • Similar to above RocksDB 7.3 has a few perf problems that have since been fixed
For an IO-bound workload that uses O_DIRECT (iodir)
  • Modern RocksDB is much faster than Speedb at fillseq and then has similar perf on the other tests
  • Similar to above RocksDB 7.3 has a few perf problems that have since been fixed.

Results: throughput over time

The previous section shows average throughput. And while more average QPS is nice, if that comes with more variance than it is less than great. The charts in this section show QPS at 1-second intervals for fillseq and overwrite for two releases:
  • speedb.udd1 - Speedb with use_dynamic_delay=1
  • rocksdb.9.9.3 - RocksDB 9.9.3

fillseq (load in key order) for byrx, iobuf and iodir
  • For the cached workload the better perf from RocksDB is obvious
  • For the IO-bound workloads the results are closer
  • Variance is less with Speedb than with RocksDB based on the thickness of the lines
With overwrite for a cached workload (byrx) there are two charts - one from the entire run and one from the last 300 seconds.
  • Using the eyeball method (much hand waving) the variance is similar for RocksDB and Speedb. Both suffer from write stalls.
  • Response time percentiles (see p50, p99, p99.9, p99.99 and pmax here) where pmax is <= 57 milliseconds for everything but RocksDB 7.3. Note the numbers in the gist are in usecs.
With overwrite for an IO-bound workload (iobuf) without O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.
  • Speedb has slightly better average throughput
  • Speedb has much better QoS (less variance). That is obvious based on the thickness of the red line vs the blue line. RocksDB also has more write stalls based on the number of times the blue lines drop to near zero.
  • Using the response time percentiles here the differences between Speedb and RocksDB are less obvious.
  • The percentage of time that writes are stalled is <= 9% for Speedb and >= 20% for modern RocksDB (see stall% here). This is a good way to measure QoS.
With overwrite for an IO-bound workload (iodir) with O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.
  • Average throughput is slightly better for RocksDB than for Speedb
  • QoS is slightly better for Speedb than for RocksDB (see the blue lines dropping to zero)
  • RocksDB does much better here with O_DIRECT than it does above without O_DIRECT
  • Speedb still does better than RocksDB at avoiding write stalls. See stall% here.
Sources of write stalls

The output from db_bench includes a summary of the sources of write stalls. However, that summary just shows the number of times each was invoked without telling you have much each contributes to the stall% (total percentage of time that write stalls are in effect).

For the IO-bound workload (iobuf and iodir) the number of write stalls is much lower with Speedb. It appears to be more clever about managing write throughput.

The summary for iobuf with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb  rocksdb
1429      285   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0   cf-l0-file-count-limit-stops-with-ongoing-compaction
1429      285   l0-file-count-limit-delays
   0        0   l0-file-count-limit-stops
   0        0   memtable-limit-delays
   0        0   memtable-limit-stops
1131    12585   pending-compaction-bytes-delays
   0        0   pending-compaction-bytes-stops
2560    12870   total-delays
   0        0   total-stops

The summary for iodir with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb  rocksdb
   5      287   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0   cf-l0-file-count-limit-stops-with-ongoing-compaction
   5      287   l0-file-count-limit-delays
   0        0   l0-file-count-limit-stops
  22        0   memtable-limit-delays
   0        0   memtable-limit-stops
   0      687   pending-compaction-bytes-delays
   0        0   pending-compaction-bytes-stops
  27      974   total-delays
   0        0   total-stops

Compaction efficiency

The final sample of compaction IO statistics at the end of the overwrite test is here for iobuf and iodir for speedb.udd1 and RocksDB 9.9.3.

For iobuf the wall clock time for which compaction threads are busy (the Comp(sec) column) is about 1.10X larger for RocksDB than Speedb. This is likely because Speedb is doing larger reads from storage (see the next section) so RocksDB has more IO waits. But the CPU time for compaction (the CompMergeCPU(sec) column) is about 1.12X larger for Speedb (I am not sure why).

For iodir the values in the Comp(sec) and CompMergeCPU(sec) columns are not as different across Speedb and RocksDB as they were above for iobuf.

Understanding IO via iostat

The numbers below are the average values from iostat collected during overwrite.

For iobuf the average read size (rareqsz) drops to 4.7 in RocksDB 8.6 courtesy of issue 12038 while it was >= 50KB prior to RocksDB 8.6. The value improves to 34.2 in RocksDB 9.9.3 but it is still much less than what it used to be in RocksDB 8.5.

For iobuf the average read size (rareqsz) is ~112KB in all cases.

The RocksDB compaction threads read from the compaction input a RocksDB block at a time. For these tests I use an 8kb RocksDB block but in the IO-bound tests the block are compressed for the larger levels of the LSM tree, and a compressed block is ~4kb. Thus some kind of prefetching that does large read requests is need to improve read IO efficiency.

Legend:
  • rps - reads/s
  • rMBps - read MB/s
  • rawait - read wait latency in milliseconds
  • rareqsz - read request size in KB
  • wps - writes/s
  • wMBps - write MB/s
  • wawait - read wait latency in milliseconds
  • wareqsz - write request size in KB
iobuf (buffered IO)
rps     rMBps   rawait  rareqsz wps     wMBps   wawait  wareqsz
6336    364.4   0.43    52.9    14344   1342.0  0.40    95.5    speedb.udd0
6209    357.6   0.41    51.6    14177   1322.8  0.41    95.3    speedb.udd1
2133    164.0   0.19    26.1    8954    854.4   0.31    99.3    rocksdb.7.3
4542    333.5   0.49    71.8    14734   1361.6  0.39    94.5    rocksdb.7.10
6101    352.1   0.42    52.4    14878   1391.8  0.41    96.1    rocksdb.8.5
40471   184.7   0.10    4.7     8552    784.1   0.31    93.6    rocksdb.8.6
39201   178.8   0.10    4.7     8783    801.5   0.32    93.7    rocksdb.9.7
7733    268.7   0.29    34.2    12742   1156.0  0.30    93.2    rocksdb.9.9

iodir (O_DIRECT)
rps     rMBps   rawait  rareqsz wps     wMBps   wawait  wareqsz
12757   1415.9  0.82    112.8   16642   1532.6  0.40    93.5    speedb.udd0
12539   1392.9  0.83    112.7   16327   1507.9  0.41    93.6    speedb.udd1
8090    903.5   0.74    114.3   10036   976.6   0.30    100.5   rocksdb.7.3
12155   1346.2  0.90    112.8   15602   1462.4  0.40    95.7    rocksdb.7.10
13315   1484.5  0.85    113.6   17436   1607.1  0.40    94.0    rocksdb.8.5
12978   1444.3  0.84    112.9   16981   1563.0  0.46    93.4    rocksdb.8.6
12641   1411.9  0.84    113.9   16217   1535.6  0.43    96.7    rocksdb.9.7
12990   1450.7  0.83    113.8   16704   1576.0  0.41    96.3    rocksdb.9.9

Update 1

For the overwrite test and the iobuf (buffered IO, IO-bound) workload the QPS is 277795 for RocksDB 8.5.4 vs 227068 for 9.9.3. So 9.9.3 gets ~82% of the QPS relative to 8.5.4. Note that 9.8 only gets about 57% and the improved from 9.8 to 9.9 is from fixing issue 12038 but that fix isn't sufficient given the gap between 8.5 and 9.9.

With a hack I can improve the QPS for 9.9 from 227068/s to 241260/s. The problem, explained to me by the RocksDB team, is that while this code adjusts compaction readahead to be no larger than max_sectors_kb, the code that requests prefetch can request for (X + Y) bytes where X is the adjusted compaction readahead amount and Y is the block size. The sum of these is likely to be larger than max_sectors_kb. So my hack was to reduce compaction readahead to be (max_sectors_kb - 8kb) given that I am using block_size=8kb.

And with the hack the QPS for overwrite improves. The performance summary from the overwrite test is here and the stall% decreased from 25.9 to 22. Alas, the stall% was 9.0 with RocksDB 8.5.4 so there is still room for improvement.

The compaction reasons for 8.5.4, 9.9.3 without the hack (9.9orig) and 9.9.3 with the hack (9.9hack) provide a better idea of what has changed.

8.5.4   9.9orig 9.9hack
1089      285     211   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0       0   cf-l0-file-count-limit-stops-with-ongoing-compaction
1089      285     211   l0-file-count-limit-delays
   0        0       0   l0-file-count-limit-stops
   0        0       0   memtable-limit-delays
   0        0       0   memtable-limit-stops
1207    12585   10735   pending-compaction-bytes-delays
   0        0       0   pending-compaction-bytes-stops
2296    12870   10946   total-delays
   0        0       0   total-stops

And finally, averages from iostat during the test show that 9.9.3 with the hack gets the largest average read request size (rareqsz is 56.4) but it still is slower than 8.5.4.

rps     rmbps   rawait  rareqsz wps     wmbps   wawait  wareqsz
6101    352.1   0.42    52.4    14878   1391.8  0.41    96.1    8.5.4
7733    268.7   0.29    34.2    12742   1156.0  0.30    93.2    9.9orig
5020    285.7   0.43    56.4    13467   1220.2  0.32    93.1    9.9hack

Vector indexes, MariaDB & pgvector, large server, small dataset: part 2

This post has results for vector index support in MariaDB and Postgres. This work was done by  Small Datum LLC  and sponsored by the MariaDB...