I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for making RocksDB better. I am wary about evaluating such claims because that takes a lot of time and when the claim is bogus I am reluctant to blog about that because I don't want to punch down on a startup.
Here I share results from the RocksDB benchmark scripts to compare Speedb and RocksDB and I am happy to claim that Speedb does some things better than RocksDB.
tl;dr
- RocksDB and Speedb have similar average throughput for ...
- a cached database
- an IO-bound database when using O_DIRECT
- RocksDB 8.6+ is slower than Speedb for write-heavy workloads with an IO-bound database when O_DIRECT isn't used.
- This problem arrived in RocksDB 8.6 (see issue 12038) which introduces the use of the readahead system call to prefetch data that compaction will soon read. I am not sure what was used prior. The regression that arrived in 8.6 was partially fixed in release 9.9.
- In general, Speedb has better QoS (less throughput variance) for write-heavy workloads
- Update 1 - with a minor hack I can remove about 1/3 of the regression between RocksDB 8.5 and 9.9 in the overwrite benchmark for the iobuf workload
On issue 12038
RocksDB 8.6 switched to using the readahead system call to prefetch SSTs that will soon be read by compaction. The goal is to reduce the time that compaction threads must wait for data. But what I see with iostat is that a readahead call is ignored when the value of the count argument is larger than max_sectors_kb for the storage device. And this happens on one or both of ext-4 and xfs. I am not a kernel guru and I have yet to read this nice writeup of readahead internals. I do read this great note from Jens Axboe every few years.
I opened issue 12038 for this issue and it was fixed in RocksDB 9.9 by adding code that reduces the value of compaction_readahead_size to be <= the value of max_sectors_kb for the database's storage device. However the fix in 9.9 doesn't restore the performance that existed prior to the change (see 8.5 results). I assume the real fix is to have code in RocksDB to do the prefetches rather than rely on the readahead system call.
Hardware
The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.
The values of max_sectors_kb and max_hw_sectors_kb for the database's storage device is 128 (KB) for both the SW RAID device (md2) and the underlying storage devices (nvme0n1, nvme1n1).
- fillseq -- load in key order with the WAL disabled
- revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
- fwdrangeww -- like revrangeww except do short forward range scans
- readww - like revrangeww except do point queries
- overwrite - do overwrites (Put) as fast as possible
There are three workloads, all of which use 40 threads:
- byrx - the database is cached by RocksDB (100M KV pairs)
- iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
- iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)
These charts plot relative QPS by test where relative QPS is (QPS for me / QPS for speedb.udd0). When this value is less than 1.0 then the given version is slower than speedb.udd0. When the value is greater than 1.0 then the given version is faster than speedb.udd0. The base case is speedb.udd0 which is Speedb with use_dynamic_delay=0. The versions listed in the charts are:
- speedb.udd1 - Speedb with use_dynamic_delay=1
- rocksdb.7.3 - RocksDB 7.3.2
- rocksdb.7.10 - RocksDB 7.10.2
- rocksdb.8.5 - RocksDB 8.5.4
- rocksdb.8.6 - RocksDB 8.6.7
- rocksdb.9.7 - RocksDB 9.7.4
- rocksdb.9.9 - RocksDB 9.9.3
- RocksDB is faster at fillseq (load in key order), otherwise modern RocksDB and Speedb have similar average throughput
- RocksDB 7.3 was much slower on the read while writing tests (revrangeww, fwdrangeww and readww) but that was fixed by 7.10 and might have been related to issue 9423 or it might be from improvements to the hyper clock cache.
- Modern RocksDB is faster than Speedb at fillseq, has similar perf on the read while writing tests and is slower on overwrite (write-only with keys in random order. The difference in overwrite perf is probably from issue 12038 which arrives in RocksDB 8.6 and then has a fix in 9.9.
- Similar to above RocksDB 7.3 has a few perf problems that have since been fixed
- Modern RocksDB is much faster than Speedb at fillseq and then has similar perf on the other tests
- Similar to above RocksDB 7.3 has a few perf problems that have since been fixed.
- speedb.udd1 - Speedb with use_dynamic_delay=1
- rocksdb.9.9.3 - RocksDB 9.9.3
- For the cached workload the better perf from RocksDB is obvious
- For the IO-bound workloads the results are closer
- Variance is less with Speedb than with RocksDB based on the thickness of the lines
- Using the eyeball method (much hand waving) the variance is similar for RocksDB and Speedb. Both suffer from write stalls.
- Response time percentiles (see p50, p99, p99.9, p99.99 and pmax here) where pmax is <= 57 milliseconds for everything but RocksDB 7.3. Note the numbers in the gist are in usecs.
- Speedb has slightly better average throughput
- Speedb has much better QoS (less variance). That is obvious based on the thickness of the red line vs the blue line. RocksDB also has more write stalls based on the number of times the blue lines drop to near zero.
- Using the response time percentiles here the differences between Speedb and RocksDB are less obvious.
- The percentage of time that writes are stalled is <= 9% for Speedb and >= 20% for modern RocksDB (see stall% here). This is a good way to measure QoS.
- Average throughput is slightly better for RocksDB than for Speedb
- QoS is slightly better for Speedb than for RocksDB (see the blue lines dropping to zero)
- RocksDB does much better here with O_DIRECT than it does above without O_DIRECT
- Speedb still does better than RocksDB at avoiding write stalls. See stall% here.
For iodir the values in the Comp(sec) and CompMergeCPU(sec) columns are not as different across Speedb and RocksDB as they were above for iobuf.
The numbers below are the average values from iostat collected during overwrite.
For iobuf the average read size (rareqsz) is ~112KB in all cases.
Legend:
- rps - reads/s
- rMBps - read MB/s
- rawait - read wait latency in milliseconds
- rareqsz - read request size in KB
- wps - writes/s
- wMBps - write MB/s
- wawait - read wait latency in milliseconds
- wareqsz - write request size in KB
With a hack I can improve the QPS for 9.9 from 227068/s to 241260/s. The problem, explained to me by the RocksDB team, is that while this code adjusts compaction readahead to be no larger than max_sectors_kb, the code that requests prefetch can request for (X + Y) bytes where X is the adjusted compaction readahead amount and Y is the block size. The sum of these is likely to be larger than max_sectors_kb. So my hack was to reduce compaction readahead to be (max_sectors_kb - 8kb) given that I am using block_size=8kb.
And with the hack the QPS for overwrite improves. The performance summary from the overwrite test is here and the stall% decreased from 25.9 to 22. Alas, the stall% was 9.0 with RocksDB 8.5.4 so there is still room for improvement.
The compaction reasons for 8.5.4, 9.9.3 without the hack (9.9orig) and 9.9.3 with the hack (9.9hack) provide a better idea of what has changed.