Small Datum: December 2024

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for making RocksDB better. I am wary about evaluating such claims because that takes a lot of time and when the claim is bogus I am reluctant to blog about that because I don't want to punch down on a startup.

Here I share results from the RocksDB benchmark scripts to compare Speedb and RocksDB and I am happy to claim that Speedb does some things better than RocksDB.

tl;dr

RocksDB and Speedb have similar average throughput for ...

a cached database
an IO-bound database when using O_DIRECT

RocksDB 8.6+ is slower than Speedb for write-heavy workloads with an IO-bound database when O_DIRECT isn't used.

This problem arrived in RocksDB 8.6 (see issue 12038) which introduces the use of the readahead system call to prefetch data that compaction will soon read. I am not sure what was used prior. The regression that arrived in 8.6 was partially fixed in release 9.9.

In general, Speedb has better QoS (less throughput variance) for write-heavy workloads

Updates

Update 1 - with a minor hack I can remove about 1/3 of the regression between RocksDB 8.5 and 9.9 in the overwrite benchmark for the iobuf workload

On issue 12038

RocksDB 8.6 switched to using the readahead system call to prefetch SSTs that will soon be read by compaction. The goal is to reduce the time that compaction threads must wait for data. But what I see with iostat is that a readahead call is ignored when the value of the count argument is larger than max_sectors_kb for the storage device. And this happens on one or both of ext-4 and xfs. I am not a kernel guru and I have yet to read this nice writeup of readahead internals. I do read this great note from Jens Axboe every few years.

I opened issue 12038 for this issue and it was fixed in RocksDB 9.9 by adding code that reduces the value of compaction_readahead_size to be <= the value of max_sectors_kb for the database's storage device. However the fix in 9.9 doesn't restore the performance that existed prior to the change (see 8.5 results). I assume the real fix is to have code in RocksDB to do the prefetches rather than rely on the readahead system call.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

The values of max_sectors_kb and max_hw_sectors_kb for the database's storage device is 128 (KB) for both the SW RAID device (md2) and the underlying storage devices (nvme0n1, nvme1n1).

Builds

I compiled db_bench from source. I used RocksDB versions 7.3.2, 7.10.2, 8.5.4, 8.6.7, 9.7.4 and 9.9.3.

For Speedb I used the latest diff in their github repo which appears to be based on RocksDB 8.6.7, although I am confused that it doesn't suffer from issue 12038 which is in RocksDB 8.6.7.

commit 8d850b666cce6f39fbd4064e80b85f9690eaf385 (HEAD -> main, origin/main, origin/HEAD)

Author: udi-speedb <106253580+udi-speedb@users.noreply.github.com>

Date: Mon Mar 11 14:00:03 2024 +0200

Support Speedb's Paired Bloom Filter in db_bloom_filter_test (#810)

Benchmark

All tests used 2MB for compaction_readahead_size which and the hyper clock block cache.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

bash x3.sh 40 no 1800 c48r128 100000000 2000000000 iobuf iobuf iodir

The tests on the charts are named as:

fillseq -- load in key order with the WAL disabled
revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
fwdrangeww -- like revrangeww except do short forward range scans
readww - like revrangeww except do point queries
overwrite - do overwrites (Put) as fast as possible

For configuration options that Speedb and RocksDB have in common I set those options to the same values. I didn't experiment with Speedb-only options except for use_dynamic_delay.

Workloads

There are three workloads, all of which use 40 threads:

byrx - the database is cached by RocksDB (100M KV pairs)
iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

Spreadsheets with charts are here and here. Performance summaries are here for byrx, iobuf and iodir.

Results: average throughput

These charts plot relative QPS by test where relative QPS is (QPS for me / QPS for speedb.udd0). When this value is less than 1.0 then the given version is slower than speedb.udd0. When the value is greater than 1.0 then the given version is faster than speedb.udd0. The base case is speedb.udd0 which is Speedb with use_dynamic_delay=0. The versions listed in the charts are:

speedb.udd1 - Speedb with use_dynamic_delay=1
rocksdb.7.3 - RocksDB 7.3.2
rocksdb.7.10 - RocksDB 7.10.2
rocksdb.8.5 - RocksDB 8.5.4
rocksdb.8.6 - RocksDB 8.6.7
rocksdb.9.7 - RocksDB 9.7.4
rocksdb.9.9 - RocksDB 9.9.3

For a cached workload (byrx):

RocksDB is faster at fillseq (load in key order), otherwise modern RocksDB and Speedb have similar average throughput
RocksDB 7.3 was much slower on the read while writing tests (revrangeww, fwdrangeww and readww) but that was fixed by 7.10 and might have been related to issue 9423 or it might be from improvements to the hyper clock cache.

For an IO-bound workload that doesn't use O_DIRECT (iobuf)

Modern RocksDB is faster than Speedb at fillseq, has similar perf on the read while writing tests and is slower on overwrite (write-only with keys in random order. The difference in overwrite perf is probably from issue 12038 which arrives in RocksDB 8.6 and then has a fix in 9.9.
Similar to above RocksDB 7.3 has a few perf problems that have since been fixed

For an IO-bound workload that uses O_DIRECT (iodir)

Modern RocksDB is much faster than Speedb at fillseq and then has similar perf on the other tests
Similar to above RocksDB 7.3 has a few perf problems that have since been fixed.

Results: throughput over time

The previous section shows average throughput. And while more average QPS is nice, if that comes with more variance than it is less than great. The charts in this section show QPS at 1-second intervals for fillseq and overwrite for two releases:

speedb.udd1 - Speedb with use_dynamic_delay=1
rocksdb.9.9.3 - RocksDB 9.9.3

fillseq (load in key order) for byrx, iobuf and iodir

For the cached workload the better perf from RocksDB is obvious
For the IO-bound workloads the results are closer
Variance is less with Speedb than with RocksDB based on the thickness of the lines

With overwrite for a cached workload (byrx) there are two charts - one from the entire run and one from the last 300 seconds.

Using the eyeball method (much hand waving) the variance is similar for RocksDB and Speedb. Both suffer from write stalls.
Response time percentiles (see p50, p99, p99.9, p99.99 and pmax here) where pmax is <= 57 milliseconds for everything but RocksDB 7.3. Note the numbers in the gist are in usecs.

With overwrite for an IO-bound workload (iobuf) without O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.

Speedb has slightly better average throughput
Speedb has much better QoS (less variance). That is obvious based on the thickness of the red line vs the blue line. RocksDB also has more write stalls based on the number of times the blue lines drop to near zero.
Using the response time percentiles here the differences between Speedb and RocksDB are less obvious.
The percentage of time that writes are stalled is <= 9% for Speedb and >= 20% for modern RocksDB (see stall% here). This is a good way to measure QoS.

With overwrite for an IO-bound workload (iodir) with O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.

Average throughput is slightly better for RocksDB than for Speedb
QoS is slightly better for Speedb than for RocksDB (see the blue lines dropping to zero)
RocksDB does much better here with O_DIRECT than it does above without O_DIRECT
Speedb still does better than RocksDB at avoiding write stalls. See stall% here.

Sources of write stalls

The output from db_bench includes a summary of the sources of write stalls. However, that summary just shows the number of times each was invoked without telling you have much each contributes to the stall% (total percentage of time that write stalls are in effect).

For the IO-bound workload (iobuf and iodir) the number of write stalls is much lower with Speedb. It appears to be more clever about managing write throughput.

The summary for iobuf with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb rocksdb

1429 285 cf-l0-file-count-limit-delays-with-ongoing-compaction

0 0 cf-l0-file-count-limit-stops-with-ongoing-compaction

1429 285 l0-file-count-limit-delays

0 0 l0-file-count-limit-stops

0 0 memtable-limit-delays

0 0 memtable-limit-stops

1131 12585 pending-compaction-bytes-delays

0 0 pending-compaction-bytes-stops

2560 12870 total-delays

0 0 total-stops

The summary for iodir with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb rocksdb

5 287 cf-l0-file-count-limit-delays-with-ongoing-compaction

0 0 cf-l0-file-count-limit-stops-with-ongoing-compaction

5 287 l0-file-count-limit-delays

0 0 l0-file-count-limit-stops

22 0 memtable-limit-delays

0 0 memtable-limit-stops

0 687 pending-compaction-bytes-delays

0 0 pending-compaction-bytes-stops

27 974 total-delays

0 0 total-stops

Compaction efficiency

The final sample of compaction IO statistics at the end of the overwrite test is here for iobuf and iodir for speedb.udd1 and RocksDB 9.9.3.

For iobuf the wall clock time for which compaction threads are busy (the Comp(sec) column) is about 1.10X larger for RocksDB than Speedb. This is likely because Speedb is doing larger reads from storage (see the next section) so RocksDB has more IO waits. But the CPU time for compaction (the CompMergeCPU(sec) column) is about 1.12X larger for Speedb (I am not sure why).

For iodir the values in the Comp(sec) and CompMergeCPU(sec) columns are not as different across Speedb and RocksDB as they were above for iobuf.

Understanding IO via iostat

The numbers below are the average values from iostat collected during overwrite.

For iobuf the average read size (rareqsz) drops to 4.7 in RocksDB 8.6 courtesy of issue 12038 while it was >= 50KB prior to RocksDB 8.6. The value improves to 34.2 in RocksDB 9.9.3 but it is still much less than what it used to be in RocksDB 8.5.

For iobuf the average read size (rareqsz) is ~112KB in all cases.

The RocksDB compaction threads read from the compaction input a RocksDB block at a time. For these tests I use an 8kb RocksDB block but in the IO-bound tests the block are compressed for the larger levels of the LSM tree, and a compressed block is ~4kb. Thus some kind of prefetching that does large read requests is need to improve read IO efficiency.

Legend:

rps - reads/s
rMBps - read MB/s
rawait - read wait latency in milliseconds
rareqsz - read request size in KB
wps - writes/s
wMBps - write MB/s
wawait - read wait latency in milliseconds
wareqsz - write request size in KB

iobuf (buffered IO)

rps rMBps rawait rareqsz wps wMBps wawait wareqsz

6336 364.4 0.43 52.9 14344 1342.0 0.40 95.5 speedb.udd0

6209 357.6 0.41 51.6 14177 1322.8 0.41 95.3 speedb.udd1

2133 164.0 0.19 26.1 8954 854.4 0.31 99.3 rocksdb.7.3

4542 333.5 0.49 71.8 14734 1361.6 0.39 94.5 rocksdb.7.10

6101 352.1 0.42 52.4 14878 1391.8 0.41 96.1 rocksdb.8.5

40471 184.7 0.10 4.7 8552 784.1 0.31 93.6 rocksdb.8.6

39201 178.8 0.10 4.7 8783 801.5 0.32 93.7 rocksdb.9.7

7733 268.7 0.29 34.2 12742 1156.0 0.30 93.2 rocksdb.9.9

iodir (O_DIRECT)

rps rMBps rawait rareqsz wps wMBps wawait wareqsz

12757 1415.9 0.82 112.8 16642 1532.6 0.40 93.5 speedb.udd0

12539 1392.9 0.83 112.7 16327 1507.9 0.41 93.6 speedb.udd1

8090 903.5 0.74 114.3 10036 976.6 0.30 100.5 rocksdb.7.3

12155 1346.2 0.90 112.8 15602 1462.4 0.40 95.7 rocksdb.7.10

13315 1484.5 0.85 113.6 17436 1607.1 0.40 94.0 rocksdb.8.5

12978 1444.3 0.84 112.9 16981 1563.0 0.46 93.4 rocksdb.8.6

12641 1411.9 0.84 113.9 16217 1535.6 0.43 96.7 rocksdb.9.7

12990 1450.7 0.83 113.8 16704 1576.0 0.41 96.3 rocksdb.9.9

Update 1

For the overwrite test and the iobuf (buffered IO, IO-bound) workload the QPS is 277795 for RocksDB 8.5.4 vs 227068 for 9.9.3. So 9.9.3 gets ~82% of the QPS relative to 8.5.4. Note that 9.8 only gets about 57% and the improved from 9.8 to 9.9 is from fixing issue 12038 but that fix isn't sufficient given the gap between 8.5 and 9.9.

With a hack I can improve the QPS for 9.9 from 227068/s to 241260/s. The problem, explained to me by the RocksDB team, is that while this code adjusts compaction readahead to be no larger than max_sectors_kb, the code that requests prefetch can request for (X + Y) bytes where X is the adjusted compaction readahead amount and Y is the block size. The sum of these is likely to be larger than max_sectors_kb. So my hack was to reduce compaction readahead to be (max_sectors_kb - 8kb) given that I am using block_size=8kb.

And with the hack the QPS for overwrite improves. The performance summary from the overwrite test is here and the stall% decreased from 25.9 to 22. Alas, the stall% was 9.0 with RocksDB 8.5.4 so there is still room for improvement.

The compaction reasons for 8.5.4, 9.9.3 without the hack (9.9orig) and 9.9.3 with the hack (9.9hack) provide a better idea of what has changed.

8.5.4 9.9orig 9.9hack

1089 285 211 cf-l0-file-count-limit-delays-with-ongoing-compaction

0 0 0 cf-l0-file-count-limit-stops-with-ongoing-compaction

1089 285 211 l0-file-count-limit-delays

0 0 0 l0-file-count-limit-stops

0 0 0 memtable-limit-delays

0 0 0 memtable-limit-stops

1207 12585 10735 pending-compaction-bytes-delays

0 0 0 pending-compaction-bytes-stops

2296 12870 10946 total-delays

0 0 0 total-stops

And finally, averages from iostat during the test show that 9.9.3 with the hack gets the largest average read request size (rareqsz is 56.4) but it still is slower than 8.5.4.

rps rmbps rawait rareqsz wps wmbps wawait wareqsz

6101 352.1 0.42 52.4 14878 1391.8 0.41 96.1 8.5.4

7733 268.7 0.29 34.2 12742 1156.0 0.30 93.2 9.9orig

5020 285.7 0.43 56.4 13467 1220.2 0.32 93.1 9.9hack

Small Datum

Thursday, December 26, 2024

Speedb vs RocksDB on a large server

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)