Small Datum: Benchmark(et)ing RocksDB vs SplinterDB

While I am a huge fan of research papers presenting storage engines that claim to be better than RocksDB, I am always wary of the performance results. A paper can be great despite an imperfect performance evaluation, so pointing out the imperfections doesn't take away from the interesting ideas in the paper. Also, as a believer of the (C)RUM Conjecture I want to know how the new thing is better and worse, but papers mostly focus only on the better parts and don't highlight what isn't better.

One factor that determines the truthiness of a database benchmark is the number of DBMS that are compared. It is hard enough to get expertise in one DBMS and more DBMS == more chance of making a mistake. Here I present a result for RocksDB and SplinterDB. I am definitely not an expert on SplinterDB. Perhaps my results are more truthy than true.

I read the SplinterDB paper and hope their research continues. However, the paper didn't have enough detail on how the benchmark was done so I had to guess. I wish more papers published artifacts (scripts, etc) that made reproduction easier. I expect reproduction to be infrequent and don't want researchers to spend their time doing that, but the artifacts are nice to have.

tl;dr

For insert performance

It is risky to compare write performance between SplinterDB and RocksDB because SplinterDB doesn't force data to storage via fsync, fdatasync or msync, and that is documented.
For IO-bound - RocksDB with universal was the fastest
For cached - performance is similar between RocksDB and SplinterDB

For point query performance

For IO-bound - RocksDB was faster because it does less IO/query
For cached - SplinterDB was faster because it uses less CPU/query

For range query performance

For IO-bound - RocksDB was a lot faster because it does less IO/query
For cached - RocksDB was a lot faster because it uses less CPU/query

SplinterDB

I patched SplinterDB as of git hash a1833060. My patched SplinterDB branch is here. The patch includes:

debug printfs I added to understand the code
changes to stop the lookup tests after 1200 seconds
a fix for a memory leak that I reported, and has now been fixed upstream (issue 440)
hardwired to only run the Small range test for range queries
make the payload size a constant rather than a range to match what I do for RocksDB

Details

I used my small servers (4-core NUC, 16G RAM, NVMe SSD), patched SplinterDB and RocksDB 7.6.0. The benchmarks were run with 1 thread for inserts and then 1 thread for queries. My test scripts are here.

The benchmark was run for two workloads: cached and IO-bound. Cached means cached by the DBMS and IO-bound means the database is larger than memory. The workload is simple: insert in random order, do point queries, do range queries where a range query fetches 10 values. I used the benchmark client provided by each DBMS: db_bench for RocksDB and driver_test splinter_test for SplinterDB. The scripts that have the config details are here for RocksDB and SplinterDB.

For RocksDB I repeated tests with leveled and universal compaction.

Tests were done for the following sizes:

cached: 50M rows, 100-byte value
cached: 25M rows, 200-byte value
IO-bound: 2B rows, 100-byte value
IO-bound: 1B rows, 200-byte value

To get results for RocksDB with universal I had to repeat the IO-bound tests at 75% of the number of rows listed above (1.5B for 100-byte, 750M for 200-byte) because issue 10533 caused me to get unexpected disk full errors.

For RocksDB I did fillrandom to measure insert performance, then reloaded with fillseq to make sure all keys from 1..N would exist. Next time I might use filluniquerandom (see here). For RocksDB I measured performance immediately after the load (well, after waiting for compaction debt to reduce), then again after flushing the memtable and L0. And for leveled compaction I then flushed the L1 and measured perf again. Here I report performance after flushing the memtable and L0 (see here).

The point and range query benchmark steps ran for 1200 seconds each with SplinterDB and 1800 seconds each with RocksDB. Were I to repeat this I would use the same duration.

Differences

Things that differ between RocksDB and SplinterDB:

A uniform key access distribution was used by both benchmark clients. However, RocksDB does this by using a RNG while SplinterDB uses something like: hash(x) for x in 1 ... N. I didn't confirm it, but I am curious if the point and range query tests will see the same key sequence, and if tests are run for a short period of time then point query tests warm the cache for the range query tests making results less truthy for IO-bound workloads.
SplinterDB does not have redo/WAL so I disabled that for RocksDB.
SplinterDB does not sync (fsync, msync, fdatasync) writes and I did not try to mimic that with RocksDB. This means that write amplification will be under-stated and write throughput will be over-stated for SplinterDB when compared to RocksDB.

Results for cached

Legend:

* insert - inserts/second

* point - point queries/second

* range - range queries/second

* wamp - write-amplification during the inserts

50M rows, 100-byte values

---- ops/second ----

insert point range wamp dbms

533584 246964 31236 1.3 splinterdb

483519 191592 76600 4.5 rocksdb, leveled

488971 180769 57394 2.9 rocksdb, universal

25M rows, 200-byte values

---- ops/second ----

insert point range wamp dbms

474446 261444 33538 1.1 splinterdb

495325 188851 75122 3.7 rocksdb, leveled

500862 201667 83686 2.8 rocksdb, universal

Results for IO-bound

The performance can be explained by the amount of read IO per query.

IO reads per point query for 2B rows, 100-byte values:

1.29 - SplinterDB
0.98 - RocksDB, leveled
1.00 - RocksDB, universal, 1.5B rows

IO reads per range query for 2B rows, 100-byte values:

7.42 - SplinterDB
2.34 - RocksDB, leveled
3.24 - RocksDB, universal, 1.5B rows

2B rows, 100-byte values

---- ops/second ----

insert point range wamp dbms

308740 6110 1060 3.7 2B rows, splinterdb

181556 8428 3056 14.1 2B rows, rocksdb, leveled

205736 8404 3029 12.7 1.5B rows, rocksdb, leveled

393873 8144 1889 6.0 1.5B rows, rocksdb, universal

1B rows, 200-byte values

---- ops/second ----

insert point range wamp dbms

221007 7175 908 3.7 1B rows, splinterdb

107692 8393 2436 13.0 1B rows, rocksdb, leveled

121519 8159 2519 11.7 750M rows, rocksdb, leveled

324555 8621 2763 5.3 750M rows, rocksdb universal

Details

For SplinterDB with 2b rows and 100-byte values the tree shape was:

Space used by level: trunk_tree_height=4

0: 191168 MiB

1: 34554 MiB

2: 6586 MiB

3: 1307 MiB

4: 206 MiB

And the LSM tree shape for RocksDB with 2B rows, 100-byte values and leveled compaction uses long lines so that is in a gist.

Small Datum

Friday, December 2, 2022

Benchmark(et)ing RocksDB vs SplinterDB

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2