Friday, December 2, 2022

Benchmark(et)ing RocksDB vs SplinterDB

While I am a huge fan of research papers presenting storage engines that claim to be better than RocksDB, I am always wary of the performance results. A paper can be great despite an imperfect performance evaluation, so pointing out the imperfections doesn't take away from the interesting ideas in the paper. Also, as a believer of the (C)RUM Conjecture I want to know how the new thing is better and worse, but papers mostly focus only on the better parts and don't highlight what isn't better.

One factor that determines the truthiness of a database benchmark is the number of DBMS that are compared. It is hard enough to get expertise in one DBMS and more DBMS == more chance of making a mistake. Here I present a result for RocksDB and SplinterDB. I am definitely not an expert on SplinterDB. Perhaps my results are more truthy than true.

I read the SplinterDB paper and hope their research continues. However, the paper didn't have enough detail on how the benchmark was done so I had to guess. I wish more papers published artifacts (scripts, etc) that made reproduction easier. I expect reproduction to be infrequent and don't want researchers to spend their time doing that, but the artifacts are nice to have.

tl;dr

  • For insert performance
    • It is risky to compare write performance between SplinterDB and RocksDB because SplinterDB doesn't force data to storage via fsync, fdatasync or msync, and that is documented.
    • For IO-bound - RocksDB with universal was the fastest
    • For cached - performance is similar between RocksDB and SplinterDB
  • For point query performance
    • For IO-bound - RocksDB was faster because it does less IO/query
    • For cached - SplinterDB was faster because it uses less CPU/query
  • For range query performance
    • For IO-bound - RocksDB was a lot faster because it does less IO/query
    • For cached - RocksDB was a lot faster because it uses less CPU/query

SplinterDB

I patched SplinterDB as of git hash a1833060. My patched SplinterDB branch is here. The patch includes:

  • debug printfs I added to understand the code
  • changes to stop the lookup tests after 1200 seconds
  • a fix for a memory leak that I reported, and has now been fixed upstream (issue 440)
  • hardwired to only run the Small range test for range queries
  • make the payload size a constant rather than a range to match what I do for RocksDB
Details

I used my small servers (4-core NUC, 16G RAM, NVMe SSD), patched SplinterDB and RocksDB 7.6.0. The benchmarks were run with 1 thread for inserts and then 1 thread for queries. My test scripts are here.

The benchmark was run for two workloads: cached and IO-bound. Cached means cached by the DBMS and IO-bound means the database is larger than memory. The workload is simple: insert in random order, do point queries, do range queries where a range query fetches 10 values. I used the benchmark client provided by each DBMS: db_bench for RocksDB and driver_test splinter_test for SplinterDB. The scripts that have the config details are here for RocksDB and SplinterDB.

For RocksDB I repeated tests with leveled and universal compaction.

Tests were done for the following sizes:
  • cached: 50M rows, 100-byte value
  • cached: 25M rows, 200-byte value
  • IO-bound: 2B rows, 100-byte value
  • IO-bound: 1B rows, 200-byte value
To get results for RocksDB with universal I had to repeat the IO-bound tests at 75% of the number of rows listed above (1.5B for 100-byte, 750M for 200-byte) because issue 10533 caused me to get unexpected disk full errors.

For RocksDB I did fillrandom to measure insert performance, then reloaded with fillseq to make sure all keys from 1..N would exist. Next time I might use filluniquerandom (see here). For RocksDB I measured performance immediately after the load (well, after waiting for compaction debt to reduce), then again after flushing the memtable and L0. And for leveled compaction I then flushed the L1 and measured perf again. Here I report performance after flushing the memtable and L0 (see here).

The point and range query benchmark steps ran for 1200 seconds each with SplinterDB and 1800 seconds each with RocksDB. Were I to repeat this I would use the same duration.

Differences

Things that differ between RocksDB and SplinterDB:
  • A uniform key access distribution was used by both benchmark clients. However, RocksDB does this by using a RNG while SplinterDB uses something like: hash(x) for x in 1 ... N.  I didn't confirm it, but I am curious if the point and range query tests will see the same key sequence, and if tests are run for a short period of time then point query tests warm the cache for the range query tests making results less truthy for IO-bound workloads.
  • SplinterDB does not have redo/WAL so I disabled that for RocksDB. 
  • SplinterDB does not sync (fsync, msync, fdatasync) writes and I did not try to mimic that with RocksDB. This means that write amplification will be under-stated and write throughput will be over-stated for SplinterDB when compared to RocksDB.
Results for cached

Legend:
* insert - inserts/second
* point - point queries/second
* range - range queries/second
* wamp - write-amplification during the inserts

50M rows, 100-byte values
---- ops/second ----
insert  point   range   wamp    dbms
533584  246964  31236   1.3     splinterdb
483519  191592  76600   4.5     rocksdb, leveled
488971  180769  57394   2.9     rocksdb, universal

25M rows, 200-byte values
---- ops/second ----
insert  point   range   wamp    dbms
474446  261444  33538   1.1     splinterdb
495325  188851  75122   3.7     rocksdb, leveled
500862  201667  83686   2.8     rocksdb, universal

Results for IO-bound

The performance can be explained by the amount of read IO per query.

IO reads per point query for 2B rows, 100-byte values:
  • 1.29 - SplinterDB
  • 0.98 - RocksDB, leveled
  • 1.00 - RocksDB, universal, 1.5B rows
IO reads per range query for 2B rows, 100-byte values:
  • 7.42 - SplinterDB
  • 2.34 - RocksDB, leveled
  • 3.24 - RocksDB, universal, 1.5B rows
2B rows, 100-byte values
---- ops/second ----
insert  point   range   wamp    dbms
308740  6110    1060     3.7    2B rows, splinterdb
181556  8428    3056    14.1    2B rows, rocksdb, leveled
205736  8404    3029    12.7    1.5B rows, rocksdb, leveled
393873  8144    1889     6.0    1.5B rows, rocksdb, universal

1B rows, 200-byte values
---- ops/second ----
insert  point   range   wamp    dbms
221007  7175     908     3.7    1B rows, splinterdb
107692  8393    2436    13.0    1B rows, rocksdb, leveled
121519  8159    2519    11.7    750M rows, rocksdb, leveled
324555  8621    2763     5.3    750M rows, rocksdb universal

Details

For SplinterDB with 2b rows and 100-byte values the tree shape was:

Space used by level: trunk_tree_height=4
0:   191168 MiB
1:    34554 MiB
2:     6586 MiB
3:     1307 MiB
4:      206 MiB

And the LSM tree shape for RocksDB with 2B rows, 100-byte values and leveled compaction uses long lines so that is in a gist.


RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...