While I am a huge fan of research papers presenting storage engines that claim to be better than RocksDB, I am always wary of the performance results. A paper can be great despite an imperfect performance evaluation, so pointing out the imperfections doesn't take away from the interesting ideas in the paper. Also, as a believer of the (C)RUM Conjecture I want to know how the new thing is better and worse, but papers mostly focus only on the better parts and don't highlight what isn't better.
One factor that determines the truthiness of a database benchmark is the number of DBMS that are compared. It is hard enough to get expertise in one DBMS and more DBMS == more chance of making a mistake. Here I present a result for RocksDB and SplinterDB. I am definitely not an expert on SplinterDB. Perhaps my results are more truthy than true.
I read the SplinterDB paper and hope their research continues. However, the paper didn't have enough detail on how the benchmark was done so I had to guess. I wish more papers published artifacts (scripts, etc) that made reproduction easier. I expect reproduction to be infrequent and don't want researchers to spend their time doing that, but the artifacts are nice to have.
tl;dr
- For insert performance
- It is risky to compare write performance between SplinterDB and RocksDB because SplinterDB doesn't force data to storage via fsync, fdatasync or msync, and that is documented.
- For IO-bound - RocksDB with universal was the fastest
- For cached - performance is similar between RocksDB and SplinterDB
- For point query performance
- For IO-bound - RocksDB was faster because it does less IO/query
- For cached - SplinterDB was faster because it uses less CPU/query
- For range query performance
- For IO-bound - RocksDB was a lot faster because it does less IO/query
- For cached - RocksDB was a lot faster because it uses less CPU/query
SplinterDB
I patched SplinterDB as of git hash a1833060. My patched SplinterDB branch is here. The patch includes:
- debug printfs I added to understand the code
- changes to stop the lookup tests after 1200 seconds
- a fix for a memory leak that I reported, and has now been fixed upstream (issue 440)
- hardwired to only run the Small range test for range queries
- make the payload size a constant rather than a range to match what I do for RocksDB
- cached: 50M rows, 100-byte value
- cached: 25M rows, 200-byte value
- IO-bound: 2B rows, 100-byte value
- IO-bound: 1B rows, 200-byte value
Things that differ between RocksDB and SplinterDB:
- A uniform key access distribution was used by both benchmark clients. However, RocksDB does this by using a RNG while SplinterDB uses something like: hash(x) for x in 1 ... N. I didn't confirm it, but I am curious if the point and range query tests will see the same key sequence, and if tests are run for a short period of time then point query tests warm the cache for the range query tests making results less truthy for IO-bound workloads.
- SplinterDB does not have redo/WAL so I disabled that for RocksDB.
- SplinterDB does not sync (fsync, msync, fdatasync) writes and I did not try to mimic that with RocksDB. This means that write amplification will be under-stated and write throughput will be over-stated for SplinterDB when compared to RocksDB.
- 1.29 - SplinterDB
- 0.98 - RocksDB, leveled
- 1.00 - RocksDB, universal, 1.5B rows
- 7.42 - SplinterDB
- 2.34 - RocksDB, leveled
- 3.24 - RocksDB, universal, 1.5B rows
For SplinterDB with 2b rows and 100-byte values the tree shape was:
And the LSM tree shape for RocksDB with 2B rows, 100-byte values and leveled compaction uses long lines so that is in a gist.