Small Datum: Comparing LevelDB and RocksDB, take 2

I previously explained problems to avoid when comparing RocksDB and LevelDB. I am back with more details and results because someone is wrong on the Internet. The purpose for the test was to determine whether we created any regressions after reading a comparison published by someone else where RocksDB had some problems. Note that the LevelDB and RocksDB projects have different goals. I expect RocksDB to be faster, but that comes at a cost in code and configuration complexity. I am also reluctant to compare different projects in public. The good news is that I didn't find any performance regression in RocksDB, it is faster as expected but the overhead from performance monitoring needs to be reduced.

I made a few changes to LevelDB before running tests. My changes are in github and the commit message has the details. Adding the --seed option for read-heavy tests is important or LevelDB can overstate QPS. The next step was to use the same compiler toolchain for RocksDB and LevelDB. I won't share the diff to the Makefile as that is specific to my work environment.

I use the following pattern for tests. The pattern was repeated for N=1M, 10M, 100M and 1000M with 800 byte values and a 50% compression rate. The database sizes were approximately 512M, 5G, 50G and 500G. The test server has 40 hyperthread cores, 144G of RAM and fast storage.

fillseq to load a database with N keys
overwrite with 1 thread to randomize the database
readwhilewriting with 1 reader thread and the writer limited to 1000 Puts/second. The rate limit is important to avoid starving the reader. Read performance is better when the memtable is empty and when queries are done immediately after fillseq. But for most workloads those are not realistic conditions, thus overwrite was done prior to this test.
readwhilewriting with 16 reader threads and the writer limited to 1000 Puts/second
readrandom with 1 thread
readrandom with 16 threads
overwrite with 1 thread
overwrite with 16 threads

Results

I ran the RocksDB tests twice, with statistics enabled and disabled. We added a lot of monitoring in RocksDB to make it easier to explain performance. But some of that monitoring needs to be more efficient for workloads with high throughput and high concurrency. I have a task open to make this better.

These are command lines for LevelDB, for RocksDB with stats and for RocksDB without stats with 1M keys. There are many more options in the RocksDB command lines. When we decide on better defaults for it then the number can be reduced. This post has more details on the differences in options between LevelDB and RocksDB. There are some differences between LevelDB and RocksDB that I did not try to avoid.

LevelDB uses 2MB files and I chose not to change that in source before compiling. It tries to limit the LSM to 10M in L1, 100M in L2, 1000M in L3, etc. It also uses a 2M write buffer which makes sense given that L0->L1 compaction is triggered when there are 4 files in L0. I configured RocksDB to use a 128M write buffer and limit levels to 1G in L1, 8G in L2, 64 G in L3, etc.
For the 100M and 1000M key test the value of --open_files wasn't large enough in LevelDB to cache all files in the database.
Statistics reporting was enabled for RocksDB. This data has been invaluable for explaining good and bad performance. That feature isn't in LevelDB. This is an example of the compaction IO stats we provide in RocksDB.
Flushing memtables and compaction is multithreaded in RocksDB. It was configured to use 7 threads for flushing memtables and 16 threads for background compaction. This is very important when the background work is slowed by IO and compression latency. And compression latency can be very high with zlib although these tests used snappy. A smaller number would have been sufficient but one thread would have been too little as seen in the LevelDB results. Even with many threads there were stalls in RocksDB. Using this output from the overwrite with 16 threads test look at the Stall(cnt) column for L0 and then the Stalls(count) line. The stalls occur because there were too many L0 files. It is a challenge to move data from the memtable to the L2 with leveled compaction because L0->L1 compaction is single threaded and usually cannot run concurrent with L1->L2 compaction. We have work in progress to make L0->L1 compaction much faster.

Details

The data below shows the QPS (ops/sec) and for some tests also shows the ingest rate (MB/sec). I like to explain performance results but the lack of monitoring in LevelDB makes that difficult. My experience in the past is that it suffers from not having concurrent threads for compaction and memtable flushing especially when the database doesn't fit in RAM because compaction will get more stalls from disk reads.

My conclusions are:

read throughput is a bit higher with RocksDB
write throughput is a lot higher with RocksDB and the advantage increases as the database size increases
worst case overhead for stats in RocksDB is about 10% at high concurrency. It is much less at low concurrency.

--- 1M keys, ~512M of data

RocksDB.stats : RocksDB.nostats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test

231641 181.1 : 243161 190.2 : 156299 121.6 : fillseq

145352 113.7 : 157914 123.5 : 21344 16.6 : overwrite, 1 thread

113814 : 116339 : 73062 : readwhilewriting, 1 thread

850609 : 891225 : 535906 : readwhilewriting, 16 threads

186651 : 192948 : 117716 : readrandom, 1 thread

771182 : 803999 : 686341 : readrandom, 16 threads

148254 115.9 : 152709 119.4 : 24396 19.0 : overwrite, 1 thread

109678 85.8 : 110883 86.7 : 18517 14.4 : overwrite, 16 threads

--- 10M keys, ~5G of data

RocksDB.stats : RocksDB.nostats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test

226324 177.0 : 242528 189.7 : 140095 109.0 : fillseq

86170 67.4 : 86120 67.3 : 12281 9.6 : overwrite, 1 thread

102422 : 95775 : 54696 : readwhilewriting, 1 thread

687739 : 727981 : 513395 : readwhilewriting, 16 threads

143811 : 143809 : 95057 : readrandom, 1 thread

604278 : 676858 : 646517 : readrandom, 16 threads

83208 65.1 : 85342 66.7 : 13220 10.3 : overwrite, 1 thread

82685 64.7 : 83576 65.4 : 11421 8.9 : overwrite, 16 threads

--- 100M keys, ~50GB of data

RocksDB.stats : RocksDB.nostats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test

227738 178.1 : 238645 186.6 : 64599 50.3 : fillseq

72139 56.4 : 73602 57.6 : 6235 4.9 : overwrite, 1 thread

45467 : 47663 : 12981 : readwhilewriting, 1 thread

501563 : 509846 : 173531 : readwhilewriting, 16 threads

54345 : 57677 : 21743 : readrandom, 1 thread

572986 : 585050 : 339314 : readrandom, 16 threads

74292 56.7 : 72860 57.0 : 7026 5.5 : overwrite, 1 thread

74382 58.2 : 75865 59.3 : 5603 4.4 : overwrite, 16 threads

--- 1000M keys, ~500GB of data

Tests are taking a long time...

RocksDB.stats : LevelDB

ops/sec MB/sec : ops/sec MB/sec : test

233126 182.3 : 7054 5.5 : fillseq

65169 51.0 : : overwrite, 1 thread

6790 : : readwhilewriting, 1 thread
72670 : : readwhilewriting, 16 threads

Small Datum

Monday, April 27, 2015

Comparing LevelDB and RocksDB, take 2

6 comments:

Why is RocksDB spending so much time handling page faults?