Monday, April 27, 2015

Comparing LevelDB and RocksDB, take 2

I previously explained problems to avoid when comparing RocksDB and LevelDB. I am back with more details and results because someone is wrong on the Internet. The purpose for the test was to determine whether we created any regressions after reading a comparison published by someone else where RocksDB had some problems. Note that the LevelDB and RocksDB projects have different goals. I expect RocksDB to be faster, but that comes at a cost in code and configuration complexity. I am also reluctant to compare different projects in public. The good news is that I didn't find any performance regression in RocksDB, it is faster as expected but the overhead from performance monitoring needs to be reduced.

I made a few changes to LevelDB before running tests. My changes are in github and the commit message has the details. Adding the --seed option for read-heavy tests is important or LevelDB can overstate QPS. The next step was to use the same compiler toolchain for RocksDB and LevelDB. I won't share the diff to the Makefile as that is specific to my work environment.

I use the following pattern for tests. The pattern was repeated for N=1M, 10M, 100M and 1000M with 800 byte values and a 50% compression rate. The database sizes were approximately 512M, 5G, 50G and 500G. The test server has 40 hyperthread cores, 144G of RAM and fast storage.
  1. fillseq to load a database with N keys
  2. overwrite with 1 thread to randomize the database
  3. readwhilewriting with 1 reader thread and the writer limited to 1000 Puts/second. The rate limit is important to avoid starving the reader. Read performance is better when the memtable is empty and when queries are done immediately after fillseq. But for most workloads those are not realistic conditions, thus overwrite was done prior to this test.
  4. readwhilewriting with 16 reader threads and the writer limited to 1000 Puts/second
  5. readrandom with 1 thread
  6. readrandom with 16 threads
  7. overwrite with 1 thread
  8. overwrite with 16 threads

I ran the RocksDB tests twice, with statistics enabled and disabled. We added a lot of monitoring in RocksDB to make it easier to explain performance. But some of that monitoring needs to be more efficient for workloads with high throughput and high concurrency. I have a task open to make this better.

These are command lines for LevelDBfor RocksDB with stats and for RocksDB without stats with 1M keys. There are many more options in the RocksDB command lines. When we decide on better defaults for it then the number can be reduced. This post has more details on the differences in options between LevelDB and RocksDB. There are some differences between LevelDB and RocksDB that I did not try to avoid.
  • LevelDB uses 2MB files and I chose not to change that in source before compiling. It tries to limit the LSM to 10M in L1, 100M in L2, 1000M in L3, etc. It also uses a 2M write buffer which makes sense given that L0->L1 compaction is triggered when there are 4 files in L0. I configured RocksDB to use a 128M write buffer and limit levels to 1G in L1, 8G in L2, 64 G in L3, etc.
  • For the 100M and 1000M key test the value of --open_files wasn't large enough in LevelDB to cache all files in the database.
  • Statistics reporting was enabled for RocksDB. This data has been invaluable for explaining good and bad performance. That feature isn't in LevelDB. This is an example of the compaction IO stats we provide in RocksDB.
  • Flushing memtables and compaction is multithreaded in RocksDB. It was configured to use 7 threads for flushing memtables and 16 threads for background compaction. This is very important when the background work is slowed by IO and compression latency. And compression latency can be very high with zlib although these tests used snappy. A smaller number would have been sufficient but one thread would have been too little as seen in the LevelDB results. Even with many threads there were stalls in RocksDB. Using this output from the overwrite with 16 threads test look at the Stall(cnt) column for L0 and then the Stalls(count) line. The stalls occur because there were too many L0 files. It is a challenge to move data from the memtable to the L2 with leveled compaction because L0->L1 compaction is single threaded and usually cannot run concurrent with L1->L2 compaction. We have work in progress to make L0->L1 compaction much faster.

The data below shows the QPS (ops/sec) and for some tests also shows the ingest rate (MB/sec). I like to explain performance results but the lack of monitoring in LevelDB makes that difficult. My experience in the past is that it suffers from not having concurrent threads for compaction and memtable flushing especially when the database doesn't fit in RAM because compaction will get more stalls from disk reads.

My conclusions are:
  • read throughput is a bit higher with RocksDB
  • write throughput is a lot higher with RocksDB and the advantage increases as the database size increases
  • worst case overhead for stats in RocksDB is about 10% at high concurrency. It is much less at low concurrency.
--- 1M keys, ~512M of data

  RocksDB.stats  :  RocksDB.nostats  :     LevelDB
ops/sec  MB/sec  :  ops/sec  MB/sec  :   ops/sec  MB/sec  : test
 231641   181.1  :   243161   190.2  :    156299   121.6  : fillseq
 145352   113.7  :   157914   123.5  :     21344    16.6  : overwrite, 1 thread
 113814          :   116339          :     73062          : readwhilewriting, 1 thread
 850609          :   891225          :    535906          : readwhilewriting, 16 threads
 186651          :   192948          :    117716          : readrandom, 1 thread
 771182          :   803999          :    686341          : readrandom, 16 threads
 148254   115.9  :   152709   119.4  :     24396    19.0  : overwrite, 1 thread
 109678    85.8  :   110883    86.7  :     18517    14.4  : overwrite, 16 threads

--- 10M keys, ~5G of data

  RocksDB.stats  :  RocksDB.nostats  :     LevelDB
ops/sec  MB/sec  :  ops/sec  MB/sec  :   ops/sec  MB/sec  : test
 226324   177.0  :   242528   189.7  :   140095   109.0   : fillseq
  86170    67.4  :    86120    67.3  :    12281     9.6   : overwrite, 1 thread
 102422          :    95775          :    54696           : readwhilewriting, 1 thread
 687739          :   727981          :   513395           : readwhilewriting, 16 threads
 143811          :   143809          :    95057           : readrandom, 1 thread
 604278          :   676858          :   646517           : readrandom, 16 threads
  83208    65.1  :    85342    66.7  :    13220    10.3   : overwrite, 1 thread
  82685    64.7  :    83576    65.4  :    11421     8.9   : overwrite, 16 threads

--- 100M keys, ~50GB of data

  RocksDB.stats  :  RocksDB.nostats  :     LevelDB
ops/sec  MB/sec  :  ops/sec  MB/sec  :   ops/sec  MB/sec  : test
 227738   178.1  :   238645   186.6  :    64599    50.3   : fillseq
  72139    56.4  :    73602    57.6  :     6235     4.9   : overwrite, 1 thread
  45467          :    47663          :    12981           : readwhilewriting, 1 thread
 501563          :   509846          :   173531           : readwhilewriting, 16 threads
  54345          :    57677          :    21743           : readrandom, 1 thread
 572986          :   585050          :   339314           : readrandom, 16 threads
  74292    56.7  :    72860    57.0  :     7026     5.5   : overwrite, 1 thread
  74382    58.2  :    75865    59.3  :     5603     4.4   : overwrite, 16 threads

--- 1000M keys, ~500GB of data

Tests are taking a long time...

  RocksDB.stats  :    LevelDB
ops/sec  MB/sec  :  ops/sec  MB/sec  : test
 233126   182.3  :     7054     5.5  : fillseq
  65169    51.0  :                   : overwrite, 1 thread
   6790          :                   : readwhilewriting, 1 thread
  72670          :                   : readwhilewriting, 16 threads


  1. Is the tool db_bench available in github to be cloned and ran against any Key Value Store DB ?

    1. Yes,

      There has been a fork of the LevelDB version of db_bench in LMDB with ports to many key-value stores. I forgot where that was published.

    2. Thanks a lot Mark. As far as I understand I should be able to run this against any Key Value Store(not just the ones in production) since this is no longer catering to just LevelDB. I will look for the link you mentioned.

    3. It was ported to many but not all key-value stores. The author of those ports is very productive. That was a lot of work.

    4. OK. That means if we want to try that tool with any new (not currently supported ones) KVS , we have to right inter interfaces.

    5. Yes you do. Every NoSQL API is different. If you want something standard then SQL, JDBC, ODBC are a better choice. With MyRocks we are bringing the good things from RocksDB to a SQL DBMS.