I made a few changes to LevelDB before running tests. My changes are in github and the commit message has the details. Adding the --seed option for read-heavy tests is important or LevelDB can overstate QPS. The next step was to use the same compiler toolchain for RocksDB and LevelDB. I won't share the diff to the Makefile as that is specific to my work environment.
I use the following pattern for tests. The pattern was repeated for N=1M, 10M, 100M and 1000M with 800 byte values and a 50% compression rate. The database sizes were approximately 512M, 5G, 50G and 500G. The test server has 40 hyperthread cores, 144G of RAM and fast storage.
- fillseq to load a database with N keys
- overwrite with 1 thread to randomize the database
- readwhilewriting with 1 reader thread and the writer limited to 1000 Puts/second. The rate limit is important to avoid starving the reader. Read performance is better when the memtable is empty and when queries are done immediately after fillseq. But for most workloads those are not realistic conditions, thus overwrite was done prior to this test.
- readwhilewriting with 16 reader threads and the writer limited to 1000 Puts/second
- readrandom with 1 thread
- readrandom with 16 threads
- overwrite with 1 thread
- overwrite with 16 threads
Results
I ran the RocksDB tests twice, with statistics enabled and disabled. We added a lot of monitoring in RocksDB to make it easier to explain performance. But some of that monitoring needs to be more efficient for workloads with high throughput and high concurrency. I have a task open to make this better.
These are command lines for LevelDB, for RocksDB with stats and for RocksDB without stats with 1M keys. There are many more options in the RocksDB command lines. When we decide on better defaults for it then the number can be reduced. This post has more details on the differences in options between LevelDB and RocksDB. There are some differences between LevelDB and RocksDB that I did not try to avoid.
These are command lines for LevelDB, for RocksDB with stats and for RocksDB without stats with 1M keys. There are many more options in the RocksDB command lines. When we decide on better defaults for it then the number can be reduced. This post has more details on the differences in options between LevelDB and RocksDB. There are some differences between LevelDB and RocksDB that I did not try to avoid.
- LevelDB uses 2MB files and I chose not to change that in source before compiling. It tries to limit the LSM to 10M in L1, 100M in L2, 1000M in L3, etc. It also uses a 2M write buffer which makes sense given that L0->L1 compaction is triggered when there are 4 files in L0. I configured RocksDB to use a 128M write buffer and limit levels to 1G in L1, 8G in L2, 64 G in L3, etc.
- For the 100M and 1000M key test the value of --open_files wasn't large enough in LevelDB to cache all files in the database.
- Statistics reporting was enabled for RocksDB. This data has been invaluable for explaining good and bad performance. That feature isn't in LevelDB. This is an example of the compaction IO stats we provide in RocksDB.
- Flushing memtables and compaction is multithreaded in RocksDB. It was configured to use 7 threads for flushing memtables and 16 threads for background compaction. This is very important when the background work is slowed by IO and compression latency. And compression latency can be very high with zlib although these tests used snappy. A smaller number would have been sufficient but one thread would have been too little as seen in the LevelDB results. Even with many threads there were stalls in RocksDB. Using this output from the overwrite with 16 threads test look at the Stall(cnt) column for L0 and then the Stalls(count) line. The stalls occur because there were too many L0 files. It is a challenge to move data from the memtable to the L2 with leveled compaction because L0->L1 compaction is single threaded and usually cannot run concurrent with L1->L2 compaction. We have work in progress to make L0->L1 compaction much faster.
Details
The data below shows the QPS (ops/sec) and for some tests also shows the ingest rate (MB/sec). I like to explain performance results but the lack of monitoring in LevelDB makes that difficult. My experience in the past is that it suffers from not having concurrent threads for compaction and memtable flushing especially when the database doesn't fit in RAM because compaction will get more stalls from disk reads.
My conclusions are:
My conclusions are:
- read throughput is a bit higher with RocksDB
- write throughput is a lot higher with RocksDB and the advantage increases as the database size increases
- worst case overhead for stats in RocksDB is about 10% at high concurrency. It is much less at low concurrency.
--- 1M keys, ~512M of data
RocksDB.stats : RocksDB.nostats : LevelDB
ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test
231641 181.1 : 243161 190.2 : 156299 121.6 : fillseq
145352 113.7 : 157914 123.5 : 21344 16.6 : overwrite, 1 thread
113814 : 116339 : 73062 : readwhilewriting, 1 thread
850609 : 891225 : 535906 : readwhilewriting, 16 threads
186651 : 192948 : 117716 : readrandom, 1 thread
771182 : 803999 : 686341 : readrandom, 16 threads
148254 115.9 : 152709 119.4 : 24396 19.0 : overwrite, 1 thread
109678 85.8 : 110883 86.7 : 18517 14.4 : overwrite, 16 threads
--- 10M keys, ~5G of data
RocksDB.stats : RocksDB.nostats : LevelDB
ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test
226324 177.0 : 242528 189.7 : 140095 109.0 : fillseq
86170 67.4 : 86120 67.3 : 12281 9.6 : overwrite, 1 thread
102422 : 95775 : 54696 : readwhilewriting, 1 thread
687739 : 727981 : 513395 : readwhilewriting, 16 threads
143811 : 143809 : 95057 : readrandom, 1 thread
604278 : 676858 : 646517 : readrandom, 16 threads
83208 65.1 : 85342 66.7 : 13220 10.3 : overwrite, 1 thread
82685 64.7 : 83576 65.4 : 11421 8.9 : overwrite, 16 threads
--- 100M keys, ~50GB of data
RocksDB.stats : RocksDB.nostats : LevelDB
ops/sec MB/sec : ops/sec MB/sec : ops/sec MB/sec : test
227738 178.1 : 238645 186.6 : 64599 50.3 : fillseq
72139 56.4 : 73602 57.6 : 6235 4.9 : overwrite, 1 thread
45467 : 47663 : 12981 : readwhilewriting, 1 thread
501563 : 509846 : 173531 : readwhilewriting, 16 threads
54345 : 57677 : 21743 : readrandom, 1 thread
572986 : 585050 : 339314 : readrandom, 16 threads
74292 56.7 : 72860 57.0 : 7026 5.5 : overwrite, 1 thread
74382 58.2 : 75865 59.3 : 5603 4.4 : overwrite, 16 threads
--- 1000M keys, ~500GB of data
Tests are taking a long time...
Tests are taking a long time...
RocksDB.stats : LevelDB
ops/sec MB/sec : ops/sec MB/sec : test
233126 182.3 : 7054 5.5 : fillseq
65169 51.0 : : overwrite, 1 thread
6790 : : readwhilewriting, 1 thread
72670 : : readwhilewriting, 16 threads
72670 : : readwhilewriting, 16 threads
Is the tool db_bench available in github to be cloned and ran against any Key Value Store DB ?
ReplyDeleteYes, https://github.com/facebook/rocksdb/blob/master/tools/db_bench.cc
DeleteThere has been a fork of the LevelDB version of db_bench in LMDB with ports to many key-value stores. I forgot where that was published.
Thanks a lot Mark. As far as I understand I should be able to run this against any Key Value Store(not just the ones in production) since this is no longer catering to just LevelDB. I will look for the link you mentioned.
DeleteIt was ported to many but not all key-value stores. The author of those ports is very productive. That was a lot of work.
DeleteOK. That means if we want to try that tool with any new (not currently supported ones) KVS , we have to right inter interfaces.
DeleteYes you do. Every NoSQL API is different. If you want something standard then SQL, JDBC, ODBC are a better choice. With MyRocks we are bringing the good things from RocksDB to a SQL DBMS.
Delete