Monday, July 7, 2014

Benchmarking the leveldb family

In the beginning there was LevelDB. It is wonderful to see so much functionality provided by a small amount of clean code. The benchmark results show very high performance for a small, cached database without concurrency. AFAIK, this is a great solution for at least one widely-used and important workload. But it isn't a server DBMS and I don't think the project owners want it to become one as that will make the code more complex and even add a few bugs.

Editorial #1

While LevelDB isn't a server DBMS, it is a good basis for one. So we have forks or projects based on it like RocksDB and HyperLevelDB and we are starting to get benchmark results that compare the forks with each other and with non-LevelDB DBMS products. Where there is benchmarking there might be benchmarketing so it helps to understand the client, db_bench, that is frequently used to generate benchmark results.

For starters, my faith in a particular benchmark result is inversely related to the number of products evaluated in the test. It is hard to find someone with expertise in multiple systems to vouch that the result for a specific product wasn't lousy because of misconfiguration or misuse. Results should be explained and that takes more time as more systems are included in a result. While recent changes have made RocksDB better, it is much harder to configure than LevelDB so I won't be surprised when lousy results are caused by misconfiguration. Fortunately, it is easy to share db_bench command lines.

Editorial #2

The RocksDB team has done a lot of great work to improve in-memory performance. When I was helping (a long time ago), we spent a lot of time on IO-bound & write-heavy workloads. Prior to the addition of universal compaction by Dhruba we added monitoring and learned that if you have a write-heavy workload then you are not going to have a good time with LevelDB. We aren't unique in learning that.

These problems described below become visible when you run a random-write workload on a large database and get the LSM into a steady state. It is possible to not reproduce the problem by avoiding some of these steps.

  1. LevelDB suffers from high write-amplification. Add 1 for the redo log, 1 for the memtable flush to the L0, 1 for L0 -> L1 compaction assuming sizeof(L0) == sizeof(L1) and then ~10 for each level beyond the L1 assuming there is no skew in keys to be written (uniform, not zipfian distribution). The above means you might end up with a write-amplification of 50 for a large database. I don't believe a clever choice of compaction inputs leads to a significant reduction in write-amplification in the steady state unless you ignore space-amplification. And if we ignore space-amplification then disabling compaction is the best strategy as data is written once to the L0 and optionally once to the redo log and then we are done. RocksDB has monitoring to report the value and sources of write-amplification.
  2. LevelDB has but one thread to do compaction in the background. This thread does read, decompress, merge keys, write, compress in a loop. It is unlikely to keep up with a few or even one thread on a write-intensive workload. When write-amplification is 50 it must repeat that loop 50 times for each byte put into the memtable. It can also stall on disk reads and writes. RocksDB supports many background compaction threads which is a big improvement from LevelDB and universal compaction to reduce write-amplification.
  3. Storage can also fall behind when write-amplification is large. If storage can sustain 200MB/second of writes and write-amplification is 50 then the peak ingest rate is 4 MB/second (peak-write-rate / write-amplification) and that assumes storage reads are free or not done because the database is cached. A b-tree that does random writes might be able to beat that. 
  4. The result of the above is that L0 -> L1 compaction will fall behind and likely trigger many stalls on Put operations. Note that LevelDB doesn't have features to stall Put operations when levels beyond L0 (L1 -> Lmax) get too much data. All of this leads to more Put stalls and an LSM tree with too much data on the wrong levels. RocksDB has an option to stall writes when levels L1 -> Lmax-1 have too much data. But the best solution is to match compaction algorithm to the workload -- universal compaction has less write-amplification.

Notes on db_bench

With all of that said, here is a short and incomplete guide to understanding the LevelDB benchmark client, db_bench. I am a big fan of db_bench. We use it a lot to understand and improve RocksDB. Some of the problems described below have been fixed in RocksDB but others remain.

  1. The db_bench client doesn't report performance metrics per time interval. This has been fixed in RocksDB. LevelDB's db_bench reports average throughput at the end of a test run. It can report percentiles and histograms for response times. For some tests I want to confirm that response time and throughput were stable and did not degrade continuously or intermittently over the duration of the test. That is hard to judge unless throughput and response time are also reported per time interval as the test runs. This is enabled in RocksDB via the --statistics, --stats_per_interval and --stats_interval options.
  2. Client threads reuse the same RNG seed. See the ThreadState constructor in LevelDB's db_bench. There is a --seed option for RocksDB's db_bench to prevent this. The problem is that a given thread generates the same sequence of keys each time the test is run. While that might help with reproducibility it can also ruin results for IO-bound tests in a few cases.  By IO-bound I mean the database is much larger than RAM. When read performance is tested for an IO-bound database if the db_bench client is run for a short period of time and then restarted then during the second run all of the read data might still be in the OS filesystem cache. I use iostat during benchmarks to debug this problem and others. When write performance is tested for an IO-bound database and db_bench is run for a short time and then restarted, the same keys are written for each run and more likely to fit in the upper levels of the LSM leading to less write-amplification and better performance than might occur in production. I added compaction statistics to RocksDB to report how much IO is done (bytes read & written, read latencies) per level with leveled compaction and this data can debug the problem.
  3. The db_bench RNG has a range from 0 to 2^31-1. If your db_bench test database has more than 2B distinct key values then you are not going to have a good time. This has been fixed in RocksDB's db_bench.
  4. db_bench doesn't have an option to enable checksum verification on reads from storage. This excludes a CPU overhead that is likely to be enabled on production deployments. See the constructor for ReadOptions which sets verify_checksum to false. Given the use of the default constructor elsewhere in LevelDB I wonder if there are places where it never verifies checksums on read. That would be a bug and I think Table::Open is one example of the bug when ReadBlock is called for the index block. Possible bugs from use of the default constructor have been fixed in RocksDB. RocksDB's db_bench has the --verify_checksum option as well. I didn't ask the LevelDB team about this until I wrote this post.
  5. db_bench doesn't have an option to set the block size for database files, so the default of 4KB is used. RocksDB's db_bench has an option to set the size via the --block_size option.
  6. The statistics timer for Put operations is stopped before DB::Write is done. The timer is stopped in LevelDB's db_bench after the operation has been added to the write batch. It should be stopped after the write batch has been added to the memtable and optionally written to the redo log. See the call to FinishedOps in DoWrite. This means that response times don't measure the time to write the database. This was recently fixed in RocksDB. I didn't ask the LevelDB team about this until I wrote this post.
  7. Was a good malloc library used? I have measured significant differences in the overhead of db_bench source code between not using and using jemalloc. 
  8. There is no client-side processing of fetched data after a read. The read loop is very simple -- generate a key, call read, don't look at returned value, repeat. To make sure that the fetched value was really read by the client I think db_bench should compute the sum of the bytes in the returned value if only to generate a bit more load on the memory system. Otherwise a DBMS using mmap reads might avoid copying or reading the values. This has yet to be fixed in RocksDB. See the ReadRandom method in db_bench.
  9. LevelDB and RocksDB use buffered IO. A too large value for readahead (read_ahead_kb) can ruin your results on an IO-bound test. RocksDB's db_bench has the --advise_random_on_open option to use POSIX_FADV_RANDOM for non-compaction reads. It also has the --compaction_fadvice option to use POSIX_FADV_SEQUENTIAL on compaction reads & writes.
  10. When compression is enabled all files are compressed. I am not sure memtable flushes should be compressed because that can use a lot of CPU, especially with zlib, without providing a significant reduction in database size. Beyond the CPU overhead, compressing a memtable flush increases the chance that a Put operation will stall waiting for a memtable flush to finish (flush of the old memtable is done by a background thread when the current memtable is being written). There is an option in RocksDB to limit compression to the lower/larger levels of the LSM with leveled compaction. There is a similar option that works with universal compaction (called size-tiered by Cassandra). For a good overview of leveled and size-tiered compaction see the DataStax posts. The db_bench options are --universal_compression_size_percent and --min_level_to_compress. 
  11. Sequential insert tests can be very fast with LevelDB and with RocksDB when using leveled compaction because there is an optimization that avoids the need for compaction. Thus write-amplification is one (or two if the redo log is enabled). The optimization lets memtable flush output get pushed to a lower level of the LSM rather than first being written to the L0 and then slowly compacted from Ln to Ln+1. Querying a database after a sequential load can also benefit from this optimization. RocksDB (and I assume LevelDB) avoids checking a file for data when the desired key is less than the min key or greater than the max key stored in the file. This pruning is done before the bloom filter check. Thus is is likely that at most one file will be checked on a point query after a sequential load. Without the sequential load optimization one file per level would be checked.
  12. A random write test that puts 1M keys into an empty database is unlikely to result in a database with 1M distinct keys. If you do a random query test after this then some of the reads will be very fast assuming the bloom filter works because the desired data doesn't exist. 
  13. A short random write test done to a database after a sequential load might not get the database into a steady state. By steady state I mean that compaction should be in progress as it would be for a sustained write-heavy workload. For write-heavy tests I usually do a sequential load of X million rows, then do X million random writes, then consider the database to be ready for real tests.
  14. There are different types of writes. A write-heavy workload with db_bench usually means blind writes but we added the --benchmarks=updaterandom option to db_bench to match the read-modify-write pattern that is common with a SQL DBMS, except db_bench ignores the overhead from making the read-modify-write sequence atomic.
  15. Read QPS is also inflated in db_bench when the memtable is empty as no time must be spent checking an empty memtable.  So the pattern -- run db_bench to populate the database, re-run db_bench using a read-only test -- will get QPS much higher than is likely to occur in the real world. And if the load was done using leveled compaction (style used by LevelDB) then the read QPS will be inflated again (see point 11 above).
  16. When RocksDB universal compaction is used (called size-tiered by Cassandra) and a sequential insert is done, then read QPS after the load can also be inflated because of key range pruning. The files after the load will be range partitioned by the index key, so at most one file must be read, and possibly at most one bloom filter check must be done, per query.


  1. Some pretty good points there, triggered several responses from me:

    LevelDB is not small or clean code, by any definition. It is demonstrably complex and bug-ridden, causing no end of teeth-gnashing for its users. etc...

    RocksDB is interesting but the complexity takes it far out of the realm of what's expected in a "lightweight embedded database". Config complexity is one of the serious downsides to BerkeleyDB as well; this path is well-trodden and leads to disappointment.

    Editorial #2, or "all about write amplification" - I've not been shy about my disdain for LevelDB and LSM designs in general. The world is waking up to the fact that write amplification is a fact of life for LSMs and this is hugely detrimental for the direction storage is going.

    Notes on db_bench

    1) Good stuff. I've merged the statistics reporting from RocksDB's version into all of the other drivers that I've written.

    2) RNG seed - yeah, what's worse is that each individual test restarts with the same seed. So even if you do multiple tests in the same run, you're not getting new numbers. I fixed this latter aspect recently as well.

    3) I should probably merge the RocksDB 64-bit RNG soon...

    1. and re: (1) those stats are used to generate the throughput graphs in e.g. and

  2. 6) seems inconsequential since the FinishedSingleOp placement doesn't affect the final calculation of # Ops / Total Time.

    7) choice of malloc library is certainly important. I've studied this extensively. But for comparative benchmarks, this needs to be kept uniform. When we test DB engines we want to know what the engine itself can do. We don't want numbers obscured by the influence of different 3rd party components, especially when projects have conflicting recommendations - tcmalloc, jemalloc, etc... But at least in this case, we can go back again using LD_PRELOAD and test all of the engines with the malloc library of our choice. It's just that there's usually already plenty to test, and adding any variable increases testing time exponentially.

    8) client-side processing of data could be more "realistic", but then it becomes a benchmark of your calling application, not of the DB engine. For a microbenchmark you really want to test the DB engine in isolation, even if the results won't be directly applicable to the real world. This just means that microbenchmarking can't be your only activity, you also need to test in the full context of a real application.

    9) readahead - yes, in DB-larger-than-RAM workloads I've found that readahead just needs to be turned off completely.

  3. 10) I disable compression in my tests, for similar reasons of avoiding 3rd party malloc libraries. Most projects now support multiple compression libraries, but if they don't all support the same ones then you wind up benchmarking compressors, not DB engines. Also you can't reasonably benchmark a compressor without domain-relevant input data. "50% compressible random data" that db_bench claims to provide is really just a guess, and that 50% target is also compressor-sensitive. For comparative benchmarking you have to control and minimize variables.

    12) yeah, this was annoying because the original db_bench code doesn't give any indication that it wrote fewer keys than you requested. I noted this a while ago and the WiredTiger folks confirmed it here

    In combination with the RNG seed issue (2) you can get some ridiculously optimistic results. I provided a simple fix in my version (using Shuffle) and also added du output after each test so that we can tell when something fishy like this has happened. (I.e., if the disk use after a test is much smaller than the number of records would suggest, you probably got bitten by this.)

  4. By the way, the RocksDB readwhilewriting stats aren't reporting what the authors think it is. They set "exclude_from_stats" for the writer thread, to supposedly report only the reader statistics. But in fact, the final stats are generated by merging all of the threads' stats into thread 0's stats, and guess what, the writer is thread 0. I fixed this a while ago too:

  5. Stats aren't merged into thread 0 for RocksDB, so I don't think it has the bug. They are merged into thread 0 in your version of db_bench and in upstream.
    // Stats for some threads can be excluded.
    Stats merge_stats;
    for (int i = 0; i < n; i++) {

    Write-amp is a fact of life for db engines, not just ones that use an LSM. When an update-in-place tree writes a 4k page back to disk because a 100 byte row is dirty, then write-amp is ~40X. Forcing a redo log to disk after appending a 100 byte record writes back at least one disk sector of data (512 bytes today, 4k tomorrow), and that is more write-amp. We pay to get endurance with flash, so write-amp matters for some workloads.

    For #6 the placement of FinishedSingleOp is only an issue when there is more than 1 put per write batch. This doesn't hurt averages in any case. It does hurt response time histograms and percentiles. Response variance is important for some benchmarks and LevelDB is prone to it on write-heavy workloads.

    For #8 computing a sum of the returned value bytes isn't realistic client processing, it is done to confirm that the db engine really read the value bytes. An mmap-based engine can avoid any read (disk or memory) for the value when the client doesn't access it. If mmap is used and the index isn't clustered so the values are in a heap organized file separate from the index, the engine stores a pointer to the value in the index. Assuming the Get call returns that pointer without copying (zero-copy is good), then we have an optimization or a way to get misleading results because the disk pages for values never get faulted in.

    1. re: merge_stats - ok, missed that before.

      re: write-amp - the difference is that in those other two cases it's a one-time cost for a given write. With an LSM you pay the cost over and over again as you flush from one level to the next and when you do compactions.

      re: response variance - ok, no argument there. LevelDB response time is clearly a wildcard.

      re: zero-copy - fair enough. In that case it should be sufficient to just reference the first byte of a returned value. (Or every PAGESIZE'th byte, if the values are larger.) If you force a walk over all of the value bytes then you negate the zero-copy benefit and mislead in the other direction. Either way, it doesn't tell you what performance a real application will get, and you still need higher level benchmarking.

    2. I am all for application specific benchmarking, but I like to start with micro-benchmarks. I assume OpenLDAP also serves as such an application specific benchmark on which LMDB is a great fit. Your whitepaper on that was very interesting. LinkBench has been the benchmark that I need to care about.

      I have workloads for which leveled compaction as done by LevelDB/RocksDB does much worse than an update-in-place b-tree on write-amp. Switching to size-tiered, which is a RocksDB option, gets me write-amp that is much less than the update-in-place b-tree.

    3. Yes, OpenLDAP is the reason LMDB exists and the consumer of the majority of "special features" in LMDB. (The exception being nested transactions, which we added for SQLite/MySQL/etc.) In that context, all that matters is that we're superior to BerkeleyDB.

      Interesting link, it appears to spell out the tradeoffs of the two compaction approaches pretty well. Sounds like your LinkBench-style workload does few/no deletes.

    4. Glad you found the whitepaper interesting. I still lament the fact that it needed to be trimmed down so much; practically none of the assertions in the paper have any backing details included. (But I guess if you're reading it next to a web browser, you can get some of it through the linked references. And probably no one has the patience to read anything longer. But also if you were an attendee at the LDAPCon where it was debuted, you were probably already intimately familiar with the problems being addressed.)

      The LDAP use case dictated a lot of our testing direction too - e.g., most LDAP entries are 1K-4K in size; benchmarks with 100 byte values really don't tell us anything. (This is why our HyperDex benchmark initially used 4K records - we wanted to evaluate HyperDex for suitability as an OpenLDAP backend.)