Monday, February 8, 2016

Concurrent inserts and the RocksDB memtable

RocksDB started as a fork of LevelDB. While it will always be a fork, we describe it as derived from LevelDB given how much code has changed. It inherited a memtable that did not allow concurrent inserts courtesy of a mutex - readers ignored the mutex, writers must lock it. Writers waiting for the mutex linked their updates together and the thread at the head of the convoy applied all of their changes after getting the mutex.

Insert performance for RocksDB and LevelDB did not improve beyond one thread when sync on commit was disabled because of contention on the mutex. It could improve with more threads when sync on commit was enabled courtesy of group commit. Write-only workloads are rare for a DBMS that uses a B-Tree because read-modify-write of leaf nodes is done for updates and inserts. Write-only is more common with RocksDB thanks to the merge operator and because non-unique secondary index maintenance doesn't need RMW. We expect to make this more common via SQL features in MyRocks.

We continue to make RocksDB better and today I show how the concurrent insert rate improves by 3X with sync on commit disabled and 2X with it enabled. Hopefully we will get a wiki entry to explain the code changes, until then thank you Nathan. The new behavior is enabled by setting two configuration options to true: allow_concurrent_memtable_write and enable_write_thread_adaptive_yield.

These tests were done using a server with 24 cores (48 HW threads) and enough RAM to cache the database. The test script is here. The workload is write-only with threads doing Put operations on randomly selected keys.

Results with sync on commit disabled

This section has results with sync on commit disabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 3X at high concurrency.

Results with sync on commit enabled

This section has results with sync on commit enabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 2X at high concurrency.

Speedup for MyRocks

I used a smaller server (20 cores instead of 24) to measure the benefit from this change for loading linkbench tables in parallel. The linkbench load step was run with generate_nodes=false to disable the single-threaded load of the node table and maxid1=30M. The database is small enough to fit in cache. The speedup is 1.4X at high concurrency. I will be vague but there is another configuration change that improves the speedup to 2.4X.


  1. Is the major win in performance coming from the lock free skiplist implementation?

    1. It is lock-free for readers. There is a mutex for writers.

  2. Has this made it into the product yet? If so what version?

    1. Yes, it has been there for months. But it is still disabled by default.

  3. Have you tried going > 40 threads (significantly higher) ?
    Use case is - a hell lot of machines doing some bulk processing job and trying to all insert the results concurrently into rocks.
    Wondering if you already have any data on that kind of experiment.

    1. I have not in part because there were intermittent errors from the Python MySQL client when trying to use more threads -- well, I am using the Python multiprocessing module. I probably resolved that by adding short sleeps before each thread is started but have yet to try with more.

      In my setup the clients and mysqld run on the same host and there is no think time between requests in a client. So each client is always using CPU so I don't want more clients than CPU cores, and then I need to reserve a few CPU cores for background tasks like compaction.

  4. Hi Mark. If we improve memtable write speed by this way, how can we guarantee no stall or slowdown triggered in Level 0? Is there any methodologies to avoid stall or slowdown? Thanks in advance!

    1. 1) configure so that sizeof(L0) and sizeof(L1) are similar
      2) no compression for L0, L1, L2 and then fast compression (LZ4) for L3, L4, ... and then LZ4, ZSTD or Zlib for the max level (set via bottommost_compression)
      3) if level0_file_num_compaction_trigger is 4 then set level0_slowdown_writes_trigger=20 and level0_stop_writes_trigger=30
      sizeof L1 set via max_bytes_for_level_base

      To determine sizeof(memtable)
      1) sizeof memtable determines size of L0 files
      2) level0_file_num_compaction_trigger determines number of L0 files when compaction starts
      3) for sizeof(L0) I use : sizeof(memtable) X level0_file_num_compaction_trigger

    2. Much clear now. Thanks Mark.