Small Datum: Concurrent inserts and the RocksDB memtable

Monday, February 8, 2016

Concurrent inserts and the RocksDB memtable

RocksDB started as a fork of LevelDB. While it will always be a fork, we describe it as derived from LevelDB given how much code has changed. It inherited a memtable that did not allow concurrent inserts courtesy of a mutex - readers ignored the mutex, writers must lock it. Writers waiting for the mutex linked their updates together and the thread at the head of the convoy applied all of their changes after getting the mutex.

Insert performance for RocksDB and LevelDB did not improve beyond one thread when sync on commit was disabled because of contention on the mutex. It could improve with more threads when sync on commit was enabled courtesy of group commit. Write-only workloads are rare for a DBMS that uses a B-Tree because read-modify-write of leaf nodes is done for updates and inserts. Write-only is more common with RocksDB thanks to the merge operator and because non-unique secondary index maintenance doesn't need RMW. We expect to make this more common via SQL features in MyRocks.

We continue to make RocksDB better and today I show how the concurrent insert rate improves by 3X with sync on commit disabled and 2X with it enabled. Hopefully we will get a wiki entry to explain the code changes, until then thank you Nathan. The new behavior is enabled by setting two configuration options to true: allow_concurrent_memtable_write and enable_write_thread_adaptive_yield.

These tests were done using a server with 24 cores (48 HW threads) and enough RAM to cache the database. The test script is here. The workload is write-only with threads doing Put operations on randomly selected keys.

Results with sync on commit disabled

This section has results with sync on commit disabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 3X at high concurrency.

Results with sync on commit enabled

This section has results with sync on commit enabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 2X at high concurrency.

Speedup for MyRocks

I used a smaller server (20 cores instead of 24) to measure the benefit from this change for loading linkbench tables in parallel. The linkbench load step was run with generate_nodes=false to disable the single-threaded load of the node table and maxid1=30M. The database is small enough to fit in cache. The speedup is 1.4X at high concurrency. I will be vague but there is another configuration change that improves the speedup to 2.4X.

9 comments:

erbenmoMarch 8, 2016 at 1:47 PM
Is the major win in performance coming from the lock free skiplist implementation?
ReplyDelete
Replies
AnonymousNovember 3, 2016 at 12:02 PM
Has this made it into the product yet? If so what version?
ReplyDelete
Replies
Roy ReznikFebruary 15, 2017 at 12:03 PM
Have you tried going > 40 threads (significantly higher) ?
Use case is - a hell lot of machines doing some bulk processing job and trying to all insert the results concurrently into rocks.
Wondering if you already have any data on that kind of experiment.
ReplyDelete
Replies
AnonymousMarch 1, 2017 at 4:47 AM
Hi Mark. If we improve memtable write speed by this way, how can we guarantee no stall or slowdown triggered in Level 0? Is there any methodologies to avoid stall or slowdown? Thanks in advance!
ReplyDelete
Replies

Add comment

Monday, February 8, 2016

Concurrent inserts and the RocksDB memtable

Results with sync on commit disabled

Speedup for MyRocks

9 comments:

Is it time for TPC-BLOB?