Friday, November 8, 2019

Jungle - LSM plus copy-on-write B-Tree

This is a review of Jungle which is an LSM variant that uses a copy-on-write (CoW) B-Tree internally. One of the Jungle developers previously invented ForestDB. I am a fan of his work.

At a high level Jungle is an LSM with leveled compaction but thanks to the CoW B-Tree it has different read, write and space amplification tradeoffs. I don't understand Jungle well enough to explain where it is better, but I am happy to share this review and accept corrections. My summary of Jungle is:
  • One sorted run per level
  • The per level file structure uses an index+log approach where the index is a CoW B-Tree, values are appended to the value log and the B-Tree entries point into the log. There is also a bloom filter.
  • Inter-level merge does compaction between adjacent levels. I think this is some-to-some as some data is moved from Ln to Ln+1 in a batch, values are appended to the value log, keys are inserted into the B-Tree and modified B-Tree pages are persisted by appending to the end of the B-Tree files. Because of CoW the path from leaf to root is made dirty when a leaf page is modified.
  • In-place merge does GC within a level to reclaim space from the B-Tree files and value log. The B-Tree is scanned in order to write a new B-Tree and new value log. Space is wasted because updates were appended to the end of the B-Tree file and value log.
Purpose

The index+log approach is used to reduce write amplification from large values. The per-level write-amp from moving a KV pair from Ln to Ln+1 is the sum of the write-amp from adding keys to the B-Tree and from appending to the end of the value log. 

Assuming a batch of KV pairs is inserted then write-amp for the value log is minimal -- close to 1. If 32 1kb values are moved then 32kb is written. I am uncertain about the average and worst case write-amp for the B-Tree even when I only consider write-amp for the leaf pages and ignore the non-leaf pages. For the worst-case assume that each key makes a leaf page dirty.  Then for each KV pair with a small key and 1kb value there is 4kb + 1kb written (4kb for B-Tree, 1kb for value log) and the per-level write-amp is ~5. That is a good worst-case. I frequently see per-level write-amp of ~5 for production workloads with RocksDB so I wonder what the average case will be for Jungle.

There is additional write-amp from doing periodic in-place merges to reclaim space. I won't try to estimate the impact from that.

Comments
  • The CoW B-Tree in Jungle is CoW-S because writes are appended and GC must eventually be done.
  • While ordering values in RocksDB has a cost, more write-amp, it also has a benefit, less cache-amp. RocksDB needs a pointer (index entry) in memory per block to achieve a worst-case of ~1 disk read per point query -- see the RocksDB data block index. With index+log the values are not in key order and this needs a pointer (index entry) in memory per KV pair. Assuming these pointers are ~8 bytes there is a huge difference in memory overhead between 8 bytes / database page and 8 bytes per KV pair assuming KV pairs are not too big. Per the CRUM Conjecture it is hard to be better in all dimensions -- read, write, disk space and memory. 
  • It will be hard to do compression for the value log if only a few values are appended at a time. But if each inter-level merge adds many 4kb pages worth of data to the value log then this isn't a problem.
  • Range scans are more expensive with index+log because values are not in key order and each value might require a storage IO and the CPU overhead for decompression. This should only be a problem for the largest level of the LSM tree.

1 comment:

  1. Thanks Mark for the review of the paper.

    A few things I want to add/correct:

    * There are multiple CoW B-trees (separate "index+log"s) per level, partitioned by range. Similar to managing SSTables.

    * A CoW B-tree in here is an index+log, but logs are locally sorted in each "batch", which is the unit of append. Hence, a batch == a sorted run, and a CoW B-tree == an index for multiple batches (sorted runs).

    * Jungle is more like an LSM with tiered compaction. Each CoW B-tree works as an index among (tiered) sorted runs in the same range group, in a level. Log parts of CoW B-trees already existing in level N+1 will not be rewritten by an inter-level compaction from level N.

    * "One leaf page update per one key update" is obviously the worst case of CoW B-tree. But as you mentioned, an inter-level merge (between L_n and L_{n+1}) writes many KV pairs in batch, so per-level write-amp is close to 1 in real workloads.

    ReplyDelete

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 202...