Friday, November 8, 2019

Jungle - LSM plus copy-on-write B-Tree

This is a review of Jungle which is an LSM variant that uses a copy-on-write (CoW) B-Tree internally. One of the Jungle developers previously invented ForestDB. I am a fan of his work.

At a high level Jungle is an LSM with leveled compaction but thanks to the CoW B-Tree it has different read, write and space amplification tradeoffs. I don't understand Jungle well enough to explain where it is better, but I am happy to share this review and accept corrections. My summary of Jungle is:
  • One sorted run per level
  • The per level file structure uses an index+log approach where the index is a CoW B-Tree, values are appended to the value log and the B-Tree entries point into the log. There is also a bloom filter.
  • Inter-level merge does compaction between adjacent levels. I think this is some-to-some as some data is moved from Ln to Ln+1 in a batch, values are appended to the value log, keys are inserted into the B-Tree and modified B-Tree pages are persisted by appending to the end of the B-Tree files. Because of CoW the path from leaf to root is made dirty when a leaf page is modified.
  • In-place merge does GC within a level to reclaim space from the B-Tree files and value log. The B-Tree is scanned in order to write a new B-Tree and new value log. Space is wasted because updates were appended to the end of the B-Tree file and value log.
Purpose

The index+log approach is used to reduce write amplification from large values. The per-level write-amp from moving a KV pair from Ln to Ln+1 is the sum of the write-amp from adding keys to the B-Tree and from appending to the end of the value log. 

Assuming a batch of KV pairs is inserted then write-amp for the value log is minimal -- close to 1. If 32 1kb values are moved then 32kb is written. I am uncertain about the average and worst case write-amp for the B-Tree even when I only consider write-amp for the leaf pages and ignore the non-leaf pages. For the worst-case assume that each key makes a leaf page dirty.  Then for each KV pair with a small key and 1kb value there is 4kb + 1kb written (4kb for B-Tree, 1kb for value log) and the per-level write-amp is ~5. That is a good worst-case. I frequently see per-level write-amp of ~5 for production workloads with RocksDB so I wonder what the average case will be for Jungle.

There is additional write-amp from doing periodic in-place merges to reclaim space. I won't try to estimate the impact from that.

Comments
  • The CoW B-Tree in Jungle is CoW-S because writes are appended and GC must eventually be done.
  • While ordering values in RocksDB has a cost, more write-amp, it also has a benefit, less cache-amp. RocksDB needs a pointer (index entry) in memory per block to achieve a worst-case of ~1 disk read per point query -- see the RocksDB data block index. With index+log the values are not in key order and this needs a pointer (index entry) in memory per KV pair. Assuming these pointers are ~8 bytes there is a huge difference in memory overhead between 8 bytes / database page and 8 bytes per KV pair assuming KV pairs are not too big. Per the CRUM Conjecture it is hard to be better in all dimensions -- read, write, disk space and memory. 
  • It will be hard to do compression for the value log if only a few values are appended at a time. But if each inter-level merge adds many 4kb pages worth of data to the value log then this isn't a problem.
  • Range scans are more expensive with index+log because values are not in key order and each value might require a storage IO and the CPU overhead for decompression. This should only be a problem for the largest level of the LSM tree.

1 comment:

  1. Thanks Mark for the review of the paper.

    A few things I want to add/correct:

    * There are multiple CoW B-trees (separate "index+log"s) per level, partitioned by range. Similar to managing SSTables.

    * A CoW B-tree in here is an index+log, but logs are locally sorted in each "batch", which is the unit of append. Hence, a batch == a sorted run, and a CoW B-tree == an index for multiple batches (sorted runs).

    * Jungle is more like an LSM with tiered compaction. Each CoW B-tree works as an index among (tiered) sorted runs in the same range group, in a level. Log parts of CoW B-trees already existing in level N+1 will not be rewritten by an inter-level compaction from level N.

    * "One leaf page update per one key update" is obviously the worst case of CoW B-tree. But as you mentioned, an inter-level merge (between L_n and L_{n+1}) writes many KV pairs in batch, so per-level write-amp is close to 1 in real workloads.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...