At a high level Jungle is an LSM with leveled compaction but thanks to the CoW B-Tree it has different read, write and space amplification tradeoffs. I don't understand Jungle well enough to explain where it is better, but I am happy to share this review and accept corrections. My summary of Jungle is:
- One sorted run per level
- The per level file structure uses an index+log approach where the index is a CoW B-Tree, values are appended to the value log and the B-Tree entries point into the log. There is also a bloom filter.
- Inter-level merge does compaction between adjacent levels. I think this is some-to-some as some data is moved from Ln to Ln+1 in a batch, values are appended to the value log, keys are inserted into the B-Tree and modified B-Tree pages are persisted by appending to the end of the B-Tree files. Because of CoW the path from leaf to root is made dirty when a leaf page is modified.
- In-place merge does GC within a level to reclaim space from the B-Tree files and value log. The B-Tree is scanned in order to write a new B-Tree and new value log. Space is wasted because updates were appended to the end of the B-Tree file and value log.
The index+log approach is used to reduce write amplification from large values. The per-level write-amp from moving a KV pair from Ln to Ln+1 is the sum of the write-amp from adding keys to the B-Tree and from appending to the end of the value log.
Assuming a batch of KV pairs is inserted then write-amp for the value log is minimal -- close to 1. If 32 1kb values are moved then 32kb is written. I am uncertain about the average and worst case write-amp for the B-Tree even when I only consider write-amp for the leaf pages and ignore the non-leaf pages. For the worst-case assume that each key makes a leaf page dirty. Then for each KV pair with a small key and 1kb value there is 4kb + 1kb written (4kb for B-Tree, 1kb for value log) and the per-level write-amp is ~5. That is a good worst-case. I frequently see per-level write-amp of ~5 for production workloads with RocksDB so I wonder what the average case will be for Jungle.
There is additional write-amp from doing periodic in-place merges to reclaim space. I won't try to estimate the impact from that.
- The CoW B-Tree in Jungle is CoW-S because writes are appended and GC must eventually be done.
- While ordering values in RocksDB has a cost, more write-amp, it also has a benefit, less cache-amp. RocksDB needs a pointer (index entry) in memory per block to achieve a worst-case of ~1 disk read per point query -- see the RocksDB data block index. With index+log the values are not in key order and this needs a pointer (index entry) in memory per KV pair. Assuming these pointers are ~8 bytes there is a huge difference in memory overhead between 8 bytes / database page and 8 bytes per KV pair assuming KV pairs are not too big. Per the CRUM Conjecture it is hard to be better in all dimensions -- read, write, disk space and memory.
- It will be hard to do compression for the value log if only a few values are appended at a time. But if each inter-level merge adds many 4kb pages worth of data to the value log then this isn't a problem.
- Range scans are more expensive with index+log because values are not in key order and each value might require a storage IO and the CPU overhead for decompression. This should only be a problem for the largest level of the LSM tree.