I have read a few papers, including TRIAD, with ideas on reducing write-amp for the smaller levels of the LSM tree. I think this could be done for RocksDB by merging and remerging immutable memtables -- this is similar in spirit to subcompactions for the L0. With a large immutable memtable there would be one less level in the LSM tree. This is an alternative to having an L0, and maybe an L1, that are not made durable. In all cases the cost is a longer MTTR because WAL replay must be done. In all cases there is an assumption that the non-durable levels (large immutable memtables or L0/L1) are in memory.
This is a small complaint from me that I have made in the past. The paper states that an LSM eliminates random IO when making things durable. I prefer to claim that it reduces random IO. With leveled compaction each step merges N (~11) SSTs to generate one steam of output. So for each step there is likely a need to seek when reading the ~11 input streams and writing the output stream. Then compaction steps usually run concurrently when the ingest rate is high so there are more seeks. Then the WAL must be written -- one more stream and a chance for more seeks. Finally user queries are likely to read from storage causing even more seeks. Fortunately, there will be fewer seeks per insert/update/delete compared to a B-Tree.
The paper has a short history of compaction describing pure-tiered and pure-leveled. But these are rarely used in practice. The original LSM paper implemented pure-leveled. LevelDB and RocksDB use a hybrid approach with tiered for the L0 followed by leveled for the remaining levels. Pure-tiered was introduced by the Stepped Merge paper. Using tiered for all levels has a large space-amplification, much larger than 1, because the max level is tiered and that is too much wasted space for many workloads. Tiered in RocksDB and other popular LSM engines can be configured to use leveled compaction into the max level to get a space-amp less than 2, ignoring transient space-amp during compaction into the max level. Pure-tiered was a great choice for Stepped Merge because that was a cache for bulk-loading a data warehouse rather than a full copy of the database. While I think that RocksDB leveled and RocksDB tiered are examples of tiered+leveled, I don't want to rename them.
I appreciate that the paper makes clear that trade-offs must be considered when evaluating benchmarks. Many things can support higher write rates than RocksDB with leveled compaction, including RocksDB with tiered compaction. But that comes at a cost in memory, read and/or space amplification. Some papers could do a better job of documenting those costs.
The cost analysis in section 2.3 is limited to IO costs. I look forward to coverage of CPU costs in future LSM research. The read penalty for an LSM compared to a B-Tree is usually worse for CPU than for IO. The paper uses partitioned and non-partitioned where I use all-to-all and some-to-some to explain the compaction approaches. RocksDB implements some-to-some for leveled and all-to-all for tiered. The paper does a nice job explaining why the per-level write-amp should be less for all-to-all than some-to-some, ignoring write skew. Note that in production the per-level write-amp is almost always less than the per-level growth factor and this paper from Hyeontaek Lim explains why.
For the read IO costs, the paper counts logical IOs rather than physical IOs. Logical IOs are easier to estimate because caches mean that many logical IOs don't cause a physical IO and smaller levels in the LSM tree are usually in cache. There are two ways to consider the cost for a range query -- long vs short range queries or the cost of range seek vs range next. The paper uses the first, I use the second. Both are useful.
I appreciate that the author noticed this. I realize there is pressure to market research and I am not offering to try and reproduce benchmark results, but I have been skeptical about some of the comparisons I see where the base case is InnoDB or RocksDB.
These improvements have mainly been evaluated against a default (untuned) configuration of LevelDB or RocksDB, which use the leveling merge policy with size ratio 10. It is not clear how these improvements would compare against a well-tuned LSM-tree.The discussion in 3.3.1 on pipelining compaction is interesting but RocksDB already does pipelining. With buffered IO there is support for async read-ahead and async write-behind. Note that the read and write phases can also be CPU-heavy if the cost for decompression on read and compression on write are included, even when the wonderful zstd and lz4 algorithms are used.
A few more comments:
- RocksDB has limited support for fractional cascading (from SST to SST). See 3.4.2.
- With key-value separation, GC could merge log segments to generate longer ordered log segments over time. This would reduce the range read penalty. See 3.4.2.
- LHAM might be the first time-series optimized compaction strategy. See 3.5.
- Non-unique secondary index maintenance is already read-free in MyRocks. It has a copy of the row prior to index maintenance, because SQL semantics or because this was an insert. Write-optimized SQL engines can add support for read-free change statements in some cases but that usually means SQL semantics (like modified row count) will be broken. See 3.7.2.
- MyRocks already collects statistics during compaction. See 3.7.3.