The design goal was for disks and insert/update/delete intensive tables that get more writes than reads. Transactional logging (history tables) is one example. The LSM was shown to use less IO capacity than a B-Tree for making changes durable and that was a big deal when the cost of IOPs was a significant fraction of the system cost. The LSM uses less disk IO capacity than a B-Tree for two reasons:
- It does multi-page writes rather than single page writes so the seek latency is amortized over N pages (maybe N=64). Assume a disk can do either 32 1MB transfers/s or 128 4kb transfers/s. Then multi-page block IO throughput is 64x faster.
- When sized appropriately, it does a page write per M modified rows while a B-Tree in the worst case (for their example) does a page write per modified row.
Details
The paper explains two component and multi-component LSM Tress. A component is similar to a level in RocksDB: component 0 (C0) is the in-memory tree, component 1 (C1) is the smallest on-disk tree. For multi-component trees there can be C2 and C3. C0 is similar to the RocksDB memtable and Cn is similar to Ln in RocksDB. The original LSM didn't have something equivalent to the L0 in RocksDB. While it might have been good for LevelDB, I think the L0 is a performance bug in RocksDB as it makes reads less efficient.
The paper has math to explain that the best fanout between components is a constant, and the paper used r for that value. This is similar to RocksDB for levels >= 1. In the paper, r is the fanout between all components including C0 to C1. While in RocksDB the fanout for memtable:L0 and L0:L1 is usually smaller than r.
The paper doesn't mention the use of bloom filters for the LSM Tree, but does mention them as used by another index structure. They aren't needed for a two component LSM Tree, but can be useful when the LSM Tree has more than 2 components. I assume they could be implemented via a bloom filter per multi-page block.
The paper mentions a few features that have yet to arrive in RocksDB. Predicate deletion is similar to range deletion in RocksDB and is a way to quickly perform a bulk delete. A find note is a special indicator written to the LSM Tree that requests data. When the find note moves from C0 to Cmax then the data will be provided as the find note encounters the keys that it requests.
C0 can use a variety of in-memory tree indexes. C1 and larger use a special B-Tree:
- Leaf nodes are packed (no free space)
- Key-order adjacent nodes are written in multi-page blocks. Assume a leaf node is 4kb and and a multi-page block is 256kb then 64 leaf nodes are written per multi-page block. This amortizes the seek latency when writing leaf nodes and when reading during long range scans.
- Point queries do single-page reads, long range scans do multi-page block reads.
- Leaf nodes are write once.
- Non-leaf nodes are also write once and might use multi-page blocks.
The paper uses a rolling merge to move data from C0 to C1. RocksDB calls this compaction. A rolling merge is a sequence of merge steps. Each merge step reads some data from C0 and C1, merges it and then writes it out into a new multi-page block in C1. The rolling merge is in progress all of the time. The rolling merge is similar to compaction for levels L1 to Lmax in RocksDB, but the implementation is different.
- Compaction in RocksDB is some-to-some and atomic. By some-to-some I mean that RocksDB chooses 1 SST from level N to compact into the ~10 overlapping SSTs in level N+1. By atomic I mean that the output from a compaction step isn't visible until all of the compaction input has been consumed (and this used to be a big problem for universal/tiered compaction in RocksDB with large SSTs).
- The rolling merge is neither some-to-some nor atomic. I will call it all-to-all and incremental. By all-to-all I mean that a rolling merge limited to data from a range in each component (perhaps a merge step is). By incremental I mean that the merge output is made visible as soon as possible, alas that is complicated.
- During a merge step component N is the inner component and component N+1 is the outer component.
- The merge step input from components N and N+1 are called the emptying nodes. And the output written into component N+1 are called the filling nodes.
- As soon as possible, the outcome from a merge step is made visible. This part is complicated. I assume this means that the B-Tree for component N is changed to remove that key range and the B-Tree for component N+1 is changed to reference the filling nodes in place of the emptying nodes. The non-leaf nodes of the B-Trees are also written sequentially, just like the leaf nodes. But I wasn't clear on some of the details on when these updates can be done.
A merge step consumes a range of data from the inner component, a range of data from the outer component and replaces that with new data in the outer component. While the paper has a section on concurrency (supporting reads concurrent with the rolling merge, supporting rolling merge in trees with more than 2 components) I don't think all of the details were provided. Some of the details are:
- for now a node is a b-tree leaf node
- nodes are locked in read mode for queries
- the nodes from the outer component, and maybe the inner component, are locked in write mode when data from the node(s) is being merged. I didn't understand all of the details about when the node(s) could be unlocked.
When a merge step fills a multi-page block (for now assume 64 leaf nodes) then that block will be written to the database file. It doesn't overwrite live multi-page blocks. Eventually the B-Tree will be updated to point to the new leaf nodes in this multi-page block. Eventually the space from the multi-page blocks for the emptying nodes can be reclaimed. This maintains one sequential write stream per component.
Performance
The paper has a nice example of the 5 Minute Rule to estimate what data should be in memory based on the cost of RAM and IOPs. Assuming RAM is $10/GB and IOPs are $200 / 10k then one IO/second costs 2 cents and 4kb of RAM costs 0.0038 cents and ~500 4kb pages of RAM has the same cost as 1 IO/second, or a 4kb page of RAM costs the same as ~1/500 IO/second. When that is true, data that is referenced more frequently than once per 500 seconds should be in RAM.
Next is the performance model. I renamed a few of the variables. The variables include:
- IO-S - cost of a single page IO
- IO-M - cost of a page IO when done as part of a multi-page block IO
- Se - size of index entry in bytes
- Sp - size of page in bytes
- S0 - size in MB of C0
- S1 - size in MB of C1
- De is number of B-Tree levels not in cache
The IO cost of a B-Tree insert is: IO-S * (De + 1). There are De IO reads to get the leaf node and then 1 IO to write back the dirty leaf node. This is the worst-case for a B-Tree. Each of the IO operations are single-page with cost IO-S.
M is used for the LSM insert cost. M is the average number of C0 entries merged into a C1 page. One formula for it is (Sp/Se) * (S0 / (S0 + S1)) because:
- Sp/Se is the number of entries per page
- S0 / (S0 + S1) is fraction of database entries in C0
Then the IO cost of an LSM insert with a two component LSM Tree is: 2 * IO-M / M.
- 2 because a component is re-read when it is re-written
- IO-M because the IO operations are multi-page block reads and writes
- Divided by M because the cost (2 * IO-M) is amortized over M entries from C0
The ratio of the costs determine when LSM inserts are cheaper:
Cost(LSM) / Cost(B-Tree) = K1 * (IO-M / IO-S) * (1/M)
- K1 is 2 / (De + 1), which =1 when De is 1 and =2/3 when De is 2
- IO-M / IO-S is the speedup from multi-page block IO, which is >> 10 for disks
Assuming disks are used, IO-M/IO-S is 64 and K1 = 1 then 1/M must be >= 1/64 for LSM inserts to be cheaper.