Posts

Showing posts from October, 2018

Combining tiered and leveled compaction

There are simple optimization problems for LSM tuning. For example use leveled compaction to minimize space amplification and use tiered to minimize write amplification. But there are interesting problems that are harder to solve: maximize throughput given a constraint on write and/or space amplification minimize space and/or write amplification given a constraint on read amplification To solve the first problem use leveled compaction if it can satisfy the write amp constraint, else use tiered compaction if it can satisfy the space amp constraint, otherwise there is no solution. The lack of a solution might mean the constraints are unreasonable but it can also mean we need to enhance LSM implementations to support more diversity in LSM tree shapes . Even when there is a solution using leveled or tiered compaction there are solutions that would do much better were an LSM to support more varieties of tiered+leveled and leveled-N. When I mention solved above I leave out that th

Minimizing write amplification in an LSM

Write-amplification for an LSM with leveled compaction is minimized when the per-level growth factor (fanout) is the same between all levels. This is a result for an LSM tree using a given number of levels. To find the minimal write-amplification for any number of levels this result can be repeated for 2, 3, 4, ... up to a large value. You might find that a large number of levels is needed to get the least write-amp and that comes at price of more read-amp , as the RUM Conjecture predicts. In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no write-amp. This is done to reduce the size of this blog post. tl;dr - for an LSM with L1, L2, L3 and L4 what values for per-level fanout minimizes write-amp when the total fanout is 1000? (10, 10, 10) for leveled (6.3, 12.6, 12.6) for leveled-N assuming two of the levels have 2 sorted runs (>1, >1, >1) for tiered Minimizing write-amp for leveled compaction For an LSM with 4 le

Describing tiered and leveled compaction

This is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts here and here . My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature. Stepped Merge The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level write-amplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1. This looks like tiered compaction. However it allows for N sorted runs on the max level w

Transaction Processing in NewSQL

This is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that. By NewSQL I mean the following. I am not trying to define "NewSQL" for the world: Support for multiple nodes because the storage/compute on one node isn't sufficient. Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated. Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it is important to understand the minimal time between commits to a single r