**Stepped Merge**

The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level write-amplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.

This looks like tiered compaction. However it allows for N sorted runs on the max level which means that space-amplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a B-Tree, not for storing the entire database and it doesn't lead to a large space-amp for that workload. Note that the InnoDB change buffer is a B-Tree that buffers changes to other B-Trees for a similar reason.

**Compaction per level**

I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.

I will start with definitions:

- When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).
- A sorted run is a sequence of key-value pairs stored in key order. It is stored using 1+ files.
- A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level.
- A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.
- Levels are full when they have a configurable number of sorted runs. In classic leveled compaction a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2.
- leveled-N uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1.
- The per level fanout is size(sorted-run in level N) / size(sorted-run in level N-1)
- The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.
- The runs-per-level is the number of sorted runs in a level when it is full.
- The per level write-amplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, all-size(Ln+1) / all-size(Ln) for leveled and run-size(Ln+1) / all-size(Ln) for leveled-N where run-size is the size of a sorted run and all-size is the sum of the sizes of all sorted runs on a level.

A level can be described by a 3-tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runs-per-level. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3-tuple per level. For now I don't use 3-tuples to describe the write buffer (memory component). That is a topic for another post. Example 3-tuples include:

- T:1:4 - this is tiered with fanout=1 and runs-per-level=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.
- T:8:8 - this is tiered with fanout=8 and runs-per-level=8. When Ln and Ln+1 both use tiered then runs-per-level in Ln == fanout in Ln+1.
- T:8:4 - this is tiered with fanout=8 and runs-per-level=4. It might be used when the next larger level uses leveled and runs-per-level on this level can be smaller than fanout to reduce read-amp.
- L:10:1 - this is common in RocksDB with leveled compaction, fanout=10 and runs-per-level=1
- L:10:2 - this is leveled-N with runs-per-level=2

**Compaction per LSM tree**

An LSM tree can be described using the per level 3-tuples from the previous section. The following are examples for popular algorithms.

Classic LSM with total fanout = 1000 is:

- C0 is the write buffer
- C1, C2 and C3 are L:10:1

RocksDB leveled with total fanout = 1000 is:

- L0 is T:1:4
- L1 is L:1:1
- L2, L3, L4 are L:10:1

Stepped Merge with total fanout = 1000 is:

- L1 is T:1:10
- L2, L3, L4 are T:10:10

Tiered in HBase and Cassandra with total fanout = 1000 might be:

- L1 is T:1:10
- L2, L3 are T:10:10
- L4 is L:10:1

**Tiered+leveled**

Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveled-N, then the remaining levels use leveled. When tiered=T, leveled=L and leveled-N=N then the regex for this is T+N*L+.

I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.

The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification.

LSM trees with tiered+leveled can be described using 3-tuples and the previous section does that but here I provide one for a tree that uses leveled-N for L1 and L2 with total fanout = 1000:

- L0 is T:1:4
- L1 is L:1:2
- L2 is L:10:2
- L3, L4 are L:10:1

And another example to show that tiered levels don't have to use the same fanout or runs-per-level, but runs-per-level for Ln == fanout for Ln+1:

- L0 is T:1:20
- L1 is T:20:10
- L2 is T:10:2
- L3 is L:5:1

**Leveled-N**

Leveled-N can reduce the per level write-amp at the cost of increasing the per level read-amp.

The regex for an LSM tree that uses leveled-N is N+L+ (see the previous section). The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification. An example 3-tuple for leveled-N with fanout=1000 that has runs-per-level=2 for L1 and L2 is:

- L1 is L:10:2
- L2 is L:10:2
- L3 is L:10:1

## No comments:

## Post a Comment