Tuesday, October 2, 2018

Describing tiered and leveled compaction

This is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts here and here. My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature.

Stepped Merge

The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level write-amplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.

This looks like tiered compaction. However it allows for N sorted runs on the max level which means that space-amplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a B-Tree, not for storing the entire database and it doesn't lead to a large space-amp for that workload. Note that the InnoDB change buffer is a B-Tree that buffers changes to other B-Trees for a similar reason.

Compaction per level

I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.

I will start with definitions:
  1. When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).
  2. A sorted run is a sequence of key-value pairs stored in key order. It is stored using 1+ files.
  3. A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level. 
  4. A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.
  5. Levels are full when they have a configurable number of sorted runs. In classic leveled compaction a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2. 
  6. leveled-N uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1. 
  7. The per level fanout is size(sorted-run in level N) / size(sorted-run in level N-1)
  8. The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.
  9. The runs-per-level is the number of sorted runs in a level when it is full.
  10. The per level write-amplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, all-size(Ln+1) / all-size(Ln) for leveled and run-size(Ln+1) / all-size(Ln) for leveled-N where run-size is the size of a sorted run and all-size is the sum of the sizes of all sorted runs on a level.
A level can be described by a 3-tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runs-per-level. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3-tuple per level. For now I don't use 3-tuples to describe the write buffer (memory component). That is a topic for another post. Example 3-tuples include:
  • T:1:4 - this is tiered with fanout=1 and runs-per-level=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.
  • T:8:8 - this is tiered with fanout=8 and runs-per-level=8. When Ln and Ln+1 both use tiered then runs-per-level in Ln == fanout in Ln+1. 
  • T:8:4 - this is tiered with fanout=8 and runs-per-level=4. It might be used when the next larger level uses leveled and runs-per-level on this level can be smaller than fanout to reduce read-amp.
  • L:10:1 - this is common in RocksDB with leveled compaction, fanout=10 and runs-per-level=1
  • L:10:2 - this is leveled-N with runs-per-level=2

Compaction per LSM tree

An LSM tree can be described using the per level 3-tuples from the previous section. The following are examples for popular algorithms.

Classic LSM with total fanout = 1000 is:
  • C0 is the write buffer
  • C1, C2 and C3 are L:10:1
RocksDB leveled with total fanout = 1000 is:
  • L0 is T:1:4
  • L1 is L:1:1
  • L2, L3, L4 are L:10:1
Stepped Merge with total fanout = 1000 is:
  • L1 is T:1:10
  • L2, L3, L4 are T:10:10
Tiered in HBase and Cassandra with total fanout = 1000 might be:
  • L1 is T:1:10
  • L2, L3 are T:10:10
  • L4 is L:10:1

Tiered+leveled

Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveled-N, then the remaining levels use leveled.  When tiered=T, leveled=L and leveled-N=N then the regex for this is T+N*L+.

I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.

The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification.

LSM trees with tiered+leveled can be described using 3-tuples and the previous section does that but here I provide one for a tree that uses leveled-N for L1 and L2 with total fanout = 1000:
  • L0 is T:1:4
  • L1 is L:1:2
  • L2 is L:10:2
  • L3, L4 are L:10:1

And another example to show that tiered levels don't have to use the same fanout or runs-per-level, but runs-per-level for Ln == fanout for Ln+1:
  • L0 is T:1:20
  • L1 is T:20:10
  • L2 is T:10:2
  • L3 is L:5:1

Leveled-N

Leveled-N can reduce the per level write-amp at the cost of increasing the per level read-amp.

The regex for an LSM tree that uses leveled-N is N+L+ (see the previous section). The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification. An example 3-tuple for leveled-N with fanout=1000 that has runs-per-level=2 for L1 and L2 is:
  • L1 is L:10:2
  • L2 is L:10:2
  • L3 is L:10:1

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...