Thursday, July 26, 2018

Tiered or leveled compaction, why not both via adaptive compaction?

First there was leveled compaction and it was good, but it took a while for implementations to become popular. Then there was (size) tiered compaction and it was good too, but more confusing given the diversity in strategies for picking files to compact. RocksDB didn't help with the confusion by calling it universal compaction. Eventually compaction algorithms optimized for time series were added (see DTCS for Cassandra). Finally, Kudu and InfluxDB have specialized compaction algorithms that are also worth understanding.

This post is about adaptive compaction which is yet another compaction algorithm. The summary for adaptive compaction is:
  • LSM file structure is orthogonal to the use of tiered or leveled compaction. A given LSM instance can be viewed as using tiered and leveled. It just depends on how you look at it.
  • Some levels in the LSM tree can be read optimized while others can be write optimized.
  • Each compaction step makes cost-based decisions to change a level between read and write optimized states, or remain as is. Compaction steps can also change the per-level fanout (thanks to Siying and Andrew for that suggestion).
  • Adaptive compaction can be configured to do strictly tiered or strictly leveled. So it can even adapt to become non-adaptive.

This is just an idea today. It has not been implemented. I think it will be good for RocksDB, but validation of ideas takes time.

Drawing the LSM file structure

The LSM file structure with leveled compaction is usually depicted vertically. The example below is from leveled compaction with 3 levels, per-level fanout set to 4, 4 SSTs in L1, 16 SSTs in L2 and 64 SSTs in L3. I ignore level 0 to simplify the picture.


With tiered compaction the LSM files are usually depicted horizontally as a sequence of sorted runs. The compaction algorithm merges 2 or more adjacent sorted runs at a time but it is hard to reason about the write amplification that will occur with such merging. Below each sorted run is range partitioned with 4, 16 and 64 partitions. Each partition is an SST.


File Structure vs Compaction

The key point here is that the LSM file structure is orthogonal to the use of tiered or leveled compaction. For the examples above the same LSM files are depicted vertically for leveled compaction and horizontally for tiered compaction.

The same files from an LSM instance can be used with tiered or leveled compaction. An instance using leveled compaction can be stopped and switched to use tiered compaction. An instance using tiered compaction can be stopped and switched to use leveled compaction. Metadata might have to be changed during the switch but otherwise the switch is trivial (allow me to wave hands).

To make this claim I ignore the work done in leveled with RocksDB to limit overlap between SSTs on adjacent levels. That is useful when incremental compaction is done to compact 1 SST from level N with the ~fanout SSTs from level N+1 as done by LevelDB and RocksDB. Limiting overlap isn't needed with the classic LSM approach because compaction between levels is all-to-all rather than incremental.

To transform from leveled to tiered assume that with tiered compaction the LSM structure is a sequence of sorted runs and each sorted run is 1+ SSTs. Then the N SSTs in the L0 become the first N sorted runs in the tiered sequence. They are followed by the sorted runs from L1 to Lmax. In this case the large sorted runs from L1 to Lmax are range partitioned (split into many SSTs) in the tiered sequence.

To transform from tiered to leveled the sorted runs from the prefix of the tiered sequence that are each a single SST become the SSTs in L0. Each of the remaining sorted runs becomes a level from L1 to Lmax. This requires that large sorted runs are range partitioned into many SSTs with tiered compaction.

Next is a specific example for transforming from leveled to tiered with fanout=8. The LSM with leveled has:
  • 4 SSTs in the L0 named L0.1, L0.2, L0.3 and L0.4
  • L1 is partitioned into 4 SSTs: L1.1 to L1.4
  • L2 is partitioned into 32 SSTs: L2.1 to L2.32
  • L3 is partitioned into 256 SSTs: L3.1 to L3.256
That uses 7 sorted runs with tiered compaction. With tiered the sequence of sorted runs is:
  • sorted runs 1 to 4 are L0.1, L0.2, L0.3, L0.4
  • sorted run 5 is L1.1 to L1.4 (range partitioned)
  • sorted run 6 is L2.1 to L2.32 (range partitioned)
  • sorted run 7 is L3.1 to L3.256 (range partitioned)
A similar transformation can be done to go from tiered back to leveled. Assume the LSM with tiered compaction uses the file structure from above with 7 sorted runs, the first 4 are each one SST and then runs 5, 6 and 7 are each range partitioned into many SSTs. This can be viewed as an LSM with leveled compaction where runs 1 to 4 are in the L0 and runs 5, 6 and 7 become levels 1, 2 and 3. As noted elsewhere this ignores the work done to limit overlap between adjacent levels with leveled and RocksDB.

Vertical Depiction of Tiered Compaction

More recently I have seen the vertical depiction used to explain write amplification for tiered compaction (thanks DASLab for doing this). But that has unstated assumptions about how sorted runs are selected for compaction. With such assumptions the number of levels in the LSM tree is the same whether leveled or tiered is used, but the write-amplification is different. The number of levels is logfanout(R) where R is size(Lmax) / size(L1). With leveled the worst-case per-level write-amp is equal to fanout and with tiered it is 1. The unstated assumptions are:
  1. Sorted runs to be merged all come from the same level.
  2. The fanout determines the number of sorted runs per level. When fanout=8 then each level is full with 8 sorted runs and a merge is started.
  3. When each level is full then each level has fanout times more data than the previous level.
  4. The output of a merge is written to the next larger level. Perhaps this rule can be broken when the size of the output is not large enough. For example assume the fanout is 8, the target size for sorted runs on L1 is 100 and the target size for sorted runs on the L2 is 800. When 8 sorted runs from L1 are merged, on which level should the output be written if the output has size 100 or size 200?
  5. The max level has fanout sorted runs, which implies that space amplification is fanout which is too large for me. I prefer to limit the max level to 1 sorted run which also increases the total write-amp. The space-amp can be further reduced by increasing the fanout between the next-to-max and max levels. I am curious whether existing LSMs can be configured to limit the max level to 1 sorted run to limit the worst-case space-amplification to 2 (ignoring transient space-amp during compaction) and a recent paper, Dostoevsky by Dayan and Idreos, claims they cannot.
The example below is a vertical depiction of tiered compaction with 4 sorted runs per level. If fanout=4 then each level is full. The sorted runs in levels 1, 2 and 3 are L1.r1 to L1.r4, L2.r1 to L2.r4 and L3.r1 to L3.r4. Each sorted run can be an SST or a sequence of range partitioned SSTs. A sorted run in level N+1 is approximately fanout times larger than a sorted run in level N. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).


Adaptive Compaction

A quick summary for adaptive compaction is that it uses the vertical depiction for tiered but each level can have a different target for the number of sorted runs -- from 1 to fanout.

First, let me show the LSM tree when the max level is constrained to one sorted run so that the worst-case space-amplification is <= 2, ignoring temp space during compaction. Each level has <= fanout sorted runs. A sorted run is either an SST or range partitioned into many SSTs. A level with 2+ sorted runs is write optimized. A level with 0 or 1 sorted runs is read optimized. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).


Adaptive compaction:
  1. Uses the vertical depiction of the LSM tree to constrain the compaction steps that can occur.
  2. Makes cost-based decisions during compaction to make the LSM tree more efficient for both the short-term and long-term workload. This is done one compaction step at a time. This is called adaptive compaction because it adapts the shape of the LSM tree to the workload.
  3. The decisions are 1) whether a level should be tiered or leveled and 2) the per-level fanout for the level. When a level is tiered then the per-level fanout determines the number of sorted runs on that level. When a level is leveled then the per-level fanout determines the size ratio between it and adjacent levels.
  4. Decisions respect constraints including the maximum space-amplification (both temporary — during compaction, and permanent — after compaction), write-amplification, number of levels and number of sorted runs. Constraints allow limits for the worst-case read and write efficiency. A possible constraint is the max number of sorted runs for the max level. Constraints can also include hints that optimization should favor point reads, range reads or writes.
Compaction is a sequence of compaction steps and at each compaction step adaptive compaction makes the LSM tree more or less read optimized to be more efficient for both the current and long-term workload. Note that workloads can have daily cycles so that optimizing for the current workload during daily backup or bulk load might not be the best decision when things return to normal in a few hours.

There are four types of compaction steps that can be scheduled. Some make the LSM tree more read-optimized, some make the LSM more write-optimized.
  1. tiered - this merges 2+ sorted runs from level N and writes the output to level N or level N+1 depending on the size of the output. Regardless, level N becomes read optimized. Level N+1 remains or becomes write optimized.
  2. tiered+leveled - this merges 2+ sorted runs from level N with 1 sorted run from level N+1 and writes the output to level N+1. For now I assume this cannot be used when level N+1 has more than 1 sorted run. When the compaction step finishes level N+1 remains read optimized and level N becomes read optimized.
  3. leveled.part - this merges one SST from a sorted run on level N with overlapping SSTs from a sorted run on level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. For now I ignore the details of minimizing overlaps between a sorted run in level N and sorted runs in level N+1.
  4. leveled.all - this merges the sorted run in level N with the sorted run in level N+1 and writes the output to level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. Unlike leveled.part this doesn't require minimizing overlap.
Examples of adaptive compaction

The smaller levels of the LSM tree have the most churn. Therefore they adapt faster as the current workload changes. They can also be fixed faster when the workload shifts from read to write heavy. Optimizations of persistent data structures have risk — which is the time to undo those optimizations when things change. There is much more risk for optimizations done for Lmax than for L1.

A simple example of adaptive compaction is changing L1 and then L2 from having one sorted run to many sorted runs when the current workload shifts from read to write heavy. Assume that the LSM tree starts out with all levels read-optimized (one sorted run per level) and has been using either leveled.part or leveled.all compaction steps. Compaction to level 1 is special, I have yet to mention it but assume there is a write buffer and leveled.part compaction steps have been used to merge it with level 1.


Then there is a burst of writes and L1 switches from read to write optimized with 4 sorted runs. Write buffer flushes stop using leveled.all and just write new sorted runs into L1. Compaction steps into L2, L3 and L4 continue using leveled.all or leveled.part.


Then the burst of writes continues and L2 switches from read to write optimized with 4 sorted runs. Write buffer flushes continue to create new sorted runs in L1. Compaction into L2 begins to use tiered compaction steps.


Eventually the write burst ends and L2 changes from write to read optimized by using a tiered+leveled compaction step from L0 to L1. The result is 1 sorted run in L2 and 0 sorted runs in L0. Then a write buffer flush creates a sorted run in L1.


Open Problems

I am sure there are many, this section lists two.

One interesting problem is to optimize when there is skew in the workload. The workload is usually not uniform across the key space. Some ranges of the key space might need optimization for writes and point queries. Another range might benefit from optimization for range queries. The work proposed above doesn't handle this any better than leveled or tiered compaction today.

In some cases we can isolate different workloads to different column families, and with MyRocks that means a column family per index. But that still doesn't handle the case when the workload has variance within one index. But this is a first step.

2 comments:

  1. This is a document with similar ideas that was published in the rocksdb dev. team 3 weeks ago.
    https://docs.google.com/document/d/1pqbnNB3TpWFBMdtiySC2N1xFP918H3miWvYPSJT3AKg/edit?usp=sharing

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...