Small Datum: Name that compaction algorithm

First there was leveled compaction and it was a great paper. Then tiered compaction arrived in BigTable, HBase and Cassandra. Eventually LevelDB arrived with leveled compaction and RocksDB emerged from that. Along the way a few interesting optimizations have been added including support for time series data. My summary is missing a few details because it is a summary.

Compaction algorithms constrain the LSM tree shape. They determine which sorted runs can be merged by it and which sorted runs need to be accessed for a read operation. I am not sure whether they have been formally defined, but I hope there can be agreement on the basics. I will try to do that now for a few - leveled, tiered, tiered+leveled, leveled-N and time-series. There are two new names on this list -- tiered+leveled and leveled-N.

LSM tree used to imply leveled compaction. I prefer to expand the LSM tree definition to include leveled, tiered and more.

I reference several papers below. All of them are awesome, even when not perfect -- they are major contributions to write-optimized databases and worth reading. One of the best things about my job is getting time to read papers like this and then speak with the authors.

There are many interesting details in academic papers and existing systems (RocksDB, Cassandra, HBase, ScyllaDB) that I ignore. I don't want to get lost in the details.

Leveled

Leveled compaction minimizes space amplification at the cost of read and write amplification.

The LSM tree is a sequence of levels. Each level is one sorted run that can be range partitioned into many files. Each level is many times larger than the previous level. The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. Compaction into level N (Ln) merges data from Ln-1 into Ln. Compaction into Ln rewrites data that was previously merged into Ln. The per-level write amplification is equal to the fanout in the worst case, but it tends to be less than the fanout in practice as explained in this paper by Hyeontaek Lim et al. Compaction in the original LSM paper was all-to-all -- all data from Ln-1 is merged with all data from Ln. It is some-to-some for LevelDB and RocksDB -- some data from Ln-1 is merged with some (the overlapping) data in Ln.

While write amplification is usually worse with leveled than with tiered there are a few cases where leveled is competitive. The first is key-order inserts and a RocksDB optimization greatly reduces write-amp in that case. The second one is skewed writes where only a small fraction of the keys are likely to be updated. With the right value for compaction priority in RocksDB compaction should stop at the smallest level that is large enough to capture the write working set -- it won't go all the way to the max level. When leveled compaction is some-to-some then compaction is only done for the slices of the LSM tree that overlap the written keys, which can generate less write amplification than all-to-all compaction.

Tiered

Tiered compaction minimizes write amplification at the cost of read and space amplification.

The LSM tree can still be viewed as a sequence of levels as explained in the Dostoevsky paper by Niv Dayan and Stratos Idreos. Each level has N sorted runs. Each sorted run in Ln is ~N times larger than a sorted run in Ln-1. Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln. The per-level write amplification is 1 which is much less than for leveled where it was fanout.

Most implementations of tiered compaction don't behave exactly as described in the previous paragraph. I hope they are close enough, because the model above makes it easy to reason about performance and estimate the worst-case write amplification. A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.

The stepped merge paper is the earliest reference I found for tiered compaction. It reduces random IO for b-tree changes by buffering them in an LSM tree that uses tiered compaction. While the stepped merge algorithm is presented as different from an LSM, it is tiered compaction. The MaSM paper is similar but the SM in MaSM stands for sort merge. The paper uses an external sort rather than an LSM to reduce write amplification. It assumes that LSM implies leveled compaction but an external sort looks a lot like tiered compaction. The InnoDB change buffer has a similar goal of reducing random IO for changes to a b-tree but doesn't use an LSM. In what year did the InnoDB change buffer get designed or implemented?

I prefer that tiered not require N sorted runs at the max level because that means N copies of the database which is too much space amplification. I define it to allow K copies at the max level where K is between 2 and N. But it still does tiered compaction at the max level and when the max level is full (has K sorted runs) then the K runs are merged and the output (1 sorted run) replaces the K runs in the max level. One day I hope to learn whether HBase or Cassandra support 1, a few or N sorted runs at the max level -- although this can be confusing because they don't enforce the notion of levels. Tiered compaction in RocksDB has a configuration option to limit the worst-case space amplification which should prevent too many full copies (too many sorted runs at the max level) but I don't have much experience with tiered in RocksDB. I hope the RocksDB wiki gets updated to explain this.

There are a few challenges with tiered compaction:

Transient space amplification is large when compaction includes a sorted run from the max level.
The block index and bloom filter for large sorted runs will be large. Splitting them into smaller parts is a good idea.
Compaction for large sorted runs takes a long time. Multi-threading would help.
Compaction is all-to-all. When there is skew and most of the keys don't get updates, large sorted runs might get rewritten because compaction is all-to-all. In a traditional tiered algorithm there is no way to rewrite a subset of a large sorted run.

For tiered compaction the notion of levels are usually a concept to reason about the shape of the LSM tree and estimate write amplification. With RocksDB they are also an implementation detail. The levels of the LSM tree beyond L0 can be used to store the larger sorted runs. The benefit from this is to partition large sorted runs into smaller SSTs. This reduces the size of the largest bloom filter and block index chunks -- which is friendlier to the block cache -- and was a big deal before partitioned index/filter was supported. With subcompactions this enables multi-threaded compaction of the largest sorted runs. Note that RocksDB used the name universal rather than tiered. More docs on this are here.

Tiered+Leveled

Tiered+Leveled has less write amplification than leveled and less space amplification than tiered.

The tiered+leveled approach is a hybrid that uses tiered for the smaller levels and leveled for the larger levels. It is flexible about the level at which the LSM tree switches from tiered to leveled. For now I assume that if Ln is leveled then all levels that follow (Ln+1, Ln+2, ...) must be leveled.

SlimDB from VLDB 2018 is an example of tiered+leveled although it might allow Lk to be tiered when Ln is leveled for k > n. Fluid LSM is described as tiered+leveled but I think it is leveled-N.

Leveled compaction in RocksDB is also tiered+leveled, but we didn't explain it that way until now. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option -- only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction -- the memtable output creates a new sorted run in L0 and doesn't read/rewrite existing sorted runs in L0. There can be N sorted runs in level 0 (L0) courtesy of level0_file_num_compaction_trigger. So the L0 is tiered. Compaction isn't done into the memtable level so it doesn't have to be labeled as tiered or leveled. Subcompactions in the RocksDB L0 makes this even more interesting, but that is a topic for another post. I hope we get more docs on this interesting feature from Andrew Kryczka.

Leveled-N

Leveled-N compaction is like leveled compaction but with less write and more read amplification. It allows more than one sorted run per level. Compaction merges all sorted runs from Ln-1 into one sorted run from Ln, which is leveled. And then "-N" is added to the name to indicate there can be n sorted runs per level.

The Dostoevsky paper defined a compaction algorithm named Fluid LSM in which the max level has 1 sorted run but the non-max levels can have more than 1 sorted run. Leveled compaction is done into the max level. The paper states that tiered compaction is done into the smaller levels when they have more than 1 sorted run. But from my reading of the paper it uses leveled-N for the non-max levels.

In Fluid LSM each level is T times larger than the previous level (T == fanout), the max level has Z sorted runs and the non-max levels have K sorted runs. When Z=1 and K=1 then this is leveled compaction. When Z=1 and K>1 or Z>1 and K>1 then I claim this uses leveled-N.

Assuming K>1 for Ln-1 then compaction with Fluid LSM into Ln merges K runs from Ln-1 with 1 run from Ln. This doesn't match my definition of tiered compaction because compaction into Ln reads & rewrites a sorted run from Ln and per-level write amplification is likely to be larger than 1. Regardless I like the idea.

Examples of write amplification with Fluid LSM for compaction from Ln-1 to Ln:

T==K - there are T (or K) sorted runs in each of Ln-1 and Ln. When each run in Ln-1 has size 1, then each run in Ln has size T. Compaction into Ln merges T runs from Ln-1 with 1 run from Ln to create a new run in Ln. This reads T bytes from Ln-1 and T bytes from Ln and the new run has a size between T and 2T -- size T when all keys in Ln-1 are duplicates of keys in the run from Ln and size > T otherwise. When the new run has size 2T the per-level write amp is 2 because 2T bytes were written to move T bytes from Ln-1. When the new run has size T the per-level write amp is 1. Otherwise the per-level write-amp is between 1 and 2.
T > K - there are K sorted runs in each of Ln-1 and Ln. Each run in Ln-1 has size T/K and each run in Ln has size T^2/K. K runs in Ln-1 have size T. Compaction reads T bytes from Ln-1, T^2/K bytes from Ln and writes a new run in Ln that has a size between T^2/K and (T^2/K + T). The per-level write-amp is as small as T^2/K / T, which reduces to T/K, when all keys in Ln-1 are duplicates with the run in Ln. It can be as large as (T^2/K + T) / T, which reduces to T/K + 1, when there is no overlap. Otherwise it is between T/K and T/K + 1.

When K=2 and T=10 then the per-level write-amp is ~5 which is about half of the per-level write-amp from leveled compaction.

Time Series

There are compaction algorithms optimized for time series workloads. I have no experience with them but they are worth mentioning. Cassandra had DTCS and has TWCS. InfluxDB has or had TSM and TSI. I hope we eventually do something interesting for time series with RocksDB.

Other

There are other interesting LSM engines:

Tarantool - Sphia begat Vinyl and I lost track of it. But I have high hopes.
WiredTiger - has an LSM but they are busy making the CoW b-tree better
Kudu - didn't use RocksDB and I like the reasons for not using it

My summary of Sphia and Tarantool probably has bugs. My memory is that Sophia was a great design assuming the database : RAM ratio wasn't too large. It had a memtable and a sorted run on disk -- both were partitioned (not sure if range or hash). When a memtable partition became full then leveled compaction was done between it and its disk partition. Vinyl has changed enough from this design that I won't try to summarize it here. It has clever ideas for managing the partitions.

ScyllaDB

I briefly mentioned ScyllaDB at the start of the post. I have yet to use the product but their documentation on LSM efficiency and many other things is remarkable. Start with this post that compares the compaction strategies (algorithms) in ScyllaDB -- leveled, size-tiered, hybrid and time-window. From this attached slide deck I learned that Lucene implemented an LSM in 1999. They also have two posts that explain write amplification for tiered and leveled compaction.

Hybrid compaction is described in the embedded slide deck and it is interesting. Hybrid range partitions large sorted runs into many SSTs, similar to RocksDB. Hybrid then uses that to make compaction with large sorted runs incremental -- an input SST to the compaction can be deleted before the compaction is finished (slide 33). This reduces the worst-case space amplification that is transient when merges are in progress for large sorted runs. This isn't trivial to implement. It isn't clear to me but slide 34 suggests that hybrid can limit compaction to a subset (1 or a few SSTs) of a large sorted run when the writes are skewed. Maybe a ScyllaDB expert can confirm or deny my guess. Hybrid also has optimizations for tombstones (slide 44). I won't go into detail here, just as I ignored the SingleDelete optimization in RocksDB.

12 comments:

byteAugust 31, 2018 at 8:28 AM
Hey Mark! Great post. One minor correction I think it is Sophia not Sphia (Vinyl/tarantool).
SonnyApril 30, 2019 at 5:45 AM
Hello, Mark.
Could you elaborate why write amplification is minimized when the same fanout is used between all levels?
In Leveled section, "The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. "
AnonymousFebruary 23, 2023 at 3:44 AM
Hello, Mark.
In Leveled section you mentioned "Leveled compaction minimizes space amplification at the cost of read and write amplification."
I can understand that the space amplification is better than others and the write amplification is worse than others. Could you explain why the read amplification is worse than others? Thank you!
Mark CallaghanFebruary 23, 2023 at 8:14 AM
Good question and I don't remember what I had in mind when I wrote it. But that statement is true relative to a B-Tree. However if the scope of the discussion is limited to LSM then it is less true.

Also, I am wary of claims about read amplification that don't also mention whether this is CPU read-amp, IO read-amp or both. In practice the LSM with leveled has worse CPU read-amp than a b-tree but similar or better IO read-amp.
AnonymousApril 13, 2024 at 1:46 AM
I'm curious how LSM-trees maintain the GDPR standard of deleting data after 30 days, especially if there isn't high throughput that would force a full compaction every 30 days. In tiered, when a tombstone enters the system, we have to maintain it until it reaches the max level, at which point we only remove it after a major compaction when we output a single sorted run to LMax. If we have to force this process to occur within 30 days for every tombstone, I wonder how that affects write amplification or if there is any better approach?
AnonymousApril 13, 2024 at 1:48 AM
I am curious what happens in lazy levelling (Leveled N) when the max level has 1 sorted run, and the levels before have say 4. What happens if that max sorted run reaches some massive size, orders of magnitude larger than the 4 sorted runs on the previous level (say L4 is the max and has a 500GB sorted run, and L3 has 4 10GB sorted runs). Then, we're merging 5 sorted runs to create a single sorted run for the max level, but one of the sorted run inputs is massive. This would cause massive write amplification. Do you know how they minimize this amplification?

Small Datum

Thursday, August 30, 2018

Name that compaction algorithm

12 comments:

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1