Thursday, December 2, 2021

Summarizing the different implementations of tiered compaction

This is my first attempt to summarize tiered compaction as implemented by RocksDB, Cassandra, ScyllaDB and HBase. I have limited experience with tiered compaction (that will change) and limited knowledge of Cassandra, ScyllaDB and HBase (that won't change soon) so I will make mistakes but enjoy the learning process. To keep this brief I ignore the variants optimizations for time-series and I don't explain compaction basics.

Different DBMS use different names for tiered compaction:

  • RocksDB has universal compaction
  • HBase has ExploringCompactionPolicy and RatioBasedCompactionPolicy, see here
  • Cassandra has STCS (Size Tiered Compaction Strategy)
  • ScyllaDB has STCS and ICS (Incremental Compaction Strategy), see here
Context

The workload matters a lot when considering compaction algorithms. There are many but I list only three below. Note that many real workloads are a mixture of patterns, an example is some overwrites, some inserts.
  • Overwrite-mostly - all of my time in LSM-land has been focused on overwrite-mostly workloads. Compaction is useful for overwrites because it reclaims space by removing tombstone and old versions.
  • Insert-mostly - for now I assume that something removes data via TTL or explicit deletes otherwise the database size grows forever. Insert-only can be further divided into:
    • Insert-only in key order - this is the easiest pattern to support. Bloom filters are less useful because pruning based on min/max keys per sorted run can eliminate many runs from being accessed on reads. Excluding implementation artifacts there are few benefits from using compaction in this case. While the sorted runs created by memtable flush will be small, they can be concatenated into larger, logical sorted runs that are range-partitioned although not all implementations do that.
    • Insert-only not in key order - compaction won't reclaim space (there is nothing to reclaim) but it still helps to reduce read-amplification.
STCS vs Prefix

While STCS is an accepted and good name for one approach to tiered compaction, HBase and RocksDB have a similar approach but use different names (RatioBasedCompactionPolicy, Universal). I prefer to call that Prefix compaction for reasons explained below.

While it would help to show examples (via diagrams or text) of the compactions that are done for STCS and Prefix assuming a sequence of memtable flushes, this post is already too long. I will do that in a future post.

Update - after reading more RocksDB code it doesn't limit compactions to a prefix, so substring would be a better name assuming you visualize the sequence of sorted runs as a string -- each sorted run is an entry in that string.

Properties

Definitions:
  • By compaction picker I mean the algorithm that decides which sorted runs to compact.
  • Major compaction merges all sorted runs, minor compaction does not. 
  • The metadata for a sorted run includes the min and max keys stored in the sorted run and the min and max commit timestamp used by a key in that sorted run.
I prefer to have a geek code for tiered compaction implementations but I don't have the expertise yet to suggest that. Regardless, below is an incomplete list of things that distinguish the implementations:

  • What triggers major compaction?
    • Many implementations only do major compaction manually (when requested by an admin). The reasons for not doing it automatically include 1) it can take a long time and 2) there might be a transient 2X increase in space. Note that some implementations can merge the largest sorted runs without merging all sorted runs and this can cause the same problems as major compaction.
  • How are sorted runs ordered as viewed by the compaction picker?
    • They can be ordered by commit timestamp or size.
    • Commit timestamp order frequently implies size order, but sometimes explicit deletes, compaction removing a larger/older value in favor of a smaller/new value or TTL means that older sorted runs are not larger.
  • Does the compaction picker (almost) always consume a prefix of the sorted runs or can it consume a substring?
    • This assumes there is an order imposed on the sorted runs (see the previous question).
  • Can the compaction picker merge sorted runs that aren't adjacent in commit timestamp order?
    • See the two previous questions. If this is done the sorted runs can overlap with respect to commit timestamp ranges. A side-effect of allowing sorted runs to overlap in commit timestamp order might be that merging iterators can't use a short-circuit optimization for point queries -- reading from iterators in commit timestamp order and stopping on the first matching key or tombstone. I don't know if that side-effect is a big deal. 
  • What read-amplification constraints trigger minor compaction?
    • The trigger can be global or local. Global is triggered when there are too many sorted runs. Local is triggered when there are too many sorted runs within a size bucket.
    • These are naive because they assume all sorted runs might have to be read for a query. For workloads where the newly written keys have some correlation with time then the assumption might be wrong, many sorted runs can be pruned based on their min/max key and the number of sorted runs in the LSM tree is a too-pessimistic estimate of the read-amp. Time-series workload might break the naive assumption, but I am trying to not discuss LSM optimizations for time-series in this post.
  • Is there a minimum number of sorted runs to merge for minor compaction?
  • Is there read-triggered compaction?
    • LevelDB has a feature to trigger compaction for regions of the LSM tree with leveled compaction. Some tiered compaction implementations have a read hotness feature that prioritizes compaction for sorted runs that get more reads.
  • What space-amplification constraints trigger compaction?
    • This assumes there is an estimate for space-amp and there might not be one. If there is an estimate it is usually naive because a less naive estimate is expensive. The naive estimate is sizeof(all-sorted-runs) / sizeof(largest-sorted-run). I think this can be improved.
  • Do large sorted runs imply large filter and index blocks in memory?
    • If index and filter blocks are per sorted run, stored as a contiguous sequence of bytes and cannot be paged then a large sorted run implies there are large index and filter blocks. These are inefficient to cache and to read and decompress from storage. 
    • One alternative is to page them so they can be read from storage and cached in pieces. 
    • Another alternative is to have large logical sorted runs that are range partitioned into N sorted runs. Each of the N sorted runs has its own and smaller index and filter blocks.
  • Is compaction for large sorted runs incremental?
    • By incremental I mean there isn't a transient 2X increase in disk space during compaction and that disk space used by compaction input is reclaimed before a large compaction is complete.
  • Is compaction single-threaded?
    • Leveled compaction is usually single-thread because each compaction step is limited to a small amount of data. And then background threads can do many small compaction steps concurrently. But a compaction step can be much larger with tiered. Multi-threading can reduce this problem. 
    • One approach to multi-threading is to have N threads divide the compaction input into N key ranges and work on them independently.
    • Another approach to multi-threading is to have one thread do the merge, 1+ threads read and decompress compaction input and 1+ threads post-process (collect stats, create bloom filters, compress) compaction output. The separate threads for reading and writing can also do async read-ahead and write-behind.
  • Are there optimizations for inserts in key order?
  • What is the space-amp vs read-amp vs write-amp tradeoff?
    • Implementations have different tradeoffs for read, write and space amplification. I am not aware of an attempt to quantify the differences. My answer is left for a future blog post.

RocksDB

There are plans for improving universal compaction and some details are here.

Answers to the geek code questions:
  • The compaction picker is here and options are here
  • Major compaction is triggered automatically when the space-amp constraint is exceeded. That is set by the option max_size_amplification_percent
  • Sorted runs are ordered by commit timestamp (seqno)
  • The compaction picker usually picks a prefix of the sorted runs. But it skips the first N if they are currently being compacted so it doesn't always pick a prefix. The next sorted run is added to the prefix when sizeof(next run) <= sizeof(selected runs) X size_ratio. When the size_ratio check fails or max_merge_width is reached then the minor compaction input is known. The stop_style option can also be used but I won't describe that here.
  • Compaction input sorted runs are always adjacent in commit timestamp order.
  • The min and max number of sorted runs to merge are set by min_merge_width and max_merge_width
  • The read-amp constraint that triggers minor compaction is set by the option level0_file_num_compaction_trigger which is borrowed from leveled compaction. This is global.
  • There is no read-triggered compaction. While it was inherited from LevelDB it is currently disabled in RocksDB.
  • The space-amp constraint triggers major compaction as mentioned above. The naive estimate is used for space-amp. The estimate is truthy for overwrite-mostly workloads. I assume the space-amp constraint should be disabled for insert-mostly workloads.
  • Large sorted runs can lead to large index and filter blocks but that is unlikely because the largest sorted runs are logical and range-partitioned with an index and filter per partition (see here). RocksDB also optionally supports partitioned index and filter blocks.
  • Compaction for large sorted runs is not incremental (yet)
  • Large compactions are multi-threaded via subcompactions that split compaction input into N ranges (see here)
  • The allow_trivial_move option can use trivial moves in some cases when inserts are in key order to avoid re-writing sorted runs.
HBase

Answers to the geek code questions:
  • Compaction options are here. Search for hstore.compaction.
  • The compaction policies (ExploringCompactionPolicy, RatioBasedCompactionPolicy) are explained here.
  • The min and max number of sorted runs to merge are set by hbase.hstore.compaction.{min,max}
  • Time-based major compaction can be enabled or disabled. Search for hstore.hregion.majorcompaction
  • RatioBasedCompactionPolicy seems similar to universal compaction in RocksDB but ExploringCompactionPolicy might be closer to STCS. However, I didn't spend enough time to figure them out.
Cassandra

Answers to the geek code questions for STCS:
  • The compaction picker is here and options are here
  • Major compaction is triggered manually
  • Sorted runs are ordered by size and the compaction picker then groups them into size classes, then looks at size classes that have at least min_threshold sorted runs and chooses the one that gets the most reads (read hotness). The bucket_high and bucket_low options determine the range of sorted run sizes that can be in one size class (bucket). More detail is here and here.
  • Sorted runs that are not adjacent in commit timestamp order can be merged
  • The min and max number of sorted runs to merge are determined by the options min_threshold and max_threshold
  • The read-amp constraint is min_threshold and triggers minor compaction when there is a size class with at least that many sorted runs
  • There isn't support for read-triggered minor compaction but read hotness is used to prioritize which size class to compact next.
  • I am not sure whether large index and filter blocks are a problem
  • Compaction is single-threaded. I don't know whether the sharding built into Cassandra makes it easier to avoid large shards and the problems related to large compactions.
  • Does not do incremental compaction
  • I am not aware of optimizations for key-order inserts, similar to the allow_trivial_move feature in RocksDB universal compaction.
  • Cassandra might use HyperLogLog to estimate overlap between sorted runs. This can help estimate whether there would be a benefit from merging sorted runs by predicting how many redundant keys would be removed. But all I have today is a DataStax blog post.

ScyllaDB

Answers to the geek code questions for STCS (size tiered compaction strategy) with a few answers for ICS (incremental compaction strategy). Most of the answers above for Cassandra are also true here:
  • The compaction picker for STCS is here and options are here
  • Major compaction is triggered manually for STCS. ICS has an option to trigger it based on a SAG (space amplification goal). And they explain it is useful for overwrite workloads.
  • The min and max number of sorted runs to merge are determined by the options min_threshold and max_threshold
  • Compaction is single-threaded but ScyllaDB explains how the shard-per-core model reduces the problem from large/slow compactions.
  • STCS does not do incremental compaction but ScyllaDB Enterprise includes ICS
Updates
  • Added a link to the use of HyperLogLog by DataStax Cassandra to estimate sorted run overlap

3 comments:

  1. Interesting post. Out of curiosity, I dug up the answers for SinglestoreDB (S2DB). Some of these are from our older ICDE paper (https://www.singlestore.com/resources/research-paper_columnstore-engine-for-real-time-streaming-analytics/). We're working on a newer paper with a more up to date description of our storage design.

    - Major Compaction: The `OPTIMIZE TABLE FULL` command merges the table to a single sorted run. We never automatically do this otherwise (too expensive).

    - The compaction picker sees sorted runs are ordered by their size

    - S2DB can compact a substring of the sorted runs. This can happen in two cases
    1. The longer sorted runs are compacted separately from the shorter ones to allow the shorter sorted runs to be compacted without waiting for the longer ones (we have a compaction thread dedicated to dealing with the shorter sorted runs). The merger responsible for compacting the longer sorted runs will always be compacting a substring rather than a prefix since it skips the shorter sorted runs.
    2. If the number of sorted runs in the selected prefix is larger than the maximum number of sorted runs allowed to be compacted at a time (10), only the suffix of the longer sorted runs within the selected prefix gets compacted.

    - S2DB has no constraint on merging sorted runs not overlapping in commit timestamp order. S2DB tracks deletes using a bit vector in the metadata rather than using tombstones, so it’s not necessary to use a merging iterator to find the latest versions of the rows.

    - S2DB triggers a minor compaction when the sum of the sizes of a prefix of smaller sorted runs grows larger than 1/8 of the next sorted run. This is similar to a “local‘ trigger, since it only considers sorted runs with similar sizes.

    - There isn't a minimum number of sorted runs to trigger compaction. There is no read triggered compaction (though we should likely add this at some point).

    - S2DB tracks deletes using a bit vector in metadata, so the space amplification can be estimated from the metadata as #non-deleted-rows / #all-rows. Compaction is triggered when more than 1/8 of the rows are deleted in a segment.

    - All on-disk data can be read incrementally. Each sorted run is partitioned into segments so that it can be read incrementally. The index structures are split into two parts - the local index structures stored with each segment, and the global index structures stored in a separate LSM tree (decoupled from the sorted runs). The global index structures are paged and cached in pieces.

    - Compaction is incremental by writing out one merged segment (up to 1M rows) at a time. Disk space from a segment can be reclaimed as soon as all of the rows in the segment are merged, without waiting for the entire sorted run to finish merging.

    - Compaction is single-threaded per shard/partition (so we do get parallelism due to sharding). Due to the need to keep compaction transparent we only use 2 threads per host today unless the compaction is manually triggered (in which case we use 1 thread per shard).

    - The sorted runs are computed dynamically, so that new data inserted in key order can be appended to existing large sorted runs without merging.

    - Without going into too much detail, we limit the amount of the compaction work (by using only two threads), so that we do something close to leveled compaction (less point read amp, more write amp) on low write rates, but fall back to tiered compaction (more point read amp, less write amp) on higher write rates.

    ReplyDelete
    Replies
    1. Thank you for the summary and for a link to the paper. I wasn't aware of it and will read it soon.

      Delete
  2. Great insights! I really appreciate how clearly you’ve outlined the topic. Your post has provided some valuable clarity. Thanks for sharing!

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...