Thursday, January 27, 2022

RocksDB internals: intra-L0 compaction

There are some details on intra-L0 compaction on the public RocksDB wiki. This post has more details.

The purpose for intra-L0 is to reduce read-amplification when compaction is behind by reducing the number of SSTs in the L0. Intra-L0 compaction merges smaller L0 SSTs into a larger SST that remains in the L0. It can be done when L0->L1 compaction is needed but blocked. From the discussion of DBPCF below I suspect that intra-L0 increases write-amplification by <= 1 as SSTs in the L0 are likely to be subject to intra-L0 at most once.

  • Needed means the L0 compaction score is >= 1 either because there are too many SSTs in L0 or the L0 SSTs are too large. The compaction score is computed here.
  • Blocked means that L0->L1 wasn't selected because compaction input (from L0) or output (from L1) SSTs are already being compacted.

There is no option to disable intra-L0. For write-heavy workloads I am trying to quantify whether there is a benefit. For write-heavy the cost is more CPU and IO from compaction. While there is less or no need to reduce read-amp for write-heavy workloads, intra-L0 can still help by reducing write stalls by reducing the number of SSTs in L0. But I need to repeat benchmarks with intra-L0 disabled (by editing code and compiling a custom db_bench) to better understand this. 

Disclaimer

I have a deep understanding of LSM performance in general but not enough understanding of RocksDB internals. That is a risky combination that I am fixing by reading source code. I will share what I learn via blog posts. My approach here is to document as I read and the result might be a lot of detail without a high-level overview.

Code

PickIntraL0Compaction is called by SetupInitialFiles when the L0 compaction score is >= 1 but an L0->L1 compaction was not selected. Both are methods in class LevelCompactionBuilder. PickIntraL0Compaction calls FindIntraL0Compaction if there are least 2 SSTs in L0 and the first isn’t being compacted. When FindIntraL0Compaction is called the arg min_files_to_compact has the value kMinFilesforIntraL0Compaction (=4) and the arg max_compaction_bytes has the value from the max_compaction_bytes option. The default for the max_compaction_bytes option is 0 and when 0 it is set to 25 * target_file_size_base. Note that with intra-L0, SSTs in the L0 can be as large as max_compaction_bytes (ignoring compression).

FindIntraL0Compaction tries to find SSTs in L0 for intra-L0 compaction. It starts by iterating the list of L0 SSTs from newest to oldest. If one is found that is being compacted, then it returns immediately. Otherwise it stops on the first SST with an old enough largest_seqno and this is the first of the SSTs to include in the intra-L0 compaction. It then adds successive SSTs as long as:

  • the SST is not being compacted and
  • Current CBPDF is less than previous CBPDF where CBPDF is Compacted Bytes Per Deleted File and computed as TB / NF, TB (Total Bytes) is the sum of the sizes of SST files and NF is the number of SST files less one. If the SST file sizes were [5, 5, 5, x] then the CBPDF for [5, 5, 5] is 7.5 (15/2) and the CBPDF for [5, 5, 5, x] is (15+x)/3. The loop stops and doesn't include the SST of size x when the CBPDF is greater than 7.5 and (15+x)/3 is > 7.5 when x > 7.5. An intra-L0 compaction would not be picked if only 3 SSTs were selected because 3 is less than min_files_to_compact
  • the sum of the sizes of the SSTs is <= max_compaction_bytes

When it stops an intra-L0 compaction will be done if the number of SSTs selected is >= min_files_to_compact (kMinFilesForIntraL0Compaction).








No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...