Wednesday, January 26, 2022

RocksDB internals: bytes pending compaction

I have a deep understanding of LSM performance in general but not enough understanding of RocksDB internals. That is a risky combination that I am fixing by reading source code. I will share what I learn via blog posts. My approach here is to document as I read and the result might be a lot of detail without a high-level overview.

This post is about soft and hard_pending_compaction_bytes_limit. Their comments in the header are brief (I expect to send a diff to improve them), so reading code is required. Before reading code I misunderstood how bytes pending compaction was computed. Below I use BPC for bytes pending compaction. EstimateCompactionBytesNeeded is the function that computes BPC. I assume this is called each time compaction or memtable flush finishes.

While this and a few other leveled compaction features have confused me, they also make leveled more adaptive to bursts of write-heavy activity and I have been promoting the benefits of adaptive LSM tree tuning for a long time. So perhaps I should embrace and document the complexity.

The soft and hard_pending_compaction_bytes_limit options set bounds on how large bytes pending compaction can get before writes are slowed or stalled. A write slowdown occurs when BPC is >= the soft limit and < the hard limit. The duration for a write slowdown is usually 1 millisecond although it can be longer. A write stall occurs when BPC is >= the hard limit. And by write stall I mean that the write is blocked until BPC becomes < the hard limit. The value for BPC is computed by EstimateCompactionBytesNeeded. The value is the sum of the bytes pending compaction across all levels in the LSM tree. Pseudo-code for this is:

For L0:
If num_files(L0) or size(L0) > size(L1_target):
increment by size(L0) + size(L1)

For Ln, n>0:
If size(Ln) > target_size(Ln):
increment by [ size(Ln) - target_size(Ln) ] *
[ (size(Ln+1) / size(Ln)) + 1 ]

Prior to reading the code I assumed was that BPC was incremented on each write (or each memtable flush) by something like sizeof(write) * write-amp and then decremented by compaction. Alas, while it is easy to estimate write-amp it is unlikely that the per-compaction decrements would match the per-write increments and BPC would quickly get out of sync. So I get why the current approach is used.

Challenges

The current approach has challenges:

  • What are good values for soft and hard_pending_compaction_bytes_limit?
    • The defaults are 64G and 256G.
  • The value of BPC changes suddenly.
    • When L0->L1 compaction is triggered at 4 files in L0, then it contributes nothing to bytes pending compaction until the number of SSTs in L0 reaches 4. At that point it increments bytes pending compaction by size(L0) + size(L1). If you plot BPC over time there will be many discontinuous changes.
  • All of the debt can be spent at one level
    • Long ago RocksDB used to constrain each level to be no more than X times larger than its target size. Whether or not this worked well in production, it made it easier to reason about the shape of the LSM tree. In papers all leveled compaction LSM trees look like pyramids and Ln+1 always has more data than Ln. With reasonable settings for per-level fanout and the size constraints this was also true in practice.
    • The enforcement today is global and all of the BPC debt can be spent at one level. If the soft and hard limits are 64G and 256G then the L1 can have ~64G of compaction debt before the slow limit starts to delay writes. If the per-level fanout were 8 then size(L1) could be 8G larger than target_size(L1). I don't know whether this is a problem beyond being a source of confusion.
  • There is monitoring debt.
    • It would help to have details on stalls, per-level target sizes, the total BPC and the per-level contribution to BPC -- written to LOG and/or summarized in compaction IO statistics.

Summary

The current approach allows the LSM tree shape to get very much out of shape. I don't know whether that is a problem, beyond my confusion about this feature which might have lead me to use bad values for the soft and hard limits while running benchmarks. Regardless, I need to enhance the comments for those options to help other users.

I am still figuring out a good way to set values for the hard and soft limits but my suggestion for now is to use the defaults.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...