Small Datum: January 2022

Monday, January 31, 2022

RocksDB internals: write rate limiter

The RocksDB wiki has a page on the rate limiter. This blog post is focused on rate limits for writes to manage compaction debt. WriteController and WriteControllerToken (source, header) are used to throttle write rates when compaction gets behind.

WriteControllerToken is extended by StopWriteToken, DelayWriteToken and CompactionPressureToken.

StopWriteToken is used when writes must be stopped until compaction catches up
DelayWriteToken is used when writes must be delayed (slowed) to <= the value of the delayed_write_rate option. This allows compaction debt to be reduced at a faster rate.
CompactionPressureToken is used when the number of background compaction threads should be increased.

Important methods for class WriteController are:

Get{Stop, Delay, CompactionPressure}Token. There is a count of the number of each type of token that currently exists. When GetDelayToken is called, the caller provides a value for the desired delayed write rate.
IsStopped - true if stop tokens exist
NeedsDelay - true if delay tokens exist
NeedSpeedupCompaction - true if IsStopped(), NeedsDelay() or compaction pressure tokens exist. It is interesting that the data type used for the counts of each token type are all std::atomic<int> but different techniques are used to read them -- see "total_compaction_pressure_ > 0", total_stopped_.load(std::memory_order_relaxed) and total_delayed_.load().
GetDelay - see below

GetDelay

WriteController::GetDelay is interesting and the method that shapes the write rate. It is called with the number of bytes that will be written and returns the number of microseconds (>=0) the write must be delayed. The caller must impose the delay. The minimum delay is 1 millisecond.

GetDelay manages the write rate over 1 millisecond intervals (kMicrosPerRefill=1000). When the delayed_write_rate is X bytes/second, then the target rate over the next millisecond is X / 1000 and that is the value for credit_in_bytes_. Each write then decrements credit_in_bytes_ by the size number of bytes written until credit_in_bytes_ is less than 0 or the next time interval is reached (>= 1 microsecond in the future).

I am curious about:

What happens when there is a large duration between consecutive calls to GetDelay and time_now is >> next_refill_time_? In this code elapsed can be large because it includes (time_now - next_refill_time_) and elapsed is used to compute credit_in_bytes_. So I wonder if that can lead to the next interval having an unusually large value for credit_in_bytes_. Although in the worst-case that just means the next millisecond interval has no write rate.

Applying Delays

DBImpl::PreprocessWrite calls DelayWrite. DbImpl::DelayWrite calls GetDelay and then enforces it.

For write slowdowns:

The delay should be no more than a few milliseconds with "normal" sized writes and usually a thread just has to wait until the next millisecond interval.
The delay is implemented as a sequence of calls to SleepForMicroseconds(kDelayInterval) where kDelayInterval = 1001.

For write stalls:

There is a loop that calls bg_cv_.Wait() if the IsStopped() check returns true and bg_cv_ is a condition variable, as the name suggests.
Nothing is logged here. I am curious if long waits for write stalls result in anything getting written to LOG. That might help with debugging. I will soon figure this out.

For both write stalls and slowdowns the time delayed is added to the STALL_MICROS timer.

Friday, January 28, 2022

RocksDB internals: compaction stall counters

Compaction IO statistics in RocksDB includes two lines with information on write throttling. Writes are throttled to slow the ingest (new writes) so that compaction (outgest) can keep up. This is a feedback mechanism and needed with an LSM because an LSM lacks the natural feedback that a b-tree has, which is when all pages in the buffer pool are dirty. The first line has counters for the causes of write slowdowns and write stalls. The second line has the total time and percentage of time (relative to uptime) that write stalls have been in effect for the RocksDB instance.

Stalls(count): 713 level0_slowdown, 706 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 1160 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 351 total count

Cumulative stall: 00:03:31.992 H:M:S, 65.1 percent

A write stall is when RocksDB blocks writes until compaction debt has been reduced. A write slowdown is when RocksDB delays writes so that the write rate is <= a target rate to give compaction a chance to catch up and avoid write stalls. The causes for write stalls and slowdowns are listed here although some enums in there are for things other than write stalls and slowdowns.

Detecting write stalls

From the sample compaction IO stats output listed above, the first line (stall counters) is printed by this code and the second line (stall timers) by this code. The reason for a stall is computed by GetWriteStallConditionAndCause and that result is processed, including incrementing stall counters, in RecalculateWriteStallConditions. Note that GetWriteStallConditionAndCause makes checks in a fixed sequence. While multiple conditions might be true at any point in time, such as too many L0 files and too many bytes pending compaction, only the first one in that sequence is returned as the reason for the stall.

The stall timer is incremented in DBImpl::DelayWrite. Although I am not sure if this is called for both write stalls and write slowdowns.

The following explains the names used in the stall counter example text above. They are listed in the order in which they are checked by GetWriteStallConditionAndCause:

The first 4 are write stalls in which case write_controller_token_ is assigned the return value from GetStopToken.
- memtable_compaction - true when the number of unflushed memtables is >= max_write_buffer_number. A message is written to LOG with the text "Stopping writes because we have %d immutable memtables ..."
- level0_numfiles - true when the number of L0 files is >= level0_stop_writes_trigger. A message is written to LOG with the text "Stopping writes because we have %d level-0 files ..."
- level0_numfiles_with_compaction - true when level0_numfiles is true and L0 compaction is in progress. Thus this counter is <= the counter for leve0_numfiles. The LOG message is explained in the previous point.
- stop for pending_compaction_bytes - true when the number of pending compaction bytes is >= hard_pending_compaction_bytes_limit. A message is written to LOG with the text "Stopping writes because of estimated pending compaction bytes ...".
The remaining are write slowdowns in which case write_controller_token_ is assigned the return value from SetupDelay. The delayed write rate that is computed by SetupDelay is printed in the LOG messages mentioned below.
- memtable_slowdown - true when max_write_buffer_number > 3 and the number of unflushed memtables is >= both (max_write_buffer_number - 1) and (min_write_buffer_number_to_merge - 1). A message is written to LOG with the text "Stalling writes because we have %d immutable memtables (waiting for flush) ..."
- level0_slowdown - true when the number of L0 files is >= level0_slowdown_writes_trigger. A message is written to LOG with the text "Stalling writes because we have %d level-0 files rate ..."
- level0_slowdown_with_compaction - true when level0_slowdown is true and L0 compaction is in progress. Thus this counter is <= the counter for level0_slowdown. See the previous point for the LOG message.
- slowdown for pending_compaction_bytes - true when the number of bytes pending compaction is >= soft_pending_compaction_bytes_limit. A message is written to LOG with the text "Stalling writes because of estimated pending compaction bytes ..."

When SetupDelay is called, the penalize_stopped argument normally gets the value of write_controller->IsStopped() for the memtable_slowdown case. For level0_slowdown it is also true when the number of L0 files is within 2 of level0_stop_writes_trigger. For pending_compaction_bytes slowdown it is also true if the number of bytes pending compaction is 3/4 of the way between the slow and hard limits.

If GetWriteStallConditionAndCause returned WriteStallCondition::kNormal then

write_controller_token_ = GetCompactionPressureToken() if either of these are true
- vstorage->l0_delay_trigger_count() >= GetL0ThresholdSpeedupCompaction(). The l0_delay_trigger_count is the number of files in the L0. GetL0Threshold...() returns the smaller of 2*level0_file_num_compaction_trigger and X where X is a value 1/4 of the way from the L0 compaction trigger to the L0 slowdown trigger. A message is written to LOG with the text "Increasing compaction threads because we have %d level-0 files"
- vstorage->estimated_compaction_needed_bytes() >= soft_pending_compaction_bytes_limit/4. A message is written to LOG with the text "Increasing compaction threads because of estimated pending compaction bytes"
otherwise write_control_token_->reset() is called. Nothing is written to LOG.
if write_controller->NeedsDelay() was true the delayed write rate is set to kDelayRecoverSlowdownRatio (=1.4) times the configured delayed_write_rate and the low-pri delayed write rate is set to 1/4 that. Nothing is written to LOG.

I will save the write controller for a future note and will end with a description of SetupDelay. This function determines the value for the applied delayed write rate which is <= the value of the delayed_write_rate option.

if write_controller->NeedsDelay() is not true then the applied delayed write is not changed. I am curious why SetupDelay() would be called in that case.
else if penalize_stop is true (see above) then the applied delayed write rate is multiplied by kNearStopSlowdownRatio (=0.6) which reduces it.
else if bytes pending compaction is increasing then the applied delayed write rate is multiplied by kIncSlowDownRatio (=0.8) which reduces it.
else if bytes pending compaction is decreasing then the applied delayed write rate is multiplied by kDecSlowdownRatio (=1/kIncSlowdownRatio) which increases it.

There is a reference count on the number of current CompactionPressureToken instances and when this count is non-zero then the NeedSpeedupCompaction() function can return true and NeedSpeedupCompaction is called by:

GetBGJobLimits - computes values for max_flushes and max_compactions
ThrottleLowPriWritesIfNeeded - optionally delays low-pri writes

Monitoring

I listed the relevant LOG messages above related to rate-limiting writes. I have two questions:

Will it be easier to grep for these from LOG if they all contain a common string unique to this feature?
Are all of the transitions covered? Note that in some cases described above nothing is written to LOG.

Thursday, January 27, 2022

RocksDB internals: intra-L0 compaction

There are some details on intra-L0 compaction on the public RocksDB wiki. This post has more details.

The purpose for intra-L0 is to reduce read-amplification when compaction is behind by reducing the number of SSTs in the L0. Intra-L0 compaction merges smaller L0 SSTs into a larger SST that remains in the L0. It can be done when L0->L1 compaction is needed but blocked. From the discussion of DBPCF below I suspect that intra-L0 increases write-amplification by <= 1 as SSTs in the L0 are likely to be subject to intra-L0 at most once.

Needed means the L0 compaction score is >= 1 either because there are too many SSTs in L0 or the L0 SSTs are too large. The compaction score is computed here.
Blocked means that L0->L1 wasn't selected because compaction input (from L0) or output (from L1) SSTs are already being compacted.

There is no option to disable intra-L0. For write-heavy workloads I am trying to quantify whether there is a benefit. For write-heavy the cost is more CPU and IO from compaction. While there is less or no need to reduce read-amp for write-heavy workloads, intra-L0 can still help by reducing write stalls by reducing the number of SSTs in L0. But I need to repeat benchmarks with intra-L0 disabled (by editing code and compiling a custom db_bench) to better understand this.

Disclaimer

I have a deep understanding of LSM performance in general but not enough understanding of RocksDB internals. That is a risky combination that I am fixing by reading source code. I will share what I learn via blog posts. My approach here is to document as I read and the result might be a lot of detail without a high-level overview.

Code

PickIntraL0Compaction is called by SetupInitialFiles when the L0 compaction score is >= 1 but an L0->L1 compaction was not selected. Both are methods in class LevelCompactionBuilder. PickIntraL0Compaction calls FindIntraL0Compaction if there are least 2 SSTs in L0 and the first isn’t being compacted. When FindIntraL0Compaction is called the arg min_files_to_compact has the value kMinFilesforIntraL0Compaction (=4) and the arg max_compaction_bytes has the value from the max_compaction_bytes option. The default for the max_compaction_bytes option is 0 and when 0 it is set to 25 * target_file_size_base. Note that with intra-L0, SSTs in the L0 can be as large as max_compaction_bytes (ignoring compression).

FindIntraL0Compaction tries to find SSTs in L0 for intra-L0 compaction. It starts by iterating the list of L0 SSTs from newest to oldest. If one is found that is being compacted, then it returns immediately. Otherwise it stops on the first SST with an old enough largest_seqno and this is the first of the SSTs to include in the intra-L0 compaction. It then adds successive SSTs as long as:

the SST is not being compacted and
Current CBPDF is less than previous CBPDF where CBPDF is Compacted Bytes Per Deleted File and computed as TB / NF, TB (Total Bytes) is the sum of the sizes of SST files and NF is the number of SST files less one. If the SST file sizes were [5, 5, 5, x] then the CBPDF for [5, 5, 5] is 7.5 (15/2) and the CBPDF for [5, 5, 5, x] is (15+x)/3. The loop stops and doesn't include the SST of size x when the CBPDF is greater than 7.5 and (15+x)/3 is > 7.5 when x > 7.5. An intra-L0 compaction would not be picked if only 3 SSTs were selected because 3 is less than min_files_to_compact
the sum of the sizes of the SSTs is <= max_compaction_bytes

When it stops an intra-L0 compaction will be done if the number of SSTs selected is >= min_files_to_compact (kMinFilesForIntraL0Compaction).

Wednesday, January 26, 2022

RocksDB internals: bytes pending compaction

This post is about soft and hard_pending_compaction_bytes_limit. Their comments in the header are brief (I expect to send a diff to improve them), so reading code is required. Before reading code I misunderstood how bytes pending compaction was computed. Below I use BPC for bytes pending compaction. EstimateCompactionBytesNeeded is the function that computes BPC. I assume this is called each time compaction or memtable flush finishes.

While this and a few other leveled compaction features have confused me, they also make leveled more adaptive to bursts of write-heavy activity and I have been promoting the benefits of adaptive LSM tree tuning for a long time. So perhaps I should embrace and document the complexity.

The soft and hard_pending_compaction_bytes_limit options set bounds on how large bytes pending compaction can get before writes are slowed or stalled. A write slowdown occurs when BPC is >= the soft limit and < the hard limit. The duration for a write slowdown is usually 1 millisecond although it can be longer. A write stall occurs when BPC is >= the hard limit. And by write stall I mean that the write is blocked until BPC becomes < the hard limit. The value for BPC is computed by EstimateCompactionBytesNeeded. The value is the sum of the bytes pending compaction across all levels in the LSM tree. Pseudo-code for this is:

For L0:
If num_files(L0) or size(L0) > size(L1_target):
increment by size(L0) + size(L1)

For Ln, n>0:
If size(Ln) > target_size(Ln):
increment by [ size(Ln) - target_size(Ln) ] *
[ (size(Ln+1) / size(Ln)) + 1 ]

Prior to reading the code I assumed was that BPC was incremented on each write (or each memtable flush) by something like sizeof(write) * write-amp and then decremented by compaction. Alas, while it is easy to estimate write-amp it is unlikely that the per-compaction decrements would match the per-write increments and BPC would quickly get out of sync. So I get why the current approach is used.

Challenges

The current approach has challenges:

What are good values for soft and hard_pending_compaction_bytes_limit?
- The defaults are 64G and 256G.
The value of BPC changes suddenly.

When L0->L1 compaction is triggered at 4 files in L0, then it contributes nothing to bytes pending compaction until the number of SSTs in L0 reaches 4. At that point it increments bytes pending compaction by size(L0) + size(L1). If you plot BPC over time there will be many discontinuous changes.

All of the debt can be spent at one level

Long ago RocksDB used to constrain each level to be no more than X times larger than its target size. Whether or not this worked well in production, it made it easier to reason about the shape of the LSM tree. In papers all leveled compaction LSM trees look like pyramids and Ln+1 always has more data than Ln. With reasonable settings for per-level fanout and the size constraints this was also true in practice.
The enforcement today is global and all of the BPC debt can be spent at one level. If the soft and hard limits are 64G and 256G then the L1 can have ~64G of compaction debt before the slow limit starts to delay writes. If the per-level fanout were 8 then size(L1) could be 8G larger than target_size(L1). I don't know whether this is a problem beyond being a source of confusion.

There is monitoring debt.

It would help to have details on stalls, per-level target sizes, the total BPC and the per-level contribution to BPC -- written to LOG and/or summarized in compaction IO statistics.

Summary

The current approach allows the LSM tree shape to get very much out of shape. I don't know whether that is a problem, beyond my confusion about this feature which might have lead me to use bad values for the soft and hard limits while running benchmarks. Regardless, I need to enhance the comments for those options to help other users.

I am still figuring out a good way to set values for the hard and soft limits but my suggestion for now is to use the defaults.