Small Datum: RocksDB internals: compaction stall counters

Compaction IO statistics in RocksDB includes two lines with information on write throttling. Writes are throttled to slow the ingest (new writes) so that compaction (outgest) can keep up. This is a feedback mechanism and needed with an LSM because an LSM lacks the natural feedback that a b-tree has, which is when all pages in the buffer pool are dirty. The first line has counters for the causes of write slowdowns and write stalls. The second line has the total time and percentage of time (relative to uptime) that write stalls have been in effect for the RocksDB instance.

Stalls(count): 713 level0_slowdown, 706 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 1160 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 351 total count

Cumulative stall: 00:03:31.992 H:M:S, 65.1 percent

A write stall is when RocksDB blocks writes until compaction debt has been reduced. A write slowdown is when RocksDB delays writes so that the write rate is <= a target rate to give compaction a chance to catch up and avoid write stalls. The causes for write stalls and slowdowns are listed here although some enums in there are for things other than write stalls and slowdowns.

Detecting write stalls

From the sample compaction IO stats output listed above, the first line (stall counters) is printed by this code and the second line (stall timers) by this code. The reason for a stall is computed by GetWriteStallConditionAndCause and that result is processed, including incrementing stall counters, in RecalculateWriteStallConditions. Note that GetWriteStallConditionAndCause makes checks in a fixed sequence. While multiple conditions might be true at any point in time, such as too many L0 files and too many bytes pending compaction, only the first one in that sequence is returned as the reason for the stall.

The stall timer is incremented in DBImpl::DelayWrite. Although I am not sure if this is called for both write stalls and write slowdowns.

The following explains the names used in the stall counter example text above. They are listed in the order in which they are checked by GetWriteStallConditionAndCause:

The first 4 are write stalls in which case write_controller_token_ is assigned the return value from GetStopToken.
- memtable_compaction - true when the number of unflushed memtables is >= max_write_buffer_number. A message is written to LOG with the text "Stopping writes because we have %d immutable memtables ..."
- level0_numfiles - true when the number of L0 files is >= level0_stop_writes_trigger. A message is written to LOG with the text "Stopping writes because we have %d level-0 files ..."
- level0_numfiles_with_compaction - true when level0_numfiles is true and L0 compaction is in progress. Thus this counter is <= the counter for leve0_numfiles. The LOG message is explained in the previous point.
- stop for pending_compaction_bytes - true when the number of pending compaction bytes is >= hard_pending_compaction_bytes_limit. A message is written to LOG with the text "Stopping writes because of estimated pending compaction bytes ...".
The remaining are write slowdowns in which case write_controller_token_ is assigned the return value from SetupDelay. The delayed write rate that is computed by SetupDelay is printed in the LOG messages mentioned below.
- memtable_slowdown - true when max_write_buffer_number > 3 and the number of unflushed memtables is >= both (max_write_buffer_number - 1) and (min_write_buffer_number_to_merge - 1). A message is written to LOG with the text "Stalling writes because we have %d immutable memtables (waiting for flush) ..."
- level0_slowdown - true when the number of L0 files is >= level0_slowdown_writes_trigger. A message is written to LOG with the text "Stalling writes because we have %d level-0 files rate ..."
- level0_slowdown_with_compaction - true when level0_slowdown is true and L0 compaction is in progress. Thus this counter is <= the counter for level0_slowdown. See the previous point for the LOG message.
- slowdown for pending_compaction_bytes - true when the number of bytes pending compaction is >= soft_pending_compaction_bytes_limit. A message is written to LOG with the text "Stalling writes because of estimated pending compaction bytes ..."

When SetupDelay is called, the penalize_stopped argument normally gets the value of write_controller->IsStopped() for the memtable_slowdown case. For level0_slowdown it is also true when the number of L0 files is within 2 of level0_stop_writes_trigger. For pending_compaction_bytes slowdown it is also true if the number of bytes pending compaction is 3/4 of the way between the slow and hard limits.

If GetWriteStallConditionAndCause returned WriteStallCondition::kNormal then

write_controller_token_ = GetCompactionPressureToken() if either of these are true
- vstorage->l0_delay_trigger_count() >= GetL0ThresholdSpeedupCompaction(). The l0_delay_trigger_count is the number of files in the L0. GetL0Threshold...() returns the smaller of 2*level0_file_num_compaction_trigger and X where X is a value 1/4 of the way from the L0 compaction trigger to the L0 slowdown trigger. A message is written to LOG with the text "Increasing compaction threads because we have %d level-0 files"
- vstorage->estimated_compaction_needed_bytes() >= soft_pending_compaction_bytes_limit/4. A message is written to LOG with the text "Increasing compaction threads because of estimated pending compaction bytes"
otherwise write_control_token_->reset() is called. Nothing is written to LOG.
if write_controller->NeedsDelay() was true the delayed write rate is set to kDelayRecoverSlowdownRatio (=1.4) times the configured delayed_write_rate and the low-pri delayed write rate is set to 1/4 that. Nothing is written to LOG.

I will save the write controller for a future note and will end with a description of SetupDelay. This function determines the value for the applied delayed write rate which is <= the value of the delayed_write_rate option.

if write_controller->NeedsDelay() is not true then the applied delayed write is not changed. I am curious why SetupDelay() would be called in that case.
else if penalize_stop is true (see above) then the applied delayed write rate is multiplied by kNearStopSlowdownRatio (=0.6) which reduces it.
else if bytes pending compaction is increasing then the applied delayed write rate is multiplied by kIncSlowDownRatio (=0.8) which reduces it.
else if bytes pending compaction is decreasing then the applied delayed write rate is multiplied by kDecSlowdownRatio (=1/kIncSlowdownRatio) which increases it.

There is a reference count on the number of current CompactionPressureToken instances and when this count is non-zero then the NeedSpeedupCompaction() function can return true and NeedSpeedupCompaction is called by:

GetBGJobLimits - computes values for max_flushes and max_compactions
ThrottleLowPriWritesIfNeeded - optionally delays low-pri writes

Monitoring

I listed the relevant LOG messages above related to rate-limiting writes. I have two questions:

Will it be easier to grep for these from LOG if they all contain a common string unique to this feature?
Are all of the transitions covered? Note that in some cases described above nothing is written to LOG.

Small Datum

Friday, January 28, 2022

RocksDB internals: compaction stall counters

No comments:

Post a Comment

Postgres 18 beta2: large server, Insert Benchmark, part 2