Monday, January 31, 2022

RocksDB internals: write rate limiter

The RocksDB wiki has a page on the rate limiter. This blog post is focused on rate limits for writes to manage compaction debt. WriteController and WriteControllerToken (source, header) are used to throttle write rates when compaction gets behind.

WriteControllerToken is extended by StopWriteToken, DelayWriteToken and CompactionPressureToken.

  • StopWriteToken is used when writes must be stopped until compaction catches up
  • DelayWriteToken is used when writes must be delayed (slowed) to <= the value of the delayed_write_rate option. This allows compaction debt to be reduced at a faster rate.
  • CompactionPressureToken is used when the number of background compaction threads should be increased.

Important methods for class WriteController are:

  • Get{Stop, Delay, CompactionPressure}Token. There is a count of the number of each type of token that currently exists. When GetDelayToken is called, the caller provides a value for the desired delayed write rate.
  • IsStopped - true if stop tokens exist
  • NeedsDelay - true if delay tokens exist
  • NeedSpeedupCompaction - true if IsStopped(), NeedsDelay() or compaction pressure tokens exist. It is interesting that the data type used for the counts of each token type are all std::atomic<int> but different techniques are used to read them -- see "total_compaction_pressure_ > 0", total_stopped_.load(std::memory_order_relaxed) and total_delayed_.load().
  • GetDelay - see below

GetDelay

WriteController::GetDelay is interesting and the method that shapes the write rate. It is called with the number of bytes that will be written and returns the number of microseconds (>=0) the write must be delayed. The caller must impose the delay. The minimum delay is 1 millisecond.

GetDelay manages the write rate over 1 millisecond intervals (kMicrosPerRefill=1000). When the delayed_write_rate is X bytes/second, then the target rate over the next millisecond is X / 1000 and that is the value for credit_in_bytes_. Each write then decrements credit_in_bytes_ by the size number of bytes written until credit_in_bytes_ is less than 0 or the next time interval is reached (>= 1 microsecond in the future).

I am curious about:

  • What happens when there is a large duration between consecutive calls to GetDelay and time_now is >> next_refill_time_? In this code elapsed can be large because it includes (time_now - next_refill_time_) and elapsed is used to compute credit_in_bytes_. So I wonder if that can lead to the next interval having an unusually large value for credit_in_bytes_. Although in the worst-case that just means the next millisecond interval has no write rate.

Applying Delays

DBImpl::PreprocessWrite calls DelayWrite. DbImpl::DelayWrite calls GetDelay and then enforces it.

For write slowdowns:

  • The delay should be no more than a few milliseconds with "normal" sized writes and usually a thread just has to wait until the next millisecond interval.
  • The delay is implemented as a sequence of calls to SleepForMicroseconds(kDelayInterval) where kDelayInterval = 1001.

For write stalls:

  • There is a loop that calls bg_cv_.Wait() if the IsStopped() check returns true and bg_cv_ is a condition variable, as the name suggests.
  • Nothing is logged here. I am curious if long waits for write stalls result in anything getting written to LOG. That might help with debugging. I will soon figure this out.

For both write stalls and slowdowns the time delayed is added to the STALL_MICROS timer.













No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...