Tuesday, May 10, 2022

RocksDB internals: LRU

RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.

An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.

Updates:

  • fix a bogus claim about cache_index_and_filter_blocks

The big picture

The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.

When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.

From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache. 

The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.

Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.

  1. Search to find the SST for which the searched for key is between the min and max key for the SST
  2. Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
  3. Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
  4. Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.

Implementation overview

These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.

The code is mostly in 4 classes/structs:

LRUHandle

LRUHandle has great comments that I summarize here:
  • each entry is a variable length heap-allocated structure
  • entries are referenced by cache and/or by any external entity
  • the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.

LRUHandle in one of three states:
  1. Referenced externally AND in hash table
    • the entry is *not* in the LRU list
    • refs >= 1 && in_cache == true
  2. Not referenced externally AND in hash table
    • the entry is in the LRU list and can be freed
    • refs == 0 && in_cache == true
  3. Referenced externally AND not in hash table
    • entry is not in the LRU list and not in hash table
    • entry can be freed when refs becomes 0
    • refs >= 1 && in_cache == false
All newly created LRUHandles are in state 1.
  • If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
  • To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
  • To move from state 2 to state 1, use LRUCacheShard::Lookup.
  • Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
Explaining the stack traces

The rest of this post explains what I saw in the stack traces that show contention for LRUCacheShard::Lookup and LRUCacheShard::Release. A block search is done in Lookup and the block is pinned (referenced) if found. The block is unpinned by Release when the search is finished with it. Both require a lock/unlock of the per-shard mutex. And Release can also write to the (contended) head of the LRU list. I also have edited PMP output for those traces.

A cache search starts with ShardedCache::Lookup. That computes hash to determine the shard and then calls LRUCacheShard::Lookup on that shard. When the searcher is done with the block it calls 

  lock LRUCacheShard::mutex_
  if LRUHandleTable::Lookup returns an entry (e)
    if not LRUHandle::HasRefs():
      LRU_Remove(e) // entry was in LRU since it has references
    LRUHandle::Ref (for e)    // does refs++
    LRUHandle::SetHit (for e) // does flags |= ...
  unlock LRUCacheShard::mutex_


  removes an LRUHandle from double linked LRU list
  calls LRUCacheHandle::CalcTotalCharge
  updates memory accounting

  if (metadata_charge_policy == kFullChargeCacheMetadata)
    call malloc_usable_size to update meta_charge

I was curious about the overhead from frequent calls to malloc_usable_size which can be done when the per-shard mutex is locked. Results are here. Preventing these calls doesn't solve the mutex contention problem.

  lock LRUCacheShard::mutex_
  call e->Unref()
  
  if ref count is 0 and e->InCache()
    if cache_full
      LRUHandleTable::Remove(e)
      e->SetInCache(False)
    else
      LRU_Insert(e)
  
  if (no longer referenced):
    call e->CalcTotalCharge
  unlock LRUCacheShard::mutex_
  
  if (no longer referenced):
    // This is done after the mutex has been unlocked
    e->Free()

  // LRUCacheShard::mutex_ held by caller
  call e->CalcTotalCharge()
  if high_pri_pool enabled && (e->IsHighPri() || e->HasHit())
    move e to head of LRU list
    MaintainPoolSize()
  else
    move e to head of low-pri pool

  while high pri pool too big
    move entry from tail of high pri LRU to low-pri pool

  call FindPointer
  remove from double linked list

  walk a list to find the LRUHandle for a key

The number of entries on the list that FindPointer will encounter should be ~1 on average and  LRUHandleTable::Insert calls LRUHandleTable::Resize to make that likely. I am curious whether calls to Resize are a source of stalls because a per-shard mutex is locked when that happens. That won't be a common stall.

Performance bugs: p50 vs p99

One way to classify performance bugs is p50 vs p99. 

  • A p50 performance bug occurs for all workloads
  • A p99 performance bug only happens for some workloads or in some contexts. By some workloads I mean it might only occur during hot backup or load. By some contexts I mean it might only happen under high concurrency or load spikes which don't represent the steady state.
The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster.

The impact from fixing a p99 bug is less obvious. On average there might be little impact with respect to average throughput and efficiency. There can be a huge impact with respect to response time goals. The challenge is to quantify the benefit and I have learned over the years that motivating the fix for p99 perf bugs is a hard problem. For MySQL I wrote many blog posts to market open bug reports.

One aspect is that there are easy workarounds for p50 perf bugs. If a bug makes something 10% less efficient then buy 10% more hardware. There might be no workarounds for a p99 perf bug. If something causes a long stall once/day or ruins performance for one infrequent part of your workload then the choices are to either tolerate it (someone suffers) or replace the system.

I spent a few years iterating on p99 bugs in MySQL back in my web-scale MySQL+InnoDB days. It was interesting work and I benefited from production which was a great source of p99 bugs and a colleague named Domas who was great at documenting such bugs.

Monday, May 9, 2022

Using mmap with RocksDB

RocksDB inherited support for mmap from LevelDB. I was curious how performance with mmap compared buffered IO (pread) and ran benchmarks with db_bench. I haven't had great experiences with mmap for database workloads in the past but tried to be fair.

tl;dr

  • Don't use mmap for IO-bound workloads until issue 9931 is fixed
  • For in-memory databases performance with mmap can be better than pread if db_bench is run with --cache_index_and_filter_blocks=false. Setting --verify_checksum=false provides a smaller performance boost.
Longer than tl;dr

If the database doesn't fit in memory then issue 9931 means there will be significant overfetching on reads from storage and this will ruin performance. I assume the fix is to add madvise calls similar to the posix_fadvise calls used for buffered IO (pread).

With --cache_index_and_filter_blocks=false the RocksDB process might use more memory than you expect so be careful while chasing performance.  When it is true and --mmap_read=true then index and filter blocks will still be copied into the RocksDB block cache.

While --verify_checksum=false improves performance in some cases, it also prevents the use of checksums to detect file corruption on data blocks. Checksums for filter and index blocks read into the block cache are always checked regardless of the value of --verify_checksum because the BlockFetcher uses a ReadOptions struct with default values and the default for ReadOptions::verify_checksums is true (see here and here). 

Configuration

I used benchmark.sh to run db_bench on a server with more than 32 CPU cores and then hyperthreading was enabled. The database was run in three configurations: cached by RocksDB, cached by OS and IO-bound. For cached by RocksDB the block cache was larger than the database files. For cached by OS the block cache was 1G but the database fits in the OS page cache. For IO-bound the database is larger than memory. I don't share graphs for cached by OS but the results were similar to cached by RocksDB.

The benchmarks were run with 32 client threads (--threads=32) and WAL sync disabled. The background writer was limited to 2MB/s.

The benchmarks were run in this order:

  • fillseq - load the database in key order
  • readrandom - point lookups via Get
  • multireadrandom - point lookups via MultiGet
  • revrangeww - reverse range scans with a background writer
  • fwdrangeww - forward range scans with a background writer
  • readww - point lookups via Get with a background writer
  • overwrite - writes via Put run as fast as possible.

IO-bound

The database was loaded with 4B key-value pairs by fillseq. The throughput graphs show that results were lousy for mmap on the read-heavy benchmarks. On my server RocksDB with mmap was fetching ~100X more data per query compared to pread. While tuning filesystem options might have reduced the excessive fetching I am not sure it would prevent it and I am reluctant to change global options on a server to benefit one of the applications that use the server. The real fix and more stats on the excessive fetching are explained in issue 9931.

For pread and mmap the benchmarks used --verify_checksum=true and --cache_index_and_filter_blocks=true.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Cached by RocksDB

The database was loaded with 40M key-value pairs by fillseq. Results are provided for several configurations. Font size reduced to avoid spanning lines.

  • pread-ver1-cm1: -mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=true
  • mmap-ver1-cm1: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=true
  • pread-ver0 -cm1: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=true
  • pread-ver0-cm0: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=false
  • pread-ver1-cm0: --mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=false
  • mmap-ver0-cm1: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=true
  • mmap-ver0-cm0: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=false
  • mmap-ver1-cm0: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=false

Command lines are here for pread-ver0-cm1. To use the *-cm0 variants see here. To run the IO-bound variants see here.

The most important db_bench option for performance is --cache_index_and_filter_blocks=false as shown in the results for pread-ver*-cm0 and mmap-ver*-cm0.

Setting --verify_checksum=false helps for mmap but not for pread. See pread-ver0-cm0 vs pread-ver1-cm0 and mmap-ver0-cm0 vs mmap-ver1-cm0. This might be workload specific.

Setting --mmap_read also helps compared to pread but only when --cache_index_and_filter_blocks=false. It is possible that also enabling the option for compaction to pre-populate the block cache would allow pread to be as fast as mmap but I did not try it.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Thursday, May 5, 2022

To compress or not compress

What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels.

tl;dr - if levels 0, 1 and 2 are compressed:

  • Benefits
    • less write-amplification
  • Costs
    • there might be more write stalls
    • write throughput might be lower if there are more write stalls
    • time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2 
    • compaction CPU overhead is increased
By might be more write stalls I mean that this is only an issue when the write rate is large. For the results here I ran one test with the write-rate limited to 10 MB/s and there were no write stalls whether or not L0/L1/L2 were compressed. Then I ran a test without a limit on the write rate and stalls were worse when L0/L1/L2 were compressed.

Quantifying the tl;dr based on the overwrite benchmark:
  • Write-amplification was 1.36X larger when L0/L1/L2 were not compressed
  • Write throughput was 1.32X larger when L0/L1/L2 were not compressed
  • Compaction CPU was 1.36X larger when L0/L1/L2 were compressed via lz4

Overview

The benchmark command lines are here and compaction IO statistics are here. Tests were run on a server with more than 32 CPU cores and hyperthreading was enabled. The server has fast SSD and enough DRAM.

I used the db_bench benchmark client for RocksDB and RocksDB version 6.28.2 to run a sequence of benchmarks. Here I explain results from two of them -- readwhilewriting and overwrite. The readwhilewriting benchmark ran with the write limited to 10MB/second. The overwrite benchmark was run without a constraint on the write rate. WAL fsync was disabled to allow the writes to happen as fast as possible.

Levels L3 and larger were compressed with lz4. Tests were then repeated with L0/L1/L2 not compressed vs compressed with lz4.

Data used in this benchmark compresses in half as the db_bench command line included --compression_ratio=0.5.

Results, part 1

Notes:

  • Write rates (wps) are the same for readwhilewriting because the writer had a 10M/s rate limit
  • Write rates (wps) are ~1.3X larger for overwrite when L0/L1/L2 are not compressed because there were fewer write stalls
  • Bytes flushed per write are ~2X larger when flushes (L0) are not compressed
  • Write-amp is ~1.3X larger when L0/L1/L2 are not compressed. The LSM tree in this workload has 7 levels, the write rate into each level is similar despite the amount of data per level varying greatly. Thus, whether or not you compress 3 of the 7 levels has a significant impact on write-amp. 

Legend:
* wps = writes/second
* ingest - GB of data written into RocksDB
* flush - GB of memtable flushes
* compact - GB of compaction

wps     ingest  flush   compact w-amp  L0/L1/L2-compression
--- readwhilewriting
220555  18.05   17.81   313     18.3   none
224320  18.05    9.84   245     14.1   lz4
--- overwrite
117383  167.25  169.42  2627    16.7   none
 89177  127.07   71.12  1486    12.3   lz4

Results, part 2

Notes:

  • Write rates are ~1.3X larger when L0/L1/L2 are not compressed
  • Write stalls are more common when L0/L1/L2 are compressed
  • The compaction CPU overhead per write is ~1.3X larger when L0/L1/L2 are compressed

Legend:
* wps = writes/second
* c.csecs - CPU seconds of compaction
* cpu/w - c.csecs / wps = compaction CPU / write
* stall% - percentage of time writes are stopped or stalled

wps     c.wsecs c.csecs cpu/w   stall%  L0/L1/L2-compression
--- readwhilewriting
220555  2867    2798    NA      0       none
224320  2737    2693    NA      0       lz4
--- overwrite
117383  17084   16464   0.140   49.7    none
 89177  14554   14209   0.159   62.9    lz4

Results, part 3

This shows the time per flush or compaction job.  A flush job is the work done to flush one memtable into an L0 SST. A compaction job is the time to do a compaction into L1+. A compaction job into L1 usually merges all data in L0 with all data in L1. A compaction job into Ln (n > 1) usually merges ~10 files in Ln-1 with 1 file in Ln. Therefore, compaction jobs into L1 usually process more data than jobs into Ln (n > 1). This data is from compaction IO statistics output.

Notes:

  • Compaction/flush jobs into L0/L1/L2 take up to 3X longer when L0/L1/L2 is compressed
  • Compaction jobs into L1 are the longest running. This happens with overwrite because it was run without rate limit for the writer, compaction fell behind and RocksDB allows L0 to grow large in that case. One side-effect is that the L0->L1 compaction job takes longer because it merges all data from L0 with all data from L1.

readwhilewriting - wall clock seconds per compaction/flush job

        compression
level   none    lz4
L0       0.05    0.08
L2       0.53    1.03
L3       0.17    0.47
L4       0.86    0.92
L5       0.86    0.88
L6       0.87    0.88
L7       0.58    0.55

overwrite - wall clock seconds per compaction/flush job


        compression
level   none    lz4
L0       0.11    0.18
L2      13.38   28.86
L3       0.14    0.42
L4       0.31    0.44
L5       0.37    0.40
L6       0.59    0.58
L7       0.77    0.79