Tuesday, May 10, 2022

RocksDB internals: LRU

RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.

An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.

Updates:

  • fix a bogus claim about cache_index_and_filter_blocks

The big picture

The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.

When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.

From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache. 

The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.

Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.

  1. Search to find the SST for which the searched for key is between the min and max key for the SST
  2. Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
  3. Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
  4. Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.

Implementation overview

These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.

The code is mostly in 4 classes/structs:

LRUHandle

LRUHandle has great comments that I summarize here:
  • each entry is a variable length heap-allocated structure
  • entries are referenced by cache and/or by any external entity
  • the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.

LRUHandle in one of three states:
  1. Referenced externally AND in hash table
    • the entry is *not* in the LRU list
    • refs >= 1 && in_cache == true
  2. Not referenced externally AND in hash table
    • the entry is in the LRU list and can be freed
    • refs == 0 && in_cache == true
  3. Referenced externally AND not in hash table
    • entry is not in the LRU list and not in hash table
    • entry can be freed when refs becomes 0
    • refs >= 1 && in_cache == false
All newly created LRUHandles are in state 1.
  • If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
  • To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
  • To move from state 2 to state 1, use LRUCacheShard::Lookup.
  • Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
Explaining the stack traces

The rest of this post explains what I saw in the stack traces that show contention for LRUCacheShard::Lookup and LRUCacheShard::Release. A block search is done in Lookup and the block is pinned (referenced) if found. The block is unpinned by Release when the search is finished with it. Both require a lock/unlock of the per-shard mutex. And Release can also write to the (contended) head of the LRU list. I also have edited PMP output for those traces.

A cache search starts with ShardedCache::Lookup. That computes hash to determine the shard and then calls LRUCacheShard::Lookup on that shard. When the searcher is done with the block it calls 

  lock LRUCacheShard::mutex_
  if LRUHandleTable::Lookup returns an entry (e)
    if not LRUHandle::HasRefs():
      LRU_Remove(e) // entry was in LRU since it has references
    LRUHandle::Ref (for e)    // does refs++
    LRUHandle::SetHit (for e) // does flags |= ...
  unlock LRUCacheShard::mutex_


  removes an LRUHandle from double linked LRU list
  calls LRUCacheHandle::CalcTotalCharge
  updates memory accounting

  if (metadata_charge_policy == kFullChargeCacheMetadata)
    call malloc_usable_size to update meta_charge

I was curious about the overhead from frequent calls to malloc_usable_size which can be done when the per-shard mutex is locked. Results are here. Preventing these calls doesn't solve the mutex contention problem.

  lock LRUCacheShard::mutex_
  call e->Unref()
  
  if ref count is 0 and e->InCache()
    if cache_full
      LRUHandleTable::Remove(e)
      e->SetInCache(False)
    else
      LRU_Insert(e)
  
  if (no longer referenced):
    call e->CalcTotalCharge
  unlock LRUCacheShard::mutex_
  
  if (no longer referenced):
    // This is done after the mutex has been unlocked
    e->Free()

  // LRUCacheShard::mutex_ held by caller
  call e->CalcTotalCharge()
  if high_pri_pool enabled && (e->IsHighPri() || e->HasHit())
    move e to head of LRU list
    MaintainPoolSize()
  else
    move e to head of low-pri pool

  while high pri pool too big
    move entry from tail of high pri LRU to low-pri pool

  call FindPointer
  remove from double linked list

  walk a list to find the LRUHandle for a key

The number of entries on the list that FindPointer will encounter should be ~1 on average and  LRUHandleTable::Insert calls LRUHandleTable::Resize to make that likely. I am curious whether calls to Resize are a source of stalls because a per-shard mutex is locked when that happens. That won't be a common stall.

4 comments:

  1. Thank you for great post explaining the internals of block cache. If a process manages multiple rocksdb instances, does the block cache is shares across all the instances or a separate block cache is created per instance?

    ReplyDelete
    Replies
    1. There are options to share resources (cache and thread pool) between instances in the same process. But I am far from an expert on that.

      Delete
  2. Great info Mark! Two mutexes per read lookup seems bad. Would it be possible to have a lookup_read that does the lookup and moves the entry to the head of the LRU list, and lookup_delete that removes the entry? And refcounts using atomic add/sub for when done using the entry?
    I'm surprised malloc_usable_size would be expensive. Typically malloc implementations store the size in the 4 bytes preceding the start off the space so just a few machine instructions. Here is glibc code: https://fossies.org/linux/glibc/malloc/malloc.c

    ReplyDelete
    Replies
    1. I think it is possible to change the LRU code to only lock once per access but someone else will focus on this soon and might do something even better.

      Delete

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...