RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.
An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.
- fix a bogus claim about cache_index_and_filter_blocks
The big picture
The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.
When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.
From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache.
The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.
Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.
- Search to find the SST for which the searched for key is between the min and max key for the SST
- Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
- Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
- Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.
These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.
The code is mostly in 4 classes/structs:
- each entry is a variable length heap-allocated structure
- entries are referenced by cache and/or by any external entity
- the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.
LRUHandle in one of three states:
- Referenced externally AND in hash table
Not referenced externally AND in hash table
- the entry is *not* in the LRU list
- refs >= 1 && in_cache == true
Referenced externally AND not in hash table
- the entry is in the LRU list and can be freed
- refs == 0 && in_cache == true
- entry is not in the LRU list and not in hash table
- entry can be freed when refs becomes 0
- refs >= 1 && in_cache == false
All newly created LRUHandles are in state 1.
- If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
- To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
- To move from state 2 to state 1, use LRUCacheShard::Lookup.
- Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
Explaining the stack traces
The rest of this post explains what I saw in the stack traces that show contention for LRUCacheShard::Lookup
. A block search is done in Lookup and the block is pinned (referenced) if found. The block is unpinned by Release when the search is finished with it. Both require a lock/unlock of the per-shard mutex. And Release can also write to the (contended) head of the LRU list. I also have edited PMP output
for those traces.
if LRUHandleTable::Lookup returns an entry (e)
if not LRUHandle::HasRefs():
LRU_Remove(e) // entry was in LRU since it has references
LRUHandle::Ref (for e) // does refs++
LRUHandle::SetHit (for e) // does flags |= ...
removes an LRUHandle from double linked LRU list
updates memory accounting
if (metadata_charge_policy == kFullChargeCacheMetadata)
call malloc_usable_size to update meta_charge
I was curious about the overhead from frequent calls to malloc_usable_size which can be done when the per-shard mutex is locked. Results are here
. Preventing these calls doesn't solve the mutex contention problem.
if ref count is 0 and e->InCache()
if (no longer referenced):
if (no longer referenced):
// This is done after the mutex has been unlocked
// LRUCacheShard::mutex_ held by caller
if high_pri_pool enabled && (e->IsHighPri() || e->HasHit())
move e to head of LRU list
move e to head of low-pri pool
while high pri pool too big
move entry from tail of high pri LRU to low-pri pool
remove from double linked list
walk a list to find the LRUHandle for a key
The number of entries on the list that FindPointer will encounter should be ~1 on average and LRUHandleTable::Insert calls
LRUHandleTable::Resize to make that likely. I am curious whether calls to Resize are a source of stalls because a per-shard mutex is locked when that happens. That won't be a common stall.