RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.
An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.
Updates:
- fix a bogus claim about cache_index_and_filter_blocks
The big picture
The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.
When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.
From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache.
The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.
Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.
- Search to find the SST for which the searched for key is between the min and max key for the SST
- Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
- Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
- Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.
Implementation overview
These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.
The code is mostly in 4 classes/structs:
- LRUHandle - wraps the objects stored on the LRU
- LRUHandleTable - the hash table for the LRU+cache
- ShardedCache - the cache that is sharded into LRUCacheShard
- LRUCacheShard - a shard of ShardedCache
- each entry is a variable length heap-allocated structure
- entries are referenced by cache and/or by any external entity
- the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.
- Referenced externally AND in hash table
- the entry is *not* in the LRU list
- refs >= 1 && in_cache == true
- Not referenced externally AND in hash table
- the entry is in the LRU list and can be freed
- refs == 0 && in_cache == true
- Referenced externally AND not in hash table
- entry is not in the LRU list and not in hash table
- entry can be freed when refs becomes 0
- refs >= 1 && in_cache == false
- If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
- To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
- To move from state 2 to state 1, use LRUCacheShard::Lookup.
- Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
A cache search starts with ShardedCache::Lookup. That computes hash to determine the shard and then calls LRUCacheShard::Lookup on that shard. When the searcher is done with the block it calls
e->Free()
Thank you for great post explaining the internals of block cache. If a process manages multiple rocksdb instances, does the block cache is shares across all the instances or a separate block cache is created per instance?
ReplyDeleteThere are options to share resources (cache and thread pool) between instances in the same process. But I am far from an expert on that.
DeleteGreat info Mark! Two mutexes per read lookup seems bad. Would it be possible to have a lookup_read that does the lookup and moves the entry to the head of the LRU list, and lookup_delete that removes the entry? And refcounts using atomic add/sub for when done using the entry?
ReplyDeleteI'm surprised malloc_usable_size would be expensive. Typically malloc implementations store the size in the 4 bytes preceding the start off the space so just a few machine instructions. Here is glibc code: https://fossies.org/linux/glibc/malloc/malloc.c
I think it is possible to change the LRU code to only lock once per access but someone else will focus on this soon and might do something even better.
Delete