Friday, March 9, 2018

Cache amplification

How much of the database must be in cache so that a point-query does at most one read from storage? I call this cache-amplification or cache amplification. The answer depends on the index structure (b-tree, LSM, something else). Cache amplification can join read, write and space amplification. Given that RWS was renamed RUM by the excellent RUM Conjecture now we have CRUM which is close to crummy. I briefly wrote about this in a previous post.

To do at most 1 storage read for a point query:
  • clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example.
  • non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index.
  • LSM - I assume there is a bloom filter per SST. Bloom filters for all levels but the max level should be in cache. Block indexes for all levels should be in cache. Data blocks don't have to be in cache. I assume there are no false positives from the bloom filter so at most one data block will be read. Note that with an LSM, more space amplification means more cache amplification. So cache-amp is worse (higher) for tiered compaction than for leveled.
  • something else - there have a been a few interesting variants on the same theme that I call index+log -- BitCask, ForestDB and WiscKey. These are similar to a non-clustered b-tree in that the entire index must be in cache so that the storage read can be spent on reading the data from the log.
I have ignored hash-based solutions for now but eventually they will be important. SILT is a great example of a solution with excellent cache-amplification.

Updated to correct what should be in cache for the LSM.


  1. Interesting concept.

    In the case of LSM I wonder if it wouldn't be better to keep only the most accessed indexes blocks in the cache, instead of all of them. This would leave more memory to cache additional data blocks (potentially hotter than other index blocks). While some point-queries would require more than 1 I/O (index block + data block), I would expect the average I/O per lookup to be lower if there is some skew in the access.

    On the other hand, maybe soon we get to a point where storage devices are so fast that doing I/O is not so bad and we could pay that price for a simpler architectural design, but I digress :)

    1. For a while we have not had to worry about which index/filter blocks to keep in cache. The database:RAM ratio was small enough that everything fit. But that ratio is growing and soon we will have to worry about that.

      One project that helps RocksDB is partitioned index/filter blocks -

      The block cache uses LRU and when index/filter blocks are kept in the block cache then only the most recently accessed blocks are there. However, to reduce CPU overhead and mutex contention from looking up blocks in the block cache there is an option to pin index/filter blocks in memory for L0 files so lookups are not needed for them:
      * see pin_l0_filter_and_index_blocks_in_cache

      Note that an LSM searches for data in more places so it does more block cache lookups for filter, index and data blocks.