To do at most 1 storage read for a point query:
- clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example.
- non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index.
- LSM - I assume there is a bloom filter per SST. Bloom filters for all levels but the max level should be in cache. Block indexes for all levels should be in cache. Data blocks don't have to be in cache. I assume there are no false positives from the bloom filter so at most one data block will be read. Note that with an LSM, more space amplification means more cache amplification. So cache-amp is worse (higher) for tiered compaction than for leveled.
- something else - there have a been a few interesting variants on the same theme that I call index+log -- BitCask, ForestDB and WiscKey. These are similar to a non-clustered b-tree in that the entire index must be in cache so that the storage read can be spent on reading the data from the log.
I have ignored hash-based solutions for now but eventually they will be important. SILT is a great example of a solution with excellent cache-amplification.
Updated to correct what should be in cache for the LSM.
Updated to correct what should be in cache for the LSM.
Interesting concept.
ReplyDeleteIn the case of LSM I wonder if it wouldn't be better to keep only the most accessed indexes blocks in the cache, instead of all of them. This would leave more memory to cache additional data blocks (potentially hotter than other index blocks). While some point-queries would require more than 1 I/O (index block + data block), I would expect the average I/O per lookup to be lower if there is some skew in the access.
On the other hand, maybe soon we get to a point where storage devices are so fast that doing I/O is not so bad and we could pay that price for a simpler architectural design, but I digress :)
For a while we have not had to worry about which index/filter blocks to keep in cache. The database:RAM ratio was small enough that everything fit. But that ratio is growing and soon we will have to worry about that.
DeleteOne project that helps RocksDB is partitioned index/filter blocks - http://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html
The block cache uses LRU and when index/filter blocks are kept in the block cache then only the most recently accessed blocks are there. However, to reduce CPU overhead and mutex contention from looking up blocks in the block cache there is an option to pin index/filter blocks in memory for L0 files so lookups are not needed for them:
* https://github.com/facebook/rocksdb/blob/master/include/rocksdb/table.h#L78
* see pin_l0_filter_and_index_blocks_in_cache
Note that an LSM searches for data in more places so it does more block cache lookups for filter, index and data blocks.