Wednesday, March 28, 2018

Missing documentation

In my time with Linux there are some things that would benefit from better documentation.
  1. The use of per-inode mutexes for buffered IO writes. This prevents concurrent writes per file between the block layer and storage. The problem is most easy to see with a hard disk when the drive write cache is disabled. XFS with O_DIRECT avoids this problem. I am not sure about other file systems with O_DIRECT. TIL the per-inode mutex is now the per-inode rwsem. An old FB note on this is here. A paper on filesystem scalability is here.
  2. PTHREAD_MUTEX_ADAPTIVE_NP - this enables some busy-waiting when trying to lock a mutex. It isn't mentioned in man pages. Interesting details are here.

Friday, March 9, 2018

Cache amplification

How much of the database must be in cache so that a point-query does at most one read from storage? I call this cache-amplification or cache amplification. The answer depends on the index structure (b-tree, LSM, something else). Cache amplification can join read, write and space amplification. Given that RWS was renamed RUM by the excellent RUM Conjecture now we have CRUM which is close to crummy. I briefly wrote about this in a previous post.

To do at most 1 storage read for a point query:
  • clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example.
  • non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index.
  • LSM - I assume there is a bloom filter per SST. Bloom filters for all levels but the max level should be in cache. Block indexes for all levels should be in cache. Data blocks don't have to be in cache. I assume there are no false positives from the bloom filter so at most one data block will be read. Note that with an LSM, more space amplification means more cache amplification. So cache-amp is worse (higher) for tiered compaction than for leveled.
  • something else - there have a been a few interesting variants on the same theme that I call index+log -- BitCask, ForestDB and WiscKey. These are similar to a non-clustered b-tree in that the entire index must be in cache so that the storage read can be spent on reading the data from the log.
I have ignored hash-based solutions for now but eventually they will be important. SILT is a great example of a solution with excellent cache-amplification.

Updated to correct what should be in cache for the LSM.

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...