Posts

Showing posts from March, 2018

Missing documentation

In my time with Linux there are some things that would benefit from better documentation. The use of per-inode mutexes for buffered IO writes. This prevents concurrent writes per file between the block layer and storage. The problem is most easy to see with a hard disk when the drive write cache is disabled. XFS with O_DIRECT avoids this problem. I am not sure about other file systems with O_DIRECT. TIL the per-inode mutex is now the per-inode rwsem . An old FB note on this is here . A paper on filesystem scalability is here . PTHREAD_MUTEX_ADAPTIVE_NP  - this enables some busy-waiting when trying to lock a mutex. It isn't mentioned in man pages. Interesting details are here .

Cache amplification

How much of the database must be in cache so that a point-query does at most one read from storage? I call this cache-amplification or cache amplification. The answer depends on the index structure (b-tree, LSM, something else). Cache amplification can join read, write and space amplification. Given that RWS was renamed RUM by the excellent RUM Conjecture now we have CRUM which is close to crummy . I briefly wrote about this in a previous post . To do at most 1 storage read for a point query: clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example. non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index. LSM - I as