Monday, May 18, 2020

Avoiding reads in write-optimized index structures

This is a summary of features that avoid storage read IO in OSS write-optimized index structures. My scope is limited to MySQL and Tarantool -- I know there are similar features in other DBMS and am happy for comments on that:
  • RocksDB merge operator logically does read-modify-write but physically does a Put into the memtable and the read is deferred until user queries or compaction. I hope that MyRocks eventually takes advantage of the merge operator.
  • Read free replication in TokuDB and MyRocks.
  • Fast Updates in TokuDB do something similar to the merge operator for update and insert
  • Non-unique secondary index maintenance is read free in MyRocks
  • Deferred secondary index update in Tarantool is a new approach to avoiding reads for write-heavy workloads. It is worth reading. I wish it had been published by VLDB.
  • MyRocks also avoids some storage reads when validating unique constraints (for PK and unique secondary indexes) on inserts and some updates because the Get() that it does can use a bloom filter. This benefit is larger when there is a bloom filter on the max level of the LSM tree, but that has other costs is is frequently not done.
  • MyRocks uses the SingleDelete feature in RocksDB to allow tombstones to be removed earlier which decreases the CPU overhead for reads after a batch of writes.
  • A b-tree can also reduce index maintenance related storage read IO. See the InnoDB change buffer.
Updates

An incomplete list of relevant research:
  • DiffIndex - this is more about distributed systems, but still interesting
  • Techniques for reducing index maintenance overhead in an LSM by Luo & Carey (VLDB 2019)
  • Deferred Lightweight Indexing (DELI)

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 202...