Wednesday, July 22, 2020

Sustaining high insert rates despite secondary indexes

If you want to sustain high insert rates in the presence of secondary indexes then there are two approaches at a high level. The first is to buy enough RAM to cache the entire database but that gets expensive. The second is to avoid storage reads during secondary index maintenance.

So now the problem shifts to methods for avoiding storage reads during secondary index maintenance and there are a few popular ways to do that.
  1. Partition the database so that only a few of the partitions get inserts at any point in time and then use local (per-partition) secondary indexes. If the partitions are small enough they will fit in memory and then traditional indexing (B-Tree) can be used and the per-partition secondary indexes will fit it memory for the partitions that are getting inserts. TimescaleDB is an example of this.
  2. Use an LSM, or something like an LSM, that supports read-free index maintenance for non-unique secondary indexes. MyRocks is an example of that.
  3. Don't use secondary indexes because the target workloads expect to scan large amounts of data
The interesting thing about the first approach is that you end up with a tree per partition and then a query that uses the per-partition secondary index might have to search many trees (fanout or broadcast queries) unless there are other query predicates that enable enough partitions to be pruned.

Searching trees of trees sounds a lot like what an LSM does, which is a nice way of stating that the CPU overhead of an LSM for reads can be reproduced by the first approach. In an LSM with leveled compaction there is a sorted run (or tree) per level. However an LSM benefits from a bloom filter for point queries and might benefit from a prefix bloom filter for range queries. There are also several interesting research papers that show how to use something like a bloom filter for range queries -- see papers on SuRF and Rosetta. I am not sure whether systems that use the first approach (partition with local secondary indexes) also provide something like a bloom filter to match what an LSM can do. But if most partitions eventually become read-only then there are opportunities for being clever.

Updates
  • this post has more detail on avoiding reads in a write-optimized index structure -- RocksDB merge operator, InnoDB change buffer and more
  • I would put things like Postgres BRIN (min/max values per block) into the third category (do scans but more efficiently)

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...