So now the problem shifts to methods for avoiding storage reads during secondary index maintenance and there are a few popular ways to do that.
- Partition the database so that only a few of the partitions get inserts at any point in time and then use local (per-partition) secondary indexes. If the partitions are small enough they will fit in memory and then traditional indexing (B-Tree) can be used and the per-partition secondary indexes will fit it memory for the partitions that are getting inserts. TimescaleDB is an example of this.
- Use an LSM, or something like an LSM, that supports read-free index maintenance for non-unique secondary indexes. MyRocks is an example of that.
- Don't use secondary indexes because the target workloads expect to scan large amounts of data
Searching trees of trees sounds a lot like what an LSM does, which is a nice way of stating that the CPU overhead of an LSM for reads can be reproduced by the first approach. In an LSM with leveled compaction there is a sorted run (or tree) per level. However an LSM benefits from a bloom filter for point queries and might benefit from a prefix bloom filter for range queries. There are also several interesting research papers that show how to use something like a bloom filter for range queries -- see papers on SuRF and Rosetta. I am not sure whether systems that use the first approach (partition with local secondary indexes) also provide something like a bloom filter to match what an LSM can do. But if most partitions eventually become read-only then there are opportunities for being clever.