I started by dividing the index structures into tree-based and hash-based. Tree-based support range queries while hash-based do not. Most of my experience is with tree-based approaches. There was a lot of research on hash-based in the 80s (extendible, dynamic and linear hashing), they are in production today but not as prominent as b-trees, and there is recent research on them. This paper by Seltzer and Yigit is a great overview on hash-based index structures.
The next classification is by algorithm - page-based, index+log and LSM. I use this for both tree-based and hash-based so LSM means Log Structured Merge rather than Log Structured Merge Tree. Most DBMS implement one or two of these. Tarantool has more algorithmic diversity, but I don't have much experience with it.
- page-based - I used to call this update-in-place but think that page-based is a better name because it includes update-in-place (UiP) and copy-on-write-random (CoW-R) b-trees. Great implementations include InnoDB for UiP and WiredTiger for CoW-R.
- LSM - leveled and tiered compaction are examples (RocksDB, LevelDB, Cassandra). Note that the original LSM design by O'Neil et al didn't have an L0 and I suspect that the L0 might be a mistake but that is for another post. I don't want to insult LevelDB authors, the L0 makes sense when you want to limit memory consumption for database embedded in client apps, I am not sure it makes sense for serious OLTP with RocksDB. O'Neil also did fundamental work on bitmap indexes. I worked on both so he made my career possible (thank you).
- index+log - use a tree-based index with data in log segments. The index points into the log segments. GC is required. It scans the log segments and probes the index to determine whether a record is live or can be deleted. There are fewer examples of this approach but interesting systems include WiscKey and ForestDB. Range queries will suffer unless the index is covering (this needs more discussion, but not here).
- page-based - see dynamic, extendible and linear hashing. TIL that ZFS implements extendible hashing. BerkeleyDB supports linear hashing. The dbm implementation on many Linux/Unix systems also implements some variant of update-in-place persistent hash tables.
- LSM - today I only know of one example of this - SILT. That is a great paper (read it). I include it as an example of hash-based even though one of the levels is ordered.
- index+log - BitCask is one example but the index wasn't durable and it took a long time (scan all log segments) to rebuild it on process start. Faster might be another example, but I am still reading the paper. I hope someone names a future system Efficienter or Greener.
Finally, I have a few comments on performance. I will be brief, maybe there will be another post with a lot more detail on my opinions:
- b-tree - provides better read efficiency at the cost of write efficiency. The worst case for write efficiency is writing back a page for every modified row. Cache efficiency is better for a clustered index than a non-clustered index -- for a clustered index you should cache at least one key/pointer per leaf block but for a non-clustered index you need the entire index in RAM or there will be extra storage reads. Space efficiency for a b-tree is good (not great, not bad) -- the problem is fragmentation.
- LSM - provides better write efficiency at the cost of read efficiency. Leveled compaction provides amazing space efficiency. Tiered compaction gets better write efficiency at the cost of space efficiency. Compaction does many key comparisons and this should be considered as part of the CPU overhead for an insert (maybe 4X more comparisons/insert than a b-tree).
- index+log - provides better write efficiency. Depending on the choice of index structure this doesn't sacrifice read efficiency like an LSM. But the entire index must remain in RAM (just like a non-clustered b-tree) or GC will fall behind and/or do too many storage reads. GC does many index probes and the cost of this is greatly reduced by using a hash-based solution. These comparisons should be considered as part of the CPU overhead of an insert. There is also a write vs space efficiency tradeoff. By increasing the amount of dead data that can be tolerated in log segments then GC is less frequent, write-amplification is improved but space-amplification suffers. There are variants that don't need GC, but they are not general purpose.