Friday, October 23, 2020

LSM local secondary indexes (LSM LSI)

This expands on my previous post about RID-lists for RocksDB. RocksDB doesn't provide secondary indexes, nor does it know about schemas but applications that use RocksDB can add both and MyRocks is an example of that.

Many DBMS use the concept of local vs global secondary indexes and partitioned tables are one place where this matters. Assuming a B-Tree is used for the index, then a local secondary index with a partitioned table has a B-Tree index per-partition. The benefit of this is that DROP PARTITION is fast -- delete the table and index files for that partition. The cost from this is that a secondary index query might have to probe indexes per-partition and this can use more CPU and IO.

Without defining a local secondary index for an LSM I assert that secondary indexes with MyRocks are global. Global secondary index entries for an LSM use the PK value to reference the base row. Side effects of this include larger secondary index entries and CPU overhead when an LSM point read must be done to find the base row. Any DBMS with a clustered PK index, like InnoDB, also has these costs.

For the following I haven't done an extensive literature search to understand whether this idea has been proposed. Regardless, I hope that LSM LSI (see below) is the name to be be used.

Local secondary indexes for an LSM

The concept of local secondary indexes also applies to a Log Structured Merge tree. An LSM global secondary index (LSM GSI) uses the PK value because the base row is in the primary key index and is likely to be relocated over time by compaction. Therefore, either secondary index entries must be updated with a new location (rowid) during compaction of the PK index or the secondary index entries must use the PK value. There are more complex solutions that I will ignore.

There are other costs when secondary index entries uses the PK value rather than a rowid to find the base row -- it is harder to implement RID-lists and bitmap indexes. But I think this can be fixed by an LSM local secondary index (LSM LSI).

I define an LSM LSI to be local to a sorted run. When compaction merges sorted runs into a larger sorted run, then a new LSM LSI will be created for that new sorted run. The cost from this is fanout for queries that have a predicate on the secondary index and cannot be pruned to a few sorted runs. Pruning to a few sorted runs is most likely to be enabled via predicates on the PK.

The lifetime of the LSM LSI is the same as the lifetime of the data it indexes. Therefore, LSM LSI entries can use something like a rowid rather than the PK to reference the base row. This makes it much easier to implement RID-lists and bitmap indexes for an LSM. The rowid has two parts -- (block#, offset) where block# is a value from 0 to N-1 when an SST has N blocks and the offset is the position of the row in the block. The rowid usually needs <= 4 bytes.

Fanout is less of a concern for a bitmap index because many analytic queries expect to scan a large part of the index. But there will be more merging of bitmap index entries when LSM LSI is used, just as there would be with local bitmap indexes on a partitioned table with a famous SQL DBMS.

References

  • Cassandra SAI (Storage Attached Index) is an LSM local secondary index. It uses a trie for text keys and a kd-tree for numeric keys. I think the index entries use the row offset within the SST as the row pointer.
  • Cuckoo Index might be relevant, although it is a lighter weight index
  • There is a paper in SIGMOD 2018 on LSM secondary indexes
  • There is a paper from the AsterixDB project on LSM secondary indexes
  • Another paper from AsterixDB and Chen Luo on LSM secondary indexes


2 comments:

  1. Good post. It's been a bit of a surprise to me that secondary indexes still seem to be a work in progress for LSM storage engines. So it's a good topic to read and write about.
    Your definitions for local and global indexes make sense, however I think your names have a risk of being overloaded. In a distributed database I'd expect a global index to be sharded separately from the record it is pointing to. Meaning that a point query only needs to access 1 node in the cluster to read the secondary index.

    ReplyDelete
    Replies
    1. Yes, it is overloaded but distributed databases might not have been the first use. Used for partitioned tables (local to partition, global across partitions. Then used for distributed databases -- local to shard, global across shards. Now used for LSM -- local to sorted run, global across sorted runs.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...