Wednesday, October 21, 2020

Self-aware DBMS, RID-lists for LSM trees

This is a short writeup of two ideas I might never have the time to implement -- self-aware DBMS and RID-lists for LSM tree.

Self-aware DBMS

Tuning a B-Tree index structure is hard and tuning an LSM tree is usually harder. Fortunately it will get easier over time. We tune a DBMS today via knobs and there is clever work in progress to automate changing of those knobs. But we can do even better. I hope one day for self-aware DBMS algorithms that understand the impact of their knobs, accept declarative goals & priorities from a user and are able to improve performance with respect to an objective function. Lets move the cleverness into the DBMS.

RID-lists for LSM trees

tl;dr - can I use the merge operator to implement RID-lists for an LSM?

I worked on bitmap indexes at Oracle. It would be fair to call them compressed RID-lists. Many smart people worked on that code in Oracle prior to me and the inventor, Gennady Antoshenkov, has a strong record of innovation. IBM DB2 has RID-lists, but I don't know whether they can be compressed. All of these are useful for analytics whether this is a bitmap index, compressed RID-list or uncompressed RID-list.

Postgres 13 has index deduplication which looks like an uncompressed RID-list and I look forward to new query execution features to do the equivalent of bitmap-AND and bitmap-OR.

But this is about a RID-list for an LSM tree. A RID-list is useful for an index that can have many duplicates for a key -- meaning some secondary indexes and not for a primary index. The LSMs I care about don't know much about schema and secondary index maintenance requires knowledge of schema but I will ignore that for now. MyRocks uses RocksDB for primary and secondary indexes, so this is something that can be solved by an application. WiredTiger supports schemas and secondary indexes without also providing a query language and someone could add the same to RocksDB.

I am curious whether the merge operator can be used to implement RID-lists? Disclaimer - I am far from an expert on the merge operator.

  • the rowid is the value of the primary key. Secondary index entries for an LSM and for Oracle IOT use the primary key value as the rowid because the base row can be relocated (during compaction with an LSM, during B-Tree leaf node splits and merges with IOT). Bitmap indexes for Oracle IOT solves this via a mapping table. If that is not done then the problem is that the primary key value might be large and a list of them might not be compression friendly.
  • the delta that is written by the merge operator specifies whether a rowid is added or removed for a given secondary key value. I wonder if something like SingleDelete is needed.
  • the callback function that combines merge operator entries during compaction will merge entries to create larger RID-lists


  1. Looks like technically doable, as long as the RID list doesn't get too large.

    But RID list can get large - there's nothing that prevents the user from doing something like

    UPDATE tbl SET tbl.key='foo' WHERE 1

    and the RID list for the key='foo' will be the entire dataset.

    I guess one can workaround this by using "lossy" a RID list that describes a superset (e.g. "min_value < $ROWID < max_value" in the most pathological case)...

    What will happen with read amplification? A single read will have to A) Read all levels and B) Combine the Merge operators. This looks expensive...

    1. Large RID lists aren't a problem. Index entries can be limited to be no larger than 1/2 of an index block. Assuming the user creates an index on (a,b) then the real key for an index entry is (a,b,s) where s is the start or min rowID in the RID list for that index entry.

      Oracle bitmap indexes work like this. Naive bitmap indexes do not work like this.

      Read-amp for point queries can be worse, as you mention. I expect that to be reduced by using LSM trees with fewer levels because this is for analytics workloads. But this is something that needs more investigation.

      Another area to be researched is a strategy for doing merge operator rewrites during queries, because the query forced operators to be merged for a given key. Sometimes it is useful to rewrite that result into the LSM tree and not wait for compaction.

  2. So, individual row updates will write Merge records:

    insert into tbl values (... key='foo', pk=rowid1);
    # this will write Merge(key='foo', value='insert rowid1')

    delete from tbl where pk=rowid1
    # this will write Merge(key='what-the-row-had', value='remove rowid1')

    delete from tbl where key='foo'
    # this command can write a Delete(key='foo').

    But it seems there's no operation that will write a Put(key='foo', new_rid_list). Is that correct?

    1. Good question. I need to revisit: