Wednesday, November 20, 2019

The delta+main index structure

A reader of my LSM for analytics blog post suggested I read Fast Scans on Key-Value Stores from VLDB 2017. It is an interesting paper and was part of the effort to build the Tell DBMS and a great PhD thesis

Tell might not be an answer for my problem as this was an in-memory DBMS and might have write amplification that is too high for my needs. Regardless, the paper is worth reading and discussing.

Index structure drama

I don't publish reviews for papers that aren't worth reading so keep that in mind for any criticism that follows. The index structure world tends to be low on drama and perhaps more while waiting for the next round of learned index papers.

The big questions for me are:
  1. Is delta+main a new index structure?
  2. Did SAP Hana pioneer delta+main?
Hana and delta+main

I will start with question 2. The paper states that delta+main was pioneered by SAP Hana (see Hana paper SIGMOD Record from 2012) and then cites a SIGMOD 2015 paper about the Analytics in Motion project. I am not an expert in this area but I suspect that C-Store/Vertica (see VLDB 2005 paper) was another pioneer in this space. 

I started to browse the cited papers. There are too many for me to read or even cite including Fast Updates from VLDB 2012 and Positional Update Handling in Column Stores from SIGMOD 2010. The earliest online paper in this space might be Differential Files from ACM TODS in 1976 and that paper cites even earlier work -- delta+main is great for data stored on tape.

Delta+main index structure

At this point I am only talking about TellStore-Col. I would classify TellStore-Log as index+log.

I am not sure that delta+main is a new index structure. It might be an LSM variant that I have called memtable+L1 where delta is the memtable and main is the L1. Or perhaps it is memtable+L0+L1 where delta is memtable and the L0 while main is the L1. I encountered memtable+L1 systems at least twice. Once at work and then again with Sophia. This is a great approach when database : RAM ratios aren't larger than 10.

However, the index structures used in Tell are more complicated than an LSM. They do clever things to make scans faster, things that you might not do with an LSM -- some fields on data pages aren't write once.

So I won't update the index structures definition just yet, but that doesn't take away from the paper.

Back to the paper

Tell is an in-memory DBMS with three index structures -- TellStore-Log, TellStore-Col and TellStore-Row. I ignore TellStore-Row. Tell supports mixed workloads -- lets call this HTAP for now -- and must be efficient for scans. It solves some of the problems I described in my LSM for Analytics post.

The paper only describes the use of a hash index but cites another Tell paper that explains how a B+Tree can be used. I assume the hash index is only (mostly) for primary key indexes.

TellStore-Log

TellStore-Log is index+log with a hash table for the index. Rows in the log are self-contained. The hash index doesn't have to be checked to determine whether a row in the log is live or visible to a transaction. Thus GC and scans don't have to probe the index which saves CPU overhead.

Rows in the log have fields for valid-from, valid-to and previous. The index points to the latest version of a key and then previous versions can be found by following the previous field. The valid-from and valid-to fields are for MVCC visibility. As part of an update the valid-to field of the previously live version of the row is updated. This confused me until I remembered that this is an in-memory DBMS, the log is in memory and it is possible to update that field in a previously written log page. But this also means that the append-only log isn't write once because the valid-to field can be updated long after the append.

Each DBMS table uses a separate log to improve scan efficiency. TellStore-Log also does clever things to piggyback GC with scans.

TellStore-Col

TellStore-Col is delta+main. It uses two logs for delta (update-log and insert-log, delete is an update) and then PAX for the main store. There is still one hash index that points to rows in update-log, insert-log and the main store.

Row visibility can be treated as a predicate to evaluate. Assume there are N rows on a page and K predicates. There can be K bitmaps, one per predicate, to indicate matching rows and then one more bitmap to indicate the rows that are visible.

Multiple versions of a row can be in the main store and would be adjacent in a page. There is also a newest pointer per key in the main store that points can point into the update log. The newest pointer would get changed each time a row with that key is appended to the update-log. Rows in the main store also have valid-from and valid-to fields to determine MVCC visibility. When the latest version of a key is in the main store and the key gets updated (appended to the update-log) then the valid-to field in the main store page would also get updated. I assume these changes to main store pages are done in place given that this is an in-memory DBMS.

Multiple versions of a key in the update log are linked via the previous field. But the oldest version of a key in the update log does not point to versions in the main store.

Eventually GC is done for pages in the main store and that uses copy-on-write in contrast to the changes mentioned above that are done in place.

The structure of pages in the main store support efficient scans. Keys that are not visible per MVCC can be excluded via the valid-from and valid-to fields. Keys that have the visible version in the update-log can be found by following the newest pointer from the main store page to the update-log. There is no need to do a merge, like an LSM requires, between rows from the main store and update-log. There is no need to probe the update-log for every key read from the main store. There is no need to scan the update-log as it gets probed on demand.

A scan would also have to read from the insert-log. But such rows are new, so there is no need to merge that data with another data structure.


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...