How do you use an LSM for analytics? I haven't thought much about it because my focus has been small data -- web-scale OLTP since 2006. It is a great question given that other RocksDB users (Yugabyte, TiDB, Rockset, CockroachDB) support analytics.
This post isn't a list of everything that is needed for an analytics engine. That has been described elsewhere. It is a description of problems that must be solved when trying to use well-known techniques with an LSM. I explain the LSM challenges for a vectorized query engine, columnar storage, analytics indexes and bitmap indexes.
There is much speculation in this post. Some of it is informed -- I used to maintain bitmap indexes at Oracle. There is also big opportunity for research and for R&D in this space. I am happy to be corrected and be told of papers that I should read.
Vectorized query engine
See MonetDB and X100 to understand the remarkable performance a vectorized query engine can provide. CockroachDB has recently explained the performance improvement from a vectorized engine even when remaining on row-wise storage.
Anything for a rowid
Columnar encoding and vectorized processing benefit from an engine that uses rowids and even more so when the rowid space isn't sparse. In the most compact form a vector of columns has a start rowid and the offset of a value in the vector is added to the start rowid to compute the rowid for a value. It is less efficient but feasible to have a vector of values and a vector of rowids when the rowid space is sparse.
But an LSM doesn't use rowids as each row has a unique key and that key can be variable length and more than 8 bytes. Rowids are easier to do with heap storage as the rowid can be <pageID>.<pageOffset>. For an LSM it might be possible to use the WAL LSN as the rowid. I propose something different for per-SST rowids, rowids that are unique within an SST. Rows in an SST are static and ordered so the rowid is the offset of the row in the SST. When there are N rows in the SST then the per-SST rowid is a value between 1 and N (or 0 and N-1). A rowid that works across SSTs might be <SST number>.<per-SST rowid>.
To use the per-SST rowid there must be an efficient way to get the key for a given rowid within an SST. That can be solved by a block index in the SST that stores the minimum rowid per block.
The benefits from columnar storage are compression and processing. Datatype specific compression over a sequence of column values provides excellent compression ratios. There have been many interesting papers on that including Facebook Gorilla and C-Store. The processing benefit comes from iterating over possibly compressed vectors of column values while minimizing data movement.
I assume a vectorized query engine is easier to implement than write-optimized columnar storage but Vertica and Kudu are proof that it is possible to do both. However neither Vertica nor Kudu use an LSM and this blog post is about LSM. First I explain how to do columnar storage and then how to do vectorized processing.
While fully columnar storage can be done I explain a PAX approach that is columnar within an SST -- all columns are in the SST but stored separately. Each column gets its own data blocks and data block index. The block index in this case has the minimum rowid per data block. Similar to Kudu, the primary key (whether composite or not) is also stored in its own data blocks with a block index. That can also be used to map a PK to a rowid. Optionally, the PK data blocks can store the full rows. Otherwise a row can be reconstructed from the per-column data blocks. The LSM must know the schema and WiredTiger shows how to do that.
The solution above shows how to get the compression benefits from columnar storage but an LSM range query does a scan of each level of the LSM tree that is processed by a merge iterator to filter rows that have been deleted while also respecting visibility. For example with the same key on levels 1, 3 and 5 of the LSM tree it might be correct to return the value from level 1, level 3 or level 5 depending on the query timestamp and per-key timestamps. But it is hard to decide which version of the key to keep in isolation -- the merge iterator needs to see all of the keys at the same time.
While the merge iterator output can be put back into a columnar format, a lot of work has been done by that point. It would be better to push some processing below (or before) the merge iterators. It is easier to push filtering and projection. It is harder to push anything else such as aggregation.
A recent blog post from CockroachDB shows how to get significant benefits from vectorized processing without using columnar storage. I don't want to diminish the value of their approach, but I am curious if it can be enhanced by columnar storage.
Filters are safe to push below a merge iterator. Assume there is a predicate like column < 3 then that can be evaluated by scanning the column in an SST to find the rowids that satisfy the predicate, using the found rowid set to reconstruct the rows, or projected columns from the rows, and returning that data. Multiple filters on the same or multiple columns can also be evaluated in this fashion.
MySQL has options to push non-index predicates to the storage engine. While the original use case was Cluster where storage is across the network from the SQL/compute this feature can also be used for analytics with MySQL.
Aggregation is harder to push below a merge iterator because you need to know whether a given key would be visible to the query while processing an SST and that requires knowing whether there was a tombstone for that key at a smaller level of the LSM tree. That might use too much state and CPU.
By userland columnar I mean doing columnar encoding above the engine. In this approach encoded column chunks are returned by point and range queries. Filtering isn't pushed down but the overhead of LSM merge is amortized because each operation returns a chunk or vector of columns and datatype specific encoding can be done to get excellent compression ratios.
There are a few challenges with this -- updates and rowids. First, updates mean there will be small deltas that must be merged with the compressed row chunks. This problem isn't unique to userland columnar as updates are always an issue for columnar analytics engines. Part of the solution might be to treat an update as delete followed by insert, but vectorized processing still has to know which rows have been updated or deleted. With RocksDB the merge operator might be used to associate update/delete with the encoded chunk of rows.
Given that rowid must be specified by the user there must be a way to map a PK to a rowid when insert, update and delete are processed. Assuming all rows come from a log then the log offset might serve as the rowid on insert. Another option that has more overhead is to maintain a PK -> rowid index within the LSM to assign rowids to new PK values and access the rowid for existing ones. During update or delete there must be a way to map the PK of the row back to a rowid. If a PK -> rowid index is used then this is easy. Otherwise I will limit my speculation for now.
If the rowid management issue can be resolved then this is an interesting idea. In the end it comes down to efficiency. How many columns per second per CPU core can be processed by a query engine? Faster is better gets too much press. I want to make efficienter is better and greener is better popular.
Rockset uses userland columnar so they are in a position to correct my speculation and document the efficiency for this approach. Perhaps MonetDB-Lite or DuckDB can be the base case.
Analytics indexes are another way to prune the amount of data that must be searched for scan-heavy queries. There was an interesting paper at SIGMOD 2018 that evaluated this approach for LSM. The paper's title is A Comparative Study of Secondary Indexing Techniques in
LSM-based NoSQL Databases. The BRIN in Postgres is another example of this. These can be described as local secondary indexes (local to an SST or data block) as opposed to a global secondary index as provided by MyRocks. The idea is to maintain some attributes per data block or per SST about columns that aren't in the LSM index. This could be the min and max value of that column or a bloom filter. This can be used to prune that data block or SST during a scan.
Has there been any work on bitmap indexes for LSM as in a paper, proof-of-concept or real system? Once per-SST rowids are implemented as described above then per-SST bitmap indexes can also be provided. These would index columns not in the LSM index. Use of the bitmap indexes would determine the found rowid set within each SST.
The cost of per-SST
There is much speculation in this blog post. In several places above I mention that special indexes can be maintained per SST. How many SSTs, and thus how many per-SST indexes, will there be in a typical setup? I assume that an LSM for analytics would use larger SSTs. There are 2^12 SSTs per 1TB of data when the SST size is 256GB. I am not sure whether that would be a problem.
I added the Userland columnar section to explain how Rockset might be avoiding the merge iterator overhead.
Someone suggested I read Fast scans on key-value stores. It looks interesting.