Tuesday, November 12, 2019

Using an LSM for analytics

How do you use an LSM for analytics? I haven't thought much about it because my focus has been small data -- web-scale OLTP since 2006. It is a great question given that other RocksDB users (Yugabyte, TiDB, Rockset, CockroachDB) support analytics.

This post isn't a list of everything that is needed for an analytics engine. That has been described elsewhere. It is a description of problems that must be solved when trying to use well-known techniques with an LSM. I explain the LSM challenges for a vectorized query engine, columnar storage, analytics indexes and bitmap indexes.

There is much speculation in this post. Some of it is informed -- I used to maintain bitmap indexes at Oracle. There is also big opportunity for research and for R&D in this space. I am happy to be corrected and be told of papers that I should read.

Vectorized query engine

See MonetDB and X100 to understand the remarkable performance a vectorized query engine can provide. CockroachDB has recently explained the performance improvement from a vectorized engine even when remaining on row-wise storage.

Anything for a rowid

Columnar encoding and vectorized processing benefit from an engine that uses rowids and even more so when the rowid space isn't sparse. In the most compact form a vector of columns has a start rowid and the offset of a value in the vector is added to the start rowid to compute the rowid for a value. It is less efficient but feasible to have a vector of values and a vector of rowids when the rowid space is sparse.

But an LSM doesn't use rowids as each row has a unique key and that key can be variable length and more than 8 bytes. Rowids are easier to do with heap storage as the rowid can be <pageID>.<pageOffset>. For an LSM it might be possible to use the WAL LSN as the rowid. I propose something different for per-SST rowids, rowids that are unique within an SST. Rows in an SST are static and ordered so the rowid is the offset of the row in the SST. When there are N rows in the SST then the per-SST rowid is a value between 1 and N (or 0 and N-1). A rowid that works across SSTs might be <SST number>.<per-SST rowid>.

To use the per-SST rowid there must be an efficient way to get the key for a given rowid within an SST. That can be solved by a block index in the SST that stores the minimum rowid per block.

Columnar storage

The benefits from columnar storage are compression and processing. Datatype specific compression over a sequence of column values provides excellent compression ratios. There have been many interesting papers on that including Facebook Gorilla and C-Store. The processing benefit comes from iterating over possibly compressed vectors of column values while minimizing data movement.

I assume a vectorized query engine is easier to implement than write-optimized columnar storage but Vertica and Kudu are proof that it is possible to do both. However neither Vertica nor Kudu use an LSM and this blog post is about LSM. First I explain how to do columnar storage and then how to do vectorized processing.

While fully columnar storage can be done I explain a PAX approach that is columnar within an SST -- all columns are in the SST but stored separately. Each column gets its own data blocks and data block index. The block index in this case has the minimum rowid per data block. Similar to Kudu, the primary key (whether composite or not) is also stored in its own data blocks with a block index. That can also be used to map a PK to a rowid. Optionally, the PK data blocks can store the full rows. Otherwise a row can be reconstructed from the per-column data blocks. The LSM must know the schema and WiredTiger shows how to do that.

Columnar processing

The solution above shows how to get the compression benefits from columnar storage but an LSM range query does a scan of each level of the LSM tree that is processed by a merge iterator to filter rows that have been deleted while also respecting visibility. For example with the same key on levels 1, 3 and 5 of the LSM tree it might be correct to return the value from level 1, level 3 or level 5 depending on the query timestamp and per-key timestamps. But it is hard to decide which version of the key to keep in isolation -- the merge iterator needs to see all of the keys at the same time.

While the merge iterator output can be put back into a columnar format, a lot of work has been done by that point. It would be better to push some processing below (or before) the merge iterators. It is easier to push filtering and projection. It is harder to push anything else such as aggregation.

A recent blog post from CockroachDB shows how to get significant benefits from vectorized processing without using columnar storage. I don't want to diminish the value of their approach, but I am curious if it can be enhanced by columnar storage.

Filters are safe to push below a merge iterator. Assume there is a predicate like column < 3 then that can be evaluated by scanning the column in an SST to find the rowids that satisfy the predicate, using the found rowid set to reconstruct the rows, or projected columns from the rows, and returning that data. Multiple filters on the same or multiple columns can also be evaluated in this fashion.

MySQL has options to push non-index predicates to the storage engine. While the original use case was Cluster where storage is across the network from the SQL/compute this feature can also be used for analytics with MySQL.

Aggregation is harder to push below a merge iterator because you need to know whether a given key would be visible to the query while processing an SST and that requires knowing whether there was a tombstone for that key at a smaller level of the LSM tree. That might use too much state and CPU.

Analytics indexes

Analytics indexes are another way to prune the amount of data that must be searched for scan-heavy queries. There was an interesting paper at SIGMOD 2018 that evaluated this approach for LSM. The BRIN in Postgres is another example of this. These can be described as local secondary indexes (local to an SST or data block) as opposed to a global secondary index as provided by MyRocks. The idea is to maintain some attributes per data block or per SST about columns that aren't in the LSM index. This could be the min and max value of that column or a bloom filter. This can be used to prune that data block or SST during a scan.

Bitmap indexes

Has there been any work on bitmap indexes for LSM as in a paper, proof-of-concept or real system? Once per-SST rowids are implemented as described above then per-SST bitmap indexes can also be provided. These would index columns not in the LSM index. Use of the bitmap indexes would determine the found rowid set within each SST.

The cost of per-SST

There is much speculation in this blog post. In several places above I mention that special indexes can be maintained per SST. How many SSTs, and thus how many per-SST indexes, will there be in a typical setup? I assume that an LSM for analytics would use larger SSTs. There are 2^12 SSTs per 1TB of data when the SST size is 256GB. I am not sure whether that would be a problem.

Monday, November 11, 2019

Linux syscall performance regressions explained

There is a paper in SOSP 2019 that explains Linux system call performance for recent kernels. Explaining performance is what I do so I appreciate the effort that went into this paper:
  • Linux syscall perf mostly improved from 3.0 to 4.0 then went bad starting with 4.7 thanks to 5 security changes - Spectre, Meltdown, harden usercopy, randomize slab freelist, user pagefault handling.
  • Linux has a lot of variance in syscall performance across releases, more than I am used to with MySQL. But MySQL still has regressions across release just without so much variance. Interesting posts for Postgres perf across versions are here, here and here. AFAIK Linux can't move as fast as we need it to and avoid the variance, so the it is caught and fixed after the fact.
  • Results are for Intel CPUs. It would be interesting to see results for AMD and for Intel with and without HT enabled.

Friday, November 8, 2019

Jungle - LSM plus copy-on-write B-Tree

This is a review of Jungle which is an LSM variant that uses a copy-on-write (CoW) B-Tree internally. One of the Jungle developers previously invented ForestDB. I am a fan of his work.

At a high level Jungle is an LSM with leveled compaction but thanks to the CoW B-Tree it has different read, write and space amplification tradeoffs. I don't understand Jungle well enough to explain where it is better, but I am happy to share this review and accept corrections. My summary of Jungle is:
  • One sorted run per level
  • The per level file structure uses an index+log approach where the index is a CoW B-Tree, values are appended to the value log and the B-Tree entries point into the log. There is also a bloom filter.
  • Inter-level merge does compaction between adjacent levels. I think this is some-to-some as some data is moved from Ln to Ln+1 in a batch, values are appended to the value log, keys are inserted into the B-Tree and modified B-Tree pages are persisted by appending to the end of the B-Tree files. Because of CoW the path from leaf to root is made dirty when a leaf page is modified.
  • In-place merge does GC within a level to reclaim space from the B-Tree files and value log. The B-Tree is scanned in order to write a new B-Tree and new value log. Space is wasted because updates were appended to the end of the B-Tree file and value log.

The index+log approach is used to reduce write amplification from large values. The per-level write-amp from moving a KV pair from Ln to Ln+1 is the sum of the write-amp from adding keys to the B-Tree and from appending to the end of the value log. 

Assuming a batch of KV pairs is inserted then write-amp for the value log is minimal -- close to 1. If 32 1kb values are moved then 32kb is written. I am uncertain about the average and worst case write-amp for the B-Tree even when I only consider write-amp for the leaf pages and ignore the non-leaf pages. For the worst-case assume that each key makes a leaf page dirty.  Then for each KV pair with a small key and 1kb value there is 4kb + 1kb written (4kb for B-Tree, 1kb for value log) and the per-level write-amp is ~5. That is a good worst-case. I frequently see per-level write-amp of ~5 for production workloads with RocksDB so I wonder what the average case will be for Jungle.

There is additional write-amp from doing periodic in-place merges to reclaim space. I won't try to estimate the impact from that.

  • The CoW B-Tree in Jungle is CoW-S because writes are appended and GC must eventually be done.
  • While ordering values in RocksDB has a cost, more write-amp, it also has a benefit, less cache-amp. RocksDB needs a pointer (index entry) in memory per block to achieve a worst-case of ~1 disk read per point query -- see the RocksDB data block index. With index+log the values are not in key order and this needs a pointer (index entry) in memory per KV pair. Assuming these pointers are ~8 bytes there is a huge difference in memory overhead between 8 bytes / database page and 8 bytes per KV pair assuming KV pairs are not too big. Per the CRUM Conjecture it is hard to be better in all dimensions -- read, write, disk space and memory. 
  • It will be hard to do compression for the value log if only a few values are appended at a time. But if each inter-level merge adds many 4kb pages worth of data to the value log then this isn't a problem.
  • Range scans are more expensive with index+log because values are not in key order and each value might require a storage IO and the CPU overhead for decompression. This should only be a problem for the largest level of the LSM tree.

Thursday, November 7, 2019

Revisiting Kudu

I read the Kudu paper many years ago. It is worth reading but I never published my review so I re-read the paper this week. My review is limited to the paper and I ignore improvements since then. My focus is on storage so I also ignore most of the distributed system features.

I don't have access to my old review of the paper when I read it years ago. I might have considered it to be an LSM variant. Now I consider it to be an example of an index+log index structure.

Kudu is a scale-out index structure for analytics engines like Impala. The goals for Kudu are fast columnar scans, low latency updates and low performance variance. I assume that Kudu satisfied those goals. The features include:
  • Data is stored in tables and a table has a fixed schema. I am curious about the demand for flexible schemas.
  • It is primary key only. It uses Raft to make sure replicas stay in sync. If global secondary indexes were supported then something like XA across shards would also be needed. That is much harder, but feasible -- see Spanner, CockroachDB and Yugabyte.
  • Supports add/drop column but PK columns can't be dropped. I am not sure whether the PK can be changed.
  • Supports insert, update, delete and range query. It might consider a point query to be a simple range scan. Update and delete must fully specify the PK values for the rows to be changed.
  • Supports range and hash partitioning but that paper section confused me
  • Supports single-row transactions. I assume a batch can be submitted but I am not sure how per-row outcomes are reported to a client.
  • Compaction uses cost-based decisions to merge data where the objective function is to improve read efficiency. I am a big fan of that.


I think this is an example of an index+log index structure where the DiskRowSets are log segments. GC is clever and there are two parts to it. First, there is compaction between DeltaMemStore and DiskRowSet. This removes deleted rows and merges long update chains. Second, there is compaction between DiskRowSets. This reduces the number of places that must be checked for a given key range.

In the standard index+log approach only the first type of compaction is done for a log segment. That usually requires all of the index to be in memory because each row from a log segment needs an index probe to determine whether the row is live. For Kudu a search of the DeltaFile determines liveness. It is not clear to me whether DeltaMemFiles are clustered per DiskRowSet to reduce the amount of data that should be in memory when such compaction is done.

I briefly cover the second type of compaction at the end of my blog post. The benefit from merging log segments with index+log is larger sorted runs. Write-optimized index structures impose a CPU and IO read efficiency penalty. Merging log segments like this reduces that penalty.


Tables are horizontally partitioned into tablets and each tablet consists of a MemRowSet, many DiskRowSets, a DeltaMemStore and many DeltaFiles. Inserts are written into a MemRowSet, when full a MemRowSet is flushed to disk creating a DiskRowSet. A DiskRowSet is limited to ~32MB. Only inserts are directly written into DiskRowSets. Because there is a PK check on insert a row will exist in at most one DiskRowSet.

Deletes and updates are first written into a DeltaMemStore and when full that is flushed to disk to create a DeltaFile. While only inserts are directly written to a DiskRowSet. Eventually a DeltaMemStore will be merged with a DiskRowSet so the effect of update and delete eventually reach a DiskRowSet.

Storage is columnar for the rows in a DiskRowSet. Encoding takes advantage of data types to use less space and block compression is optional. This output is written to a sequence of pages for each column. Then there is a B-Tree index that maps ID to page where ID is the ordinal offset of the row within the DiskRowSet (when a DiskRowSet has 1M rows then the ID is from 1 to 1M assuming it starts at 1). Although I am not sure if that index has an entry per page or per row.

There is also a PK index column that has the encoded PK values (encoded because the key is a string even when the PK has more than one column). It wasn't clear to me whether there was an index built on that or if binary search was used. I assume the value in this case is the ID (ordinal offset) for the row. A paged bloom filter is created for the PK index column.

Range scans will start by searching the PK index column to determine the row IDs that must be retrieved and then for each column that must be returned search the per-column indexes using those IDs.

Friday, November 1, 2019

Fun setting up Ubuntu on Win 10 Home

I feel like I have gone back to Linux from 25 years ago when it was quite a challenge to install, configure storage, setup networking with a modem and then getting X working. My setup is latest Windows 10 Home, new-ish HP Omen laptop, Ubuntu 18.04 and VMWare Player 15.5.0. Some useful advice is here.

I lost a few hours today setting up Ubuntu on a laptop running Win 10 Home. The magic that fixed this for VirtualBox and VMWare was to run the following command in the Command Prompt app, run as an admin. It does not work if you try to run it in PowerShell. Reboot after running this:
bcdedit /set hypervisorlaunchtype off
This was a not fun experience shared by many others trying to use VirtualBox and VMWare. I assume the problem is Win 10 and not the VM apps. As much as I don't like the Mac keyboard, each time I try to switch to Windows I quickly realize it is a bad idea. While WSL and WSL2 sound OK, I really want the full Linux experience and don't want to deal with running an X server separately in Windows to get the UI for Linux apps.

For those running Ubuntu natively on their laptops with great keyboards, I salute you. You have every right to snicker at me.

Shared folders

Shared folders work except for automount. This command fixes that:
vmhgfs-fuse .host:/ /mnt/hgfs -o subtype=vmhgfs-fuse,allow_other

The alternative is to use the VMWare Player UI to disable/reenable shared folders each time I reboot the VM. The best solution is to add something to /etc/fstab and I will do that soon.