Tuesday, November 26, 2019

iostat output changes in Ubuntu 18.04

The output format has changed for iostat -kx 1 in Ubuntu 18.04. The changes include reordered columns, renamed columns and new columns. My benchmark scripts scrape this output and use it to compute efficiency metrics so I need to update the scripts to catch up to Ubuntu. I should also update them to deal with future output format changes, or better yet, just scrape /proc/diskstats directly.

Next up - I should determine whether bytes trimmed are still included in bytes written. It will be nice when the bytes trimmed column shows up in iostat but that requires kernel 4.18 or 4.19 per this and with Ubuntu 18.04 (latest LTS) I use 4.15.

This bug report explains the differences.

  • rrqm/s and wrqm/s were moved from columns 2, 3 to 6, 7
  • avgqu-sz was renamed to aqu-sz
  • await was removed - r_await and w_await remain
  • avgrq-sz was replaced by rareq-sz and wareq-sz

Monday, November 25, 2019

Throttling writes: LSM vs B-Tree

Reducing response time variance is important for some workloads. This post explains sources of variance for workloads with high write rates when the index structure is an LSM or a B-Tree. I previously wrote about this in my post on durability debt.

Short summary:
  1. For a given write rate stalls are more likely with a B-Tree than an LSM
  2. Many RocksDB write stalls can be avoided via configuration
  3. Write stalls with a B-Tree are smaller but more frequent versus an LSM
  4. Write stalls are more likely when the redo log isn't forced on commit
  5. The worst case difference between an LSM and B-Tree is larger when the working set isn't cached
  6. Life is easier but more expensive when the working set fits in cache
  7. Less write amplification saves IO for other uses
Less short summary:
  1. Write stalls for an LSM occur when compaction has trouble keeping up with the incoming write rate. The worst stalls occur at write rates that a B-Tree could not sustain. One way to mitigate stalls is to reduce the write rate. Another way is to use an index structure that doesn't support or is inefficient for range scans (see index+log).
  2. The cost from configuring RocksDB to avoid write stalls is more CPU overhead on reads as there will be more data in the upper levels of the LSM. I am partly to blame for the default configuration in RocksDB that throttles writes when the LSM tree gets too much data in the L0, L1 and L2. But that configuration can be changed.
  3. SQLite4  has a clever LSM designed for systems that don't allow background threads. It implements a pay as you go approach to durability debt. A traditional LSM takes the opposite approach - it defers the IO cost to the background. RocksDB has optional write throttling and work has been done to smooth the impact from it but it is not solved. A B-Tree in the worst-case (buffer pool full & mostly dirty, working set not cached) also implements pay as you go approach.
  4. I almost always disable sync-on-commit for benchmarks because I want to observe how the DBMS observes under stress and less commit latency means more writes/second and more IO stress.
  5. See item #6 where I argue that it is good to not have the working set cached.
  6. A common rule of thumb has been to keep all indexes in cache or all of the working set in cache. That simplifies tuning and makes it easier to avoid performance problems. But that also might force a deployment to use 2X more HW than it needs because NAND flash SSDs are everywhere and the response time difference between reading from RAM and reading from NAND flash might not matter for many applications. But if you are using a DBMS in the cloud that charges by the IO, then keeping the working set in RAM might be a good idea.
  7. An LSM usually has less write-amp than a B-Tree. So the IO capacity it saves from that can be used elsewhere to support more read or write transactions.
Worst case behavior

I am wary of faster is better. I prefer nuance but I also know that people don't have time to read long blog posts like this or long performance reports. Here I explain worst case behavior in terms of IO overheads. Worst case behavior isn't the only way to judge an index structure but it helps me to explain performance. Another way is to measure the average amount of IO per transaction (in operations and KB) and treat IO efficiency as important.

I describe worst case behavior for a write operation under a few scenarios. By worst case I mean the largest amount of IO done in the foreground (the thread handling the write) as that determines the response time. I ignore the work done in the background which favors an LSM because that defers more work to the background. For a B-Tree I ignore undo and page splits. The write is a SQL update which is read-modify-write, as opposed to a blind-write like a Put with RocksDB. Finally, I assume the update isn't to an indexed column. The scenarios are:
  1. Cached, PK only - working set cached, PK index only
  2. Not cached, PK only - working set not cached, PK index only
  3. Cached, PK and secondary index - working set cached, PK and non-unique secondary index
  4. Not cached, PK and secondary index - working set not cached, PK and non-unique secondary index 
PK only

For the cached, PK only scenario neither an LSM nor a B-Tree do IO in the foreground with the exception of the redo log fsync. Stalls are unlikely for both but more likely with a B-Tree especially when the DBMS storage uses a spinning disk.
  • An LSM writes the redo log buffer, optionally syncs the redo log and then does an insert into the memtable. Both memtable flush and Ln:Ln+1 compaction are deferred to background threads. If memtable flush were too slow then there are write stalls until flush catches up to avoid too many memtables wasting memory.
  • A B-Tree modifies a page in the buffer pool, writes the redo log buffer and optionally syncs the redo log. If checkpoint were too slow a full redo log can't be rotated until checkpoint catches up and there are write stalls.
For the not cached, PK only scenario the work done in the foreground is 1 IO/update for an LSM and 2 IO/update for a B-Tree. Here a B-Tree uses a pay as you go model.
  • An LSM reads a page into the block cache and then repeats the work described in cached, PK only
  • A B-Tree finds a dirty page to evict, writes that page back to storage, then reads the desired page into that slot in the buffer pool and repeats the work described in cached, PK only.

PK and secondary index

For the cached, PK and secondary index scenario there is approximately twice as much work to be done per update compared to the cached, PK only scenario. Thus stalls are more likely here. But other than the optional redo fsync there is no foreground IO for the LSM and B-Tree.
  • An LSM repeats the work explained in the cached, PK only scenario. For the secondary index it does an additional insert to the memtable which is also logged as redo. This can double the demand for compaction.
  • A B-Tree repeats the work explained in the cached, PK only scenario. For the secondary index it makes an additional page dirty in the buffer pool. This can double the demand for page write back.
For the not cached, PK and secondary index scenario the foreground IO difference between an LSM and B-Tree is more significant -- 1 IO for the LSM vs 4 IO for the B-Tree -- ignoring the redo log overhead. The IO difference is reduced from 1:4 to approximately 1:2 for a B-Tree like InnoDB that implements a change buffer.
  • An LSM does the union of the work described in not cached, PK only and cached, PK and secondary index scenarios. Ignoring the optional redo fsync the cost is 1 read IO for the PK index and no reads for the secondary index because non-unique secondary index maintenance is read-free.
  • A B-Tree repeats the work explained in the cached, PK only scenario but this is done for both the PK and secondary indexes. Thus the cost is 2 IOs to write back dirty pages and then 2 IOs to read pages from the PK and secondary indexes into the buffer pool and then make them dirty -- which then requires redo log writes. So the cost for this is 4 IOs ignoring the redo log.

Make writes fast: LSM

Writes can be fast with an LSM because most of the IO cost is deferred but that also increases the need to throttle writes. Life is good as long as that deferred cost can be repaid fast enough, otherwise there will be more response time variance.

Flush and compaction are the deferred cost for an LSM write. Flush means writing the memtable to an SST on storage. Compaction means merging SSTs to move flushed data from the root to leaf of the LSM tree. Compaction costs more than flush. RocksDB can stall writes when compaction doesn't keep up with ingest. Ingest creates durability debt, compaction reduces it and write stalls are there to bound the debt. Write stalls are enabled by default but can be disabled by configuration. Putting a bound on durability debt also puts a bound on read latency by reducing the number of SSTs that can exist in the L0, L1 and L2. So if you want to support extremely high write rates than choose one of: read stalls, write stalls.

Make writes fast: B-Tree

Writes can also be fast with a B-Tree as there are no page reads/writes to/form storage when the working set is cached and background page write back is fast enough. In that case the only IO work in the foreground is the optional redo log fsync.

Page write back is the primary deferred cost for a B-Tree write. Most of my B-Tree experience is with InnoDB which does fuzzy checkpoint. The goal is to flush dirty pages before the current redo log segment gets full. Using larger redo log segments lets InnoDB defer write back for a longer time increasing the chance that more transactions will modify the page -- reducing write amplification and helping performance.

Purge can be an additional deferred cost for a B-Tree write. I use the InnoDB name here as Postgres calls this vacuum. This is the process of reclaiming space from deleted rows that are no longer visible by open MVCC snapshots. The LSM equivalent of purge is checking the open snapshot list during compaction for KV pairs that are not the latest version of a given key to determine whether that version is still needed.

When write back and purge are fast enough then write stalls should be infrequent with a B-Tree. But write back isn't always fast enough. A B-Tree write stall occurs when a write transaction must read a page into the buffer pool prior to modifying that page but 1) the buffer pool is full and 2) write back must be done for a dirty page before the memory can be reused.


A few other comments that didn't have a place above:
  • In this post I assume the B-Tree uses no-force, but there is at least one nice OSS B-Tree that uses force.
  • Making commit slower is another way to throttle writes and reduce the chance of stalled writes. Examples of this include redo log fsync, semisync or synchronous replication.
  • The InnoDB change buffer is a wonderful feature that reduces the IO overhead for write-heavy workloads.
  • NAND flash GC stalls are another source of write stalls. I wish more were written about this topic.
  • Stalls during TRIM when using an LSM with NAND flash are another source of stalls. I wish there were more TRIM benchmarks. Smart friends tell me that NAND flash devices vary widely in their ability to handle TRIM. And they display different stall behavior when their TRIM capacity has been exceeded. Some of us were spoiled by FusionIO.

Wednesday, November 20, 2019

The delta+main index structure

A reader of my LSM for analytics blog post suggested I read Fast Scans on Key-Value Stores from VLDB 2017. It is an interesting paper and was part of the effort to build the Tell DBMS and a great PhD thesis

Tell might not be an answer for my problem as this was an in-memory DBMS and might have write amplification that is too high for my needs. Regardless, the paper is worth reading and discussing.

Index structure drama

I don't publish reviews for papers that aren't worth reading so keep that in mind for any criticism that follows. The index structure world tends to be low on drama and perhaps more while waiting for the next round of learned index papers.

The big questions for me are:
  1. Is delta+main a new index structure?
  2. Did SAP Hana pioneer delta+main?
Hana and delta+main

I will start with question 2. The paper states that delta+main was pioneered by SAP Hana (see Hana paper SIGMOD Record from 2012) and then cites a SIGMOD 2015 paper about the Analytics in Motion project. I am not an expert in this area but I suspect that C-Store/Vertica (see VLDB 2005 paper) was another pioneer in this space. 

I started to browse the cited papers. There are too many for me to read or even cite including Fast Updates from VLDB 2012 and Positional Update Handling in Column Stores from SIGMOD 2010. The earliest online paper in this space might be Differential Files from ACM TODS in 1976 and that paper cites even earlier work -- delta+main is great for data stored on tape.

Delta+main index structure

At this point I am only talking about TellStore-Col. I would classify TellStore-Log as index+log.

I am not sure that delta+main is a new index structure. It might be an LSM variant that I have called memtable+L1 where delta is the memtable and main is the L1. Or perhaps it is memtable+L0+L1 where delta is memtable and the L0 while main is the L1. I encountered memtable+L1 systems at least twice. Once at work and then again with Sophia. This is a great approach when database : RAM ratios aren't larger than 10.

However, the index structures used in Tell are more complicated than an LSM. They do clever things to make scans faster, things that you might not do with an LSM -- some fields on data pages aren't write once.

So I won't update the index structures definition just yet, but that doesn't take away from the paper.

Back to the paper

Tell is an in-memory DBMS with three index structures -- TellStore-Log, TellStore-Col and TellStore-Row. I ignore TellStore-Row. Tell supports mixed workloads -- lets call this HTAP for now -- and must be efficient for scans. It solves some of the problems I described in my LSM for Analytics post.

The paper only describes the use of a hash index but cites another Tell paper that explains how a B+Tree can be used. I assume the hash index is only (mostly) for primary key indexes.


TellStore-Log is index+log with a hash table for the index. Rows in the log are self-contained. The hash index doesn't have to be checked to determine whether a row in the log is live or visible to a transaction. Thus GC and scans don't have to probe the index which saves CPU overhead.

Rows in the log have fields for valid-from, valid-to and previous. The index points to the latest version of a key and then previous versions can be found by following the previous field. The valid-from and valid-to fields are for MVCC visibility. As part of an update the valid-to field of the previously live version of the row is updated. This confused me until I remembered that this is an in-memory DBMS, the log is in memory and it is possible to update that field in a previously written log page. But this also means that the append-only log isn't write once because the valid-to field can be updated long after the append.

Each DBMS table uses a separate log to improve scan efficiency. TellStore-Log also does clever things to piggyback GC with scans.


TellStore-Col is delta+main. It uses two logs for delta (update-log and insert-log, delete is an update) and then PAX for the main store. There is still one hash index that points to rows in update-log, insert-log and the main store.

Row visibility can be treated as a predicate to evaluate. Assume there are N rows on a page and K predicates. There can be K bitmaps, one per predicate, to indicate matching rows and then one more bitmap to indicate the rows that are visible.

Multiple versions of a row can be in the main store and would be adjacent in a page. There is also a newest pointer per key in the main store that points can point into the update log. The newest pointer would get changed each time a row with that key is appended to the update-log. Rows in the main store also have valid-from and valid-to fields to determine MVCC visibility. When the latest version of a key is in the main store and the key gets updated (appended to the update-log) then the valid-to field in the main store page would also get updated. I assume these changes to main store pages are done in place given that this is an in-memory DBMS.

Multiple versions of a key in the update log are linked via the previous field. But the oldest version of a key in the update log does not point to versions in the main store.

Eventually GC is done for pages in the main store and that uses copy-on-write in contrast to the changes mentioned above that are done in place.

The structure of pages in the main store support efficient scans. Keys that are not visible per MVCC can be excluded via the valid-from and valid-to fields. Keys that have the visible version in the update-log can be found by following the newest pointer from the main store page to the update-log. There is no need to do a merge, like an LSM requires, between rows from the main store and update-log. There is no need to probe the update-log for every key read from the main store. There is no need to scan the update-log as it gets probed on demand.

A scan would also have to read from the insert-log. But such rows are new, so there is no need to merge that data with another data structure.

Monday, November 18, 2019

Always be upgrading

On the production DBMS side I want to provide a high QoS and a stable environment helps to achieve that. Upgrades are done slowly with much testing. Upgrades include moving to a new compiler toolchain or new library version.

On the security side people don't want libraries with known security problems to be used in production. This leads to a push to always be upgrading (ABU) and this is justified when the expected value of ABU exceeds the expected cost. Alas this will be hard to figure out. I assume the expected cost can be estimated. I am skeptical there can be an estimate of the expected value.

The expected value of ABU is:
cost(security-event) X probability(security-event)
I assume it is possible to estimate the cost of a security event but will be much harder to estimate the probability. It would help if more vendors were to publish full incident reports. My guess is that old library version won't be near the top of the list of root causes, but this is not my area of expertise.

The cost of ABU is easier to quantify -- count debugging time, testing time and down time from all teams that will be upgrading frequently. Examples of things that can go wrong include Postgres depends on glibc locale, ext-4 performance changes for O_DIRECT and many more, but I will be lazy and many of the problems I experienced weren't reported in public.

Saturday, November 16, 2019

My theory on technical debt and OSS

This is a hypothesis, perhaps it is true. I am biased given that I spent 15 years acknowledging tech debt on a vendor-based project (MySQL) and not much time on community-based projects.

My theory on tech debt and OSS is that there is more acknowledgement of tech debt in vendor-based projects than community-based ones. This is an advantage for vendor-based projects assuming you are more likely to fix acknowledged problems. Of course there are other advantages for community-based projects.

I think there is more acknowledgement of tech debt in vendor-based projects because the community is criticizing someone else's effort rather than their own. This is human nature, even if the effect and behavior aren't always kind. I spent many years marketing bugs that needed to be fixed -- along with many years  leading teams working to fix those bugs.

Friday, November 15, 2019

The joy of database configuration

I am wary of papers with performance results for too many products.Too many means including results from systems for which you lack expertise. Wary means I have less faith in the comparison even when the ideas in the paper are awesome. I have expertise in MySQL, MongoDB, RocksDB, WiredTiger and InnoDB but even for them I have made and acknowledged ridiculous mistakes.

Database configuration is too hard. There are too many options, most of them aren't significant and the approach is bottom-up. I an expert on this -- in addition to years of tuning I have added more than a few options to RocksDB and MySQL.

This post was motivated by PostgreSQL. I want to run the insert benchmark for it and need a good configuration. I have nothing against PG with the exception of a few too many why not Postgres comments. The community is strong, docs are great and the product is still improving. But I think PostgreSQL configuration has room to improve -- just like RocksDB (here, here) and MySQL/InnoDB.

Too many options

A non-expert user lacks both the ability to choose good values for options and the ability to understand which options might be useful to set. My solution to too many options and most aren't significant is to use good defaults and split the option name space into two parts -- regular and expert. Regular options are set by most users because they matter for performance and don't have good default values. The amount of memory the DBMS can use is one such option - the default will be small.

Everything else is an expert option. These include options for which the default is great and options that rarely impact performance. There is a reason for expert options -- some workloads benefit from their existence and being able to set that option at runtime might avoid downtime. Options are also added early in the lifecycle of new features to allow developers to evaluate the new feature and choose good default values. But such options don't need to be exposed to all users.

The benefit from doing this is to avoid presenting a new user with tens or hundreds of options to consider. That is a lousy experience. And while X is too hard isn't always a valid complaint -- language (human and database query) is complex because they let us express complex idea -- I don't think we gain much from the current approach.

RocksDB has added functions that simplify configuration and even split the option namespace into two parts -- regular and advanced. This is a step in the right direction but I hope for more. I confirmed that most RocksDB options either have good defaults or aren't significant for my workloads and then published advice on tuning RocksDB.

The performance configurations I use for MongoDB/WiredTiger and MySQL/InnoDB are similar to my experience with RocksDB. I don't have to set too many options to get great performance. Alas, it took a long time to figure that out.

Top-down configuration

Top-down configuration is another approach that can help. The idea is simple - tell the DBMS about the hardware it can use and optionally state a few constraints.

The basic hardware configuration is empty which implies the DBMS gets everything it can find -- all memory, all CPU cores, all IO capacity. When a host does more than run a DBMS it should be easy to enforce that limit with one option for memory consumption, one for CPU, etc. The user shouldn't have to set ten options for ten different memory consumers. It is even worse when these limits are per instance -- limiting how much memory each sort buffer gets is a lousy way to manage total memory usage. IO capacity is interesting. AFAIK there was a tool included in RethinkDB that characterized IO capacity, PostgreSQL has a tool for fsync performance and we can't forget fio. But it is easy to be mislead about SSD performance.

The constraints cover things that are subjective. What is the max recovery time objective? How do you rank read, write, space and memory efficiency?

 A great example of this is SQL Memory Management in Oracle 9i -- tell the DBMS how much memory it can use and let it figure out the best way to use it.

What about ML

I hope that ML makes it easier to discover the options that aren't significant and can be moved into the expert options namespace. But I prefer a solution with fewer tuning knobs, or at least fewer visible tuning knobs. I hope to avoid too many knobs (status quota) combined with ML. Lets make smarter database algorithms. If nothing else this should be a source of research funding, interesting PhDs and many papers worth reading.


While I appreciate that someone made the MySQL memory calculator available I wish this weren't needed. Setting memory limits based on peak concurrency means you will under-allocate memory in the normal case or instead you can over-allocate at peak concurrency and get OOM.

Tuesday, November 12, 2019

Using an LSM for analytics

How do you use an LSM for analytics? I haven't thought much about it because my focus has been small data -- web-scale OLTP since 2006. It is a great question given that other RocksDB users (Yugabyte, TiDB, Rockset, CockroachDB) support analytics.

This post isn't a list of everything that is needed for an analytics engine. That has been described elsewhere. It is a description of problems that must be solved when trying to use well-known techniques with an LSM. I explain the LSM challenges for a vectorized query engine, columnar storage, analytics indexes and bitmap indexes.

There is much speculation in this post. Some of it is informed -- I used to maintain bitmap indexes at Oracle. There is also big opportunity for research and for R&D in this space. I am happy to be corrected and be told of papers that I should read.

Vectorized query engine

See MonetDB and X100 to understand the remarkable performance a vectorized query engine can provide. CockroachDB has recently explained the performance improvement from a vectorized engine even when remaining on row-wise storage.

Anything for a rowid

Columnar encoding and vectorized processing benefit from an engine that uses rowids and even more so when the rowid space isn't sparse. In the most compact form a vector of columns has a start rowid and the offset of a value in the vector is added to the start rowid to compute the rowid for a value. It is less efficient but feasible to have a vector of values and a vector of rowids when the rowid space is sparse.

But an LSM doesn't use rowids as each row has a unique key and that key can be variable length and more than 8 bytes. Rowids are easier to do with heap storage as the rowid can be <pageID>.<pageOffset>. For an LSM it might be possible to use the WAL LSN as the rowid. I propose something different for per-SST rowids, rowids that are unique within an SST. Rows in an SST are static and ordered so the rowid is the offset of the row in the SST. When there are N rows in the SST then the per-SST rowid is a value between 1 and N (or 0 and N-1). A rowid that works across SSTs might be <SST number>.<per-SST rowid>.

To use the per-SST rowid there must be an efficient way to get the key for a given rowid within an SST. That can be solved by a block index in the SST that stores the minimum rowid per block.

Columnar storage

The benefits from columnar storage are compression and processing. Datatype specific compression over a sequence of column values provides excellent compression ratios. There have been many interesting papers on that including Facebook Gorilla and C-Store. The processing benefit comes from iterating over possibly compressed vectors of column values while minimizing data movement.

I assume a vectorized query engine is easier to implement than write-optimized columnar storage but Vertica and Kudu are proof that it is possible to do both. However neither Vertica nor Kudu use an LSM and this blog post is about LSM. First I explain how to do columnar storage and then how to do vectorized processing.

While fully columnar storage can be done I explain a PAX approach that is columnar within an SST -- all columns are in the SST but stored separately. Each column gets its own data blocks and data block index. The block index in this case has the minimum rowid per data block. Similar to Kudu, the primary key (whether composite or not) is also stored in its own data blocks with a block index. That can also be used to map a PK to a rowid. Optionally, the PK data blocks can store the full rows. Otherwise a row can be reconstructed from the per-column data blocks. The LSM must know the schema and WiredTiger shows how to do that.

Columnar processing

The solution above shows how to get the compression benefits from columnar storage but an LSM range query does a scan of each level of the LSM tree that is processed by a merge iterator to filter rows that have been deleted while also respecting visibility. For example with the same key on levels 1, 3 and 5 of the LSM tree it might be correct to return the value from level 1, level 3 or level 5 depending on the query timestamp and per-key timestamps. But it is hard to decide which version of the key to keep in isolation -- the merge iterator needs to see all of the keys at the same time.

While the merge iterator output can be put back into a columnar format, a lot of work has been done by that point. It would be better to push some processing below (or before) the merge iterators. It is easier to push filtering and projection. It is harder to push anything else such as aggregation.

A recent blog post from CockroachDB shows how to get significant benefits from vectorized processing without using columnar storage. I don't want to diminish the value of their approach, but I am curious if it can be enhanced by columnar storage.

Filters are safe to push below a merge iterator. Assume there is a predicate like column < 3 then that can be evaluated by scanning the column in an SST to find the rowids that satisfy the predicate, using the found rowid set to reconstruct the rows, or projected columns from the rows, and returning that data. Multiple filters on the same or multiple columns can also be evaluated in this fashion.

MySQL has options to push non-index predicates to the storage engine. While the original use case was Cluster where storage is across the network from the SQL/compute this feature can also be used for analytics with MySQL.

Aggregation is harder to push below a merge iterator because you need to know whether a given key would be visible to the query while processing an SST and that requires knowing whether there was a tombstone for that key at a smaller level of the LSM tree. That might use too much state and CPU.

Userland columnar

By userland columnar I mean doing columnar encoding above the engine. In this approach encoded column chunks are returned by point and range queries. Filtering isn't pushed down but the overhead of LSM merge is amortized because each operation returns a chunk or vector of columns and datatype specific encoding can be done to get excellent compression ratios.

There are a few challenges with this -- updates and rowids. First, updates mean there will be small deltas that must be merged with the compressed row chunks. This problem isn't unique to userland columnar as updates are always an issue for columnar analytics engines. Part of the solution might be to treat an update as delete followed by insert, but vectorized processing still has to know which rows have been updated or deleted. With RocksDB the merge operator might be used to associate update/delete with the encoded chunk of rows.

Given that rowid must be specified by the user there must be a way to map a PK to a rowid when insert, update and delete are processed. Assuming all rows come from a log then the log offset might serve as the rowid on insert. Another option that has more overhead is to maintain a PK -> rowid index within the LSM to assign rowids to new PK values and access the rowid for existing ones. During update or delete there must be a way to map the PK of the row back to a rowid. If a PK -> rowid index is used then this is easy. Otherwise I will limit my speculation for now.

If the rowid management issue can be resolved then this is an interesting idea. In the end it comes down to efficiency. How many columns per second per CPU core can be processed by a query engine? Faster is better gets too much press. I want to make efficienter is better and greener is better popular.

Rockset uses userland columnar so they are in a position to correct my speculation and document the efficiency for this approach. Perhaps MonetDB-Lite or DuckDB can be the base case.

Analytics indexes

Analytics indexes are another way to prune the amount of data that must be searched for scan-heavy queries. There was an interesting paper at SIGMOD 2018 that evaluated this approach for LSM. The paper's title is A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases. The BRIN in Postgres is another example of this. These can be described as local secondary indexes (local to an SST or data block) as opposed to a global secondary index as provided by MyRocks. The idea is to maintain some attributes per data block or per SST about columns that aren't in the LSM index. This could be the min and max value of that column or a bloom filter. This can be used to prune that data block or SST during a scan.

Bitmap indexes

Has there been any work on bitmap indexes for LSM as in a paper, proof-of-concept or real system? Once per-SST rowids are implemented as described above then per-SST bitmap indexes can also be provided. These would index columns not in the LSM index. Use of the bitmap indexes would determine the found rowid set within each SST.

The cost of per-SST

There is much speculation in this blog post. In several places above I mention that special indexes can be maintained per SST. How many SSTs, and thus how many per-SST indexes, will there be in a typical setup? I assume that an LSM for analytics would use larger SSTs. There are 2^12 SSTs per 1TB of data when the SST size is 256GB. I am not sure whether that would be a problem.


I added the Userland columnar section to explain how Rockset might be avoiding the merge iterator overhead.

Someone suggested I read Fast scans on key-value stores. It looks interesting.

Monday, November 11, 2019

Linux syscall performance regressions explained

There is a paper in SOSP 2019 that explains Linux system call performance for recent kernels. Explaining performance is what I do so I appreciate the effort that went into this paper:
  • Linux syscall perf mostly improved from 3.0 to 4.0 then went bad starting with 4.7 thanks to 5 security changes - Spectre, Meltdown, harden usercopy, randomize slab freelist, user pagefault handling.
  • Linux has a lot of variance in syscall performance across releases, more than I am used to with MySQL. But MySQL still has regressions across release just without so much variance. Interesting posts for Postgres perf across versions are here, here and here. AFAIK Linux can't move as fast as we need it to and avoid the variance, so the it is caught and fixed after the fact.
  • Results are for Intel CPUs. It would be interesting to see results for AMD and for Intel with and without HT enabled.

Friday, November 8, 2019

Jungle - LSM plus copy-on-write B-Tree

This is a review of Jungle which is an LSM variant that uses a copy-on-write (CoW) B-Tree internally. One of the Jungle developers previously invented ForestDB. I am a fan of his work.

At a high level Jungle is an LSM with leveled compaction but thanks to the CoW B-Tree it has different read, write and space amplification tradeoffs. I don't understand Jungle well enough to explain where it is better, but I am happy to share this review and accept corrections. My summary of Jungle is:
  • One sorted run per level
  • The per level file structure uses an index+log approach where the index is a CoW B-Tree, values are appended to the value log and the B-Tree entries point into the log. There is also a bloom filter.
  • Inter-level merge does compaction between adjacent levels. I think this is some-to-some as some data is moved from Ln to Ln+1 in a batch, values are appended to the value log, keys are inserted into the B-Tree and modified B-Tree pages are persisted by appending to the end of the B-Tree files. Because of CoW the path from leaf to root is made dirty when a leaf page is modified.
  • In-place merge does GC within a level to reclaim space from the B-Tree files and value log. The B-Tree is scanned in order to write a new B-Tree and new value log. Space is wasted because updates were appended to the end of the B-Tree file and value log.

The index+log approach is used to reduce write amplification from large values. The per-level write-amp from moving a KV pair from Ln to Ln+1 is the sum of the write-amp from adding keys to the B-Tree and from appending to the end of the value log. 

Assuming a batch of KV pairs is inserted then write-amp for the value log is minimal -- close to 1. If 32 1kb values are moved then 32kb is written. I am uncertain about the average and worst case write-amp for the B-Tree even when I only consider write-amp for the leaf pages and ignore the non-leaf pages. For the worst-case assume that each key makes a leaf page dirty.  Then for each KV pair with a small key and 1kb value there is 4kb + 1kb written (4kb for B-Tree, 1kb for value log) and the per-level write-amp is ~5. That is a good worst-case. I frequently see per-level write-amp of ~5 for production workloads with RocksDB so I wonder what the average case will be for Jungle.

There is additional write-amp from doing periodic in-place merges to reclaim space. I won't try to estimate the impact from that.

  • The CoW B-Tree in Jungle is CoW-S because writes are appended and GC must eventually be done.
  • While ordering values in RocksDB has a cost, more write-amp, it also has a benefit, less cache-amp. RocksDB needs a pointer (index entry) in memory per block to achieve a worst-case of ~1 disk read per point query -- see the RocksDB data block index. With index+log the values are not in key order and this needs a pointer (index entry) in memory per KV pair. Assuming these pointers are ~8 bytes there is a huge difference in memory overhead between 8 bytes / database page and 8 bytes per KV pair assuming KV pairs are not too big. Per the CRUM Conjecture it is hard to be better in all dimensions -- read, write, disk space and memory. 
  • It will be hard to do compression for the value log if only a few values are appended at a time. But if each inter-level merge adds many 4kb pages worth of data to the value log then this isn't a problem.
  • Range scans are more expensive with index+log because values are not in key order and each value might require a storage IO and the CPU overhead for decompression. This should only be a problem for the largest level of the LSM tree.

Thursday, November 7, 2019

Revisiting Kudu

I read the Kudu paper many years ago. It is worth reading but I never published my review so I re-read the paper this week. My review is limited to the paper and I ignore improvements since then. My focus is on storage so I also ignore most of the distributed system features.

I don't have access to my old review of the paper when I read it years ago. I might have considered it to be an LSM variant. Now I consider it to be an example of an index+log index structure.

Kudu is a scale-out index structure for analytics engines like Impala. The goals for Kudu are fast columnar scans, low latency updates and low performance variance. I assume that Kudu satisfied those goals. The features include:
  • Data is stored in tables and a table has a fixed schema. I am curious about the demand for flexible schemas.
  • It is primary key only. It uses Raft to make sure replicas stay in sync. If global secondary indexes were supported then something like XA across shards would also be needed. That is much harder, but feasible -- see Spanner, CockroachDB and Yugabyte.
  • Supports add/drop column but PK columns can't be dropped. I am not sure whether the PK can be changed.
  • Supports insert, update, delete and range query. It might consider a point query to be a simple range scan. Update and delete must fully specify the PK values for the rows to be changed.
  • Supports range and hash partitioning but that paper section confused me
  • Supports single-row transactions. I assume a batch can be submitted but I am not sure how per-row outcomes are reported to a client.
  • Compaction uses cost-based decisions to merge data where the objective function is to improve read efficiency. I am a big fan of that.


I think this is an example of an index+log index structure where the DiskRowSets are log segments. GC is clever and there are two parts to it. First, there is compaction between DeltaMemStore and DiskRowSet. This removes deleted rows and merges long update chains. Second, there is compaction between DiskRowSets. This reduces the number of places that must be checked for a given key range.

In the standard index+log approach only the first type of compaction is done for a log segment. That usually requires all of the index to be in memory because each row from a log segment needs an index probe to determine whether the row is live. For Kudu a search of the DeltaFile determines liveness. It is not clear to me whether DeltaMemFiles are clustered per DiskRowSet to reduce the amount of data that should be in memory when such compaction is done.

I briefly cover the second type of compaction at the end of my blog post. The benefit from merging log segments with index+log is larger sorted runs. Write-optimized index structures impose a CPU and IO read efficiency penalty. Merging log segments like this reduces that penalty.


Tables are horizontally partitioned into tablets and each tablet consists of a MemRowSet, many DiskRowSets, a DeltaMemStore and many DeltaFiles. Inserts are written into a MemRowSet, when full a MemRowSet is flushed to disk creating a DiskRowSet. A DiskRowSet is limited to ~32MB. Only inserts are directly written into DiskRowSets. Because there is a PK check on insert a row will exist in at most one DiskRowSet.

Deletes and updates are first written into a DeltaMemStore and when full that is flushed to disk to create a DeltaFile. While only inserts are directly written to a DiskRowSet. Eventually a DeltaMemStore will be merged with a DiskRowSet so the effect of update and delete eventually reach a DiskRowSet.

Storage is columnar for the rows in a DiskRowSet. Encoding takes advantage of data types to use less space and block compression is optional. This output is written to a sequence of pages for each column. Then there is a B-Tree index that maps ID to page where ID is the ordinal offset of the row within the DiskRowSet (when a DiskRowSet has 1M rows then the ID is from 1 to 1M assuming it starts at 1). Although I am not sure if that index has an entry per page or per row.

There is also a PK index column that has the encoded PK values (encoded because the key is a string even when the PK has more than one column). It wasn't clear to me whether there was an index built on that or if binary search was used. I assume the value in this case is the ID (ordinal offset) for the row. A paged bloom filter is created for the PK index column.

Range scans will start by searching the PK index column to determine the row IDs that must be retrieved and then for each column that must be returned search the per-column indexes using those IDs.

Friday, November 1, 2019

Fun setting up Ubuntu on Win 10 Home

I feel like I have gone back to Linux from 25 years ago when it was quite a challenge to install, configure storage, setup networking with a modem and then getting X working. My setup is latest Windows 10 Home, new-ish HP Omen laptop, Ubuntu 18.04 and VMWare Player 15.5.0. Some useful advice is here.

I lost a few hours today setting up Ubuntu on a laptop running Win 10 Home. The magic that fixed this for VirtualBox and VMWare was to run the following command in the Command Prompt app, run as an admin. It does not work if you try to run it in PowerShell. Reboot after running this:
bcdedit /set hypervisorlaunchtype off
This was a not fun experience shared by many others trying to use VirtualBox and VMWare. I assume the problem is Win 10 and not the VM apps. As much as I don't like the Mac keyboard, each time I try to switch to Windows I quickly realize it is a bad idea. While WSL and WSL2 sound OK, I really want the full Linux experience and don't want to deal with running an X server separately in Windows to get the UI for Linux apps.

For those running Ubuntu natively on their laptops with great keyboards, I salute you. You have every right to snicker at me.

Shared folders

Shared folders work except for automount. This command fixes that:
vmhgfs-fuse .host:/ /mnt/hgfs -o subtype=vmhgfs-fuse,allow_other

The alternative is to use the VMWare Player UI to disable/reenable shared folders each time I reboot the VM. The best solution is to add something to /etc/fstab and I will do that soon.