Monday, November 25, 2019

Throttling writes: LSM vs B-Tree

Reducing response time variance is important for some workloads. This post explains sources of variance for workloads with high write rates when the index structure is an LSM or a B-Tree. I previously wrote about this in my post on durability debt.

Short summary:
  1. For a given write rate stalls are more likely with a B-Tree than an LSM
  2. Many RocksDB write stalls can be avoided via configuration
  3. Write stalls with a B-Tree are smaller but more frequent versus an LSM
  4. Write stalls are more likely when the redo log isn't forced on commit
  5. The worst case difference between an LSM and B-Tree is larger when the working set isn't cached
  6. Life is easier but more expensive when the working set fits in cache
  7. Less write amplification saves IO for other uses
Less short summary:
  1. Write stalls for an LSM occur when compaction has trouble keeping up with the incoming write rate. The worst stalls occur at write rates that a B-Tree could not sustain. One way to mitigate stalls is to reduce the write rate. Another way is to use an index structure that doesn't support or is inefficient for range scans (see index+log).
  2. The cost from configuring RocksDB to avoid write stalls is more CPU overhead on reads as there will be more data in the upper levels of the LSM. I am partly to blame for the default configuration in RocksDB that throttles writes when the LSM tree gets too much data in the L0, L1 and L2. But that configuration can be changed.
  3. SQLite4  has a clever LSM designed for systems that don't allow background threads. It implements a pay as you go approach to durability debt. A traditional LSM takes the opposite approach - it defers the IO cost to the background. RocksDB has optional write throttling and work has been done to smooth the impact from it but it is not solved. A B-Tree in the worst-case (buffer pool full & mostly dirty, working set not cached) also implements pay as you go approach.
  4. I almost always disable sync-on-commit for benchmarks because I want to observe how the DBMS observes under stress and less commit latency means more writes/second and more IO stress.
  5. See item #6 where I argue that it is good to not have the working set cached.
  6. A common rule of thumb has been to keep all indexes in cache or all of the working set in cache. That simplifies tuning and makes it easier to avoid performance problems. But that also might force a deployment to use 2X more HW than it needs because NAND flash SSDs are everywhere and the response time difference between reading from RAM and reading from NAND flash might not matter for many applications. But if you are using a DBMS in the cloud that charges by the IO, then keeping the working set in RAM might be a good idea.
  7. An LSM usually has less write-amp than a B-Tree. So the IO capacity it saves from that can be used elsewhere to support more read or write transactions.
Worst case behavior

I am wary of faster is better. I prefer nuance but I also know that people don't have time to read long blog posts like this or long performance reports. Here I explain worst case behavior in terms of IO overheads. Worst case behavior isn't the only way to judge an index structure but it helps me to explain performance. Another way is to measure the average amount of IO per transaction (in operations and KB) and treat IO efficiency as important.

I describe worst case behavior for a write operation under a few scenarios. By worst case I mean the largest amount of IO done in the foreground (the thread handling the write) as that determines the response time. I ignore the work done in the background which favors an LSM because that defers more work to the background. For a B-Tree I ignore undo and page splits. The write is a SQL update which is read-modify-write, as opposed to a blind-write like a Put with RocksDB. Finally, I assume the update isn't to an indexed column. The scenarios are:
  1. Cached, PK only - working set cached, PK index only
  2. Not cached, PK only - working set not cached, PK index only
  3. Cached, PK and secondary index - working set cached, PK and non-unique secondary index
  4. Not cached, PK and secondary index - working set not cached, PK and non-unique secondary index 
PK only

For the cached, PK only scenario neither an LSM nor a B-Tree do IO in the foreground with the exception of the redo log fsync. Stalls are unlikely for both but more likely with a B-Tree especially when the DBMS storage uses a spinning disk.
  • An LSM writes the redo log buffer, optionally syncs the redo log and then does an insert into the memtable. Both memtable flush and Ln:Ln+1 compaction are deferred to background threads. If memtable flush were too slow then there are write stalls until flush catches up to avoid too many memtables wasting memory.
  • A B-Tree modifies a page in the buffer pool, writes the redo log buffer and optionally syncs the redo log. If checkpoint were too slow a full redo log can't be rotated until checkpoint catches up and there are write stalls.
For the not cached, PK only scenario the work done in the foreground is 1 IO/update for an LSM and 2 IO/update for a B-Tree. Here a B-Tree uses a pay as you go model.
  • An LSM reads a page into the block cache and then repeats the work described in cached, PK only
  • A B-Tree finds a dirty page to evict, writes that page back to storage, then reads the desired page into that slot in the buffer pool and repeats the work described in cached, PK only.

PK and secondary index

For the cached, PK and secondary index scenario there is approximately twice as much work to be done per update compared to the cached, PK only scenario. Thus stalls are more likely here. But other than the optional redo fsync there is no foreground IO for the LSM and B-Tree.
  • An LSM repeats the work explained in the cached, PK only scenario. For the secondary index it does an additional insert to the memtable which is also logged as redo. This can double the demand for compaction.
  • A B-Tree repeats the work explained in the cached, PK only scenario. For the secondary index it makes an additional page dirty in the buffer pool. This can double the demand for page write back.
For the not cached, PK and secondary index scenario the foreground IO difference between an LSM and B-Tree is more significant -- 1 IO for the LSM vs 4 IO for the B-Tree -- ignoring the redo log overhead. The IO difference is reduced from 1:4 to approximately 1:2 for a B-Tree like InnoDB that implements a change buffer.
  • An LSM does the union of the work described in not cached, PK only and cached, PK and secondary index scenarios. Ignoring the optional redo fsync the cost is 1 read IO for the PK index and no reads for the secondary index because non-unique secondary index maintenance is read-free.
  • A B-Tree repeats the work explained in the cached, PK only scenario but this is done for both the PK and secondary indexes. Thus the cost is 2 IOs to write back dirty pages and then 2 IOs to read pages from the PK and secondary indexes into the buffer pool and then make them dirty -- which then requires redo log writes. So the cost for this is 4 IOs ignoring the redo log.

Make writes fast: LSM

Writes can be fast with an LSM because most of the IO cost is deferred but that also increases the need to throttle writes. Life is good as long as that deferred cost can be repaid fast enough, otherwise there will be more response time variance.

Flush and compaction are the deferred cost for an LSM write. Flush means writing the memtable to an SST on storage. Compaction means merging SSTs to move flushed data from the root to leaf of the LSM tree. Compaction costs more than flush. RocksDB can stall writes when compaction doesn't keep up with ingest. Ingest creates durability debt, compaction reduces it and write stalls are there to bound the debt. Write stalls are enabled by default but can be disabled by configuration. Putting a bound on durability debt also puts a bound on read latency by reducing the number of SSTs that can exist in the L0, L1 and L2. So if you want to support extremely high write rates than choose one of: read stalls, write stalls.

Make writes fast: B-Tree

Writes can also be fast with a B-Tree as there are no page reads/writes to/form storage when the working set is cached and background page write back is fast enough. In that case the only IO work in the foreground is the optional redo log fsync.

Page write back is the primary deferred cost for a B-Tree write. Most of my B-Tree experience is with InnoDB which does fuzzy checkpoint. The goal is to flush dirty pages before the current redo log segment gets full. Using larger redo log segments lets InnoDB defer write back for a longer time increasing the chance that more transactions will modify the page -- reducing write amplification and helping performance.

Purge can be an additional deferred cost for a B-Tree write. I use the InnoDB name here as Postgres calls this vacuum. This is the process of reclaiming space from deleted rows that are no longer visible by open MVCC snapshots. The LSM equivalent of purge is checking the open snapshot list during compaction for KV pairs that are not the latest version of a given key to determine whether that version is still needed.

When write back and purge are fast enough then write stalls should be infrequent with a B-Tree. But write back isn't always fast enough. A B-Tree write stall occurs when a write transaction must read a page into the buffer pool prior to modifying that page but 1) the buffer pool is full and 2) write back must be done for a dirty page before the memory can be reused.


A few other comments that didn't have a place above:
  • In this post I assume the B-Tree uses no-force, but there is at least one nice OSS B-Tree that uses force.
  • Making commit slower is another way to throttle writes and reduce the chance of stalled writes. Examples of this include redo log fsync, semisync or synchronous replication.
  • The InnoDB change buffer is a wonderful feature that reduces the IO overhead for write-heavy workloads.
  • NAND flash GC stalls are another source of write stalls. I wish more were written about this topic.
  • Stalls during TRIM when using an LSM with NAND flash are another source of stalls. I wish there were more TRIM benchmarks. Smart friends tell me that NAND flash devices vary widely in their ability to handle TRIM. And they display different stall behavior when their TRIM capacity has been exceeded. Some of us were spoiled by FusionIO.


  1. How much of write variance in LSM's is due to overcommitted throughput? For every insert into an LSM, one must perform about log N inserts at higher levels of the LSM, where N is the size of the data set. I have a feeling (unsupported by any evidence) that many LSM implementations simply blast as much as they can into L0, and don't worry about the write debt that they have accumulated. How much of the variance would be solved if for every insert one really did those log N operations at the other layers of the LSM: that is pay the debt as you go?

    That still leaves GC stalls and TRIM stalls.

    1. RocksDB has strong throttling enabled by default. That creates a few bad experiences for some users who encounter write stalls. Other users quietly benefit, perhaps unaware, by getting less variance on reads. I assume that RocksDB can do better and make throttling more smooth. If nothing else, this would lead to a few interesting research papers.

      AFAIK the SQLite4 LSM has a solution for this. SQLite doesn't allow background threads so the foreground threads help with compaction. But I need to revisit their design docs.

    2. "Strong throttling" is too vague. This post explains it:

    3. It seems like the write throttling in RocksDB is kind of ad-hoc. Here is what I would consider a principled way to do write throttling:

      Suppose that the write amplification is W. (This covers the general case: for example for an LSM that doubles the size of each run and does merging of equal-sized runs to make larger runs, and whenever there are two runs of the same size, they get merged, then W is log n. For the LSM that has the size factor go up by a factor of 10 and you merge each time, then W is 10 log_{10} N. We parameterize by W.)

      Now every time the application writes K bytes into the LSM, we do KW worth of writes of rebalancing. It turns out for all these LSM variants you can find a schedule that does it. For the simple power-of-two LSM, you simply merge K bytes in each level.

      You might want to support a higher burst rate, feeling that paying log N on every write is too much. I'm not sure how much value there really is to supporting a higher burst rate. Anyway, to support a higher burst rate, you pick a size S that you are willing to get behind. For example, you might set S to 1GB, and you are allowed to write 1GB without doing all the work.

      Now S gives you a straightforward tuning parameter: Bigger S means you need to allocate more storage and tolerate longer bursts.

  2. I sometimes explain Innodb change buffer as "it's kind of like a 2 level LSM". How right or wrong do you think such a characterisation is?

    1. I like it, but it is also like the merge operator in RocksDB. A 2-level LSM is a great fit when data:RAM ratios won't be too big (maybe <= 5:1). Sophia might have been a 2 level LSM, and I know of a successful in-house implementation.