Tuesday, November 24, 2015

Read, write & space amplification - B-Tree vs LSM

This post compares a B-Tree and LSM for read, write and space amplification. The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload. For the LSM I consider leveled compaction rather than size-tiered compaction. For the B-Tree I consider a clustered index like InnoDB.

The comparison in practice provides values for read, write and space amplification on real workloads. The comparison in theory attempts to explain those values.

B-Tree vs LSM in theory

Read Amplification

Most comparisons should be done for a specific context including the hardware and workload. For now I am only specific about the cache hit rate. For the B-Tree I assume that all non-leaf levels are in cache. For the LSM I assume that everything but the data blocks of the largest LSM level are in cache. While an LSM with leveled compaction has more things to keep in the cache (bloom filters) it also benefits from a better compression rate and the cache requirements are similar to a clustered B-Tree.

Worst-case disk read-amp for point queries is 1 for the B-Tree and the LSM as one block is read from the B-Tree leaf level and largest LSM level. Disk read-amp for range queries is 1 or 2 for a short range scan assuming that 1 or 2 blocks from the B-Tree leaf level and LSM max level are read. Note the impact of my assumption for cached data. While many files might be accessed for a short range query with an LSM everything but the max level data blocks are in cache.

The number of key comparisons can be used as the in-memory read-amp. For a B-Tree with 1M keys there are about 20 key comparisons on a point query. For a range query with a B-Tree there is one additional comparison for each row fetched.

It is harder to reason about the number of comparisons for an LSM. Bloom filters can be used for a point query to avoid comparisons but when there are too many files in level 0 then there will be too many bloom filter checks. Bloom filters don't work for range queries, ignoring prefix bloom filters. When query processing is IO-bound I don't expect key comparison overhead to make a difference between an LSM and B-Tree. So I will ignore this for now.

If you want to maximize the ratio of the database to cache sizes while doing at most one disk read per point query then an LSM with leveled compaction or a clustered B-Tree are the best choices. For a clustered B-Tree the things that must be in memory are one key per leaf block and all non-leaf levels of the index. An LSM with leveled compaction has similar requirements, although it also needs some of the bloom filters to be in memory.

The cache requirement is much larger for an LSM with size-tiered compaction. First, the max level has ~50% of the data compared to ~90% with leveled compaction and it less likely that all data except the max file are in cache. Second, there are more old versions of key-value pairs, space-amp is larger, so there is more data that needs to be in the cache.

An unclustered B-Tree index also requires more memory to keep the important bits in cache. The important bits are all keys, which is much more memory than one key per leaf block for a clustered B-Tree.

Write Amplification

For now I assume that flash storage is used so I can focus on bytes written and ignore disk seeks when explaining write-amp. For a B-Tree a change is first recorded in the redo log and the page is eventually written back. The worst case occurs when the buffer pool is full with dirty pages and reading the to-be-modified page into the buffer pool forces a dirty page to be evicted and written back. In this case there is a redo log write and a page write back per row change. If the row is 128 bytes and the page is 4096 bytes then 4096+128 bytes are written to storage per 128 byte row change. The write-amp is 33 -- (4096 + 128) / 128. The write-amp is reduced when there is more one changed row on a page or when one row is changed many times before write back.

For the LSM the redo log is written immediately on a row change. When the memtable is full and flushed to level 0 then the row is written again. When level N is full and compaction is done from level N to level N+1 then one SST file is read from level N, ~10 SST files are ready from level N+1 and ~10 files are written back to level N+1. The write-amp to move rows from level N to N+1 is ~10 given my handwaving but in practice it is ~7 and I am waiting for a paper to be published to explain that. The total write-amp is computed from the writes required to move a row change from the memtable to the max level. The write-amp is 1 for the redo log, 1 for the memtable flush and usually ~1 for compacting to level 1. Assuming the LSM has levels 0 to 4 and the per-level write-amp is 7 for levels 2 to 4 then the total write-amp is 24 -- 1 + 1 + 1 + 7 + 7 + 7.

From the examples above the LSM has less write-amp than the B-Tree but those examples were not meant to be compared. An LSM tends to have less write-amp than a B-Tree. When using flash storage this means the device will last longer. When using disk storage this is likely to save more IO capacity for reads leading to higher QPS.

The IO pattern for a busy LSM is concurrent streams of IO. Each stream writes files sequentially, but the writes from different streams can end up in the same logical erase block (logical means it is striped across many NAND chips). The level of the leveled compaction LSM predicts the lifetime of the write. Writes to level 0 have a short lifetime. Writes to level 4 have a much longer lifetime. The write rates per level are similar -- there might be 10 MB/second of writes to levels 0 and 1 and then 20 MB/second of writes to levels 2 through 4. This means that logical erase blocks will end up with a mix of long and short lived data and the long-lived data will get copied out during flash garbage collection. Does your flash device use logical erase blocks? If it does then there will be write-amp from flash GC even with an LSM. Flash devices that support multi-stream will help a lot.

Space Amplification

A B-Tree gets space-amp from fragmentation, per-row metadata and fixed page sizes on disk. The leaf pages in a B-Tree are between 50% and 70% full when subject to random updates. When they are 2/3 full then space-amp is 1.5 and when they are 1/2 full then space-amp is 2. An update-in-place B-Tree like InnoDB uses ~20 bytes/row for metadata to support consistent read and transactions. The metadata overhead is much smaller for an LSM like MyRocks. Finally, when compression is done for InnoDB there will be wasted space because page sizes are fixed on disk. When a 16kb in-memory page compressed to 5kb for a table that uses 8kb pages on disk, then 3kb of the 8kb page on disk is wasted.

An LSM gets space-amp from old versions of key-value pairs. Leveled and size-tiered compaction differ significantly in this regard. With leveled compaction you are likely to get space-amp of 1.1 or 1.2 and with size-tiered compaction a more common result is space-amp of 2. Size-tiered compaction can suffer even more from additional but temporary space-amp when the max file is compacted and disk space is required for the old and new version of that file.

Compression reduces space-amp and for this reason I claim that space-amp of less than 1 is possible.

B-Tree vs LSM in practice

This post is longer than I expected, so I will write less here. This is a good spot for a joke about space-amp and write-amp. I have begun reporting on read, write and space amplification by normalizing the server's IO and CPU rates by QPS during benchmarks. I use iostat to get data for on-disk read-amp and write-amp by measuring reads/second, MB read/second and MB written/second. I frequently ignore writes/second because that mixes fast and slow writes (redo log writes are fast, page writes are slow). I use vmstat to measure the CPU utilization and that is a proxy for the in-memory read-amp and write-amp. Finally I look at the size of the database on disk to compare space-amp. The data is usually measured over 1 hour intervals to make it easy to detect when metrics get worse as a database ages. I try to run workloads for at least 12 hours to give things time to go bad.

Percona has begun doing this for some benchmark reports. I hope this becomes a common practice.

This is an example from running Linkbench for MongoDB with RocksDB and WiredTiger. I will soon have more results for MyRocks. I am thrilled that we have a copy-on-write B-Tree (WiredTiger) and an LSM (RocksDB) available as storage engines in MongoDB. We are also bringing RocksDB to MySQL via the MyRocks effort. The big deal for MyRocks compared to InnoDB is half the space-amp and half the write-amp. This has been measured on Linkbench and on the real workload. This is a big deal.


  1. I wonder why there isn't more data on over provisioning ssds. Some manufacturers overprovision certain models and 4k write performance on those tends to be more stable/higher than expected. We can overprovision drives too to address write amp.

    1. I think more details are not shared for business and marketing reasons.

      The user or the vendor can overprovision. That also means you pay more per usable GB of storage. This is trrading more space-amp for less write-amp.

    2. Did you use the leveldb to do the benchmark for LSM-tree? I may know the why the write amplification of compaction in each level is 7 instead of 10.

    3. Did you use leveldb to do the benchmarks for LSM-tree? I may know why the write amplification of compaction in each level is 7 instead 10.

    4. Can you share with me?

      I used RocksDB. I rarely use LevelDB because it isn't performant for high throughput workloads. This paper has some math that explains whey the per-level write amplification will be less than ten -- https://www.cs.cmu.edu/~hl/papers/msls-fast2016.pdf

  2. https://www.percona.com/blog/2018/05/10/why-weve-deprecated-mongorocks-in-percona-server-for-mongodb-3-6/

    Is MongoRocks completely done away?

    1. Probably. I am not aware of anyone who plans to maintain it going forward. We need an LSM in MongoDB and I hope that WiredTiger eventually officially supports it.

  3. Should we also include the bytes read during compaction into the read amplification for LSM?

    1. Good question. Answer requires nuance. Write-amp is about the overhead for writes, and I would include it there, but the model as currently expressed doesn't have much room for nuance.

      A b-tree storage engine also has some reads that occur during writes, so an LSM isn't alone in needing such nuance. The read overhead in this case is that a page can't be modified unless it is in the buffer pool. The same goes for secondary index maintenance, although the InnoDB change buffer can reduce the reads needed for that.

  4. Re: "When level N is full and compaction is done from level N to level N+1 then one SST file is read from level N, ~10 SST files are ready from level N+1 and ~10 files are written back to level N+1. "

    Shouldn't we take the worst case as 11? In case there were absolutely no duplicates between the files being compacted.

    1. The worst case can be more than 11. But I don't want to use too much detail in the description.