Wednesday, September 25, 2019

Reducing write and space amplification in a b-tree

This started as a review of FineLine: Log-structured Transactional Storage and Recovery from VLDB 2019. The paper is wonderful but dense so I didn't take the time to understand all of it. It uses a log-structured approach to avoid the complexity of ARIES for recovery. While it can support a variety of index structures a b-tree is the obvious target.

While it wasn't a stated goal for the paper, the FineLine approach reduces space and write amplification for a b-tree. Maybe a focus on that is motivation for another paper from the authors. An LSM with leveled compaction has much less space and write amplification than a b-tree at the cost of more CPU. Reducing space and write amplification for a b-tree makes it more competitive when storage efficiency is a priority. The paper has good performance results for databases less than 10X the size of RAM. I am curious what happens when the database is 100X the size of RAM.

I look forward to corrections and apologize for not distinguishing between page and block. Fortunately I don't have to write about lock vs latch here. I also look forward to the next paper from the authors.

Improving b-tree storage efficiency

I previously explained how the InnoDB b-tree can use 2X more space and write 10X more bytes per transaction to storage compared to the MyRocks LSM. This is a big deal if you prefer to use less SSD and worry about endurance (because QLC is coming). But there are several things that can be done to make a b-tree better at storage efficiency.
  1. Variable length pages
  2. Block and index prefix compression
  3. Defer dirtying pages and page write-back
A traditional b-tree subject to random updates will have leaf pages that are <= 70% full. This wastes at least 30% of storage and space amplification (space-amp) is at least 1.42X (100 / 70). Variable length pages can avoid that waste. By variable length pages I mean that when the compressed page is X bytes then it uses at most X+Y bytes on disk where Y is <= 512. One reason for such a small value for Y is that a common page size for OLTP is 4kb or 8kb. The alternatives to supporting variable length on-disk pages has been to rely on file system hole punch (see here and here for my bad experience with that in InnoDB) or a proprietary file system. Variable length pages will cause space-amp from external fragmentation as there will be wasted space that has yet to be reclaimed.

Both index prefix compression and block compression are needed. Index prefix compression reduces the size of pages before and after compression. These should be optional because some workloads don't want the CPU overhead from block compression (even with fast lz4 and zstd) or from undoing prefix compression. Block compression usually leads to variable length pages.

Variable length pages and compression help mostly with space-amp. To reduce write amplification (write-amp) a b-tree can defer making pages dirty and the write-back that must follow. The FineLine paper has new ideas for this. The old ideas have expensive side-effects (slow crash recovery) including using a large buffer pool with 99% dirty pages and using a huge redo log. I wonder what the side-effects will be for FineLine.

The OSS b-tree engines that I have used (InnoDB, WiredTiger) can do more to reduce space-amp. InnoDB supports block compression but not variable length pages (by my definition) so the implementation is complex and some space-amp savings are lost. WiredTiger supports variable length pages, block compression and index prefix compression. From the WiredTiger docs the on-disk page size is a multiple of allocation_size which is >= 4kb which doesn't meet my requirement for variable length above. Regardless I am still a huge fan of InnoDB and WiredTiger. There is also more to be done to reduce write-amp in OSS engines. InnoDB fuzzy checkpoint combined with a large redo log can reduce write back. The InnoDB change buffer defers dirtying non-unique secondary index pages. WiredTiger doesn't support fuzzy checkpoint. I hope that changes. But neither have anything that can defer write-back like FineLine.


The goals for FineLine include avoiding the complexity of ARIES, instant recovery, larger than memory databases and a compromise between read and write optimized solutions -- better read performance than an LSM, better write performance than a b-tree.

There are a few logical concepts to grasp when reading the paper -- node and indexed log. FineLine uses node rather than page. A node has an ID and variable length size. Nodes can reference other nodes -- for a b-tree leaf nodes will be referenced by interior nodes. Log entries are either node format (a node image) or describe a change to a node. The indexed log is the database and supports append and fetch. The indexed log is unlikely to be implemented by a log file.
FineLine uses a single-storage approach. The indexed log is the database. For the paper the indexed log was implemented using a Foster b-tree, forked from ShoreMT, that might be a variant of a copy-on-write b-tree. Single-storage means there is only a log to write -- redo log records are written on commit and compaction runs in the background to merge redo log records and occasionally write page images (node-format log records) when there are many redo records for a given page. The paper describes a simple API for the indexed log -- append and fetch node.

Group commit is supported and all log records to be written are assigned a partition number and then ordered by node ID. I assume the partition number is monotonically increasing per commit group. A group commit creates a new partition in the indexed log. A bloom filter might also be created to make it faster to determine whether a partition includes records for a node. Only redo is written to the indexed log. There are no undo records. One reason for better write performance is less logging.

Adjacent partitions are merged by background threads. This has something in common with LSM compaction. The paper was vague but there are times where node images (node format log records) are written back to the indexed log when there are too many log records for that node. This can be done via cost-based decisions that consider read vs write efficiency. There was some work in Hekaton to figure out when to collapse per-node redo log records.

On a buffer cache miss the fetch operation must first find the most recent node image (node format log record) and then find/apply all of the redo log records for that node. In the worst case this means searching many partitions. Bloom filters can make this search faster.

A buffer manager caches node images. Some of the node images might be expensive to recreate as explained in the previous paragraph. While the paper claims FineLine supports databases larger than memory I wonder how much larger because of the cost of recreating node images. The paper provide tpc-b results where the database is no more than 10X the size of memory. I am curious about 100X larger.


  1. Thank you for this great review of our paper! The benefits on write and space amplification are also very important, and you are right that we should have mentioned them more prominently.

    Our system was so far not optimized for the 100x case, and in that case it is hard to beat a B-tree with ARIES. The tricks required to do so would deviate too much from the main point of the paper, so we left them out. Nevertheless, is is a great suggestion for future work (which is in progress already).