Friday, November 2, 2018

Converting an LSM to a B-Tree and back again

I wonder if it is possible to convert an LSM to a B-Tree. The goal is to do it online and in-place -- so I don't want two copies of the database while the conversion is in progress. I am interested in data structures for data management that adapt dynamically to improve performance and efficiency for a given workload. 

Workloads change in the short and long term. I hope that data structures can be adapt to the change and converting between an LSM and a B-Tree is one way to adapt. This is more likely to be useful when the data structure supports some kind of partitioning in the hope that different workloads can be isolated to different partitions -- and then some can use an LSM while others use a B-Tree.

LSM to B-Tree

A B-Tree is one tree. An LSM is a sequence of trees. Each sorted run in the LSM is a tree. With leveled compaction in RocksDB there are a few sorted runs in level 0 (L0) and then one sorted run in each of L1, L2 up to the max level (Lmax). 

A B-Tree persists changes by writing back pages -- either in-place or copy-on-write (UiP or CoW). An LSM persists changes by writing and then re-writing rows. I assume that page write back is required if you want to limit the database to one tree and row write back implies there will be more than one tree. 

There are two things that must be done online and in-place:
  1. Convert the LSM from many trees to one tree
  2. Convert from row write back to page write back
Note that my goal has slightly changed. I want to move from an LSM to a data structure with one tree. For the one-tree solution a B-Tree is preferred but not required.

The outline of a solution:
  1. Reconfigure the LSM to use 2 levels -- L0 and L1 -- and 3 trees -- memtable, L0, L1.
  2. Disable the L0. At this point the LSM has two trees -- memtable and L1.
  3. Flush the memtable and merge it into the L1. Now there is one tree.
  4. After the flush disable the memtable and switch to a page cache. Changes now require a copy of the L1 block in the page cache that eventually get written back via UiP or CoW.
The outline above doesn't explain how to maintain indexes for the L1. Note that after step 2 there is one tree on disk and the layout isn't that different from the leaf level of a B-Tree. The interior levels of the B-Tree could be created by reading/rewriting the block indexes stored in the SSTs.

B-Tree to LSM

The conversion can also be done in the opposite direction (B-Tree to LSM)
  1. Treat the current B-Tree as the max level of the LSM tree. While it might help to flush the page cache I don't think that is required. This is easier to do when your LSM uses a B-Tree per level, as done by WiredTiger.
  2. Record new changes for insert, update, delete in a memtable rather than a page cache.
  3. When the memtable is full then flush it to create a new tree (sorted run, SST) on disk.
  4. Eventually start to do compaction.

2 comments:

  1. https://www.eecs.harvard.edu/~margo/cs165/papers/gp-lsm.pdf

    ReplyDelete
    Replies
    1. Rusty is awesome but this bLSM doesn't approach B-Tree read performance/efficiency if you consider IO and CPU (table 1 only considers IO). I have seen more than one interesting academic paper ignore CPU and only consider IO when making this claim.

      The good news is that considering both IO and CPU efficiency is an area where we need a few more papers.

      Delete

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL , a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions f...