Tuesday, July 24, 2018

Index+log - an alternative to LSM

While there are many algorithms for index structures I think that most of them can be put into one of three families -- page-based, LSM or index+log.

I have not spent much time with index+log implementations, but I think they are interesting. These include BitCask, ForestDB, WiscKey, Faster, HashKV and RocksDB BlobDB. With the index+log approach keys are stored separate from values, values are written to log segments and GC is required. There is variety in the index algorithms. A hash index is used by BitCask and Faster while the others used persistent tree indexes -- a trie for ForestDB, an LSM for WiscKey, HashKV and RocksDB BlobDB.

How might index+log do in terms of read, write, space and cache amplification? I will focus on implementations that use a tree index, but some of what I write is true for a hash index. My summary for page-based and LSM is in a previous post.

An important part of the answer is that it depends on the index algorithm and index+log allows for diversity. The implementations above use an LSM or trie. There might be an implementation in the future that uses a b-tree although tries are getting more popular (see an interesting SIGMOD paper).

My overview of index+log:
  • Space and write amplification for the value log have an inverse relationship. Space amplification is reduced by doing GC more frequently, but that increases write amplification. A simple model that assumes uniform distribution explains the relationship. When pct_full is the percentage of storage that is used then space-amp = 100 / pct-full and write-amp is 100 / (100 - pct_full). When storage is 80% full then space-amp is 100/80 or 1.25 and write-amp is 100/20 or 5.
  • The write-amp in the previous bullet point is IO write-amp. But there is also CPU write-amp and for tree indexes that is the number of comparisons per insert. This includes the comparisons done upfront when the insert is processed and the comparisons done over time during GC on behalf of that insert. All implementations have an upfront cost and all except for HashKV must search the index during GC. The upfront cost is equivalent to a point query as it requires a tree search. When write-amp is 5 as above then GC does 5 tree searches. In that case the total CPU write-amp is equivalent to 6 point queries. The CPU cost of a point query depends on the index algorithm -- for an LSM it costs more than for a b-tree (sometimes 2X more even with bloom filters). But in general the CPU write-amp is 1 + IO-write-amp point queries (except for HashKV). 
  • The CPU write-amp for index+log is usually much worse than for LSM when index+log uses an LSM for the index, except for HashKV. With an LSM if write-amp is 20 then each KV pair is rewritten 20 times by compaction to reach the max level of the LSM tree. The CPU cost is one comparison per rewrite and then N comparisons to insert the KV pair into the memtable and a common value for N is 20.
  • I ignore write-amp and space-amp for the persistent index. That depends on the algorithm -- page-based or LSM. I wonder if index+log could be used. 
  • Cache amplification can be bad for index+log for algorithms that require an index probe per KV pair during GC -- HashKV doesn't probe the index during GC. The index probe must be efficient to support high GC rates that follow from high insert rates. This means that the index must be in memory with an index entry in memory for every key. That is a lot of memory unless the value/key size ratio is large.
  • Read amplification is different for point and range queries. First, the CPU component of read-amp depends on the algorithm used for the index (page-based vs LSM). Assuming the index fits in memory (and except for HashKV, the index must be in memory) then the IO component of read-amp comes when reading the value log. For a point query the worst case is one storage read. For a range query the worst case is one storage read per KV pair in the range and that can be expensive (more seeks with disk, more request processing with SSD, decompression per storage read). 
In summary, the potential problems with index+log are bad cache-amp (index must fit in memory), bad CPU write-amp (too many compares/insert) and bad IO read-amp for range queries. Note that HashKV might avoid the first two problems. But all index structures have potential problems and for some workloads index+log is a great choice.

There is an optimization for index+log that might not have been described in previous work. If the log segments are ordered then multiple ordered log segments can be merged to create longer runs of ordered log segments over time. This can reduce the random IO penalty for range queries. While the log segment when first written isn't ordered, but it can be ordered the first time GC is done for it.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...