Friday, May 17, 2019

index+log: implementations

My take on index+log systems like WiscKey is that they are neither better nor worse than an LSM - it all depends on your workload. But I am certain that we know much more about an LSM than about the index+log systems. Hopefully that changes over time as some of them are thriving.

The first index+log system that I read about was Berkeley DB Java Edition. The design paper is worth reading. Since then there have been a few more implementations and papers that I describe here. This list is probably incomplete: Bitcask, ForestDB, WiscKey, HashKV, TitanDB and RocksDB BlobDB.

At this point the systems that are getting into production, TitanDB and BadgerDB, use an LSM for the index. I wonder if an index structure that supports update-in-place would be better especially when the index must be cached because I expect the CPU read-amp for an LSM to be about 2X larger than for a b-tree and a b-tree supports update-in-place which makes it easier to relocate values during GC.

While I like index+log systems I think that papers and marketing tend to overstate LSM write-amp. For production RocksDB I usually see write-amp between 10 and 20. I expect that index+log could achieve something closer to 5. This paper from CMU explains one reason why per-level write-amp in an LSM is less than the per-level fanout (less than 10). Write skew is another reason.

The systems

Bitcask was part of the Riak effort.
  • The index is an in-memory hash table. The index isn't durable and the entire log has to be scanned to rebuild it on startup -- whether or not this was after a crash or a clean shutdown. The slow startup is a problem.
  • The value log is circular and GC copies live values from the tail to the head of the log. Liveness is determined by an index search. 

ForestDB was a finalist in the SIGMOD 2011 student programming contest. Eventually the project and creator moved to Couchbase. It is worth reading about here and on the github page. I published blog posts that compare ForestDB and RocksDB: 1, 2, 3 and 4. Google finds more interesting things to read.
  • The index is a space-efficient trie.
  • The value log might have log segments. GC copies live values to the head of the log. Liveness is determined by an index search.

WiscKey is described as an LSM with key-value separation and made popular the term key-value separation. I put it in the index+log family of index structures.
  • The index is an LSM. There is no redo log for the index as it can be recovered from the head of the value log.
  • Kudos for many references to amplification factors. The paper uses bytes read for read-amp. I prefer to consider both IO and CPU for read-amp with key comparisons for CPU and storage reads for IO.
  • It doesn't mention that it has more cache-amp than an LSM, but few papers mention that problem. Shrinking the LSM by keeping large values separate doesn't make the index parts of it (filter and index blocks) easier to cache as they are already separate from the data blocks. There is more to cache with index+log as I describe here.
  • It claims to overcome the (worst-case) problem of one storage IO per KV pair on a range scan by fetching in parallel. Assuming the storage device has enough spare IO this might hide the problem but it doesn't fix it. With many workloads there isn't spare IO and extra IO for reads also means extra CPU for decompression.
  • The value log is circular and single-threaded GC copies live values to the head of the log. Liveness is determined by an index search. I assume that multi-threaded GC is feasible.
  • The paper isn't clear about the total write-amp that might occur from both the first write to the value log and GC that follows.
  • Compression isn't explained.

BadgerDB is a golang implementation, and much more, of the WiscKey paper.
  • It has many features and many production use cases. This is impressive. 
  • GC is scheduled by the user. Based on Options.NumCompactors I assume it can be multi-threaded.
  • The docs state that the LSM can be served from RAM because the values are elsewhere. That is true but I don't consider it a feature. It must be in RAM to avoid IO from liveness queries done by GC. An LSM isn't a monolithic thing. There are index blocks, data blocks and filter blocks and most of the LSM, data blocks from the max level, don't have to be cached. 
  • There is extra work on reads to find values that have been moved by GC. See the comments about BadgerDB here.

HashKV is an interesting paper that avoids index queries during GC.
  • Hash-based data grouping distributes KV pairs by hash into one of N logs. GC is probably done by scanning a log twice -- once to get the keys and the second time to relocate the live values. A value is live when the most recent key is not a tombstone. A value might be live when needed for a snapshot. GC doesn't do index searches so the index doesn't have to be cached to make GC efficient but you might want to cache it to avoid doing index IO on queries -- and this index is still much larger than the block index for an LSM.
  • Hotness awareness copies cold values to a cold segment to avoid repeated GC for a value that doesn't get updated or deleted. A header for the value is kept in the non-cold log.
  • Small values are stored inline in the LSM.
  • I am curious if more log groups means more write-amp. See my comment about fsync in a previous post.
  • I am curious whether managing the hash buckets is easy. The goal is to make sure that keys for a segment group fit in memory. The range and number of buckets must change over time. Does this have anything in common with linear and extendible hashing?

TitanDB is part of TiDB and TiDB is thriving.
  • A WAL is used for new writes. This might make it easier to compress data on the first write to the value logs.
  • GC appears to do index searches to determine liveness.
  • Compression is per-record. I hope this does per-block in the future.
  • It might let the user tune between space-amp and write-amp via discardable_ratio.
  • This is compatible with most of the RocksDB API.k

RocksDB BlobDB is an extension to RocksDB that uses log segments for large values and stores small values in the LSM. GC copies live values and liveness is determined by an index search.

Future work

Future work for index+log systems includes:
  • Determine whether a b-tree is better than an LSM for the index structure
  • Determine whether the HashKV solution is the best way to avoid liveness queries during GC.
  • If an LSM is used for the index structure determine efficient ways to support relocating values during GC without extra overheads and complexity during read processing.
  • Determine whether clever things can be done during GC.
    • Block compression is easier than on first write.
    • Hot/cold value separation is possible (see HashKV). This is an example of generational GC even if we rarely mention that for index structures.
    • Values in a log segment can be ordered by key rather than by time of write during GC. GC can also merge multiple ordered segments to create longer sorted runs. I wonder if it is then possible to use block indexes (key+pointer per block rather than per row) to reduce cache-amp for such log segments.


  1. Thanks for sharing this series of index+log! I noticed one problem with index + log approach when I was writing the survey is that the reported range query performance by those papers is often too good to be true. As noted in this blog, they all use parallel disk I/Os to improve range queries, but I believe that's unfair to the baseline method and is kind of cheating.

    Moreover, when the record size is large, e.g., it matches the disk page size (4KB), index + log seems to be a better approach, especially it suffers less from range queries. But I wonder do you have any experience about how common large records are in practical workloads?

    1. Range queries I cared about were from 1 to 10,000 rows. When running MySQL with disk arrays many years ago random IO on such queries was huge problem and OSC (online schema change for MySQL) was created to make possible a schema change that gave us a covering index for such queries.

      I don't think parallel IO is cheating, but I do find it odd to call that a feature or optimization.

  2. Thanks for another great post.This is so amazing.