Wednesday, May 15, 2019

Index+log, v2

I put most index structures into one of three categories -- page-based, LSM or index+log. My focus is on databases larger than memory and I might be ignoring categories used for main memory DBMS. Over the past decade index+log has gotten more attention and this is my second attempt at explaining it.

My definition of index+log is simple -- data is appended to a log on each write, index entries point into the log and GC scans the log to copy live values and discard others. The log can be one file or many log segments. GC might have to search the index to determine whether a value is live. The value written to the log might include the key.

Bitcask was the first index+log system that I noticed but I assume there are earlier examples and just found one -- Berkeley DB Java Edition. While WiscKey made popular the term key-value separation and is presented as an LSM variant, I put it in the index+log category. Other interesting index+log systems include RocksDB BlobDBTitanDBForestDB and Faster. While many of the tree-based solutions use an LSM for the index that is not required by index+log and an LSM is not used by Berkeley DB Java Edition or ForestDB.

For an index+log solution that must cache all of the index and query the index during GC I suspect that an LSM is not the best choice for the index. Although if you already have an LSM (see RocksDB BlobDB) then I get it.

The summary of my LSM vs index+log comparison is:
  1. index+log has less write-amp but more space-amp
  2. index+log has much more cache-amp, maybe 10X more
  3. index+log might have more deferred CPU write-amp
  4. I have yet to validate my claims by running benchmarks with index+log implementations.
Note that #1 is a feature, #2 is not a feature and for #3 it depends. The key point is that the cost of faster writes from index+log is more cache-amp (more memory) and more IO for range queries. In production with RocksDB I frequently see write-amp=15 with space-amp=1.1. I assume that could be reduced with index+log to write-amp ~=5 and space-amp ~= 1.3. It might be possible to avoid or reduce the impact of #2 and #3 in future index+log implementations.

Amplification Factors

I am speculating on read, write, space and cache amplification for index+log because I have little hands on experience with index+log implementations. Another reason for speculation is that index+log allows for different structures for the index (b-tree, LSM, etc) which affects some of the estimates.

The amplification factors are:
  • cache - the cache-amp for index+log is likely to be much larger than for an LSM. To achieve at most one IO per point query index+log might need 10X (or more) memory versus an LSM. Clustering values in key order doesn't just make range queries faster, it also means an LSM only needs block indexes in memory (key+pointer per block) while index+log needs a key+pointer in memory for every KV pair in the database. When there are 10 KV pairs per block then this is 10X more memory. Even when the database has hot and cold keys they are likely to be interleaved on the same index leaf pages --  so all of those leaf pages must be in memory to avoid doing two IOs (one for the leaf page, one for the value) on a point query.
    • There is additional data that an LSM and b-tree need in memory to satisfy the one IO per query constraint and I described that in previous posts (mostly things from the non leaf/max level).
    • It might be valid to debate whether my one IO per point query constraint is valid, but this blog post is already long.
    • Another reason for the index to be cached is to avoid doing IO during GC when index searches are done to determine whether KV pairs are live.
  • space - I ignore space-amp for the index and focus on the log because the log is usually larger. With index+log the user can trade between write-amp and space-amp. With the variable pct_full representing the percentage of live data in the database (a value between 1 and 100) then:
    • space-amp = 100 / pct_full
    • write-amp = 100 / (100 - pct_full)
    • Just like with an LSM, previously written KV pairs are rewritten during GC with the index+log approach. Fortunately this is done less often with index+log. 
    • I assumed that block compression is used but that is harder to implement for index+log. The WiscKey paper doesn't describe a solution and the HashKV paper suggests using per-record compression, which will have a lower compression rate versus block as used by an LSM. I assume block compression can be done for index+log but it isn't trivial.
    • To explain the estimates when pct_full=X assume that all log segments have X% live data (yes, this is naive). When GC is done on a log segment X% is copied leaving (100-X)% free space in the newly written log segment. So in total 100% of a log segment is written for each (100 - pct_full)% of new data, which is the formula above.
    • Thus with pct_full=90 then space-amp is 1.1 while write-amp is 10. Comparing these with a leveled LSM the space-amp is similar while the write-amp is slightly better than what I see in production. To get a write-amp that is significantly better the cost is more space-amp. For example with pct-full=75 then write-amp=4 and space-amp=1.33. 
  • read (CPU) - see here for range seek/next. The summary is that when an LSM is used for index+log then the costs are similar to an LSM. When a b-tree is used for index+log then the costs are smaller.
  • read (IO) - see here for range seek/next. In the cache-amp estimate I assume that the index is cached so the only IO to be done is for the log. Therefore the index structure (LSM vs b-tree) doesn't matter.
    • point query - the IO read-amp is <= 1 because the log is not cached.
    • range seek - range seek doesn't do IO when the index is cached
    • range next - this is much larger for index+log than for an LSM because it might do one IO per call to range next because rows are not clustered by key in the log. When data is compressed then there also is the CPU overhead for decompression per KV pair.
  • write - by write I assume update (read-modify-write) rather than a blind write.
    • immediate CPU - the cost of an index search. See the section on read CPU for point queries above.
    • immediate IO - the cost of an optional redo log write for the index structure and then writing the (value) log. Note that the minimum size of a write done by the storage device might be 4kb even if the data written is much smaller. Doing an fsync per 128 byte value might have a write-amp of 32 if that write is really forced to storage and doesn't just linger in a capacitor backed write cache.
    • deferred CPU - the deferred CPU write-amp is the cost of index searches done during GC, unless the HashKV approach is used. With pct_full=75, write-amp=4 and space-amp=1.33 then GC is done ~4 times for each key and the deferred CPU cost is 4 index searches. When live KV pairs are copied by GC then there is also the CPU and IO overhead from updating the index entry to point to the new location in the log.
    • deferred IO - this is determined by the percentage of live data in the database. With pct_full=75 it is 4.


  1. This is a very interesting topic! One minor addition to this blog: the earliest system that uses index + log might be Postgres (, but this idea was abandoned because the disk was too slow back then.

    Also, Log-structured File System (LFS) may also be the early version of this idea (and now there is an in-memory version of LFS called RAMCloud).

    1. Thanks. I read that Postgres paper a long time ago and then forgot about it. Note that it is a page-based b-tree. The difference is that overwrite is not done, page write-back appends to the end of the log, and in practice GC is needed. I called that a CoW-S b-tree in past blog post where CoW-S stands for copy-on-write sequential:

      I enjoyed reading and re-reading the LFS paper as well as the log-structured FS research by Margo Seltzer.

      For index+log my definition requires changed rows to be written to the log, rather than changed pages.

      If early Postgres wasn't crash safe (I don't know for sure) maybe that problem is an artifact of the paper you cite.

    2. By the same blog post WiredTiger is CoW-R -- copy-on-write with random writeback to reclaim dead space,so GC isn't needed.