Tuesday, September 21, 2021

Review of DiffKV

This is a review of Differentiated Key-Value Storage Management for Balanced I/O Performance that was published in ATC 2021. This is a wonderful paper that resolves several of my concerns about key-value separation (here and here). The authors have experience with key-value separation via Titan which is part of TiDB.

The key idea in the paper is to use leveled compaction for keys and tiered compaction for values. Write-amplification is reduced by not using leveled compaction for values. Read-amplification is reduced, relative to classic key-value separation (see WiscKey), by using tiered compaction for values rather than logs. With classic key-value separation the worst case read-amp for a scan is the need to do a random read from the log for every qualifying key as that random read can include a block decompression and a storage IO. With DiffKV the use of tiered compaction for values provides some ordering to avoid that worst case.

While I enjoyed the paper, it didn't have performance results with compression enabled and I am curious about the impact of decompression overhead on point and range read latency.

A summary of the approach:

  • small values are stored inline in the LSM tree
  • large values are stored in logs, called vLogs. The paper explains a few improvements to make GC faster in this case. But mostly this is classic key-value separation.
  • medium values use the vTree (tiered compaction for values)
  • the user defines the two size limits that determines whether a value is small, medium or large
The vTree

Medium values are stored in the vTree which resembles an LSM that uses tiered compaction. A vTable is the unit of storage in the vTree. A vTable is 8MB by default and stores KV pairs in key order. A sorted group is a sequence vTables for which the keys don't overlap -- (vTable, sorted group) are similar to an (SST, sorted run) in RocksDB. A vTree has multiple levels where each level has one or more sorted groups and the levels have exponentially increasing size.

This is similar to tiered compaction for two reasons. First, there are multiple sorted groups per level. Second, when merges are done to move values from level N to N+1, the merge process only reads data from level N and then writes it to level N+1. It doesn't read and rewrite data already on level N+1. 

To reduce the number of times that values get rewritten the vTree has fewer levels than the LSM tree that stores keys. The smallest N levels in the key's LSM tree share the smallest vTree level. I assume the remaining levels in the key's LSM tree each have their own level in the vTree.

vTree GC

Keys in the LSM tree point into the vTree. When a value is moved between levels in the vTree the key's entry in the LSM tree must be updated to point to the new location in the vTree. DiffKV couples GC in the vTree with GC in the key's LSM tree to avoid extra writes. By coupling I mean that vTree GC is extra work added to compaction done on the key's LSM tree. When a key is moved from level N to level N+1 in the LSM tree then its value might be moved to the next level in the vTree.

By coupling it like this, DiffKV avoids the need to probe the index (key's LSM tree) to determine whether a value is live -- which is an extra overhead that occurs with classic key-value separation. It also avoids generating extra writes to change the point into the vTree that is stored with the key.

This above is called compaction-triggered merge and driven by compaction of the key's LSM tree. Another reason for doing vTree GC is called scan-triggered and triggered by scans that encounter regions of the vTree that need GC.

I am curious whether there is a need to also trigger GC based on vTables that have excessive space-amplification.

vTree rewrites

Leveled compaction has more write-amp than tiered because it rewrites previously written KV pairs. While vTree GC frequently avoids the need to do rewrites, there is one case where a rewrite is needed and I didn't learn enough about how DiffKV handles this.

There can be space wasted from vTables that have low utilization rates because most of the values in the vTable have been deleted. In this case something should be done to copy out (rewrite) the values to a new vTable. And when values are moved the key LSM tree must be updated to reference their new location. For workloads with updates and deletes this will be needed in the largest level of the vTree and might be needed for smaller levels.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...