Monday, November 19, 2018

Review of TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores

This is review of TRIAD which was published in USENIX ATC 2017. It explains how to reduce write amplification for RocksDB leveled compaction although the ideas are useful for many LSM implementations. I share a review here because the paper has good ideas. It isn't easy to keep up with all of the LSM research, even when limiting the search to papers that reference RocksDB, and I didn't notice this paper until recently.

TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the RUM Conjecture improvements usually come at a cost and the cost in this case is more cache amplification (more memory overhead/key) and possibly more read amplification. I assume this is a good tradeoff in many cases.

The paper explains the improvements via 3 components -- TRIAD-MEM, TRIAD-DISK and TRIAD-LOG -- that combine to reduce write amplification.

TRIAD-MEM

TRIAD-MEM reduces write-amp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIAD-MEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.

An extra 4-bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIAD-MEM has a cost in cache-amp but I don't think that is a big deal.

Assuming the per-level write-amp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.

TRIAD-DISK

TRIAD-DISK reduces write-amp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.

Overlap is estimated via Hyperloglog (HLL) which requires 4kb/SST and is estimated as the following where file-i is the i-th SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(file-i) is the number of keys in the i-th SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that per-SST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:
    UniqueKeys(file-1, file-2, ... file-n) / sum( Keys( file-i))

The benefit from early L0:L1 compaction is less read-amp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more per-level write-amp which is etimated by size(L1 input) / size(L0 input). TRIAD-DISK provides the benefit with less cost.

In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIAD-DISK uses a cost-based approach to do early compaction when it won't hurt the per-level write-amp. This is an interesting idea.

TRIAD-LOG

TRIAD-LOG explains improvements to memtable flush that reduce write-amp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CL-SST for commit log SST) will be deleted when it is compacted into the L1.

There is cache-amp from TRIAD-LOG. Each key in the CL-SST (L0) and maybe in the memtable needs 8 extra bytes -- 4 bytes for CL-SST ID, 4 bytes for the WAL offset.

Assuming the per-level write-amp is one from the memtable flush for cold keys this reduces that to 0.

Reducing write amplification

The total write-amp for an LSM tree with leveled compaction is the sum of:
  • writing the WAL = 1
  • memtable flush = 1
  • L0:L1 compaction ~= size(L1) / size(L0)
  • Ln compaction for n>1 ~= fanout, the per-level growth factor, usually 8 or 10. Note that this paper explains why it is usually a bit less than fanout.
TRIAD avoids the write-amp from memtable flush thanks to TRIAD-MEM for hot keys and TRIAD-LOG for cold keys. I will wave my hands and suggest that TRIAD-DISK reduces write-amp for L0:L1 from 3 to 1 based on the typical LSM configuration I use. So TRIAD reduces the total write-amp by 1+2 or 3.

Reducing total write-amp by 3 is a big deal when the total write-amp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total write-amp above then per-level write-amp is ~8 for both L1:L2 and L2:L3. The total write-amp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.

But my write-amp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total write-amp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the per-level write-amp with and without TRIAD? Most of this can be observed from compaction statistics provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.

Questions

The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU read-amp from using the commit-log SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commit-log SST.
impact of more levels?

Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of compaction priority in RocksDB but it doesn't for all.  When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.

An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits write-amp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.

4 comments:

  1. One idea that I can't take credit for is to use HLL for picking SSTs to compact beyond L0:L1

    ReplyDelete
  2. Hi Mark,

    Thank you for taking the time to review our paper. It’s great to see that there is interest in our work! I’ll try to answer the questions that you raise in the review.

    - In our experiments the LSM tree was configured to maximum 5 levels. This value was chosen to be similar to the values seen in the RocksDB tuning guide (https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks — 6 levels, https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide — 4 levels) and from our experience in production.

    - Read amplification was not extensively evaluated in this paper. We do, however, show a read amplification breakdown for one of our workloads. The lower-right plot in Figure 11 shows the read amplification breakdown for a uniform workload with 10% reads, compared to RocksDB. Like you point out, TRIAD-DISK does increase the read amplification, as it’s keeping on average more files on L0. TRIAD-LOG does not have any effect concerning read-amp. This is to be expected since it does not change the number of SSTs, only affecting the number of written bytes when flushing. TRIAD-MEM on the other hand, reduces read-amp, because it can serve part of the requests from memory. Overall, TRIAD’s read-amp in this experiment is slightly higher than RocksDB (3.6, compared to 3.5 in RocksDB).

    - There is increased use of CPU because of commit-log SSTs on level 0. This penalty is only paid on compaction from level 0 to level 1. As soon as the commit-log SSTs on level 0 are compacted into level 1, they are converted into regular SSTs. This is a trade-off that we choose in order to avoid flushing almost completely.

    - TRIAD’s techniques target the memory component, the commit log and level 0 to level 1 compaction. The rest of the compaction (i.e., compaction on levels > 1) proceeds as it would normally proceed in RocksDB. However, depending on the workload, the effects of our techniques can be felt on the higher levels as well. For instance, if the memory component is large enough to capture the workload skew, little data ends up being written to disk, leading to less activity in the LSM tree on all levels.

    Cheers,
    Oana Balmau

    ReplyDelete
  3. Hi Mark,

    Thank you for taking the time to review our paper. It’s great to see that there is interest in our work! I’ll try to answer the questions that you raise in the review.

    - In our experiments the LSM tree was configured to maximum 5 levels. This value was chosen to be similar to the values seen in the RocksDB tuning guide (https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks — 6 levels, https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide — 4 levels) and from our experience in production.

    - Read amplification was not extensively evaluated in this paper. We do, however, show a read amplification breakdown for one of our workloads. The lower-right plot in Figure 11 shows the read amplification breakdown for a uniform workload with 10% reads, compared to RocksDB. Like you point out, TRIAD-DISK does increase the read amplification, as it’s keeping on average more files on L0. TRIAD-LOG does not have any effect concerning read-amp. This is to be expected since it does not change the number of SSTs, only affecting the number of written bytes when flushing. TRIAD-MEM on the other hand, reduces read-amp, because it can serve part of the requests from memory. Overall, TRIAD’s read-amp in this experiment is slightly higher than RocksDB (3.6, compared to 3.5 in RocksDB).

    - There is increased use of CPU because of commit-log SSTs on level 0. This penalty is only paid on compaction from level 0 to level 1. As soon as the commit-log SSTs on level 0 are compacted into level 1, they are converted into regular SSTs. This is a trade-off that we choose in order to avoid flushing almost completely.

    - TRIAD’s techniques target the memory component, the commit log and level 0 to level 1 compaction. The rest of the compaction (i.e., compaction on levels > 1) proceeds as it would normally proceed in RocksDB. However, depending on the workload, the effects of our techniques can be felt on the higher levels as well. For instance, if the memory component is large enough to capture the workload skew, little data ends up being written to disk, leading to less activity in the LSM tree on all levels.

    Cheers,
    Oana

    ReplyDelete
    Replies
    1. Thanks for the reply and for advancing the state of the art for LSM. I look forward to more results.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...