TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the RUM Conjecture improvements usually come at a cost and the cost in this case is more cache amplification (more memory overhead/key) and possibly more read amplification. I assume this is a good tradeoff in many cases.
The paper explains the improvements via 3 components -- TRIAD-MEM, TRIAD-DISK and TRIAD-LOG -- that combine to reduce write amplification.
TRIAD-MEM
TRIAD-MEM reduces write-amp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIAD-MEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.
An extra 4-bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIAD-MEM has a cost in cache-amp but I don't think that is a big deal.
Assuming the per-level write-amp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.
TRIAD-DISK
TRIAD-DISK reduces write-amp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.
Overlap is estimated via Hyperloglog (HLL) which requires 4kb/SST and is estimated as the following where file-i is the i-th SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(file-i) is the number of keys in the i-th SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that per-SST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:
UniqueKeys(file-1, file-2, ... file-n) / sum( Keys( file-i))
The benefit from early L0:L1 compaction is less read-amp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more per-level write-amp which is etimated by size(L1 input) / size(L0 input). TRIAD-DISK provides the benefit with less cost.
In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIAD-DISK uses a cost-based approach to do early compaction when it won't hurt the per-level write-amp. This is an interesting idea.
TRIAD-LOG
TRIAD-LOG explains improvements to memtable flush that reduce write-amp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CL-SST for commit log SST) will be deleted when it is compacted into the L1.
There is cache-amp from TRIAD-LOG. Each key in the CL-SST (L0) and maybe in the memtable needs 8 extra bytes -- 4 bytes for CL-SST ID, 4 bytes for the WAL offset.
Assuming the per-level write-amp is one from the memtable flush for cold keys this reduces that to 0.
Reducing write amplification
The total write-amp for an LSM tree with leveled compaction is the sum of:
- writing the WAL = 1
- memtable flush = 1
- L0:L1 compaction ~= size(L1) / size(L0)
- Ln compaction for n>1 ~= fanout, the per-level growth factor, usually 8 or 10. Note that this paper explains why it is usually a bit less than fanout.
TRIAD avoids the write-amp from memtable flush thanks to TRIAD-MEM for hot keys and TRIAD-LOG for cold keys. I will wave my hands and suggest that TRIAD-DISK reduces write-amp for L0:L1 from 3 to 1 based on the typical LSM configuration I use. So TRIAD reduces the total write-amp by 1+2 or 3.
Reducing total write-amp by 3 is a big deal when the total write-amp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total write-amp above then per-level write-amp is ~8 for both L1:L2 and L2:L3. The total write-amp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.
But my write-amp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total write-amp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the per-level write-amp with and without TRIAD? Most of this can be observed from compaction statistics provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.
Reducing total write-amp by 3 is a big deal when the total write-amp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total write-amp above then per-level write-amp is ~8 for both L1:L2 and L2:L3. The total write-amp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.
But my write-amp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total write-amp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the per-level write-amp with and without TRIAD? Most of this can be observed from compaction statistics provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.
Questions
The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU read-amp from using the commit-log SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commit-log SST.
impact of more levels?
Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of compaction priority in RocksDB but it doesn't for all. When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.
An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits write-amp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.