Wednesday, September 19, 2018

Durability debt

I define durability debt to be the amount of work that can be done to persist changes that have been applied to a database. Dirty pages must be written back for a b-tree. Compaction must be done for an LSM. Durability debt has IO and CPU components. The common IO overhead is from writing something back to the database. The common CPU overhead is from computing a checksum and optionally from compressing data.

From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a B-Tree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a B-Tree which is one reason why tuning can be more challenging for an LSM than a B-Tree.

In this post by LSM I mean LSM with leveled compaction.

B-Tree

The maximum durability debt for a B-Tree is limited by the size of the buffer pool. If the buffer pool has N pages then there will be at most N dirty pages to write back. If the buffer pool is 100G then there will be at most 100G to write back. The IO is more random or less random depending on whether the B-Tree is update-in-place, copy-on-write random or copy-on-write sequential. I prefer to describe this as small writes (page at a time) or large writes (many pages grouped into a larger block) rather than random or sequential. InnoDB uses small writes and WiredTiger uses larger writes. The distinction between small writes and large writes is more important with disks than with SSD.

There is a small CPU overhead from computing the per-page checksum prior to write back. There can be a larger CPU overhead from compressing the page. Compression isn't popular with InnoDB but is popular with WiredTiger.

There can be an additional IO overhead when torn-write protection is enabled as provided by the InnoDB double write buffer.

LSM

The durability debt for an LSM is the work required to compact all data into the max level (Lmax). A byte in the write buffer causes more debt than a byte in the L1 because more work is needed to move the byte from the write buffer to Lmax than from L1 to Lmax.

The maximum durability debt for an LSM is limited by the size of the storage device. Users can configure RocksDB such that the level 0 (L0) is huge. Assume that the database needs 1T of storage were it compacted into one sorted run and the write-amplification to move data from the L0 to the max level (Lmax) is 30. Then the maximum durability debt is 30 * sizeof(L0). The L0 is usually configured to be <= 1G in which case the durability debt from the L0 is <= 30G. But were the L0 configured to be <= 1T then the debt from it could grow to 30T.

I use the notion of per-level write-amp to explain durability debt in an LSM. Per-level write-amp is defined in the next section. Per-level write-amp is a proxy for all of the work done by compaction, not just the data to be written. When the per-level write-amp is X then for compaction from Ln to Ln+1 for every key-value pair from Ln there are ~X key-value pairs from Ln+1 for which work is done including:
  • Read from Ln+1. If Ln is a small level then the data is likely to be in the LSM block cache or OS page cache. Otherwise it is read from storage. Some reads will be cached, all writes go to storage. So the write rate to storage is > the read rate from storage.
  • The key-value pairs are decompressed if the level is compressed for each block not in the LSM block cache.
  • The key-value pairs from Ln+1 are merged with Ln. Note that this is a merge, not a merge sort because the inputs are ordered. The number of comparisons might be less than you expect because one iterator is ~X times larger than the other and there are optimizations for that.
The output from the merge is then compressed and written back to Ln+1. Some of the work above (reads, decompression) are also done for Ln but most of the work comes from Ln+1 because it is many times larger than Ln. I stated above that an LSM usually has more IO and less CPU durability debt per modified row. The extra CPU overheads come from decompression and the merge. I am not sure whether to count the compression overhead as extra.

Assuming the per-level growth factor is 10 and f is 0.7 (see below) then the per-level write-amp is 7 for L1 and larger levels. If sizeof(L1) == sizeof(L0) then the per-level write-amp is 2 for the L0. And the per-level write-amp is always 1 for the write buffer.

From this we can estimate the pending write-amp for data at any level in the LSM tree.
  1. Key-value pairs in the write buffer have the most pending write-amp. Key-value pairs in the max level (L5 in this case) have none. Key-value pairs in the write buffer are further from the max level. 
  2. Starting with the L2 there is more durability debt from a full Ln+1 than a full Ln -- while there is more pending write-amp for Ln, there is more data in Ln+1.
  3. Were I given the choice of L1, L2, L3 and L4 when first placing a write in the LSM tree then I would choose L4 as that has the least pending write-amp.
  4. Were I to choose to make one level 10% larger then I prefer to do that for a smaller level given the values in the rel size X pend column.

legend:
w-amp per-lvl   : per-level write-amp
w-amp pend      : write-amp to move byte to Lmax from this level
rel size        : size of level relative to write buffer
rel size X pend : write-amp to move all data from that level to Lmax

        w-amp   w-amp   rel     rel size 
level   per-lvl pend    size    X pend
-----   ------- -----   -----   --------
wbuf    1       31          1      31      
L0      2       30          4     120     
L1      7       28          4     112     
L2      7       21         40     840     
L3      7       14        400    5600    
L4      7       7        4000   28000   
L5      0       0       40000       0  

Per-level write-amp in an LSM

The per-level write-amplification is the work required to move data between adjacent levels. The per-level write-amp for the write buffer is 1 because a write buffer flush creates a new SST in L0 without reading/re-writing an SST already in L0.

I assume that any key in Ln is already in Ln+1 so that merging Ln into Ln+1 does not make Ln+1 larger. This isn't true in real life, but this is a model.

The per-level write-amp for Ln is approximately sizeof(Ln+1) / sizeof(Ln). For n=0 this is 2 with a typical RocksDB configuration. For n>0 this is the per-level growth factor and the default is 10 in RocksDB. Assume that the per-level growth factor is equal to X, in reality the per-level write-amp is f*X rather than X where f ~= 0.7. See this excellent paper or examine the compaction IO stats from a production RocksDB instance. Too many excellent conference papers assume it is X rather than f*X in practice.

The per-level write-amp for Lmax is 0 because compaction stops at Lmax.

Tuesday, September 18, 2018

Bloom filter and cuckoo filter

The multi-level cuckoo filter (MLCF) in SlimDB builds on the cuckoo filter (CF) so I read the cuckoo filter paper. The big deal about the cuckoo filter is that it supports delete and a bloom filter does not. As far as I know the MLCF is updated when sorted runs arrive and depart a level -- so delete is required. A bloom filter in an LSM is per sorted run and delete is not required because the filter is created when the sorted run is written and dropped when the sorted run is unlinked.

I learned of the blocked bloom filter from the cuckoo filter paper (see here or here). RocksDB uses this but I didn't know it had a name. The benefit of it is to reduce the number of cache misses per probe. I was curious about the cost and while the math is complicated, the paper estimates a 10% increase on the false positive rate for a bloom filter with 8 bits/key and a 512-bit block which is similar to a typical setup for RocksDB.

Space Efficiency

I am always interested in things that use less space for filters and block indexes with an LSM so I spent time reading the paper. It is a great paper and I hope that more people read it. The cuckoo filter (CF) paper claims better space-efficiency than a bloom filter and the claim is repeated in the SlimDB paper as:
However, by selecting an appropriate fingerprint size f and bucket size b, it can be shown that the cuckoo filter is more space-efficient than the Bloom filter when the target false positive rate is smaller than 3%
The tl;dr for me is that the space savings from a cuckoo filter is significant when the false positive rate (FPR) is sufficiently small. But when the target FPR is 1% then a cuckoo filter uses about the same amount of space as a bloom filter.

The paper has a lot of interesting math that I was able to follow. It provides formulas for the number of bits/key for a bloom filter, cuckoo filter and semisorted cuckoo filter. The semisorted filter uses 1 less bit/key than a regular cuckoo filter. The formulas assuming E is the target false positive rate, b=4, and A is the load factor:
  • bloom filter: ceil(1.44 * log2(1/E))
  • cuckoo filter: ceil(log2(1/E) + log2(2b)) / A == (log2(1/E) + 3) / A
  • semisorted cuckoo filter: ceil(log2(1/E) + 2) / A

The target load factor is 0.95 (A = 0.95) and that comes at a cost in CPU overhead when creating the CF. Assuming A=0.95 then a bloom filter uses 10 bits/key, a cuckoo filter uses 10.53 and a semisorted cuckoo filter uses 9.47. So the cuckoo filter uses either 5% more or 5% less space than a bloom filter when the target FPR is 1% which is a different perspective from the quote I listed above. Perhaps my math is wrong and I am happy for an astute reader to explain that.

When the target FPR rate is 0.1% then a bloom filter uses 15 bits/key, a cuckoo filter uses 13.7 and a semisorted cuckoo filter uses 12.7. The savings from a cuckoo filter are larger here but the common configuration for a bloom filter in an LSM has been to target a 1% FPR. I won't claim that we have proven that FPR=1% is the best rate and recent research on Monkey has shown that we can do better when allocating space to bloom filters.

The first graph shows the number of bits/key as a function of the FPR for a bloom filter (BF) and cuckoo filter (CF). The second graph shows the ratio for bits/key from BF versus bits/key from CF. The results for semisorted CF, which uses 1 less bit/key, are not included.  For the second graph a CF uses less space than a BF when the value is greater than one. The graph covers FPR from 0.00001 to 0.09 which is 0.001% to 9%. R code to generate the graphs is here.


CPU Efficiency

From the paper there is more detail on CPU efficiency in table 3, figure 5 and figure 7. Table 3 has the speed to create a filter, but the filter is much larger (192MB) than a typical per-run filter with an LSM and there will be more memory system stalls in that case. Regardless the blocked bloom filter has the least CPU overhead during construction.

Figure 5 shows the lookup performance as a function of the hit rate. Fortunately performance doesn't vary much with the hit rate. The cuckoo filter is faster than the blocked bloom filter and the block bloom filter is faster than the semisorted cuckoo filter.

Figure 7 shows the insert performance as a function of the cuckoo filter load factor. The CPU overhead per insert grows significantly when the load factor exceeds 80%.

Thursday, September 13, 2018

Review of SlimDB from VLDB 2018

SlimDB is a paper worth reading from VLDB 2018. The highlights from the paper are that it shows:
  1. How to use less memory for filters and indexes with an LSM
  2. How to reduce the CPU penalty for queries with tiered compaction
  3. The benefit of more diversity in LSM tree shapes
Overview

Cache amplification has become more important as database:RAM ratios increase. With SSD it is possible to attach many TB of usable data to a server for OLTP. By usable I mean that the SSD has enough IOPs to access the data. But it isn't possible to grow the amount of RAM per server at that rate. Many of the early RocksDB workloads used database:RAM ratios that were about 10:1 and everything but the max level (Lmax) of the LSM tree was in memory. As the ratio grows that won't be possible unless filters and block indexes use less memory. SlimDB does that via three-level block indexes and multi-level cuckoo-filters.

Tiered compaction uses more CPU and IO for point and range queries because there are more places to check for data when compared to level compaction. The multi-level cuckoo filter in SlimDB reduces the CPU overhead for point queries as there is only one filter to check per level rather than one per sorted run per level.

The SlimDB paper shows the value of hybrid LSM tree shapes, combinations of tiered and leveled, and then how to choose the best combination based on IO costs. Prior to this year, hybrid didn't get much discussion -- the choices were usually tiered or leveled. While RocksDB and LevelDB with the L0 have always been hybrids of tiered (L0) and leveled (L1 to Lmax), we rarely discuss that. But more diversity in LSM tree shape means more complexity in tuning and the SlimDB solution is to make a cost-based decision (cost == IO overhead) subject to a constraint on the amount of memory to use.

This has been a great two years for storage engine efficiency. First we had several papers from Harvard DASLab that have begun to explain cost-based algorithm design and engine configuration and SlimDB continues in that tradition. I have much more reading to do starting with The Periodic Table of Data Structures.

Below I review the paper. Included with that is some criticism. Papers can be great without being perfect. This paper is a major contribution and worth reading.

Semi-sorted

The paper starts by explaining the principle of semi-sorted data. When the primary key can be split into two parts -- prefix and suffix -- there are some workloads that don't need data ordered over the entire primary key (prefix + suffix). Semi-sorted supports queries that fetch all data that matches the prefix of the PK while still enforcing uniqueness for the entire PK. The PK can be on (a,b,c,d) and (a,b) is prefix and queries are like "a=X and b=Y" without predicates on (c,d) that require index ordering. SlimDB takes advantage of this to use less space for the block index.

There are many use cases for this, but the paper cites Linkbench which isn't correct. See the Linkbench and Tao papers for queries that do an exact match on the prefix but only want the top-N rows in the result. So ordering on the suffix is required to satisfy query response time goals when the total number of rows that match the prefix is much larger than N. I assume this issue with top-N is important for other social graph workloads because some graph nodes are popular. Alas, things have changed with the social graph workload since those papers were published and I hope the changes are explained one day.

Note that MyRocks can use a prefix bloom filter to support some range queries with composite indexes. Assume the index is on (a,b,c) and the query has a=X and b=Y order by c limit 10. A prefix bloom on (a,b) can be used for such a query.

Stepped Merge

The paper implements tiered compaction but calls it stepped merge. I didn't know about the stepped merge paper prior to reading the SlimDB paper. I assume that people who chose the name tiered might also have missed that paper.

LSM compaction algorithms haven't been formally defined. I tried to advance the definitions in a previous post. One of the open issues for tiered is whether it requires only one sorted run at the max level or allows for N runs at the max level. With N runs at the max level the space-amplification is at least N which is too much for many workloads. With 1 run at the max level compaction into the max level is always leveled rather than tiered -- the max level is read/rewritten and the per-level write-amplification from that is larger than 1 (while the per-level write-amp from tiered == 1). With N runs at the max level many of the compaction steps into the max level can be tiered, but some will be leveled -- when the max level is full (has N runs) then something must be done to reduce the number of runs.

3-level block index

Read the paper. It is complex and a summary by me here won't add value. It uses an Entropy Coded Trie (ECT) that builds on ideas from SILT -- another great paper from CMU.

ECT uses ~2 bits/key versus at least 8 bits/key for LevelDB for the workloads they considered. This is a great result. ECT also uses 5X to 7X more CPU per lookup than LevelDB which means you might limit the use of it to the largest levels of the LSM tree -- because those use the most memory and the place where we are willing to spend CPU to save memory.

Multi-level cuckoo filter

SlimDB can use a cuckoo filter for leveled levels of the LSM tree and a multi-level cuckoo filter for tiered levels. Note that leveled levels have one sorted run and tiered levels have N sorted runs. SlimDB and the Stepped Merge paper use the term sub-levels, but I prefer N sorted runs.

The cuckoo filter is used in place of a bloom filter to save space given target false positive rates of less than 3%. The paper has examples where the cuckoo filter uses 13 bits/key (see Table 1) and a bloom filter with 10 bits/key (RocksDB default) has a false positive rate of much less than 3%. It is obvious that I need to read another interesting CMU paper cited by SlimDB -- Cuckoo Filter Practically Better than Bloom.

The multi-level cuckoo filter (MLCF) extends the cuckoo filter by using a few bits/entry to name the sub-level (sorted run) in the level that might contain the search key. With tiered and a bloom filter per sub-level (sorted run) a point query must search a bloom filter per sorted run. With the MLCF there is only one search per level (if I read the paper correctly).

The MLCF might go a long way to reduce the point-query CPU overhead when using many sub-levels which is a big deal. While a filter can't be used for general range queries, SlimDB doesn't support general range queries. Assuming the PK is on (a,b,c,d) and the prefix is (a,b) then SlimDB supports range queries like fetch all rows where a=X and b=Y. It wasn't clear to me whether the MLCF could be used in that case. But many sub-levels can create more work for range queries as iterators must be positioned in each sub-level in the worst case and that is more work.

This statement from the end of the paper is tricky. SlimDB allows for an LSM tree to use leveled compaction on all levels, tiered on all levels or a hybrid. When all levels are leveled, then performance should be similar to RocksDB with leveled, when all or some levels are tiered then write-amplification will be reduced at the cost of read performance and the paper shows that range queries are slower when some levels are tiered. Lunch isn't free as the RUM Conjecture asserts.
In contrast, with the support of dynamic use of a stepped merge algorithm and optimized in-memory indexes, SlimDB minimizes write amplification without sacrificing read performance.
The memory overhead for MLCF is ~2 bits. I am not sure this was explained by the paper but that might be to name the sub-level, in which case there can be at most 4 sub-levels per level and the cost would be larger with more sub-levels.

The paper didn't explain how the MLCF is maintained. With a bloom filter per sorted run the bloom filter is created when SST files are created during compaction and memtable flush. This is an offline or batch computation. But the MLCF covers all the sub-levels (sorted runs) in a level. And the sub-levels in a level arrive and depart one at a time, not at the same time. They  arrive as output from compaction and depart when they were compaction input. The arrival or departure of a new sub-level requires incremental changes to the MLCF. 

LSM tree shapes

For too long there has not been much diversity in LSM tree shapes. The usual choice was all tiered or all leveled. RocksDB leveled is really a hybrid -- tiered for L0, leveled for L1 to Lmax. But the SlimDB paper makes the case for more diversity. It explains that some levels (smaller ones) can be tiered while the larger levels can be leveled. And the use of multi-level cuckoo filters, three-level indexes and cuckoo filters is also a decision to make per-level.

Even more interesting is the use of a cost-model to choose the best configuration subject to a constraint -- the memory budget. They enumerate a large number of LSM tree configurations, generate estimated IO-costs per operation (write-amp, IO per point query that returns a row, IO per point query that doesn't return a row, memory overhead) and then the total IO cost is computed for for a workload -- where a workload specifies the frequency of each operation (for example - 30% writes, 40% point hits, 30% point misses).

The Dostoevsky paper also makes the case for more diversity and uses rigorous models to show how to choose the best LSM tree shape.

I think this work is a big step in the right direction. Although cost models must be expanded to include CPU overheads and constraints expanded to include the maximum write and space amplification that can be tolerated.

I disagree with a statement from the related work section. We can already navigate some of the read, write and space amplification space but I hope there is more flexibility in the future. RocksDB tuning is complex in part to support this via changing the number of levels (or growth factor per level), enabling/disabling the bloom filter, using different compression (or none) on different levels, changing the max space amplification allowed, changing the max number of sorted runs in the L0 or max number of write buffers, changing the L0:L1 size ratio, changing the number of bloom filter bits/key. Of course I want more flexibility in the future while also making RocksDB easier to tune.

Existing LSM-tree based key-value stores do not allow trading among read cost, write cost and main memory footprint. 

Performance Results


Figuring out why X was faster than Y in academic papers is not my favorite task. I realize that space constraints are a common reason for the lack of details but I am wary of results that have not been explained and I know that mistakes can be made (note: don't use serializable with InnoDB). I make many mistakes myself. I am willing to provide advice for MyRocks, MySQL and RocksDB. AFAIK most authors who hack on RocksDB or compare with it for research are not reaching out to us. We are happy to help in private.

SlimDB was faster than RocksDB on their evaluation except for range queries. There were few details about the configurations used, so I will guess. First I assume that SlimDB used stepped merge with MLCF for most levels. I am not sure why point queries were faster with SlimDB than RocksDB. Maybe RocksDB wasn't configured to use bloom filters. Writes were about 4X faster with SlimDB because stepped merge (tiered) compaction was used, write-amplification was 4X less and when IO is the bottleneck then an approach that has less write-amp will go faster.



Wednesday, September 5, 2018

5 things to set when configuring RocksDB and MyRock

The 5 options to set for RocksDB and MyRocks are:
  1. block cache size
  2. number of background threads
  3. compaction priority
  4. dynamic leveled compaction
  5. bloom filters
I have always wanted to do a "10 things" posts but prefer to keep this list small. It is unlikely that RocksDB can provide a great default for the block cache size and number of background threads because they depend on the amount of RAM and number of CPU cores in a server. But I hope RocksDB or MyRocks are changed to get better defaults for the other three which would shrink this list from 5 to 2.

Options

My advice on setting the size of the RocksDB block cache has not changed assuming it is configured to use buffered IO (the default). With MyRocks this option is rocksdb_block_cache_size and with RocksDB you will write a few lines of code to setup the LRU.

The number of background threads for flushing memtables and doing compaction is set by the option rocksdb_max_background_jobs in MyRocks and max_background_jobs in RocksDB. There used to be two options for this. While RocksDB can use async read-ahead and write-behind during compaction, it still uses synchronous reads and a potentially slow fsync/fdatasync call. Using more than 1 background job helps to overlap CPU and IO. A common configuration for me is number-of-CPU-cores / 4. With too few threads there will be more stalls from throttling. With too many threads there the threads handling user queries might suffer.

There are several strategies for choosing the next data to compact with leveled compaction in RocksDB. The strategy is selected via the compaction_pri option in RocksDB. This is harder to set for MyRocks -- see compaction_pri in rocksdb_default_cf_options. The default value is kByCompensatedSize but the better choice is kMinOverlappingRatio. With MyRocks the default is 0 and the better value is 3 (3 == kMinOverlappingRatio). I first wrote about compaction_pri prior to the arrival of kMinOverlappingRatio. Throughput is better and write amplification is reduced with kMinOverlappingRatio. An awesome paper by Hyeontaek Lim et al explains this.

Leveled compaction in RocksDB limits the amount of data per level of the LSM tree. A great description of this is here. There is a target size per level and this is enforced top down (smaller to larger levels) or bottom up (larger to smaller levels). With the bottom up approach the largest level has ~10X (or whatever the fanout is set to) more data than the next to last level. With the top down approach the largest level frequently has less data than the next to last level. I strongly prefer the bottom up approach to reduce space amplification. This is enabled via the level_compaction_dynamic_level_bytes option in RocksDB. It is harder to set for MyRocks -- see rocksdb_default_cf_options.

Bloom filters are disabled by default for MyRocks and RocksDB. I prefer to use a bloom filer on all but the largest level. This is set via rocksdb_default_cf_options with MyRocks. The reason for not using it with the max level is to consume less memory (reduce cache amplification). The bloom filter is skipped for the largest level in MyRocks via the optimize_filter_for_hits option. The example at the end of this post has more information on enabling bloom filters. All of this is set via rocksdb_default_cf_options.

Examples

A previous post recently explained how to set rocksdb_default_cf_options for compression with MyRocks. Below I share an example my.cnf for MyRocks to set the 5 options I listed above. I set transaction isolation because read committed is a better choice for MyRocks today. Repatable read will be a great choice after gap locks are added to match InnoDB semantics. In rocksdb_default_cf_options block_based_table_factory is used to enable the bloom filter, level_compaction_dynamic_level_bytes enables bottom up management of level sizes, optimize_filters_for_hits disables the bloom filter for the largest level of the LSM tree and compaction_pri sets the compaction priority.

transaction-isolation=READ-COMMITTED
default-storage-engine=rocksdb
rocksdb

rocksdb_default_cf_options=block_based_table_factory={filter_policy=bloomfilter:10:false};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio
rocksdb_block_cache_size=2g
rocksdb_max_background_jobs=4

Thursday, August 30, 2018

Name that compaction algorithm

First there was leveled compaction and it was a great paper. Then tiered compaction arrived in BigTable, HBase and Cassandra. Eventually LevelDB arrived with leveled compaction and RocksDB emerged from that. Along the way a few interesting optimizations have been added including support for time series data. My summary is missing a few details because it is a summary.

Compaction algorithms constrain the LSM tree shape. They determine which sorted runs can be merged by it and which sorted runs need to be accessed for a read operation. I am not sure whether they have been formally defined, but I hope there can be agreement on the basics. I will try to do that now for a few - leveled, tiered, tiered+leveled, leveled-N and time-series. There are two new names on this list -- tiered+leveled and leveled-N.

LSM tree used to imply leveled compaction. I prefer to expand the LSM tree definition to include leveled, tiered and more.

I reference several papers below. All of them are awesome, even when not perfect -- they are major contributions to write-optimized databases and worth reading. One of the best things about my job is getting time to read papers like this and then speak with the authors.

There are many interesting details in academic papers and existing systems (RocksDB, Cassandra, HBase, ScyllaDB) that I ignore. I don't want to get lost in the details.

Leveled

Leveled compaction minimizes space amplification at the cost of read and write amplification.

The LSM tree is a sequence of levels. Each level is one sorted run that can be range partitioned into many files. Each level is many times larger than the previous level. The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. Compaction into level N (Ln) merges data from Ln-1 into Ln. Compaction into Ln rewrites data that was previously merged into Ln. The per-level write amplification is equal to the fanout in the worst case, but it tends to be less than the fanout in practice as explained in this paper by Hyeontaek Lim et al. Compaction in the original LSM paper was all-to-all -- all data from Ln-1 is merged with all data from Ln. It is some-to-some for LevelDB and RocksDB -- some data from Ln-1 is merged with some (the overlapping) data in Ln.

While write amplification is usually worse with leveled than with tiered there are a few cases where leveled is competitive. The first is key-order inserts and a RocksDB optimization greatly reduces write-amp in that case. The second one is skewed writes where only a small fraction of the keys are likely to be updated. With the right value for compaction priority in RocksDB compaction should stop at the smallest level that is large enough to capture the write working set -- it won't go all the way to the max level. When leveled compaction is some-to-some then compaction is only done for the slices of the LSM tree that overlap the written keys, which can generate less write amplification than all-to-all compaction.

Tiered

Tiered compaction minimizes write amplification at the cost of read and space amplification.

The LSM tree can still be viewed as a sequence of levels as explained in the Dostoevsky paper by Niv Dayan and Stratos Idreos. Each level has N sorted runs. Each sorted run in Ln is ~N times larger than a sorted run in Ln-1. Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln. The per-level write amplification is 1 which is much less than for leveled where it was fanout.

Most implementations of tiered compaction don't behave exactly as described in the previous paragraph. I hope they are close enough, because the model above makes it easy to reason about performance and estimate the worst-case write amplification. A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.

The stepped merge paper is the earliest reference I found for tiered compaction. It reduces random IO for b-tree changes by buffering them in an LSM tree that uses tiered compaction. While the stepped merge algorithm is presented as different from an LSM, it is tiered compaction. The MaSM paper is similar but the SM in MaSM stands for sort merge. The paper uses an external sort rather than an LSM to reduce write amplification. It assumes that LSM implies leveled compaction but an external sort looks a lot like tiered compaction. The InnoDB change buffer has a similar goal of reducing random IO for changes to a b-tree but doesn't use an LSM. In what year did the InnoDB change buffer get designed or implemented?

I prefer that tiered not require N sorted runs at the max level because that means N copies of the database which is too much space amplification. I define it to allow K copies at the max level where K is between 2 and N. But it still does tiered compaction at the max level and when the max level is full (has K sorted runs) then the K runs are merged and the output (1 sorted run) replaces the K runs in the max level. One day I hope to learn whether HBase or Cassandra support 1, a few or N sorted runs at the max level -- although this can be confusing because they don't enforce the notion of levels. Tiered compaction in RocksDB has a configuration option to limit the worst-case space amplification which should prevent too many full copies (too many sorted runs at the max level) but I don't have much experience with tiered in RocksDB. I hope the RocksDB wiki gets updated to explain this.

There are a few challenges with tiered compaction:
  • Transient space amplification is large when compaction includes a sorted run from the max level.
  • The block index and bloom filter for large sorted runs will be large. Splitting them into smaller parts is a good idea.
  • Compaction for large sorted runs takes a long time. Multi-threading would help.
  • Compaction is all-to-all. When there is skew and most of the keys don't get updates, large sorted runs might get rewritten because compaction is all-to-all. In a traditional tiered algorithm there is no way to rewrite a subset of a large sorted run. 
For tiered compaction the notion of levels are usually a concept to reason about the shape of the LSM tree and estimate write amplification. With RocksDB they are also an implementation detail. The levels of the LSM tree beyond L0 can be used to store the larger sorted runs. The benefit from this is to partition large sorted runs into smaller SSTs. This reduces the size of the largest bloom filter and block index chunks -- which is friendlier to the block cache -- and was a big deal before partitioned index/filter was supported. With subcompactions this enables multi-threaded compaction of the largest sorted runs. Note that RocksDB used the name universal rather than tiered. More docs on this are here.

Tiered+Leveled

Tiered+Leveled has less write amplification than leveled and less space amplification than tiered.

The tiered+leveled approach is a hybrid that uses tiered for the smaller levels and leveled for the larger levels. It is flexible about the level at which the LSM tree switches from tiered to leveled. For now I assume that if Ln is leveled then all levels that follow (Ln+1, Ln+2, ...) must be leveled.

SlimDB from VLDB 2018 is an example of tiered+leveled although it might allow Lk to be tiered when Ln is leveled for k > n. Fluid LSM is described as tiered+leveled but I think it is leveled-N.

Leveled compaction in RocksDB is also tiered+leveled, but we didn't explain it that way until now. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option -- only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction -- the memtable output creates a new sorted run in L0 and doesn't read/rewrite existing sorted runs in L0. There can be N sorted runs in level 0 (L0) courtesy of level0_file_num_compaction_trigger. So the L0 is tiered. Compaction isn't done into the memtable level so it doesn't have to be labeled as tiered or leveled. Subcompactions in the RocksDB L0 makes this even more interesting, but that is a topic for another post. I hope we get more docs on this interesting feature from Andrew Kryczka.

Leveled-N

Leveled-N compaction is like leveled compaction but with less write and more read amplification. It allows more than one sorted run per level. Compaction merges all sorted runs from Ln-1 into one sorted run from Ln, which is leveled. And then "-N" is added to the name to indicate there can be n sorted runs per level.

The Dostoevsky paper defined a compaction algorithm named Fluid LSM in which the max level has 1 sorted run but the non-max levels can have more than 1 sorted run. Leveled compaction is done into the max level. The paper states that tiered compaction is done into the smaller levels when they have more than 1 sorted run. But from my reading of the paper it uses leveled-N for the non-max levels.

In Fluid LSM each level is T times larger than the previous level (T == fanout), the max level has Z sorted runs and the non-max levels have K sorted runs. When Z=1 and K=1 then this is leveled compaction. When Z=1 and K>1 or Z>1 and K>1 then I claim this uses leveled-N.

Assuming K>1 for Ln-1 then compaction with Fluid LSM into Ln merges K runs from Ln-1 with 1 run from Ln. This doesn't match my definition of tiered compaction because compaction into Ln reads & rewrites a sorted run from Ln and per-level write amplification is likely to be larger than 1. Regardless I like the idea.

Examples of write amplification with Fluid LSM for compaction from Ln-1 to Ln:
  • T==K - there are T (or K) sorted runs in each of Ln-1 and Ln. When each run in Ln-1 has size 1, then each run in Ln has size T. Compaction into Ln merges T runs from Ln-1 with 1 run from Ln to create a new run in Ln. This reads T bytes from Ln-1 and T bytes from Ln and the new run has a size between T and 2T -- size T when all keys in Ln-1 are duplicates of keys in the run from Ln and size > T otherwise. When the new run has size 2T the per-level write amp is 2 because 2T bytes were written to move T bytes from Ln-1. When the new run has size T the per-level write amp is 1. Otherwise the per-level write-amp is between 1 and 2. 
  • T > K - there are K sorted runs in each of Ln-1 and Ln. Each run in Ln-1 has size T/K and each run in Ln has size T^2/K. K runs in Ln-1 have size T. Compaction reads T bytes from Ln-1, T^2/K bytes from Ln and writes a new run in Ln that has a size between T^2/K and (T^2/K + T). The per-level write-amp is as small as T^2/K / T, which reduces to T/K, when all keys in Ln-1 are duplicates with the run in Ln. It can be as large as (T^2/K + T) / T, which reduces to T/K + 1, when there is no overlap. Otherwise it is between T/K and T/K + 1.
When K=2 and T=10 then the per-level write-amp is ~5 which is about half of the per-level write-amp from leveled compaction.

Time Series

There are compaction algorithms optimized for time series workloads. I have no experience with them but they are worth mentioning. Cassandra had DTCS and has TWCS. InfluxDB has or had TSM and TSI. I hope we eventually do something interesting for time series with RocksDB.

Other

There are other interesting LSM engines:
  • Tarantool - Sphia begat Vinyl and I lost track of it. But I have high hopes.
  • WiredTiger - has an LSM but they are busy making the CoW b-tree better
  • Kudu - didn't use RocksDB and I like the reasons for not using it
My summary of Sphia and Tarantool probably has bugs. My memory is that Sophia was a great design assuming the database : RAM ratio wasn't too large. It had a memtable and a sorted run on disk -- both were partitioned (not sure if range or hash). When a memtable partition became full then leveled compaction was done between it and its disk partition. Vinyl has changed enough from this design that I won't try to summarize it here. It has clever ideas for managing the partitions.

ScyllaDB

I briefly mentioned ScyllaDB at the start of the post. I have yet to use the product but their documentation on LSM efficiency and many other things is remarkable. Start with this post that compares the compaction strategies (algorithms) in ScyllaDB -- leveled, size-tiered, hybrid and time-window. From this attached slide deck I learned that Lucene implemented an LSM in 1999. They also have two posts that explain write amplification for tiered and leveled compaction.

Hybrid compaction is described in the embedded slide deck and it is interesting. Hybrid range partitions large sorted runs into many SSTs, similar to RocksDB. Hybrid then uses that to make compaction with large sorted runs incremental -- an input SST to the compaction can be deleted before the compaction is finished (slide 33). This reduces the worst-case space amplification that is transient when merges are in progress for large sorted runs. This isn't trivial to implement. It isn't clear to me but slide 34 suggests that hybrid can limit compaction to a subset (1 or a few SSTs) of a large sorted run when the writes are skewed. Maybe a ScyllaDB expert can confirm or deny my guess. Hybrid also has optimizations for tombstones (slide 44). I won't go into detail here, just as I ignored the SingleDelete optimization in RocksDB. 

Monday, August 27, 2018

Review of "Concurrent Log-Structured Memory" from VLDB 2018.

Space-amplification matters for in-memory stores too.

This is a review of Concurrent Log-Structured Memory for Many-Core Key-Value Stores from VLDB 2018 and the engine is named Nibble. The paper is worth reading - the ideas are interesting and the performance results are thorough. Nibble is an example of index+log. Their focus is on huge many-core servers. I wonder how RocksDB would do a a server with 240 cores and many TB of DRAM. I assume there might be a few interesting performance problems to make better. Start with table 2 for a condensed summary. The design overview is:

  • partitioned, resizable hash index - the hash index uses open addressing and linear probing. Each bucket has 15 entries, 64-bits/entry and a version counter. The counter is incremented twice/update -- at start and end. Readers retry if counter is changed or unchanged but odd.
  • per-socket log-structured memory allocators with per-core log heads. There is a log instance per core and each log instance has multiple heads (locations where log inserts are done). They call this write local, read global because reads might be from a log instance written by another core. A cost-based approach is used to select the next log segment for GC.
  • thread-local epochs - after GC there might still be threads reading from a log segment. Epochs are used to determine when that is safe. The CPU time stamp counter is used for the epoch value and each thread writes its epoch value, using a cache line per thread, to a fixed memory location.
Things that are interesting to me:
  • When needed, a partition of the hash index is doubled in size. Rather than allocate 2X more memory, they use the VM to extend the current allocation so only ~1/2 of the objects in the partition must be relocated. I am curious whether there are stalls during resize.
  • What is the range of load factors that the Nibble hash index can sustain? A b-tree with leaf pages 2/3 full is one source of fragmentation, a hash index with a load factor less than 100% is another form of fragmentation. Fortunately, with a load factor of 80%, 20% of memory isn't wasted because the log segments should use much more memory than the hash index. Figure 14 shows throughput as a function of memory utilization.
  • index+log has a large CPU cost during GC from tree index probes, but Nibble uses a hash index which has a lower probe cost. Nibble maintains a live bytes counter per log segment. Segments with zero live bytes can be collected without probing the index to find live entries. Otherwise an index probe per entry is required to determine whether an entry is live.
  • I am curious about the results in Figures 1 and 9b on memory fragmentation per allocator. The results for jemalloc and tcmalloc are similar while ptmalloc2 does the best.  The microbenchmarks are from the Rumble paper. I don't think much can be concluded from such simple allocator workloads -- but I still like the paper. RocksDB or LevelDB with db_bench would be a better test for fragmentation. It would also be good to know which versions of the allocators (jemalloc, tcmalloc, etc) were used and whether any tuning was done.
I have published posts on the benefits from using jemalloc or tcmalloc compared to glibc malloc for RocksDB. A RocksDB/MyRocks instance has ~2X larger RSS with glibc malloc because jemalloc and tcmalloc are better at avoiding fragmentation. See posts one, two, three, four and five. RocksDB does an allocation per block read and puts a lot of stress on the allocator especially when using a fast storage device. An allocation remains cached until it reaches the LRU end in the block cache or the block's SST gets unlinked. I expect blocks in the block cache to have vastly different lifetimes.

Tuesday, August 7, 2018

Default configuration benchmarks

Default configuration benchmarks are an interesting problem. Most storage engines require some configuration tuning to get good performance and efficiency. We configure an engine to do the right thing for the expected workload and hardware. Unfortunately the configuration is done in the language of the engine (innodb_write_io_threads, rocksdb_default_cf_options) which requires a significant amount of time to understand.

Hardware comes in many sizes and engines frequently don't have code to figure out the size -- how many CPUs, how much RAM, how many GB of storage, how many IOPs from storage. Even when that code exists the engine might not be able to use everything it finds:
  • HW can be shared and the engine is only allowed a fraction of it. 
  • It might be running on a VM that gets more CPU when other VMs on the host are idle.
  • SSDs get slower when more full. It can take a long time to reach that state.

Minimal configuration

I assume there is a market for storage engines that have better performance with the default configuration, but it will take time to get there. A step in the right direction is to enhance engines to get great performance and efficiency with minimal configuration (minimal != default). I am still figuring out what minimal means. I prefer to use the language of the engine user (HW capacity and performance/efficiency goals) rather than the language of the engine. I'd rather not set engine-specific options, even easy to understand ones like innodb_buffer_pool_size. I want the engine to figure out its configuration given the minimal tuning. For now I have two levels for minimal:
  • HW-only - tell the engine how much HW it can use -- number of CPU cores, GB of RAM, storage capacity and IOPs. Optionally you can ask it to use all that it finds.
  • HW + goals - in addition to HW-only this supports goals for read, write, space and cache amplification. For now I will be vague about the goals. 

Things change

Another part of the configuration challenge is that database workloads change while configurations tend to be static. I prefer that the engine does the right thing, while respecting the advice provided via minimal configuration. I want the engine to adapt to the current workload without ruining performance for the future workload. Adapting by deferring index maintenance can make loads faster, but might hurt the queries that follow.

Types of change include:
  • The working set no longer fits in memory and the workload shifts from CPU to IO bound.
  • Daily maintenance (vacuum, reorg, defrag, DDL, reporting) runs during off-peak hours.
  • Web-scale workloads have daily peak cycles as people wake and sleep.
  • New features get popular, old features get deprecated. Their tables and indexes arrive, grow large, become read-only, get dropped and more. Some deprecated features get un-deprecated.
  • Access patterns to data changes. Rows might be write once, N times or forever and write once/N rows eventually become read-only. Rows might be read never, once, a few-times or forever.
  • Different types of data (see previous point) can live within the same index. Even if you were willing to tune per-index (some of us are) this isn't sufficient when there is workload diversity within an index.
Real workloads include the types of change listed above but benchmarks rarely include them. Any benchmark that includes such change is likely to need more than 24-hours to run which will limit its popularity -- but maybe that isn't a bad thing. I hope we see a few new benchmarks that include such types of change. I might even try to write one.