Wednesday, October 3, 2018

Minimizing write amplification in an LSM

Write-amplification for an LSM with leveled compaction is minimized when the per-level growth factor (fanout) is the same between all levels. This is a result for an LSM tree using a given number of levels. To find the minimal write-amplification for any number of levels this result can be repeated for 2, 3, 4, ... up to a large value. You might find that a large number of levels is needed to get the least write-amp and that comes at price of more read-amp, as the RUM Conjecture predicts.

In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no write-amp. This is done to reduce the size of this blog post.

tl;dr - for an LSM with L1, L2, L3 and L4 what values for per-level fanout minimizes write-amp when the total fanout is 1000?
  • (10, 10, 10) for leveled
  • (6.3, 12.6, 12.6) for leveled-N assuming two of the levels have 2 sorted runs
  • (>1, >1, >1) for tiered

Minimizing write-amp for leveled compaction

For an LSM with 4 levels (L1, L2, L3, L4) there is a per-level fanout between L1:L2, L2:L3 and L3:L4. Assume this uses classic leveled compaction so the total fanout is size(L4) / size(L1). The product of the per-level fanouts must equal the total fanout. The total write-amp is the sum of the per-level write-amp. I assume that the per-level write amp is the same as the per-level fanout although in practice and in theory it isn't that simple. Lets use a, b and c as the variables for the per-level fanout (write-amp) then the math problem is:
  1. minimize a+b+c
  2. such that a*b*c=k and a, b, c > 1
While I have been working on my math skills this year they aren't great and corrections are welcome. This is a constrained optimization problem that can be solved using Lagrange Multipliers. From above #1 is the sum of per-level write-amp and #2 means that the product of per-level fanout must equal the total fanout. The last constraint is that a, b and c must (or should) all be > 1.

This result uses Lagrange Multipliers for an LSM tree with 4 levels do there are 3 variables: a, b, c. But the math holds for an LSM tree with fewer levels or with more levels. If there are N levels then there are N-1 variables.

L(a, b, c) = a + b + c - lambda * (a*b*c - k)
dL/da = 1 - lambda * bc
dL/db = 1 - lambda * ac
dL/dc = 1 - lambda * ab
then
lambda = 1/bc = 1/ac = 1/ab
bc == ac == ab
and a == b == c to minimize the sum in #1

I wrote a Python script to discover the (almost) best values and the results match the math above.

Minimizing write-amp for tiered compaction

Assuming you can reason about tiered compaction using the notion of levels then the math changes a bit because the per-level write-amp with tiered equals 1 regardless of the per-level fanout. For tiered with 4 levels and 3 variables the problem is:
  1. minimize 1+1+1
  2. such that a*b*c = k and a, b, c > 1
Any values for a, b and c are sufficient as long they satisfy the constraints in #2. But it still helps to minimize a+b+c if that is predicts read-amp because a, b and c are also the number of sorted runs in L2, L3 and L4. So my advice is to use a == b == c in most cases.

Minimizing write-amp for leveled-N compaction
I explain leveled-N compaction here and here. It is like leveled compaction but allows a level to have more than one sorted run. This reduces the per-level write-amp at the cost of more read-amp. Sometimes that is a good trade.

The math above can also be used to determine how to configure per-level fanout to minimize write-amp for leveled-N. Assume an LSM tree with 4 levels (L1, L2, L3, L4) and 2 sorted runs in L2 and L3. The problem is:
  1. minimize a + b/2 + c/2
  2. such that a*b*c = k and a, b, c > 1
For leveled compaction I assume that per-level write-amp is all-size(Ln+1) / all-size(Ln) for compaction from Ln into Ln+1. For leveled-N I assume it is run-size(Ln+1) / all-size(Ln) where all-size is the size of all sorted runs on that level and run-size is the size of one sorted run. The astute reader might notice that all-size(Ln) == run-size(Ln) for traditional leveled. For leveled-N I assume that fanout continues to be run-size(Ln+1) / run-size(Ln).

Therefore with leveled-N the per-level write-amp is b/2 for L2 to L3 and c/2 for L3 to L4 because there are 2 sorted runs in the compaction input (twice as much data) in those cases. Were there 3 sorted runs then the values would be b/3 and c/3.

Lagrange Multipliers can be used to solve this assuming we want to minimize a + b/2 + c/2.

L(a, b, c) = a + b/2 + c/2 - lambda * (a*b*c - k)
dL/da = 1   - lambda * bc
dL/db = 1/2 - lambda * ac
dL/dc = 1/2 - lambda * ab
then
lambda = 1/bc = 1/2ac = 1/2ab
bc == 2ac -> b == 2a
bc == 2ab -> c == 2a
2ac == 2ab -> c == b 
and a == 2b == 2c to minimize the sum

If the total fanout is 1000 then the per-level fanout values that minimize write-amp are 10, 10, 10 for leveled and 6.30, 12.60, 12.60 for this example with leveled-N and can be computed by "bc -l"
# for leveled-N
e(l(1000/4)/3)
6.29960524947436582381

e(l(1000/4)/3) * 2
12.59921049894873164762

# and for leveled
e(l(1000)/3)

9.99999999999999999992

One way to think of this result is that with leveled compaction the goal is to use the same per-level fanout between levels. This also uses the same per-level write-amp between levels because per-level write-amp == the per-level fanout for leveled.

But with leveled-N compaction we need to adjust the per-level fanout for levels to continue to get the same per-level write-amp between levels.


Tuesday, October 2, 2018

Describing tiered and leveled compaction

This is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts here and here. My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature.

Stepped Merge

The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level write-amplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.

This looks like tiered compaction. However it allows for N sorted runs on the max level which means that space-amplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a B-Tree, not for storing the entire database and it doesn't lead to a large space-amp for that workload. Note that the InnoDB change buffer is a B-Tree that buffers changes to other B-Trees for a similar reason.

Compaction per level

I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.

I will start with definitions:
  1. When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).
  2. A sorted run is a sequence of key-value pairs stored in key order. It is stored using 1+ files.
  3. A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level. 
  4. A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.
  5. Levels are full when they have a configurable number of sorted runs. In classic leveled compaction a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2. 
  6. leveled-N uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1. 
  7. The per level fanout is size(sorted-run in level N) / size(sorted-run in level N-1)
  8. The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.
  9. The runs-per-level is the number of sorted runs in a level when it is full.
  10. The per level write-amplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, all-size(Ln+1) / all-size(Ln) for leveled and run-size(Ln+1) / all-size(Ln) for leveled-N where run-size is the size of a sorted run and all-size is the sum of the sizes of all sorted runs on a level.
A level can be described by a 3-tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runs-per-level. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3-tuple per level. For now I don't use 3-tuples to describe the write buffer (memory component). That is a topic for another post. Example 3-tuples include:
  • T:1:4 - this is tiered with fanout=1 and runs-per-level=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.
  • T:8:8 - this is tiered with fanout=8 and runs-per-level=8. When Ln and Ln+1 both use tiered then runs-per-level in Ln == fanout in Ln+1. 
  • T:8:4 - this is tiered with fanout=8 and runs-per-level=4. It might be used when the next larger level uses leveled and runs-per-level on this level can be smaller than fanout to reduce read-amp.
  • L:10:1 - this is common in RocksDB with leveled compaction, fanout=10 and runs-per-level=1
  • L:10:2 - this is leveled-N with runs-per-level=2

Compaction per LSM tree

An LSM tree can be described using the per level 3-tuples from the previous section. The following are examples for popular algorithms.

Classic LSM with total fanout = 1000 is:
  • C0 is the write buffer
  • C1, C2 and C3 are L:10:1
RocksDB leveled with total fanout = 1000 is:
  • L0 is T:1:4
  • L1 is L:1:1
  • L2, L3, L4 are L:10:1
Stepped Merge with total fanout = 1000 is:
  • L1 is T:1:10
  • L2, L3, L4 are T:10:10
Tiered in HBase and Cassandra with total fanout = 1000 might be:
  • L1 is T:1:10
  • L2, L3 are T:10:10
  • L4 is L:10:1

Tiered+leveled

Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveled-N, then the remaining levels use leveled.  When tiered=T, leveled=L and leveled-N=N then the regex for this is T+N*L+.

I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.

The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification.

LSM trees with tiered+leveled can be described using 3-tuples and the previous section does that but here I provide one for a tree that uses leveled-N for L1 and L2 with total fanout = 1000:
  • L0 is T:1:4
  • L1 is L:1:2
  • L2 is L:10:2
  • L3, L4 are L:10:1

And another example to show that tiered levels don't have to use the same fanout or runs-per-level, but runs-per-level for Ln == fanout for Ln+1:
  • L0 is T:1:20
  • L1 is T:20:10
  • L2 is T:10:2
  • L3 is L:5:1

Leveled-N

Leveled-N can reduce the per level write-amp at the cost of increasing the per level read-amp.

The regex for an LSM tree that uses leveled-N is N+L+ (see the previous section). The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification. An example 3-tuple for leveled-N with fanout=1000 that has runs-per-level=2 for L1 and L2 is:
  • L1 is L:10:2
  • L2 is L:10:2
  • L3 is L:10:1

Monday, October 1, 2018

Transaction Processing in NewSQL

This is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.

By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:
  1. Support for multiple nodes because the storage/compute on one node isn't sufficient.
  2. Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated.
  3. Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it is important to understand the minimal time between commits to a single row or the max number of commits/second to a single row.
NewSQL Systems
  • MySQL Cluster - this was NewSQL before NewSQL was a thing. There is a nice book that explains the internals. There is a company that uses it to make HDFS better. Cluster seems to be more popular for uses other than web-scale workloads.
  • VoltDB - another early NewSQL system that is still getting better. It was after MySQL Cluster but years before Spanner and came out of the H-Store research effort.
  • Spanner - XA across-shards, Paxos across replicas, special hardware to reduce clock drift between nodes. Sounds amazing, but this is Google so it just works. See the papers that explain the system and support for SQL. This got the NewSQL movement going.
  • CockroachDB - the answer to implementing Spanner without GPS and atomic clocks. From that URL they explain it as "while Spanner always waits after writes, CockroachDB sometimes waits before reads". It uses RocksDB and they help make it better.
  • FaunaDB - FaunaDB is inspired by Calvin and Daniel Abadi explains the difference between it and Spanner -- here and here. Abadi is great at explaining distributed systems, see his work on PACELC (and the pdf). A key part of Calvin is that "Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed." This approach might limit the peak TPS on a large cluster, but I assume that doesn't matter for a large fraction of the market.
  • YugaByte - another user of RocksDB. There is much discussion about it in the recent Abadi post. Their docs are amazing -- slides, transaction IO path, single-shard write IO path, distributed ACID and single-row ACID.
  • TiDB - I don't know much about it but they are growing fast and are part of the MySQL community. It uses RocksDB (I shouldn't have forgotten that).
Other relevant systems