Tuesday, January 15, 2019

Geek code for LSM trees

This is a link to slides from my 5-minute talk at the CIDR 2019 Gong Show. The slides are a brief overview of the geek code for LSM trees. If you click on the settings icon in the slide show you can view the speaker notes which have links to blog posts that have more details. I also pasted the links below. Given time I might add to this post, but most of the content is in my past blog posts. Regardless I think there is more to be discovered about performant, efficient and manageable LSM trees.

The key points are there are more compaction algorithms to discover, we need to make it easier to describe them and compaction is a property of a level, not of the LSM tree.

Links to posts with more details:

Thursday, January 10, 2019

LSM math: fixing mistakes in my last post

My last post explained the number of levels in an LSM that minimizes write amplification using 3 different estimates for the per-level write-amp. Assuming the per-level growth factor is w then the 3 estimates were approximately w, w+1 and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.

I realized there was a mistake in that post for the analysis of LWA-3. The problem is that the per-level write-amp must be >= 1 (and really should be > 1) but the value of w-1 is <= 1 when the per-level growth factor is <= 2. By allowing the per-level write-amp to be < 1 it easy to incorrectly show that a huge number of levels reduces write-amp as I do for curve #3 in this graph. While I don't claim that (w-1) or (w-1)/2 can't be a useful estimate for per-level write-amp in some cases, it must be used with care.

Explaining LWA-3

The next challenge is to explain how LWA-3 is derived. That comes from equation 12 on page 9 of the Dostoevsky paper. Start with the (T-1)/(K+1) term and with K=1 then this is (T-1)/2. T in the paper is the per-level growth factor so this is the same as (w-1)/2. The paper mentions that this is derived using an arithmetic series but does not show the work. I show my work but was not able to reproduce that result.

Assume that the per-level growth factor is w, all-to-all compaction is used and the LSM tree has at least 3 levels. When full L1 has size 1, L2 has size w and L3 has size w*w. There are four derivations below - v1, v2, v3, v4. The results are either w/2 or (w+1)/2 which doesn't match (w-1)/2 from the paper. Fortunately, my previous post shows how to minimize total write-amp assuming the per-level write-amp is w/2 or (w+1)/2. I will contact the author to figure out what I am missing.

The analysis below is for merges from L1 to L2, but it holds for merges from Ln to Ln+1. I think that v1 and v2 are correct and their estimate for per-level write-amp is (w+1)/2. As explained below I don't think that v3 or v4 are correct, their estimate for per-level write-amp is w/2.

I have yet to explain how to get (w-1)/2.

v1

Assume that merges are triggered from Ln to Ln+1 when a level is full -- L1 has size 1, L2 has size w, L3 has size w*w. A level is empty immediately after it is merged into the next level. So L2 gets full, then is merged into L3 and becomes empty, then slowly gets larger as L1 is merged into it w times. The per-level write-amp from this is (w+1)/2.

* merges into L2 write output of size 1, 2, ..., w
* then L2 is full
* sum of that sequence -> w*(w+1)/2
* average value is sum/w -> (w+1)/2

1) Moving data of size 1 from L1 to L2 writes (w+1)/2 on average
2) Therefore per-level write-amp for L1 -> L2 is (w+1)/2

Note that per-level write-amp is (avg merge output to Ln / size of Ln-1)
* avg merge output to L2 is (w+1)/2
* size of Ln-1 is 1


v2

Assume that merges are triggered from Ln to Ln+1 when a level is almost full -- L1 has size 1 * (w-1)/w, L2 has size w * (w-1)/w, L3 has size (w*w) * (w-1)/w. The trigger conditions can be reduced to L1 has size (w-1)/w, L2 has size (w-1) and L3 has size w*(w-1).

This assumes that w merges are done from L1 to L2 for L2 to go from empty to full. Each merge adds data of size (w-1)/w because L1:L2 merge is triggered when L1 has that much data. Thus L2 has size (w-1) after w merges into it at which point L2:L3 merge can be done. The per-level write-amp from this is the same as it was for v1.

* merges into L2 write output of size (w-1)/w * [1, 2, ..., w]
* then L2 is full
* sum of that sequence -> (w-1)/w * w*(w+1)/2 = (w-1)(w+1)/2
* average value is sum/w -> (w-1)(w+1)/(2*w)

As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)

* avg merge output to L2 = (w-1)(w+1)/(2*w)
* size of L1 = (w-1)/w


start with: ( (w-1)(w+1)/(2*w) ) / ( (w-1)/w )
simplify to: (w+1)/2


v3

Merges are triggered the same as for v1 but I assume that only w-1 merges are done from Ln to Ln+1 rather than w. Ln+1 won't be full at the end of that, for example L2 would have size w-1 rather than the expected size w. But I was curious about the math. The per-level write-amp is w/2.

* merges into L2 write output of size 1, 2, ..., w-1
* sum of that sequence -> (w-1)*w/2
* average value is sum/(w-1) -> w/2

1) Moving data of size 1 from L1 to L2 writes w/2 on average
2) Therefore per-level write-amp for L1 -> L2 is w/2

v4

Merges are triggered the same as for v2. But as with v3, only w-1 merges are done into a level. Again I don't think this is correct because a level won't have enough data to trigger compaction at that point. The per-level write-amp here is the same as for v3.

* merges into L2 write output of size (w-1)/w * [1, 2, ..., w-1]
* sum of that sequence -> (w-1)/w * (w-1)*w/2 = (w-1)(w-1)/2
* average value is sum/(w-1) -> (w-1)/2

As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)

* avg merge output to L2 = (w-1)/2
* size of L1 = (w-1)/w


start with: ( (w-1)/2 ) / ( (w-1)/w )
simplify to: w/2




Wednesday, January 9, 2019

LSM math: revisiting the number of levels that minimizes write amplification

I previously used math to explain the number of levels that minimizes write amplification for an LSM tree with leveled compaction. My answer was one of ceil(ln(T)) or floor(ln(T)) assuming the LSM tree has total fanout = T where T is size(database) / size(memtable).

Then I heard from a coworker that the real answer is less than floor(ln(T)). Then I heard from Niv Dayan, first author of the Dostoevsky paper, that the real answer is larger than ceil(ln(T)) and the optimal per-level growth factor is ~2 rather than ~e.

All of our answers are correct. We have different answers because we use different functions to estimate the per-level write-amp. The graph of the functions for total write-amp using the different cost functions is here and you can see that the knee in the curve occurs at a different x value for two of the curves and the third curve doesn't appear to have a minimum.

While working on this I learned to love the Lambert W function. But I wonder whether I made the math below for LWA-2 harder than necessary. I am happy to be corrected. I appreciate the excellent advice on Quora: here, here and here. The online graphing calculator Desmos is another great resource.

Math

I use differentiable functions to express the total write-amp as a function of the number of levels, then determine the value (number of levels) at which the first derivative is zero as that might be the global minimum. Constants, variables and functions below include:
  • T - total fanout, = size(database) / size(memtable)
  • n - number of levels in the LSM tree
  • LWA, LWA-x - function for the per-level write-amp
  • TWA, TWA-x - function for the total write-amp, = n * LWA
  • w - per-level growth factor, = T^(1/n) for all levels to minimize write-amp
The function for total write-amp has the form: TWA = n * LWA where n is the number of levels and LWA is the per-level write-amp. LWA is a function of T and n. The goal is determine the value of n at which TWA is minimized. While n must be an integer the math here doesn't enforce that and the result should be rounded up or down to an integer. T is a constant as I assume a given value for total fanout. Here I use T=1024.

I wrote above that the 3 different answers came from using 3 different estimates for the per-level write-amp and I label these LWA-1, LWA-2 and LWA-3. When w is the per-level growth factor then the per-level write-amp functions are:
  • LWA-1 = w -- I used this to find that the best n = ceil(ln(T)) or floor(ln(T))
  • LWA-2 = w + 1 -- with this the best n is less than that found with LWA-1
  • LWA-3 = (w - 1) / 2 -- with this the best n is greater than that found with LWA-1
I can also state the per-level write-amp functions directly with T and n. I didn't above to make it easier to see the differences.
  • LWA-1 = T^(1/n)
  • LWA-2 = T^(1/n) + 1
  • LWA-3 = (T^(1/n) - 1) / 2
Explaining LWA

First I explain LWA-1 and LWA-2. Compacting 1 SST from Ln to Ln+1 requires merging 1 SST from Ln with ~w SSTs from Ln+1 where w=10 by default with RocksDB. The output will be between w and w+1 SSTs. If the output is closer to w then LWA-1 is correct. If the output is closer to w+1 then LWA-2 is correct. This paper explains why the per level write-amp is likely to be less than w. Were I to use f*w where f < 1 for LWA-1 then the math still holds. Maybe that is a future blog post.

LWA-3 assumes that all-to-all compaction is used rather than some-to-some. I explain the difference here. RocksDB/LevelDB leveled uses some-to-some but all-to-all is interesting. With all-to-all when compaction from Ln to Ln+1 finishes then Ln is empty and slowly gets full after each merge into it. Assume the per-level growth factor is w and Ln-1, Ln and Ln+1 are full at sizes 1, w and w*w. Then Ln becomes full after w merges from Ln-1 and those write output of size 1, 2, ..., w-1, w. The sum of the first w integers is w(w+1)/2. Divide this by w to get the averge -- (w+1)/2. However above LWA-3 is (w-1)/2 not (w+1)/2. I will explain that in another blog post. Note that in LWA-3 the numerator, w-1, is more interesting than the denominator, 2. Dividing by any constant doesn't change where the minimum occurs assuming there is a minimum and that is visible on this graph that shows the impact of dividing by 2 on the total write-amp.

Read on to understand the impact of using w-1, w or w+1 as the function for per-level write-amp. The difference might be more significant than you expect. It surprised me.

Minimizing TWA

This graph shows the total write-amp for LWA-1, LWA-2 and LWA-3. I call the total write-amp TWA-1, TWA-2 and TWA-3. Two of the curves, for TWA-1 and TWA-2, appear to have a minimum. One occurs for x between 4 and 6, the other for x between 6 and 8. The third curve, for TWA-3, doesn't appear to have a minimum and is decreasing as x (number of levels) grows.

The next graph uses the first derivative for the total write-amp functions, so it is for TWA-1', TWA-2' and TWA-3'. A global minimum for TWA-x can occur when TWA-x' = 0 and from the graph TWA-1'=0 when x=6.931 and TWA-2'=0 when x=5.422 which matches the estimate from the previous paragraph. From the graph it appears that TWA-3' approaches zero as x gets large but is never equal to zero.

The next step is to use math to confirm what is visible on the graphs.

Min write-amp for LWA-1

See my previous post where I show that n = ln(T) minimizes total write-amp if n isn't limited to an integer and then the per-level growth factor is e. Since the number of levels must be an integer then one of ceil(ln(T)) or floor(ln(T)) minimized total write-amp.

Min write-amp for LWA-2

I can reuse some of the math from my previous post. But this one is harder to solve.

# wa is the total write-amp
# n is the number of levels

# t is the total fanout
wa = n * ( t^(1/n) + 1 )
wa = n*t^(1/n) + n

# the difference between this and the previous post is '+1'

wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) + 1
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1

At this point the difference between this and the previous post is '+1'. But wait this starts to get interesting.

# critical point for this occurs when wa' = 0
t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1 = 0

# multiply by t^(-1/n)
1 - (1/n) * ln(t) + t^(-1/n) = 0

# move some terms to RHS
t^(-1/n) = (1/n) ln(t) - 1

# use ln on LHS and RHS to get rid of '^(1/n)'
ln ( t^(-1/n) ) = ln( (1/n) * ln(t) - 1 )
(-1/n) ln(t) =  ln( (1/n) * ln(t) - 1

I got stuck here but eventually made progress.

# let a = (1/n) ln(t) and rewrite
-a = ln(a - 1)

# let x=a-1, a=x+1 and rewrite
-(x+1) = ln(x)

# do e^LHS = e^RHS
e^-(x+1) = e^ln(x)
e^-x * e^-1 = x

# multiply LHS and RHS by e^x
e^-1 = e^x * x

# e^-1 -> (1/e)
(1/e)  =  e^x * x


At last I can use Lambert W function!

# Given: e^x * x = K, then x = W(K)
x = W(e^-1) ~= 0.27846


# because a=x+1
a ~= 1.27846


# a = (1/n) ln(t) -> n = (1/a) ln(t), t=1024
n = 1/1.27846 * ln(1024)

# The value for n that minimizes total write-amp
# from the graph I claimed that n=5.422. this is close
n = 5.4217


Min write-amp for LWA-3

Update-1 - I think I made a few mistakes here. So you can stop reading until update-2 arrives.

Update-2 - this post explains my mistake and uses math to estimate that per-level write-amp = (w+1)/2 when all-to-all compaction is used. I am still unable to derive (w-1)/2.

I started to work on this without paying attention to the curve for LWA-3'. From the graph it appears to converge to 0 but is always less than 0, TWA-3 is decreasing as x, number of levels, gets large. Therefore make the number of levels as large as possible, 2M or 2B, to minimize total write-amp as visible in this graph.

But more levels in the LSM tree comes at a cost -- more read-amp. And the reduction in write-amp is small when the number of levels increases from 20 to 200 to 2000 to 2M. Again, this is visible in the graph. Besides, if you really want less write-amp then use tiered compaction rather than leveled with too many levels.

The other consideration is the minimal per-level growth factor that should be allowed. If the min per-level growth factor is 2. Then then that occurs when the number of levels, n, is:

# assume total fanout is 1024

2^n = 1024
log2(2^n) = log2(1024)
n = log2(1024) = 10


Alas the total fanout isn't always a power of 2. Given that the number of levels must be an integer then the goal is to use the smallest number of levels such that the per-level growth factor >= 2. Therefore when x isn't limited to an integer there is no answer -- just make x as large as possible (1M, 1B, etc) in which case the per-level growth factor converges to 1 but is always greater than 1.

The above can be repeated where the constraint is either the max number of levels or a different value for the min per-level growth factor (either <2 or >2). Regardless, if LWA-3 is the cost function then total write-amp is minimized by using as many levels as possible subject to these constraints.

Below is some math for LWA-3 and LWA-3'.

# wa is the total write-amp
# n is the number of levels

# t is the total fanoutwa = n * ( t^(1/n) - 1 ) / 2
wa = (n*t^(1/n) - n ) / 2

# the big difference between this and the previous post is '+1'

wa' = [ t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) - 1 ] / 2
wa' = [ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2

# determine when wa' = 0
[ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2 = 0

# multiply LHS and RHS by 2
t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 = 0# multiply LHS and RHS by t^(-1/n)
1 - (1/n) * ln(t) - t^(-1/n) = 0

# move last term to RHS
1 - (1/n) * ln(t) = t^(-1/n)

# probably a good idea to stop here
# LHS is likely to be <0 so can't use ln(LHS) = ln(RHS)


Monday, January 7, 2019

Define "better"

Welcome to my first rant of 2019, although I have written about this before. While I enjoy benchmarketing from a distance it is not much fun to be in the middle of it. The RocksDB project has been successful and thus becomes the base case for products and research claiming that something else is better. While I have no doubt that other things can be better I am wary about the definition of better.

There are at least 3 ways to define better when evaluating database performance. The first, faster is better, ignores efficiency, the last two do not. I'd rather not ignore efficiency. The marginal return of X more QPS eventually becomes zero while the benefit of using less hardware is usually greater than zero.
  1. Optimize for throughput and ignore efficiency (faster is better)
  2. Get good enough performance and then optimize for efficiency
  3. Get good enough efficiency and then optimize for throughput
Call to action

I forgot to include this before publishing. Whether #1, #2 or #3 is followed I hope that more performance results include details on the HW consumed to create that performance. How much memory and disk space were used? What was the CPU utilization? How many bytes were read from and written to storage? How much random IO was used? I try to report both absolute and relative values where relative values are normalized by the transaction rate.

Thursday, January 3, 2019

Review of LSM-based Storage Techniques: A Survey

Chen Luo and Mike Carey published a wonderful survey of research on LSM algorithms. They know about LSM because the AsterixDB project includes an LSM. They did a great job explaining the LSM space, telling a coherent story and summarizing relevant papers. Reading this paper was a good use of my time and I found a few more papers to read in their references.

I have read a few papers, including TRIAD, with ideas on reducing write-amp for the smaller levels of the LSM tree. I think this could be done for RocksDB by merging and remerging immutable memtables -- this is similar in spirit to subcompactions for the L0. With a large immutable memtable there would be one less level in the LSM tree. This is an alternative to having an L0, and maybe an L1, that are not made durable. In all cases the cost is a longer MTTR because WAL replay must be done. In all cases there is an assumption that the non-durable levels (large immutable memtables or L0/L1) are in memory.

This is a small complaint from me that I have made in the past. The paper states that an LSM eliminates random IO when making things durable. I prefer to claim that it reduces random IO. With leveled compaction each step merges N (~11) SSTs to generate one steam of output. So for each step there is likely a need to seek when reading the ~11 input streams and writing the output stream. Then compaction steps usually run concurrently when the ingest rate is high so there are more seeks. Then the WAL must be written -- one more stream and a chance for more seeks. Finally user queries are likely to read from storage causing even more seeks.  Fortunately, there will be fewer seeks per insert/update/delete compared to a B-Tree.

The paper has a short history of compaction describing pure-tiered and pure-leveled. But these are rarely used in practice. The original LSM paper implemented pure-leveled. LevelDB and RocksDB use a hybrid approach with tiered for the L0 followed by leveled for the remaining levels. Pure-tiered was introduced by the Stepped Merge paper. Using tiered for all levels has a large space-amplification, much larger than 1, because the max level is tiered and that is too much wasted space for many workloads. Tiered in RocksDB and other popular LSM engines can be configured to use leveled compaction into the max level to get a space-amp less than 2, ignoring transient space-amp during compaction into the max level. Pure-tiered was a great choice for Stepped Merge because that was a cache for bulk-loading a data warehouse rather than a full copy of the database. While I think that RocksDB leveled and RocksDB tiered are examples of tiered+leveled, I don't want to rename them.

I appreciate that the paper makes clear that trade-offs must be considered when evaluating benchmarks. Many things can support higher write rates than RocksDB with leveled compaction, including RocksDB with tiered compaction. But that comes at a cost in memory, read and/or space amplification. Some papers could do a better job of documenting those costs.

The cost analysis in section 2.3 is limited to IO costs. I look forward to coverage of CPU costs in future LSM research. The read penalty for an LSM compared to a B-Tree is usually worse for CPU than for IO. The paper uses partitioned and non-partitioned where I use all-to-all and some-to-some to explain the compaction approaches. RocksDB implements some-to-some for leveled and all-to-all for tiered. The paper does a nice job explaining why the per-level write-amp should be less for all-to-all than some-to-some, ignoring write skew. Note that in production the per-level write-amp is almost always less than the per-level growth factor and this paper from Hyeontaek Lim explains why.

For the read IO costs, the paper counts logical IOs rather than physical IOs. Logical IOs are easier to estimate because caches mean that many logical IOs don't cause a physical IO and smaller levels in the LSM tree are usually in cache. There are two ways to consider the cost for a range query -- long vs short range queries or the cost of range seek vs range next. The paper uses the first, I use the second. Both are useful.

I appreciate that the author noticed this. I realize there is pressure to market research and I am not offering to try and reproduce benchmark results, but I have been skeptical about some of the comparisons I see where the base case is InnoDB or RocksDB.
These improvements have mainly been evaluated against a default (untuned) configuration of LevelDB or RocksDB, which use the leveling merge policy with size ratio 10. It is not clear how these improvements would compare against a well-tuned LSM-tree.
The discussion in 3.3.1 on pipelining compaction is interesting but RocksDB already does pipelining. With buffered IO there is support for async read-ahead and async write-behind. Note that the read and write phases can also be CPU-heavy if the cost for decompression on read and compression on write are included, even when the wonderful zstd and lz4 algorithms are used.

A few more comments:
  • RocksDB has limited support for fractional cascading (from SST to SST). See 3.4.2.
  • With key-value separation, GC could merge log segments to generate longer ordered log segments over time. This would reduce the range read penalty. See 3.4.2.
  • LHAM might be the first time-series optimized compaction strategy. See 3.5.
  • Non-unique secondary index maintenance is already read-free in MyRocks. It has a copy of the row prior to index maintenance, because SQL semantics or because this was an insert. Write-optimized SQL engines can add support for read-free change statements in some cases but that usually means SQL semantics (like modified row count) will be broken. See 3.7.2.
  • MyRocks already collects statistics during compaction. See 3.7.3.

Monday, December 17, 2018

New small servers for performance testing

My old NUC cluster found a new home and I downsized to 2 new NUC servers. The new server is NUC8i7beh with 16g RAM, 500g Samsung 860 EVO for the OS and 500g Samsung 970 EVO for performance. The Samsung 860 is SATA and the Samsung 970 is an m.2 device. I expect to wear out the performance devices as I have done that in the past. With the OS on a separate device I avoid the need to reinstall the OS when that happens.

The new NUC has a post-Skylake CPU (i7-8559u), provides 4 cores (8 HW threads) compared to 2 cores (4 HW threads) in the old NUCs. I disabled turbo boost again to avoid performance variance as mentioned in the old post. I am not sure these have sufficient cooling for sustained boost and when boost isn't sustained there are frequent changed in CPU performance. I also disabled hyperthreads out of concern for both the impact from Spectre fixes and to avoid a different syscall overhead each time I update the kernel.

I might use these servers to examine the impact of the ~10x increase in PAUSE times on InnoDB with and without HT enabled. I might also use them for another round of MySQL performance testing when 8.0.14 is release.

I am a big fan of Intel NUC servers. But maybe I am not a fan of the SATA cables they use. I already had one of my old NUCs replaced under warranty after one of the SATA wires was bare. In the new NUCs I just setup a few of the SATA cables appear to be cut and I wonder if that eventually becomes bare.

Friday, December 14, 2018

LSM math - size of search space for LSM tree configuration

I have written before and will write again about using 3-tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for L1. There is one 3-tuple per level and the components of the 3-tuple are (type, fanout, runs) for Lk (level k) where:
  • type is Tiered or Leveled and explains compaction into that level
  • fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1
  • runs is the number of sorted runs in that level, an integer and >= 1
Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3-tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy - there are an infinite number of configurations because fanout is a real.

The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer.

There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11 - 1

But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.