Small Datum: LSM math: revisiting the number of levels that minimizes write amplification

I previously used math to explain the number of levels that minimizes write amplification for an LSM tree with leveled compaction. My answer was one of ceil(ln(T)) or floor(ln(T)) assuming the LSM tree has total fanout = T where T is size(database) / size(memtable).

Then I heard from a coworker that the real answer is less than floor(ln(T)). Then I heard from Niv Dayan, first author of the Dostoevsky paper, that the real answer is larger than ceil(ln(T)) and the optimal per-level growth factor is ~2 rather than ~e.

All of our answers are correct. We have different answers because we use different functions to estimate the per-level write-amp. The graph of the functions for total write-amp using the different cost functions is here and you can see that the knee in the curve occurs at a different x value for two of the curves and the third curve doesn't appear to have a minimum.

While working on this I learned to love the Lambert W function. But I wonder whether I made the math below for LWA-2 harder than necessary. I am happy to be corrected. I appreciate the excellent advice on Quora: here, here and here. The online graphing calculator Desmos is another great resource.

Math

I use differentiable functions to express the total write-amp as a function of the number of levels, then determine the value (number of levels) at which the first derivative is zero as that might be the global minimum. Constants, variables and functions below include:

T - total fanout, = size(database) / size(memtable)
n - number of levels in the LSM tree
LWA, LWA-x - function for the per-level write-amp
TWA, TWA-x - function for the total write-amp, = n * LWA
w - per-level growth factor, = T^(1/n) for all levels to minimize write-amp

The function for total write-amp has the form: TWA = n * LWA where n is the number of levels and LWA is the per-level write-amp. LWA is a function of T and n. The goal is determine the value of n at which TWA is minimized. While n must be an integer the math here doesn't enforce that and the result should be rounded up or down to an integer. T is a constant as I assume a given value for total fanout. Here I use T=1024.

I wrote above that the 3 different answers came from using 3 different estimates for the per-level write-amp and I label these LWA-1, LWA-2 and LWA-3. When w is the per-level growth factor then the per-level write-amp functions are:

LWA-1 = w -- I used this to find that the best n = ceil(ln(T)) or floor(ln(T))
LWA-2 = w + 1 -- with this the best n is less than that found with LWA-1
LWA-3 = (w - 1) / 2 -- with this the best n is greater than that found with LWA-1

I can also state the per-level write-amp functions directly with T and n. I didn't above to make it easier to see the differences.

LWA-1 = T^(1/n)
LWA-2 = T^(1/n) + 1
LWA-3 = (T^(1/n) - 1) / 2

Explaining LWA

First I explain LWA-1 and LWA-2. Compacting 1 SST from Ln to Ln+1 requires merging 1 SST from Ln with ~w SSTs from Ln+1 where w=10 by default with RocksDB. The output will be between w and w+1 SSTs. If the output is closer to w then LWA-1 is correct. If the output is closer to w+1 then LWA-2 is correct. This paper explains why the per level write-amp is likely to be less than w. Were I to use f*w where f < 1 for LWA-1 then the math still holds. Maybe that is a future blog post.

LWA-3 assumes that all-to-all compaction is used rather than some-to-some. I explain the difference here. RocksDB/LevelDB leveled uses some-to-some but all-to-all is interesting. With all-to-all when compaction from Ln to Ln+1 finishes then Ln is empty and slowly gets full after each merge into it. Assume the per-level growth factor is w and Ln-1, Ln and Ln+1 are full at sizes 1, w and w*w. Then Ln becomes full after w merges from Ln-1 and those write output of size 1, 2, ..., w-1, w. The sum of the first w integers is w(w+1)/2. Divide this by w to get the averge -- (w+1)/2. However above LWA-3 is (w-1)/2 not (w+1)/2. I will explain that in another blog post. Note that in LWA-3 the numerator, w-1, is more interesting than the denominator, 2. Dividing by any constant doesn't change where the minimum occurs assuming there is a minimum and that is visible on this graph that shows the impact of dividing by 2 on the total write-amp.

Read on to understand the impact of using w-1, w or w+1 as the function for per-level write-amp. The difference might be more significant than you expect. It surprised me.

Minimizing TWA

This graph shows the total write-amp for LWA-1, LWA-2 and LWA-3. I call the total write-amp TWA-1, TWA-2 and TWA-3. Two of the curves, for TWA-1 and TWA-2, appear to have a minimum. One occurs for x between 4 and 6, the other for x between 6 and 8. The third curve, for TWA-3, doesn't appear to have a minimum and is decreasing as x (number of levels) grows.

The next graph uses the first derivative for the total write-amp functions, so it is for TWA-1', TWA-2' and TWA-3'. A global minimum for TWA-x can occur when TWA-x' = 0 and from the graph TWA-1'=0 when x=6.931 and TWA-2'=0 when x=5.422 which matches the estimate from the previous paragraph. From the graph it appears that TWA-3' approaches zero as x gets large but is never equal to zero.

The next step is to use math to confirm what is visible on the graphs.

Min write-amp for LWA-1

See my previous post where I show that n = ln(T) minimizes total write-amp if n isn't limited to an integer and then the per-level growth factor is e. Since the number of levels must be an integer then one of ceil(ln(T)) or floor(ln(T)) minimized total write-amp.

Min write-amp for LWA-2

I can reuse some of the math from my previous post. But this one is harder to solve.

# wa is the total write-amp
# n is the number of levels
# t is the total fanout
wa = n * ( t^(1/n) + 1 )
wa = n*t^(1/n) + n

# the difference between this and the previous post is '+1'
wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) + 1
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1

At this point the difference between this and the previous post is '+1'. But wait this starts to get interesting.
# critical point for this occurs when wa' = 0
t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1 = 0

# multiply by t^(-1/n)
1 - (1/n) * ln(t) + t^(-1/n) = 0

# move some terms to RHS
t^(-1/n) = (1/n) ln(t) - 1
# use ln on LHS and RHS to get rid of '^(1/n)'
ln ( t^(-1/n) ) = ln( (1/n) * ln(t) - 1 )
(-1/n) ln(t) = ln( (1/n) * ln(t) - 1

I got stuck here but eventually made progress.

# let a = (1/n) ln(t) and rewrite
-a = ln(a - 1)

# let x=a-1, a=x+1 and rewrite
-(x+1) = ln(x)

# do e^LHS = e^RHS
e^-(x+1) = e^ln(x)
e^-x * e^-1 = x

# multiply LHS and RHS by e^x
e^-1 = e^x * x

# e^-1 -> (1/e)
(1/e) = e^x * x

At last I can use Lambert W function!

# Given: e^x * x = K, then x = W(K)
x = W(e^-1) ~= 0.27846

# because a=x+1
a ~= 1.27846

# a = (1/n) ln(t) -> n = (1/a) ln(t), t=1024
n = 1/1.27846 * ln(1024)

# The value for n that minimizes total write-amp
# from the graph I claimed that n=5.422. this is close
n = 5.4217

Min write-amp for LWA-3

Update-1 - I think I made a few mistakes here. So you can stop reading until update-2 arrives.

Update-2 - this post explains my mistake and uses math to estimate that per-level write-amp = (w+1)/2 when all-to-all compaction is used. I am still unable to derive (w-1)/2.

I started to work on this without paying attention to the curve for LWA-3'. From the graph it appears to converge to 0 but is always less than 0, TWA-3 is decreasing as x, number of levels, gets large. Therefore make the number of levels as large as possible, 2M or 2B, to minimize total write-amp as visible in this graph.

But more levels in the LSM tree comes at a cost -- more read-amp. And the reduction in write-amp is small when the number of levels increases from 20 to 200 to 2000 to 2M. Again, this is visible in the graph. Besides, if you really want less write-amp then use tiered compaction rather than leveled with too many levels.

The other consideration is the minimal per-level growth factor that should be allowed. If the min per-level growth factor is 2. Then then that occurs when the number of levels, n, is:

# assume total fanout is 1024
2^n = 1024
log2(2^n) = log2(1024)
n = log2(1024) = 10

Alas the total fanout isn't always a power of 2. Given that the number of levels must be an integer then the goal is to use the smallest number of levels such that the per-level growth factor >= 2. Therefore when x isn't limited to an integer there is no answer -- just make x as large as possible (1M, 1B, etc) in which case the per-level growth factor converges to 1 but is always greater than 1.

The above can be repeated where the constraint is either the max number of levels or a different value for the min per-level growth factor (either <2 or >2). Regardless, if LWA-3 is the cost function then total write-amp is minimized by using as many levels as possible subject to these constraints.

Below is some math for LWA-3 and LWA-3'.

# wa is the total write-amp
# n is the number of levels
# t is the total fanoutwa = n * ( t^(1/n) - 1 ) / 2
wa = (n*t^(1/n) - n ) / 2

# the big difference between this and the previous post is '+1'
wa' = [ t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) - 1 ] / 2
wa' = [ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2

# determine when wa' = 0[ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2 = 0

# multiply LHS and RHS by 2t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 = 0# multiply LHS and RHS by t^(-1/n)
1 - (1/n) * ln(t) - t^(-1/n) = 0

# move last term to RHS
1 - (1/n) * ln(t) = t^(-1/n)

# probably a good idea to stop here
# LHS is likely to be <0 so can't use ln(LHS) = ln(RHS)

Small Datum

Wednesday, January 9, 2019

LSM math: revisiting the number of levels that minimizes write amplification

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2