Friday, December 14, 2018

LSM math - size of search space for LSM tree configuration

I have written before and will write again about using 3-tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for L1. There is one 3-tuple per level and the components of the 3-tuple are (type, fanout, runs) for Lk (level k) where:
  • type is Tiered or Leveled and explains compaction into that level
  • fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1
  • runs is the number of sorted runs in that level, an integer and >= 1
Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3-tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy - there are an infinite number of configurations because fanout is a real.

The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer.

There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11 - 1

But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.

Thursday, December 13, 2018

LSM math - how many levels minimizes write amplification?

How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels write-amp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question.
  1. The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)
  2. When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.

I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the original LSM paper didn't address this problem because that system used a fixed number of levels.

One result from the original LSM paper and updated by me is that write-amp is minimized when the per-level growth factor is constant. Sometimes I use fanout or per-level fanout rather than per-level growth factor. In RocksDB the option name is max_bytes_for_level_multiplier. Yes, this can be confusing. The default fanout in RocksDB is 10.


I solve this for pure-leveled compaction which differs from what RocksDB calls leveled. In pure-leveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to explain this here where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.

  • LSM tree uses pure-leveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction
  • total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree
  • workload is update-only so the number of keys in the database is fixed
  • workload has no write skew and all keys are equally likely to be updated
  • per-level write-amp == per-level growth factor. In practice and in theory the per-level write-amp tends to be less than the per-level growth factor.
  • total write-amp is the sum of per-level write-amp. I ignore write-amp from the WAL. 

Specify function for write-amp and determine critical points

# wa is the total write-amp
# n is the number of levels
# per-level fanout is the nth root of the total fanout

# per-level fanout = per-level write-amp
# therefore wa = number of levels * per-level fanout

wa = n * t^(1/n)

# given the function for write-amp as wa = a * b
# ... then below is a' * b + a * b'
a = n, b = t^(1/n)

wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2)

# which simplifies to

wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)

# critical point for this occurs when wa' = 0
t^(1/n) - (1/n) * ln(t) * t^(1/n) = 0

t^(1/n) = (1/n) * ln(t) * t^(1/n)
1 = (1/n) * ln(t)

n = ln(t)

When t = 1024 then n = ln(1024) ~= 6.93. In this case write-amp is minimized when 7 levels are used although 6 isn't a bad choice.

Assuming the cost function is convex (see below) the critical point is the minimum for write-amp. However, n must be an integer so the number of levels that minimizes write-amp is one of: ceil(ln(t)) or floor(ln(t)).

The graph for wa when t=1024 can be viewed thanks to Desmos. The function looks convex and I show below that it is.

Determine whether critical point is a min or max

The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.

wa = n * t ^ (1/n)
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)
wa' = t^(1/n) * (1 - (1/n) * ln(t))

# assuming wa' is a * b then wa'' is a' * b + a * b' 

a  = t^(1/n)
a' = ln(t) * t^(1/n) * -1 * (1/n^2)
a' = - ln(t) * t^(1/n) * (1/n^2)

b  = 1 - (1/n) * ln(t)
b' = (1/n^2) * ln(t)

# a' * b 
- ln(t) * t^(1/n) * (1/n^2)         --> called x below
+ ln(t) * ln(t) * (1/n^3) * t^(1/n) --> called y below

# b' * a
t^(1/n) * (1/n^2) * ln(t)           --> called z below

# therefore wa'' = x + y + z
# note that x, y and z all contain: t^(1/n), 1/n and ln(t)
wa'' = t^(1/n) * (1/n) * ln(t) * (-(1/n) + (ln(t) * 1/n^2) + (1/n))
wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'
wa'' = t^(1/n) * 1/n^3 * ln(t)^2

Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa

Solve for per-level fanout

The next step is to determine the value of the per-level fanout when write-amp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the per-level fanout is e in that case. When the number of levels is limited to an integer then the per-level fanout that minimizes write-amp is a value that is close to e.

# total write-amp is number of levels * per-level fanout
wa = n * t^(1/n)

# The per-level fanout is t^(1/n) and wa is minimized when n = ln(t)

# Therefore we show that t^(1/n) = e when n = ln(t)
Assume t^(1 / ln(t)) = e
ln (t^(1 / ln(t))) = ln e
(1 / ln(t)) * ln(t) = 1

When the t=1024 then ln(t) ~= 6.93. With 7 levels the per-level fanout is t^(1/7) ~= 2.69 while e ~= 2.72.

Saturday, December 1, 2018

Pixelbook review

This has nothing to do with databases. This is a review of a Pixelbook (Chromebook laptop) that I got on sale last month. This one has a core i5, 8gb RAM and 128gb storage. It runs Linux too but I haven't done much with that. I expected a lot from this given that my 2013 Nexus 7 tablet is still awesome. I have been mostly happy with the laptop but if you care about keyboards and don't like the new Macs thanks to the butterfly keyboard then this might not be the laptop for you. My 3 complaints:

  1. keyboard is hard to read. It is grey on grey and too hard to read when there is light on my back even with the backlight (backlit?) turned all the way up. I don't get it -- grey on grey. So this is a great device for using in a dark room or for improving your touch typing skills.
  2. touchpad control is too coarse grained so it is either too fast or too slow. The settings has 5 values via a slider (1=slowest, 5=fastest). I have been using it at 3 which is a bit too fast for me while 2 is a bit too slow. I might go back to 2 but that means picking up my finger more frequently when moving a pointer across the screen.
  3. no iMessage - my family uses Apple devices and I can't run that here as I can on a Mac laptop