## Thursday, December 13, 2018

### LSM math - how many levels minimizes write amplification?

How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels write-amp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question.
1. The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)
2. When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.
Introduction

I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the original LSM paper didn't address this problem because that system used a fixed number of levels.

One result from the original LSM paper and updated by me is that write-amp is minimized when the per-level growth factor is constant. Sometimes I use fanout or per-level fanout rather than per-level growth factor. In RocksDB the option name is max_bytes_for_level_multiplier. Yes, this can be confusing. The default fanout in RocksDB is 10.

Math

I solve this for pure-leveled compaction which differs from what RocksDB calls leveled. In pure-leveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to explain this here where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.

Assumptions:
• LSM tree uses pure-leveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction
• total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree
• workload is update-only so the number of keys in the database is fixed
• workload has no write skew and all keys are equally likely to be updated
• per-level write-amp == per-level growth factor. In practice and in theory the per-level write-amp tends to be less than the per-level growth factor.
• total write-amp is the sum of per-level write-amp. I ignore write-amp from the WAL.

Specify function for write-amp and determine critical points

# wa is the total write-amp
# n is the number of levels
# per-level fanout is the nth root of the total fanout

# per-level fanout = per-level write-amp
# therefore wa = number of levels * per-level fanout

wa = n * t^(1/n)

# given the function for write-amp as wa = a * b
# ... then below is a' * b + a * b'
a = n, b = t^(1/n)

wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2)

# which simplifies to

wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)

# critical point for this occurs when wa' = 0
t^(1/n) - (1/n) * ln(t) * t^(1/n) = 0

t^(1/n) = (1/n) * ln(t) * t^(1/n)
1 = (1/n) * ln(t)

n = ln(t)

When t = 1024 then n = ln(1024) ~= 6.93. In this case write-amp is minimized when 7 levels are used although 6 isn't a bad choice.

Assuming the cost function is convex (see below) the critical point is the minimum for write-amp. However, n must be an integer so the number of levels that minimizes write-amp is one of: ceil(ln(t)) or floor(ln(t)).

The graph for wa when t=1024 can be viewed thanks to Desmos. The function looks convex and I show below that it is.

Determine whether critical point is a min or max

The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.

wa = n * t ^ (1/n)
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)
wa' = t^(1/n) * (1 - (1/n) * ln(t))

# assuming wa' is a * b then wa'' is a' * b + a * b'

a  = t^(1/n)
a' = ln(t) * t^(1/n) * -1 * (1/n^2)
a' = - ln(t) * t^(1/n) * (1/n^2)

b  = 1 - (1/n) * ln(t)
b' = (1/n^2) * ln(t)

# a' * b
- ln(t) * t^(1/n) * (1/n^2)         --> called x below
+ ln(t) * ln(t) * (1/n^3) * t^(1/n) --> called y below

# b' * a
t^(1/n) * (1/n^2) * ln(t)           --> called z below

# therefore wa'' = x + y + z
# note that x, y and z all contain: t^(1/n), 1/n and ln(t)
wa'' = t^(1/n) * (1/n) * ln(t) * (-(1/n) + (ln(t) * 1/n^2) + (1/n))
wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'
wa'' = t^(1/n) * 1/n^3 * ln(t)^2

Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa

Solve for per-level fanout

The next step is to determine the value of the per-level fanout when write-amp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the per-level fanout is e in that case. When the number of levels is limited to an integer then the per-level fanout that minimizes write-amp is a value that is close to e.

# total write-amp is number of levels * per-level fanout
wa = n * t^(1/n)

# The per-level fanout is t^(1/n) and wa is minimized when n = ln(t)

# Therefore we show that t^(1/n) = e when n = ln(t)
Assume t^(1 / ln(t)) = e
ln (t^(1 / ln(t))) = ln e
(1 / ln(t)) * ln(t) = 1
1=1

When the t=1024 then ln(t) ~= 6.93. With 7 levels the per-level fanout is t^(1/7) ~= 2.69 while e ~= 2.72.

1. This seems to be a very interesting problem, and there were several posts to address it. However, the problem definition seems to be not clear to me and I would appreciate further clarifications.

If the goal is to compute the number of levels to minimize the total write amp (without any other constraints), then the number of levels should be set to infinite. That is, there should not be any merge at all. In this case, write amp is definitely minimized (actually 0 ignoring flush cost). However, the number of levels is directly related to space amplification/query performance, and in my mind it should be treated as a constraint instead of an optimization goal.

The original LSM-tree solves the problem that how the size ratio should be set given a fixed number of components (i.e., levels, which are treated as constraint) to minimize the write amplification. When designing an LSM-tree, the number of levels (in my mind) should be set based on the requirement of query performance/space amplification, and based on that we can optimize the size ratio to minimize the total write cost by letting the largest level slightly larger than the database size.

2. I appreciate the feedback.

The goal is to explain the number of levels that minimizes write-amp for an LSM that implements pure-leveled. I am not suggesting that doing this is the best configuration -- that depends on many things.

I distinguish between pure-leveled and RocksDB-leveled. RocksDB-leveled has an L0 with many sorted runs, while in pure-leveled the L0 is not allowed as each level only has one sorted run. So while I agree that writing new sorted runs to the L0 and never compacting minimizes write-amp, that can only happen with RocksDB-leveled but not with pure-leveled.
-- see http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html

I explained that using the same per-level growth factor for all levels is required. Explaining that isn't a new result, but I extended it.
-- see http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html