## Thursday, January 10, 2019

### LSM math: fixing mistakes in my last post

My last post explained the number of levels in an LSM that minimizes write amplification using 3 different estimates for the per-level write-amp. Assuming the per-level growth factor is w then the 3 estimates were approximately w, w+1 and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.

I realized there was a mistake in that post for the analysis of LWA-3. The problem is that the per-level write-amp must be >= 1 (and really should be > 1) but the value of w-1 is <= 1 when the per-level growth factor is <= 2. By allowing the per-level write-amp to be < 1 it easy to incorrectly show that a huge number of levels reduces write-amp as I do for curve #3 in this graph. While I don't claim that (w-1) or (w-1)/2 can't be a useful estimate for per-level write-amp in some cases, it must be used with care.

Explaining LWA-3

The next challenge is to explain how LWA-3 is derived. That comes from equation 12 on page 9 of the Dostoevsky paper. Start with the (T-1)/(K+1) term and with K=1 then this is (T-1)/2. T in the paper is the per-level growth factor so this is the same as (w-1)/2. The paper mentions that this is derived using an arithmetic series but does not show the work. I show my work but was not able to reproduce that result.

Assume that the per-level growth factor is w, all-to-all compaction is used and the LSM tree has at least 3 levels. When full L1 has size 1, L2 has size w and L3 has size w*w. There are four derivations below - v1, v2, v3, v4. The results are either w/2 or (w+1)/2 which doesn't match (w-1)/2 from the paper. Fortunately, my previous post shows how to minimize total write-amp assuming the per-level write-amp is w/2 or (w+1)/2. I will contact the author to figure out what I am missing.

The analysis below is for merges from L1 to L2, but it holds for merges from Ln to Ln+1. I think that v1 and v2 are correct and their estimate for per-level write-amp is (w+1)/2. As explained below I don't think that v3 or v4 are correct, their estimate for per-level write-amp is w/2.

I have yet to explain how to get (w-1)/2.

v1

Assume that merges are triggered from Ln to Ln+1 when a level is full -- L1 has size 1, L2 has size w, L3 has size w*w. A level is empty immediately after it is merged into the next level. So L2 gets full, then is merged into L3 and becomes empty, then slowly gets larger as L1 is merged into it w times. The per-level write-amp from this is (w+1)/2.

* merges into L2 write output of size 1, 2, ..., w
* then L2 is full
* sum of that sequence -> w*(w+1)/2
* average value is sum/w -> (w+1)/2

1) Moving data of size 1 from L1 to L2 writes (w+1)/2 on average
2) Therefore per-level write-amp for L1 -> L2 is (w+1)/2

Note that per-level write-amp is (avg merge output to Ln / size of Ln-1)
* avg merge output to L2 is (w+1)/2
* size of Ln-1 is 1

v2

Assume that merges are triggered from Ln to Ln+1 when a level is almost full -- L1 has size 1 * (w-1)/w, L2 has size w * (w-1)/w, L3 has size (w*w) * (w-1)/w. The trigger conditions can be reduced to L1 has size (w-1)/w, L2 has size (w-1) and L3 has size w*(w-1).

This assumes that w merges are done from L1 to L2 for L2 to go from empty to full. Each merge adds data of size (w-1)/w because L1:L2 merge is triggered when L1 has that much data. Thus L2 has size (w-1) after w merges into it at which point L2:L3 merge can be done. The per-level write-amp from this is the same as it was for v1.

* merges into L2 write output of size (w-1)/w * [1, 2, ..., w]
* then L2 is full
* sum of that sequence -> (w-1)/w * w*(w+1)/2 = (w-1)(w+1)/2
* average value is sum/w -> (w-1)(w+1)/(2*w)

As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)

* avg merge output to L2 = (w-1)(w+1)/(2*w)
* size of L1 = (w-1)/w

simplify to: (w+1)/2

v3

Merges are triggered the same as for v1 but I assume that only w-1 merges are done from Ln to Ln+1 rather than w. Ln+1 won't be full at the end of that, for example L2 would have size w-1 rather than the expected size w. But I was curious about the math. The per-level write-amp is w/2.

* merges into L2 write output of size 1, 2, ..., w-1
* sum of that sequence -> (w-1)*w/2
* average value is sum/(w-1) -> w/2

1) Moving data of size 1 from L1 to L2 writes w/2 on average
2) Therefore per-level write-amp for L1 -> L2 is w/2

v4

Merges are triggered the same as for v2. But as with v3, only w-1 merges are done into a level. Again I don't think this is correct because a level won't have enough data to trigger compaction at that point. The per-level write-amp here is the same as for v3.

* merges into L2 write output of size (w-1)/w * [1, 2, ..., w-1]
* sum of that sequence -> (w-1)/w * (w-1)*w/2 = (w-1)(w-1)/2
* average value is sum/(w-1) -> (w-1)/2

As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)

* avg merge output to L2 = (w-1)/2
* size of L1 = (w-1)/w