tag:blogger.com,1999:blog-91495239278647510872019-01-16T03:44:20.061-08:00Small DatumMark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.comBlogger267125tag:blogger.com,1999:blog-9149523927864751087.post-13133214627514312522019-01-15T20:00:00.003-08:002019-01-15T20:01:16.348-08:00Geek code for LSM trees<a href="https://docs.google.com/presentation/d/e/2PACX-1vQ9AStYSOmJzrcmXA-nkQk0kwoJPpgZAAHcxfrL0TApxbBLVdu0Cszit-EQ11yY1Kpbri5vhJFAssLh/pub?start=false&loop=false&delayms=3000">This is a link to slides</a> from my 5-minute talk at the <a href="http://cidrdb.org/cidr2019/program.html">CIDR 2019</a> Gong Show. The slides are a brief overview of the geek code for LSM trees. If you click on the settings icon in the slide show you can view the speaker notes which have links to blog posts that have more details. I also pasted the links below. Given time I might add to this post, but most of the content is in my past blog posts. Regardless I think there is more to be discovered about performant, efficient and manageable LSM trees.<br /><br />The key points are there are more compaction algorithms to discover, we need to make it easier to describe them and compaction is a property of a level, not of the LSM tree.<br /><br />Links to posts with more details:<br /><ul><li><a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">Describing tiered and leveled compaction</a></li><li><a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">Number of levels that minimized write amplification</a></li><li><a href="http://smalldatum.blogspot.com/2018/10/combining-tiered-and-leveled-compaction.html">Combining tiered and leveled compaction</a></li><li><a href="http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html">Tiered vs leveled, why not both</a></li><li><a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">Name that compaction algorithm</a></li><li><a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">Original LSM paper that got this started</a></li><li><a href="http://smalldatum.blogspot.com/2018/09/review-of-slimdb-from-vldb-2018.html">Review of SlimDB</a> with references to the first tiered compaction, Stepped Merge</li></ul><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-71052640453851549572019-01-10T10:30:00.000-08:002019-01-10T10:30:27.695-08:00LSM math: fixing mistakes in my last post<a href="http://smalldatum.blogspot.com/2019/01/lsm-math-revisiting-number-of-levels.html">My last post</a> explained the number of levels in an LSM that minimizes write amplification using 3 different estimates for the per-level write-amp. Assuming the per-level growth factor is w then the 3 estimates were approximately w, w+1 and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.<br /><br />I realized there was a mistake in that post for the analysis of LWA-3. The problem is that the per-level write-amp must be >= 1 (and really should be > 1) but the value of w-1 is <= 1 when the per-level growth factor is <= 2. By allowing the per-level write-amp to be < 1 it easy to incorrectly show that a huge number of levels reduces write-amp as I do for curve #3 <a href="https://www.desmos.com/calculator/ap4fa0okcq">in this graph</a>. While I don't claim that (w-1) or (w-1)/2 can't be a useful estimate for per-level write-amp in some cases, it must be used with care.<br /><br /><b>Explaining LWA-3</b><br /><br />The next challenge is to explain how LWA-3 is derived. That comes from equation 12 on page 9 of the <a href="https://stratos.seas.harvard.edu/files/stratos/files/dostoevskykv.pdf">Dostoevsky paper</a>. Start with the (T-1)/(K+1) term and with K=1 then this is (T-1)/2. T in the paper is the per-level growth factor so this is the same as (w-1)/2. The paper mentions that this is derived using an arithmetic series but does not show the work. I show my work but was not able to reproduce that result.<br /><br />Assume that the per-level growth factor is w, <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">all-to-all compaction</a> is used and the LSM tree has at least 3 levels. When full L1 has size 1, L2 has size w and L3 has size w*w. There are four derivations below - v1, v2, v3, v4. The results are either w/2 or (w+1)/2 which doesn't match (w-1)/2 from the paper. Fortunately, my previous post shows how to minimize total write-amp assuming the per-level write-amp is w/2 or (w+1)/2. I will contact the author to figure out what I am missing.<br /><br />The analysis below is for merges from L1 to L2, but it holds for merges from Ln to Ln+1. I think that v1 and v2 are correct and their estimate for per-level write-amp is (w+1)/2. As explained below I don't think that v3 or v4 are correct, their estimate for per-level write-amp is w/2.<br /><br />I have yet to explain how to get (w-1)/2.<br /><br /><b>v1</b><br /><br />Assume that merges are triggered from Ln to Ln+1 when a level is full -- L1 has size 1, L2 has size w, L3 has size w*w. A level is empty immediately after it is merged into the next level. So L2 gets full, then is merged into L3 and becomes empty, then slowly gets larger as L1 is merged into it w times. The per-level write-amp from this is (w+1)/2.<br /><br /><span style="font-family: Courier New, Courier, monospace;">* merges into L2 write output of size 1, 2, ..., w<br />* then L2 is full<br />* sum of that sequence -> w*(w+1)/2<br />* average value is sum/w -> (w+1)/2<br /><br />1) Moving data of size 1 from L1 to L2 writes (w+1)/2 on average<br />2) Therefore per-level write-amp for L1 -> L2 is (w+1)/2<br /><br />Note that per-level write-amp is (avg merge output to Ln / size of Ln-1)<br />* avg merge output to L2 is (w+1)/2<br />* size of Ln-1 is 1</span><br /><br /><b>v2</b><br /><br />Assume that merges are triggered from Ln to Ln+1 when a level is almost full -- L1 has size 1 * (w-1)/w, L2 has size w * (w-1)/w, L3 has size (w*w) * (w-1)/w. The trigger conditions can be reduced to L1 has size (w-1)/w, L2 has size (w-1) and L3 has size w*(w-1).<br /><br />This assumes that w merges are done from L1 to L2 for L2 to go from empty to full. Each merge adds data of size (w-1)/w because L1:L2 merge is triggered when L1 has that much data. Thus L2 has size (w-1) after w merges into it at which point L2:L3 merge can be done. The per-level write-amp from this is the same as it was for v1.<br /><br /><span style="font-family: Courier New, Courier, monospace;">* merges into L2 write output of size (w-1)/w * [1, 2, ..., w]<br />* then L2 is full<br />* sum of that sequence -> (w-1)/w * w*(w+1)/2 = (w-1)(w+1)/2<br />* average value is sum/w -> (w-1)(w+1)/(2*w)<br /><br />As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)</span><br /><span style="font-family: Courier New, Courier, monospace;">* avg merge output to L2 = (w-1)(w+1)/(2*w)<br />* size of L1 = (w-1)/w</span><br /><span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: Courier New, Courier, monospace;">start with: ( (w-1)(w+1)/(2*w) ) / ( (w-1)/w )<br />simplify to: (w+1)/2</span><br /><br /><b>v3</b><br /><br />Merges are triggered the same as for v1 but I assume that only w-1 merges are done from Ln to Ln+1 rather than w. Ln+1 won't be full at the end of that, for example L2 would have size w-1 rather than the expected size w. But I was curious about the math. The per-level write-amp is w/2.<br /><br /><span style="font-family: "Courier New", Courier, monospace;">* merges into L2 write output of size 1, 2, ..., w-1</span><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">* sum of that sequence -> (w-1)*w/2</span><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">* average value is sum/(w-1) -> w/2</span><br style="font-family: "Courier New", Courier, monospace;" /><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">1) Moving data of size 1 from L1 to L2 writes w/2 on average</span><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">2) </span><span style="font-family: "Courier New", Courier, monospace;">Therefore per-level write-amp for L1 -> L2 is w/2</span><br /><br /><b>v4</b><br /><br />Merges are triggered the same as for v2. But as with v3, only w-1 merges are done into a level. Again I don't think this is correct because a level won't have enough data to trigger compaction at that point. The per-level write-amp here is the same as for v3.<br /><br /><span style="font-family: Courier New, Courier, monospace;">* merges into L2 write output of size (w-1)/w * [1, 2, ..., w-1]<br />* sum of that sequence -> (w-1)/w * (w-1)*w/2 = (w-1)(w-1)/2<br />* average value is sum/(w-1) -> (w-1)/2<br /><br />As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)</span><br /><span style="font-family: Courier New, Courier, monospace;">* avg merge output to L2 = (w-1)/2<br />* size of L1 = (w-1)/w</span><br /><span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: Courier New, Courier, monospace;">start with: ( (w-1)/2 ) / ( (w-1)/w )<br />simplify to: w/2</span><br /><br /><br /><br />Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-1759749492812850582019-01-09T15:58:00.000-08:002019-01-10T10:37:58.560-08:00LSM math: revisiting the number of levels that minimizes write amplificationI <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">previously used math</a> to explain the number of levels that minimizes write amplification for an LSM tree with leveled compaction. My answer was one of ceil(ln(T)) or floor(ln(T)) assuming the LSM tree has total fanout = T where T is size(database) / size(memtable).<br /><br />Then I heard from a coworker that the real answer is less than floor(ln(T)). Then I heard from Niv Dayan, first author of <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">the Dostoevsky paper</a>, that the real answer is larger than ceil(ln(T)) and the optimal per-level growth factor is ~2 rather than ~e.<br /><br />All of our answers are correct. We have different answers because we use different functions to estimate the per-level write-amp. The graph of the functions for total write-amp using the different cost functions <a href="https://www.desmos.com/calculator/ap4fa0okcq">is here</a> and you can see that the knee in the curve occurs at a different x value for two of the curves and the third curve doesn't appear to have a minimum.<br /><br />While working on this I learned to love the <a href="https://en.m.wikipedia.org/wiki/Lambert_W_function">Lambert W function</a>. But I wonder whether I made the math below for LWA-2 harder than necessary. I am happy to be corrected. I appreciate the excellent advice on Quora: <a href="https://www.quora.com/How-can-I-learn-to-use-the-Lambert-W-function/answer/Awnon-Bhowmik">here</a>, <a href="https://www.quora.com/What-is-the-Lambert-W-function">here</a> and <a href="http://o/">here</a>. The online graphing calculator <a href="https://www.desmos.com/">Desmos</a> is another great resource.<br /><br /><b>Math</b><br /><br />I use differentiable functions to express the total write-amp as a function of the number of levels, then determine the value (number of levels) at which the first derivative is zero as that might be the global minimum. Constants, variables and functions below include:<br /><ul><li>T - total fanout, = size(database) / size(memtable)</li><li>n - number of levels in the LSM tree</li><li>LWA, LWA-x - function for the per-level write-amp</li><li>TWA, TWA-x - function for the total write-amp, = n * LWA</li><li>w - per-level growth factor, = T^(1/n) for all levels <a href="http://smalldatum.blogspot.com/2018/10/minimizing-write-amplification-in-lsm_3.html">to minimize write-amp</a></li></ul>The function for total write-amp has the form: TWA = n * LWA where n is the number of levels and LWA is the per-level write-amp. LWA is a function of T and n. The goal is determine the value of n at which TWA is minimized. While n must be an integer the math here doesn't enforce that and the result should be rounded up or down to an integer. T is a constant as I assume a given value for total fanout. Here I use T=1024.<br /><br />I wrote above that the 3 different answers came from using 3 different estimates for the per-level write-amp and I label these LWA-1, LWA-2 and LWA-3. When w is the per-level growth factor then the per-level write-amp functions are:<br /><ul><li>LWA-1 = w -- I used this to find that the best n = ceil(ln(T)) or floor(ln(T))</li><li>LWA-2 = w + 1 -- with this the best n is less than that found with LWA-1</li><li>LWA-3 = (w - 1) / 2 -- with this the best n is greater than that found with LWA-1</li></ul><div>I can also state the per-level write-amp functions directly with T and n. I didn't above to make it easier to see the differences.<br /><ul><li>LWA-1 = T^(1/n)</li><li>LWA-2 = T^(1/n) + 1</li><li>LWA-3 = (T^(1/n) - 1) / 2</li></ul></div><b>Explaining LWA</b><br /><br />First I explain LWA-1 and LWA-2. Compacting 1 SST from Ln to Ln+1 requires merging 1 SST from Ln with ~w SSTs from Ln+1 where w=10 by default with RocksDB. The output will be between w and w+1 SSTs. If the output is closer to w then LWA-1 is correct. If the output is closer to w+1 then LWA-2 is correct. <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">This paper explains</a> why the per level write-amp is likely to be less than w. Were I to use f*w where f < 1 for LWA-1 then the <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">math still holds</a>. Maybe that is a future blog post.<br /><br />LWA-3 assumes that all-to-all compaction is used rather than some-to-some. I <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">explain the difference here</a>. RocksDB/LevelDB leveled uses some-to-some but all-to-all is interesting. With all-to-all when compaction from Ln to Ln+1 finishes then Ln is empty and slowly gets full after each merge into it. Assume the per-level growth factor is w and Ln-1, Ln and Ln+1 are full at sizes 1, w and w*w. Then Ln becomes full after w merges from Ln-1 and those write output of size 1, 2, ..., w-1, w. The <a href="https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF#Partial_sums">sum of the first w</a> integers is w(w+1)/2. Divide this by w to get the averge -- (w+1)/2. However above LWA-3 is (w-1)/2 not (w+1)/2. I will explain that in another blog post. Note that in LWA-3 the numerator, w-1, is more interesting than the denominator, 2. Dividing by any constant doesn't change where the minimum occurs assuming there is a minimum and that is visible on <a href="https://www.desmos.com/calculator/vy9f2k7oyi">this graph</a> that shows the impact of dividing by 2 on the total write-amp.<br /><br />Read on to understand the impact of using w-1, w or w+1 as the function for per-level write-amp. The difference might be more significant than you expect. It surprised me.<br /><br /><b>Minimizing TWA</b><br /><br /><a href="https://www.desmos.com/calculator/ap4fa0okcq">This graph</a> shows the total write-amp for LWA-1, LWA-2 and LWA-3. I call the total write-amp TWA-1, TWA-2 and TWA-3. Two of the curves, for TWA-1 and TWA-2, appear to have a minimum. One occurs for x between 4 and 6, the other for x between 6 and 8. The third curve, for TWA-3, doesn't appear to have a minimum and is decreasing as x (number of levels) grows.<br /><br /><a href="https://www.desmos.com/calculator/kt7zwwe6lj">The next graph</a> uses the first derivative for the total write-amp functions, so it is for TWA-1', TWA-2' and TWA-3'. A global minimum for TWA-x can occur when TWA-x' = 0 and from the graph TWA-1'=0 when x=6.931 and TWA-2'=0 when x=5.422 which matches the estimate from the previous paragraph. From the graph it appears that TWA-3' approaches zero as x gets large but is never equal to zero.<br /><br />The next step is to use math to confirm what is visible on the graphs.<br /><br /><b>Min write-amp for LWA-1</b><br /><br />See my <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">previous post</a> where I show that n = ln(T) minimizes total write-amp if n isn't limited to an integer and then the per-level growth factor is e. Since the number of levels must be an integer then one of ceil(ln(T)) or floor(ln(T)) minimized total write-amp.<br /><br /><b>Min write-amp for LWA-2</b><br /><div><br /></div><div>I can reuse some of the math from my <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">previous post</a>. But this one is harder to solve.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># wa is the total write-amp<br /># n is the number of levels</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># t is the total fanout</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n * ( t^(1/n) + 1 )</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n*t^(1/n) + n<br /><br /># the difference between this and the previous post is '+1'</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) + 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /><span style="font-family: "times" , "times new roman" , serif;">At this point the difference between this and the previous post is '+1'. But wait this starts to get interesting.</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"># critical point for this occurs when wa' = 0<br />t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1 = 0<br /><br /># multiply by t^(-1/n)<br />1 - (1/n) * ln(t) + t^(-1/n) = 0<br /><br /># move some terms to RHS<br />t^(-1/n) = (1/n) ln(t) - 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># use ln on LHS and RHS to get rid of '^(1/n)'</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">ln ( t^(-1/n) ) = ln( (1/n) * ln(t) - 1 )</span><br /><span style="font-family: "courier new" , "courier" , monospace;">(-1/n) ln(t) = ln( (1/n) * ln(t) - 1</span><br /><br /><span style="font-family: "times" , "times new roman" , serif;">I got stuck here but eventually made progress.</span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># let a = (1/n) ln(t) and rewrite<br />-a = ln(a - 1)<br /><br /># let x=a-1, a=x+1 and rewrite<br />-(x+1) = ln(x)<br /><br /># do e^LHS = e^RHS<br />e^-(x+1) = e^ln(x)<br />e^-x * e^-1 = x<br /><br /># multiply LHS and RHS by e^x<br />e^-1 = e^x * x<br /><br /># e^-1 -> (1/e)<br />(1/e) = e^x * x</span><br /><br /><span style="font-family: "times" , "times new roman" , serif;">At last I can use <a href="https://en.m.wikipedia.org/wiki/Lambert_W_function">Lambert W function</a>!</span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># Given: e^x * x = K, then x = W(K)<br />x = W(e^-1) ~= 0.27846</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># because a=x+1<br />a ~= 1.27846</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># a = (1/n) ln(t) -> n = (1/a) ln(t), t=1024<br />n = 1/1.27846 * ln(1024)<br /><br /># The value for n that minimizes total write-amp<br /># from the graph I claimed that n=5.422. this is close<br />n = 5.4217</span><br /><br /></div><div><b>Min write-amp for LWA-3</b></div><div><br /><b>Update-1</b> - I think I made a few mistakes here. So you can stop reading until update-2 arrives.<br /><br /><b>Update-2</b> - <a href="http://smalldatum.blogspot.com/2019/01/lsm-math-fixing-mistakes-in-my-last-post.html">this post</a> explains my mistake and uses math to estimate that per-level write-amp = (w+1)/2 when all-to-all compaction is used. I am still unable to derive (w-1)/2.<br /><br /></div><div>I started to work on this without paying attention to the <a href="https://www.desmos.com/calculator/kt7zwwe6lj">curve for LWA-3'</a>. From the graph it appears to converge to 0 but is always less than 0, TWA-3 is decreasing as x, number of levels, gets large. Therefore make the number of levels as large as possible, 2M or 2B, to minimize total write-amp as visible <a href="https://www.desmos.com/calculator/vheojxcczt">in this graph</a>.<br /><br />But more levels in the LSM tree comes at a cost -- more read-amp. And the reduction in write-amp is small when the number of levels increases from 20 to 200 to 2000 to 2M. Again, this is visible <a href="https://www.desmos.com/calculator/vheojxcczt">in the graph</a>. Besides, if you really want less write-amp then use tiered compaction rather than leveled with too many levels.<br /><br />The other consideration is the minimal per-level growth factor that should be allowed. If the min per-level growth factor is 2. Then then that occurs when the number of levels, n, is:<br /><span style="font-family: "courier new" , "courier" , monospace;"><br /># assume total fanout is 1024</span><br /><span style="font-family: "courier new" , "courier" , monospace;">2^n = 1024<br />log2(2^n) = log2(1024)<br />n = log2(1024) = 10</span><br /><br />Alas the total fanout isn't always a power of 2. Given that the number of levels must be an integer then the goal is to use the smallest number of levels such that the per-level growth factor >= 2. Therefore when x isn't limited to an integer there is no answer -- just make x as large as possible (1M, 1B, etc) in which case the per-level growth factor converges to 1 but is always greater than 1.<br /><br />The above can be repeated where the constraint is either the max number of levels or a different value for the min per-level growth factor (either <2 or >2). Regardless, if LWA-3 is the cost function then total write-amp is minimized by using as many levels as possible subject to these constraints.<br /><br />Below is some math for LWA-3 and LWA-3'.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># wa is the total write-amp<br /># n is the number of levels</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># t is the total fanoutwa = n * ( t^(1/n) - 1 ) / 2</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = (n*t^(1/n) - n ) / 2<br /><br /># the big difference between this and the previous post is '+1'</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = [ t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) - 1 ] / 2</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = [ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2<br /><br /># determine when wa' = 0</span><span style="font-family: "courier new" , "courier" , monospace;">[ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2 = 0<br /><br /># multiply LHS and RHS by 2</span><span style="font-family: "courier new" , "courier" , monospace;">t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 = 0</span><span style="font-family: "courier new" , "courier" , monospace;"># multiply LHS and RHS by t^(-1/n)</span><span style="font-family: "courier new" , "courier" , monospace;"><br />1 - (1/n) * ln(t) - t^(-1/n) = 0<br /><br /># move last term to RHS<br />1 - (1/n) * ln(t) = t^(-1/n)<br /><br /># probably a good idea to stop here<br /># LHS is likely to be <0 so can't use ln(LHS) = ln(RHS)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-47330692257427692792019-01-07T12:35:00.000-08:002019-01-07T13:15:18.507-08:00Define "better"Welcome to my first rant of 2019, although I have <a href="http://smalldatum.blogspot.com/2015/11/define-better-for-small-data-dbms.html">written about this</a> before. While I enjoy <a href="http://smalldatum.blogspot.com/2014/06/benchmarketing.html">benchmarketing</a> from a distance it is not much fun to be in the middle of it. The RocksDB project has been successful and thus becomes the base case for products and research claiming that something else is better. While I have no doubt that other things can be better I am wary about the definition of <i><b>better</b></i>.<br /><br />There are at least 3 ways to define better when evaluating database performance. The first, faster is better, ignores efficiency, the last two do not. I'd rather not ignore efficiency. The marginal return of X more QPS eventually becomes zero while the benefit of using less hardware is usually greater than zero.<br /><ol><li>Optimize for throughput and ignore efficiency (faster is better)</li><li>Get good enough performance and then optimize for efficiency</li><li>Get good enough efficiency and then optimize for throughput</li></ol><div><b>Call to action</b></div><div><br /></div><div>I forgot to include this before publishing. Whether #1, #2 or #3 is followed I hope that more performance results include details on the HW consumed to create that performance. How much memory and disk space were used? What was the CPU utilization? How many bytes were read from and written to storage? How much random IO was used? I try to report both absolute and relative values where relative values are normalized by the transaction rate.</div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-41563271290807086672019-01-03T11:06:00.001-08:002019-01-03T11:06:25.101-08:00Review of LSM-based Storage Techniques: A SurveyChen Luo and Mike Carey published a wonderful <a href="https://arxiv.org/abs/1812.07527">survey of research on LSM algorithms</a>. They know about LSM because the <a href="http://asterix.ics.uci.edu/">AsterixDB project</a> includes an LSM. They did a great job explaining the LSM space, telling a coherent story and summarizing relevant papers. Reading this paper was a good use of my time and I found a few more papers to read in their references.<br /><br />I have read a few papers, including <a href="http://smalldatum.blogspot.com/2018/11/review-of-triad-creating-synergies.html">TRIAD</a>, with ideas on reducing write-amp for the smaller levels of the LSM tree. I think this could be done for RocksDB by merging and remerging immutable memtables -- this is similar in spirit to subcompactions for the L0. With a large immutable memtable there would be one less level in the LSM tree. This is an alternative to having an L0, and maybe an L1, that are not made durable. In all cases the cost is a longer MTTR because WAL replay must be done. In all cases there is an assumption that the non-durable levels (large immutable memtables or L0/L1) are in memory.<br /><br />This is a small complaint from me that I have made in the past. The paper states that an LSM eliminates random IO when making things durable. I prefer to claim that it reduces random IO. With leveled compaction each step merges N (~11) SSTs to generate one steam of output. So for each step there is likely a need to seek when reading the ~11 input streams and writing the output stream. Then compaction steps usually run concurrently when the ingest rate is high so there are more seeks. Then the WAL must be written -- one more stream and a chance for more seeks. Finally user queries are likely to read from storage causing even more seeks. Fortunately, there will be fewer seeks per insert/update/delete compared to a B-Tree.<br /><br />The paper has a short history of compaction describing pure-tiered and pure-leveled. But these are rarely used in practice. The <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">original LSM paper</a> implemented pure-leveled. LevelDB and RocksDB use a hybrid approach with tiered for the L0 followed by leveled for the remaining levels. Pure-tiered was introduced by the Stepped Merge paper. Using tiered for all levels has a large space-amplification, much larger than 1, because the max level is tiered and that is too much wasted space for many workloads. Tiered in RocksDB and other popular LSM engines can be configured to use leveled compaction into the max level to get a space-amp less than 2, ignoring transient space-amp during compaction into the max level. Pure-tiered was a great choice for Stepped Merge because that was a cache for bulk-loading a data warehouse rather than a full copy of the database. While I think that RocksDB leveled and RocksDB tiered are examples of <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">tiered+leveled</a>, I don't want to rename them.<br /><br />I appreciate that the paper makes clear that trade-offs must be considered when evaluating benchmarks. Many things can support higher write rates than RocksDB with leveled compaction, including RocksDB with tiered compaction. But that comes at a cost in memory, read and/or space amplification. Some papers could do a better job of documenting those costs.<br /><br />The cost analysis in section 2.3 is limited to IO costs. I look forward to coverage of CPU costs in future LSM research. The read penalty for an LSM compared to a B-Tree is usually worse for CPU than for IO. The paper uses partitioned and non-partitioned where I use <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">all-to-all and some-to-some</a> to explain the compaction approaches. RocksDB implements some-to-some for leveled and all-to-all for tiered. The paper does a nice job explaining why the per-level write-amp should be less for all-to-all than some-to-some, ignoring write skew. Note that in production the per-level write-amp is almost always less than the per-level growth factor and <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">this paper from Hyeontaek Lim</a> explains why.<br /><br />For the read IO costs, the paper counts logical IOs rather than physical IOs. Logical IOs are easier to estimate because caches mean that many logical IOs don't cause a physical IO and smaller levels in the LSM tree are usually in cache. There are two ways to consider the cost for a range query -- long vs short range queries or the cost of range seek vs range next. The paper uses the first, I use the second. Both are useful.<br /><br />I appreciate that the author noticed this. I realize there is pressure to market research and I am not offering to try and reproduce benchmark results, but I have been skeptical about some of the comparisons I see where the base case is InnoDB or RocksDB.<br /><blockquote class="tr_bq"><span style="font-family: Times, Times New Roman, serif;"><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">These improvements have mainly </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">been evaluated against a default (untuned) configuration of </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">LevelDB or RocksDB, which use the leveling merge policy </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">with size ratio 10. It is not clear how these improvements </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">would compare against a well-tuned LSM-tree.</span></span></blockquote>The discussion in 3.3.1 on pipelining compaction is interesting but RocksDB already does pipelining. With buffered IO there is support for async read-ahead and async write-behind. Note that the read and write phases can also be CPU-heavy if the cost for decompression on read and compression on write are included, even when the wonderful zstd and lz4 algorithms are used.<br /><br />A few more comments:<br /><ul><li>RocksDB has limited support for fractional cascading (from SST to SST). See 3.4.2.</li><li>With key-value separation, GC could merge log segments to generate longer ordered log segments over time. This would reduce the range read penalty. See 3.4.2.</li><li>LHAM might be the first time-series optimized compaction strategy. See 3.5.</li><li>Non-unique secondary index maintenance is already read-free in MyRocks. It has a copy of the row prior to index maintenance, because SQL semantics or because this was an insert. Write-optimized SQL engines can add support for read-free change statements in some cases but that usually means SQL semantics (like modified row count) will be broken. See 3.7.2.</li><li>MyRocks already collects statistics during compaction. See 3.7.3.</li></ul><br />Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-69071944690750933402018-12-17T12:05:00.002-08:002018-12-17T12:06:34.446-08:00New small servers for performance testingMy <a href="http://smalldatum.blogspot.com/2017/05/small-servers-for-database-performance.html">old NUC cluster</a> found a new home and I downsized to 2 new NUC servers. The new server is <a href="https://ark.intel.com/products/126140/Intel-NUC-Kit-NUC8i7BEH">NUC8i7beh</a> with <a href="https://www.amazon.com/gp/product/B01BIWMWVS">16g RAM</a>, 500g <a href="https://www.amazon.com/gp/product/B0781Z7Y3S">Samsung 860</a> EVO for the OS and 500g <a href="https://www.amazon.com/gp/product/B07BN4NJ2J">Samsung 970</a> EVO for performance. The Samsung 860 is SATA and the Samsung 970 is an m.2 device. I expect to wear out the performance devices as I <a href="http://smalldatum.blogspot.com/2017/10/wearing-out-ssd.html">have done that</a> in the past. With the OS on a separate device I avoid the need to reinstall the OS when that happens.<br /><br />The new NUC has a post-Skylake CPU (<a href="https://ark.intel.com/products/137979/Intel-Core-i7-8559U-Processor-8M-Cache-up-to-4-50-GHz-">i7-8559u</a>), provides 4 cores (8 HW threads) compared to 2 cores (4 HW threads) in the old NUCs. I disabled turbo boost again to avoid performance variance as mentioned in the old post. I am not sure these have sufficient cooling for sustained boost and when boost isn't sustained there are frequent changed in CPU performance. I also disabled hyperthreads out of concern for both the impact from Spectre fixes and to avoid a different syscall overhead each time I update the kernel.<br /><br />I might use these servers to examine the impact of the <a href="https://twitter.com/markcallaghan/status/1074375153650266112">~10x increase in PAUSE</a> times on InnoDB with and without HT enabled. I might also use them for another round of MySQL performance testing when 8.0.14 is release.<br /><br />I am a big fan of Intel NUC servers. But maybe I am not a fan of the SATA cables they use. I already had one of my old NUCs replaced under warranty after one of the SATA wires was bare. In the new NUCs I just setup a few of the SATA cables appear to be cut and I wonder if that eventually becomes bare.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-87931951296277077552018-12-14T10:17:00.001-08:002018-12-14T10:17:36.774-08:00LSM math - size of search space for LSM tree configurationI have <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">written before</a> and will write again about using 3-tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for L1. There is one 3-tuple per level and the components of the 3-tuple are (type, fanout, runs) for Lk (level k) where:<br /><ul><li>type is Tiered or Leveled and explains compaction into that level</li><li>fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1</li><li>runs is the number of sorted runs in that level, an integer and >= 1</li></ul><div>Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3-tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy - there are an infinite number of configurations because fanout is a real.</div><div><br /></div><div>The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer.</div><div><br /></div><div>There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11 - 1</div><div><br /></div><div>But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.</div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-74227521885973079772018-12-13T13:18:00.001-08:002018-12-13T13:39:27.718-08:00LSM math - how many levels minimizes write amplification?How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels write-amp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question.<br /><ol><li>The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)</li><li>When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.</li></ol><div><b>Introduction</b></div><div><br />I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">original LSM paper</a> didn't address this problem because that system used a fixed number of levels.</div><br />One result from the <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">original LSM paper</a> and <a href="http://smalldatum.blogspot.com/2018/10/minimizing-write-amplification-in-lsm_3.html">updated by me</a> is that write-amp is minimized when the per-level growth factor is constant. Sometimes I use fanout or per-level fanout rather than per-level growth factor. In RocksDB the option name is <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L489">max_bytes_for_level_multiplier</a>. Yes, this can be confusing. The default fanout in RocksDB is 10.<br /><br /><b>Math</b><br /><br />I solve this for pure-leveled compaction which differs from what RocksDB calls leveled. In pure-leveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">explain this here</a> where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.<br /><br />Assumptions:<br /><ul><li>LSM tree uses pure-leveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction</li><li>total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree</li><li>workload is update-only so the number of keys in the database is fixed</li><li>workload has no write skew and all keys are equally likely to be updated</li><li>per-level write-amp == per-level growth factor. In practice and <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">in theory</a> the per-level write-amp tends to be less than the per-level growth factor.</li><li>total write-amp is the sum of per-level write-amp. I ignore write-amp from the WAL. </li></ul><br /><b>Specify function for write-amp and determine critical points</b><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># wa is the total write-amp<br /># n is the number of levels<br /># per-level fanout is the nth root of the total fanout</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># per-level fanout = per-level write-amp<br /># therefore wa = number of levels * per-level fanout</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n * t^(1/n)<br /><br /># given the function for write-amp as wa = a * b<br /># ... then below is a' * b + a * b'<br />a = n, b = t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2)<br /><br /># which simplifies to</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)<br /><br /># critical point for this occurs when wa' = 0<br />t^(1/n) - (1/n) * ln(t) * t^(1/n) = 0</span><br /><span style="font-family: "courier new" , "courier" , monospace;">t^(1/n) = (1/n) * ln(t) * t^(1/n)<br />1 = (1/n) * ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">n = ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: Times, Times New Roman, serif;">When t = 1024 then n = ln(1024) ~= 6.93. In this case write-amp is minimized when 7 levels are used although 6 isn't a bad choice.</span><br /><br />Assuming the cost function is convex (see below) the critical point is the minimum for write-amp. However, n must be an integer so the number of levels that minimizes write-amp is one of: ceil(ln(t)) or floor(ln(t)).<br /><br />The graph for wa when t=1024 can be viewed <a href="https://www.desmos.com/calculator/dyqkf7irep">thanks to Desmos</a>. The function looks convex and I show below that it is.<br /><br /><b>Determine whether critical point is a min or max</b><br /><br />The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n * t ^ (1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) * (1 - (1/n) * ln(t))</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /># assuming wa' is a * b then wa'' is a' * b + a * b' </span><br /><span style="font-family: "courier new" , "courier" , monospace;">a = t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">a' = ln(t) * t^(1/n) * -1 * (1/n^2)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">a' = - ln(t) * t^(1/n) * (1/n^2)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;">b = 1 - (1/n) * ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">b' = (1/n^2) * ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># a' * b </span><br /><span style="font-family: "courier new" , "courier" , monospace;">- ln(t) * t^(1/n) * (1/n^2) --> called x below</span><br /><span style="font-family: "courier new" , "courier" , monospace;">+ ln(t) * ln(t) * (1/n^3) * t^(1/n) --> called y below</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># b' * a</span><br /><span style="font-family: "courier new" , "courier" , monospace;">t^(1/n) * (1/n^2) * ln(t) --> called z below</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># therefore wa'' = x + y + z</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># note that x, y and z all contain: t^(1/n), 1/n and ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa'' = t^(1/n) * (1/n) * ln(t) * (-(1/n) + (ln(t) * 1/n^2) + (1/n))</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa'' = t^(1/n) * 1/n^3 * ln(t)^2</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: Times, Times New Roman, serif;">Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa</span><br /><br /><b>Solve for per-level fanout</b><br /><br />The next step is to determine the value of the per-level fanout when write-amp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the per-level fanout is e in that case. When the number of levels is limited to an integer then the per-level fanout that minimizes write-amp is a value that is close to e.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># total write-amp is number of levels * per-level fanout<br />wa = n * t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /># The per-level fanout is t^(1/n) and wa is minimized when n = ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># Therefore we show that t^(1/n) = e when n = ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">Assume t^(1 / ln(t)) = e</span><br /><span style="font-family: "courier new" , "courier" , monospace;">ln (t^(1 / ln(t))) = ln e</span><br /><span style="font-family: "courier new" , "courier" , monospace;">(1 / ln(t)) * ln(t) = 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;">1=1</span><br /><br />When the t=1024 then ln(t) ~= 6.93. With 7 levels the per-level fanout is t^(1/7) ~= 2.69 while e ~= 2.72.<br /><span style="background-color: #f3f1f0; color: #1d2129; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 13px;"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span><span style="background-color: #f3f1f0; color: #1d2129; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 13px;"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span><span style="background-color: #f3f1f0; color: #1d2129; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 13px;"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-13803072068986634382018-12-01T08:39:00.000-08:002018-12-01T08:39:01.390-08:00Pixelbook reviewThis has nothing to do with databases. This is a review of a Pixelbook (Chromebook laptop) that I got on sale last month. This one has a core i5, 8gb RAM and 128gb storage. It runs Linux too but I haven't done much with that. I expected a lot from this given that my 2013 Nexus 7 tablet is still awesome. I have been mostly happy with the laptop but if you care about keyboards and don't like the new Macs thanks to the butterfly keyboard then this might not be the laptop for you. My 3 complaints:<br /><br /><ol><li>keyboard is hard to read. It is grey on grey and too hard to read when there is light on my back even with the backlight (backlit?) turned all the way up. I don't get it -- grey on grey. So this is a great device for using in a dark room or for improving your touch typing skills.</li><li>touchpad control is too coarse grained so it is either too fast or too slow. The settings has 5 values via a slider (1=slowest, 5=fastest). I have been using it at 3 which is a bit too fast for me while 2 is a bit too slow. I might go back to 2 but that means picking up my finger more frequently when moving a pointer across the screen.</li><li>no iMessage - my family uses Apple devices and I can't run that here as I can on a Mac laptop</li></ol><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-_qqlkbJTjxo/XAK4Tpc2RSI/AAAAAAAAYqU/hWjw_XHYVx01Cz_SwF1XU9i75S-x56C4QCLcBGAs/s1600/IMG-0067.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="240" src="https://4.bp.blogspot.com/-_qqlkbJTjxo/XAK4Tpc2RSI/AAAAAAAAYqU/hWjw_XHYVx01Cz_SwF1XU9i75S-x56C4QCLcBGAs/s320/IMG-0067.JPG" width="320" /></a></div><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-68062792429821627322018-11-19T12:32:00.000-08:002018-11-19T12:32:44.822-08:00Review of TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value StoresThis is review of TRIAD which <a href="https://www.usenix.org/conference/atc17/technical-sessions/presentation/balmau">was published</a> in USENIX ATC 2017. It explains how to reduce write amplification for RocksDB leveled compaction although the ideas are useful for many LSM implementations. I share a review here because the paper has good ideas. It isn't easy to keep up with all of the LSM research, even when limiting the search to <a href="https://scholar.google.com/scholar?as_vis=1&q=rocksdb+lsm&hl=en&as_sdt=1,5">papers that reference RocksDB</a>, and I didn't notice this paper until recently.<br /><br />TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM Conjecture</a> improvements usually come at a cost and the cost in this case is more <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">cache amplification</a> (more memory overhead/key) and possibly more <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">read amplification</a>. I assume this is a good tradeoff in many cases.<br /><br />The paper explains the improvements via 3 components -- TRIAD-MEM, TRIAD-DISK and TRIAD-LOG -- that combine to reduce write amplification.<br /><br /><b>TRIAD-MEM</b><br /><br />TRIAD-MEM reduces write-amp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIAD-MEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.<br /><br />An extra 4-bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIAD-MEM has a cost in cache-amp but I don't think that is a big deal.<br /><br />Assuming the per-level write-amp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.<br /><br /><b>TRIAD-DISK</b><br /><br />TRIAD-DISK reduces write-amp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.<br /><br />Overlap is estimated via <a href="https://en.wikipedia.org/wiki/HyperLogLog">Hyperloglog</a> (HLL) which requires 4kb/SST and is estimated as the following where file-i is the i-th SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(file-i) is the number of keys in the i-th SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that per-SST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:<br /> <span style="font-family: "courier new" , "courier" , monospace;">UniqueKeys(file-1, file-2, ... file-n) / sum( Keys( file-i))</span><br /><br />The benefit from early L0:L1 compaction is less read-amp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more per-level write-amp which is etimated by size(L1 input) / size(L0 input). TRIAD-DISK provides the benefit with less cost.<br /><br />In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIAD-DISK uses a cost-based approach to do early compaction when it won't hurt the per-level write-amp. This is an interesting idea.<br /><br /><b>TRIAD-LOG</b><br /><br />TRIAD-LOG explains improvements to memtable flush that reduce write-amp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CL-SST for commit log SST) will be deleted when it is compacted into the L1.<br /><br />There is cache-amp from TRIAD-LOG. Each key in the CL-SST (L0) and maybe in the memtable needs 8 extra bytes -- 4 bytes for CL-SST ID, 4 bytes for the WAL offset.<br /><br />Assuming the per-level write-amp is one from the memtable flush for cold keys this reduces that to 0.<br /><br /><b>Reducing write amplification</b><br /><br />The total write-amp for an LSM tree with leveled compaction is the sum of:<br /><ul><li>writing the WAL = 1</li><li>memtable flush = 1</li><li>L0:L1 compaction ~= size(L1) / size(L0)</li><li>Ln compaction for n>1 ~= fanout, the per-level growth factor, usually 8 or 10. Note that <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">this paper</a> explains why it is usually a bit less than fanout.</li></ul><div>TRIAD avoids the write-amp from memtable flush thanks to TRIAD-MEM for hot keys and TRIAD-LOG for cold keys. I will wave my hands and suggest that TRIAD-DISK reduces write-amp for L0:L1 from 3 to 1 based on the typical LSM configuration I use. So TRIAD reduces the total write-amp by 1+2 or 3.<br /><br />Reducing total write-amp by 3 is a big deal when the total write-amp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total write-amp above then per-level write-amp is ~8 for both L1:L2 and L2:L3. The total write-amp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.<br /><br />But my write-amp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total write-amp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the per-level write-amp with and without TRIAD? Most of this can be observed from <a href="https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide">compaction statistics</a> provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.</div><br /><b>Questions</b><br /><br />The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU read-amp from using the commit-log SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commit-log SST.<br />impact of more levels?<br /><br />Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of <a href="http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html">compaction priority</a> in RocksDB but it doesn't for all. When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.<br /><br />An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits write-amp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com4tag:blogger.com,1999:blog-9149523927864751087.post-50873921815814846002018-11-02T10:18:00.000-07:002018-11-02T10:18:45.344-07:00Converting an LSM to a B-Tree and back againI wonder if it is possible to convert an LSM to a B-Tree. The goal is to do it online and in-place -- so I don't want two copies of the database while the conversion is in progress. I am interested in data structures for data management that adapt dynamically to improve performance and efficiency for a given workload. <div><br /></div><div>Workloads change in the short and long term. I hope that data structures can be adapt to the change and converting between an LSM and a B-Tree is one way to adapt. This is more likely to be useful when the data structure supports some kind of partitioning in the hope that different workloads can be isolated to different partitions -- and then some can use an LSM while others use a B-Tree.<br /><br /><b>LSM to B-Tree</b></div><div><br /></div><div>A B-Tree is one tree. An LSM is a sequence of trees. Each sorted run in the LSM is a tree. With leveled compaction in RocksDB there are a few sorted runs in level 0 (L0) and then one sorted run in each of L1, L2 up to the max level (Lmax). <div><br /></div><div>A B-Tree persists changes by writing back pages -- either in-place or copy-on-write (<a href="http://smalldatum.blogspot.com/2015/08/different-kinds-of-copy-on-write-for-b.html">UiP or CoW</a>). An LSM persists changes by writing and then re-writing rows. I assume that page write back is required if you want to limit the database to one tree and row write back implies there will be more than one tree. </div><div><br /></div><div>There are two things that must be done online and in-place:</div><div><ol><li>Convert the LSM from many trees to one tree</li><li>Convert from row write back to page write back</li></ol></div><div>Note that my goal has slightly changed. I want to move from an LSM to a data structure with one tree. For the one-tree solution a B-Tree is preferred but not required.</div><div><br /></div><div>The outline of a solution:<br /><ol><li>Reconfigure the LSM to use 2 levels -- L0 and L1 -- and 3 trees -- memtable, L0, L1.</li><li>Disable the L0. At this point the LSM has two trees -- memtable and L1.</li><li>Flush the memtable and merge it into the L1. Now there is one tree.</li><li>After the flush disable the memtable and switch to a page cache. Changes now require a copy of the L1 block in the page cache that eventually get written back via UiP or CoW.</li></ol><div>The outline above doesn't explain how to maintain indexes for the L1. Note that after step 2 there is one tree on disk and the layout isn't that different from the leaf level of a B-Tree. The interior levels of the B-Tree could be created by reading/rewriting the block indexes stored in the SSTs.</div></div><div><br /></div><div><b>B-Tree to LSM</b></div><div><br />The conversion can also be done in the opposite direction (B-Tree to LSM)</div><div><ol><li>Treat the current B-Tree as the max level of the LSM tree. While it might help to flush the page cache I don't think that is required. This is easier to do when your LSM uses a B-Tree per level, as done by WiredTiger.</li><li>Record new changes for insert, update, delete in a memtable rather than a page cache.</li><li>When the memtable is full then flush it to create a new tree (sorted run, SST) on disk.</li><li>Eventually start to do compaction.</li></ol></div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-22596677300676814792018-10-19T16:34:00.000-07:002018-10-19T16:34:21.871-07:00Combining tiered and leveled compactionThere are simple optimization problems for LSM tuning. For example use leveled compaction to minimize space amplification and use tiered to minimize write amplification. But there are interesting problems that are harder to solve:<br /><ol><li>maximize throughput given a constraint on write and/or space amplification</li><li>minimize space and/or write amplification given a constraint on read amplification</li></ol><div>To solve the first problem use leveled compaction if it can satisfy the write amp constraint, else use tiered compaction if it can satisfy the space amp constraint, otherwise there is no solution. The lack of a solution might mean the constraints are unreasonable but it can also mean we need to enhance LSM implementations to support more <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">diversity in LSM tree shapes</a>. Even when there is a solution using leveled or tiered compaction there are solutions that would do much better were an LSM to support more varieties of tiered+leveled and leveled-N.</div><div><br /></div><div>When I mention <i>solved</i> above I leave out that there is more work to find a solution even when tiered or leveled compaction is used. For both there are decisions about the number of levels and per-level fanout. If <a href="http://smalldatum.blogspot.com/2018/10/minimizing-write-amplification-in-lsm_3.html">minimizing write amp</a> is the goal then that is a solved problem. But there are usually more things to consider.</div><div><br /><b>Tiered+leveled</b></div><div><br />I defined tiered+leveled and leveled-N in a <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">previous post</a>. They occupy the middle ground between tiered and leveled compaction with better read efficiency than tiered and better write efficiency than leveled. They are not supported today by popular LSM implementations but I think they can and should be supported. </div><div><br />While we tend to explain compaction as a property of an LSM tree (all tiered or all leveled) it is really a property of a level of an LSM tree and RocksDB already supports hybrids, combinations of tiered and leveled. For tiered compaction in RocksDB all levels except the largest use tiered. The largest level is usually configured to use leveled to reduce space amp. For leveled compaction in RocksDB all levels except the smallest use leveled and the smallest (L0) uses tiered.</div><div><br />So tiered+leveled isn't new but I think we need more flexibility. When a string of T and L is created from the per-level compaction choices then the regex for the strings that RocksDB supports is T+L or TL+. I want to support T+L+. I don't want to support cases where leveled is used for a smaller level and tiered for a larger level. So I like TTLL but not LTTL. My reasons for not supporting LTTL are:</div><div><ol><li>The benefit from tiered is less write amp and is independent of the level on which it is used. The reduction in write amp is the same whether tiered is used for L1, L2 or L3.</li><li>The cost from tiered is more read and space amp and that is dependent on the level on which it is used. The cost is larger for larger levels. When space amp is 2 more space is wasted on larger levels than smaller levels. More IO read amp is worse for larger levels because they have a lower hit rate than smaller levels and more IO will be done. More IO implies more CPU cost from decompression and the CPU overhead of performing IO.</li></ol><div>From above the benefit from using T is the same for all levels but the cost increases for larger levels so when T and L are both used then T (tiered) should be used on the smaller levels and L (leveled) on the larger levels.</div></div><div><br /></div><div><b>Leveled-N</b></div><div><br /></div><div>I defined leveled-N in a <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">previous post</a>. Since then a co-worker, <a href="https://twitter.com/MaysamYabandeh">Maysam Yabandeh</a>, explained to me that a level that uses leveled-N can also be described as two levels where the smaller uses leveled and the larger uses tiered. So leveled-N might be syntactic sugar in the LSM tree configuration language.</div><div><br /></div><div>For example with an LSM defined using the triple syntax <a href="https://twitter.com/MaysamYabandeh">from here</a> as (compaction type, fanout, runs-per-level) then this is valid: (T,1,8) (T,8,2) (L,8,2) (L,8,1) and has total fanout of 512 (8 * 8 * 8). The third level (L,8,2) uses leveled-N with N=2. Assuming we allow LSM trees where T follows L then the leveled-N level can be replaced with two levels: (L,8,1) (T,1,8). Then the LSM tree is defined as (T,1,8) (T,8,2) (L,8,1) (T,1,8) (L,8,1). These LSM trees have the same total fanout and total read/write/space amp. Compaction from (L,8,1) to (T,1,8) is special. It has zero write amp because it is done by a file move rather than merging/writing data so all that must be updated is LSM metadata to record the move.</div><div><br />So in general I don't support T after L but I do support it in the special case. Of course we can pretend the special case doesn't exist if we use the syntactic sugar provided by leveled-N. But I appreciate that Maysam discovered this.<br /></div><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-90342570708996548822018-10-03T11:47:00.000-07:002018-10-30T13:35:17.825-07:00Minimizing write amplification in an LSMWrite-amplification for an LSM with leveled compaction is minimized when the per-level growth factor (fanout) is the same between all levels. This is a result for an LSM tree using a given number of levels. To find the minimal write-amplification for any number of levels this result can be repeated for 2, 3, 4, ... up to a large value. You might find that a large number of levels is needed to get the least write-amp and that comes at <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">price of more read-amp</a>, as the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM Conjecture</a> predicts.<br /><br />In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no write-amp. This is done to reduce the size of this blog post.<br /><br />tl;dr - for an LSM with L1, L2, L3 and L4 what values for per-level fanout minimizes write-amp when the total fanout is 1000?<br /><ul><li>(10, 10, 10) for leveled</li><li>(6.3, 12.6, 12.6) for leveled-N assuming two of the levels have 2 sorted runs</li><li>(>1, >1, >1) for tiered</li></ul><br /><b>Minimizing write-amp for leveled compaction</b><br /><br />For an LSM with 4 levels (L1, L2, L3, L4) there is a per-level fanout between L1:L2, L2:L3 and L3:L4. Assume this uses <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">classic leveled</a> compaction so the total fanout is size(L4) / size(L1). The product of the per-level fanouts must equal the total fanout. The total write-amp is the sum of the per-level write-amp. I assume that the per-level write amp is the same as the per-level fanout although in practice and <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">in theory</a> it isn't that simple. Lets use a, b and c as the variables for the per-level fanout (write-amp) then the math problem is:<br /><ol><li>minimize a+b+c</li><li>such that a*b*c=k and a, b, c > 1</li></ol>While I have been working on my math skills this year they aren't great and corrections are welcome. This is a <a href="https://en.wikipedia.org/wiki/Constrained_optimization">constrained optimization</a> problem that can be solved using <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange Multipliers</a>. From above #1 is the sum of per-level write-amp and #2 means that the product of per-level fanout must equal the total fanout. The last constraint is that a, b and c must (or should) all be > 1.<br /><div><br /></div><div>This result uses Lagrange Multipliers for an LSM tree with 4 levels do there are 3 variables: a, b, c. But the math holds for an LSM tree with fewer levels or with more levels. If there are N levels then there are N-1 variables.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">L(a, b, c) = a + b + c - lambda * (a*b*c - k)</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/da = 1 - lambda * bc</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/db = 1 - lambda * ac</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/dc = 1 - lambda * ab</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">then</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">lambda = 1/bc = 1/ac = 1/ab</span><br /><span style="font-family: "courier new" , "courier" , monospace;">bc == ac == ab</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">and a == b == c to minimize the sum in #1</span></div><div><br /></div><div>I wrote a <a href="https://github.com/mdcallag/mytools/blob/master/scripts/index_structures/minwa.py">Python script</a> to discover the (almost) best values and the results match the math above.<br /><br /><b>Minimizing write-amp for tiered compaction</b><br /><br />Assuming you can reason about tiered compaction using the notion of levels then the math changes a bit because the per-level write-amp with tiered equals 1 regardless of the per-level fanout. For tiered with 4 levels and 3 variables the problem is:<br /><ol><li>minimize 1+1+1</li><li>such that a*b*c = k and a, b, c > 1</li></ol><div>Any values for a, b and c are sufficient as long they satisfy the constraints in #2. But it still helps to minimize a+b+c if that is predicts read-amp because a, b and c are also the number of sorted runs in L2, L3 and L4. So my advice is to use a == b == c in most cases.<br /><br /><b>Minimizing write-amp for leveled-N compaction</b><br />I explain leveled-N compaction <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">here</a> and <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">here</a>. It is like leveled compaction but allows a level to have more than one sorted run. This reduces the per-level write-amp at the cost of more read-amp. Sometimes that is a good trade.<br /><br />The math above can also be used to determine how to configure per-level fanout to minimize write-amp for leveled-N. Assume an LSM tree with 4 levels (L1, L2, L3, L4) and 2 sorted runs in L2 and L3. The problem is:<br /><ol><li>minimize a + b/2 + c/2</li><li>such that a*b*c = k and a, b, c > 1</li></ol>For leveled compaction I assume that per-level write-amp is all-size(Ln+1) / all-size(Ln) for compaction from Ln into Ln+1. For leveled-N I assume it is run-size(Ln+1) / all-size(Ln) where all-size is the size of all sorted runs on that level and run-size is the size of one sorted run. The astute reader might notice that all-size(Ln) == run-size(Ln) for traditional leveled. For leveled-N I assume that fanout continues to be run-size(Ln+1) / run-size(Ln).<br /><br />Therefore with leveled-N the per-level write-amp is b/2 for L2 to L3 and c/2 for L3 to L4 because there are 2 sorted runs in the compaction input (twice as much data) in those cases. Were there 3 sorted runs then the values would be b/3 and c/3.<br /><br />Lagrange Multipliers can be used to solve this assuming we want to minimize a + b/2 + c/2.<br /><br /><div><span style="font-family: "courier new" , "courier" , monospace;">L(a, b, c) = a + b/2 + c/2 - lambda * (a*b*c - k)</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/da = 1 - lambda * bc</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/db = 1/2 - lambda * ac</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/dc = 1/2 - lambda * ab</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">then</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">lambda = 1/bc = 1/2ac = 1/2ab</span><br /><span style="font-family: "courier new" , "courier" , monospace;">bc == 2ac -> b == 2a</span><br /><span style="font-family: "courier new" , "courier" , monospace;">bc == 2ab -> c == 2a</span><br /><span style="font-family: "courier new" , "courier" , monospace;">2ac == 2ab -> c == b </span></div><div><span style="font-family: "courier new" , "courier" , monospace;">and 2a == b == c to minimize the sum</span></div><br />If the total fanout is 1000 then the per-level fanout values that minimize write-amp are 10, 10, 10 for leveled and 6.30, 12.60, 12.60 for this example with leveled-N and can be computed by "bc -l"<br /><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;"># for leveled-N<br />e(l(1000/4)/3)</span></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;">6.29960524947436582381</span></span></div><div class="p2"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;"><span class="s1"></span><br /></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;">e(l(1000/4)/3) * 2</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;">12.59921049894873164762</span></div><span style="font-family: "courier new" , "courier" , monospace;"><br /># and for leveled<br />e(l(1000)/3)</span><br /><span style="background-color: white; font-family: "courier new" , "courier" , monospace;">9.99999999999999999992</span><br /><br />One way to think of this result is that with leveled compaction the goal is to use the same per-level fanout between levels. This also uses the same per-level write-amp between levels because per-level write-amp == the per-level fanout for leveled.<br /><br />But with leveled-N compaction we need to adjust the per-level fanout for levels to continue to get the same per-level write-amp between levels.</div><br /><br /></div><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-50472360374591024912018-10-02T16:27:00.000-07:002018-10-02T16:27:15.687-07:00Describing tiered and leveled compactionThis is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts <a href="http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html">here</a> and <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">here</a>. My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature.<br /><br /><b>Stepped Merge</b><br /><br />The <a href="https://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/indexbuffering-vldb97.pdf">Stepped Merge paper</a> might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html">write-amplification</a> was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.<br /><br />This <a href="http://smalldatum.blogspot.com/2018/09/review-of-slimdb-from-vldb-2018.html">looks like</a> tiered compaction. However it allows for N sorted runs on the max level which means that space-amplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a B-Tree, not for storing the entire database and it doesn't lead to a large space-amp for that workload. Note that the <a href="https://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html">InnoDB change buffer</a> is a B-Tree that buffers changes to other B-Trees for a similar reason.<br /><br /><b>Compaction per level</b><br /><br />I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.<br /><br />I will start with definitions:<br /><ol><li>When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).</li><li>A sorted run is a sequence of key-value pairs stored in key order. It is stored using 1+ files.</li><li>A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level. </li><li>A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.</li><li>Levels are full when they have a configurable number of sorted runs. In <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">classic leveled compaction</a> a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2. </li><li><a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">leveled-N</a> uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1. </li><li>The per level fanout is size(sorted-run in level N) / size(sorted-run in level N-1)</li><li>The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.</li><li>The runs-per-level is the number of sorted runs in a level when it is full.</li><li>The per level write-amplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, all-size(Ln+1) / all-size(Ln) for leveled and run-size(Ln+1) / all-size(Ln) for leveled-N where run-size is the size of a sorted run and all-size is the sum of the sizes of all sorted runs on a level.</li></ol><div>A level can be described by a 3-tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runs-per-level. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3-tuple per level. For now I don't use 3-tuples to describe the write buffer (memory component). That is a topic for another post. Example 3-tuples include:</div><div><ul><li>T:1:4 - this is tiered with fanout=1 and runs-per-level=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.</li><li>T:8:8 - this is tiered with fanout=8 and runs-per-level=8. When Ln and Ln+1 both use tiered then runs-per-level in Ln == fanout in Ln+1. </li><li>T:8:4 - this is tiered with fanout=8 and runs-per-level=4. It might be used when the next larger level uses leveled and runs-per-level on this level can be smaller than fanout to reduce read-amp.</li><li>L:10:1 - this is common in RocksDB with leveled compaction, fanout=10 and runs-per-level=1</li><li>L:10:2 - this is leveled-N with runs-per-level=2</li></ul></div><div><b><br /></b><b>Compaction per LSM tree</b></div><div><br />An LSM tree can be described using the per level 3-tuples from the previous section. The following are examples for popular algorithms.<br /><br /><a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">Classic LSM</a> with total fanout = 1000 is:<br /><ul><li>C0 is the write buffer</li><li>C1, C2 and C3 are L:10:1</li></ul></div><div>RocksDB leveled with total fanout = 1000 is:</div><div><ul><li>L0 is T:1:4</li><li>L1 is L:1:1</li><li>L2, L3, L4 are L:10:1</li></ul><div>Stepped Merge with total fanout = 1000 is:<br /><ul><li>L1 is T:1:10</li><li>L2, L3, L4 are T:10:10</li></ul><div>Tiered in HBase and Cassandra with total fanout = 1000 might be:<br /><ul><li>L1 is T:1:10</li><li>L2, L3 are T:10:10</li><li>L4 is L:10:1</li></ul></div></div></div><div><b><br /></b><b>Tiered+leveled</b><br /><br />Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveled-N, then the remaining levels use leveled. When tiered=T, leveled=L and leveled-N=N then the regex for this is T+N*L+.<br /><br />I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.<br /><br />The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification.<br /><br />LSM trees with tiered+leveled can be described using 3-tuples and the previous section does that but here I provide one for a tree that uses leveled-N for L1 and L2 with total fanout = 1000:<br /><ul><li>L0 is T:1:4</li><li>L1 is L:1:2</li><li>L2 is L:10:2</li><li>L3, L4 are L:10:1</li></ul><br />And another example to show that tiered levels don't have to use the same fanout or runs-per-level, but runs-per-level for Ln == fanout for Ln+1:<br /><ul><li>L0 is T:1:20</li><li>L1 is T:20:10</li><li>L2 is T:10:2</li><li>L3 is L:5:1</li></ul><b><br /></b><b>Leveled-N</b><br /><br />Leveled-N can reduce the per level write-amp at the cost of increasing the per level read-amp.<br /><br />The regex for an LSM tree that uses leveled-N is N+L+ (see the previous section). The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification. An example 3-tuple for leveled-N with fanout=1000 that has runs-per-level=2 for L1 and L2 is:<br /><ul><li>L1 is L:10:2</li><li>L2 is L:10:2</li><li>L3 is L:10:1</li></ul></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-8269597228709268102018-10-01T10:44:00.000-07:002018-10-02T16:31:20.680-07:00Transaction Processing in NewSQLThis is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.<br /><br />By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:<br /><ol><li>Support for multiple nodes because the storage/compute on one node isn't sufficient.</li><li>Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated.</li><li>Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it is important to understand the minimal time between commits to a single row or the max number of commits/second to a single row.</li></ol>NewSQL Systems<br /><ul><li><a href="https://en.wikipedia.org/wiki/MySQL_Cluster">MySQL Cluster</a> - this was NewSQL before NewSQL was a thing. There is a <a href="https://www.amazon.com/MySQL-Cluster-7-5-Inside-Out/dp/9176998142">nice book</a> that explains the internals. There is <a href="https://www.hops.io/">a company</a> that uses it to <a href="https://medium.com/@jim_dowling/introducing-hops-hadoop-120c30d02676">make HDFS better</a>. Cluster seems to be more popular for uses other than web-scale workloads.</li><li><a href="https://en.wikipedia.org/wiki/VoltDB">VoltDB</a> - another early NewSQL system that is still <a href="http://voltdb.com/">getting better</a>. It was after MySQL Cluster but years before Spanner and came out of the <a href="http://hstore.cs.brown.edu/">H-Store research effort</a>.</li><li><a href="https://en.wikipedia.org/wiki/Spanner_(database)">Spanner</a> - XA across-shards, Paxos across replicas, special hardware to reduce clock drift between nodes. Sounds amazing, but this is Google so it just works. See the papers that explain <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf">the system</a> and <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf">support for SQL</a>. This got the NewSQL movement going.</li><li>CockroachDB - the answer to implementing Spanner <a href="https://www.cockroachlabs.com/blog/living-without-atomic-clocks/">without GPS and atomic clocks</a>. From that URL they explain it as "while Spanner always waits after writes, CockroachDB sometimes waits before reads". It uses RocksDB and they help make it better.</li><li>FaunaDB - FaunaDB is inspired by Calvin and Daniel Abadi explains the difference between it and Spanner -- <a href="https://fauna.com/blog/distributed-consistency-at-scale-spanner-vs-calvin">here</a> and <a href="http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html">here</a>. Abadi is great at explaining distributed systems, see his work <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html">on PACELC</a> (and <a href="http://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf">the pdf</a>). A key part of Calvin is that "Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed." This approach might limit the peak TPS on a large cluster, but I assume that doesn't matter for a large fraction of the market.</li><li>YugaByte - another <a href="https://docs.yugabyte.com/latest/architecture/concepts/persistence/">user of RocksDB</a>. There is much discussion about it in the <a href="http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html">recent Abadi post</a>. Their docs are amazing -- <a href="https://www.slideshare.net/YugaByte/yugabyte-db-architecture-storage-engine-and-transactions">slides</a>, <a href="https://docs.yugabyte.com/latest/architecture/transactions/transactional-io-path/">transaction IO path</a>, <a href="https://docs.yugabyte.com/latest/architecture/core-functions/write-path/">single-shard write IO path</a>, <a href="https://docs.yugabyte.com/latest/architecture/transactions/distributed-txns/">distributed ACID</a> and <a href="https://docs.yugabyte.com/latest/architecture/transactions/single-row-transactions/">single-row ACID</a>.</li><li><a href="https://github.com/pingcap/tidb">TiDB</a> - I don't know much about it but they are <a href="https://techcrunch.com/2018/09/11/tidb-developer-pingcap-wants-to-expand-in-north-america-after-raising-50m-series-c/">growing fast</a> and are part of the <a href="https://www.percona.com/live/17/sessions/tidb-newsql-database-compatible-mysql">MySQL community</a>. It uses RocksDB (I shouldn't have forgotten that).</li></ul>Other relevant systems<br /><ul><li><a href="https://www.foundationdb.org/">FoundationDB</a> - I am curious where this goes given the competition explained above.</li><li><a href="https://en.wikipedia.org/wiki/Amazon_Aurora">Aurora</a> - not NewSQL yet because this doesn't scale across nodes. It does support large nodes and that might be sufficient for a large part of the market. But Amazon moves fast (see the <a href="https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/">new parallel query feature</a>) so I wouldn't be surprised if this became NewSQL one day. I appreciate that they have begun to explain the internals -- <a href="https://dl.acm.org/citation.cfm?id=3056101">here</a> and <a href="https://dl.acm.org/citation.cfm?id=3196937">here</a>.</li><li>MongoDB - not SQL, but starting to get interesting with the new features for <a href="http://smalldatum.blogspot.com/2015/10/losing-it.html">read and write concerns</a>. There is also new support for <a href="https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#causal-consistency">causal consistency</a> and <a href="https://docs.mongodb.com/manual/core/retryable-writes/">retryable writes</a>.</li><li>Clustrix - a NewSQL system that is now <a href="https://techcrunch.com/2018/09/20/mariadb-acquires-clusterix/">part of MariaDB</a>. Maybe this becomes open source.</li><li><a href="http://kudu.apache.org/">Kudu</a> - <a href="https://kudu.apache.org/kudu.pdf">awesome paper</a>, interesting <a href="http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf">research on HybridTime</a>, useful docs <a href="https://kudu.apache.org/docs/transaction_semantics.html#1">on the internals</a>.</li><li><a href="https://vitess.io/overview/">Vitess</a> - was created to scale MySQL for Youtube. Now is part of CNCF, backed by a startup and used by many companies. Cross-shard <a href="https://vitess.io/user-guide/twopc/">writes are atomic</a>, but isolation is weaker.</li><li><a href="https://www.splicemachine.com/">Splice Machine</a> - SQL on HBase. Summary is "100% ACID via snapshot isolation with optimistic concurrency via write-write conflicts" and details <a href="https://doc.splicemachine.com/developers_fundamentals_transactions.html">are here</a>. Has integration to use Spark for OLAP, so <a href="https://www.splicemachine.com/defining-htap/">this is HTAP</a>.</li></ul>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-545027977452958192018-09-19T15:01:00.002-07:002018-09-19T15:01:43.963-07:00Durability debtI define <i>durability debt</i> to be the amount of work that can be done to persist changes that have been applied to a database. Dirty pages must be written back for a b-tree. Compaction must be done for an LSM. Durability debt has IO and CPU components. The common IO overhead is from writing something back to the database. The common CPU overhead is from computing a checksum and optionally from compressing data.<br /><br />From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a B-Tree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a B-Tree which is one reason why tuning can be more challenging for an LSM than a B-Tree.<br /><br />In this post by LSM I mean LSM with leveled compaction.<br /><br /><b>B-Tree</b><br /><br />The maximum durability debt for a B-Tree is limited by the size of the buffer pool. If the buffer pool has N pages then there will be at most N dirty pages to write back. If the buffer pool is 100G then there will be at most 100G to write back. The IO is more random or less random depending on whether the B-Tree is update-in-place, <a href="http://smalldatum.blogspot.com/2015/08/different-kinds-of-copy-on-write-for-b.html">copy-on-write random or copy-on-write sequential</a>. I prefer to describe this as small writes (page at a time) or large writes (many pages grouped into a larger block) rather than random or sequential. InnoDB uses small writes and WiredTiger uses larger writes. The distinction between small writes and large writes is more important with disks than with SSD.<br /><br />There is a small CPU overhead from computing the per-page checksum prior to write back. There can be a larger CPU overhead from compressing the page. Compression isn't popular with InnoDB but is popular with WiredTiger.<br /><br />There can be an additional IO overhead when torn-write protection is enabled as provided by the InnoDB <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-buffer.html">double write buffer</a>.<br /><br /><b>LSM</b><br /><br />The durability debt for an LSM is the work required to compact all data into the max level (Lmax). A byte in the write buffer causes more debt than a byte in the L1 because more work is needed to move the byte from the write buffer to Lmax than from L1 to Lmax.<br /><br />The maximum durability debt for an LSM is limited by the size of the storage device. Users can configure RocksDB such that the level 0 (L0) is huge. Assume that the database needs 1T of storage were it compacted into one sorted run and the write-amplification to move data from the L0 to the max level (Lmax) is 30. Then the maximum durability debt is 30 * sizeof(L0). The L0 is usually configured to be <= 1G in which case the durability debt from the L0 is <= 30G. But were the L0 configured to be <= 1T then the debt from it could grow to 30T.<br /><br />I use the notion of per-level write-amp to explain durability debt in an LSM. Per-level write-amp is defined in the next section. Per-level write-amp is a proxy for all of the work done by compaction, not just the data to be written. When the per-level write-amp is X then for compaction from Ln to Ln+1 for every key-value pair from Ln there are ~X key-value pairs from Ln+1 for which work is done including:<br /><ul><li>Read from Ln+1. If Ln is a small level then the data is likely to be in the LSM block cache or OS page cache. Otherwise it is read from storage. Some reads will be cached, all writes go to storage. So the write rate to storage is > the read rate from storage.</li><li>The key-value pairs are decompressed if the level is compressed for each block not in the LSM block cache.</li><li>The key-value pairs from Ln+1 are merged with Ln. Note that this is a merge, not a merge sort because the inputs are ordered. The number of comparisons might be less than you expect because one iterator is ~X times larger than the other and there are optimizations for that.</li></ul><div>The output from the merge is then compressed and written back to Ln+1. Some of the work above (reads, decompression) are also done for Ln but most of the work comes from Ln+1 because it is many times larger than Ln. I stated above that an LSM usually has more IO and less CPU durability debt per modified row. The extra CPU overheads come from decompression and the merge. I am not sure whether to count the compression overhead as extra.<br /><br />Assuming the per-level growth factor is 10 and f is 0.7 (see below) then the per-level write-amp is 7 for L1 and larger levels. If sizeof(L1) == sizeof(L0) then the per-level write-amp is 2 for the L0. And the per-level write-amp is always 1 for the write buffer.<br /><br />From this we can estimate the pending write-amp for data at any level in the LSM tree.</div><div><ol><li>Key-value pairs in the write buffer have the most pending write-amp. Key-value pairs in the max level (L5 in this case) have none. Key-value pairs in the write buffer are further from the max level. </li><li>Starting with the L2 there is more durability debt from a full Ln+1 than a full Ln -- while there is more pending write-amp for Ln, there is more data in Ln+1.</li><li>Were I given the choice of L1, L2, L3 and L4 when first placing a write in the LSM tree then I would choose L4 as that has the least pending write-amp.</li><li>Were I to choose to make one level 10% larger then I prefer to do that for a smaller level given the values in the <i>rel size X pend</i> column.</li></ol></div><br /><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">legend:</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">w-amp per-lvl : per-level write-amp</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">w-amp pend<span class="Apple-converted-space"> </span>: write-amp to move byte to Lmax from this level</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">rel size<span class="Apple-converted-space"> </span>: size of level relative to write buffer</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">rel size X pend : write-amp to move all data from that level to Lmax</span></span></div><div class="p2"><span style="font-family: Courier New, Courier, monospace; font-size: small;"><span class="s1"></span><br /></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;"><span class="Apple-converted-space"> </span>w-amp <span class="Apple-converted-space"> </span>w-amp <span class="Apple-converted-space"> </span>rel <span class="Apple-converted-space"> </span>rel size<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">level <span class="Apple-converted-space"> </span>per-lvl pend<span class="Apple-converted-space"> </span>size<span class="Apple-converted-space"> </span>X pend</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">----- <span class="Apple-converted-space"> </span>------- ----- <span class="Apple-converted-space"> </span>----- <span class="Apple-converted-space"> </span>--------</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">wbuf<span class="Apple-converted-space"> </span>1 <span class="Apple-converted-space"> </span>31<span class="Apple-converted-space"> </span>1<span class="Apple-converted-space"> </span>31 <span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L0<span class="Apple-converted-space"> </span>2 <span class="Apple-converted-space"> </span>30<span class="Apple-converted-space"> </span>4 <span class="Apple-converted-space"> </span>120<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L1<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>28<span class="Apple-converted-space"> </span>4 <span class="Apple-converted-space"> </span>112<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L2<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>21 <span class="Apple-converted-space"> </span>40 <span class="Apple-converted-space"> </span>840<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L3<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>14<span class="Apple-converted-space"> </span>400<span class="Apple-converted-space"> </span>5600 <span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L4<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>7<span class="Apple-converted-space"> </span>4000 <span class="Apple-converted-space"> </span>28000<span class="Apple-converted-space"> </span></span></span></div><div class="p1"> <style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff; min-height: 15.0px} span.s1 {font-variant-ligatures: no-common-ligatures} </style> </div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L5<span class="Apple-converted-space"> </span>0 <span class="Apple-converted-space"> </span>0 <span class="Apple-converted-space"> </span>40000 <span class="Apple-converted-space"> </span>0 <span class="Apple-converted-space"> </span></span></span></div><br /><b>Per-level write-amp in an LSM</b><br /><br />The per-level write-amplification is the work required to move data between adjacent levels. The per-level write-amp for the write buffer is 1 because a write buffer flush creates a new SST in L0 without reading/re-writing an SST already in L0.<br /><br />I assume that any key in Ln is already in Ln+1 so that merging Ln into Ln+1 does not make Ln+1 larger. This isn't true in real life, but this is a model.<br /><br />The per-level write-amp for Ln is approximately sizeof(Ln+1) / sizeof(Ln). For n=0 this is 2 with a typical RocksDB configuration. For n>0 this is the per-level growth factor and the default is 10 in RocksDB. Assume that the per-level growth factor is equal to X, in reality the per-level write-amp is f*X rather than X where f ~= 0.7. See this <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">excellent paper</a> or examine the compaction IO stats from a production RocksDB instance. Too many excellent conference papers assume it is X rather than f*X in practice.<br /><br />The per-level write-amp for Lmax is 0 because compaction stops at Lmax.<br /><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-1180114107383240152018-09-18T12:15:00.003-07:002018-09-18T12:15:59.271-07:00Bloom filter and cuckoo filterThe multi-level cuckoo filter (MLCF) <a href="http://smalldatum.blogspot.com/2018/09/review-of-slimdb-from-vldb-2018.html">in SlimDB</a> builds on the cuckoo filter (CF) so I read the<a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf"> cuckoo filter paper.</a> The big deal about the cuckoo filter is that it supports delete and a bloom filter does not. As far as I know the MLCF is updated when sorted runs arrive and depart a level -- so delete is required. A bloom filter in an LSM is per sorted run and delete is not required because the filter is created when the sorted run is written and dropped when the sorted run is unlinked.<br /><br />I learned of the blocked bloom filter from the cuckoo filter paper (see <a href="https://dl.acm.org/citation.cfm?id=1768582">here</a> or <a href="http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf">here</a>). RocksDB uses this but I didn't know it had a name. The benefit of it is to reduce the number of cache misses per probe. I was curious about the cost and while the math is complicated, the paper estimates a 10% increase on the false positive rate for a bloom filter with 8 bits/key and a 512-bit block which is similar to a typical setup for RocksDB.<br /><br /><b>Space Efficiency</b><br /><br />I am always interested in things that use less space for filters and block indexes with an LSM so I spent time reading <a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf">the paper</a>. It is a great paper and I hope that more people read it. The cuckoo filter (CF) paper claims better space-efficiency than a bloom filter and the claim is repeated in the SlimDB paper as:<br /><blockquote class="tr_bq"><span style="background-color: white; color: #1d2129; white-space: pre-wrap;"><span style="font-family: "times" , "times new roman" , serif;"><i>However, by selecting an appropriate fingerprint size f and bucket size b, it can be shown that the cuckoo filter is more space-efficient than the Bloom filter when the target false positive rate is smaller than 3%</i></span></span></blockquote>The tl;dr for me is that the space savings from a cuckoo filter is significant when the false positive rate (FPR) is sufficiently small. But when the target FPR is 1% then a cuckoo filter uses about the same amount of space as a bloom filter.<br /><br />The paper has a lot of interesting math that I was able to follow. It provides formulas for the number of bits/key for a bloom filter, cuckoo filter and semisorted cuckoo filter. The semisorted filter uses 1 less bit/key than a regular cuckoo filter. The formulas assuming E is the target false positive rate, b=4, and A is the load factor:<br /><ul><li>bloom filter: ceil(1.44 * log2(1/E))</li><li>cuckoo filter: ceil(log2(1/E) + log2(2b)) / A == (log2(1/E) + 3) / A</li><li>semisorted cuckoo filter: ceil(log2(1/E) + 2) / A</li></ul><br />The target load factor is 0.95 (A = 0.95) and that comes at a cost in CPU overhead when creating the CF. Assuming A=0.95 then a bloom filter uses 10 bits/key, a cuckoo filter uses 10.53 and a semisorted cuckoo filter uses 9.47. So the cuckoo filter uses either 5% more or 5% less space than a bloom filter when the target FPR is 1% which is a different perspective from the quote I listed above. Perhaps my math is wrong and I am happy for an astute reader to explain that.<br /><br />When the target FPR rate is 0.1% then a bloom filter uses 15 bits/key, a cuckoo filter uses 13.7 and a semisorted cuckoo filter uses 12.7. The savings from a cuckoo filter are larger here but the common configuration for a bloom filter in an LSM has been to target a 1% FPR. I won't claim that we have proven that FPR=1% is the best rate and recent <a href="http://daslab.seas.harvard.edu/monkey/">research on Monkey</a> has shown that we can do better when allocating space to bloom filters.<br /><br />The first graph shows the number of bits/key as a function of the FPR for a bloom filter (BF) and cuckoo filter (CF). The second graph shows the ratio for bits/key from BF versus bits/key from CF. The results for semisorted CF, which uses 1 less bit/key, are not included. For the second graph a CF uses less space than a BF when the value is greater than one. The graph covers FPR from 0.00001 to 0.09 which is 0.001% to 9%. R code to generate the graphs <a href="https://gist.github.com/mdcallag/ae8c23a64b038026f9634a123a221104">is here</a>.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-3Bk6bQY9WnA/W6FLQe0LY4I/AAAAAAAAXxs/AJoPYXqsNS4qFVk-3yvJeazddYkq6ePQQCEwYBhgL/s1600/bf%2Bvs%2Bcf.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="570" data-original-width="628" height="362" src="https://2.bp.blogspot.com/-3Bk6bQY9WnA/W6FLQe0LY4I/AAAAAAAAXxs/AJoPYXqsNS4qFVk-3yvJeazddYkq6ePQQCEwYBhgL/s400/bf%2Bvs%2Bcf.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-PqU7NMgid0c/W6FL4PJ0aSI/AAAAAAAAXxw/FUzz1uSalOkXFflrDeCHU8y_CzXAzledQCEwYBhgL/s1600/bf2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="570" data-original-width="628" height="362" src="https://2.bp.blogspot.com/-PqU7NMgid0c/W6FL4PJ0aSI/AAAAAAAAXxw/FUzz1uSalOkXFflrDeCHU8y_CzXAzledQCEwYBhgL/s400/bf2.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><br /><br /><b>CPU Efficiency</b><br /><br />From the paper there is more detail on CPU efficiency in table 3, figure 5 and figure 7. Table 3 has the speed to create a filter, but the filter is much larger (192MB) than a typical per-run filter with an LSM and there will be more memory system stalls in that case. Regardless the blocked bloom filter has the least CPU overhead during construction.<br /><br />Figure 5 shows the lookup performance as a function of the hit rate. Fortunately performance doesn't vary much with the hit rate. The cuckoo filter is faster than the blocked bloom filter and the block bloom filter is faster than the semisorted cuckoo filter.<br /><br />Figure 7 shows the insert performance as a function of the cuckoo filter load factor. The CPU overhead per insert grows significantly when the load factor exceeds 80%.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-15408781323155603302018-09-13T14:35:00.000-07:002018-09-13T14:35:12.193-07:00Review of SlimDB from VLDB 2018SlimDB is a paper worth reading from <a href="http://www.vldb.org/pvldb/vol10/p2037-ren.pdf">VLDB 2018</a>. The highlights from the paper are that it shows:<br /><ol><li>How to use less memory for filters and indexes with an LSM</li><li>How to reduce the CPU penalty for queries with tiered compaction</li><li>The benefit of more diversity in LSM tree shapes</li></ol><div><b>Overview</b></div><div><br /><a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">Cache amplification</a> has become more important as database:RAM ratios increase. With SSD it is possible to attach many TB of usable data to a server for OLTP. By usable I mean that the SSD has enough IOPs to access the data. But it isn't possible to grow the amount of RAM per server at that rate. Many of the early RocksDB workloads used database:RAM ratios that were about 10:1 and everything but the max level (Lmax) of the LSM tree was in memory. As the ratio grows that won't be possible unless filters and block indexes use less memory. SlimDB does that via three-level block indexes and multi-level cuckoo-filters.</div><div><br /></div><div><a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">Tiered compaction</a> uses more CPU and IO for point and range queries because there are more places to check for data when compared to level compaction. The multi-level cuckoo filter in SlimDB reduces the CPU overhead for point queries as there is only one filter to check per level rather than one per sorted run per level.</div><div><br />The SlimDB paper shows the value of hybrid LSM tree shapes, combinations of tiered and leveled, and then how to choose the best combination based on IO costs. <a href="http://the%20slimdb%20papers%20shows%20the%20value%20of%20hybrid%20lsm%20tree%20shapes%2C%20combinations%20of%20tiered%20and%20leveled%2C%20and%20then%20how%20to%20choose%20the%20best%20combination%20based%20on%20io%20costs./">Prior to this year</a>, hybrid didn't get much discussion -- the choices were usually tiered or leveled. While RocksDB and LevelDB with the L0 have always been hybrids of tiered (L0) and leveled (L1 to Lmax), we rarely discuss that. But more diversity in LSM tree shape means more complexity in tuning and the SlimDB solution is to make a cost-based decision (cost == IO overhead) subject to a constraint on the amount of memory to use.<br /><br />This has been a great two years for storage engine efficiency. First we had several papers from <a href="https://stratos.seas.harvard.edu/">Harvard DASLab</a> that have begun to explain cost-based algorithm design and engine configuration and SlimDB continues in that tradition. I have much more reading to do starting with <a href="https://stratos.seas.harvard.edu/publications/periodic-table-data-structures">The Periodic Table of Data Structures</a>.<br /><br />Below I review the paper. Included with that is some criticism. Papers can be great without being perfect. This paper is a major contribution and worth reading.</div><div><br /></div><div><b>Semi-sorted</b></div><div><br /></div><div>The paper starts by explaining the principle of semi-sorted data. When the primary key can be split into two parts -- prefix and suffix -- there are some workloads that don't need data ordered over the entire primary key (prefix + suffix). Semi-sorted supports queries that fetch all data that matches the prefix of the PK while still enforcing uniqueness for the entire PK. The PK can be on (a,b,c,d) and (a,b) is prefix and queries are like "a=X and b=Y" without predicates on (c,d) that require index ordering. SlimDB takes advantage of this to use less space for the block index. <br /><br />There are many use cases for this, but the paper cites Linkbench which isn't correct. See the <a href="https://research.fb.com/publications/linkbench-a-database-benchmark-based-on-the-facebook-social-graph/">Linkbench</a> and <a href="https://research.fb.com/publications/tao-facebooks-distributed-data-store-for-the-social-graph-2/">Tao</a> papers for queries that do an exact match on the prefix but only want the top-N rows in the result. So ordering on the suffix is required to satisfy query response time goals when the total number of rows that match the prefix is much larger than N. I assume this issue with top-N is important for other social graph workloads because some graph nodes are popular. Alas, things have changed with the social graph workload since those papers were published and I hope the changes are explained one day.<br /><br />Note that MyRocks can use a prefix bloom filter to support some range queries with composite indexes. Assume the index is on (a,b,c) and the query has a=X and b=Y order by c limit 10. A prefix bloom on (a,b) can be used for such a query.</div><div><br /></div><div><b>Stepped Merge</b></div><div><br /></div><div>The paper implements tiered compaction but calls it stepped merge. I didn't know about the stepped merge paper prior to reading the SlimDB paper. I assume that people who chose the name tiered might also have missed <a href="https://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/indexbuffering-vldb97.pdf">that paper</a>.<br /><br />LSM compaction algorithms haven't been formally defined. I tried to advance the definitions in a <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">previous post</a>. One of the open issues for tiered is whether it requires only one sorted run at the max level or allows for N runs at the max level. With N runs at the max level the space-amplification is at least N which is too much for many workloads. With 1 run at the max level compaction into the max level is always leveled rather than tiered -- the max level is read/rewritten and the per-level write-amplification from that is larger than 1 (while the per-level write-amp from tiered == 1). With N runs at the max level many of the compaction steps into the max level can be tiered, but some will be leveled -- when the max level is full (has N runs) then something must be done to reduce the number of runs.</div><div><br /></div><div><b>3-level block index</b><br /><br />Read the paper. It is complex and a summary by me here won't add value. It uses an Entropy Coded Trie (ECT) that builds on ideas from <a href="https://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf">SILT</a> -- another great paper from CMU.<br /><br />ECT uses ~2 bits/key versus at least 8 bits/key for LevelDB for the workloads they considered. This is a great result. ECT also uses 5X to 7X more CPU per lookup than LevelDB which means you might limit the use of it to the largest levels of the LSM tree -- because those use the most memory and the place where we are willing to spend CPU to save memory.</div><div><br /></div><div><b>Multi-level cuckoo filter</b></div><div><br />SlimDB can use a cuckoo filter for leveled levels of the LSM tree and a multi-level cuckoo filter for tiered levels. Note that leveled levels have one sorted run and tiered levels have N sorted runs. SlimDB and the Stepped Merge paper use the term sub-levels, but I prefer N sorted runs.<br /><br />The cuckoo filter is used in place of a bloom filter to save space given target false positive rates of less than 3%. The paper has examples where the cuckoo filter uses 13 bits/key (see Table 1) and a bloom filter with 10 bits/key (RocksDB default) has a false positive rate of much less than 3%. It is obvious that I need to read another interesting CMU paper cited by SlimDB -- <a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf">Cuckoo Filter Practically Better than Bloom</a>.<br /><br />The multi-level cuckoo filter (MLCF) extends the cuckoo filter by using a few bits/entry to name the sub-level (sorted run) in the level that might contain the search key. With tiered and a bloom filter per sub-level (sorted run) a point query must search a bloom filter per sorted run. With the MLCF there is only one search per level (if I read the paper correctly).<br /><br />The MLCF might go a long way to reduce the point-query CPU overhead when using many sub-levels which is a big deal. While a filter can't be used for general range queries, SlimDB doesn't support general range queries. Assuming the PK is on (a,b,c,d) and the prefix is (a,b) then SlimDB supports range queries like fetch all rows where a=X and b=Y. It wasn't clear to me whether the MLCF could be used in that case. But many sub-levels can create more work for range queries as iterators must be positioned in each sub-level in the worst case and that is more work.<br /><br />This statement from the end of the paper is tricky. SlimDB allows for an LSM tree to use leveled compaction on all levels, tiered on all levels or a hybrid. When all levels are leveled, then performance should be similar to RocksDB with leveled, when all or some levels are tiered then write-amplification will be reduced at the cost of read performance and the paper shows that range queries are slower when some levels are tiered. Lunch isn't free as the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM Conjecture</a> asserts.<blockquote class="tr_bq"><span class="s1"><span style="font-family: Times, Times New Roman, serif; font-size: small;"><i>In contrast, with the support of dynamic use of a stepped merge algorithm and optimized in-memory indexes, SlimDB minimizes write amplification without sacrificing read performance.</i></span></span></blockquote>The memory overhead for MLCF is ~2 bits. I am not sure this was explained by the paper but that might be to name the sub-level, in which case there can be at most 4 sub-levels per level and the cost would be larger with more sub-levels.</div><div><br /></div><div>The paper didn't explain how the MLCF is maintained. With a bloom filter per sorted run the bloom filter is created when SST files are created during compaction and memtable flush. This is an offline or batch computation. But the MLCF covers all the sub-levels (sorted runs) in a level. And the sub-levels in a level arrive and depart one at a time, not at the same time. They arrive as output from compaction and depart when they were compaction input. The arrival or departure of a new sub-level requires incremental changes to the MLCF. </div><div><br /><b>LSM tree shapes</b></div><div><br />For too long there has not been much diversity in LSM tree shapes. The usual choice was all tiered or all leveled. RocksDB leveled is really a hybrid -- tiered for L0, leveled for L1 to Lmax. But the SlimDB paper makes the case for more diversity. It explains that some levels (smaller ones) can be tiered while the larger levels can be leveled. And the use of multi-level cuckoo filters, three-level indexes and cuckoo filters is also a decision to make per-level.<br /><br />Even more interesting is the use of a cost-model to choose the best configuration subject to a constraint -- the memory budget. They enumerate a large number of LSM tree configurations, generate estimated IO-costs per operation (write-amp, IO per point query that returns a row, IO per point query that doesn't return a row, memory overhead) and then the total IO cost is computed for for a workload -- where a workload specifies the frequency of each operation (for example - 30% writes, 40% point hits, 30% point misses).<br /><br />The <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">Dostoevsky paper</a> also makes the case for more diversity and uses rigorous models to show how to choose the best LSM tree shape.<br /><br />I think this work is a big step in the right direction. Although cost models must be expanded to include CPU overheads and constraints expanded to include the maximum write and space amplification that can be tolerated.<br /><br />I disagree with a statement from the related work section. We can already navigate some of the read, write and space amplification space but I hope there is more flexibility in the future. RocksDB tuning is complex in part to support this via changing the number of levels (or growth factor per level), enabling/disabling the bloom filter, using different compression (or none) on different levels, changing the max space amplification allowed, changing the max number of sorted runs in the L0 or max number of write buffers, changing the L0:L1 size ratio, changing the number of bloom filter bits/key. Of course I want more flexibility in the future while also making RocksDB easier to tune.<br /> <style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style> </div><br /><div class="p1"></div><blockquote class="tr_bq"><span class="s1"><span style="font-family: Times, Times New Roman, serif; font-size: small;"><i>Existing LSM-tree based key-value stores do not allow trading among read cost, write cost and main memory footprint. </i></span></span></blockquote><span class="s1"><br /><span style="font-family: Times, Times New Roman, serif; font-size: small;"><b>Performance Results</b></span></span><br /><div class="p1"><span class="s1"><span style="font-family: Times, Times New Roman, serif; font-size: small;"><br /></span></span></div><div class="p1"><span style="font-family: Times, Times New Roman, serif; font-size: small;">Figuring out why X was faster than Y in academic papers is not my favorite task. I realize that space constraints are a common reason for the lack of details but I am wary of results that have not been explained and I know that mistakes can be made (note: don't use serializable with InnoDB). I make many mistakes myself. I am willing to provide advice for MyRocks, MySQL and RocksDB. AFAIK most authors who hack on RocksDB or compare with it for research are not reaching out to us. We are <a href="http://smalldatum.blogspot.com/2014/09/get-help-from-mba-mysql-benchmark.html">happy to help</a> in private.<br /><br />SlimDB was faster than RocksDB on their evaluation except for range queries. There were few details about the configurations used, so I will guess. First I assume that SlimDB used stepped merge with MLCF for most levels. I am not sure why point queries were faster with SlimDB than RocksDB. Maybe RocksDB wasn't configured to use bloom filters. Writes were about 4X faster with SlimDB because stepped merge (tiered) compaction was used, write-amplification was 4X less and when IO is the bottleneck then an approach that has less write-amp will go faster.</span></div><br /><div><br /></div><br /><div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-72116662933639721672018-09-05T06:57:00.003-07:002018-09-05T06:58:12.277-07:005 things to set when configuring RocksDB and MyRockThe 5 options to set for RocksDB and MyRocks are:<br /><ol><li>block cache size</li><li>number of background threads</li><li>compaction priority</li><li>dynamic leveled compaction</li><li>bloom filters</li></ol><div>I have always wanted to do a "10 things" posts but prefer to keep this list small. It is unlikely that RocksDB can provide a great default for the block cache size and number of background threads because they depend on the amount of RAM and number of CPU cores in a server. But I hope RocksDB or MyRocks are changed to get better defaults for the other three which would shrink this list from 5 to 2.</div><div><br /><b>Options</b><br /><br /><a href="http://smalldatum.blogspot.com/2016/09/tuning-rocksdb-block-cache.html">My advice</a> on setting the size of the RocksDB block cache has not changed assuming it is configured to use buffered IO (the default). With MyRocks this option is <a href="https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning">rocksdb_block_cache_size</a> and with RocksDB you will write a few lines of code to setup the LRU.</div><div><br />The number of background threads for flushing memtables and doing compaction is set by the option <a href="https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning">rocksdb_max_background_jobs</a> in MyRocks and <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L498">max_background_jobs</a> in RocksDB. There used to be two options for this. While RocksDB can use async read-ahead and write-behind during compaction, it still uses synchronous reads and a potentially slow fsync/fdatasync call. Using more than 1 background job helps to overlap CPU and IO. A common configuration for me is number-of-CPU-cores / 4. With too few threads there will be more stalls from throttling. With too many threads there the threads handling user queries might suffer.<br /><br />There are several strategies for choosing the next data to compact with leveled compaction in RocksDB. The strategy is selected via the <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L499">compaction_pri</a> option in RocksDB. This is harder to set for MyRocks -- see compaction_pri in r<a href="https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning">ocksdb_default_cf_options</a>. The default value is kByCompensatedSize but the better choice is <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L53">kMinOverlappingRatio</a>. With MyRocks the default is 0 and the better value is 3 (3 == kMinOverlappingRatio). I first wrote about <a href="http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html">compaction_pri prior to</a> the arrival of kMinOverlappingRatio. Throughput is better and write amplification is reduced <a href="http://smalldatum.blogspot.com/2017/05/innodb-myrocks-and-tokudb-on-insert.html">with kMinOverlappingRatio</a>. An <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">awesome paper</a> by Hyeontaek Lim et al explains this.<br /><br />Leveled compaction in RocksDB limits the amount of data per level of the LSM tree. A great description of this <a href="https://rocksdb.org/blog/2015/07/23/dynamic-level.html">is here</a>. There is a target size per level and this is enforced top down (smaller to larger levels) or bottom up (larger to smaller levels). With the bottom up approach the largest level has ~10X (or whatever the fanout is set to) more data than the next to last level. With the top down approach the largest level frequently has less data than the next to last level. I strongly prefer the bottom up approach to reduce <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">space amplification</a>. This is enabled via the <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L460">level_compaction_dynamic_level_bytes</a> option in RocksDB. It is harder to set for MyRocks -- see <a href="https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning">rocksdb_default_cf_options</a>.</div><div><br />Bloom filters are disabled by default for MyRocks and RocksDB. I prefer to use a bloom filer on all but the largest level. This is set via <a href="https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning">rocksdb_default_cf_options</a> with MyRocks. The reason for not using it with the max level is to consume less memory (reduce <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">cache amplification</a>). The bloom filter is skipped for the largest level in MyRocks via the optimize_filter_for_hits option. The example at the end of this post has more information on enabling bloom filters. All of this is set via <a href="https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning">rocksdb_default_cf_options</a>.<br /><br /><b>Examples</b><br /><br />A <a href="http://smalldatum.blogspot.com/2018/07/default-options-in-myrocks.html">previous post</a> recently explained how to <a href="http://smalldatum.blogspot.com/2018/07/default-options-in-myrocks.html">set rocksdb_default_cf_options</a> for compression with MyRocks. Below I share an example my.cnf for MyRocks to set the 5 options I listed above. I set transaction isolation because read committed is a better choice for MyRocks today. Repatable read will be a great choice after gap locks are added to match InnoDB semantics. In rocksdb_default_cf_options block_based_table_factory is used to enable the bloom filter, level_compaction_dynamic_level_bytes enables bottom up management of level sizes, optimize_filters_for_hits disables the bloom filter for the largest level of the LSM tree and compaction_pri sets the compaction priority.<br /><span style="background-color: white; font-family: "courier new" , "courier" , monospace; font-size: x-small;"><br /></span><span style="background-color: white; font-family: "courier new" , "courier" , monospace; font-size: x-small;">transaction-isolation=READ-COMMITTED</span><br /><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">default-storage-engine=rocksdb</span></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">rocksdb</span></span></div><div class="p2"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s1"></span><br /></span></div><div class="p1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s1">rocksdb_default_cf_options=block_based_table_factory={f</span><span class="s2">ilt</span><span class="s1">er_policy=bloomf</span><span class="s2">ilt</span><span class="s1">er:10:false};level_compaction_dynamic_level_bytes=true;optimize_f</span><span class="s2">ilt</span><span class="s1">ers_for_hits=true;compaction_pri=kMinOverlappingRatio</span></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">rocksdb_block_cache_size=2g</span></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">rocksdb_max_background_jobs=4</span></span></div><br /></div><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff; min-height: 15.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.s2 {font-variant-ligatures: no-common-ligatures; background-color: #999900} </style>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-54542592817480224222018-08-30T08:07:00.000-07:002018-08-30T08:09:19.985-07:00Name that compaction algorithmFirst there was <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">leveled compaction</a> and it was a great paper. Then tiered compaction arrived in BigTable, HBase and <a href="https://docs.datastax.com/en/articles/cassandra/cassandrathenandnow.html">Cassandra</a>. Eventually LevelDB arrived with leveled compaction and RocksDB emerged from that. Along the way a few interesting optimizations have been added including support for time series data. My summary is missing a few details because it is a summary.<br /><br />Compaction algorithms constrain the <a href="http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html">LSM tree shape</a>. They determine which sorted runs can be merged by it and which sorted runs need to be accessed for a read operation. I am not sure whether they have been formally defined, but I hope there can be agreement on the basics. I will try to do that now for a few - leveled, tiered, tiered+leveled, leveled-N and time-series. There are two new names on this list -- tiered+leveled and leveled-N.<br /><br />LSM tree used to imply leveled compaction. I prefer to expand the LSM tree definition to include leveled, tiered and more.<br /><br />I reference several papers below. All of them are awesome, even when not perfect -- they are major contributions to write-optimized databases and worth reading. One of the best things about my job is getting time to read papers like this and then speak with the authors.<br /><br />There are many interesting details in academic papers and existing systems (RocksDB, Cassandra, HBase, ScyllaDB) that I ignore. I don't want to get lost in the details.<br /><br /><b>Leveled</b><br /><br />Leveled compaction minimizes space amplification at the cost of read and write amplification.<br /><br />The LSM tree is a sequence of levels. Each level is one sorted run that can be range partitioned into many files. Each level is many times larger than the previous level. The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. Compaction into level N (Ln) merges data from Ln-1 into Ln. Compaction into Ln rewrites data that was previously merged into Ln. The per-level write amplification is equal to the fanout in the worst case, but it tends to be less than the fanout in practice as explained in <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">this paper</a> by Hyeontaek Lim et al. Compaction in the original LSM paper was all-to-all -- all data from Ln-1 is merged with all data from Ln. It is some-to-some for LevelDB and RocksDB -- some data from Ln-1 is merged with some (the overlapping) data in Ln.<br /><br />While write amplification is usually worse with leveled than with tiered there are a few cases where leveled is competitive. The first is key-order inserts and a RocksDB optimization greatly reduces write-amp in that case. The second one is skewed writes where only a small fraction of the keys are likely to be updated. With the right value for <a href="http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html">compaction priority</a> in RocksDB compaction should stop at the smallest level that is large enough to capture the write working set -- it won't go all the way to the max level. When leveled compaction is some-to-some then compaction is only done for the slices of the LSM tree that overlap the written keys, which can generate less write amplification than all-to-all compaction.<br /><br /><b>Tiered</b><br /><br />Tiered compaction minimizes <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">write amplification at the cost of read and space amplification</a>.<br /><br />The LSM tree can still be viewed as a sequence of levels as explained in the <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">Dostoevsky paper</a> by Niv Dayan and Stratos Idreos. Each level has N sorted runs. Each sorted run in Ln is ~N times larger than a sorted run in Ln-1. Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln. The per-level write amplification is 1 which is much less than for leveled where it was fanout.<br /><br />Most implementations of tiered compaction don't behave exactly as described in the previous paragraph. I hope they are close enough, because the model above makes it easy to reason about performance and estimate the worst-case write amplification. A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.<br /><br />The <a href="https://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/indexbuffering-vldb97.pdf">stepped merge paper</a> is the earliest reference I found for tiered compaction. It reduces random IO for b-tree changes by buffering them in an LSM tree that uses tiered compaction. While the stepped merge algorithm is presented as different from an LSM, it is tiered compaction. The <a href="https://infoscience.epfl.ch/record/163406/files/sigmod135-athanassoulis.pdf">MaSM paper</a> is similar but the SM in MaSM stands for sort merge. The paper uses an external sort rather than an LSM to reduce write amplification. It assumes that LSM implies leveled compaction but an external sort looks a lot like tiered compaction. The <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-insert-buffering.html">InnoDB change buffer</a> has a similar goal of reducing random IO for changes to a b-tree but doesn't use an LSM. In what year did the InnoDB change buffer get designed or implemented?<br /><br />I prefer that tiered not require N sorted runs at the max level because that means N copies of the database which is too much space amplification. I define it to allow K copies at the max level where K is between 2 and N. But it still does tiered compaction at the max level and when the max level is full (has K sorted runs) then the K runs are merged and the output (1 sorted run) replaces the K runs in the max level. One day I hope to learn whether HBase or Cassandra support 1, a few or N sorted runs at the max level -- although this can be confusing because they don't enforce the notion of levels. Tiered compaction in RocksDB has a <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/universal_compaction.h#L38">configuration option</a> to limit the worst-case space amplification which should prevent too many full copies (too many sorted runs at the max level) but I don't have much experience with tiered in RocksDB. I hope the RocksDB wiki gets updated to explain this.<br /><br />There are a few challenges with tiered compaction:<br /><ul><li>Transient space amplification is large when compaction includes a sorted run from the max level.</li><li>The block index and bloom filter for large sorted runs will be large. Splitting them into smaller parts is a good idea.</li><li>Compaction for large sorted runs takes a long time. Multi-threading would help.</li><li>Compaction is all-to-all. When there is skew and most of the keys don't get updates, large sorted runs might get rewritten because compaction is all-to-all. In a traditional tiered algorithm there is no way to rewrite a subset of a large sorted run. </li></ul>For tiered compaction the notion of levels are usually a concept to reason about the shape of the LSM tree and estimate write amplification. With RocksDB they are also an implementation detail. The levels of the LSM tree beyond L0 can be used to store the larger sorted runs. The benefit from this is to partition large sorted runs into smaller SSTs. This reduces the size of the largest bloom filter and block index chunks -- which is friendlier to the block cache -- and was a big deal before partitioned index/filter was supported. With subcompactions this enables multi-threaded compaction of the largest sorted runs. Note that RocksDB used the name <i>universal</i> rather than <i>tiered</i>. More docs on this <a href="https://github.com/facebook/rocksdb/wiki/Universal-Compaction">are here</a>.<br /><ul></ul><b><br /></b><b>Tiered+Leveled</b><br /><br />Tiered+Leveled has less write amplification than leveled and less space amplification than tiered.<br /><br />The <a href="http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html">tiered+leveled</a> approach is a hybrid that uses tiered for the smaller levels and leveled for the larger levels. It is flexible about the level at which the LSM tree switches from tiered to leveled. For now I assume that if Ln is leveled then all levels that follow (Ln+1, Ln+2, ...) must be leveled.<br /><br /><a href="http://www.vldb.org/pvldb/vol10/p2037-ren.pdf">SlimDB</a> from VLDB 2018 is an example of tiered+leveled although it might allow Lk to be tiered when Ln is leveled for k > n. <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">Fluid LSM</a> is described as tiered+leveled but I think it is leveled-N.<br /><br />Leveled compaction in RocksDB is also tiered+leveled, but we didn't explain it that way until now. There can be N sorted runs at the memtable level courtesy of the <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L165">max_write_buffer_number</a> option -- only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction -- the memtable output creates a new sorted run in L0 and doesn't read/rewrite existing sorted runs in L0. There can be N sorted runs in level 0 (L0) courtesy of <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L232">level0_file_num_compaction_trigger</a>. So the L0 is tiered. Compaction isn't done into the memtable level so it doesn't have to be labeled as tiered or leveled. <a href="https://github.com/facebook/rocksdb/wiki/Leveled-Compaction">Subcompactions</a> in the RocksDB L0 makes this even more interesting, but that is a topic for another post. I hope we get more docs on this interesting feature from Andrew Kryczka.<br /><br /><b>Leveled-N</b><br /><br />Leveled-N compaction is like leveled compaction but with less write and more read amplification. It allows more than one sorted run per level. Compaction merges all sorted runs from Ln-1 into one sorted run from Ln, which is leveled. And then "-N" is added to the name to indicate there can be n sorted runs per level.<br /><br />The <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">Dostoevsky paper</a> defined a compaction algorithm named Fluid LSM in which the max level has 1 sorted run but the non-max levels can have more than 1 sorted run. Leveled compaction is done into the max level. The paper states that tiered compaction is done into the smaller levels when they have more than 1 sorted run. But from my reading of the paper it uses leveled-N for the non-max levels.<br /><br />In Fluid LSM each level is T times larger than the previous level (T == fanout), the max level has Z sorted runs and the non-max levels have K sorted runs. When Z=1 and K=1 then this is leveled compaction. When Z=1 and K>1 or Z>1 and K>1 then I claim this uses leveled-N.<br /><br />Assuming K>1 for Ln-1 then compaction with Fluid LSM into Ln merges K runs from Ln-1 with 1 run from Ln. This doesn't match my definition of tiered compaction because compaction into Ln reads & rewrites a sorted run from Ln and per-level write amplification is likely to be larger than 1. Regardless I like the idea.<br /><br />Examples of write amplification with Fluid LSM for compaction from Ln-1 to Ln:<br /><ul><li>T==K - there are T (or K) sorted runs in each of Ln-1 and Ln. When each run in Ln-1 has size 1, then each run in Ln has size T. Compaction into Ln merges T runs from Ln-1 with 1 run from Ln to create a new run in Ln. This reads T bytes from Ln-1 and T bytes from Ln and the new run has a size between T and 2T -- size T when all keys in Ln-1 are duplicates of keys in the run from Ln and size > T otherwise. When the new run has size 2T the per-level write amp is 2 because 2T bytes were written to move T bytes from Ln-1. When the new run has size T the per-level write amp is 1. Otherwise the per-level write-amp is between 1 and 2. </li><li>T > K - there are K sorted runs in each of Ln-1 and Ln. Each run in Ln-1 has size T/K and each run in Ln has size T^2/K. K runs in Ln-1 have size T. Compaction reads T bytes from Ln-1, T^2/K bytes from Ln and writes a new run in Ln that has a size between T^2/K and (T^2/K + T). The per-level write-amp is as small as T^2/K / T, which reduces to T/K, when all keys in Ln-1 are duplicates with the run in Ln. It can be as large as (T^2/K + T) / T, which reduces to T/K + 1, when there is no overlap. Otherwise it is between T/K and T/K + 1.</li></ul><div>When K=2 and T=10 then the per-level write-amp is ~5 which is about half of the per-level write-amp from leveled compaction.</div><br /><b>Time Series</b><br /><br />There are compaction algorithms optimized for time series workloads. I have no experience with them but they are worth mentioning. Cassandra had <a href="https://www.datastax.com/dev/blog/datetieredcompactionstrategy">DTCS</a> and has <a href="http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html">TWCS</a>. InfluxDB has or had <a href="https://docs.influxdata.com/influxdb/v1.2/concepts/storage_engine/">TSM</a> and <a href="https://www.influxdata.com/blog/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/">TSI</a>. I hope we eventually do something interesting for time series with RocksDB.<br /><br /><b>Other</b><br /><br />There are other interesting LSM engines:<br /><ul><li>Tarantool - Sphia begat <a href="https://github.com/tarantool/tarantool/wiki/Vinyl-Architecture">Vinyl</a> and I lost track of it. But I have high hopes.</li><li>WiredTiger - has an LSM but they are busy making the <a href="http://smalldatum.blogspot.com/2015/08/different-kinds-of-copy-on-write-for-b.html">CoW b-tree</a> better</li><li>Kudu - didn't use RocksDB and I like <a href="https://kudu.apache.org/kudu.pdf">the reasons</a> for not using it</li></ul><div>My summary of Sphia and Tarantool probably has bugs. My memory is that Sophia was a great design assuming the database : RAM ratio wasn't too large. It had a memtable and a sorted run on disk -- both were partitioned (not sure if range or hash). When a memtable partition became full then leveled compaction was done between it and its disk partition. Vinyl has changed enough from this design that I won't try to summarize it here. It has clever ideas for managing the partitions.<br /><br /><b>ScyllaDB</b><br /><br />I briefly mentioned ScyllaDB at the start of the post. I have yet to use the product but their documentation on LSM efficiency and many other things is remarkable. Start with <a href="https://www.scylladb.com/2017/12/28/compaction-strategy-scylla/">this post</a> that compares the compaction strategies (algorithms) in ScyllaDB -- leveled, size-tiered, hybrid and time-window. From this attached slide deck I learned that Lucene implemented an LSM in 1999. They also have two posts that explain write amplification for <a href="https://www.scylladb.com/2018/01/17/compaction-series-space-amplification/">tiered</a> and <a href="https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/">leveled</a> compaction.<br /><br />Hybrid compaction is described in the <a href="https://www.scylladb.com/2017/12/28/compaction-strategy-scylla/">embedded slide deck</a> and it is interesting. Hybrid range partitions large sorted runs into many SSTs, similar to RocksDB. Hybrid then uses that to make compaction with large sorted runs incremental -- an input SST to the compaction can be deleted before the compaction is finished (slide 33). This reduces the worst-case space amplification that is transient when merges are in progress for large sorted runs. This isn't trivial to implement. It isn't clear to me but slide 34 suggests that hybrid can limit compaction to a subset (1 or a few SSTs) of a large sorted run when the writes are skewed. Maybe a ScyllaDB expert can confirm or deny my guess. Hybrid also has optimizations for tombstones (slide 44). I won't go into detail here, just as I ignored the <a href="https://github.com/facebook/rocksdb/wiki/Single-Delete">SingleDelete optimization</a> in RocksDB. </div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-44386233688986460422018-08-27T13:30:00.000-07:002018-08-27T14:08:00.246-07:00Review of "Concurrent Log-Structured Memory" from VLDB 2018.<span style="font-family: "times" , "times new roman" , serif;"><a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">Space-amplification</a> matters for in-memory stores too. <br /><br />This is a review of <span style="background-color: white; color: #1c1e21; white-space: pre-wrap;"><a href="http://www.vldb.org/pvldb/vol11/p458-merritt.pdf">Concurrent Log-Structured Memory for Many-Core Key-Value Stores</a> from VLDB 2018 and the engine is named Nibble. The paper is worth reading - the ideas are interesting and the performance results are thorough. Nibble is an example of <a href="http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html">index+log</a>.</span><span style="background-color: white; color: #1c1e21; white-space: pre-wrap;"> Their focus is on huge many-core servers. I wonder how RocksDB would do a a server with 240 cores and many TB of DRAM. I assume there might be a few interesting performance problems to make better. Start with table 2 for a condensed summary. The design overview is:</span></span><br /><ul><li><span style="font-family: "times" , "times new roman" , serif;">partitioned, resizable hash index - the hash index uses <a href="https://en.wikipedia.org/wiki/Open_addressing">open addressing and linear probing</a>. Each bucket has 15 entries, 64-bits/entry and a version counter. The counter is incremented twice/update -- at start and end. Readers retry if counter is changed or unchanged but odd.</span></li><li><span style="font-family: "times" , "times new roman" , serif;">per-socket log-structured memory allocators with per-core log heads. There is a log instance per core and each log instance has multiple heads (locations where log inserts are done). They call this write local, read global because reads might be from a log instance written by another core. A cost-based approach is used to select the next log segment for GC. </span></li><li><span style="font-family: "times" , "times new roman" , serif;">thread-local epochs - after GC there might still be threads reading from a log segment. Epochs are used to determine when that is safe. The CPU time stamp counter is used for the epoch value and each thread writes its epoch value, using a cache line per thread, to a fixed memory location.</span></li></ul><span style="background-color: white; color: #1c1e21; font-family: "times" , "times new roman" , serif; white-space: pre-wrap;">Things that are interesting to me:</span><br /><ul><li><span style="font-family: "times" , "times new roman" , serif;">When needed, a partition of the hash index is doubled in size. Rather than allocate 2X more memory, they use the VM to extend the current allocation so only ~1/2 of the objects in the partition must be relocated. I am curious whether there are stalls during resize.</span></li><li><span style="font-family: "times" , "times new roman" , serif;">What is the range of load factors that the Nibble hash index can sustain? A b-tree with leaf pages 2/3 full is one source of fragmentation, a hash index with a load factor less than 100% is another form of fragmentation. Fortunately, with a load factor of 80%, 20% of memory isn't wasted because the log segments should use much more memory than the hash index. Figure 14 shows throughput as a function of memory utilization.</span></li><li><span style="font-family: "times" , "times new roman" , serif;">index+log has a large CPU cost during GC from tree index probes, but Nibble uses a hash index which has a lower probe cost. Nibble maintains a live bytes counter per log segment. Segments with zero live bytes can be collected without probing the index to find live entries. Otherwise an index probe per entry is required to determine whether an entry is live.</span></li><li><span style="font-family: "times" , "times new roman" , serif;">I am curious about the results in Figures 1 and 9b on memory fragmentation per allocator. The results for jemalloc and tcmalloc are similar while ptmalloc2 does the best. The microbenchmarks are from the <a href="https://www.usenix.org/node/179822">Rumble paper</a>. I don't think much can be concluded from such simple allocator workloads -- but I still like the paper. RocksDB or LevelDB with db_bench would be a better test for fragmentation. It would also be good to know which versions of the allocators (jemalloc, tcmalloc, etc) were used and whether any tuning was done.</span></li></ul><span style="font-family: "times" , "times new roman" , serif;">I have published posts on the benefits from using jemalloc or tcmalloc compared to glibc malloc for RocksDB. A RocksDB/MyRocks instance has ~2X larger RSS with glibc malloc because jemalloc and tcmalloc are better at avoiding fragmentation. See posts <a href="http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-fragmentation-strong.html">one</a>, <a href="http://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html">two</a>, <a href="http://smalldatum.blogspot.com/2014/12/malloc-and-mongodb-performance.html">three</a>, <a href="http://smalldatum.blogspot.com/2017/11/concurrent-large-allocations-glibc.html">four</a> and <a href="https://twitter.com/markcallaghan/status/888414002044436482">five</a>. RocksDB does an allocation per block read and <span style="background-color: white; color: #1c1e21; white-space: pre-wrap;">puts a lot of stress on the allocator especially when using a fast storage device. </span><span style="background-color: white; color: #1c1e21; white-space: pre-wrap;">An allocation remains cached until it reaches the LRU end in the block cache or the block's SST gets unlinked. I expect blocks in the block cache to have vastly different lifetimes.</span></span><br /><span data-offset-key="bmhlm-2-0" style="background-color: white; color: #1c1e21; font-family: , , , ".sfnstext-regular" , sans-serif; white-space: pre-wrap;"><span data-text="true" style="font-family: inherit;"> </span></span>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-62983933035389359342018-08-07T07:55:00.002-07:002018-08-07T08:11:01.258-07:00Default configuration benchmarksDefault configuration benchmarks are an interesting problem. Most storage engines require some configuration tuning to get good performance and efficiency. We configure an engine to do the right thing for the expected workload and hardware. Unfortunately the configuration is done in the language of the engine (innodb_write_io_threads, rocksdb_default_cf_options) which requires a significant amount of time to understand.<br /><br />Hardware comes in many sizes and engines frequently don't have code to figure out the size -- how many CPUs, how much RAM, how many GB of storage, how many IOPs from storage. Even when that code exists the engine might not be able to use everything it finds:<br /><ul><li>HW can be shared and the engine is only allowed a fraction of it. </li><li>It might be running on a VM that gets more CPU when other VMs on the host are idle.</li><li>SSDs get slower when more full. It can take a long time to reach that state.</li></ul><div><b><br /></b><b>Minimal configuration</b><br /><br />I assume there is a market for storage engines that have better performance with the default configuration, but it will take time to get there. A step in the right direction is to enhance engines to get great performance and efficiency with minimal configuration (minimal != default). I am still figuring out what <i>minimal</i> means. I prefer to use the language of the engine user (HW capacity and performance/efficiency goals) rather than the language of the engine. I'd rather not set engine-specific options, even easy to understand ones like innodb_buffer_pool_size. I want the engine to figure out its configuration given the minimal tuning. For now I have two levels for <i>minimal</i>:</div><div><ul><li>HW-only - tell the engine how much HW it can use -- number of CPU cores, GB of RAM, storage capacity and IOPs. Optionally you can ask it to use all that it finds.</li><li>HW + goals - in addition to HW-only this supports goals for <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">read, write, space</a> and <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">cache amplification</a>. For now I will be vague about the goals. </li></ul><div><b><br /></b></div><div><b>Things change</b></div><div><br /></div><div>Another part of the configuration challenge is that database workloads change while configurations tend to be static. I prefer that the engine does the right thing, while respecting the advice provided via minimal configuration. I want the engine to adapt to the current workload without ruining performance for the future workload. Adapting by deferring index maintenance can make loads faster, but might hurt the queries that follow.</div><div><br /></div><div>Types of change include:</div><div><ul><li>The working set no longer fits in memory and the workload shifts from CPU to IO bound.</li><li>Daily maintenance (vacuum, reorg, defrag, DDL, reporting) runs during off-peak hours.</li><li>Web-scale workloads have daily peak cycles as people wake and sleep.</li><li>New features get popular, old features get deprecated. Their tables and indexes arrive, grow large, become read-only, get dropped and more. Some deprecated features get un-deprecated.</li><li>Access patterns to data changes. Rows might be write once, N times or forever and write once/N rows eventually become read-only. Rows might be read never, once, a few-times or forever.</li><li>Different types of data (see previous point) can live within the same index. Even if you were willing to tune per-index (some of us are) this isn't sufficient when there is workload diversity within an index.</li></ul><div>Real workloads include the types of change listed above but benchmarks rarely include them. Any benchmark that includes such change is likely to need more than 24-hours to run which will limit its popularity -- but maybe that isn't a bad thing. I hope we see a few new benchmarks that include such types of change. I might even try to write one.</div></div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-53750706017337656852018-08-01T10:26:00.000-07:002018-08-01T10:26:38.717-07:00Lock elision, pthreads and MySQLYesterday I learned that <a href="https://queue.acm.org/detail.cfm?id=2579227">lock elision</a> is supported in <a href="https://lwn.net/Articles/534758/">recent versions of glibc</a> for pthread mutex and rw-lock. I am curious if anyone has results for MySQL with it. My memory is that InnoDB can suffer from contention on a rw-lock, but that is a custom rw-lock not the one included with glibc. But code above the storage engine uses mutex and maybe rw-lock from glibc.<br /><br />A rw-lock where reads dominate can suffer from contention because it has at least twice the memory writes per lock/unlock pair compared to a mutex. So when the lock hold time is short a mutex wins even when exclusive access isn't required. This can often be seen in PMP output where there are convoys and the worst-case is when a thread gets stuck trying to get the internal latch during unlock, but the InnoDB custom rw-lock might not have that problem. Lock elision for the rw-lock might be a big deal in this case.<br /><br />RocksDB might also benefit from this change.<br /><br />One of the challenges with glibc pthreads is documentation. I previously wrote about the difficulty of finding <a href="http://smalldatum.blogspot.com/2018/03/missing-documentation.html">documentation for PTHREAD_MUTEX_ADAPTIVE_NP</a>. The problem continues. There isn't much about pthreads in a recent version of the <a href="https://www.gnu.org/software/libc/manual/html_node/POSIX-Threads.html#POSIX-Threads">glibc manual</a>. From Google searches I wasn't able to find recent docs elsewhere, except for man pages. But man pages don't document PTHREAD_MUTEX_ADAPTIVE_NP. With lock elision we get new options -- <a href="https://www.google.com/search?q=PTHREAD_MUTEX_ELISION_NP">PTHREAD_MUTEX_ELISION_NP</a> and <a href="https://www.google.com/search?q=PTHREAD_MUTEX_NO_ELISION_NP">PTHREAD_MUTEX_NO_ELISION_MP</a>. Google searches will take you to bits of source code and email list discussions. I hope this can be improved. Given the lack of docs you might need to <a href="https://sourceware.org/git/?p=glibc.git;a=tree;f=nptl;hb=HEAD">read the source</a>. I hope that the community (web-scale companies) can sponsor a tech writer to provide the missing docs.<br /><br />There has been drama because the introduction of this feature failed when it encountered buggy microcode on certain CPUs. Then there was more drama when it <a href="https://codeandbitters.com/2016/04/18/fun-with-lock-elision/">broke buggy</a> software that worked despite the bugs, until lock elision made the bugs serious. <a href="https://www.google.com/search?q=glibc+lock+elision">Google searches</a> find many of the stories.<br /><br />One of my favorite perks at work is getting answers from experts. In this case the expert is Nathan Bronson (thank you). A summary of the glibc 2.23 implementation per the expert is:<br /><ul><li>NPTL lock elision is performed using TSX's RTM (Restricted Transactional Memory) instructions XBEGIN, XEND, and XABORT, rather than TSX's HLE (H<span class="text_exposed_show" style="display: inline; font-family: inherit; white-space: pre-wrap;">ardware Lock Elision) instructions XACQUIRE and XRELEASE</span></li><li>On x86, elision support is always present when detected by HAS_CPU_FEATURE(RTM)</li><li>pthread_rwlock_t <span class="_4yxo" style="color: #4b4f56; font-family: inherit; font-weight: 600; white-space: pre-wrap;">always</span><span style="font-family: inherit; white-space: pre-wrap;"> attempts elision if the hardware has it (both for .._rdlock and .._wrlock)</span></li><li>pthread_rwlock_t uses an adaptive strategy for falling back to the non-TSX implementation. If the lock is held in a non-TSX mode, there is a transaction conflict, or the transaction exceeds TSX's (undocumented) capacity, then the current lock acquisition and the 3 following use the non-TXN code path. This means that once a lock falls off the elision path it needs several uncontended acquisitions before a transaction it will be attempted again. This seems quite conservative</li><li>pthread_rwlock_rdlock -> pthread_rwlock_unlock with a successful transaction is about twice as fast as the non-TSX implementation under no contention, and massively better under contention</li></ul>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com1tag:blogger.com,1999:blog-9149523927864751087.post-13382441420153426242018-07-27T10:10:00.001-07:002018-07-27T10:10:07.739-07:00Fast shutdown, fast startupManaging a web-scale database service is easier when shutdown and startup are fast, or at least fast enough. Fast enough to me means a few seconds, and too slow means tens of seconds.<br /><br />When these operations are too slow then:<br /><ul><li>scripts might time out - one of the MySQL scripts used to do that, see <a href="https://bugs.mysql.com/bug.php?id=25341">bug 25341</a></li><li>uncertainty increases - storage engines rarely provide progress indicators for shutdown. Most provide 2 to a few lines in the error log, 1 for shutdown starting, 1 for shutdown ending and maybe a few more. Alas, you have to ssh to the host to tail the error log to see them. When startup for InnoDB does crash recovery there is a useful progress indicator in the error log, but again, you need to ssh to the host to see that. Note that "ssh to host to tail error log" is not a best practice for web-scale.</li><li>downtime increases - restart (shutdown/startup), shutdown and startup can be sources of downtime. They happen for many reasons -- changing a configuration option that isn't dynamic, upgrading to a new binary, upgrading the kernel, etc. When they take 60 seconds your service might incur 60 seconds of downtime.<br /></li></ul><div>The work done by shutdown/startup also depends on the index structure (LSM vs b-tree) and on implementation details.<br /><br /><b>B-Tree</b></div><div><br /></div><div>For a b-tree either shutdown or startup will be slow. The choice is either to flush dirty pages on shutdown (one random write per dirty page) or to do crash recovery at startup (one random read per page that was dirty on shutdown, eventually those dirty pages must be written back). The <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_fast_shutdown">innodb_fast_shutdown</a> option lets you control which one will be slow.<br /><br />When dirty page writeback is done on shutdown then the time for that is a function of storage performance and the number of dirty pages. Back in the day (InnoDB running with disks) shutdown was slower. Storage is much faster today, but buffer pools are also larger because servers have more RAM. Shutdown can be made faster by reducing the value of <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_max_dirty_pages_pct">innodb_max_dirty_pages_pct</a> a few minutes before shutdown will be done. Alas, using a large value for innodb_max_dirty_pages_pct can be very good for performance -- less log IO, less page IO, less <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html">write-amplification</a>.<br /><br />Amazon Aurora is a hybrid, or mullet, with a b-tree up front and log-structured in the back. Shutdown for it is fast. It also doesn't need to warmup after startup because the buffer pool survives instance restart. Many years ago there was an option in Percona XtraDB to make the buffer pool survive restart, I wonder if that option will return. InnoDB also has an option to<a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_load_at_startup"> warmup the buffer pool </a>at startup, but that still does IO which isn't as nice as preserving the buffer pool.<br /><br />Back in the day InnoDB startup was often too slow. My memory about this has faded. One part of the problem is that per-index statistics are computed the first time a table is opened and that did ~6 random reads per index. That was the first part of the delay. My memory about the second part of the delay has faded more but I think at times this was single-threaded. A <a href="https://www.percona.com/blog/2006/11/21/opening-tables-scalability/">post from Percona</a> explained some of this. Today InnoDB stats can be persistent, so they don't have to be sampled at startup. But InnoDB was also enhanced to avoid some of this problem long before persistent stats were added. I hope a reader provides a less vague description of this.</div><div><br /></div><div><b>LSM</b></div><div><br /></div><div>Shutdown for an LSM is fast -- flush the write buffer, no random writes. One thing that made shutdown slower for RocksDB was calling free for every page in the LRU. Note that RocksDB does malloc per page in the LRU rather than one huge malloc like InnoDB. With MyRocks the LRU isn't free'd on shutdown so the there are no stalls from that.<br /><br />Startup for MyRocks should be fast but there is still at least one problem to solve. If you configure it with max_open_files=-1 then file descriptors are opened for all SSTs at startup. This helps performance by avoiding the need to search a hash table. The cost of this is 1) more open file descriptors and 2) more work at startup. See the description of the <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L415">RocksDB option</a> and more details in the <a href="https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide">tuning guide</a> and <a href="https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ">FAQ</a>. Note that the work done to open all of the SSTs can be done by multiple threads and the number of threads is controlled by the <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L423">max_file_opening_threads</a> RocksDB option. From looking at MyRocks code I don't think there is a way to change the value of max_file_opening_threads and the default is 16. The not-yet-solved problem is that RocksDB tries to precache some data from the end of every SST, by reading this data into the OS page cache, and that can be a lot of IO at startup, which also can make startup slower. With MyRocks when <a href="https://github.com/facebook/mysql-5.6/blob/fb-mysql-5.6.35/storage/rocksdb/ha_rocksdb.cc#L4520">rocksdb_max_open_files</a> is set to -2 then the open files limit is auto-configured, when set to -1 then there is no limit, and when set to > 0 then that is the limit.</div><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-48529640508917253752018-07-26T11:03:00.000-07:002018-07-26T11:03:58.235-07:00Tiered or leveled compaction, why not both via adaptive compaction?First there was <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">leveled compaction</a> and it was good, but it took a while for implementations to become popular. Then there was (size) tiered compaction and it was good too, but more confusing given the diversity in strategies for picking files to compact. RocksDB didn't help with the confusion by calling it <a href="https://github.com/facebook/rocksdb/wiki/Universal-Compaction">universal compaction</a>. Eventually compaction algorithms optimized for time series were added (see <a href="https://www.datastax.com/dev/blog/datetieredcompactionstrategy">DTCS</a> for Cassandra). Finally, Kudu and InfluxDB have specialized compaction algorithms that are also worth understanding.<br /><br />This post is about adaptive compaction which is yet another compaction algorithm. The summary for adaptive compaction is:<br /><ul><li>LSM file structure is orthogonal to the use of tiered or leveled compaction. A given LSM instance can be viewed as using tiered and leveled. It just depends on how you look at it.</li><li>Some levels in the LSM tree can be read optimized while others can be write optimized.</li><li>Each compaction step makes cost-based decisions to change a level between read and write optimized states, or remain as is. Compaction steps can also change the per-level fanout (thanks to Siying and Andrew for that suggestion).</li><li>Adaptive compaction can be configured to do strictly tiered or strictly leveled. So it can even adapt to become non-adaptive.</li></ul><br />This is just an idea today. It has not been implemented. I think it will be good for RocksDB, but validation of ideas takes time.<br /><div><br /></div><div><b>Drawing the LSM file structure</b></div><div><br /></div><div>The LSM file structure with leveled compaction is usually depicted vertically. The example below is from leveled compaction with 3 levels, per-level fanout set to 4, 4 SSTs in L1, 16 SSTs in L2 and 64 SSTs in L3. I ignore level 0 to simplify the picture.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-kB9mrc4zNeI/W1oEGOp8DYI/AAAAAAAAXa4/oLGrCPhPz-wUhF5iVHthc9l1XIp-imWHgCLcBGAs/s1600/image.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="125" data-original-width="454" height="110" src="https://1.bp.blogspot.com/-kB9mrc4zNeI/W1oEGOp8DYI/AAAAAAAAXa4/oLGrCPhPz-wUhF5iVHthc9l1XIp-imWHgCLcBGAs/s400/image.png" width="400" /></a></div><div><br /></div>With tiered compaction the LSM files are usually depicted horizontally as a sequence of sorted runs. The compaction algorithm merges 2 or more adjacent sorted runs at a time but it is hard to reason about the write amplification that will occur with such merging. Below each sorted run is range partitioned with 4, 16 and 64 partitions. Each partition is an SST.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-97uOiffGvjY/W1oETXR85xI/AAAAAAAAXa8/LrA7rp6cUusA4Zqqw4OT2HcujAXqo0KdwCLcBGAs/s1600/image%2B%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="108" data-original-width="713" height="60" src="https://4.bp.blogspot.com/-97uOiffGvjY/W1oETXR85xI/AAAAAAAAXa8/LrA7rp6cUusA4Zqqw4OT2HcujAXqo0KdwCLcBGAs/s400/image%2B%25281%2529.png" width="400" /></a></div><br /><b>File Structure vs Compaction</b><br /><br />The key point here is that the LSM file structure is orthogonal to the use of tiered or leveled compaction. For the examples above the same LSM files are depicted vertically for leveled compaction and horizontally for tiered compaction. <br /><br />The same files from an LSM instance can be used with tiered or leveled compaction. An instance using leveled compaction can be stopped and switched to use tiered compaction. An instance using tiered compaction can be stopped and switched to use leveled compaction. Metadata might have to be changed during the switch but otherwise the switch is trivial (allow me to wave hands).<br /><br />To make this claim I ignore the work done in leveled with RocksDB to limit overlap between SSTs on adjacent levels. That is useful when incremental compaction is done to compact 1 SST from level N with the ~fanout SSTs from level N+1 as done by LevelDB and RocksDB. Limiting overlap isn't needed with the <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">classic LSM</a> approach because compaction between levels is all-to-all rather than incremental.<br /><br />To transform from leveled to tiered assume that with tiered compaction the LSM structure is a sequence of sorted runs and each sorted run is 1+ SSTs. Then the N SSTs in the L0 become the first N sorted runs in the tiered sequence. They are followed by the sorted runs from L1 to Lmax. In this case the large sorted runs from L1 to Lmax are range partitioned (split into many SSTs) in the tiered sequence.<br /><br />To transform from tiered to leveled the sorted runs from the prefix of the tiered sequence that are each a single SST become the SSTs in L0. Each of the remaining sorted runs becomes a level from L1 to Lmax. This requires that large sorted runs are range partitioned into many SSTs with tiered compaction.<br /><br />Next is a specific example for transforming from leveled to tiered with fanout=8. The LSM with leveled has:<br /><ul><li>4 SSTs in the L0 named L0.1, L0.2, L0.3 and L0.4</li><li>L1 is partitioned into 4 SSTs: L1.1 to L1.4</li><li>L2 is partitioned into 32 SSTs: L2.1 to L2.32</li><li>L3 is partitioned into 256 SSTs: L3.1 to L3.256</li></ul>That uses 7 sorted runs with tiered compaction. With tiered the sequence of sorted runs is:<br /><ul><li>sorted runs 1 to 4 are L0.1, L0.2, L0.3, L0.4</li><li>sorted run 5 is L1.1 to L1.4 (range partitioned)</li><li>sorted run 6 is L2.1 to L2.32 (range partitioned)</li><li>sorted run 7 is L3.1 to L3.256 (range partitioned)</li></ul>A similar transformation can be done to go from tiered back to leveled. Assume the LSM with tiered compaction uses the file structure from above with 7 sorted runs, the first 4 are each one SST and then runs 5, 6 and 7 are each range partitioned into many SSTs. This can be viewed as an LSM with leveled compaction where runs 1 to 4 are in the L0 and runs 5, 6 and 7 become levels 1, 2 and 3. As noted elsewhere this ignores the work done to limit overlap between adjacent levels with leveled and RocksDB.<br /><br /><b>Vertical Depiction of Tiered Compaction</b><br /><br />More recently I have seen the vertical depiction used to explain write amplification for tiered compaction (thanks <a href="http://daslab.seas.harvard.edu/">DASLab</a> for doing this). But that has unstated assumptions about how sorted runs are selected for compaction. With such assumptions the number of levels in the LSM tree is the same whether leveled or tiered is used, but the write-amplification is different. The number of levels is log<sub>fanout</sub>(R) where R is size(Lmax) / size(L1). With leveled the worst-case per-level write-amp is equal to fanout and with tiered it is 1. The unstated assumptions are:<br /><ol><li>Sorted runs to be merged all come from the same level.</li><li>The fanout determines the number of sorted runs per level. When fanout=8 then each level is full with 8 sorted runs and a merge is started.</li><li>When each level is full then each level has fanout times more data than the previous level.</li><li>The output of a merge is written to the next larger level. Perhaps this rule can be broken when the size of the output is not large enough. For example assume the fanout is 8, the target size for sorted runs on L1 is 100 and the target size for sorted runs on the L2 is 800. When 8 sorted runs from L1 are merged, on which level should the output be written if the output has size 100 or size 200?</li><li>The max level has fanout sorted runs, which implies that space amplification is fanout which is too large for me. I prefer to limit the max level to 1 sorted run which also increases the total write-amp. The space-amp can be further reduced by increasing the fanout between the next-to-max and max levels. I am curious whether existing LSMs can be configured to limit the max level to 1 sorted run to limit the worst-case space-amplification to 2 (ignoring transient space-amp during compaction) and a recent paper, <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">Dostoevsky by Dayan and Idreos</a>, claims they cannot.</li></ol>The example below is a vertical depiction of tiered compaction with 4 sorted runs per level. If fanout=4 then each level is full. The sorted runs in levels 1, 2 and 3 are L1.r1 to L1.r4, L2.r1 to L2.r4 and L3.r1 to L3.r4. Each sorted run can be an SST or a sequence of range partitioned SSTs. A sorted run in level N+1 is approximately fanout times larger than a sorted run in level N. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-I2EO5jkRrHQ/W1oIPY5Q0WI/AAAAAAAAXbM/fRAfmEoLPNIIDKePEXoOWGeCw98TKeiSgCLcBGAs/s1600/image%2B%25282%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="152" data-original-width="454" height="133" src="https://3.bp.blogspot.com/-I2EO5jkRrHQ/W1oIPY5Q0WI/AAAAAAAAXbM/fRAfmEoLPNIIDKePEXoOWGeCw98TKeiSgCLcBGAs/s400/image%2B%25282%2529.png" width="400" /></a></div><br /><b>Adaptive Compaction</b><br /><br />A quick summary for adaptive compaction is that it uses the vertical depiction for tiered but each level can have a different target for the number of sorted runs -- from 1 to fanout.<br /><br />First, let me show the LSM tree when the max level is constrained to one sorted run so that the worst-case space-amplification is <= 2, ignoring temp space during compaction. Each level has <= fanout sorted runs. A sorted run is either an SST or range partitioned into many SSTs. A level with 2+ sorted runs is write optimized. A level with 0 or 1 sorted runs is read optimized. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-xlORwh2tl7Q/W1oJB9FtMUI/AAAAAAAAXbU/cS2_7IK4ddE6o5qp3NOAQYqucn6gsSfmACLcBGAs/s1600/image%2B%25283%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="152" data-original-width="457" height="132" src="https://3.bp.blogspot.com/-xlORwh2tl7Q/W1oJB9FtMUI/AAAAAAAAXbU/cS2_7IK4ddE6o5qp3NOAQYqucn6gsSfmACLcBGAs/s400/image%2B%25283%2529.png" width="400" /></a></div><div><br /><div>Adaptive compaction:<br /><ol><li>Uses the vertical depiction of the LSM tree to constrain the compaction steps that can occur.</li><li>Makes cost-based decisions during compaction to make the LSM tree more efficient for both the short-term and long-term workload. This is done one compaction step at a time. This is called <i>adaptive compaction</i> because it adapts the shape of the LSM tree to the workload.</li><li>The decisions are 1) whether a level should be tiered or leveled and 2) the per-level fanout for the level. When a level is tiered then the per-level fanout determines the number of sorted runs on that level. When a level is leveled then the per-level fanout determines the size ratio between it and adjacent levels.</li><li>Decisions respect constraints including the maximum space-amplification (both temporary — during compaction, and permanent — after compaction), write-amplification, number of levels and number of sorted runs. Constraints allow limits for the worst-case read and write efficiency. A possible constraint is the max number of sorted runs for the max level. Constraints can also include hints that optimization should favor point reads, range reads or writes.</li></ol>Compaction is a sequence of compaction steps and at each compaction step adaptive compaction makes the LSM tree more or less read optimized to be more efficient for both the current and long-term workload. Note that workloads can have daily cycles so that optimizing for the current workload during daily backup or bulk load might not be the best decision when things return to normal in a few hours.<br /><br />There are four types of compaction steps that can be scheduled. Some make the LSM tree more read-optimized, some make the LSM more write-optimized.<br /><ol><li>tiered - this merges 2+ sorted runs from level N and writes the output to level N or level N+1 depending on the size of the output. Regardless, level N becomes read optimized. Level N+1 remains or becomes write optimized.</li><li>tiered+leveled - this merges 2+ sorted runs from level N with 1 sorted run from level N+1 and writes the output to level N+1. For now I assume this cannot be used when level N+1 has more than 1 sorted run. When the compaction step finishes level N+1 remains read optimized and level N becomes read optimized.</li><li>leveled.part - this merges one SST from a sorted run on level N with overlapping SSTs from a sorted run on level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. For now I ignore the details of minimizing overlaps between a sorted run in level N and sorted runs in level N+1.</li><li>leveled.all - this merges the sorted run in level N with the sorted run in level N+1 and writes the output to level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. Unlike leveled.part this doesn't require minimizing overlap.</li></ol><div><b>Examples of adaptive compaction</b></div></div></div><div><br /></div><div>The smaller levels of the LSM tree have the most churn. Therefore they adapt faster as the current workload changes. They can also be fixed faster when the workload shifts from read to write heavy. Optimizations of persistent data structures have risk — which is the time to undo those optimizations when things change. There is much more risk for optimizations done for Lmax than for L1.<br /><br />A simple example of adaptive compaction is changing L1 and then L2 from having one sorted run to many sorted runs when the current workload shifts from read to write heavy. Assume that the LSM tree starts out with all levels read-optimized (one sorted run per level) and has been using either leveled.part or leveled.all compaction steps. Compaction to level 1 is special, I have yet to mention it but assume there is a write buffer and leveled.part compaction steps have been used to merge it with level 1.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-FTsSlLpeOR8/W1oKeEj8uiI/AAAAAAAAXbg/Y5d9xorFJWoT6uwopzi6lyaKY_aZcdkEwCLcBGAs/s1600/image%2B%25284%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="454" height="183" src="https://1.bp.blogspot.com/-FTsSlLpeOR8/W1oKeEj8uiI/AAAAAAAAXbg/Y5d9xorFJWoT6uwopzi6lyaKY_aZcdkEwCLcBGAs/s400/image%2B%25284%2529.png" width="400" /></a></div><div><br /></div><div>Then there is a burst of writes and L1 switches from read to write optimized with 4 sorted runs. Write buffer flushes stop using leveled.all and just write new sorted runs into L1. Compaction steps into L2, L3 and L4 continue using leveled.all or leveled.part.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-ZINO2yjbU-A/W1oKrMZHu5I/AAAAAAAAXbk/i6IfAs4FOTcHKUdo7xPnqcgzozLIgM7wACLcBGAs/s1600/image%2B%25285%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="454" height="183" src="https://3.bp.blogspot.com/-ZINO2yjbU-A/W1oKrMZHu5I/AAAAAAAAXbk/i6IfAs4FOTcHKUdo7xPnqcgzozLIgM7wACLcBGAs/s400/image%2B%25285%2529.png" width="400" /></a></div><div><br /></div><div>Then the burst of writes continues and L2 switches from read to write optimized with 4 sorted runs. Write buffer flushes continue to create new sorted runs in L1. Compaction into L2 begins to use tiered compaction steps.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-jRTQnqvy_ZE/W1oK8C20msI/AAAAAAAAXbw/hppH1tYcqwoy-Xtif9ESkRqwK_3gmh4PgCLcBGAs/s1600/image%2B%25286%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="454" height="183" src="https://4.bp.blogspot.com/-jRTQnqvy_ZE/W1oK8C20msI/AAAAAAAAXbw/hppH1tYcqwoy-Xtif9ESkRqwK_3gmh4PgCLcBGAs/s400/image%2B%25286%2529.png" width="400" /></a></div><div><br /></div><div>Eventually the write burst ends and L2 changes from write to read optimized by using a tiered+leveled compaction step from L0 to L1. The result is 1 sorted run in L2 and 0 sorted runs in L0. Then a write buffer flush creates a sorted run in L1.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-bBsseTGGtxw/W1oLIY9NNlI/AAAAAAAAXb0/YzmLLZrPNOUeipdVzWJU2jL-rzC4Q1ljwCLcBGAs/s1600/image%2B%25287%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="454" height="183" src="https://2.bp.blogspot.com/-bBsseTGGtxw/W1oLIY9NNlI/AAAAAAAAXb0/YzmLLZrPNOUeipdVzWJU2jL-rzC4Q1ljwCLcBGAs/s400/image%2B%25287%2529.png" width="400" /></a></div><div><br /></div><div><b>Open Problems</b></div><div><br />I am sure there are many, this section lists two.</div><div><br /></div><div>One interesting problem is to optimize when there is skew in the workload. The workload is usually not uniform across the key space. Some ranges of the key space might need optimization for writes and point queries. Another range might benefit from optimization for range queries. The work proposed above doesn't handle this any better than leveled or tiered compaction today. <br /><br />In some cases we can isolate different workloads to different column families, and with MyRocks that means a column family per index. But that still doesn't handle the case when the workload has variance within one index. But this is a first step.</div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2