My old NUC cluster found a new home and I downsized to 2 new NUC servers. The new server is NUC8i7beh with 16g RAM, 500g Samsung 860 EVO for the OS and 500g Samsung 970 EVO for performance. The Samsung 860 is SATA and the Samsung 970 is an m.2 device. I expect to wear out the performance devices as I have done that in the past. With the OS on a separate device I avoid the need to reinstall the OS when that happens.
The new NUC has a postSkylake CPU (i78559u), provides 4 cores (8 HW threads) compared to 2 cores (4 HW threads) in the old NUCs. I disabled turbo boost again to avoid performance variance as mentioned in the old post. I am not sure these have sufficient cooling for sustained boost and when boost isn't sustained there are frequent changed in CPU performance. I also disabled hyperthreads out of concern for both the impact from Spectre fixes and to avoid a different syscall overhead each time I update the kernel.
I might use these servers to examine the impact of the ~10x increase in PAUSE times on InnoDB with and without HT enabled. I might also use them for another round of MySQL performance testing when 8.0.14 is release.
I am a big fan of Intel NUC servers. But maybe I am not a fan of the SATA cables they use. I already had one of my old NUCs replaced under warranty after one of the SATA wires was bare. In the new NUCs I just setup a few of the SATA cables appear to be cut and I wonder if that eventually becomes bare.
Monday, December 17, 2018
Friday, December 14, 2018
LSM math  size of search space for LSM tree configuration
I have written before and will write again about using 3tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for Ln. There is one 3tuple per level and the components of the 3tuple are (type, fanout, runs) for Lk (level k) where:
 type is Tiered or Leveled and explains compaction into that level
 fanout is the size of a sorted run in Lk relative to a sorted run from Lk1, a real and >= 1
 runs is the number of sorted runs in that level, an integer and >= 1
Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy  there are an infinite number of configurations because fanout is a real.
The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer.
There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11  1
But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.
Thursday, December 13, 2018
LSM math  how many levels minimizes write amplification?
How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels writeamp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question.
I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the original LSM paper didn't address this problem because that system used a fixed number of levels.
One result from the original LSM paper and updated by me is that writeamp is minimized when the perlevel growth factor is constant. Sometimes I use fanout or perlevel fanout rather than perlevel growth factor. In RocksDB the option name is max_bytes_for_level_multiplier. Yes, this can be confusing. The default fanout in RocksDB is 10.
Math
I solve this for pureleveled compaction which differs from what RocksDB calls leveled. In pureleveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to explain this here where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.
Assumptions:
Specify function for writeamp and determine critical points
# wa is the total writeamp
# n is the number of levels
# perlevel fanout is the nth root of the total fanout
# perlevel fanout = perlevel writeamp
# therefore wa = number of levels * perlevel fanout
wa = n * t^(1/n)
# given the function for writeamp as wa = a * b
# ... then below is a' * b + a * b'
a = n, b = t^(1/n)
wa' = t^(1/n) + n * ln(t) * t^(1/n) * (1) * (1/n^2)
# which simplifies to
wa' = t^(1/n)  (1/n) * ln(t) * t^(1/n)
# critical point for this occurs when wa' = 0
t^(1/n)  (1/n) * ln(t) * t^(1/n) = 0
t^(1/n) = (1/n) * ln(t) * t^(1/n)
1 = (1/n) * ln(t)
n = ln(t)
When t = 1024 then n = ln(1024) ~= 6.93. In this case writeamp is minimized when 7 levels are used although 6 isn't a bad choice.
Assuming the cost function is convex (see below) the critical point is the minimum for writeamp. However, n must be an integer so the number of levels that minimizes writeamp is one of: ceil(ln(t)) or floor(ln(t)).
The graph for wa when t=1024 can be viewed thanks to Desmos. The function looks convex and I show below that it is.
Determine whether critical point is a min or max
The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.
wa = n * t ^ (1/n)
wa' = t^(1/n)  (1/n) * ln(t) * t^(1/n)
wa' = t^(1/n) * (1  (1/n) * ln(t))
# assuming wa' is a * b then wa'' is a' * b + a * b'
a = t^(1/n)
a' = ln(t) * t^(1/n) * 1 * (1/n^2)
a' =  ln(t) * t^(1/n) * (1/n^2)
b = 1  (1/n) * ln(t)
b' = (1/n^2) * ln(t)
# a' * b
 ln(t) * t^(1/n) * (1/n^2) > called x below
+ ln(t) * ln(t) * (1/n^3) * t^(1/n) > called y below
# b' * a
t^(1/n) * (1/n^2) * ln(t) > called z below
# therefore wa'' = x + y + z
# note that x, y and z all contain: t^(1/n), 1/n and ln(t)
wa'' = t^(1/n) * (1/n) * ln(t) * ((1/n) + (ln(t) * 1/n^2) + (1/n))
wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'
wa'' = t^(1/n) * 1/n^3 * ln(t)^2
Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa
Solve for perlevel fanout
The next step is to determine the value of the perlevel fanout when writeamp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the perlevel fanout is e in that case. When the number of levels is limited to an integer then the perlevel fanout that minimizes writeamp is a value that is close to e.
# total writeamp is number of levels * perlevel fanout
wa = n * t^(1/n)
# The perlevel fanout is t^(1/n) and wa is minimized when n = ln(t)
# Therefore we show that t^(1/n) = e when n = ln(t)
Assume t^(1 / ln(t)) = e
ln (t^(1 / ln(t))) = ln e
(1 / ln(t)) * ln(t) = 1
1=1
When the t=1024 then ln(t) ~= 6.93. With 7 levels the perlevel fanout is t^(1/7) ~= 2.69 while e ~= 2.72.
 The number of levels that minimizes writeamp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout  sizeof(database) / sizeof(memtable)
 When #1 is done then the perlevel fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.
Introduction
I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the original LSM paper didn't address this problem because that system used a fixed number of levels.
One result from the original LSM paper and updated by me is that writeamp is minimized when the perlevel growth factor is constant. Sometimes I use fanout or perlevel fanout rather than perlevel growth factor. In RocksDB the option name is max_bytes_for_level_multiplier. Yes, this can be confusing. The default fanout in RocksDB is 10.
Math
I solve this for pureleveled compaction which differs from what RocksDB calls leveled. In pureleveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to explain this here where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.
Assumptions:
 LSM tree uses pureleveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction
 total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree
 workload is updateonly so the number of keys in the database is fixed
 workload has no write skew and all keys are equally likely to be updated
 perlevel writeamp == perlevel growth factor. In practice and in theory the perlevel writeamp tends to be less than the perlevel growth factor.
 total writeamp is the sum of perlevel writeamp. I ignore writeamp from the WAL.
Specify function for writeamp and determine critical points
# wa is the total writeamp
# n is the number of levels
# perlevel fanout is the nth root of the total fanout
# perlevel fanout = perlevel writeamp
# therefore wa = number of levels * perlevel fanout
wa = n * t^(1/n)
# given the function for writeamp as wa = a * b
# ... then below is a' * b + a * b'
a = n, b = t^(1/n)
wa' = t^(1/n) + n * ln(t) * t^(1/n) * (1) * (1/n^2)
# which simplifies to
wa' = t^(1/n)  (1/n) * ln(t) * t^(1/n)
# critical point for this occurs when wa' = 0
t^(1/n)  (1/n) * ln(t) * t^(1/n) = 0
t^(1/n) = (1/n) * ln(t) * t^(1/n)
1 = (1/n) * ln(t)
n = ln(t)
When t = 1024 then n = ln(1024) ~= 6.93. In this case writeamp is minimized when 7 levels are used although 6 isn't a bad choice.
Assuming the cost function is convex (see below) the critical point is the minimum for writeamp. However, n must be an integer so the number of levels that minimizes writeamp is one of: ceil(ln(t)) or floor(ln(t)).
The graph for wa when t=1024 can be viewed thanks to Desmos. The function looks convex and I show below that it is.
Determine whether critical point is a min or max
The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.
wa = n * t ^ (1/n)
wa' = t^(1/n)  (1/n) * ln(t) * t^(1/n)
wa' = t^(1/n) * (1  (1/n) * ln(t))
# assuming wa' is a * b then wa'' is a' * b + a * b'
a = t^(1/n)
a' = ln(t) * t^(1/n) * 1 * (1/n^2)
a' =  ln(t) * t^(1/n) * (1/n^2)
b = 1  (1/n) * ln(t)
b' = (1/n^2) * ln(t)
# a' * b
 ln(t) * t^(1/n) * (1/n^2) > called x below
+ ln(t) * ln(t) * (1/n^3) * t^(1/n) > called y below
# b' * a
t^(1/n) * (1/n^2) * ln(t) > called z below
# therefore wa'' = x + y + z
# note that x, y and z all contain: t^(1/n), 1/n and ln(t)
wa'' = t^(1/n) * (1/n) * ln(t) * ((1/n) + (ln(t) * 1/n^2) + (1/n))
wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'
wa'' = t^(1/n) * 1/n^3 * ln(t)^2
Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa
Solve for perlevel fanout
The next step is to determine the value of the perlevel fanout when writeamp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the perlevel fanout is e in that case. When the number of levels is limited to an integer then the perlevel fanout that minimizes writeamp is a value that is close to e.
# total writeamp is number of levels * perlevel fanout
wa = n * t^(1/n)
# The perlevel fanout is t^(1/n) and wa is minimized when n = ln(t)
# Therefore we show that t^(1/n) = e when n = ln(t)
Assume t^(1 / ln(t)) = e
ln (t^(1 / ln(t))) = ln e
(1 / ln(t)) * ln(t) = 1
1=1
When the t=1024 then ln(t) ~= 6.93. With 7 levels the perlevel fanout is t^(1/7) ~= 2.69 while e ~= 2.72.
Saturday, December 1, 2018
Pixelbook review
This has nothing to do with databases. This is a review of a Pixelbook (Chromebook laptop) that I got on sale last month. This one has a core i5, 8gb RAM and 128gb storage. It runs Linux too but I haven't done much with that. I expected a lot from this given that my 2013 Nexus 7 tablet is still awesome. I have been mostly happy with the laptop but if you care about keyboards and don't like the new Macs thanks to the butterfly keyboard then this might not be the laptop for you. My 3 complaints:
 keyboard is hard to read. It is grey on grey and too hard to read when there is light on my back even with the backlight (backlit?) turned all the way up. I don't get it  grey on grey. So this is a great device for using in a dark room or for improving your touch typing skills.
 touchpad control is too coarse grained so it is either too fast or too slow. The settings has 5 values via a slider (1=slowest, 5=fastest). I have been using it at 3 which is a bit too fast for me while 2 is a bit too slow. I might go back to 2 but that means picking up my finger more frequently when moving a pointer across the screen.
 no iMessage  my family uses Apple devices and I can't run that here as I can on a Mac laptop
 the "" ey is flay > the "k" key is flaky > spacebar is flaky. Keys go bad for a few days, then get better, repeat. Ugh, his is oneoff Google hardware. Maybe they don't want Apple and the butterfly keyboard to have all the fun. Fortunately I bought from an authorized reseller (Best Buy) so the 1 year warranty should apply.
 Charger failed, fortunately that is easy to replace.
Monday, November 19, 2018
Review of TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured KeyValue Stores
This is review of TRIAD which was published in USENIX ATC 2017. It explains how to reduce write amplification for RocksDB leveled compaction although the ideas are useful for many LSM implementations. I share a review here because the paper has good ideas. It isn't easy to keep up with all of the LSM research, even when limiting the search to papers that reference RocksDB, and I didn't notice this paper until recently.
TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the RUM Conjecture improvements usually come at a cost and the cost in this case is more cache amplification (more memory overhead/key) and possibly more read amplification. I assume this is a good tradeoff in many cases.
The paper explains the improvements via 3 components  TRIADMEM, TRIADDISK and TRIADLOG  that combine to reduce write amplification.
TRIADMEM
TRIADMEM reduces writeamp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIADMEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.
An extra 4bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIADMEM has a cost in cacheamp but I don't think that is a big deal.
Assuming the perlevel writeamp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.
TRIADDISK
TRIADDISK reduces writeamp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.
Overlap is estimated via Hyperloglog (HLL) which requires 4kb/SST and is estimated as the following where filei is the ith SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(filei) is the number of keys in the ith SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that perSST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:
UniqueKeys(file1, file2, ... filen) / sum( Keys( filei))
The benefit from early L0:L1 compaction is less readamp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more perlevel writeamp which is etimated by size(L1 input) / size(L0 input). TRIADDISK provides the benefit with less cost.
In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIADDISK uses a costbased approach to do early compaction when it won't hurt the perlevel writeamp. This is an interesting idea.
TRIADLOG
TRIADLOG explains improvements to memtable flush that reduce writeamp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CLSST for commit log SST) will be deleted when it is compacted into the L1.
There is cacheamp from TRIADLOG. Each key in the CLSST (L0) and maybe in the memtable needs 8 extra bytes  4 bytes for CLSST ID, 4 bytes for the WAL offset.
Assuming the perlevel writeamp is one from the memtable flush for cold keys this reduces that to 0.
Reducing write amplification
The total writeamp for an LSM tree with leveled compaction is the sum of:
Questions
The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU readamp from using the commitlog SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commitlog SST.
impact of more levels?
Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of compaction priority in RocksDB but it doesn't for all. When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.
An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits writeamp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.
TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the RUM Conjecture improvements usually come at a cost and the cost in this case is more cache amplification (more memory overhead/key) and possibly more read amplification. I assume this is a good tradeoff in many cases.
The paper explains the improvements via 3 components  TRIADMEM, TRIADDISK and TRIADLOG  that combine to reduce write amplification.
TRIADMEM
TRIADMEM reduces writeamp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIADMEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.
An extra 4bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIADMEM has a cost in cacheamp but I don't think that is a big deal.
Assuming the perlevel writeamp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.
TRIADDISK
TRIADDISK reduces writeamp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.
Overlap is estimated via Hyperloglog (HLL) which requires 4kb/SST and is estimated as the following where filei is the ith SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(filei) is the number of keys in the ith SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that perSST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:
UniqueKeys(file1, file2, ... filen) / sum( Keys( filei))
The benefit from early L0:L1 compaction is less readamp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more perlevel writeamp which is etimated by size(L1 input) / size(L0 input). TRIADDISK provides the benefit with less cost.
In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIADDISK uses a costbased approach to do early compaction when it won't hurt the perlevel writeamp. This is an interesting idea.
TRIADLOG
TRIADLOG explains improvements to memtable flush that reduce writeamp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CLSST for commit log SST) will be deleted when it is compacted into the L1.
There is cacheamp from TRIADLOG. Each key in the CLSST (L0) and maybe in the memtable needs 8 extra bytes  4 bytes for CLSST ID, 4 bytes for the WAL offset.
Assuming the perlevel writeamp is one from the memtable flush for cold keys this reduces that to 0.
Reducing write amplification
The total writeamp for an LSM tree with leveled compaction is the sum of:
 writing the WAL = 1
 memtable flush = 1
 L0:L1 compaction ~= size(L1) / size(L0)
 Ln compaction for n>1 ~= fanout, the perlevel growth factor, usually 8 or 10. Note that this paper explains why it is usually a bit less than fanout.
TRIAD avoids the writeamp from memtable flush thanks to TRIADMEM for hot keys and TRIADLOG for cold keys. I will wave my hands and suggest that TRIADDISK reduces writeamp for L0:L1 from 3 to 1 based on the typical LSM configuration I use. So TRIAD reduces the total writeamp by 1+2 or 3.
Reducing total writeamp by 3 is a big deal when the total writeamp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total writeamp above then perlevel writeamp is ~8 for both L1:L2 and L2:L3. The total writeamp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.
But my writeamp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total writeamp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the perlevel writeamp with and without TRIAD? Most of this can be observed from compaction statistics provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.
Reducing total writeamp by 3 is a big deal when the total writeamp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total writeamp above then perlevel writeamp is ~8 for both L1:L2 and L2:L3. The total writeamp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.
But my writeamp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total writeamp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the perlevel writeamp with and without TRIAD? Most of this can be observed from compaction statistics provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.
Questions
The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU readamp from using the commitlog SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commitlog SST.
impact of more levels?
Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of compaction priority in RocksDB but it doesn't for all. When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.
An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits writeamp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.
Friday, November 2, 2018
Converting an LSM to a BTree and back again
I wonder if it is possible to convert an LSM to a BTree. The goal is to do it online and inplace  so I don't want two copies of the database while the conversion is in progress. I am interested in data structures for data management that adapt dynamically to improve performance and efficiency for a given workload.
Workloads change in the short and long term. I hope that data structures can be adapt to the change and converting between an LSM and a BTree is one way to adapt. This is more likely to be useful when the data structure supports some kind of partitioning in the hope that different workloads can be isolated to different partitions  and then some can use an LSM while others use a BTree.
LSM to BTree
LSM to BTree
A BTree is one tree. An LSM is a sequence of trees. Each sorted run in the LSM is a tree. With leveled compaction in RocksDB there are a few sorted runs in level 0 (L0) and then one sorted run in each of L1, L2 up to the max level (Lmax).
The conversion can also be done in the opposite direction (BTree to LSM)
A BTree persists changes by writing back pages  either inplace or copyonwrite (UiP or CoW). An LSM persists changes by writing and then rewriting rows. I assume that page write back is required if you want to limit the database to one tree and row write back implies there will be more than one tree.
There are two things that must be done online and inplace:
 Convert the LSM from many trees to one tree
 Convert from row write back to page write back
Note that my goal has slightly changed. I want to move from an LSM to a data structure with one tree. For the onetree solution a BTree is preferred but not required.
The outline of a solution:
 Reconfigure the LSM to use 2 levels  L0 and L1  and 3 trees  memtable, L0, L1.
 Disable the L0. At this point the LSM has two trees  memtable and L1.
 Flush the memtable and merge it into the L1. Now there is one tree.
 After the flush disable the memtable and switch to a page cache. Changes now require a copy of the L1 block in the page cache that eventually get written back via UiP or CoW.
The outline above doesn't explain how to maintain indexes for the L1. Note that after step 2 there is one tree on disk and the layout isn't that different from the leaf level of a BTree. The interior levels of the BTree could be created by reading/rewriting the block indexes stored in the SSTs.
BTree to LSM
The conversion can also be done in the opposite direction (BTree to LSM)
 Treat the current BTree as the max level of the LSM tree. While it might help to flush the page cache I don't think that is required. This is easier to do when your LSM uses a BTree per level, as done by WiredTiger.
 Record new changes for insert, update, delete in a memtable rather than a page cache.
 When the memtable is full then flush it to create a new tree (sorted run, SST) on disk.
 Eventually start to do compaction.
Friday, October 19, 2018
Combining tiered and leveled compaction
There are simple optimization problems for LSM tuning. For example use leveled compaction to minimize space amplification and use tiered to minimize write amplification. But there are interesting problems that are harder to solve:
Tiered+leveled
I defined tiered+leveled and leveledN in a previous post. They occupy the middle ground between tiered and leveled compaction with better read efficiency than tiered and better write efficiency than leveled. They are not supported today by popular LSM implementations but I think they can and should be supported.
While we tend to explain compaction as a property of an LSM tree (all tiered or all leveled) it is really a property of a level of an LSM tree and RocksDB already supports hybrids, combinations of tiered and leveled. For tiered compaction in RocksDB all levels except the largest use tiered. The largest level is usually configured to use leveled to reduce space amp. For leveled compaction in RocksDB all levels except the smallest use leveled and the smallest (L0) uses tiered.
So tiered+leveled isn't new but I think we need more flexibility. When a string of T and L is created from the perlevel compaction choices then the regex for the strings that RocksDB supports is T+L or TL+. I want to support T+L+. I don't want to support cases where leveled is used for a smaller level and tiered for a larger level. So I like TTLL but not LTTL. My reasons for not supporting LTTL are:
So in general I don't support T after L but I do support it in the special case. Of course we can pretend the special case doesn't exist if we use the syntactic sugar provided by leveledN. But I appreciate that Maysam discovered this.
 maximize throughput given a constraint on write and/or space amplification
 minimize space and/or write amplification given a constraint on read amplification
To solve the first problem use leveled compaction if it can satisfy the write amp constraint, else use tiered compaction if it can satisfy the space amp constraint, otherwise there is no solution. The lack of a solution might mean the constraints are unreasonable but it can also mean we need to enhance LSM implementations to support more diversity in LSM tree shapes. Even when there is a solution using leveled or tiered compaction there are solutions that would do much better were an LSM to support more varieties of tiered+leveled and leveledN.
When I mention solved above I leave out that there is more work to find a solution even when tiered or leveled compaction is used. For both there are decisions about the number of levels and perlevel fanout. If minimizing write amp is the goal then that is a solved problem. But there are usually more things to consider.
Tiered+leveled
I defined tiered+leveled and leveledN in a previous post. They occupy the middle ground between tiered and leveled compaction with better read efficiency than tiered and better write efficiency than leveled. They are not supported today by popular LSM implementations but I think they can and should be supported.
While we tend to explain compaction as a property of an LSM tree (all tiered or all leveled) it is really a property of a level of an LSM tree and RocksDB already supports hybrids, combinations of tiered and leveled. For tiered compaction in RocksDB all levels except the largest use tiered. The largest level is usually configured to use leveled to reduce space amp. For leveled compaction in RocksDB all levels except the smallest use leveled and the smallest (L0) uses tiered.
So tiered+leveled isn't new but I think we need more flexibility. When a string of T and L is created from the perlevel compaction choices then the regex for the strings that RocksDB supports is T+L or TL+. I want to support T+L+. I don't want to support cases where leveled is used for a smaller level and tiered for a larger level. So I like TTLL but not LTTL. My reasons for not supporting LTTL are:
 The benefit from tiered is less write amp and is independent of the level on which it is used. The reduction in write amp is the same whether tiered is used for L1, L2 or L3.
 The cost from tiered is more read and space amp and that is dependent on the level on which it is used. The cost is larger for larger levels. When space amp is 2 more space is wasted on larger levels than smaller levels. More IO read amp is worse for larger levels because they have a lower hit rate than smaller levels and more IO will be done. More IO implies more CPU cost from decompression and the CPU overhead of performing IO.
From above the benefit from using T is the same for all levels but the cost increases for larger levels so when T and L are both used then T (tiered) should be used on the smaller levels and L (leveled) on the larger levels.
LeveledN
I defined leveledN in a previous post. Since then a coworker, Maysam Yabandeh, explained to me that a level that uses leveledN can also be described as two levels where the smaller uses leveled and the larger uses tiered. So leveledN might be syntactic sugar in the LSM tree configuration language.
For example with an LSM defined using the triple syntax from here as (compaction type, fanout, runsperlevel) then this is valid: (T,1,8) (T,8,2) (L,8,2) (L,8,1) and has total fanout of 512 (8 * 8 * 8). The third level (L,8,2) uses leveledN with N=2. Assuming we allow LSM trees where T follows L then the leveledN level can be replaced with two levels: (L,8,1) (T,1,8). Then the LSM tree is defined as (T,1,8) (T,8,2) (L,8,1) (T,1,8) (L,8,1). These LSM trees have the same total fanout and total read/write/space amp. Compaction from (L,8,1) to (T,1,8) is special. It has zero write amp because it is done by a file move rather than merging/writing data so all that must be updated is LSM metadata to record the move.
So in general I don't support T after L but I do support it in the special case. Of course we can pretend the special case doesn't exist if we use the syntactic sugar provided by leveledN. But I appreciate that Maysam discovered this.
Wednesday, October 3, 2018
Minimizing write amplification in an LSM
Writeamplification for an LSM with leveled compaction is minimized when the perlevel growth factor (fanout) is the same between all levels. This is a result for an LSM tree using a given number of levels. To find the minimal writeamplification for any number of levels this result can be repeated for 2, 3, 4, ... up to a large value. You might find that a large number of levels is needed to get the least writeamp and that comes at price of more readamp, as the RUM Conjecture predicts.
In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no writeamp. This is done to reduce the size of this blog post.
tl;dr  for an LSM with L1, L2, L3 and L4 what values for perlevel fanout minimizes writeamp when the total fanout is 1000?
Minimizing writeamp for leveled compaction
For an LSM with 4 levels (L1, L2, L3, L4) there is a perlevel fanout between L1:L2, L2:L3 and L3:L4. Assume this uses classic leveled compaction so the total fanout is size(L4) / size(L1). The product of the perlevel fanouts must equal the total fanout. The total writeamp is the sum of the perlevel writeamp. I assume that the perlevel write amp is the same as the perlevel fanout although in practice and in theory it isn't that simple. Lets use a, b and c as the variables for the perlevel fanout (writeamp) then the math problem is:
In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no writeamp. This is done to reduce the size of this blog post.
tl;dr  for an LSM with L1, L2, L3 and L4 what values for perlevel fanout minimizes writeamp when the total fanout is 1000?
 (10, 10, 10) for leveled
 (6.3, 12.6, 12.6) for leveledN assuming two of the levels have 2 sorted runs
 (>1, >1, >1) for tiered
Minimizing writeamp for leveled compaction
For an LSM with 4 levels (L1, L2, L3, L4) there is a perlevel fanout between L1:L2, L2:L3 and L3:L4. Assume this uses classic leveled compaction so the total fanout is size(L4) / size(L1). The product of the perlevel fanouts must equal the total fanout. The total writeamp is the sum of the perlevel writeamp. I assume that the perlevel write amp is the same as the perlevel fanout although in practice and in theory it isn't that simple. Lets use a, b and c as the variables for the perlevel fanout (writeamp) then the math problem is:
 minimize a+b+c
 such that a*b*c=k and a, b, c > 1
This result uses Lagrange Multipliers for an LSM tree with 4 levels do there are 3 variables: a, b, c. But the math holds for an LSM tree with fewer levels or with more levels. If there are N levels then there are N1 variables.
L(a, b, c) = a + b + c  lambda * (a*b*c  k)
L(a, b, c) = a + b + c  lambda * (a*b*c  k)
dL/da = 1  lambda * bc
dL/db = 1  lambda * ac
dL/dc = 1  lambda * ab
then
lambda = 1/bc = 1/ac = 1/ab
bc == ac == ab
bc == ac == ab
and a == b == c to minimize the sum in #1
I wrote a Python script to discover the (almost) best values and the results match the math above.
Minimizing writeamp for tiered compaction
Assuming you can reason about tiered compaction using the notion of levels then the math changes a bit because the perlevel writeamp with tiered equals 1 regardless of the perlevel fanout. For tiered with 4 levels and 3 variables the problem is:
Minimizing writeamp for tiered compaction
Assuming you can reason about tiered compaction using the notion of levels then the math changes a bit because the perlevel writeamp with tiered equals 1 regardless of the perlevel fanout. For tiered with 4 levels and 3 variables the problem is:
 minimize 1+1+1
 such that a*b*c = k and a, b, c > 1
Any values for a, b and c are sufficient as long they satisfy the constraints in #2. But it still helps to minimize a+b+c if that is predicts readamp because a, b and c are also the number of sorted runs in L2, L3 and L4. So my advice is to use a == b == c in most cases.
Minimizing writeamp for leveledN compaction
I explain leveledN compaction here and here. It is like leveled compaction but allows a level to have more than one sorted run. This reduces the perlevel writeamp at the cost of more readamp. Sometimes that is a good trade.
The math above can also be used to determine how to configure perlevel fanout to minimize writeamp for leveledN. Assume an LSM tree with 4 levels (L1, L2, L3, L4) and 2 sorted runs in L2 and L3. The problem is:
Therefore with leveledN the perlevel writeamp is b/2 for L2 to L3 and c/2 for L3 to L4 because there are 2 sorted runs in the compaction input (twice as much data) in those cases. Were there 3 sorted runs then the values would be b/3 and c/3.
Lagrange Multipliers can be used to solve this assuming we want to minimize a + b/2 + c/2.
If the total fanout is 1000 then the perlevel fanout values that minimize writeamp are 10, 10, 10 for leveled and 6.30, 12.60, 12.60 for this example with leveledN and can be computed by "bc l"
# and for leveled
e(l(1000)/3)
9.99999999999999999992
One way to think of this result is that with leveled compaction the goal is to use the same perlevel fanout between levels. This also uses the same perlevel writeamp between levels because perlevel writeamp == the perlevel fanout for leveled.
But with leveledN compaction we need to adjust the perlevel fanout for levels to continue to get the same perlevel writeamp between levels.
Minimizing writeamp for leveledN compaction
I explain leveledN compaction here and here. It is like leveled compaction but allows a level to have more than one sorted run. This reduces the perlevel writeamp at the cost of more readamp. Sometimes that is a good trade.
The math above can also be used to determine how to configure perlevel fanout to minimize writeamp for leveledN. Assume an LSM tree with 4 levels (L1, L2, L3, L4) and 2 sorted runs in L2 and L3. The problem is:
 minimize a + b/2 + c/2
 such that a*b*c = k and a, b, c > 1
Therefore with leveledN the perlevel writeamp is b/2 for L2 to L3 and c/2 for L3 to L4 because there are 2 sorted runs in the compaction input (twice as much data) in those cases. Were there 3 sorted runs then the values would be b/3 and c/3.
Lagrange Multipliers can be used to solve this assuming we want to minimize a + b/2 + c/2.
L(a, b, c) = a + b/2 + c/2  lambda * (a*b*c  k)
dL/da = 1  lambda * bc
dL/db = 1/2  lambda * ac
dL/dc = 1/2  lambda * ab
then
lambda = 1/bc = 1/2ac = 1/2ab
bc == 2ac > b == 2a
bc == 2ab > c == 2a
2ac == 2ab > c == b
bc == 2ac > b == 2a
bc == 2ab > c == 2a
2ac == 2ab > c == b
and 2a == b == c to minimize the sum
If the total fanout is 1000 then the perlevel fanout values that minimize writeamp are 10, 10, 10 for leveled and 6.30, 12.60, 12.60 for this example with leveledN and can be computed by "bc l"
# for leveledN
e(l(1000/4)/3)
e(l(1000/4)/3)
6.29960524947436582381
e(l(1000/4)/3) * 2
12.59921049894873164762
12.59921049894873164762
# and for leveled
e(l(1000)/3)
9.99999999999999999992
One way to think of this result is that with leveled compaction the goal is to use the same perlevel fanout between levels. This also uses the same perlevel writeamp between levels because perlevel writeamp == the perlevel fanout for leveled.
But with leveledN compaction we need to adjust the perlevel fanout for levels to continue to get the same perlevel writeamp between levels.
Tuesday, October 2, 2018
Describing tiered and leveled compaction
This is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts here and here. My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature.
Stepped Merge
The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve BTree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The perlevel writeamplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.
This looks like tiered compaction. However it allows for N sorted runs on the max level which means that spaceamplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a BTree, not for storing the entire database and it doesn't lead to a large spaceamp for that workload. Note that the InnoDB change buffer is a BTree that buffers changes to other BTrees for a similar reason.
Compaction per level
I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.
I will start with definitions:
Compaction per LSM tree
An LSM tree can be described using the per level 3tuples from the previous section. The following are examples for popular algorithms.
Classic LSM with total fanout = 1000 is:
Tiered+leveled
Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveledN, then the remaining levels use leveled. When tiered=T, leveled=L and leveledN=N then the regex for this is T+N*L+.
I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.
The largest level should use leveled compaction with runsperlevel=1 to avoid too much space amplification.
LSM trees with tiered+leveled can be described using 3tuples and the previous section does that but here I provide one for a tree that uses leveledN for L1 and L2 with total fanout = 1000:
And another example to show that tiered levels don't have to use the same fanout or runsperlevel, but runsperlevel for Ln == fanout for Ln+1:
LeveledN
LeveledN can reduce the per level writeamp at the cost of increasing the per level readamp.
The regex for an LSM tree that uses leveledN is N+L+ (see the previous section). The largest level should use leveled compaction with runsperlevel=1 to avoid too much space amplification. An example 3tuple for leveledN with fanout=1000 that has runsperlevel=2 for L1 and L2 is:
Stepped Merge
The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve BTree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The perlevel writeamplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.
This looks like tiered compaction. However it allows for N sorted runs on the max level which means that spaceamplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a BTree, not for storing the entire database and it doesn't lead to a large spaceamp for that workload. Note that the InnoDB change buffer is a BTree that buffers changes to other BTrees for a similar reason.
Compaction per level
I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.
I will start with definitions:
 When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).
 A sorted run is a sequence of keyvalue pairs stored in key order. It is stored using 1+ files.
 A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level.
 A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.
 Levels are full when they have a configurable number of sorted runs. In classic leveled compaction a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2.
 leveledN uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1.
 The per level fanout is size(sortedrun in level N) / size(sortedrun in level N1)
 The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.
 The runsperlevel is the number of sorted runs in a level when it is full.
 The per level writeamplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, allsize(Ln+1) / allsize(Ln) for leveled and runsize(Ln+1) / allsize(Ln) for leveledN where runsize is the size of a sorted run and allsize is the sum of the sizes of all sorted runs on a level.
A level can be described by a 3tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runsperlevel. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3tuple per level. For now I don't use 3tuples to describe the write buffer (memory component). That is a topic for another post. Example 3tuples include:
 T:1:4  this is tiered with fanout=1 and runsperlevel=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.
 T:8:8  this is tiered with fanout=8 and runsperlevel=8. When Ln and Ln+1 both use tiered then runsperlevel in Ln == fanout in Ln+1.
 T:8:4  this is tiered with fanout=8 and runsperlevel=4. It might be used when the next larger level uses leveled and runsperlevel on this level can be smaller than fanout to reduce readamp.
 L:10:1  this is common in RocksDB with leveled compaction, fanout=10 and runsperlevel=1
 L:10:2  this is leveledN with runsperlevel=2
Compaction per LSM tree
An LSM tree can be described using the per level 3tuples from the previous section. The following are examples for popular algorithms.
Classic LSM with total fanout = 1000 is:
 C0 is the write buffer
 C1, C2 and C3 are L:10:1
RocksDB leveled with total fanout = 1000 is:
 L0 is T:1:4
 L1 is L:1:1
 L2, L3, L4 are L:10:1
Stepped Merge with total fanout = 1000 is:
 L1 is T:1:10
 L2, L3, L4 are T:10:10
Tiered in HBase and Cassandra with total fanout = 1000 might be:
 L1 is T:1:10
 L2, L3 are T:10:10
 L4 is L:10:1
Tiered+leveled
Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveledN, then the remaining levels use leveled. When tiered=T, leveled=L and leveledN=N then the regex for this is T+N*L+.
I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.
The largest level should use leveled compaction with runsperlevel=1 to avoid too much space amplification.
LSM trees with tiered+leveled can be described using 3tuples and the previous section does that but here I provide one for a tree that uses leveledN for L1 and L2 with total fanout = 1000:
 L0 is T:1:4
 L1 is L:1:2
 L2 is L:10:2
 L3, L4 are L:10:1
And another example to show that tiered levels don't have to use the same fanout or runsperlevel, but runsperlevel for Ln == fanout for Ln+1:
 L0 is T:1:20
 L1 is T:20:10
 L2 is T:10:2
 L3 is L:5:1
LeveledN
LeveledN can reduce the per level writeamp at the cost of increasing the per level readamp.
The regex for an LSM tree that uses leveledN is N+L+ (see the previous section). The largest level should use leveled compaction with runsperlevel=1 to avoid too much space amplification. An example 3tuple for leveledN with fanout=1000 that has runsperlevel=2 for L1 and L2 is:
 L1 is L:10:2
 L2 is L:10:2
 L3 is L:10:1
Monday, October 1, 2018
Transaction Processing in NewSQL
This is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.
By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:
By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:
 Support for multiple nodes because the storage/compute on one node isn't sufficient.
 Support for SQL with ACID transactions. If there are shards then crossshard operations can be consistent and isolated.
 Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous georeplication is slow and tooslow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async georeplication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it is important to understand the minimal time between commits to a single row or the max number of commits/second to a single row.
 MySQL Cluster  this was NewSQL before NewSQL was a thing. There is a nice book that explains the internals. There is a company that uses it to make HDFS better. Cluster seems to be more popular for uses other than webscale workloads.
 VoltDB  another early NewSQL system that is still getting better. It was after MySQL Cluster but years before Spanner and came out of the HStore research effort.
 Spanner  XA acrossshards, Paxos across replicas, special hardware to reduce clock drift between nodes. Sounds amazing, but this is Google so it just works. See the papers that explain the system and support for SQL. This got the NewSQL movement going.
 CockroachDB  the answer to implementing Spanner without GPS and atomic clocks. From that URL they explain it as "while Spanner always waits after writes, CockroachDB sometimes waits before reads". It uses RocksDB and they help make it better.
 FaunaDB  FaunaDB is inspired by Calvin and Daniel Abadi explains the difference between it and Spanner  here and here. Abadi is great at explaining distributed systems, see his work on PACELC (and the pdf). A key part of Calvin is that "Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed." This approach might limit the peak TPS on a large cluster, but I assume that doesn't matter for a large fraction of the market.
 YugaByte  another user of RocksDB. There is much discussion about it in the recent Abadi post. Their docs are amazing  slides, transaction IO path, singleshard write IO path, distributed ACID and singlerow ACID.
 TiDB  I don't know much about it but they are growing fast and are part of the MySQL community. It uses RocksDB (I shouldn't have forgotten that).
 FoundationDB  I am curious where this goes given the competition explained above.
 Aurora  not NewSQL yet because this doesn't scale across nodes. It does support large nodes and that might be sufficient for a large part of the market. But Amazon moves fast (see the new parallel query feature) so I wouldn't be surprised if this became NewSQL one day. I appreciate that they have begun to explain the internals  here and here.
 MongoDB  not SQL, but starting to get interesting with the new features for read and write concerns. There is also new support for causal consistency and retryable writes.
 Clustrix  a NewSQL system that is now part of MariaDB. Maybe this becomes open source.
 Kudu  awesome paper, interesting research on HybridTime, useful docs on the internals.
 Vitess  was created to scale MySQL for Youtube. Now is part of CNCF, backed by a startup and used by many companies. Crossshard writes are atomic, but isolation is weaker.
 Splice Machine  SQL on HBase. Summary is "100% ACID via snapshot isolation with optimistic concurrency via writewrite conflicts" and details are here. Has integration to use Spark for OLAP, so this is HTAP.
Wednesday, September 19, 2018
Durability debt
I define durability debt to be the amount of work that can be done to persist changes that have been applied to a database. Dirty pages must be written back for a btree. Compaction must be done for an LSM. Durability debt has IO and CPU components. The common IO overhead is from writing something back to the database. The common CPU overhead is from computing a checksum and optionally from compressing data.
From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a BTree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a BTree which is one reason why tuning can be more challenging for an LSM than a BTree.
In this post by LSM I mean LSM with leveled compaction.
BTree
The maximum durability debt for a BTree is limited by the size of the buffer pool. If the buffer pool has N pages then there will be at most N dirty pages to write back. If the buffer pool is 100G then there will be at most 100G to write back. The IO is more random or less random depending on whether the BTree is updateinplace, copyonwrite random or copyonwrite sequential. I prefer to describe this as small writes (page at a time) or large writes (many pages grouped into a larger block) rather than random or sequential. InnoDB uses small writes and WiredTiger uses larger writes. The distinction between small writes and large writes is more important with disks than with SSD.
There is a small CPU overhead from computing the perpage checksum prior to write back. There can be a larger CPU overhead from compressing the page. Compression isn't popular with InnoDB but is popular with WiredTiger.
There can be an additional IO overhead when tornwrite protection is enabled as provided by the InnoDB double write buffer.
LSM
The durability debt for an LSM is the work required to compact all data into the max level (Lmax). A byte in the write buffer causes more debt than a byte in the L1 because more work is needed to move the byte from the write buffer to Lmax than from L1 to Lmax.
The maximum durability debt for an LSM is limited by the size of the storage device. Users can configure RocksDB such that the level 0 (L0) is huge. Assume that the database needs 1T of storage were it compacted into one sorted run and the writeamplification to move data from the L0 to the max level (Lmax) is 30. Then the maximum durability debt is 30 * sizeof(L0). The L0 is usually configured to be <= 1G in which case the durability debt from the L0 is <= 30G. But were the L0 configured to be <= 1T then the debt from it could grow to 30T.
I use the notion of perlevel writeamp to explain durability debt in an LSM. Perlevel writeamp is defined in the next section. Perlevel writeamp is a proxy for all of the work done by compaction, not just the data to be written. When the perlevel writeamp is X then for compaction from Ln to Ln+1 for every keyvalue pair from Ln there are ~X keyvalue pairs from Ln+1 for which work is done including:
Perlevel writeamp in an LSM
The perlevel writeamplification is the work required to move data between adjacent levels. The perlevel writeamp for the write buffer is 1 because a write buffer flush creates a new SST in L0 without reading/rewriting an SST already in L0.
I assume that any key in Ln is already in Ln+1 so that merging Ln into Ln+1 does not make Ln+1 larger. This isn't true in real life, but this is a model.
The perlevel writeamp for Ln is approximately sizeof(Ln+1) / sizeof(Ln). For n=0 this is 2 with a typical RocksDB configuration. For n>0 this is the perlevel growth factor and the default is 10 in RocksDB. Assume that the perlevel growth factor is equal to X, in reality the perlevel writeamp is f*X rather than X where f ~= 0.7. See this excellent paper or examine the compaction IO stats from a production RocksDB instance. Too many excellent conference papers assume it is X rather than f*X in practice.
The perlevel writeamp for Lmax is 0 because compaction stops at Lmax.
From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a BTree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a BTree which is one reason why tuning can be more challenging for an LSM than a BTree.
In this post by LSM I mean LSM with leveled compaction.
BTree
The maximum durability debt for a BTree is limited by the size of the buffer pool. If the buffer pool has N pages then there will be at most N dirty pages to write back. If the buffer pool is 100G then there will be at most 100G to write back. The IO is more random or less random depending on whether the BTree is updateinplace, copyonwrite random or copyonwrite sequential. I prefer to describe this as small writes (page at a time) or large writes (many pages grouped into a larger block) rather than random or sequential. InnoDB uses small writes and WiredTiger uses larger writes. The distinction between small writes and large writes is more important with disks than with SSD.
There is a small CPU overhead from computing the perpage checksum prior to write back. There can be a larger CPU overhead from compressing the page. Compression isn't popular with InnoDB but is popular with WiredTiger.
There can be an additional IO overhead when tornwrite protection is enabled as provided by the InnoDB double write buffer.
LSM
The durability debt for an LSM is the work required to compact all data into the max level (Lmax). A byte in the write buffer causes more debt than a byte in the L1 because more work is needed to move the byte from the write buffer to Lmax than from L1 to Lmax.
The maximum durability debt for an LSM is limited by the size of the storage device. Users can configure RocksDB such that the level 0 (L0) is huge. Assume that the database needs 1T of storage were it compacted into one sorted run and the writeamplification to move data from the L0 to the max level (Lmax) is 30. Then the maximum durability debt is 30 * sizeof(L0). The L0 is usually configured to be <= 1G in which case the durability debt from the L0 is <= 30G. But were the L0 configured to be <= 1T then the debt from it could grow to 30T.
I use the notion of perlevel writeamp to explain durability debt in an LSM. Perlevel writeamp is defined in the next section. Perlevel writeamp is a proxy for all of the work done by compaction, not just the data to be written. When the perlevel writeamp is X then for compaction from Ln to Ln+1 for every keyvalue pair from Ln there are ~X keyvalue pairs from Ln+1 for which work is done including:
 Read from Ln+1. If Ln is a small level then the data is likely to be in the LSM block cache or OS page cache. Otherwise it is read from storage. Some reads will be cached, all writes go to storage. So the write rate to storage is > the read rate from storage.
 The keyvalue pairs are decompressed if the level is compressed for each block not in the LSM block cache.
 The keyvalue pairs from Ln+1 are merged with Ln. Note that this is a merge, not a merge sort because the inputs are ordered. The number of comparisons might be less than you expect because one iterator is ~X times larger than the other and there are optimizations for that.
The output from the merge is then compressed and written back to Ln+1. Some of the work above (reads, decompression) are also done for Ln but most of the work comes from Ln+1 because it is many times larger than Ln. I stated above that an LSM usually has more IO and less CPU durability debt per modified row. The extra CPU overheads come from decompression and the merge. I am not sure whether to count the compression overhead as extra.
Assuming the perlevel growth factor is 10 and f is 0.7 (see below) then the perlevel writeamp is 7 for L1 and larger levels. If sizeof(L1) == sizeof(L0) then the perlevel writeamp is 2 for the L0. And the perlevel writeamp is always 1 for the write buffer.
From this we can estimate the pending writeamp for data at any level in the LSM tree.
Assuming the perlevel growth factor is 10 and f is 0.7 (see below) then the perlevel writeamp is 7 for L1 and larger levels. If sizeof(L1) == sizeof(L0) then the perlevel writeamp is 2 for the L0. And the perlevel writeamp is always 1 for the write buffer.
From this we can estimate the pending writeamp for data at any level in the LSM tree.
 Keyvalue pairs in the write buffer have the most pending writeamp. Keyvalue pairs in the max level (L5 in this case) have none. Keyvalue pairs in the write buffer are further from the max level.
 Starting with the L2 there is more durability debt from a full Ln+1 than a full Ln  while there is more pending writeamp for Ln, there is more data in Ln+1.
 Were I given the choice of L1, L2, L3 and L4 when first placing a write in the LSM tree then I would choose L4 as that has the least pending writeamp.
 Were I to choose to make one level 10% larger then I prefer to do that for a smaller level given the values in the rel size X pend column.
legend:
wamp perlvl : perlevel writeamp
wamp pend : writeamp to move byte to Lmax from this level
rel size : size of level relative to write buffer
rel size X pend : writeamp to move all data from that level to Lmax
wamp wamp rel rel size
level perlvl pend size X pend
    
wbuf 1 31 1 31
L0 2 30 4 120
L1 7 28 4 112
L2 7 21 40 840
L3 7 14 400 5600
L4 7 7 4000 28000
L5 0 0 40000 0
Perlevel writeamp in an LSM
The perlevel writeamplification is the work required to move data between adjacent levels. The perlevel writeamp for the write buffer is 1 because a write buffer flush creates a new SST in L0 without reading/rewriting an SST already in L0.
I assume that any key in Ln is already in Ln+1 so that merging Ln into Ln+1 does not make Ln+1 larger. This isn't true in real life, but this is a model.
The perlevel writeamp for Ln is approximately sizeof(Ln+1) / sizeof(Ln). For n=0 this is 2 with a typical RocksDB configuration. For n>0 this is the perlevel growth factor and the default is 10 in RocksDB. Assume that the perlevel growth factor is equal to X, in reality the perlevel writeamp is f*X rather than X where f ~= 0.7. See this excellent paper or examine the compaction IO stats from a production RocksDB instance. Too many excellent conference papers assume it is X rather than f*X in practice.
The perlevel writeamp for Lmax is 0 because compaction stops at Lmax.
Tuesday, September 18, 2018
Bloom filter and cuckoo filter
The multilevel cuckoo filter (MLCF) in SlimDB builds on the cuckoo filter (CF) so I read the cuckoo filter paper. The big deal about the cuckoo filter is that it supports delete and a bloom filter does not. As far as I know the MLCF is updated when sorted runs arrive and depart a level  so delete is required. A bloom filter in an LSM is per sorted run and delete is not required because the filter is created when the sorted run is written and dropped when the sorted run is unlinked.
I learned of the blocked bloom filter from the cuckoo filter paper (see here or here). RocksDB uses this but I didn't know it had a name. The benefit of it is to reduce the number of cache misses per probe. I was curious about the cost and while the math is complicated, the paper estimates a 10% increase on the false positive rate for a bloom filter with 8 bits/key and a 512bit block which is similar to a typical setup for RocksDB.
Space Efficiency
I am always interested in things that use less space for filters and block indexes with an LSM so I spent time reading the paper. It is a great paper and I hope that more people read it. The cuckoo filter (CF) paper claims better spaceefficiency than a bloom filter and the claim is repeated in the SlimDB paper as:
The paper has a lot of interesting math that I was able to follow. It provides formulas for the number of bits/key for a bloom filter, cuckoo filter and semisorted cuckoo filter. The semisorted filter uses 1 less bit/key than a regular cuckoo filter. The formulas assuming E is the target false positive rate, b=4, and A is the load factor:
The target load factor is 0.95 (A = 0.95) and that comes at a cost in CPU overhead when creating the CF. Assuming A=0.95 then a bloom filter uses 10 bits/key, a cuckoo filter uses 10.53 and a semisorted cuckoo filter uses 9.47. So the cuckoo filter uses either 5% more or 5% less space than a bloom filter when the target FPR is 1% which is a different perspective from the quote I listed above. Perhaps my math is wrong and I am happy for an astute reader to explain that.
When the target FPR rate is 0.1% then a bloom filter uses 15 bits/key, a cuckoo filter uses 13.7 and a semisorted cuckoo filter uses 12.7. The savings from a cuckoo filter are larger here but the common configuration for a bloom filter in an LSM has been to target a 1% FPR. I won't claim that we have proven that FPR=1% is the best rate and recent research on Monkey has shown that we can do better when allocating space to bloom filters.
The first graph shows the number of bits/key as a function of the FPR for a bloom filter (BF) and cuckoo filter (CF). The second graph shows the ratio for bits/key from BF versus bits/key from CF. The results for semisorted CF, which uses 1 less bit/key, are not included. For the second graph a CF uses less space than a BF when the value is greater than one. The graph covers FPR from 0.00001 to 0.09 which is 0.001% to 9%. R code to generate the graphs is here.
CPU Efficiency
From the paper there is more detail on CPU efficiency in table 3, figure 5 and figure 7. Table 3 has the speed to create a filter, but the filter is much larger (192MB) than a typical perrun filter with an LSM and there will be more memory system stalls in that case. Regardless the blocked bloom filter has the least CPU overhead during construction.
Figure 5 shows the lookup performance as a function of the hit rate. Fortunately performance doesn't vary much with the hit rate. The cuckoo filter is faster than the blocked bloom filter and the block bloom filter is faster than the semisorted cuckoo filter.
Figure 7 shows the insert performance as a function of the cuckoo filter load factor. The CPU overhead per insert grows significantly when the load factor exceeds 80%.
I learned of the blocked bloom filter from the cuckoo filter paper (see here or here). RocksDB uses this but I didn't know it had a name. The benefit of it is to reduce the number of cache misses per probe. I was curious about the cost and while the math is complicated, the paper estimates a 10% increase on the false positive rate for a bloom filter with 8 bits/key and a 512bit block which is similar to a typical setup for RocksDB.
Space Efficiency
I am always interested in things that use less space for filters and block indexes with an LSM so I spent time reading the paper. It is a great paper and I hope that more people read it. The cuckoo filter (CF) paper claims better spaceefficiency than a bloom filter and the claim is repeated in the SlimDB paper as:
However, by selecting an appropriate fingerprint size f and bucket size b, it can be shown that the cuckoo filter is more spaceefficient than the Bloom filter when the target false positive rate is smaller than 3%The tl;dr for me is that the space savings from a cuckoo filter is significant when the false positive rate (FPR) is sufficiently small. But when the target FPR is 1% then a cuckoo filter uses about the same amount of space as a bloom filter.
The paper has a lot of interesting math that I was able to follow. It provides formulas for the number of bits/key for a bloom filter, cuckoo filter and semisorted cuckoo filter. The semisorted filter uses 1 less bit/key than a regular cuckoo filter. The formulas assuming E is the target false positive rate, b=4, and A is the load factor:
 bloom filter: ceil(1.44 * log2(1/E))
 cuckoo filter: ceil(log2(1/E) + log2(2b)) / A == (log2(1/E) + 3) / A
 semisorted cuckoo filter: ceil(log2(1/E) + 2) / A
The target load factor is 0.95 (A = 0.95) and that comes at a cost in CPU overhead when creating the CF. Assuming A=0.95 then a bloom filter uses 10 bits/key, a cuckoo filter uses 10.53 and a semisorted cuckoo filter uses 9.47. So the cuckoo filter uses either 5% more or 5% less space than a bloom filter when the target FPR is 1% which is a different perspective from the quote I listed above. Perhaps my math is wrong and I am happy for an astute reader to explain that.
When the target FPR rate is 0.1% then a bloom filter uses 15 bits/key, a cuckoo filter uses 13.7 and a semisorted cuckoo filter uses 12.7. The savings from a cuckoo filter are larger here but the common configuration for a bloom filter in an LSM has been to target a 1% FPR. I won't claim that we have proven that FPR=1% is the best rate and recent research on Monkey has shown that we can do better when allocating space to bloom filters.
The first graph shows the number of bits/key as a function of the FPR for a bloom filter (BF) and cuckoo filter (CF). The second graph shows the ratio for bits/key from BF versus bits/key from CF. The results for semisorted CF, which uses 1 less bit/key, are not included. For the second graph a CF uses less space than a BF when the value is greater than one. The graph covers FPR from 0.00001 to 0.09 which is 0.001% to 9%. R code to generate the graphs is here.
CPU Efficiency
From the paper there is more detail on CPU efficiency in table 3, figure 5 and figure 7. Table 3 has the speed to create a filter, but the filter is much larger (192MB) than a typical perrun filter with an LSM and there will be more memory system stalls in that case. Regardless the blocked bloom filter has the least CPU overhead during construction.
Figure 5 shows the lookup performance as a function of the hit rate. Fortunately performance doesn't vary much with the hit rate. The cuckoo filter is faster than the blocked bloom filter and the block bloom filter is faster than the semisorted cuckoo filter.
Figure 7 shows the insert performance as a function of the cuckoo filter load factor. The CPU overhead per insert grows significantly when the load factor exceeds 80%.
Thursday, September 13, 2018
Review of SlimDB from VLDB 2018
SlimDB is a paper worth reading from VLDB 2018. The highlights from the paper are that it shows:
Cache amplification has become more important as database:RAM ratios increase. With SSD it is possible to attach many TB of usable data to a server for OLTP. By usable I mean that the SSD has enough IOPs to access the data. But it isn't possible to grow the amount of RAM per server at that rate. Many of the early RocksDB workloads used database:RAM ratios that were about 10:1 and everything but the max level (Lmax) of the LSM tree was in memory. As the ratio grows that won't be possible unless filters and block indexes use less memory. SlimDB does that via threelevel block indexes and multilevel cuckoofilters.
The SlimDB paper shows the value of hybrid LSM tree shapes, combinations of tiered and leveled, and then how to choose the best combination based on IO costs. Prior to this year, hybrid didn't get much discussion  the choices were usually tiered or leveled. While RocksDB and LevelDB with the L0 have always been hybrids of tiered (L0) and leveled (L1 to Lmax), we rarely discuss that. But more diversity in LSM tree shape means more complexity in tuning and the SlimDB solution is to make a costbased decision (cost == IO overhead) subject to a constraint on the amount of memory to use.
This has been a great two years for storage engine efficiency. First we had several papers from Harvard DASLab that have begun to explain costbased algorithm design and engine configuration and SlimDB continues in that tradition. I have much more reading to do starting with The Periodic Table of Data Structures.
Below I review the paper. Included with that is some criticism. Papers can be great without being perfect. This paper is a major contribution and worth reading.
SlimDB can use a cuckoo filter for leveled levels of the LSM tree and a multilevel cuckoo filter for tiered levels. Note that leveled levels have one sorted run and tiered levels have N sorted runs. SlimDB and the Stepped Merge paper use the term sublevels, but I prefer N sorted runs.
The cuckoo filter is used in place of a bloom filter to save space given target false positive rates of less than 3%. The paper has examples where the cuckoo filter uses 13 bits/key (see Table 1) and a bloom filter with 10 bits/key (RocksDB default) has a false positive rate of much less than 3%. It is obvious that I need to read another interesting CMU paper cited by SlimDB  Cuckoo Filter Practically Better than Bloom.
The multilevel cuckoo filter (MLCF) extends the cuckoo filter by using a few bits/entry to name the sublevel (sorted run) in the level that might contain the search key. With tiered and a bloom filter per sublevel (sorted run) a point query must search a bloom filter per sorted run. With the MLCF there is only one search per level (if I read the paper correctly).
The MLCF might go a long way to reduce the pointquery CPU overhead when using many sublevels which is a big deal. While a filter can't be used for general range queries, SlimDB doesn't support general range queries. Assuming the PK is on (a,b,c,d) and the prefix is (a,b) then SlimDB supports range queries like fetch all rows where a=X and b=Y. It wasn't clear to me whether the MLCF could be used in that case. But many sublevels can create more work for range queries as iterators must be positioned in each sublevel in the worst case and that is more work.
This statement from the end of the paper is tricky. SlimDB allows for an LSM tree to use leveled compaction on all levels, tiered on all levels or a hybrid. When all levels are leveled, then performance should be similar to RocksDB with leveled, when all or some levels are tiered then writeamplification will be reduced at the cost of read performance and the paper shows that range queries are slower when some levels are tiered. Lunch isn't free as the RUM Conjecture asserts.
LSM tree shapes
For too long there has not been much diversity in LSM tree shapes. The usual choice was all tiered or all leveled. RocksDB leveled is really a hybrid  tiered for L0, leveled for L1 to Lmax. But the SlimDB paper makes the case for more diversity. It explains that some levels (smaller ones) can be tiered while the larger levels can be leveled. And the use of multilevel cuckoo filters, threelevel indexes and cuckoo filters is also a decision to make perlevel.
Even more interesting is the use of a costmodel to choose the best configuration subject to a constraint  the memory budget. They enumerate a large number of LSM tree configurations, generate estimated IOcosts per operation (writeamp, IO per point query that returns a row, IO per point query that doesn't return a row, memory overhead) and then the total IO cost is computed for for a workload  where a workload specifies the frequency of each operation (for example  30% writes, 40% point hits, 30% point misses).
The Dostoevsky paper also makes the case for more diversity and uses rigorous models to show how to choose the best LSM tree shape.
I think this work is a big step in the right direction. Although cost models must be expanded to include CPU overheads and constraints expanded to include the maximum write and space amplification that can be tolerated.
I disagree with a statement from the related work section. We can already navigate some of the read, write and space amplification space but I hope there is more flexibility in the future. RocksDB tuning is complex in part to support this via changing the number of levels (or growth factor per level), enabling/disabling the bloom filter, using different compression (or none) on different levels, changing the max space amplification allowed, changing the max number of sorted runs in the L0 or max number of write buffers, changing the L0:L1 size ratio, changing the number of bloom filter bits/key. Of course I want more flexibility in the future while also making RocksDB easier to tune.
Performance Results
 How to use less memory for filters and indexes with an LSM
 How to reduce the CPU penalty for queries with tiered compaction
 The benefit of more diversity in LSM tree shapes
Overview
Cache amplification has become more important as database:RAM ratios increase. With SSD it is possible to attach many TB of usable data to a server for OLTP. By usable I mean that the SSD has enough IOPs to access the data. But it isn't possible to grow the amount of RAM per server at that rate. Many of the early RocksDB workloads used database:RAM ratios that were about 10:1 and everything but the max level (Lmax) of the LSM tree was in memory. As the ratio grows that won't be possible unless filters and block indexes use less memory. SlimDB does that via threelevel block indexes and multilevel cuckoofilters.
Tiered compaction uses more CPU and IO for point and range queries because there are more places to check for data when compared to level compaction. The multilevel cuckoo filter in SlimDB reduces the CPU overhead for point queries as there is only one filter to check per level rather than one per sorted run per level.
The SlimDB paper shows the value of hybrid LSM tree shapes, combinations of tiered and leveled, and then how to choose the best combination based on IO costs. Prior to this year, hybrid didn't get much discussion  the choices were usually tiered or leveled. While RocksDB and LevelDB with the L0 have always been hybrids of tiered (L0) and leveled (L1 to Lmax), we rarely discuss that. But more diversity in LSM tree shape means more complexity in tuning and the SlimDB solution is to make a costbased decision (cost == IO overhead) subject to a constraint on the amount of memory to use.
This has been a great two years for storage engine efficiency. First we had several papers from Harvard DASLab that have begun to explain costbased algorithm design and engine configuration and SlimDB continues in that tradition. I have much more reading to do starting with The Periodic Table of Data Structures.
Below I review the paper. Included with that is some criticism. Papers can be great without being perfect. This paper is a major contribution and worth reading.
Semisorted
The paper starts by explaining the principle of semisorted data. When the primary key can be split into two parts  prefix and suffix  there are some workloads that don't need data ordered over the entire primary key (prefix + suffix). Semisorted supports queries that fetch all data that matches the prefix of the PK while still enforcing uniqueness for the entire PK. The PK can be on (a,b,c,d) and (a,b) is prefix and queries are like "a=X and b=Y" without predicates on (c,d) that require index ordering. SlimDB takes advantage of this to use less space for the block index.
There are many use cases for this, but the paper cites Linkbench which isn't correct. See the Linkbench and Tao papers for queries that do an exact match on the prefix but only want the topN rows in the result. So ordering on the suffix is required to satisfy query response time goals when the total number of rows that match the prefix is much larger than N. I assume this issue with topN is important for other social graph workloads because some graph nodes are popular. Alas, things have changed with the social graph workload since those papers were published and I hope the changes are explained one day.
Note that MyRocks can use a prefix bloom filter to support some range queries with composite indexes. Assume the index is on (a,b,c) and the query has a=X and b=Y order by c limit 10. A prefix bloom on (a,b) can be used for such a query.
There are many use cases for this, but the paper cites Linkbench which isn't correct. See the Linkbench and Tao papers for queries that do an exact match on the prefix but only want the topN rows in the result. So ordering on the suffix is required to satisfy query response time goals when the total number of rows that match the prefix is much larger than N. I assume this issue with topN is important for other social graph workloads because some graph nodes are popular. Alas, things have changed with the social graph workload since those papers were published and I hope the changes are explained one day.
Note that MyRocks can use a prefix bloom filter to support some range queries with composite indexes. Assume the index is on (a,b,c) and the query has a=X and b=Y order by c limit 10. A prefix bloom on (a,b) can be used for such a query.
Stepped Merge
The paper implements tiered compaction but calls it stepped merge. I didn't know about the stepped merge paper prior to reading the SlimDB paper. I assume that people who chose the name tiered might also have missed that paper.
LSM compaction algorithms haven't been formally defined. I tried to advance the definitions in a previous post. One of the open issues for tiered is whether it requires only one sorted run at the max level or allows for N runs at the max level. With N runs at the max level the spaceamplification is at least N which is too much for many workloads. With 1 run at the max level compaction into the max level is always leveled rather than tiered  the max level is read/rewritten and the perlevel writeamplification from that is larger than 1 (while the perlevel writeamp from tiered == 1). With N runs at the max level many of the compaction steps into the max level can be tiered, but some will be leveled  when the max level is full (has N runs) then something must be done to reduce the number of runs.
LSM compaction algorithms haven't been formally defined. I tried to advance the definitions in a previous post. One of the open issues for tiered is whether it requires only one sorted run at the max level or allows for N runs at the max level. With N runs at the max level the spaceamplification is at least N which is too much for many workloads. With 1 run at the max level compaction into the max level is always leveled rather than tiered  the max level is read/rewritten and the perlevel writeamplification from that is larger than 1 (while the perlevel writeamp from tiered == 1). With N runs at the max level many of the compaction steps into the max level can be tiered, but some will be leveled  when the max level is full (has N runs) then something must be done to reduce the number of runs.
3level block index
Read the paper. It is complex and a summary by me here won't add value. It uses an Entropy Coded Trie (ECT) that builds on ideas from SILT  another great paper from CMU.
ECT uses ~2 bits/key versus at least 8 bits/key for LevelDB for the workloads they considered. This is a great result. ECT also uses 5X to 7X more CPU per lookup than LevelDB which means you might limit the use of it to the largest levels of the LSM tree  because those use the most memory and the place where we are willing to spend CPU to save memory.
Read the paper. It is complex and a summary by me here won't add value. It uses an Entropy Coded Trie (ECT) that builds on ideas from SILT  another great paper from CMU.
ECT uses ~2 bits/key versus at least 8 bits/key for LevelDB for the workloads they considered. This is a great result. ECT also uses 5X to 7X more CPU per lookup than LevelDB which means you might limit the use of it to the largest levels of the LSM tree  because those use the most memory and the place where we are willing to spend CPU to save memory.
Multilevel cuckoo filter
SlimDB can use a cuckoo filter for leveled levels of the LSM tree and a multilevel cuckoo filter for tiered levels. Note that leveled levels have one sorted run and tiered levels have N sorted runs. SlimDB and the Stepped Merge paper use the term sublevels, but I prefer N sorted runs.
The cuckoo filter is used in place of a bloom filter to save space given target false positive rates of less than 3%. The paper has examples where the cuckoo filter uses 13 bits/key (see Table 1) and a bloom filter with 10 bits/key (RocksDB default) has a false positive rate of much less than 3%. It is obvious that I need to read another interesting CMU paper cited by SlimDB  Cuckoo Filter Practically Better than Bloom.
The multilevel cuckoo filter (MLCF) extends the cuckoo filter by using a few bits/entry to name the sublevel (sorted run) in the level that might contain the search key. With tiered and a bloom filter per sublevel (sorted run) a point query must search a bloom filter per sorted run. With the MLCF there is only one search per level (if I read the paper correctly).
The MLCF might go a long way to reduce the pointquery CPU overhead when using many sublevels which is a big deal. While a filter can't be used for general range queries, SlimDB doesn't support general range queries. Assuming the PK is on (a,b,c,d) and the prefix is (a,b) then SlimDB supports range queries like fetch all rows where a=X and b=Y. It wasn't clear to me whether the MLCF could be used in that case. But many sublevels can create more work for range queries as iterators must be positioned in each sublevel in the worst case and that is more work.
This statement from the end of the paper is tricky. SlimDB allows for an LSM tree to use leveled compaction on all levels, tiered on all levels or a hybrid. When all levels are leveled, then performance should be similar to RocksDB with leveled, when all or some levels are tiered then writeamplification will be reduced at the cost of read performance and the paper shows that range queries are slower when some levels are tiered. Lunch isn't free as the RUM Conjecture asserts.
In contrast, with the support of dynamic use of a stepped merge algorithm and optimized inmemory indexes, SlimDB minimizes write amplification without sacrificing read performance.The memory overhead for MLCF is ~2 bits. I am not sure this was explained by the paper but that might be to name the sublevel, in which case there can be at most 4 sublevels per level and the cost would be larger with more sublevels.
The paper didn't explain how the MLCF is maintained. With a bloom filter per sorted run the bloom filter is created when SST files are created during compaction and memtable flush. This is an offline or batch computation. But the MLCF covers all the sublevels (sorted runs) in a level. And the sublevels in a level arrive and depart one at a time, not at the same time. They arrive as output from compaction and depart when they were compaction input. The arrival or departure of a new sublevel requires incremental changes to the MLCF.
LSM tree shapes
For too long there has not been much diversity in LSM tree shapes. The usual choice was all tiered or all leveled. RocksDB leveled is really a hybrid  tiered for L0, leveled for L1 to Lmax. But the SlimDB paper makes the case for more diversity. It explains that some levels (smaller ones) can be tiered while the larger levels can be leveled. And the use of multilevel cuckoo filters, threelevel indexes and cuckoo filters is also a decision to make perlevel.
Even more interesting is the use of a costmodel to choose the best configuration subject to a constraint  the memory budget. They enumerate a large number of LSM tree configurations, generate estimated IOcosts per operation (writeamp, IO per point query that returns a row, IO per point query that doesn't return a row, memory overhead) and then the total IO cost is computed for for a workload  where a workload specifies the frequency of each operation (for example  30% writes, 40% point hits, 30% point misses).
The Dostoevsky paper also makes the case for more diversity and uses rigorous models to show how to choose the best LSM tree shape.
I think this work is a big step in the right direction. Although cost models must be expanded to include CPU overheads and constraints expanded to include the maximum write and space amplification that can be tolerated.
I disagree with a statement from the related work section. We can already navigate some of the read, write and space amplification space but I hope there is more flexibility in the future. RocksDB tuning is complex in part to support this via changing the number of levels (or growth factor per level), enabling/disabling the bloom filter, using different compression (or none) on different levels, changing the max space amplification allowed, changing the max number of sorted runs in the L0 or max number of write buffers, changing the L0:L1 size ratio, changing the number of bloom filter bits/key. Of course I want more flexibility in the future while also making RocksDB easier to tune.
Existing LSMtree based keyvalue stores do not allow trading among read cost, write cost and main memory footprint.
Performance Results
Figuring out why X was faster than Y in academic papers is not my favorite task. I realize that space constraints are a common reason for the lack of details but I am wary of results that have not been explained and I know that mistakes can be made (note: don't use serializable with InnoDB). I make many mistakes myself. I am willing to provide advice for MyRocks, MySQL and RocksDB. AFAIK most authors who hack on RocksDB or compare with it for research are not reaching out to us. We are happy to help in private.
SlimDB was faster than RocksDB on their evaluation except for range queries. There were few details about the configurations used, so I will guess. First I assume that SlimDB used stepped merge with MLCF for most levels. I am not sure why point queries were faster with SlimDB than RocksDB. Maybe RocksDB wasn't configured to use bloom filters. Writes were about 4X faster with SlimDB because stepped merge (tiered) compaction was used, writeamplification was 4X less and when IO is the bottleneck then an approach that has less writeamp will go faster.
SlimDB was faster than RocksDB on their evaluation except for range queries. There were few details about the configurations used, so I will guess. First I assume that SlimDB used stepped merge with MLCF for most levels. I am not sure why point queries were faster with SlimDB than RocksDB. Maybe RocksDB wasn't configured to use bloom filters. Writes were about 4X faster with SlimDB because stepped merge (tiered) compaction was used, writeamplification was 4X less and when IO is the bottleneck then an approach that has less writeamp will go faster.
Wednesday, September 5, 2018
5 things to set when configuring RocksDB and MyRocks
The 5 options to set for RocksDB and MyRocks are:
Options
My advice on setting the size of the RocksDB block cache has not changed assuming it is configured to use buffered IO (the default). With MyRocks this option is rocksdb_block_cache_size and with RocksDB you will write a few lines of code to setup the LRU.
The number of background threads for flushing memtables and doing compaction is set by the option rocksdb_max_background_jobs in MyRocks and max_background_jobs in RocksDB. There used to be two options for this. While RocksDB can use async readahead and writebehind during compaction, it still uses synchronous reads and a potentially slow fsync/fdatasync call. Using more than 1 background job helps to overlap CPU and IO. A common configuration for me is numberofCPUcores / 4. With too few threads there will be more stalls from throttling. With too many threads there the threads handling user queries might suffer.
There are several strategies for choosing the next data to compact with leveled compaction in RocksDB. The strategy is selected via the compaction_pri option in RocksDB. This is harder to set for MyRocks  see compaction_pri in rocksdb_default_cf_options. The default value is kByCompensatedSize but the better choice is kMinOverlappingRatio. With MyRocks the default is 0 and the better value is 3 (3 == kMinOverlappingRatio). I first wrote about compaction_pri prior to the arrival of kMinOverlappingRatio. Throughput is better and write amplification is reduced with kMinOverlappingRatio. An awesome paper by Hyeontaek Lim et al explains this.
Leveled compaction in RocksDB limits the amount of data per level of the LSM tree. A great description of this is here. There is a target size per level and this is enforced top down (smaller to larger levels) or bottom up (larger to smaller levels). With the bottom up approach the largest level has ~10X (or whatever the fanout is set to) more data than the next to last level. With the top down approach the largest level frequently has less data than the next to last level. I strongly prefer the bottom up approach to reduce space amplification. This is enabled via the level_compaction_dynamic_level_bytes option in RocksDB. It is harder to set for MyRocks  see rocksdb_default_cf_options.
Bloom filters are disabled by default for MyRocks and RocksDB. I prefer to use a bloom filer on all but the largest level. This is set via rocksdb_default_cf_options with MyRocks. The reason for not using it with the max level is to consume less memory (reduce cache amplification). The bloom filter is skipped for the largest level in MyRocks via the optimize_filter_for_hits option. The example at the end of this post has more information on enabling bloom filters. All of this is set via rocksdb_default_cf_options.
Examples
A previous post recently explained how to set rocksdb_default_cf_options for compression with MyRocks. Below I share an example my.cnf for MyRocks to set the 5 options I listed above. I set transaction isolation because read committed is a better choice for MyRocks today. Repatable read will be a great choice after gap locks are added to match InnoDB semantics. In rocksdb_default_cf_options block_based_table_factory is used to enable the bloom filter, level_compaction_dynamic_level_bytes enables bottom up management of level sizes, optimize_filters_for_hits disables the bloom filter for the largest level of the LSM tree and compaction_pri sets the compaction priority.
transactionisolation=READCOMMITTED
 block cache size
 number of background threads
 compaction priority
 dynamic leveled compaction
 bloom filters
I have always wanted to do a "10 things" posts but prefer to keep this list small. It is unlikely that RocksDB can provide a great default for the block cache size and number of background threads because they depend on the amount of RAM and number of CPU cores in a server. But I hope RocksDB or MyRocks are changed to get better defaults for the other three which would shrink this list from 5 to 2.
Options
My advice on setting the size of the RocksDB block cache has not changed assuming it is configured to use buffered IO (the default). With MyRocks this option is rocksdb_block_cache_size and with RocksDB you will write a few lines of code to setup the LRU.
The number of background threads for flushing memtables and doing compaction is set by the option rocksdb_max_background_jobs in MyRocks and max_background_jobs in RocksDB. There used to be two options for this. While RocksDB can use async readahead and writebehind during compaction, it still uses synchronous reads and a potentially slow fsync/fdatasync call. Using more than 1 background job helps to overlap CPU and IO. A common configuration for me is numberofCPUcores / 4. With too few threads there will be more stalls from throttling. With too many threads there the threads handling user queries might suffer.
There are several strategies for choosing the next data to compact with leveled compaction in RocksDB. The strategy is selected via the compaction_pri option in RocksDB. This is harder to set for MyRocks  see compaction_pri in rocksdb_default_cf_options. The default value is kByCompensatedSize but the better choice is kMinOverlappingRatio. With MyRocks the default is 0 and the better value is 3 (3 == kMinOverlappingRatio). I first wrote about compaction_pri prior to the arrival of kMinOverlappingRatio. Throughput is better and write amplification is reduced with kMinOverlappingRatio. An awesome paper by Hyeontaek Lim et al explains this.
Leveled compaction in RocksDB limits the amount of data per level of the LSM tree. A great description of this is here. There is a target size per level and this is enforced top down (smaller to larger levels) or bottom up (larger to smaller levels). With the bottom up approach the largest level has ~10X (or whatever the fanout is set to) more data than the next to last level. With the top down approach the largest level frequently has less data than the next to last level. I strongly prefer the bottom up approach to reduce space amplification. This is enabled via the level_compaction_dynamic_level_bytes option in RocksDB. It is harder to set for MyRocks  see rocksdb_default_cf_options.
Bloom filters are disabled by default for MyRocks and RocksDB. I prefer to use a bloom filer on all but the largest level. This is set via rocksdb_default_cf_options with MyRocks. The reason for not using it with the max level is to consume less memory (reduce cache amplification). The bloom filter is skipped for the largest level in MyRocks via the optimize_filter_for_hits option. The example at the end of this post has more information on enabling bloom filters. All of this is set via rocksdb_default_cf_options.
Examples
A previous post recently explained how to set rocksdb_default_cf_options for compression with MyRocks. Below I share an example my.cnf for MyRocks to set the 5 options I listed above. I set transaction isolation because read committed is a better choice for MyRocks today. Repatable read will be a great choice after gap locks are added to match InnoDB semantics. In rocksdb_default_cf_options block_based_table_factory is used to enable the bloom filter, level_compaction_dynamic_level_bytes enables bottom up management of level sizes, optimize_filters_for_hits disables the bloom filter for the largest level of the LSM tree and compaction_pri sets the compaction priority.
transactionisolation=READCOMMITTED
defaultstorageengine=rocksdb
rocksdb
rocksdb_default_cf_options=block_based_table_factory={filter_policy=bloomfilter:10:false};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio
rocksdb_block_cache_size=2g
rocksdb_max_background_jobs=4
Subscribe to:
Posts (Atom)
RocksDB on a big server: LRU vs hyperclock
This has benchmark results for RocksDB using a big (48core) server. I ran tests to document the impact of the the block cache type (LRU vs ...

This provides additional results for Postgres versions 11 through 16 vs Sysbench on a medium server. My previous post is here . The goal is ...

MySQL 8.0.35 includes a fix for bug 109595 and with that fix the QPS is almost 4X larger on the read+write benchmark steps compared to MyS...

I am trying out a dedicated server from Hetzner for my performance work. I am trying the ax162s that has 48 cores (96 vCPU), 128G of RAM a...