Small Datum: October 2019

Wednesday, October 30, 2019

USL, universal scalability law, is good to know

The USL is worth understanding. USL is short for universal scalability law and was created by Neil Gunther for use in capacity planning. I don't do much capacity planning but the USL is also valuable for explaining performance. Performance problems for the old InnoDB rw-lock (circa 2007) would have been easy to describe via the USL because "wake all" on unlock is an N² overhead -- see the 𝛽 parameter in the USL.

A longer version of the USL is here. The USL predicts how throughput will change (increase or decrease) with concurrency. One version of the formula where X(i) is throughput with i concurrent clients is below.

X(1) * N
X(N) = -------------------
1 + α(N-1) + ꞵN(N-1)

The denominator determines how throughput changes with concurrency. When it is one then there is linear scaleup. The best way to get linear scaleup is to cheat and choose an easy base case but otherwise the denominator is greater than 1 and the USL explains less than linear scaleup. The components in the denominator represent the overhead from contention and coherency. The alpha term represents contention and the beta term represents coherency.

I write above that "wake all" on unlock has an N² overhead. By this I mean that when N threads are trying to get through a synchronization object one at a time and all waiters wake on unlock then there are O(N²) wakeups -- (N-1) + (N-2) + (N-3) + ... + 3 + 2 + 1. The beta term in the denominator represents the coherency overhead and has an N² term as 𝛽N(N-1) = 𝛽N²-𝛽N.

Amdahl's Law

The long post states that when B=0 and X(1)=1 then the USL is reduced to Amdahl's Law. I will derive the math for that here. The key point is that α=(1-p) where p is the fraction of the workload that is parallel in Amdahl.

# Amdahl's law where p is the fraction of the workload
# that is parallel, N is same as for USL
1
Speedup = -----------
(1-p) + p/N

# Start with USL
N
X(N) = -------------------
1 + α(N-1) + ꞵN(N-1)

# Update for X(1)=1, ꞵ=0, α=(1-p)
N
X(N) = ----------
1 + (1-p)(N-1)

# Algebra for denominator, expand and then reduce
N
X(N) = ----------
N - pN + p

# Multiply numerator and denominator by 1/N
1
X(N) = ----------
(1 - p) + p/N

# Now it looks like Amdahl where α=(1-p)

Predicted max throughput

The long post states that the max throughput occurs at sqrt((1-α)/ꞵ). That is the value of N for which X(N)' = 0. I won't derive it here but X(N)' is:
1 - α - ꞵN²
X(N)' = ---------------------
(1+α(N-1) + ꞵN(N-1))²

Then X(N)'=0 when the numerator is zero and that happens when N=sqrt((1-α)/ꞵ). That determines a critical point for X(N) which is either a min or max. In this case it is a max. I assume that X(N) is concave without deriving X(N)''.

Tuesday, October 29, 2019

Review of "5 Minute rule - thirty years later"

The 5 Minute rule paper has been updated and it is worth reading. The paper explains how to make cost-based decisions for tiered storage, applies that using price and performance of devices over the past 30 years and then makes predictions based on current trends.

There is a prediction in the paper that will provoke discussion -- the price vs performance advantage for disk vs tape is getting smaller and soon we might need Hadoop for tape. That will be an interesting project.

5 minute rule math

The 5 minute rule math explains what data should be in a cache by considering the price of the cache and the price/performance of the storage device. The math has been used where cache+storage are DRAM+disk, DRAM+SSD and SSD+disk. Now it has be updated for NVM.

The equation is presented as two ratios: technology and economic. Below I explain how to derive it.

Technology ratio Economic ratio

Pages/MBofCache Price/StorageDevice
----------------------- * -------------------
OpsPerSec/StorageDevice Price/MBofCache

If you reduce the equation above you end up with Pages/OpsPerSec. The unit for this is seconds and this is inverse of the access rate. When Pages/OpsPerSec = X then the cache should be large enough to fit pages accessed more frequently than every X seconds (1/X accesses per second). The formula above doesn't tell you to buy X MB of cache. It does tell you to buy enough cache to fit all pages accessed more frequently then once per X seconds.

Now I derive the formula as the presentation above can confuse me. The cost of a cache page is:
Price/MBofCache * MBofCache/Pages

The cost of 1 OpsPerSec from a storage device is:
Price/StorageDevice * StorageDevice/OpsPerSec

Each cache page has a cost and a benefit. The benefit is the reduction in OpsPerSec demand for cached data. Assume there are K pages of cache and the access rate for each page is A. Then this cache reduces the OpsPerSec demand by K*A. From this we can determine the value of A for which the cache cost equals the benefit:

K * Price/MBofCache * MBofCache/Pages =
K * A * Price/StorageDevice * StorageDevice/OpsPerSec

# And then solve for A, K cancels, move ratios to LHS

Price/MBofCache MBofCache/Pages
------------------- * ----------------------- = A
Price/StorageDevice StorageDevice/OpsPerSec

# LHS can be reduced to 1/Pages / 1/OpsPerSec
# which is OpsPerSec / Pages while '5 minute rule'
# above solves for Pages / OpsPerSec, so invert LHS
# to get 1/A

Price/StorageDevice StorageDevice/OpsPerSec
------------------- * ----------------------- = 1/A
Price/MBofCache MBofCache/Pages

# Now reorder LHS

StorageDevice/OpsPerSec Price/StorageDevice
----------------------- * ------------------- = 1/A
MBofCache/Pages Price/MBofCache

# Given that X/Y = 1/Y / 1/X when X, Y != 0 then
# first term in LHS can be rewritten to get

Pages/MBofCache Price/StorageDevice
----------------------- * ------------------- = 1/A
OpsPerSec/StorageDevice Price/MBofCache

# A is an access rate, OpsPerSec / Pages, and 1/A is
# Pages/OpsPerSec. Thus we have derived the formula.

Technology

The numerator for the technology ratio changes slowly. Assuming file system pages are cached, the page size is now a multiple of 4kb (ignoring compression). There is one case for smaller pages -- NVM on the memory bus might allow for the return of 512b pages. The denominator for the technology ratio is interesting. It has barely changed for disk but changed a lot SSD.

For the economic ratio the cost of DRAM, disk, SSD and NVM is not changing at the same rate. Thus advice changes depending on the technology. In 2007 the advice for DRAM+SSD was to cache objects accessed every 15 minutes and in 2017 it is every 7 minutes so less DRAM is needed. The advice for DRAM+disk is the opposite. DRAM cost dropped more than disk OpsPerSec improved so more DRAM cache is needed.

Cost

The paper has some discussion on cost. The cost can be the retail price from a device vendor or the all-in cost from a public cloud vendor. The retail price can be misleading as that isn't TCO. You need to consider power and more. In either case (retail, public cloud) capacities come in discrete sizes and you might not be able to deploy the optimal values.

Finally, the 5 minute rule doesn't solve for all constraints, nor should it. There are limits to power, space and throughput that must be considered. Motherboards are limited by memory slots, PCIe slots and throughput to storage. Power might not be unlimited. Constrained optimization is an interesting topic, but not for this post.

Friday, October 25, 2019

Nitpicking papers in the LSM space

Nits:

LSM compaction does merge, not merge sort. Compaction inputs are already sorted. Merge sort is O(NlogN) for N records. Naive merge is O(NlogK) for N records and K input streams and in some cases RocksDB can do better than naive merge -- see the reference to optimized binary heap. Compaction merges for RocksDB with leveled compaction have two forms -- L0 to L1 and Ln to Ln+1. For L0 to L1 there might be ~4 streams from the L0 and 1 stream from the L1 and size(L1) is usually >= size(L0). For Ln to Ln+1 there is one stream from each and size(Ln+1) is approximately 10 * size(Ln).
LSMs do large random writes, not sequential writes. Writes are sequential from perspective of a single-file but a busy LSM does compaction and memtable flush concurrently. So there are concurrent reads and writes from perspective of device -- each stream of reads and writes is sequential but they are interleaved at the device level.
The statement write amplification is approximately 10 * num-levels is some of: the worst case, a hypothesis, an unvalidated performance model. I am a frequent user of unvalidated performance models but they have their limits. There is an amazing paper that measures write-amp in practice and then provides a brilliant explanation. I wish there were more reporting of write-amp from production workloads. LevelDB overstates write-amp because it uses too many levels because the L0 and L1 are small (~10mb).
It is great to read about new index structures that don't support range query in pursuit of less write-amp. If a workload doesn't need range query then maybe an LSM isn't the best choice. The other reason to use an LSM with leveled compaction is low space-amp, so I hope that some alternatives consider ways to keep space-amp low while also getting low write-amp.
Consider CPU and IO overhead. That is there are IO and CPU components for read and write amplification. IO gets a lot of attention. CPU needs more attention.
Benchmarking is hard. See here for an example of a horrible mistake I made which lead to many bogus reports of lousy MySQL performance regressions. I explained some of the problems for LevelDB and RocksDB benchmarks here and here. LevelDB doesn't target high performance, RocksDB is hard to configure, papers rarely provide enough details to support reproduction and even if they did nobody is volunteering their time to reproduce results. I assume that benchmark result accuracy is inversely related to the number of systems that have been tested for a given paper -- expertise is in short supply. My offer for advice still stands. I have read about possible programs that would volunteer grad students to reproduce results. While I think that would be a great education, I wonder if it would advance their academic careers.

Thursday, October 24, 2019

Tuning space and write amplification to minimize cost

In a previous post I described how to tune for space and write amplification with an LSM or index+log index structure. In this post I explain how to use that to minimize the storage cost for a workload.

This is an optimization problem and the objective function is to minimize the number of storage devices. The model I use was described in a previous post but I summarize it here:

Storage device supplies r units of read IO, w units of write endurance and s units of capacity.
Workload demands R units of read IO, W units of write endurance and S units of capacity.
max(R/r, W/w, S/s) storage devices are required when amplification is ignored
max(R*a_r/r, W*a_w/w, S*a_s/s) storage devices are required when amplification is not ignored where a_r, a_w and a_s are read, write and space amplification

Here I assume that read IO is never the bottleneck and the goal is to minimize max(W*a_w/w, S*a_s/s). That occurs when W*a_w/w = S*a_s/s. I assume that W, w, S and s are constants so the goal is to tune a_wand a_s to solve the equation. Note that W/w and S/s are the number of devices needed for the workload based on endurance and capacity.

I thought I previously published a blog on this topic but I could not find it.

Index+log

For index+log I assume that a_w = 100/(100-pfull) and a_s= 100/pfull where pfull is the percentage of device capacity available to the user. With that I can determine the value of pfull that solves the equation.

W*a_w/w = S*a_s/s
# reorder LHS and RHS
W/w*a_w = S/s*a_s
# replace a_w and a_s
W/w * 100/(100-pfull) = S/s * 100/pfull
# multiply LHS and RHS by pfull and then (100-pfull)
pfull * W/w * 100 = (100-pfull) * S/s * 100
# divide LHS and RHS by 100
pfull * W/w = (100-pfull) * S/s
# expand RHS
pfull * W/w = 100 * S/s - pfull * S/s
# add pfull * S/s to LHS and RHS
pfull * W/w + pfull * S/s = 100 * S/s
# factor LHS
pfull * (W/w + S/s) = 100 * S/s
# done
pfull = 100 * S/s / (W/w + S/s)

From the solution if I need 10 devices based on endurance (W/w = 10) and 30 devices based on capacity (S/s = 30) then using pfull = 100 * 30 / (10+30) = 75% minimizes the number of devices required. With pfull=75 then a_w=4 (100/25), a_s=1.33 (100/75), W/w*a_w=40, S/s*a_s=40 and the workload needs 40 storage devices. Were I to use pfull=50 then a_w=2, a_s=2, W/w*a_w=20, S/s*a_s=60 and the workload needs 60 devices. Were I to use pfull=80 then a_w=5, a_s=1.25, W/w*a_w=50, S/s*a_s=38 (rounded up) and the workload needs 50 devices. So pfull=75 looks like a good choice.

LSM

Space and write amplification are inversely related for both LSM and index+log but the math is easier for index+log. We can start with the equation to solve, but I won't solve it.

W*a_w/w = S*a_s/s
# reorder LHS and RHS
W/w*a_w = S/s*a_s
# replace a_w with 0.8 * fo * L and a_swith 1 + 1/fo
# where fo is per-level fanout and L is number of levels
W/w * 0.8 * fo * L = S/s * (1 + 1/fo)
# alas fo (per-level fanout) is a function of L (number of levels) and total fanout
# assume total fanout is t then fo = t ^ 1/L where t is a constant
W/w * 0.8 * t^1/L * L = S/s * (1 + 1/t^1/L)

I stopped at this point because the math isn't easy and L must be an integer >= 1 so an easy way to solve this is to compute the LHS and RHS this for L=1, 2, ..., 10 and choose L that minimizes the difference. For the example above with W/w=10 and S/s=30 then LSM write-amp is always sufficiently larger than space-amp that write-amp determines the number of devices and L=6 or 7 minimizes write-amp and the number of storage devices. Were S/s increased to 200 then L=3 or 4 minimizes the number of storage devices.

Wednesday, October 23, 2019

Write vs space amplification for an LSM and index+log

There is an inverse relationship between write and space amplification for the LSM and index+log index structures -- with more space (write) there is less write (space) amplification. I expect that the tradeoff is better for index+log than for an LSM with leveled compaction which means that index+log has less write-amp for a given amount of space-amp. It will be interesting to determine whether my expectation is true.

My expectation about index+log is based on the following assumptions and all models are wrong, some are useful applies here.

For index+log write-amp is 100 / (100-pfull), space-amp is 100/pfull, pfull is the percentage of the device capacity that can be used and (100-pfull) is the percentage of device capacity set aside to reduce write-amp.
For an LSM with leveled compaction write-amp is f * L * total-fo^(1/L), space-amp is 1 + 1/total-fo^(1/L), f is ~0.8 (see this paper), L is the number of levels and total-fo is the total fanout -- size(database) / size(memtable)

The following chart displays write-amp as a function of space-amp assuming that total fanout is 1024 for the LSM. The numbers are based on the formulas above. It shows that write-amp for index+log is always less than write-amp for the leveled LSM for a similar amount of space-amp. Of course lunch isn't free and you usually pay elsewhere with index+log via more cache-amp and more expensive range scans.

Capacity vs Endurance

There is another way to illustrate the relationship between write-amp and space-amp for index+log. I use %cap and %end to indicate the percentage of device capacity and device endurance that is available to a user after accounting for the impact of pfull. When pfull is larger then write-amp increases and space-amp decreases.

It is interesting that %cap + %end = 100.

Assuming:
pfull is percentage of device capacity available to user
space-amp = 100/pfull
write-amp = 100/(100-pfull)
%cap = 100/space-amp
%end = 100/write-amp

Then:
%cap + %end
= 100/space-amp + 100/write-amp
= 100/(100/pfull) + 100/(100/(100-pfull))
= pfull + (100-pfull)
= 100

pfull write-amp space-amp %cap %end
90 10 1.11 90 10
80 5 1.25 80 20
70 3.33 1.43 70 30
60 2.5 1.67 60 40
50 2 2 50 50
40 1.67 2.5 40 60
30 1.43 3.33 30 70
20 1.24 5 20 80
10 1.11 10 90 10

Tuesday, October 22, 2019

A review of uDepot - keeping up with fast storage

This is a review of Reaping the performance of fast NVM storagewith uDepot which was published in FAST 2019. The paper is worth reading. uDepot is hash-based index+log using my index structure terminology. The goal is to create a database engine that can utilize all of the IO capacity of a fast storage device -- think millions of read IOPs and ~10 usec latency. As I wrote yesterday, stranding read IO is one way to waste a storage device but so are too much space and write amplification.

By hash-based index+log I mean that updates are appended to a log and an in-memory hash index points into the log. Log space is managed as segments and grains. Segments are large (1gb) and contain smaller grains (4kb). GC reclaims previously written segments by copying out live grains. The index must be updated during GC.

Evaluating this according to the CRUM conjecture:

read amplification - uDepot does one storage read per point query as there is no block cache. The cost of a block cache is complexity, memory and CPU cycles. The benefit of a block cache is a reduction in storage traffic. uDepot doesn't support range queries because it is hash-based.
write amplification - space and write amplification are inversely related with index+log. Doing GC more frequently reduces space-amp at the cost of more write-amp. The paper doesn't discuss this tradeoff and I don't know whether it is configurable (yet) in uDepot.
space amplification - see the comment for write-amp. From the paper it wasn't clear whether grains could be shared by small records. If not shared then there will be more space-amp.
cache amplification - the hash index needs at least 8 bytes in memory per record. There are hash-based approaches that use less memory per record - SkimpyStash and SILT need ~1 byte/record. The need for something in memory per record is common to index+log approaches because records are not clustered in the log. The memory requirements for uDepot are reduced because it doesn't use a block cache.

uDepot supports get, put and delete. It does not support a range scan because it is hash-based. While hash-based approaches can use much less CPU than a tree-based approach and hash-based is sufficient if you don't need range scans I am curious whether there is sufficient demand to justify the cost of building a production quality hash-based index structure. I hope there is.

Implementation details

The hash index is an array of hash tables. The array can grow dynamically by doubling in size as needed. The paper did not explain whether the array can be reduced in size. Growing is online and incremental. The reported worst-case blocks some operations for 1 millisecond. The hash tables use Hopscotch hashing to support a high fill factor. There is an array of mutexes per hash table and some benchmarks were run with 8192 mutexes/table. The hash index is eventually made durable in the log. The last N changes to the index might not be durable. The paper claims the index can be recovered after a crash in a few seconds but the process wasn't fully explained.

The log has large segments (1gb each) that contain smaller grains (4kb) each. A segment stores one of records or the index. I wrote above that uDepot might not share grains between small records which will waste space. GC copies live grains from a segment to make a segment free. The GC process -- how and when are segments selected for GC -- was not explained. uDepot expects a raw device. This will avoid filesystem overhead but using a filesystem makes life easier in production. The paper did not explain the overhead saved by not using a filesystem.

More implementation details

The implementation raises two interesting questions. What is the best way to do fast IO? What is the best way to implement a thread per core server?

For fast IO uDepot uses SPDK or Linux AIO. I assume that it could work great with io_uring when io_uring becomes widely available. Linux has a habit of eventually catching up to modern hardware once said hardware is sufficiently available. It will be interesting if io_uring removes the need for SPDK. In figures 7 and 8 the paper has results that show a dramatic improvement from using async IO with thread/core compared to sync IO with many threads.

For thread per core uDepot uses TRT -- Task Run Time. This provides coroutines for systems programming. TRT uses cooperative multitasking so it must know when to reschedule tasks. IO and synchronization is done via TRT interfaces to help in that regard. Under the covers it can use async IO and switch tasks while the IO or sync call is blocked. One benefit from coroutines is reducing the number of context switches.

I am curious about the future of coroutines for systems programming in C and C++. RethinkDB used a thread per core model and started via callbacks then realized that coroutines made development easier -- see here and here. Coroutines are coming, or have come, to Seastar. Boost supports fibers and coroutines. I assume they eventually arrive in standard C++.

Monday, October 21, 2019

How many storage devices does a workload require?

I read an interesting paper that was motivated by the inability of existing storage engines to utilize all of the read IO provided by new and fast storage devices. That is excellent motivation but the story has more nuance.

A storage device provides capacity, endurance and reads. For now read means read operations and I ignore read throughput and latency. For a given device and workload none of the dimensions (capacity, endurance, reads) might be saturated. When saturation occurs it is usually limited to one of the dimensions. In that case if you want to reduce the cost of storage then change something to be more efficient in the saturated dimension. When capacity (endurance, read IO) is the bottleneck then reducing space (write, read) amplification is an answer.

It is hard to saturate a storage device in all dimensions so I am cautious when insufficient utilization in any dimension is cited as a problem. Too much saturation leads to lousy quality of service. Besides, workloads rarely have constant workloads -- web-scale varies by time of day.

When endurance or capacity are the bottleneck then it can be OK to not use all of the read IO provided by a new storage device. A simple model for this is:

Storage device supplies r units of read IO, w units of write endurance and s units of capacity.
The workload demands R units of read IO, W units of write endurance and S units of capacity.
max(R/r, W/w, S/s) storage devices are required when amplification is ignored
max(R*a_r/r, W*a_w/w, S*a_s/s) storage devices are required when amplification is not ignored where a_r, a_w and a_s are read, write and space amplification
To reduce the number of storage devices focus on the saturated dimension and tune or change the index structure

An example

For (R=10000, W=1000, S=100, r=10, w=10, s=10, a_r=4, a_w=10, a_s=2) then max(R*a_r/r, W*a_w/w, S*a_s/s) = max(10000*4/10, 1000*10/10, 100*2/10) = max(4000, 1000, 20) = 4000 and read IO is the bottleneck. But change S from 100 to 100000 and this becomes max(4000, 1000, 20000) = 20000 and capacity is the bottleneck.

Friday, October 18, 2019

Just put the cold data over there

There are several ways to use less SSD for an OLTP workload: choose a database engine has less space amplification, store less data, move the cold data elsewhere. The first approach is a plan while the others are goals. A plan is something you can implement. A goal requires a plan to get done.

This matters when you want to decrease the cost of storage for a database workload but not everyone needs to do that. The first approach assumes your DBMS supports an LSM with leveled compaction and compression (MariaDB and Percona Server include MyRocks, ScyllaDB and Cassandra are also options). The second approach, store less data, assumes you can get everyone to agree to remove data and that is a hard conversation.

The third approach, move the cold data elsewhere, is a wonderful goal. I wonder how often that goal is achieved. To implement this you must find data that won't (well, almost never) be read or written again and then move it to less expensive storage. I assume this has been more successful when implemented in storage than in a DBMS. The obvious example is an OS page cache but there are also tiered storage servers. An LSM already classifies data that is hot vs cold for writes, data closer to the L0 was written more recently, but that might not imply anything about read likelihood.

I have read wonderful research papers that show how a DBMS might do this but I am still wary. Assume that data can be described by read and write likelihood attributes -- {read N, read forever} X {write N, write forever} then the goal is to find data that is read N, write N for which N has been reached. You frequently won't know the value of N and monitoring will be needed to identify data that is rarely read or written, along with more CPU, RAM and IO to perform that monitoring. This is easier to do when the granularity of hot vs cold is per table but that is rare in my experience. I assume the common case is a table with a mix of hot and cold data.

Don't forget that it is a lousy experience when cold data becomes hot again.

This post was inspired by something I read on a wonderful blog -- Blocks and Files (no snark, blog is great). My summary of the post is that SSD endurance isn't a problem, just evict cold data to cold storage. Just remember that is a goal not a plan.

Update - I can't believe I forgot to mention the RocksDB persistent read cache that can use a faster persistent device (like Optane) to cache data read from a slower persistent device. The RocksDB-cloud effort from Rockset makes RocksDB work on S3 and uses the read cache to benefit from local storage. This post explains RocksDB-cloud.

Update - I am not sure I reviewed this before, but there is an interesting paper that extends RocksDB to migrate cold data to cold storage -- see Mutant.

Monday, October 7, 2019

Learned indexes for an LSM?

The Learned Indexes paper opened a new area of research for storage engines. The idea is to use the distribution of the data to make searches faster and/or reduce the size of search structures. Whether ML will be used is an implementation artifact to me. I am in this for the efficiency gains -- reducing space and CPU read amplification. I prefer to call this topic learned search structures rather than learned indexes because the public isn't ready for learned recovery and learned consistency.

While the Learned Indexes paper has ideas for hash maps and bloom filters most research has focused on a B-Tree. I expect learned indexes to work better with an LSM because its search structures (block indexes) are created off line with respect to user operations. There is more time for analysis and the per-SST search structures are write once so there is no need to deal with updates.

Other search optimizations

One of the goals for learned tree indexes is to avoid log₂(N) comparisons on a search so I will mention a few things in InnoDB and RocksDB that help with that:

The adaptive hash index (AHI) in InnoDB is a non-persistent hash table built on demand that maps a primary key value to a block in the buffer pool. With a clustered index like InnoDB (and MyRocks) non-covering secondary index queries must probe the PK index per row to get the missing columns. The AHI can avoid that CPU overhead.
RocksDB uses bloom filters to avoid some of the search overhead per level of the LSM tree. They don't avoid all of the per-level overhead as a search must be done to find the SST before the bloom filter can be checked.
RocksDB optionally uses a hash index per data block to avoid binary search within the data block. I assume that learned index approaches can augment or replace some of the data block index, the per block hash index and the bloom filter.