Posts

Showing posts from October, 2019

USL, universal scalability law, is good to know

The USL is worth understanding. USL is short for universal scalability law and was created by Neil Gunther for use in capacity planning. I don't do much capacity planning but the USL is also valuable for explaining performance. Performance problems for the old InnoDB rw-lock (circa 2007) would have been easy to describe via the USL because "wake all" on unlock is an N 2 overhead -- see the 𝛽 parameter in the USL. A longer version of the USL is here . The USL predicts how throughput will change (increase or decrease) with concurrency. One version of the formula where X(i) is throughput with i concurrent clients is below.                  X(1) * N      X(N) =  -------------------              1 + α(N-1) + ꞵN(N-1) The denominator determines how throughput changes with concurrency. When it is one then there is linear scaleup. The best way to get linear scaleup is to cheat and choose an easy base case  but otherwise the denominator is greater than 1 and the USL ex

Review of "5 Minute rule - thirty years later"

The 5 Minute rule paper has been updated and it is worth reading. The paper explains how to make cost-based decisions for tiered storage, applies that using price and performance of devices over the past 30 years and then makes predictions based on current trends. There is a prediction in the paper that will provoke discussion --  the price vs performance advantage for disk vs tape is getting smaller and soon we might need Hadoop for tape. That will be an interesting project. 5 minute rule math The 5 minute rule math explains what data should be in a cache by considering the price of the cache and the price/performance of the storage device. The math has been used where cache+storage are DRAM+disk, DRAM+SSD and SSD+disk. Now it has be updated for NVM. The equation is presented as two ratios: technology and economic. Below I explain how to derive it. Technology ratio            Economic ratio Pages/MBofCache             Price/StorageDevice -----------------------  *  ------

Nitpicking papers in the LSM space

Nits: LSM compaction does merge, not merge sort. Compaction inputs are already sorted. Merge sort is O(NlogN) for N records. Naive merge is O(NlogK) for N records and K input streams and in some cases RocksDB can do better than naive merge -- see the reference to optimized binary heap . Compaction merges for RocksDB with leveled compaction have two forms -- L0 to L1 and Ln to Ln+1. For L0 to L1 there might be ~4 streams from the L0 and 1 stream from the L1 and size(L1) is usually >= size(L0). For Ln to Ln+1 there is one stream from each and size(Ln+1) is approximately 10 * size(Ln). LSMs do large random writes, not sequential writes. Writes are sequential from perspective of a single-file but a busy LSM does compaction and memtable flush concurrently. So there are concurrent reads and writes from perspective of device -- each stream of reads and writes is sequential but they are interleaved at the device level. The statement write amplification is approximately 10 * num-levels

Tuning space and write amplification to minimize cost

In a previous post I described how to tune for space and write amplification with an LSM or index+log index structure . In this post I explain how to use that to minimize the storage cost for a workload. This is an optimization problem and the objective function is to minimize the number of storage devices. The model I use was described in a  previous post  but I summarize it here: Storage device supplies r units of read IO, w units of write endurance and s units of capacity. Workload demands R units of read IO, W units of write endurance and S units of capacity. max(R/r, W/w, S/s) storage devices are required when amplification is ignored max(R*a r /r, W*a w /w, S*a s /s) storage devices are required when amplification is not ignored where a r , a w  and a s  are  read, write and space amplification Here I assume that read IO is never the bottleneck and the goal is to minimize max(W*a w /w, S*a s /s). That occurs when W*a w /w = S*a s /s. I assume that W, w, S and s are cons

Write vs space amplification for an LSM and index+log

Image
There is an inverse relationship between write and space amplification for the LSM and index+log index structures  -- with more space (write) there is less write (space) amplification. I expect that the tradeoff is better for index+log than for an LSM with leveled compaction which means that index+log has less write-amp for a given amount of space-amp. It will be interesting to determine whether my expectation is true. My expectation about index+log is based on the following assumptions and all models are wrong, some are useful  applies here. For index+log write-amp is 100 / (100-pfull), space-amp is 100/pfull, pfull is the percentage of the device capacity that can be used and (100-pfull) is the percentage of device capacity set aside to reduce write-amp. For an LSM with leveled compaction write-amp is f * L * total-fo^(1/L), space-amp is 1 + 1/total-fo^(1/L), f is ~0.8 (see this paper ), L is the number of levels and total-fo is the total fanout -- size(database) / size(memtab

A review of uDepot - keeping up with fast storage

This is a review of  Reaping the performance of fast NVM storagewith uDepot which was published in FAST 2019. The paper is worth reading. uDepot is hash-based index+log  using my index structure terminology. The goal is to create a database engine that can utilize all of the IO capacity of a fast storage device -- think millions of read IOPs and ~10 usec latency. As I wrote yesterday , stranding read IO is one way to waste a storage device but so are too much space and write amplification. By hash-based index+log I mean that updates are appended to a log and an in-memory hash index points into the log. Log space is managed as segments and grains. Segments are large (1gb) and contain smaller grains (4kb). GC reclaims previously written segments by copying out live grains. The index must be updated during GC. Evaluating this according to the CRUM conjecture : read amplification - uDepot does one storage read per point query as there is no block cache. The cost of a block cache is

How many storage devices does a workload require?

I read an interesting paper that was motivated by the inability of existing storage engines to utilize all of the read IO provided by new and fast storage devices. That is excellent motivation but the story has more nuance. A storage device provides capacity, endurance and reads. For now read means read operations and I ignore read throughput and latency. For a given device and workload none of the dimensions (capacity, endurance, reads) might be saturated. When saturation occurs it is usually limited to one of the dimensions. In that case if you want to reduce the cost of storage then change something to be more efficient in the saturated dimension. When capacity (endurance, read IO) is the bottleneck then reducing space (write, read) amplification is an answer. It is hard to saturate a storage device in all dimensions so I am cautious when insufficient utilization in any dimension is cited as a problem. Too much saturation leads to lousy quality of service. Besides, workloads ra

Just put the cold data over there

There are several ways to use less SSD for an OLTP workload: choose a database engine has less space amplification , store less data, move the cold data elsewhere. The first approach is a plan while the others are goals. A plan is something you can implement. A goal requires a plan to get done. This matters when you want to decrease the cost of storage for a database workload but not everyone needs to do that. The first approach assumes your DBMS supports an LSM with leveled compaction and compression (MariaDB and Percona Server include MyRocks, ScyllaDB and Cassandra are also options). The second approach, store less data, assumes you can get everyone to agree to remove data and that is a hard conversation. The third approach, move the cold data elsewhere, is a wonderful goal. I wonder how often that goal is achieved. To implement this you must find data that won't (well, almost never) be read or written again and then move it to less expensive storage. I assume this has been

Learned indexes for an LSM?

The Learned Indexes paper opened a new area of research for storage engines. The idea is to use the distribution of the data to make searches faster and/or reduce the size of search structures. Whether ML will be used is an implementation artifact to me. I am in this for the efficiency gains -- reducing space and CPU read amplification. I prefer to call this topic learned search structures rather than learned indexes because the public isn't ready for learned recovery and learned consistency. While the  Learned Indexes  paper has ideas for hash maps and bloom filters most research has focused on a B-Tree. I expect learned indexes to work better with an LSM because its search structures (block indexes) are created off line with respect to user operations. There is more time for analysis and the per-SST search structures are write once so there is no need to deal with updates. Other search optimizations One of the goals for learned tree indexes is to avoid log 2 (N) comparison