## Posts

Showing posts from October, 2019

### USL, universal scalability law, is good to know

The USL is worth understanding. USL is short for universal scalability law and was created by Neil Gunther for use in capacity planning. I don't do much capacity planning but the USL is also valuable for explaining performance. Performance problems for the old InnoDB rw-lock (circa 2007) would have been easy to describe via the USL because "wake all" on unlock is an N 2 overhead -- see the 𝛽 parameter in the USL. A longer version of the USL is here . The USL predicts how throughput will change (increase or decrease) with concurrency. One version of the formula where X(i) is throughput with i concurrent clients is below.                  X(1) * N      X(N) =  -------------------              1 + α(N-1) + ꞵN(N-1) The denominator determines how throughput changes with concurrency. When it is one then there is linear scaleup. The best way to get linear scaleup is to cheat and choose an easy base case  but otherwise the denominator is greater than 1 and the USL ex

### Review of "5 Minute rule - thirty years later"

The 5 Minute rule paper has been updated and it is worth reading. The paper explains how to make cost-based decisions for tiered storage, applies that using price and performance of devices over the past 30 years and then makes predictions based on current trends. There is a prediction in the paper that will provoke discussion --  the price vs performance advantage for disk vs tape is getting smaller and soon we might need Hadoop for tape. That will be an interesting project. 5 minute rule math The 5 minute rule math explains what data should be in a cache by considering the price of the cache and the price/performance of the storage device. The math has been used where cache+storage are DRAM+disk, DRAM+SSD and SSD+disk. Now it has be updated for NVM. The equation is presented as two ratios: technology and economic. Below I explain how to derive it. Technology ratio            Economic ratio Pages/MBofCache             Price/StorageDevice -----------------------  *  ------

### Nitpicking papers in the LSM space

Nits: LSM compaction does merge, not merge sort. Compaction inputs are already sorted. Merge sort is O(NlogN) for N records. Naive merge is O(NlogK) for N records and K input streams and in some cases RocksDB can do better than naive merge -- see the reference to optimized binary heap . Compaction merges for RocksDB with leveled compaction have two forms -- L0 to L1 and Ln to Ln+1. For L0 to L1 there might be ~4 streams from the L0 and 1 stream from the L1 and size(L1) is usually >= size(L0). For Ln to Ln+1 there is one stream from each and size(Ln+1) is approximately 10 * size(Ln). LSMs do large random writes, not sequential writes. Writes are sequential from perspective of a single-file but a busy LSM does compaction and memtable flush concurrently. So there are concurrent reads and writes from perspective of device -- each stream of reads and writes is sequential but they are interleaved at the device level. The statement write amplification is approximately 10 * num-levels

### Tuning space and write amplification to minimize cost

In a previous post I described how to tune for space and write amplification with an LSM or index+log index structure . In this post I explain how to use that to minimize the storage cost for a workload. This is an optimization problem and the objective function is to minimize the number of storage devices. The model I use was described in a  previous post  but I summarize it here: Storage device supplies r units of read IO, w units of write endurance and s units of capacity. Workload demands R units of read IO, W units of write endurance and S units of capacity. max(R/r, W/w, S/s) storage devices are required when amplification is ignored max(R*a r /r, W*a w /w, S*a s /s) storage devices are required when amplification is not ignored where a r , a w  and a s  are  read, write and space amplification Here I assume that read IO is never the bottleneck and the goal is to minimize max(W*a w /w, S*a s /s). That occurs when W*a w /w = S*a s /s. I assume that W, w, S and s are cons

### Write vs space amplification for an LSM and index+log

There is an inverse relationship between write and space amplification for the LSM and index+log index structures  -- with more space (write) there is less write (space) amplification. I expect that the tradeoff is better for index+log than for an LSM with leveled compaction which means that index+log has less write-amp for a given amount of space-amp. It will be interesting to determine whether my expectation is true. My expectation about index+log is based on the following assumptions and all models are wrong, some are useful  applies here. For index+log write-amp is 100 / (100-pfull), space-amp is 100/pfull, pfull is the percentage of the device capacity that can be used and (100-pfull) is the percentage of device capacity set aside to reduce write-amp. For an LSM with leveled compaction write-amp is f * L * total-fo^(1/L), space-amp is 1 + 1/total-fo^(1/L), f is ~0.8 (see this paper ), L is the number of levels and total-fo is the total fanout -- size(database) / size(memtab

### A review of uDepot - keeping up with fast storage

This is a review of  Reaping the performance of fast NVM storagewith uDepot which was published in FAST 2019. The paper is worth reading. uDepot is hash-based index+log  using my index structure terminology. The goal is to create a database engine that can utilize all of the IO capacity of a fast storage device -- think millions of read IOPs and ~10 usec latency. As I wrote yesterday , stranding read IO is one way to waste a storage device but so are too much space and write amplification. By hash-based index+log I mean that updates are appended to a log and an in-memory hash index points into the log. Log space is managed as segments and grains. Segments are large (1gb) and contain smaller grains (4kb). GC reclaims previously written segments by copying out live grains. The index must be updated during GC. Evaluating this according to the CRUM conjecture : read amplification - uDepot does one storage read per point query as there is no block cache. The cost of a block cache is