Thursday, May 9, 2019

CRUM conjecture - read, write, space and cache amplification

The RUM Conjecture asserts that an index structure can't be optimal for all of read, write and space. I will ignore whether optimal is about performance or efficiency (faster is better vs efficient-er is better). I want to use CRUM in place of RUM where C stands for database cache.

The C in CRUM is the amount of memory per key-value pair (or row) the DBMS needs so that either a point query or the first row from a range query can be retrieved with at most X storage reads. The C can also be reported as the minimal database : memory ratio to achieve at most X storage reads per point query.

My points here are:
  • There are 4 amplification factors - read, write, space and cache
  • CRUM is for comparing index structure efficiency and performance
  • Read and write amplification have CPU and IO parts
  • Write amplification has immediate and deferred parts
Many X is faster than Y papers and articles neglect to quantify the tradeoffs made in pursuit of performance. I hope that changes and we develop better methods for quantifying the tradeoffs (a short rant on defining better).

Amplification factors (RUM -> CRUM) are used to compare index structures. Values for the factors are measured for real workloads and estimated for hypothetical ones. The comparison is the most valuable part. Knowing that the deferred CPU write-amp on inserts for a b-tree is 30 is not that useful. Knowing that it is 3X or 0.5X the value for an LSM is useful.

Workload matters. For estimates of amplification I usually assume uniform distribution because this simplifies the estimate. But there is much skew in production workloads and that impact can be measured to complement the estimates.


Read Amplification


This post is an overview for read-amp. This post explains it in detail for an LSM. There are two parts to read-amp -- CPU and IO. Thus for each of the three basic operations (point query, range seek, range next) there are 2 values for read-amp: CPU and IO. I have yet to consider deferred read-amp and by read-amp I mean immediate read-amp.

Metrics for CPU read-amp include CPU time, bytes/pages read and key comparisons. I use key comparisons when predicting performance and CPU time when running tests. I have not used bytes/pages read. While key comparisons are a useful metric they ignore other CPU overheads including hash table search, bloom filter search, page read and page decompress.

Metrics for IO read-amp include bytes read and pages read. I use pages read for disk and bytes read for SSD because disks are IOPs limited for small reads. IO read-amp implies extra CPU read-amp when the database is compressed. Decompressing pages after storage reads can use a lot of CPU with fast storage devices and even more with zlib but you should be using zstd.

With estimates for hypothetical workloads I assume there is a cache benefit as explained in the Cache Amplification section. This is likely to mean that comparisons assume a different amount of memory for index structures that have more or less cache-amp. For real tests I mostly run with database >> memory but don't attempt to use the least memory that satisfies the cache-amp X reads constraint.


Write Amplification


This post is an overview for write-amp. This post explains it in detail for an LSM. Write-amp has two dimensions: CPU vs IO, immediate vs deferred. For each operation (insert, delete, update) there are 4 values for write-amp: immediate CPU, deferred CPU, immediate IO and deferred IO.

The immediate write-amp occurs during the write. The deferred write-amp occurs after the write completes and includes writing back dirty pages in a b-tree and compaction in an LSM.

Possible metrics for CPU write-amp include bytes written, pages written, key comparisons and pages/bytes (de)compressed. Bytes and (in-memory) pages written are useful metrics for in-memory DBMS but my focus is on databases >> memory.

Possible metrics for IO write-amp include bytes written and pages written. These can be estimated for hypothetical workloads and measured for real ones. The choice between bytes or pages written might depend on whether disk or SSD is used as one is limited by ops/s and the other by transfer rate. If you use iostat to measure this then figure out whether Linux still counts bytes written as bytes trimmed.

Examples of deferred and immediate write-amp:
  • The InnoDB change buffer is deferred IO and CPU. Checking the change buffer and applying changes is deferred CPU. The deferred IO is from reading pages from storage to apply changes.
  • For a b-tree: page writeback for a b-tree is deferred IO, compression and creating the page checksum are deferred CPU, finding the in-memory copy of a page is immediate CPU, reading the page on a cache miss is immediate IO.
  • An LSM insert has immediate/deferred IO/CPU. 
    • Immediate CPU - key comparisons for memtable insert
    • Immediate IO - redo log write
    • Deferred IO - reading uncached pages for input SSTs and writing output SSTs during compaction
    • Deferred CPU - decompression, compression and key comparisons while merging input SSTs into output SSTs during compaction. Note that compaction does a merge, not a sort or merge+sort.

Space Amplification


Space-amp is the size of the database files versus the size of the data, or the ratio of the physical to logical database size. An estimate for the logical size is the size of the uncompressed database dump with some adjustment if secondary indexes are used. The space-amp is reduced by compression. It is increased by fragmentation in a b-tree and uncompacted data in an LSM.

It is best to measure this after the DBMS has reached a steady state to include the impact of fragmentation and uncompacted data.


Cache Amplification


I briefly described cache-amp in this post. The cache-amp describes memory efficiency. It represents the minimal database : memory ratio such that a point query requires at most X storage reads. A DBMS with cache-amp=10 (C=10) needs 10 times more memory than one with C=100 to satisfy the at most X reads constraint.

It can be more complicated to consider cache-amp for range seek and range next because processing them is more complicated for an LSM or index+log algorithm. Therefore I usually limit this to point queries.

For a few years I limited this to X=1 (at most 1 storage read). But it will be interesting to consider X=2 or 3. With X=1:
  • For a b-tree all but the leaf level must be in cache
  • For an LSM the cache must include all bloom filter and index blocks, all data blocks but the max level
  • For an index+log approach it depends (wait for another blog post)


Other posts


Related posts by me on this topic include:

1 comment:

  1. this the basic and very-useful knowlage when you want to build a distributed OLTP/OLAP system

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...