Small Datum: May 2019

Friday, May 17, 2019

index+log: implementations

My take on index+log systems like WiscKey is that they are neither better nor worse than an LSM - it all depends on your workload. But I am certain that we know much more about an LSM than about the index+log systems. Hopefully that changes over time as some of them are thriving.

The first index+log system that I read about was Berkeley DB Java Edition. The design paper is worth reading. Since then there have been a few more implementations and papers that I describe here. This list is probably incomplete: Bitcask, ForestDB, WiscKey, HashKV, TitanDB and RocksDB BlobDB.

At this point the systems that are getting into production, TitanDB and BadgerDB, use an LSM for the index. I wonder if an index structure that supports update-in-place would be better especially when the index must be cached because I expect the CPU read-amp for an LSM to be about 2X larger than for a b-tree and a b-tree supports update-in-place which makes it easier to relocate values during GC.

While I like index+log systems I think that papers and marketing tend to overstate LSM write-amp. For production RocksDB I usually see write-amp between 10 and 20. I expect that index+log could achieve something closer to 5. This paper from CMU explains one reason why per-level write-amp in an LSM is less than the per-level fanout (less than 10). Write skew is another reason.

The systems

Bitcask was part of the Riak effort.

The index is an in-memory hash table. The index isn't durable and the entire log has to be scanned to rebuild it on startup -- whether or not this was after a crash or a clean shutdown. The slow startup is a problem.
The value log is circular and GC copies live values from the tail to the head of the log. Liveness is determined by an index search.

ForestDB was a finalist in the SIGMOD 2011 student programming contest. Eventually the project and creator moved to Couchbase. It is worth reading about here and on the github page. I published blog posts that compare ForestDB and RocksDB: 1, 2, 3 and 4. Google finds more interesting things to read.

The index is a space-efficient trie.
The value log might have log segments. GC copies live values to the head of the log. Liveness is determined by an index search.

WiscKey is described as an LSM with key-value separation and made popular the term key-value separation. I put it in the index+log family of index structures.

The index is an LSM. There is no redo log for the index as it can be recovered from the head of the value log.
Kudos for many references to amplification factors. The paper uses bytes read for read-amp. I prefer to consider both IO and CPU for read-amp with key comparisons for CPU and storage reads for IO.
It doesn't mention that it has more cache-amp than an LSM, but few papers mention that problem. Shrinking the LSM by keeping large values separate doesn't make the index parts of it (filter and index blocks) easier to cache as they are already separate from the data blocks. There is more to cache with index+log as I describe here.
It claims to overcome the (worst-case) problem of one storage IO per KV pair on a range scan by fetching in parallel. Assuming the storage device has enough spare IO this might hide the problem but it doesn't fix it. With many workloads there isn't spare IO and extra IO for reads also means extra CPU for decompression.
The value log is circular and single-threaded GC copies live values to the head of the log. Liveness is determined by an index search. I assume that multi-threaded GC is feasible.
The paper isn't clear about the total write-amp that might occur from both the first write to the value log and GC that follows.
Compression isn't explained.

BadgerDB is a golang implementation, and much more, of the WiscKey paper.

It has many features and many production use cases. This is impressive.
GC is scheduled by the user. Based on Options.NumCompactors I assume it can be multi-threaded.
The docs state that the LSM can be served from RAM because the values are elsewhere. That is true but I don't consider it a feature. It must be in RAM to avoid IO from liveness queries done by GC. An LSM isn't a monolithic thing. There are index blocks, data blocks and filter blocks and most of the LSM, data blocks from the max level, don't have to be cached.
There is extra work on reads to find values that have been moved by GC. See the comments about BadgerDB here.

HashKV is an interesting paper that avoids index queries during GC.

Hash-based data grouping distributes KV pairs by hash into one of N logs. GC is probably done by scanning a log twice -- once to get the keys and the second time to relocate the live values. A value is live when the most recent key is not a tombstone. A value might be live when needed for a snapshot. GC doesn't do index searches so the index doesn't have to be cached to make GC efficient but you might want to cache it to avoid doing index IO on queries -- and this index is still much larger than the block index for an LSM.
Hotness awareness copies cold values to a cold segment to avoid repeated GC for a value that doesn't get updated or deleted. A header for the value is kept in the non-cold log.
Small values are stored inline in the LSM.
I am curious if more log groups means more write-amp. See my comment about fsync in a previous post.
I am curious whether managing the hash buckets is easy. The goal is to make sure that keys for a segment group fit in memory. The range and number of buckets must change over time. Does this have anything in common with linear and extendible hashing?

TitanDB is part of TiDB and TiDB is thriving.

A WAL is used for new writes. This might make it easier to compress data on the first write to the value logs.
GC appears to do index searches to determine liveness.
Compression is per-record. I hope this does per-block in the future.
It might let the user tune between space-amp and write-amp via discardable_ratio.
This is compatible with most of the RocksDB API.k

RocksDB BlobDB is an extension to RocksDB that uses log segments for large values and stores small values in the LSM. GC copies live values and liveness is determined by an index search.

Future work

Future work for index+log systems includes:

Determine whether a b-tree is better than an LSM for the index structure
Determine whether the HashKV solution is the best way to avoid liveness queries during GC.
If an LSM is used for the index structure determine efficient ways to support relocating values during GC without extra overheads and complexity during read processing.
Determine whether clever things can be done during GC.

Block compression is easier than on first write.
Hot/cold value separation is possible (see HashKV). This is an example of generational GC even if we rarely mention that for index structures.
Values in a log segment can be ordered by key rather than by time of write during GC. GC can also merge multiple ordered segments to create longer sorted runs. I wonder if it is then possible to use block indexes (key+pointer per block rather than per row) to reduce cache-amp for such log segments.

Thursday, May 16, 2019

index+log: open issues

This post is about open issues for the index+log family of index structures. See past posts here and here for more details on index+log. The success of LSM has lead to several solutions that use an LSM for the index. I am curious if that is a good choice when implementing index+log from scratch. It is a great choice when starting with an LSM and then adding index+log.

Open Issues

The open issues for index+log include:

Is GC multi-threaded?
Does value log GC require index queries?
Must the index be cached?
Is block compression possible?
With an LSM as the index how is relocating an entry in the value log supported?

Is GC multi-threaded?

I hope so.
It hasn't been in some of the index+log papers/systems.

Does value log GC require index queries?

Yes, in most of the index+log papers/systems this is required to determine whether a value is live when scanning the value log during GC
HashKV proposed a solution that doesn't require such queries. It is a wonderful paper but I think there are more things to figure out. The general idea is to hash KV pairs into buckets such that all keys for a bucket will fit in memory during GC. GC then reads the value log segments for a bucket twice -- first to get the keys, second to copy the live KV pairs into a new log segment. I wonder if managing the hash bucket ranges has things in common with linear and extendible hashing.

Must the index be cached?

The index for index+log might be 10X larger than for an LSM because an LSM uses a block index while index+log uses a row index. A block index has a key+pointer per block and a row index has that per row. An LSM only needs a block index because rows are clustered by key.
To do at most one IO per point query for a database larger than memory the index must be cached for index+log to avoid index IO as the IO is spent reading the value log. If the <= 1 IO / query constraint is then cache-amp is larger for index+log compared to LSM because the index+log index is larger (see #1).
If value log GC queries the index to determine whether each value is live then this query shouldn't do IO or GC will be too slow and inefficient. This is another reason for caching the index+log index.
If the index must be cached then maybe an LSM isn't the best choice. Consider an index structure optimized for a main-memory DBMS.

Is block compression possible?

I hope so but this hasn't been explained in some of the index+log papers/systems.
Per-record compression can be done instead of block compression. That will have a lower compression rate but less CPU overhead when decompressing on read.
It might be hard to do block compression the first time a KV pair is written to a value log. One option is to write to a redo log until enough data is available to do block compression and then write to the value log. Another option is to defer block compression until KV pairs are rewritten during GC.

When the index is an LSM what happens when values are moved?

This is an issue that I learned about via CockroachDB. I should have figured it out long ago but mistakes happen.
The LSM tree is read in order -- top to bottom with leveled compaction and left to right with tiered compaction. This guarantees that the correct result is returned with respect to visibility. If the first entry for a key is a tombstone then the search can stop (ignoring snapshot reads).
Value log GC moves live KV pairs to new log segments. To find the KV pair after the move either the index entry must be updated or the index entry must have a logical value log key and then another index is needed to map the logical value log key to a physical value log (filename, offset).
Updating the LSM index entry to reference the new value log location (filename, offset) can be done by inserting a new KV pair into the LSM but that either breaks read consistency semantics or complicates read processing. It would break LSM read processing because inserting back into the LSM implies this is a new write, but it just a move for GC. Something other than an LSM that supports update-in-place makes this easier.
Details on using a logical value log key are explained in a CockroachDB github issue.

Wednesday, May 15, 2019

Index+log, v2

I put most index structures into one of three categories -- page-based, LSM or index+log. My focus is on databases larger than memory and I might be ignoring categories used for main memory DBMS. Over the past decade index+log has gotten more attention and this is my second attempt at explaining it.

My definition of index+log is simple -- data is appended to a log on each write, index entries point into the log and GC scans the log to copy live values and discard others. The log can be one file or many log segments. GC might have to search the index to determine whether a value is live. The value written to the log might include the key.

Bitcask was the first index+log system that I noticed but I assume there are earlier examples and just found one -- Berkeley DB Java Edition. While WiscKey made popular the term key-value separation and is presented as an LSM variant, I put it in the index+log category. Other interesting index+log systems include RocksDB BlobDB, TitanDB, ForestDB and Faster. While many of the tree-based solutions use an LSM for the index that is not required by index+log and an LSM is not used by Berkeley DB Java Edition or ForestDB.

For an index+log solution that must cache all of the index and query the index during GC I suspect that an LSM is not the best choice for the index. Although if you already have an LSM (see RocksDB BlobDB) then I get it.

The summary of my LSM vs index+log comparison is:

index+log has less write-amp but more space-amp
index+log has much more cache-amp, maybe 10X more
index+log might have more deferred CPU write-amp
I have yet to validate my claims by running benchmarks with index+log implementations.

Note that #1 is a feature, #2 is not a feature and for #3 it depends. The key point is that the cost of faster writes from index+log is more cache-amp (more memory) and more IO for range queries. In production with RocksDB I frequently see write-amp=15 with space-amp=1.1. I assume that could be reduced with index+log to write-amp ~=5 and space-amp ~= 1.3. It might be possible to avoid or reduce the impact of #2 and #3 in future index+log implementations.

Amplification Factors

I am speculating on read, write, space and cache amplification for index+log because I have little hands on experience with index+log implementations. Another reason for speculation is that index+log allows for different structures for the index (b-tree, LSM, etc) which affects some of the estimates.

The amplification factors are:

cache - the cache-amp for index+log is likely to be much larger than for an LSM. To achieve at most one IO per point query index+log might need 10X (or more) memory versus an LSM. Clustering values in key order doesn't just make range queries faster, it also means an LSM only needs block indexes in memory (key+pointer per block) while index+log needs a key+pointer in memory for every KV pair in the database. When there are 10 KV pairs per block then this is 10X more memory. Even when the database has hot and cold keys they are likely to be interleaved on the same index leaf pages -- so all of those leaf pages must be in memory to avoid doing two IOs (one for the leaf page, one for the value) on a point query.

There is additional data that an LSM and b-tree need in memory to satisfy the one IO per query constraint and I described that in previous posts (mostly things from the non leaf/max level).
It might be valid to debate whether my one IO per point query constraint is valid, but this blog post is already long.
Another reason for the index to be cached is to avoid doing IO during GC when index searches are done to determine whether KV pairs are live.

space - I ignore space-amp for the index and focus on the log because the log is usually larger. With index+log the user can trade between write-amp and space-amp. With the variable pct_full representing the percentage of live data in the database (a value between 1 and 100) then:

space-amp = 100 / pct_full
write-amp = 100 / (100 - pct_full)
Just like with an LSM, previously written KV pairs are rewritten during GC with the index+log approach. Fortunately this is done less often with index+log.
I assumed that block compression is used but that is harder to implement for index+log. The WiscKey paper doesn't describe a solution and the HashKV paper suggests using per-record compression, which will have a lower compression rate versus block as used by an LSM. I assume block compression can be done for index+log but it isn't trivial.
To explain the estimates when pct_full=X assume that all log segments have X% live data (yes, this is naive). When GC is done on a log segment X% is copied leaving (100-X)% free space in the newly written log segment. So in total 100% of a log segment is written for each (100 - pct_full)% of new data, which is the formula above.
Thus with pct_full=90 then space-amp is 1.1 while write-amp is 10. Comparing these with a leveled LSM the space-amp is similar while the write-amp is slightly better than what I see in production. To get a write-amp that is significantly better the cost is more space-amp. For example with pct-full=75 then write-amp=4 and space-amp=1.33.

read (CPU) - see here for range seek/next. The summary is that when an LSM is used for index+log then the costs are similar to an LSM. When a b-tree is used for index+log then the costs are smaller.

read (IO) - see here for range seek/next. In the cache-amp estimate I assume that the index is cached so the only IO to be done is for the log. Therefore the index structure (LSM vs b-tree) doesn't matter.

point query - the IO read-amp is <= 1 because the log is not cached.
range seek - range seek doesn't do IO when the index is cached
range next - this is much larger for index+log than for an LSM because it might do one IO per call to range next because rows are not clustered by key in the log. When data is compressed then there also is the CPU overhead for decompression per KV pair.

write - by write I assume update (read-modify-write) rather than a blind write.

immediate CPU - the cost of an index search. See the section on read CPU for point queries above.
immediate IO - the cost of an optional redo log write for the index structure and then writing the (value) log. Note that the minimum size of a write done by the storage device might be 4kb even if the data written is much smaller. Doing an fsync per 128 byte value might have a write-amp of 32 if that write is really forced to storage and doesn't just linger in a capacitor backed write cache.
deferred CPU - the deferred CPU write-amp is the cost of index searches done during GC, unless the HashKV approach is used. With pct_full=75, write-amp=4 and space-amp=1.33 then GC is done ~4 times for each key and the deferred CPU cost is 4 index searches. When live KV pairs are copied by GC then there is also the CPU and IO overhead from updating the index entry to point to the new location in the log.
deferred IO - this is determined by the percentage of live data in the database. With pct_full=75 it is 4.

Thursday, May 9, 2019

CRUM conjecture - read, write, space and cache amplification

The RUM Conjecture asserts that an index structure can't be optimal for all of read, write and space. I will ignore whether optimal is about performance or efficiency (faster is better vs efficient-er is better). I want to use CRUM in place of RUM where C stands for database cache.

The C in CRUM is the amount of memory per key-value pair (or row) the DBMS needs so that either a point query or the first row from a range query can be retrieved with at most X storage reads. The C can also be reported as the minimal database : memory ratio to achieve at most X storage reads per point query.

My points here are:

There are 4 amplification factors - read, write, space and cache
CRUM is for comparing index structure efficiency and performance
Read and write amplification have CPU and IO parts
Write amplification has immediate and deferred parts

Many X is faster than Y papers and articles neglect to quantify the tradeoffs made in pursuit of performance. I hope that changes and we develop better methods for quantifying the tradeoffs (a short rant on defining better).

Amplification factors (RUM -> CRUM) are used to compare index structures. Values for the factors are measured for real workloads and estimated for hypothetical ones. The comparison is the most valuable part. Knowing that the deferred CPU write-amp on inserts for a b-tree is 30 is not that useful. Knowing that it is 3X or 0.5X the value for an LSM is useful.

Workload matters. For estimates of amplification I usually assume uniform distribution because this simplifies the estimate. But there is much skew in production workloads and that impact can be measured to complement the estimates.

Read Amplification

This post is an overview for read-amp. This post explains it in detail for an LSM. There are two parts to read-amp -- CPU and IO. Thus for each of the three basic operations (point query, range seek, range next) there are 2 values for read-amp: CPU and IO. I have yet to consider deferred read-amp and by read-amp I mean immediate read-amp.

Metrics for CPU read-amp include CPU time, bytes/pages read and key comparisons. I use key comparisons when predicting performance and CPU time when running tests. I have not used bytes/pages read. While key comparisons are a useful metric they ignore other CPU overheads including hash table search, bloom filter search, page read and page decompress.

Metrics for IO read-amp include bytes read and pages read. I use pages read for disk and bytes read for SSD because disks are IOPs limited for small reads. IO read-amp implies extra CPU read-amp when the database is compressed. Decompressing pages after storage reads can use a lot of CPU with fast storage devices and even more with zlib but you should be using zstd.

With estimates for hypothetical workloads I assume there is a cache benefit as explained in the Cache Amplification section. This is likely to mean that comparisons assume a different amount of memory for index structures that have more or less cache-amp. For real tests I mostly run with database >> memory but don't attempt to use the least memory that satisfies the cache-amp X reads constraint.

Write Amplification

This post is an overview for write-amp. This post explains it in detail for an LSM. Write-amp has two dimensions: CPU vs IO, immediate vs deferred. For each operation (insert, delete, update) there are 4 values for write-amp: immediate CPU, deferred CPU, immediate IO and deferred IO.

The immediate write-amp occurs during the write. The deferred write-amp occurs after the write completes and includes writing back dirty pages in a b-tree and compaction in an LSM.

Possible metrics for CPU write-amp include bytes written, pages written, key comparisons and pages/bytes (de)compressed. Bytes and (in-memory) pages written are useful metrics for in-memory DBMS but my focus is on databases >> memory.

Possible metrics for IO write-amp include bytes written and pages written. These can be estimated for hypothetical workloads and measured for real ones. The choice between bytes or pages written might depend on whether disk or SSD is used as one is limited by ops/s and the other by transfer rate. If you use iostat to measure this then figure out whether Linux still counts bytes written as bytes trimmed.

Examples of deferred and immediate write-amp:

The InnoDB change buffer is deferred IO and CPU. Checking the change buffer and applying changes is deferred CPU. The deferred IO is from reading pages from storage to apply changes.
For a b-tree: page writeback for a b-tree is deferred IO, compression and creating the page checksum are deferred CPU, finding the in-memory copy of a page is immediate CPU, reading the page on a cache miss is immediate IO.
An LSM insert has immediate/deferred IO/CPU.

Immediate CPU - key comparisons for memtable insert
Immediate IO - redo log write
Deferred IO - reading uncached pages for input SSTs and writing output SSTs during compaction
Deferred CPU - decompression, compression and key comparisons while merging input SSTs into output SSTs during compaction. Note that compaction does a merge, not a sort or merge+sort.

Space Amplification

Space-amp is the size of the database files versus the size of the data, or the ratio of the physical to logical database size. An estimate for the logical size is the size of the uncompressed database dump with some adjustment if secondary indexes are used. The space-amp is reduced by compression. It is increased by fragmentation in a b-tree and uncompacted data in an LSM.

It is best to measure this after the DBMS has reached a steady state to include the impact of fragmentation and uncompacted data.

Cache Amplification

I briefly described cache-amp in this post. The cache-amp describes memory efficiency. It represents the minimal database : memory ratio such that a point query requires at most X storage reads. A DBMS with cache-amp=10 (C=10) needs 10 times more memory than one with C=100 to satisfy the at most X reads constraint.

It can be more complicated to consider cache-amp for range seek and range next because processing them is more complicated for an LSM or index+log algorithm. Therefore I usually limit this to point queries.

For a few years I limited this to X=1 (at most 1 storage read). But it will be interesting to consider X=2 or 3. With X=1:

For a b-tree all but the leaf level must be in cache
For an LSM the cache must include all bloom filter and index blocks, all data blocks but the max level
For an index+log approach it depends (wait for another blog post)

Small Datum