Small Datum: database economics

Showing posts with label database economics. Show all posts

Sunday, August 16, 2020

Review of PrismDB: Read-aware LSM Tree for Heterogeneous Storage

The PrismDB paper explains how to use heterogenous storage with an LSM. This is my review of the paper and I don't review papers that aren't interesting and worth reading. I might be biased given that I worked on RocksDB, MyRocks, helped with the persistent read cache for tiered storage in RocksDB for which there is an interesting paper. For the rest of this post I use the word tiered instead of heterogenous as that is easier to type.

The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.

An LSM tree with leveled compaction organizes keys by write recency. When n < m then a key/value pair in Ln is likely to have been written more recently than a key in Lm. But it isn't clear that location in the LSM tree suggests something about read recency or frequency. Thus the simple tiered storage approach mentioned in the previous paragraph might not help much for reads.

Paper summary

The idea in the paper is to pin hot key/value pairs higher in the LSM tree and then use higher-perf, higher-cost storage for those levels. Thus, storage read latency for such data will be reduced. This is implemented via three components: tracker, mapper, placer.

The tracker identifies keys that are frequently accessed. This is done via CLOCK with 2 bits/key and implemented via a concurrent hash map.
The mapper decides which keys are hot. The mapper can achieve the goal of at most X% of keys on a given level getting pinned. Given such a constraint the mapper will determine which keys can get pinned in that level. The mapper uses the information from the concurrent hash map managed by the tracker.
The placer implements pinned compaction, explained below, to implement the decisions made by the mapper.

Pinned compaction is a new compaction strategy. For leveled compaction from Ln to Ln+1 pinned compaction will keep keys on Ln when such keys are listed for pinning on that level. Pinned compaction will also pull up keys form Ln+1 to Ln for the same reason.

Pinned compaction benefits from weighted scores that determine which SSTs are most likely to benefit from pinned compaction (have keys that should be pinned). The scores determine which SSTs to use as input for the next round of pinned compaction.

Access distributions

My biggest concern about the paper is whether production access patterns will benefit form this approach. Academic work on this topic has been hampered because not enough has been published on this topic. It is hard to work on cache management without knowing the access distributions.

The paper has two claims on this point:

Objects stored in the higher levels of the LSM are read and updated much more often than objects stored at the bottom layers.
In addition, since they are updated much less frequently, these lower levels can meet the lower endurance requirements of cheaper flash storage.

I agree that objects in the higher levels are likely to be updated more often. I am less certain that this extends to reads. Most of the benchmarks from the paper use YCSB with Zipfian and the claims will definitely be true in that case. The claim about reads is more likely to be true for workloads with many range reads especially when prefix bloom filters cannot be used.

For point #2, my experience with production and benchmark workloads has been that per-level compaction write rates are similar (MB/s written to storage per level). But this can be true without contradicting claim 2 above as the larger levels have more objects. Regardless the per-level write rates must be considered to make sure a device will meet the desired lifetime.

By similar write rates I mean they tend to be within a factor of 2 and here is an example from Linkbench -- see the Write(GB) column.

Deployment

I know that conference papers don't have to solve my production concerns, but I want to share more background on the topic.

How can I deploy this in production? If I am a cloud customer then my storage form factors are different from on-prem. For AWS I can choose from ephemeral, EBS disk, EBS SSD, EBS SSD with PIOPs and S3. I probably don't have to worry about SSD endurance in the cloud but if my workload does many writes then I have to pay for that except when using ephemeral storage. In the cloud there are still different price & performance tradeoffs and I assume work like PrismDB is relevant there.

For on-prem it isn't easy to get multiple storage devices (NVM & TLC & QLC) into one server as web-scale operators try to reduce the number of items on the menu. Even if they allow this the question is whether I can get something with small amounts of NVM per server. Disaggregated storage might solve the small amounts of storage per server problem but with disagg I have to worry about new failure modes when all servers in a rack are sharing access to the same NVM device. I think the most likely solution is for storage vendors to provide a hybrid device that includes some NVM with more MLC/TLC NAND and a lot of QLC NAND (they already do this to some extent).

Assorted comments

Does the mapper create a list of keys (to be pinned) per level or is there a global list?
Does pinned compaction risk creating SSTs with a small number of keys? I'd rather not have an LSM tree with too many small SSTs.
Do the decisions about pinning avoid the double caching problem? I am not sure I need to pin objects that are likely to be in the block cache.
Is the CPU overhead of searching the to-be-pinned data structure significant? This might depend on whether pinned compaction is the only type of compaction run. The paper suggests regular compaction is also run.
For the two points above, using a bloom filter per level for the keys to be pinned might help.
Is there a mechanism for unpin? If regular compaction is still run then is that the way in which pinned keys get unpinned. Or does recomputing the to-be-pinned list over time server to unpin keys that used to be hot and now are cold.
Logical backup and long scans for batch queries are things that can interfere with the computation of the to-be-pinned list. In MyRocks hints can be used as part of a SELECT statement to suggest that data for the query should not pulled into the block cache.

Thursday, March 12, 2020

Tuning space and write amplification to minimize cost

This uses math to show how to tune space and write amplification to minimize storage costs for an index structure that uses index+log. The result, minimal cost, is true assuming my model is true. But at this point I will only claim that the model is truthy. I look forward to more results in this area from the smart people at DASlab and elsewhere. I wonder if database economics is a good name for this topic.

I explained the index+log index structure here and here and my truthy model assumes that write and space amplification are functions of PctUsed - the percent of available storage that is used by the index structure. The model is:

Space amplification = 100 / PctUsed
Write amplification = 100 / (100 - PctUsed)

In what follows I use SpaceAmp and WriteAmp for space and write amplification. When PctUsed is X then the remaining space on the storage device (100-X) is free, not used by anything. The formulas mean that a deployment can trade between SpaceAmp and WriteAmp by adjusting the value of PctUsed. When PctUsed is 80 the values of SpaceAmp and WriteAmp are 1.25 and 5. When PctUsed is 20 the values of SpaceAmp and WriteAmp are 5 and 1.25.

Math

Back in the day hardware was scarce and there was much math in systems papers. While there isn't as much math today the rush to ML means that more people are learning applied math (me included) which is a good thing. I regret not learning enough applied math while in college.

Here I derive a formula for the cost of storage in terms of the database size and the IO required to do compaction (GC, garbage collection) for an index structure that uses the index+log approach. The cost is a function of PctUsed.

Assume:
P = PctUsed
S = index structure size in GB
G = Cost/GB
N = Write rate to index structure in MB/s
f = value between 1 and 2, 1= no storage reads by GC, 2= all GC writes do storage reads first
I = Cost/IOPs for 4KB operations

Cost = Cost_space + Cost_io

# Formulas for the cost of space and IOPs
# 256 converts MB/s into 4KB IOPs

Cost_space = S * G * 100 * P^-1
Cost_io= N * 256 * f * I * 100 * (100-P)^-1

# Determine where Cost' = 0 to find the minimal cost
Cost' = Cost_space' + Cost_io'

Cost_space' = -1 * S * G * 100 * P^-2
Cost_io' = N * 256 * f * I * 100 * (100-P)^-2* -1 * -1
Cost' = (-1 * S * G * 100 * P^-2) + (N * 256 * f * I * 100 * (100-P)^-2)
# And Cost' = 0 when
S * G * 100 * P^-2= N * 256 * f * I * 100 * (100-P)^-2
# Skipping a few steps this reduces to
P²* ((NfI/SG) - 1) + 200P - 10,000 = 0

# This can be solved by the quadratic equation with a=((NfI/SG) - 1), b=200, c=-10,000

Graphs

Now I solve the equation above to determine the value of PctUsed that minimizes cost with prices from EBS provisioned IOPs. A Google Sheets spreadsheet with the solution is here. For the spreadsheet:

The graph uses log-scale for the y-axis and the y-axis doesn't start at 0. This makes it easier to see the impact of changing PctUsed, but can also be misleading.
The solutions from the quadratic equation are quad1 and quad2
Cost is computed for PctUsed in (5, 10, 15, ..., 85, 90, 95)
The minimal value for Cost (quad1) is likely to be between these values

I then solve for 3 cases: N=1, 10 and 100 where N is the write rate to the index structure in MB/s. The minimal cost occurs at PctUsed = 67, 39 and 17 for N = 1, 10 and 100.

For N=1, a low write rate, the minimal cost is at PctUsed=67

For N=10, a moderate write rate, the minimal cost is at PctUsed=39

For N=100, an extreme write rate, the minimal cost is at PctUsed=17

Wednesday, January 1, 2020

From disk to flashcache to flash

The past decade in database storage was interesting whether you stayed with local attach storage, used block & object storage from cloud or on-prem vendors or moved to OSS scale-out storage like Ceph, GlusterFS and MinIO. I am writing about my experience and will focus on local attach.

Over the past decade the DBMS deployments I cared for went from disk to flashcache to flash on the HW side and then from MySQL+InnoDB to MySQL+MyRocks on the SW side. I assume that HW changes faster than DBMS software. DBMS algorithms that can adapt to such changes will do better in the next decade.

One comment I have heard a few too many times is that storage performance doesn't matter much because you can fit the database in RAM. More recently I hear the same except change RAM to Optane. I agree that this can be done for many workloads. I am less certain that it should be done for many workloads. That (all data in RAM/Optane) costs a lot in money, power and even space in the data center. Lets make web-scale DBMS green. Use enough RAM/Optane for cache to meet the performance SLA and then use SSD or disk arrays. At some point there is no return from cutting the DBMS query response time in half but cutting the DBMS HW cost in half is usually a big deal.

Priorities

With disk and flashcache I worried a lot about the IOPs demand because the supply was limited, usually less than 2000 operations/second. On moving to flash I stopped worrying about that and began worrying about write and space amplification (efficiency).

The context for this is small data (OLTP) workloads and deployments where reducing HW cost matters. Overcoming the IOPs shortage was interesting at times and good for my career as there were always new problems that had to be fixed right now. Moving to flash made life easier for everyone. There was an opportunity cost from using disks -- time spent solving the latest IOPs demand crisis was time not spent on longer term projects. Moving to flash gave us time to build and deploy MyRocks.

MyRocks has better space and write efficiency than a b-tree. The cost of better space and write efficiency with an LSM is more CPU overhead for point and range queries. Sometimes that is a great trade. Better space and write efficiency means you buy less SSD and it lasts longer. Better write efficiency is a big deal with lower endurance (TLC and QLC) NAND flash. I wonder how this changes in the cloud. Cloud vendors might improve their profit margins with better space and write efficiency but they also have the ability to pass on some of the inefficiency costs to the user. A cloud user doesn't have to worry as much about write efficiency because they are renting the SSD.

Hardware

This is my history with storage for web-scale MySQL. The NUC servers I use today have similar RAM/CPU as the DBMS servers I started with in 2005 but the NUC servers have much more IO capacity.

First there were disk arrays with HW RAID and SW RAID. This was RAID 10 which was better for durability than availability. Data isn't lost on a single-disk failure but the server performance is unacceptable when a HW RAID cache battery fails (fsync is too slow) or a rebuild is in progress after a disk gets replaced.

Then there was flashcache and performance is wonderful when the read working sit fits in the flash cache but there is an abrupt change in performance when it does not. Those were exciting years. Some of the performance critical parts of flashcache were in the Linux kernel. I lack kernel skills and it took us (really, Domas) a while to identify perf problems that were eventually fixed.

Then there was flash and the abundance of IOPs was wonderful. I look forward to the next decade.

Anecdotes

If you use disk arrays at scale then you will see corruption at scale. You are likely using multiple storage devices with multiple firmware revisions. It is interesting when 99% of corruption occurs on 1% of the deployment -- all on the same, new firmware revision. That result makes it easy to focus on the HW as the probable problem and stop blaming MySQL. I can't imagine doing web-scale DBMS without per-page checksums.

Performance variance with NAND flash is to be expected. I hope that more is done to explain and document it for OLTP workloads. The common problem is that NAND flash GC can stall reads on the device. I wish it were easier to characterize device performance for enthusiasts like myself. I am sure there is an interesting paper on this topic. How much concurrency does the device provide? How are writes done? How is GC done? What is the stall distribution? What can be done to reduce stalls (see multistream and LightNVM)?

Using TRIM (mount FS with discard) at scale is exciting. RocksDB and MyRocks do a lot of TRIM while while InnoDB does not. How many GB/s and unlink/s of TRIM does the device support? TRIM performance varies greatly by vendor. I hope more is done to document these differences. Perhaps we need trimbench. People at web-scale companies have stories that never get shared because they don't want to throw their SSD vendors under the bus. I was spoiled by FusionIO. My memory is that FusionIO TRIM was a noop from a perf perspective.

Innosim is an InnoDB IO simulator that I wrote to help device vendors reproduce performance stalls we encountered with web-scale MySQL. It is easier to run than MySQL while able to generate similar IO patterns. I wrote it because InnoDB has a pattern of coordinated IO that fio wasn't able to reproduce. The pattern occurs during page write back -- first write the double write buffer (1MB or 2MB sequential write) and then do 64 or 128 random 16kb writes. Innosim also takes much less time to reach steady state -- just sequentially write out X GB of database pages versus load InnoDB and then run (for days) an update workload to fragment the indexes. Fragmentation takes time. I wish more DBMS benchmarks ran long enough to get sufficient fragmentation but that can be expensive.

Perhaps one day I will write WTsim, the WiredTiger IO simulator. I wrote ldbsim, the LevelDB IO simulator, but it was rarely used because the RocksDB benchmark client, db_bench, was easy to use even if fragmenting the LSM tree still took a long time. I am not sure that fio would be able to reproduce the special IO patterns created by RocksDB compaction. I love fio but I am not sure it should try to solve this problem for me.

Monday, November 25, 2019

Throttling writes: LSM vs B-Tree

Reducing response time variance is important for some workloads. This post explains sources of variance for workloads with high write rates when the index structure is an LSM or a B-Tree. I previously wrote about this in my post on durability debt.

Short summary:

For a given write rate stalls are more likely with a B-Tree than an LSM
Many RocksDB write stalls can be avoided via configuration
Write stalls with a B-Tree are smaller but more frequent versus an LSM
Write stalls are more likely when the redo log isn't forced on commit
The worst case difference between an LSM and B-Tree is larger when the working set isn't cached
Life is easier but more expensive when the working set fits in cache
Less write amplification saves IO for other uses

Less short summary:

Write stalls for an LSM occur when compaction has trouble keeping up with the incoming write rate. The worst stalls occur at write rates that a B-Tree could not sustain. One way to mitigate stalls is to reduce the write rate. Another way is to use an index structure that doesn't support or is inefficient for range scans (see index+log).
The cost from configuring RocksDB to avoid write stalls is more CPU overhead on reads as there will be more data in the upper levels of the LSM. I am partly to blame for the default configuration in RocksDB that throttles writes when the LSM tree gets too much data in the L0, L1 and L2. But that configuration can be changed.
SQLite4 has a clever LSM designed for systems that don't allow background threads. It implements a pay as you go approach to durability debt. A traditional LSM takes the opposite approach - it defers the IO cost to the background. RocksDB has optional write throttling and work has been done to smooth the impact from it but it is not solved. A B-Tree in the worst-case (buffer pool full & mostly dirty, working set not cached) also implements pay as you go approach.
I almost always disable sync-on-commit for benchmarks because I want to observe how the DBMS observes under stress and less commit latency means more writes/second and more IO stress.
See item #6 where I argue that it is good to not have the working set cached.
A common rule of thumb has been to keep all indexes in cache or all of the working set in cache. That simplifies tuning and makes it easier to avoid performance problems. But that also might force a deployment to use 2X more HW than it needs because NAND flash SSDs are everywhere and the response time difference between reading from RAM and reading from NAND flash might not matter for many applications. But if you are using a DBMS in the cloud that charges by the IO, then keeping the working set in RAM might be a good idea.
An LSM usually has less write-amp than a B-Tree. So the IO capacity it saves from that can be used elsewhere to support more read or write transactions.

Worst case behavior

I am wary of faster is better. I prefer nuance but I also know that people don't have time to read long blog posts like this or long performance reports. Here I explain worst case behavior in terms of IO overheads. Worst case behavior isn't the only way to judge an index structure but it helps me to explain performance. Another way is to measure the average amount of IO per transaction (in operations and KB) and treat IO efficiency as important.

I describe worst case behavior for a write operation under a few scenarios. By worst case I mean the largest amount of IO done in the foreground (the thread handling the write) as that determines the response time. I ignore the work done in the background which favors an LSM because that defers more work to the background. For a B-Tree I ignore undo and page splits. The write is a SQL update which is read-modify-write, as opposed to a blind-write like a Put with RocksDB. Finally, I assume the update isn't to an indexed column. The scenarios are:

Cached, PK only - working set cached, PK index only
Not cached, PK only - working set not cached, PK index only
Cached, PK and secondary index - working set cached, PK and non-unique secondary index
Not cached, PK and secondary index - working set not cached, PK and non-unique secondary index

PK only

For the cached, PK only scenario neither an LSM nor a B-Tree do IO in the foreground with the exception of the redo log fsync. Stalls are unlikely for both but more likely with a B-Tree especially when the DBMS storage uses a spinning disk.

An LSM writes the redo log buffer, optionally syncs the redo log and then does an insert into the memtable. Both memtable flush and Ln:Ln+1 compaction are deferred to background threads. If memtable flush were too slow then there are write stalls until flush catches up to avoid too many memtables wasting memory.
A B-Tree modifies a page in the buffer pool, writes the redo log buffer and optionally syncs the redo log. If checkpoint were too slow a full redo log can't be rotated until checkpoint catches up and there are write stalls.

For the not cached, PK only scenario the work done in the foreground is 1 IO/update for an LSM and 2 IO/update for a B-Tree. Here a B-Tree uses a pay as you go model.

An LSM reads a page into the block cache and then repeats the work described in cached, PK only.
A B-Tree finds a dirty page to evict, writes that page back to storage, then reads the desired page into that slot in the buffer pool and repeats the work described in cached, PK only.

PK and secondary index

For the cached, PK and secondary index scenario there is approximately twice as much work to be done per update compared to the cached, PK only scenario. Thus stalls are more likely here. But other than the optional redo fsync there is no foreground IO for the LSM and B-Tree.

An LSM repeats the work explained in the cached, PK only scenario. For the secondary index it does an additional insert to the memtable which is also logged as redo. This can double the demand for compaction.
A B-Tree repeats the work explained in the cached, PK only scenario. For the secondary index it makes an additional page dirty in the buffer pool. This can double the demand for page write back.

For the not cached, PK and secondary index scenario the foreground IO difference between an LSM and B-Tree is more significant -- 1 IO for the LSM vs 4 IO for the B-Tree -- ignoring the redo log overhead. The IO difference is reduced from 1:4 to approximately 1:2 for a B-Tree like InnoDB that implements a change buffer.

An LSM does the union of the work described in not cached, PK only and cached, PK and secondary index scenarios. Ignoring the optional redo fsync the cost is 1 read IO for the PK index and no reads for the secondary index because non-unique secondary index maintenance is read-free.
A B-Tree repeats the work explained in the cached, PK only scenario but this is done for both the PK and secondary indexes. Thus the cost is 2 IOs to write back dirty pages and then 2 IOs to read pages from the PK and secondary indexes into the buffer pool and then make them dirty -- which then requires redo log writes. So the cost for this is 4 IOs ignoring the redo log.

Make writes fast: LSM

Writes can be fast with an LSM because most of the IO cost is deferred but that also increases the need to throttle writes. Life is good as long as that deferred cost can be repaid fast enough, otherwise there will be more response time variance.

Flush and compaction are the deferred cost for an LSM write. Flush means writing the memtable to an SST on storage. Compaction means merging SSTs to move flushed data from the root to leaf of the LSM tree. Compaction costs more than flush. RocksDB can stall writes when compaction doesn't keep up with ingest. Ingest creates durability debt, compaction reduces it and write stalls are there to bound the debt. Write stalls are enabled by default but can be disabled by configuration. Putting a bound on durability debt also puts a bound on read latency by reducing the number of SSTs that can exist in the L0, L1 and L2. So if you want to support extremely high write rates than choose one of: read stalls, write stalls.

Make writes fast: B-Tree

Writes can also be fast with a B-Tree as there are no page reads/writes to/form storage when the working set is cached and background page write back is fast enough. In that case the only IO work in the foreground is the optional redo log fsync.

Page write back is the primary deferred cost for a B-Tree write. Most of my B-Tree experience is with InnoDB which does fuzzy checkpoint. The goal is to flush dirty pages before the current redo log segment gets full. Using larger redo log segments lets InnoDB defer write back for a longer time increasing the chance that more transactions will modify the page -- reducing write amplification and helping performance.

Purge can be an additional deferred cost for a B-Tree write. I use the InnoDB name here as Postgres calls this vacuum. This is the process of reclaiming space from deleted rows that are no longer visible by open MVCC snapshots. The LSM equivalent of purge is checking the open snapshot list during compaction for KV pairs that are not the latest version of a given key to determine whether that version is still needed.

When write back and purge are fast enough then write stalls should be infrequent with a B-Tree. But write back isn't always fast enough. A B-Tree write stall occurs when a write transaction must read a page into the buffer pool prior to modifying that page but 1) the buffer pool is full and 2) write back must be done for a dirty page before the memory can be reused.

Other

A few other comments that didn't have a place above:

In this post I assume the B-Tree uses no-force, but there is at least one nice OSS B-Tree that uses force.
Making commit slower is another way to throttle writes and reduce the chance of stalled writes. Examples of this include redo log fsync, semisync or synchronous replication.
The InnoDB change buffer is a wonderful feature that reduces the IO overhead for write-heavy workloads.
NAND flash GC stalls are another source of write stalls. I wish more were written about this topic.
Stalls during TRIM when using an LSM with NAND flash are another source of stalls. I wish there were more TRIM benchmarks. Smart friends tell me that NAND flash devices vary widely in their ability to handle TRIM. And they display different stall behavior when their TRIM capacity has been exceeded. Some of us were spoiled by FusionIO.

Wednesday, November 20, 2019

The delta+main index structure

A reader of my LSM for analytics blog post suggested I read Fast Scans on Key-Value Stores from VLDB 2017. It is an interesting paper and was part of the effort to build the Tell DBMS and a great PhD thesis.

Tell might not be an answer for my problem as this was an in-memory DBMS and might have write amplification that is too high for my needs. Regardless, the paper is worth reading and discussing.

Index structure drama

I don't publish reviews for papers that aren't worth reading so keep that in mind for any criticism that follows. The index structure world tends to be low on drama and perhaps more while waiting for the next round of learned index papers.

The big questions for me are:

Is delta+main a new index structure?
Did SAP Hana pioneer delta+main?

Hana and delta+main

I will start with question 2. The paper states that delta+main was pioneered by SAP Hana (see Hana paper SIGMOD Record from 2012) and then cites a SIGMOD 2015 paper about the Analytics in Motion project. I am not an expert in this area but I suspect that C-Store/Vertica (see VLDB 2005 paper) was another pioneer in this space.

I started to browse the cited papers. There are too many for me to read or even cite including Fast Updates from VLDB 2012 and Positional Update Handling in Column Stores from SIGMOD 2010. The earliest online paper in this space might be Differential Files from ACM TODS in 1976 and that paper cites even earlier work -- delta+main is great for data stored on tape.

Delta+main index structure

At this point I am only talking about TellStore-Col. I would classify TellStore-Log as index+log.

I am not sure that delta+main is a new index structure. It might be an LSM variant that I have called memtable+L1 where delta is the memtable and main is the L1. Or perhaps it is memtable+L0+L1 where delta is memtable and the L0 while main is the L1. I encountered memtable+L1 systems at least twice. Once at work and then again with Sophia. This is a great approach when database : RAM ratios aren't larger than 10.

However, the index structures used in Tell are more complicated than an LSM. They do clever things to make scans faster, things that you might not do with an LSM -- some fields on data pages aren't write once.

So I won't update the index structures definition just yet, but that doesn't take away from the paper.

Back to the paper

Tell is an in-memory DBMS with three index structures -- TellStore-Log, TellStore-Col and TellStore-Row. I ignore TellStore-Row. Tell supports mixed workloads -- lets call this HTAP for now -- and must be efficient for scans. It solves some of the problems I described in my LSM for Analytics post.

The paper only describes the use of a hash index but cites another Tell paper that explains how a B+Tree can be used. I assume the hash index is only (mostly) for primary key indexes.

TellStore-Log

TellStore-Log is index+log with a hash table for the index. Rows in the log are self-contained. The hash index doesn't have to be checked to determine whether a row in the log is live or visible to a transaction. Thus GC and scans don't have to probe the index which saves CPU overhead.

Rows in the log have fields for valid-from, valid-to and previous. The index points to the latest version of a key and then previous versions can be found by following the previous field. The valid-from and valid-to fields are for MVCC visibility. As part of an update the valid-to field of the previously live version of the row is updated. This confused me until I remembered that this is an in-memory DBMS, the log is in memory and it is possible to update that field in a previously written log page. But this also means that the append-only log isn't write once because the valid-to field can be updated long after the append.

Each DBMS table uses a separate log to improve scan efficiency. TellStore-Log also does clever things to piggyback GC with scans.

TellStore-Col

TellStore-Col is delta+main. It uses two logs for delta (update-log and insert-log, delete is an update) and then PAX for the main store. There is still one hash index that points to rows in update-log, insert-log and the main store.

Row visibility can be treated as a predicate to evaluate. Assume there are N rows on a page and K predicates. There can be K bitmaps, one per predicate, to indicate matching rows and then one more bitmap to indicate the rows that are visible.

Multiple versions of a row can be in the main store and would be adjacent in a page. There is also a newest pointer per key in the main store that points can point into the update log. The newest pointer would get changed each time a row with that key is appended to the update-log. Rows in the main store also have valid-from and valid-to fields to determine MVCC visibility. When the latest version of a key is in the main store and the key gets updated (appended to the update-log) then the valid-to field in the main store page would also get updated. I assume these changes to main store pages are done in place given that this is an in-memory DBMS.

Multiple versions of a key in the update log are linked via the previous field. But the oldest version of a key in the update log does not point to versions in the main store.

Eventually GC is done for pages in the main store and that uses copy-on-write in contrast to the changes mentioned above that are done in place.

The structure of pages in the main store support efficient scans. Keys that are not visible per MVCC can be excluded via the valid-from and valid-to fields. Keys that have the visible version in the update-log can be found by following the newest pointer from the main store page to the update-log. There is no need to do a merge, like an LSM requires, between rows from the main store and update-log. There is no need to probe the update-log for every key read from the main store. There is no need to scan the update-log as it gets probed on demand.

A scan would also have to read from the insert-log. But such rows are new, so there is no need to merge that data with another data structure.

Friday, November 15, 2019

The joy of database configuration

I am wary of papers with performance results for too many products.Too many means including results from systems for which you lack expertise. Wary means I have less faith in the comparison even when the ideas in the paper are awesome. I have expertise in MySQL, MongoDB, RocksDB, WiredTiger and InnoDB but even for them I have made and acknowledged ridiculous mistakes.

Database configuration is too hard. There are too many options, most of them aren't significant and the approach is bottom-up. I an expert on this -- in addition to years of tuning I have added more than a few options to RocksDB and MySQL.

This post was motivated by PostgreSQL. I want to run the insert benchmark for it and need a good configuration. I have nothing against PG with the exception of a few too many why not Postgres comments. The community is strong, docs are great and the product is still improving. But I think PostgreSQL configuration has room to improve -- just like RocksDB (here, here) and MySQL/InnoDB.

Too many options

A non-expert user lacks both the ability to choose good values for options and the ability to understand which options might be useful to set. My solution to too many options and most aren't significant is to use good defaults and split the option name space into two parts -- regular and expert. Regular options are set by most users because they matter for performance and don't have good default values. The amount of memory the DBMS can use is one such option - the default will be small.

Everything else is an expert option. These include options for which the default is great and options that rarely impact performance. There is a reason for expert options -- some workloads benefit from their existence and being able to set that option at runtime might avoid downtime. Options are also added early in the lifecycle of new features to allow developers to evaluate the new feature and choose good default values. But such options don't need to be exposed to all users.

The benefit from doing this is to avoid presenting a new user with tens or hundreds of options to consider. That is a lousy experience. And while X is too hard isn't always a valid complaint -- language (human and database query) is complex because they let us express complex idea -- I don't think we gain much from the current approach.

RocksDB has added functions that simplify configuration and even split the option namespace into two parts -- regular and advanced. This is a step in the right direction but I hope for more. I confirmed that most RocksDB options either have good defaults or aren't significant for my workloads and then published advice on tuning RocksDB.

The performance configurations I use for MongoDB/WiredTiger and MySQL/InnoDB are similar to my experience with RocksDB. I don't have to set too many options to get great performance. Alas, it took a long time to figure that out.

Top-down configuration

Top-down configuration is another approach that can help. The idea is simple - tell the DBMS about the hardware it can use and optionally state a few constraints.

The basic hardware configuration is empty which implies the DBMS gets everything it can find -- all memory, all CPU cores, all IO capacity. When a host does more than run a DBMS it should be easy to enforce that limit with one option for memory consumption, one for CPU, etc. The user shouldn't have to set ten options for ten different memory consumers. It is even worse when these limits are per instance -- limiting how much memory each sort buffer gets is a lousy way to manage total memory usage. IO capacity is interesting. AFAIK there was a tool included in RethinkDB that characterized IO capacity, PostgreSQL has a tool for fsync performance and we can't forget fio. But it is easy to be mislead about SSD performance.

The constraints cover things that are subjective. What is the max recovery time objective? How do you rank read, write, space and memory efficiency?

A great example of this is SQL Memory Management in Oracle 9i -- tell the DBMS how much memory it can use and let it figure out the best way to use it.

What about ML

I hope that ML makes it easier to discover the options that aren't significant and can be moved into the expert options namespace. But I prefer a solution with fewer tuning knobs, or at least fewer visible tuning knobs. I hope to avoid too many knobs (status quota) combined with ML. Lets make smarter database algorithms. If nothing else this should be a source of research funding, interesting PhDs and many papers worth reading.

Update

While I appreciate that someone made the MySQL memory calculator available I wish this weren't needed. Setting memory limits based on peak concurrency means you will under-allocate memory in the normal case or instead you can over-allocate at peak concurrency and get OOM.

Friday, November 8, 2019

Jungle - LSM plus copy-on-write B-Tree

This is a review of Jungle which is an LSM variant that uses a copy-on-write (CoW) B-Tree internally. One of the Jungle developers previously invented ForestDB. I am a fan of his work.

At a high level Jungle is an LSM with leveled compaction but thanks to the CoW B-Tree it has different read, write and space amplification tradeoffs. I don't understand Jungle well enough to explain where it is better, but I am happy to share this review and accept corrections. My summary of Jungle is:

One sorted run per level
The per level file structure uses an index+log approach where the index is a CoW B-Tree, values are appended to the value log and the B-Tree entries point into the log. There is also a bloom filter.
Inter-level merge does compaction between adjacent levels. I think this is some-to-some as some data is moved from Ln to Ln+1 in a batch, values are appended to the value log, keys are inserted into the B-Tree and modified B-Tree pages are persisted by appending to the end of the B-Tree files. Because of CoW the path from leaf to root is made dirty when a leaf page is modified.
In-place merge does GC within a level to reclaim space from the B-Tree files and value log. The B-Tree is scanned in order to write a new B-Tree and new value log. Space is wasted because updates were appended to the end of the B-Tree file and value log.

Purpose

The index+log approach is used to reduce write amplification from large values. The per-level write-amp from moving a KV pair from Ln to Ln+1 is the sum of the write-amp from adding keys to the B-Tree and from appending to the end of the value log.

Assuming a batch of KV pairs is inserted then write-amp for the value log is minimal -- close to 1. If 32 1kb values are moved then 32kb is written. I am uncertain about the average and worst case write-amp for the B-Tree even when I only consider write-amp for the leaf pages and ignore the non-leaf pages. For the worst-case assume that each key makes a leaf page dirty. Then for each KV pair with a small key and 1kb value there is 4kb + 1kb written (4kb for B-Tree, 1kb for value log) and the per-level write-amp is ~5. That is a good worst-case. I frequently see per-level write-amp of ~5 for production workloads with RocksDB so I wonder what the average case will be for Jungle.

There is additional write-amp from doing periodic in-place merges to reclaim space. I won't try to estimate the impact from that.

Comments

The CoW B-Tree in Jungle is CoW-S because writes are appended and GC must eventually be done.
While ordering values in RocksDB has a cost, more write-amp, it also has a benefit, less cache-amp. RocksDB needs a pointer (index entry) in memory per block to achieve a worst-case of ~1 disk read per point query -- see the RocksDB data block index. With index+log the values are not in key order and this needs a pointer (index entry) in memory per KV pair. Assuming these pointers are ~8 bytes there is a huge difference in memory overhead between 8 bytes / database page and 8 bytes per KV pair assuming KV pairs are not too big. Per the CRUM Conjecture it is hard to be better in all dimensions -- read, write, disk space and memory.
It will be hard to do compression for the value log if only a few values are appended at a time. But if each inter-level merge adds many 4kb pages worth of data to the value log then this isn't a problem.
Range scans are more expensive with index+log because values are not in key order and each value might require a storage IO and the CPU overhead for decompression. This should only be a problem for the largest level of the LSM tree.

Thursday, November 7, 2019

Revisiting Kudu

I read the Kudu paper many years ago. It is worth reading but I never published my review so I re-read the paper this week. My review is limited to the paper and I ignore improvements since then. My focus is on storage so I also ignore most of the distributed system features.

I don't have access to my old review of the paper when I read it years ago. I might have considered it to be an LSM variant. Now I consider it to be an example of an index+log index structure.

Kudu is a scale-out index structure for analytics engines like Impala. The goals for Kudu are fast columnar scans, low latency updates and low performance variance. I assume that Kudu satisfied those goals. The features include:

Data is stored in tables and a table has a fixed schema. I am curious about the demand for flexible schemas.
It is primary key only. It uses Raft to make sure replicas stay in sync. If global secondary indexes were supported then something like XA across shards would also be needed. That is much harder, but feasible -- see Spanner, CockroachDB and Yugabyte.
Supports add/drop column but PK columns can't be dropped. I am not sure whether the PK can be changed.
Supports insert, update, delete and range query. It might consider a point query to be a simple range scan. Update and delete must fully specify the PK values for the rows to be changed.
Supports range and hash partitioning but that paper section confused me
Supports single-row transactions. I assume a batch can be submitted but I am not sure how per-row outcomes are reported to a client.
Compaction uses cost-based decisions to merge data where the objective function is to improve read efficiency. I am a big fan of that.

Index+log

I think this is an example of an index+log index structure where the DiskRowSets are log segments. GC is clever and there are two parts to it. First, there is compaction between DeltaMemStore and DiskRowSet. This removes deleted rows and merges long update chains. Second, there is compaction between DiskRowSets. This reduces the number of places that must be checked for a given key range.

In the standard index+log approach only the first type of compaction is done for a log segment. That usually requires all of the index to be in memory because each row from a log segment needs an index probe to determine whether the row is live. For Kudu a search of the DeltaFile determines liveness. It is not clear to me whether DeltaMemFiles are clustered per DiskRowSet to reduce the amount of data that should be in memory when such compaction is done.

I briefly cover the second type of compaction at the end of my blog post. The benefit from merging log segments with index+log is larger sorted runs. Write-optimized index structures impose a CPU and IO read efficiency penalty. Merging log segments like this reduces that penalty.

Storage

Tables are horizontally partitioned into tablets and each tablet consists of a MemRowSet, many DiskRowSets, a DeltaMemStore and many DeltaFiles. Inserts are written into a MemRowSet, when full a MemRowSet is flushed to disk creating a DiskRowSet. A DiskRowSet is limited to ~32MB. Only inserts are directly written into DiskRowSets. Because there is a PK check on insert a row will exist in at most one DiskRowSet.

Deletes and updates are first written into a DeltaMemStore and when full that is flushed to disk to create a DeltaFile. While only inserts are directly written to a DiskRowSet. Eventually a DeltaMemStore will be merged with a DiskRowSet so the effect of update and delete eventually reach a DiskRowSet.

Storage is columnar for the rows in a DiskRowSet. Encoding takes advantage of data types to use less space and block compression is optional. This output is written to a sequence of pages for each column. Then there is a B-Tree index that maps ID to page where ID is the ordinal offset of the row within the DiskRowSet (when a DiskRowSet has 1M rows then the ID is from 1 to 1M assuming it starts at 1). Although I am not sure if that index has an entry per page or per row.

There is also a PK index column that has the encoded PK values (encoded because the key is a string even when the PK has more than one column). It wasn't clear to me whether there was an index built on that or if binary search was used. I assume the value in this case is the ID (ordinal offset) for the row. A paged bloom filter is created for the PK index column.

Range scans will start by searching the PK index column to determine the row IDs that must be retrieved and then for each column that must be returned search the per-column indexes using those IDs.

Wednesday, October 30, 2019

USL, universal scalability law, is good to know

The USL is worth understanding. USL is short for universal scalability law and was created by Neil Gunther for use in capacity planning. I don't do much capacity planning but the USL is also valuable for explaining performance. Performance problems for the old InnoDB rw-lock (circa 2007) would have been easy to describe via the USL because "wake all" on unlock is an N² overhead -- see the 𝛽 parameter in the USL.

A longer version of the USL is here. The USL predicts how throughput will change (increase or decrease) with concurrency. One version of the formula where X(i) is throughput with i concurrent clients is below.

X(1) * N
X(N) = -------------------
1 + α(N-1) + ꞵN(N-1)

The denominator determines how throughput changes with concurrency. When it is one then there is linear scaleup. The best way to get linear scaleup is to cheat and choose an easy base case but otherwise the denominator is greater than 1 and the USL explains less than linear scaleup. The components in the denominator represent the overhead from contention and coherency. The alpha term represents contention and the beta term represents coherency.

I write above that "wake all" on unlock has an N² overhead. By this I mean that when N threads are trying to get through a synchronization object one at a time and all waiters wake on unlock then there are O(N²) wakeups -- (N-1) + (N-2) + (N-3) + ... + 3 + 2 + 1. The beta term in the denominator represents the coherency overhead and has an N² term as 𝛽N(N-1) = 𝛽N²-𝛽N.

Amdahl's Law

The long post states that when B=0 and X(1)=1 then the USL is reduced to Amdahl's Law. I will derive the math for that here. The key point is that α=(1-p) where p is the fraction of the workload that is parallel in Amdahl.

# Amdahl's law where p is the fraction of the workload
# that is parallel, N is same as for USL
1
Speedup = -----------
(1-p) + p/N

# Start with USL
N
X(N) = -------------------
1 + α(N-1) + ꞵN(N-1)

# Update for X(1)=1, ꞵ=0, α=(1-p)
N
X(N) = ----------
1 + (1-p)(N-1)

# Algebra for denominator, expand and then reduce
N
X(N) = ----------
N - pN + p

# Multiply numerator and denominator by 1/N
1
X(N) = ----------
(1 - p) + p/N

# Now it looks like Amdahl where α=(1-p)

Predicted max throughput

The long post states that the max throughput occurs at sqrt((1-α)/ꞵ). That is the value of N for which X(N)' = 0. I won't derive it here but X(N)' is:
1 - α - ꞵN²
X(N)' = ---------------------
(1+α(N-1) + ꞵN(N-1))²

Then X(N)'=0 when the numerator is zero and that happens when N=sqrt((1-α)/ꞵ). That determines a critical point for X(N) which is either a min or max. In this case it is a max. I assume that X(N) is concave without deriving X(N)''.

Tuesday, October 29, 2019

Review of "5 Minute rule - thirty years later"

The 5 Minute rule paper has been updated and it is worth reading. The paper explains how to make cost-based decisions for tiered storage, applies that using price and performance of devices over the past 30 years and then makes predictions based on current trends.

There is a prediction in the paper that will provoke discussion -- the price vs performance advantage for disk vs tape is getting smaller and soon we might need Hadoop for tape. That will be an interesting project.

5 minute rule math

The 5 minute rule math explains what data should be in a cache by considering the price of the cache and the price/performance of the storage device. The math has been used where cache+storage are DRAM+disk, DRAM+SSD and SSD+disk. Now it has be updated for NVM.

The equation is presented as two ratios: technology and economic. Below I explain how to derive it.

Technology ratio Economic ratio

Pages/MBofCache Price/StorageDevice
----------------------- * -------------------
OpsPerSec/StorageDevice Price/MBofCache

If you reduce the equation above you end up with Pages/OpsPerSec. The unit for this is seconds and this is inverse of the access rate. When Pages/OpsPerSec = X then the cache should be large enough to fit pages accessed more frequently than every X seconds (1/X accesses per second). The formula above doesn't tell you to buy X MB of cache. It does tell you to buy enough cache to fit all pages accessed more frequently then once per X seconds.

Now I derive the formula as the presentation above can confuse me. The cost of a cache page is:
Price/MBofCache * MBofCache/Pages

The cost of 1 OpsPerSec from a storage device is:
Price/StorageDevice * StorageDevice/OpsPerSec

Each cache page has a cost and a benefit. The benefit is the reduction in OpsPerSec demand for cached data. Assume there are K pages of cache and the access rate for each page is A. Then this cache reduces the OpsPerSec demand by K*A. From this we can determine the value of A for which the cache cost equals the benefit:

K * Price/MBofCache * MBofCache/Pages =
K * A * Price/StorageDevice * StorageDevice/OpsPerSec

# And then solve for A, K cancels, move ratios to LHS

Price/MBofCache MBofCache/Pages
------------------- * ----------------------- = A
Price/StorageDevice StorageDevice/OpsPerSec

# LHS can be reduced to 1/Pages / 1/OpsPerSec
# which is OpsPerSec / Pages while '5 minute rule'
# above solves for Pages / OpsPerSec, so invert LHS
# to get 1/A

Price/StorageDevice StorageDevice/OpsPerSec
------------------- * ----------------------- = 1/A
Price/MBofCache MBofCache/Pages

# Now reorder LHS

StorageDevice/OpsPerSec Price/StorageDevice
----------------------- * ------------------- = 1/A
MBofCache/Pages Price/MBofCache

# Given that X/Y = 1/Y / 1/X when X, Y != 0 then
# first term in LHS can be rewritten to get

Pages/MBofCache Price/StorageDevice
----------------------- * ------------------- = 1/A
OpsPerSec/StorageDevice Price/MBofCache

# A is an access rate, OpsPerSec / Pages, and 1/A is
# Pages/OpsPerSec. Thus we have derived the formula.

Technology

The numerator for the technology ratio changes slowly. Assuming file system pages are cached, the page size is now a multiple of 4kb (ignoring compression). There is one case for smaller pages -- NVM on the memory bus might allow for the return of 512b pages. The denominator for the technology ratio is interesting. It has barely changed for disk but changed a lot SSD.

For the economic ratio the cost of DRAM, disk, SSD and NVM is not changing at the same rate. Thus advice changes depending on the technology. In 2007 the advice for DRAM+SSD was to cache objects accessed every 15 minutes and in 2017 it is every 7 minutes so less DRAM is needed. The advice for DRAM+disk is the opposite. DRAM cost dropped more than disk OpsPerSec improved so more DRAM cache is needed.

Cost

The paper has some discussion on cost. The cost can be the retail price from a device vendor or the all-in cost from a public cloud vendor. The retail price can be misleading as that isn't TCO. You need to consider power and more. In either case (retail, public cloud) capacities come in discrete sizes and you might not be able to deploy the optimal values.

Finally, the 5 minute rule doesn't solve for all constraints, nor should it. There are limits to power, space and throughput that must be considered. Motherboards are limited by memory slots, PCIe slots and throughput to storage. Power might not be unlimited. Constrained optimization is an interesting topic, but not for this post.

Thursday, October 24, 2019

Tuning space and write amplification to minimize cost

In a previous post I described how to tune for space and write amplification with an LSM or index+log index structure. In this post I explain how to use that to minimize the storage cost for a workload.

This is an optimization problem and the objective function is to minimize the number of storage devices. The model I use was described in a previous post but I summarize it here:

Storage device supplies r units of read IO, w units of write endurance and s units of capacity.
Workload demands R units of read IO, W units of write endurance and S units of capacity.
max(R/r, W/w, S/s) storage devices are required when amplification is ignored
max(R*a_r/r, W*a_w/w, S*a_s/s) storage devices are required when amplification is not ignored where a_r, a_w and a_s are read, write and space amplification

Here I assume that read IO is never the bottleneck and the goal is to minimize max(W*a_w/w, S*a_s/s). That occurs when W*a_w/w = S*a_s/s. I assume that W, w, S and s are constants so the goal is to tune a_wand a_s to solve the equation. Note that W/w and S/s are the number of devices needed for the workload based on endurance and capacity.

I thought I previously published a blog on this topic but I could not find it.

Index+log

For index+log I assume that a_w = 100/(100-pfull) and a_s= 100/pfull where pfull is the percentage of device capacity available to the user. With that I can determine the value of pfull that solves the equation.

W*a_w/w = S*a_s/s
# reorder LHS and RHS
W/w*a_w = S/s*a_s
# replace a_w and a_s
W/w * 100/(100-pfull) = S/s * 100/pfull
# multiply LHS and RHS by pfull and then (100-pfull)
pfull * W/w * 100 = (100-pfull) * S/s * 100
# divide LHS and RHS by 100
pfull * W/w = (100-pfull) * S/s
# expand RHS
pfull * W/w = 100 * S/s - pfull * S/s
# add pfull * S/s to LHS and RHS
pfull * W/w + pfull * S/s = 100 * S/s
# factor LHS
pfull * (W/w + S/s) = 100 * S/s
# done
pfull = 100 * S/s / (W/w + S/s)

From the solution if I need 10 devices based on endurance (W/w = 10) and 30 devices based on capacity (S/s = 30) then using pfull = 100 * 30 / (10+30) = 75% minimizes the number of devices required. With pfull=75 then a_w=4 (100/25), a_s=1.33 (100/75), W/w*a_w=40, S/s*a_s=40 and the workload needs 40 storage devices. Were I to use pfull=50 then a_w=2, a_s=2, W/w*a_w=20, S/s*a_s=60 and the workload needs 60 devices. Were I to use pfull=80 then a_w=5, a_s=1.25, W/w*a_w=50, S/s*a_s=38 (rounded up) and the workload needs 50 devices. So pfull=75 looks like a good choice.

LSM

Space and write amplification are inversely related for both LSM and index+log but the math is easier for index+log. We can start with the equation to solve, but I won't solve it.

W*a_w/w = S*a_s/s
# reorder LHS and RHS
W/w*a_w = S/s*a_s
# replace a_w with 0.8 * fo * L and a_swith 1 + 1/fo
# where fo is per-level fanout and L is number of levels
W/w * 0.8 * fo * L = S/s * (1 + 1/fo)
# alas fo (per-level fanout) is a function of L (number of levels) and total fanout
# assume total fanout is t then fo = t ^ 1/L where t is a constant
W/w * 0.8 * t^1/L * L = S/s * (1 + 1/t^1/L)

I stopped at this point because the math isn't easy and L must be an integer >= 1 so an easy way to solve this is to compute the LHS and RHS this for L=1, 2, ..., 10 and choose L that minimizes the difference. For the example above with W/w=10 and S/s=30 then LSM write-amp is always sufficiently larger than space-amp that write-amp determines the number of devices and L=6 or 7 minimizes write-amp and the number of storage devices. Were S/s increased to 200 then L=3 or 4 minimizes the number of storage devices.

Wednesday, October 23, 2019

Write vs space amplification for an LSM and index+log

There is an inverse relationship between write and space amplification for the LSM and index+log index structures -- with more space (write) there is less write (space) amplification. I expect that the tradeoff is better for index+log than for an LSM with leveled compaction which means that index+log has less write-amp for a given amount of space-amp. It will be interesting to determine whether my expectation is true.

My expectation about index+log is based on the following assumptions and all models are wrong, some are useful applies here.

For index+log write-amp is 100 / (100-pfull), space-amp is 100/pfull, pfull is the percentage of the device capacity that can be used and (100-pfull) is the percentage of device capacity set aside to reduce write-amp.
For an LSM with leveled compaction write-amp is f * L * total-fo^(1/L), space-amp is 1 + 1/total-fo^(1/L), f is ~0.8 (see this paper), L is the number of levels and total-fo is the total fanout -- size(database) / size(memtable)

The following chart displays write-amp as a function of space-amp assuming that total fanout is 1024 for the LSM. The numbers are based on the formulas above. It shows that write-amp for index+log is always less than write-amp for the leveled LSM for a similar amount of space-amp. Of course lunch isn't free and you usually pay elsewhere with index+log via more cache-amp and more expensive range scans.

Capacity vs Endurance

There is another way to illustrate the relationship between write-amp and space-amp for index+log. I use %cap and %end to indicate the percentage of device capacity and device endurance that is available to a user after accounting for the impact of pfull. When pfull is larger then write-amp increases and space-amp decreases.

It is interesting that %cap + %end = 100.

Assuming:
pfull is percentage of device capacity available to user
space-amp = 100/pfull
write-amp = 100/(100-pfull)
%cap = 100/space-amp
%end = 100/write-amp

Then:
%cap + %end
= 100/space-amp + 100/write-amp
= 100/(100/pfull) + 100/(100/(100-pfull))
= pfull + (100-pfull)
= 100

pfull write-amp space-amp %cap %end
90 10 1.11 90 10
80 5 1.25 80 20
70 3.33 1.43 70 30
60 2.5 1.67 60 40
50 2 2 50 50
40 1.67 2.5 40 60
30 1.43 3.33 30 70
20 1.24 5 20 80
10 1.11 10 90 10

Tuesday, October 22, 2019

A review of uDepot - keeping up with fast storage

This is a review of Reaping the performance of fast NVM storagewith uDepot which was published in FAST 2019. The paper is worth reading. uDepot is hash-based index+log using my index structure terminology. The goal is to create a database engine that can utilize all of the IO capacity of a fast storage device -- think millions of read IOPs and ~10 usec latency. As I wrote yesterday, stranding read IO is one way to waste a storage device but so are too much space and write amplification.

By hash-based index+log I mean that updates are appended to a log and an in-memory hash index points into the log. Log space is managed as segments and grains. Segments are large (1gb) and contain smaller grains (4kb). GC reclaims previously written segments by copying out live grains. The index must be updated during GC.

Evaluating this according to the CRUM conjecture:

read amplification - uDepot does one storage read per point query as there is no block cache. The cost of a block cache is complexity, memory and CPU cycles. The benefit of a block cache is a reduction in storage traffic. uDepot doesn't support range queries because it is hash-based.
write amplification - space and write amplification are inversely related with index+log. Doing GC more frequently reduces space-amp at the cost of more write-amp. The paper doesn't discuss this tradeoff and I don't know whether it is configurable (yet) in uDepot.
space amplification - see the comment for write-amp. From the paper it wasn't clear whether grains could be shared by small records. If not shared then there will be more space-amp.
cache amplification - the hash index needs at least 8 bytes in memory per record. There are hash-based approaches that use less memory per record - SkimpyStash and SILT need ~1 byte/record. The need for something in memory per record is common to index+log approaches because records are not clustered in the log. The memory requirements for uDepot are reduced because it doesn't use a block cache.

uDepot supports get, put and delete. It does not support a range scan because it is hash-based. While hash-based approaches can use much less CPU than a tree-based approach and hash-based is sufficient if you don't need range scans I am curious whether there is sufficient demand to justify the cost of building a production quality hash-based index structure. I hope there is.

Implementation details

The hash index is an array of hash tables. The array can grow dynamically by doubling in size as needed. The paper did not explain whether the array can be reduced in size. Growing is online and incremental. The reported worst-case blocks some operations for 1 millisecond. The hash tables use Hopscotch hashing to support a high fill factor. There is an array of mutexes per hash table and some benchmarks were run with 8192 mutexes/table. The hash index is eventually made durable in the log. The last N changes to the index might not be durable. The paper claims the index can be recovered after a crash in a few seconds but the process wasn't fully explained.

The log has large segments (1gb each) that contain smaller grains (4kb) each. A segment stores one of records or the index. I wrote above that uDepot might not share grains between small records which will waste space. GC copies live grains from a segment to make a segment free. The GC process -- how and when are segments selected for GC -- was not explained. uDepot expects a raw device. This will avoid filesystem overhead but using a filesystem makes life easier in production. The paper did not explain the overhead saved by not using a filesystem.

More implementation details

The implementation raises two interesting questions. What is the best way to do fast IO? What is the best way to implement a thread per core server?

For fast IO uDepot uses SPDK or Linux AIO. I assume that it could work great with io_uring when io_uring becomes widely available. Linux has a habit of eventually catching up to modern hardware once said hardware is sufficiently available. It will be interesting if io_uring removes the need for SPDK. In figures 7 and 8 the paper has results that show a dramatic improvement from using async IO with thread/core compared to sync IO with many threads.

For thread per core uDepot uses TRT -- Task Run Time. This provides coroutines for systems programming. TRT uses cooperative multitasking so it must know when to reschedule tasks. IO and synchronization is done via TRT interfaces to help in that regard. Under the covers it can use async IO and switch tasks while the IO or sync call is blocked. One benefit from coroutines is reducing the number of context switches.

I am curious about the future of coroutines for systems programming in C and C++. RethinkDB used a thread per core model and started via callbacks then realized that coroutines made development easier -- see here and here. Coroutines are coming, or have come, to Seastar. Boost supports fibers and coroutines. I assume they eventually arrive in standard C++.

Monday, October 21, 2019

How many storage devices does a workload require?

I read an interesting paper that was motivated by the inability of existing storage engines to utilize all of the read IO provided by new and fast storage devices. That is excellent motivation but the story has more nuance.

A storage device provides capacity, endurance and reads. For now read means read operations and I ignore read throughput and latency. For a given device and workload none of the dimensions (capacity, endurance, reads) might be saturated. When saturation occurs it is usually limited to one of the dimensions. In that case if you want to reduce the cost of storage then change something to be more efficient in the saturated dimension. When capacity (endurance, read IO) is the bottleneck then reducing space (write, read) amplification is an answer.

It is hard to saturate a storage device in all dimensions so I am cautious when insufficient utilization in any dimension is cited as a problem. Too much saturation leads to lousy quality of service. Besides, workloads rarely have constant workloads -- web-scale varies by time of day.

When endurance or capacity are the bottleneck then it can be OK to not use all of the read IO provided by a new storage device. A simple model for this is:

Storage device supplies r units of read IO, w units of write endurance and s units of capacity.
The workload demands R units of read IO, W units of write endurance and S units of capacity.
max(R/r, W/w, S/s) storage devices are required when amplification is ignored
max(R*a_r/r, W*a_w/w, S*a_s/s) storage devices are required when amplification is not ignored where a_r, a_w and a_s are read, write and space amplification
To reduce the number of storage devices focus on the saturated dimension and tune or change the index structure

An example

For (R=10000, W=1000, S=100, r=10, w=10, s=10, a_r=4, a_w=10, a_s=2) then max(R*a_r/r, W*a_w/w, S*a_s/s) = max(10000*4/10, 1000*10/10, 100*2/10) = max(4000, 1000, 20) = 4000 and read IO is the bottleneck. But change S from 100 to 100000 and this becomes max(4000, 1000, 20000) = 20000 and capacity is the bottleneck.