Monday, October 7, 2019

Learned indexes for an LSM?

The Learned Indexes paper opened a new area of research for storage engines. The idea is to use the distribution of the data to make searches faster and/or reduce the size of search structures. Whether ML will be used is an implementation artifact to me. I am in this for the efficiency gains -- reducing space and CPU read amplification. I prefer to call this topic learned search structures rather than learned indexes because the public isn't ready for learned recovery and learned consistency.

While the Learned Indexes paper has ideas for hash maps and bloom filters most research has focused on a B-Tree. I expect learned indexes to work better with an LSM because its search structures (block indexes) are created off line with respect to user operations. There is more time for analysis and the per-SST search structures are write once so there is no need to deal with updates.

Other search optimizations

One of the goals for learned tree indexes is to avoid log2(N) comparisons on a search so I will mention a few things in InnoDB and RocksDB that help with that:
  • The adaptive hash index (AHI) in InnoDB is a non-persistent hash table built on demand that maps a primary key value to a block in the buffer pool. With a clustered index like InnoDB (and MyRocks) non-covering secondary index queries must probe the PK index per row to get the missing columns. The AHI can avoid that CPU overhead.
  • RocksDB uses bloom filters to avoid some of the search overhead per level of the LSM tree. They don't avoid all of the per-level overhead as a search must be done to find the SST before the bloom filter can be checked.
  • RocksDB optionally uses a hash index per data block to avoid binary search within the data block. I assume that learned index approaches can augment or replace some of the data block index, the per block hash index and the bloom filter.

Wednesday, September 25, 2019

Reducing write and space amplification in a b-tree

This started as a review of FineLine: Log-structured Transactional Storage and Recovery from VLDB 2019. The paper is wonderful but dense so I didn't take the time to understand all of it. It uses a log-structured approach to avoid the complexity of ARIES for recovery. While it can support a variety of index structures a b-tree is the obvious target.

While it wasn't a stated goal for the paper, the FineLine approach reduces space and write amplification for a b-tree. Maybe a focus on that is motivation for another paper from the authors. An LSM with leveled compaction has much less space and write amplification than a b-tree at the cost of more CPU. Reducing space and write amplification for a b-tree makes it more competitive when storage efficiency is a priority. The paper has good performance results for databases less than 10X the size of RAM. I am curious what happens when the database is 100X the size of RAM.

I look forward to corrections and apologize for not distinguishing between page and block. Fortunately I don't have to write about lock vs latch here. I also look forward to the next paper from the authors.

Improving b-tree storage efficiency

I previously explained how the InnoDB b-tree can use 2X more space and write 10X more bytes per transaction to storage compared to the MyRocks LSM. This is a big deal if you prefer to use less SSD and worry about endurance (because QLC is coming). But there are several things that can be done to make a b-tree better at storage efficiency.
  1. Variable length pages
  2. Block and index prefix compression
  3. Defer dirtying pages and page write-back
A traditional b-tree subject to random updates will have leaf pages that are <= 70% full. This wastes at least 30% of storage and space amplification (space-amp) is at least 1.42X (100 / 70). Variable length pages can avoid that waste. By variable length pages I mean that when the compressed page is X bytes then it uses at most X+Y bytes on disk where Y is <= 512. One reason for such a small value for Y is that a common page size for OLTP is 4kb or 8kb. The alternatives to supporting variable length on-disk pages has been to rely on file system hole punch (see here and here for my bad experience with that in InnoDB) or a proprietary file system. Variable length pages will cause space-amp from external fragmentation as there will be wasted space that has yet to be reclaimed.

Both index prefix compression and block compression are needed. Index prefix compression reduces the size of pages before and after compression. These should be optional because some workloads don't want the CPU overhead from block compression (even with fast lz4 and zstd) or from undoing prefix compression. Block compression usually leads to variable length pages.

Variable length pages and compression help mostly with space-amp. To reduce write amplification (write-amp) a b-tree can defer making pages dirty and the write-back that must follow. The FineLine paper has new ideas for this. The old ideas have expensive side-effects (slow crash recovery) including using a large buffer pool with 99% dirty pages and using a huge redo log. I wonder what the side-effects will be for FineLine.

The OSS b-tree engines that I have used (InnoDB, WiredTiger) can do more to reduce space-amp. InnoDB supports block compression but not variable length pages (by my definition) so the implementation is complex and some space-amp savings are lost. WiredTiger supports variable length pages, block compression and index prefix compression. From the WiredTiger docs the on-disk page size is a multiple of allocation_size which is >= 4kb which doesn't meet my requirement for variable length above. Regardless I am still a huge fan of InnoDB and WiredTiger. There is also more to be done to reduce write-amp in OSS engines. InnoDB fuzzy checkpoint combined with a large redo log can reduce write back. The InnoDB change buffer defers dirtying non-unique secondary index pages. WiredTiger doesn't support fuzzy checkpoint. I hope that changes. But neither have anything that can defer write-back like FineLine.


The goals for FineLine include avoiding the complexity of ARIES, instant recovery, larger than memory databases and a compromise between read and write optimized solutions -- better read performance than an LSM, better write performance than a b-tree.

There are a few logical concepts to grasp when reading the paper -- node and indexed log. FineLine uses node rather than page. A node has an ID and variable length size. Nodes can reference other nodes -- for a b-tree leaf nodes will be referenced by interior nodes. Log entries are either node format (a node image) or describe a change to a node. The indexed log is the database and supports append and fetch. The indexed log is unlikely to be implemented by a log file.
FineLine uses a single-storage approach. The indexed log is the database. For the paper the indexed log was implemented using a Foster b-tree, forked from ShoreMT, that might be a variant of a copy-on-write b-tree. Single-storage means there is only a log to write -- redo log records are written on commit and compaction runs in the background to merge redo log records and occasionally write page images (node-format log records) when there are many redo records for a given page. The paper describes a simple API for the indexed log -- append and fetch node.

Group commit is supported and all log records to be written are assigned a partition number and then ordered by node ID. I assume the partition number is monotonically increasing per commit group. A group commit creates a new partition in the indexed log. A bloom filter might also be created to make it faster to determine whether a partition includes records for a node. Only redo is written to the indexed log. There are no undo records. One reason for better write performance is less logging.

Adjacent partitions are merged by background threads. This has something in common with LSM compaction. The paper was vague but there are times where node images (node format log records) are written back to the indexed log when there are too many log records for that node. This can be done via cost-based decisions that consider read vs write efficiency. There was some work in Hekaton to figure out when to collapse per-node redo log records.

On a buffer cache miss the fetch operation must first find the most recent node image (node format log record) and then find/apply all of the redo log records for that node. In the worst case this means searching many partitions. Bloom filters can make this search faster.

A buffer manager caches node images. Some of the node images might be expensive to recreate as explained in the previous paragraph. While the paper claims FineLine supports databases larger than memory I wonder how much larger because of the cost of recreating node images. The paper provide tpc-b results where the database is no more than 10X the size of memory. I am curious about 100X larger.

Friday, September 20, 2019

Review of Spanner: Becoming a SQL System

This is a review of Spanner: Becoming a SQL System, at least the bits of it that I understand. It is worth reading and is from SIGMOD 2017 but I just noticed it. I am impressed by the ability of Google to build large-scale systems. Spanner + SQL is used within Google and by external customers for OLTP and OLAP (HTAP) although I am curious whether it is performant for OLAP - which is really a function of the time needed to improve the optimizer and execution engine for complex SQL. From the paper SQL and a common SQL across DBMS within Google has greatly increased Spanner usage.

Cloud Spanner provides serializable isolation. The paper states this is done with pessimistic concurrency control and timestamps so I assume that reads take locks in read-write transactions to prevent changes to rows that have been read. I wonder whether conflicts are an issue for workloads with contention. It isn't clear to me whether CockroachDB uses read locks.

There are several SQL systems within Google -- F1, Cloud Spanner and Dremel. They implement the same SQL language thanks to a common front-end (parse, semantic analysis), library of scalar functions, test framework, feature tests (coverage and correctness) and random query generator for testing. I wonder what is to become of F1 which first did SQL on Spanner for Google Ads.

Range extraction is done at compile time and run time to prune table shards, determine which fragments of a index should be read and determine key ranges to lock. All DBMS do this at compile time. Doing this analysis at run time is less common. Distributed SQL has more overhead for data access so the overhead from this analysis is worthwhile.

Restart tokens are used to allow the client library to hide transient failures even in cases where partial query results have been received. The token represents the internal state of query execution and works across upgrades. Intermediate query results are not persisted on the server side as the target workload is OLTP. The comdb2 DBMS also returns information to the client to hide transient failures.

Ressi is an LSM optimized for OLAP rather than OLTP. This uses the PAX idea within the LSM data blocks. Data within the LSM tree is ordered by key (row-wise) but within each data block is stored column-wise. This is a great compromise for HTAP -- one IO fetches all columns for a row but the vectorized format within the data block reduces CPU overhead during OLAP and the paper states that Ressi can operate directly on vectors in compressed form. Other features from Ressi include:
  • Large (multi-page) values are stored in separate files. I assume this is similar to the LOB indirection done by InnoDB.
  • There are active and inactive files. The active file has the current version for a key while the inactive file has older versions for a key. It wasn't clear to me how this is integrated into the LSM tree. My guess is that the split is done during LSM compaction and I am curious if this ever creates search overhead.
I am uncertain but think this means that a transaction can't read its own uncommitted writes. That seems like a source of confusion for external users.
Just like SQL syntax and semantics, transactional guarantees of database products vary widely from one system to another. Spanner supports very strong guarantees that establish a global order among transactions, called external consistency. Within a given read-modify-write transaction, however, Spanner imposes a constraint that all updates must be committed as the last step of the transaction. Overcoming this limitation requires supporting reading the uncommitted results within an active transaction. While we have seen comparatively little demand for this feature internally, supporting such semantics improves compatibility with other SQL systems and their ecosystems and is on our long-term radar. 
While the paper didn't cover pricing I wonder how Spanner compares to the alternatives from AWS and MSFT. Spanner charges per GB of storage while AWS Aurora charges for GB of storage and per IO. Is that a big deal? Do users pay for extra RAM to reduce their read IO with Aurora?

Wednesday, September 11, 2019

Optimal Tuning for an LSM

For a few years I have wanted to use math to determine optimal tuning for an LSM given a specific workload and constraints on efficiency. First I had to (re)learn math. I like calculus now while I didn't enjoy it in high school -- just like spinach.

By specific workload I mean the fraction of operations that are inserts, point queries and range reads. Note that if the workload has no range reads then an LSM might not be the best choice. By constraints on efficiency I mean the max write and space amplification that would be tolerated.

I decided to try constrained optimization and Lagrange Multipliers where the objective function was the average number of CPU comparisons per operation and the decision variables were the per-level fanouts for the LSM tree. In this case I am only optimizing for the CPU overhead, but I have to start somewhere. The eventual goal is to include CPU, memory and storage in the cost function.

I previously used Lagrange Multipliers to show how to set the per-level fanout to minimize write amplification. The result depends on assumptions about the per-level write amplification and this dependency is a new result but the previous result is that the per-level fanout should be the same for all levels. Here I show that CPU overhead can be reduced by using larger per-level fanouts for larger levels.

  • Books with titles like Math for Economists are a great resource for learning math
  • Choosing a good value for the number of levels in the LSM tree is more significant than figuring out the optimal per-level fanouts.
  • Take time to understand when a problem is, or is not, worth optimizing

Is this worth solving?

To minimize the CPU overhead of a workload it is much more important to choose the number of levels in the LSM tree and the per-level compaction algorithms (tiered, leveled, leveled-N) then it is to optimize the per-level fanout as I describe here. While the math I describe below isn't the best way to optimize an LSM, the math was interesting to me so this blog post lets me revisit my work.

In the cost function below the CPU cost of a read, Cr, is dominated by the size of the memtable. The per-level fanout values, f1 and f2, won't have a significant impact on the value of Cr. Therefore it isn't clear that it is worth figuring out the optimal values for the per-level fanout. By not significant assume that B=2^20 while f1 and f2 are likely to be <= 32. Then the contribution to Cr from B is 60 while the contribution from f1 and f2 are at most 15. For Ci, the cost of an insert, the contribution from B is likely to be similar to the contribution from f1 and f2.

The problem

The problem below has a nonlinear objective function, nonlinear equality constraint and simple inequality constraints.

* LSM memtable, L1 and L2 where there are B entries in memtable, f1 is fanout from memtable to L1 and f2 is fanout from L1 to L2. There are B * f1 * f2 rows in the LSM.

* Cost of read is Cr and cost of insert is Ci, both are the number of compares per operation. Bloom filter is not used and Ci includes compares for memtable insert and compaction.
* Fraction of reads and inserts in workloads is Pr and Pi where Pr + Pi = 1
* The value for X is somewhat arbitrary although X should be >= 2. B should be an integer but I ignore that for now.

Minimize Pr * Cr + Pi * Ci
such that B * f1 * f2 = K
and B >= X, B <= mK where m between 0 and 1
and f1 >= 2 and f2 >= 2

# compares per point-read without a bloom filter
# assume key is found in max level of LSM tree
Cr = memtable-cmp + l1-cmp + l2-cmp

Cr = log2(B) + log2(B*f1) + log2(B*f1*f2)
# Cr can also be expressed as
Cr = 3*log2(B) + 2*log2(f1) + log2(f2)

# compares per insert
Ci = memtable-insert-cmp + compaction-cmp
Ci = log2(B) + f1 + f2

Step 1

First, write the L function that includes the constraints:

# Note that I wrote u in place of lambda
L = Cr * Pr + Ci * Pi + u(K - B*f1*f2)

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
dL/df2 = Pr / (f2 * ln(2)) + Pi - u*f1*f2

Then realize that the inequality constraints make this a lot harder to solve and that complexity gets worse with more levels in the LSM tree. Then admit that you haven't done all of the work to confirm that the math techniques can be used on this problem (constraint qualifications, concave objective function).

Step 2

Without the constraints the goal is to find the min and max values for the objective function by finding values for which the partial derivatives (dL/dB, dL/df1 and dL/df2) are zero. I will do that here and then stop. Note that when the partial derivatives are zero the solution might violate the constraints and thus not be valid. Also note that when the partial derivatives are zero the extremum can be a min or a max. Regardless, the results suggest that fanout should increase from smaller to larger levels to minimize the cost.

First solve for u when dL/dB = 0.

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2
dL/dB = (3*Pr + Pi) / (B * ln(2)) - u*f1*f2
# given Pr + Pi = 1 ...

dL/dB = (2*Pr + 1) / (B * ln(2)) - u*f1*f2

# now solve when dL/dB = 0
(2*Pr + 1) / (B * ln(2)) = u*f1*f2
# given f1*f2 = K/B
(2*Pr + 1) / (B * ln(2)) = u*K/B
# multiply LHS and RHS by B
(2*Pr + 1) / ln(2) = u*K

# solve for u
u = (2*Pr + 1) / (K * ln(2))

Then  solve for f1 when dL/df1 = 0

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
# when dL/df1 = 0 then
(2*Pr) / (f1 * ln(2)) + Pi = u*B*f2
# given B*f2 = K/f1

(2*Pr) / (f1 * ln(2)) + Pi = u*K/f1
# multiply LHS and RHS by f1
(2*Pr)/ln(2) + f1*Pi = u*K
f1*Pi = u*K - (2*Pr)/ln(2)

# substitute u = (2*Pr + 1) / (K * ln(2))

f1*Pi = (2*Pr + 1) / ln(2) - (2*Pr)/ln(2)
f1*Pi = (2*Pr + 1 - 2*Pr) / ln(2)
f1*Pi = 1 / ln(2)
f1 = 1 / (Pi * ln(2)) 

Then solve for f2 when dL/df2 = 0

# skip the math for now, trust me :)
f2 = (1 + Pr) / (Pi * ln(2)

# were there an f3 then the answer would be...
f3 = (1 + 2*Pr) / (Pi * ln(2) 

Then solve for B, f1 and f2 given values for Pi and Pr

B = K / (f1 * f2)
B = K / (1 / (Pi * ln(2)) * (1+Pr) / (Pi * ln(2)))
B = K / (1 + Pr) / (Pi * ln(2))^2
B = K * (Pi * ln(2))^2 / (1+Pr)

# with Pi=0.5 and Pr=0.5
# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.080, f1=2.89, f2=4.33

# with Pi=0.1 and Pr=0.9# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.003, f1=14.43, f2=27.41

# with Pi=0.9 and Pr=0.1
# note that f1, f2 are < 2 which violates a constraint

B = K * 0.353, f1=1.60, f2=1.76

# with Pi=0 and Pr=1
# --> flush the memtable, no need to search it

Comparing optimal solution with typical config

For the the 2 cases above for which an optimal solution was found that doesn't violate constraints I compare the value of the objective function (compares/operation) with for the optimal vs a typical config (f1=f2 in typical config) assuming K=1B (1B rows in LSM).

These results explain my comment about confirming that the impact from an optimal config is significant. It isn't in these cases.

# with Pi=0.5 and Pr=0.5
# f1=f2=3.54 in default config
B = K * 0.080, f1=2.89, f2=4.33

optimal and default config do ~47 compares/operation
# with Pi=0.1 and Pr=0.9
B = K * 0.003, f1=14.43, f2=27.41
# f1=f2=19.89 in default config
optimal and default config do ~63 compares/operation

Using solvers

I tried the Mathematica and Matlab and had more success with Matlab. The Minimize function in Mathematica can get unhappy with fractional coefficients in the cost function (values for Pr and Pi).

Below are two functions and the script I used with Matlab.

# costf.m is the objective function
function [f,g] = costf(x)
B = 1000000;
Pq = 0.99;
Pu = 0.01;

f = Pq * (4*log2(B) + 3*log2(x(1)) + 2*log2(x(2)) + log2(x(3))) + Pu * (log2(B) + x(1) + x(2) + x(3));

if nargout > 1
    g = [((Pq * 3) / (x(1) * log(2))) + Pu;
         ((Pq * 2) / (x(2) * log(2))) + Pu;
         ((Pq * 1) / (x(3) * log(2))) + Pu];

# nlcon.m is the nonlinear constraint

function [c, ceq] = nlcon(x)
c = [];
ceq = x(1) * x(2) * x(3) - 1000;

# the script
lb = [2,2,2];

x0 = [10,10,10];

S = fmincon(@costf, x0, [], [], [], [], lb, [], @nlcon);

Tuesday, September 10, 2019

FoundationDB Record Layer

This is a review of the FoundationDB Record Layer paper. My summary is that Record Layer is a wonderful solution when you need to support billions of OLTP databases where each database is small with a simple and low-concurrency workload. I am not an expert on FoundationDB so this review contains questions and might have mistakes. My focus is not on the distributed system properties.

The big question is whether Record Layer can be the solution for less simple workloads, larger databases and higher concurrency. There doesn't appear to be an OLTP SQL layer in development for FoundationDB and there are many interesting alternatives in the open-source and source-available scale-out OLTP DBMS space that already have query layers including CockroachDB, Yugabyte, FaunaDB, TiDB, MySQL Cluster, Comdb2 and MongoDB.

Apple's CloudKit is the star customer for FoundationDB. Other use cases include SnowflakeDB (metadata mgtmt) and Wavefront (primary store). FoundationDB is open source (Apache 2.0) and has an active community. At the upcoming summit you can learn about Redwood, the new storage engine that will replace SQLite.

Update - CouchDB might be moving to FoundationDB

Questions and Comments

An overview of the API is here.

  • Record Layer uses optimistic concurrency control to provide serializable isolation. Commit can fail when there are conflicts. Locks are not held prior to commit. There will be conflicts especially when there is a lag between the time at which the transactions does reads and commit (stalls for reads from storage make this worse). 
  • It provides atomic RMW operations that don't fail from conflicts at commit like min/max and increment because they never conflict (but what happens when the target row has been deleted). This is great but not enough to reduce my concern about support for high-contention workloads -- there will be conflicts.
  • Support for server-side logic would help reduce conflicts by shrinking transaction duration, alas there is neither a layer for SQL nor for server-side logic.
  • Clients can remove keys from a transactions read set to reduce the chance of conflict. I wonder if this supports weaker isolation like repeatable read. Search for conflict and conflict ranges in the paper.
  • Record Layer supports aggregation indexes to maintain things like sum, count, min and max. I assume this uses the atomic RMW operations. I wonder what trouble this could cause with hot spots.
  • SQLite implements the ssd engine. This isn't the best choice if you care more about performance or efficiency. This is a great choice if you care about manageability while scaling out to billions of databases.
  • Transactions can run for at most 5 seconds. With the ssd engine, implemented by SQLite, only committed versions are forced to disk and undo is buffered in memory to support reads at previous versions. The 5 second limit is there to reduce the memory demand. This restriction might go away with the upcoming Redwood engine.
  • Transactions are limited to 10mb of changes.
  • Redwood is expected to replace SQLite. It is a copy-on-write b-tree with prefix compression. Long common prefixes are frequent with Record Layer so prefix compression is in demand. It uses shadow pages for the leaf pages and then remaps page IDs for non-leaf pages to reduce write-amp. There was an awesome PhD on shadow paging many years ago.
  • The storage engine must be single-threaded which makes it harder to reuse engines like RocksDB. 
  • Order by is only supported via indexes to reduce the memory demand on the server.
  • Record Layer is stateless. Query results are streamed back to the client. I hope the client can get more than one row per network RPC. A continuation is returned to the client when it must fetch more from a cursor.
  • Protocol Buffers are used to encode data. Index keys are compared via memcmp and Record Layer has support for combining composite keys, and hopefully for charsets, into a string for which memcmp does the right thing.

Monday, September 9, 2019

Vinyl - the LSM in Tarantool

Tarantool started as an in-memory DBMS then added an LSM, named Vinyl. This is my review of the Vinyl overview. I have some questions and assume that I will make a few mistakes in describing it. Corrections are welcome.


Tarantool is fast and one reason is server-side logic implemented in Lua and made faster by LuaJIT.

There has been talk of running Linkbench on Tarantool. I am not sure whether that requires too much work but I expect great throughput results from it.

I know of Tarantool because the former tech lead for it previously did great work at MySQL and now will do the same at ScyllaDB. HighLoad is a great place to learn more about Tarantool, as I did on my two visits.

LSM Tree Shape

Vinyl uses tiered+leveled compaction with the memtable as L0, leveled for Lmax and tiered for the levels in between the memtable and Lmax. See LSM Geek Code for more on my terminology.

Configuration parameters for Vinyl are here. There are two parameters that determine the shape of the LSM tree. The max number of sorted runs per level is set by vinyl_run_count_per_level. The size ratio between adjacent levels is set by vinyl_run_size_ratio. The docs in the Filling an LSM tree section have a nice description of the impact from these options. When vinyl_run_count_per_level is 2 then there will be at most 2 sorted runs in a level. When vinyl_run_size_ratio is 5 then the max size of a run in L2 is 5X the max size of a run in L1.

Not all tiered compaction algorithms allow an LSM tree to be described in terms of levels like this, but Vinyl does and that makes it easier to reason about performance and efficiency. More than one sorted run per level will have a large, negative impact on range query CPU overhead. However the impact for point queries should be much less when there is a bloom filter per sorted run as used by Vinyl.

Tuple Range Cache

The range cache sounds interesting but I am still confused by it. The docs describe it with the following. I assume this stores ranges (less or more than one page) from the max level of the LSM tree based on "after having performed a compaction spanning all tree levels". How is cache invalidation managed? What is done in between all-level compaction? Is there also a block cache separate from this?
Unlike, say, RocksDB or MySQL, this cache doesn’t store pages, but rather ranges of index values obtained from disk, after having performed a compaction spanning all tree levels. This allows the use of caching for both single-key and key-range searches. Since this method of caching stores only hot data and not, say, pages (you may need only some data from a page), RAM is used in the most efficient way possible. The cache size is controlled by the vinyl_cache parameter.
Other highlights

  • Vinyl measures the memtable insert and flush rates to predict when to begin flushing the memtable so that the flush will finish before the memtable is full. 
  • The Vinyl memtable is a b-tree rather than a skip-list
  • Serializable isolation is provided
  • zstd compression is used
  • Multi-level compaction is mentioned a few times but not explained
  • Vinyl has gap locks but the overview didn't explain how they are used.
  • What happens when a transaction stalls on a read from disk with Vinyl? Does Vinyl run transactions twice -- once to prefetch reads without applying changes, then again to apply changes? Does it run transactions until they stall on a disk read and then what happens when the read is ready? This might explain the need for gap locks.
  • I assume that transactions are interleaved when there are read stalls. Locks are not held so commit can fail when conflicts are detected. The overview explains it implements the MVTO (multiversion timestamp ordering) class, whereby the winning transaction is the one that finished earlier. There are no locks and associated deadlocks. But now I am confused by the claim that there are no locks when the overview mentions that gap locks have been implemented.
  • Vinyl creates bloom filters for full and partial key searches. I assume that a bloom filter for partial key search is similar to a prefix bloom filter in RocksDB.
  • Vinyl uses at least one LSM tree per logical index. More than one LSM tree is used when the logical index is too big in which case the LSM trees range partition the logical index. This allows different optimizations per partition and can be a major win for right or left growing indices, in which case only one partition will get writes and the others can be read optimized. Creating partitions is done via a split. Too-small partitions can be combined via a coalesce. Some of the split/coaleasce work is deferred until compaction (think of virtual partitions).
  • Vinyl supports upsert (read-free updates are in RocksDB)

Thursday, September 5, 2019

Adapting TPC-C for MongoDB - reviewing a VLDB paper

This is a review of Adapting TPC-C Benchmark to Measure Performance of Multi-Document Transactions in MongoDB which was published in VLDB 2019. I appreciate that MongoDB and Asya Kamsky took the time to get this published. That can be a weekend and nights project when in industry. I also appreciate that this not a benchmarketing effort. The purpose wasn't to overstate performance. The purpose was to show how to get good performance on a TPC-C like workload with MongoDB and realistic hardware and configurations. I hope for a similar effort on MongoDB with Linkbench.

My comments:
  • Work was done to reduce write-write conflicts which will be more likely given the extra commit latency from using w:majority writeConcern on a 3-node cluster. That work included 1) moving conflicting writes early in the transaction 2) moving writes before reads 3) using findAndModify instead of select/update and 4) batching writes. I wonder if non-stored procedures will be useful.
  • A small amount of denormalization was done by storing order lines in the order document. Denormalize everything isn't feasible here or in Linkbench because that leads to too-big documents.
  • Code and details were shared that will allow you to reproduce results.
  • w:majority was used on a 3-node cluster. The goal was to get realistic results, not a benchmarketing special.
I was confused by two things. First, section 3.3 states that majority write concern guarantees that a write is flushed to disk by any replica that ack'd the write. I thought this was determined by the value of the j option in writeConcern. Second, section 3.5.2 is about causal consistency and that (causal reads feature and logical clocks) seems like overkill when shards aren't used. If you want to avoid going back in time when moving from a primary to a secondary isn't it sufficient to remember the point-in-time at which primary queries are done? But maybe that is just a simple logical clock.

Wednesday, September 4, 2019

Tunable Consistency in MongoDB - reviewing a VLDB paper

This is a review of Tunable Consistency in MongoDB from VLDB 2019. It is worth reading and I appreciate that MongoDB has recently published several conference papers. I am not an expert on this topic. For expertise see Daniel Abadi, Kyle Kingsbury and Peter Bailis. Henrik can be added to the list with a few more blog posts.

MongoDB vs MySQL

MongoDB is a NoSQL DBMS that makes it easy to run sharded replicasets. While the NoSQL part of it is nice the sharded replicaset part of it is amazing. I hope that MySQL eventually gets similar support for sharded replicasets including readConcern and writeConcern options.

I previously compared MySQL semisync replication with MongoDB. With MongoDB the behavior for reads can be tuned separate from writes while MySQL combines them. In the MySQL implementation for lossless semisync a write is not locally visible until a replica acks. In the MongoDB implementation the write is committed locally and then replicated. The replicas apply writes as soon as possible without waiting for a distributed commit protocol. The key to making all of this work is controlling read visibility via an MVCC database engine courtesy of point-in-time reads.

The Review


  • With many options comes a need for more documentation and tutorials. While I appreciate splitting read and write semantics into separate options, the feature space is large and figuring this out is harder.
  • The paper states that the gold standard is linearizability. Well, this is a DBMS paper so maybe that should be changed to serializability.
  • I was confused by the difference between majority and linearizable reads. AFAIK snapshot reads are similar to linearizable (both wait for confirmation that the master really is the master) while majority reads don't have that extra wait.
  • I was confused by "The transaction commit operation accepts a write concern, which determines ... and its constituent read and write operations" because commit occurs after the reads so how could it effect them. As the paper promises, that is explained later.
  • MongoDB is a great case study in adding stronger consistency to an async replication system. It continues to use async replication, yet it now provides stronger consistency on demand.
  • I think that propagated means applied in the description of the w option for writeConcern. This means that a replica acks after applying a change -- either before or after making redo durable depending on the j option. AFAIK the more common option is to ack after shipping the change to the replicas log but before applying the change. However, I prefer what MongoDB does. Maybe commenters will correct my perception of what is more common.
  • To reduce write-write conflicts MongoDB uses the latest locally committed transaction as the point-in-time to do reads for read/write operations that use w:majority and for multi-statement transactions that use snapshots. The alternative was to use the older latest majority commit point-in-time. See section 5 from the paper. Therefore there is a wait before returning to the user for that locally committed timestamp to be committed to a majority of the replica set. This is true even for read-only transactions. So MongoDB can make reads wait. Obviously it can make writes wait before returning to the user doing the write for w:majority. An excellent CockroachDB blog post explains that it too can make reads wait while Spanner can make writes wait. 
  • Consistency is tunable in MongoDB. With writeConcern you can determine whether a write might wait. With readConcern you can determine whether a read might wait.
  • Some of the wait for reads is to confirm that the master from which the read has been done is still the master. I wonder if a master lease could have been used to avoid that wait at the cost of making failover slower. Which cost do you prefer?
  • Replication remains async. Writes are committed locally (and visible to others with the appropriate readConcern options) regardless of the write concern. These are shipped to replicas ASAP and applied by replicas ASAP. This means that a replica can apply a change that has to be undone during master failover because it was never committed to a majority. MongoDB has two ways to rollback that replica -- rollback WiredTiger to an older point in time or undo the extra replication events. Rollback sounds easy while undo is complicated.
  • Strongly consistent writes don't delay weakly consistent writes. A strongly consistent write is done locally then releases row locks then waits for replicas to ack.
  • MongoDB doesn't like long running transactions because WiredTiger must keep undo in memory to satisfy all active transactions. MongoDB kills snapshot reads that have been running longer than one minute. For non-snapshot reads it can advance the snapshot in the middle of the read operation. One side-effect of this is that you will have a hard time implementing a consistent logical backup tool like mysqldump.

InnoDB repeatable read in a nutshell

InnoDB repeatable read (RR) is complicated.  This is my short description. It isn't exactly correct.


Processing of statements within a transaction for InnoDB RR:
  1. Statements that are read-only and x-lock free use snapshot isolation
  2. Other statements use read committed (RC) with gap locks
  3. All statements observe uncommitted changes from the transaction
It would have been simpler to use snapshot isolation as-is but that leads to more write-write conflicts.

Sources of Confusion

  • The repeatable read snapshot is created at the start of the first real statement (SELECT, INSERT, etc) and not by BEGIN or START TRANSACTION. It is created by START TRANSACTION WITH CONSISTENT SNAPSHOT.
  • Some RR transactions use SELECT to find rows to modify followed by UPDATE when they should have used SELECT ... FOR UPDATE. Without the FOR UPDATE a SELECT statement uses the snapshot from transaction start while the UPDATE uses RC and sees all committed changes. So SELECT might observe different rows from the UPDATE that follows.
  • I never tried to learn whether SELECT ... FOR UPDATE and UPDATE use 1) use a snapshot from statement start or 2) see any row that is committed at the time statement processing encounters that row. I assume that it is #2. The manual states "reads the latest available data".
  • The rules for gap locks and next-key locks are non-trivial. I don't claim they are more complicated than they need to be because concurrency is hard and they provide useful semantics by preventing phantom reads and more. While the academic community tends towards stronger isolation the common choices in popular systems are weaker (RC in Postgres and Oracle, RR in InnoDB, ? in SQL Server).
When does InnoDB RR implement RR?

I am not an expert on isolation. Experts have yet to review this post. I hope they do. The ANSI spec requires that RR prevent dirty read, non-repeatable read and phantom reads. Alas, InnoDB can do reads (evaluate the WHERE clause) at different points in time (see above) which can allow for some of non-repeatable reads and phantom reads unless you are careful to use the FOR UPDATE clause.

Thursday, June 20, 2019

Open Core Summit and OSS glue code

I am optimistic about the Open Core Summit. It can be something that benefits users, startups, the source-available community and the OSS crowd. The summit has many of the important people from the open core community. It is an opportunity for them to collaborate -- form a foundation, reduce open core license proliferation, discuss the next license to be reviewed by OSI and most importantly -- fix the open core marketing approach.

I appreciate that startups put so much time and money into building interesting system software that is shared either as OSS or source-available. I haven't enjoyed much of the marketing about the challenges that cloud creates for VC-funded OSS. I am sure cloud makes it harder but the approach has been too negative with too much snark directed towards OSI and AWS. There is a better way.

Glue code

It is all about the glue code.

Stateful services are a hard problem. It is not trivial to scale MySQL by enabling a small DBA team to support a growing database deployment. I am sure the same is true for other DBMS. This is done with some changes to MySQL and a huge amount of glue code. While the GPL doesn't require sharing of the diffs to MySQL my teams have always been happy to do that. But the glue code is the secret sauce and neither the GPL nor the AGPL require it to be shared. The SSPL would change that, although I am wary of the SSPL given uncertainty about how far down the stack the sharing requirement extends.

While the glue code is the secret sauce I wonder whether it has any value outside of a web-scale company.
  1. The glue code isn't portable as it depends on other systems internal to a web-scale company.
  2. Documentation for internal systems is frequently not good and success depends on the shared knowledge of the current teams. Even with access to the internal dependencies the glue code is unusable without the team that knows how to use it.
Therefore I am more interested in OSS glue code that is useful to a community and less interested in licenses that force the (unusable to me) glue code to be published. The glue code should assume environments open to the public -- public clouds and k8s.

What is the state of OSS glue code for MySQL? MySQL needs it to remain competitive. Much of the glue code is baked into MongoDB so that it is easier to scale MongoDB clusters even if you aren't an Atlas customer.

Thursday, June 13, 2019

Interesting indexing in Rockset and MongoDB

I was lazy today and asked about new indexing features in Rockset and MongoDB. They share a valuable goal which is better indexing for the document data model (think less schema, not schema-less). How do you index documents when you don't know all of the attributes that will be used? MongoDB now supports this via a wildcard index and Rockset via converged indexing.

Wildcard indexing in MongoDB lets you specify that an index should be maintained on all, most or some attributes in a document. By most I mean there are options to exclude attributes from a wildcard index. By some I mean there are options to limit this to attributes that start with certain prefixes. Read the docs for more.

Converged indexing in Rockset indexes all attributes. There are no options to include or exclude attributes. This makes the product easier to use at the cost of more IO for index maintenance and more storage for larger indexes. Note that Rockset uses the RocksDB LSM which reduces the cost of index maintenance and might also use the excellent ZStandard compression.

Wildcard and converged indexes do not support compound indexes. For the document { a:1, b:2 } there will be two index entries: a=1 and b=2. There is no way to get an index entry for (a=1, b=2) or (b=2, a=1). If you want a compound index with MongoDB the existing index features can be used. See below (editorial 1) for compound indexes and Rockset.

Implementation details

This section is an educated guess. I don't know enough MongoDB and Rockset internals to claim this with certainty. I ignore the complexity of support for datatypes. In the ideal world all values can be compared via memcmp.

For a traditional index limited to a specified attribute the index entries are of the form (value, pointer) where pointer points to the row and can be the primary key value or (file name, file offset).

This is more interesting for wildcard/converged indexes. I assume that the attribute name is the leading field in each index entry so that the entry is of the form (attribute name, value, pointer). The common way to use such an index is to have an equality predicate on attribute name which is satisfied when the index is queried with predicates like attributeName relOp value. Examples of such predicates are a=2, a>2 and a<=2.

A smart person (Dr Pavlo) mentioned the use of skip scan for these indexes. That could be used to query the index and find documents with any attribute equal to a specific value. That is a less likely use case but still interesting.

Wildcard/converged indexes aren't free. Putting the attribute name in every index entry makes index entries larger and consume more space in memory and on storage. Block compression reduces some of this overhead. Index prefix compression in WiredTiger and RocksDB also helps but at the cost of more CPU overhead.

Storage differences

Up to now I have been describing the search index. In this section I will describe the document storage.

MongoDB stores the document via the storage engine which will soon be WiredTiger only although I hope MongoRocks returns. I assume that WiredTiger with MongoDB is row-wise so that each document is (usually) a contiguous sequence of bytes on some disk page.

Rockset stores each document twice -- row-wise and column-wise. Alas, this gets complicated. The row-wise format is not the traditional approach with one thing in the storage engine per document. Instead there is one thing per attribute per document. This is similar to the CockroachDB approach. I prefer to still call this row-wise given that attributes from a document will be co-located in the LSM SSTs. I am also grateful for the many great blog posts from CockroachDB that I can reference.

With two copies of each document in the base storage there is more storage overhead. Fortunately that overhead is reduced courtesy of the write efficiency and compression friendliness of an LSM.

The Rockset blog post does a great job of explaining this with pictures. I do a worse job here without pictures. For the document { pk:1, a:7, b:3 } when the primary key is pk then the keys for row-wise are R.1.a and R.1.b and for column-wise are C.a.1 and C.b.1. The row-wise format clusters all attributes for a given document. The column-wise format clusters all values across documents for a given attribute. The row-wise format is efficient when most attributes for a document must be retrieved. The column-wise format is efficient for analytics when a given attribute across all documents must be retrieved.

Editorial 1

I interpret the MongoDB docs to mean that when a query uses a wildcard index it cannot use any other index and the wildcard index will only be used for a predicate on a single attribute. I expect that to limit the utility of wildcard indexes. I also expect MongoDB to fix that given how fast they reduce their tech debt. The limitations are listed below. The 1st won't be fixed. The 2nd and 3rd can be fixed.
  1. Compound wildcard indexes are not supported
  2. MongoDB cannot use a non-wildcard index to satisfy one part of a query predicate and a wildcard index to satisfy another.
  3. MongoDB cannot use one wildcard index to satisfy one part of a query predicate and another wildcard index to satisfy another.
I assume that Rockset can combine indexes during query evaluation given their focus on analytics. Thanks to the Rockset team I learned it supports index intersection. It also supports composite indexes via field mappings (functional indexes).

Editorial 2

An open question is whether an LSM can do clever things to support analytics. There has been some work to show the compression benefit from using column-wise storage within a RocksDB SST for the larger levels of the RocksDB LSM. Alas, the key RocksDB workloads have been OLTP. With the arrival of Rockset there is more reason to reconsider this work. There can be benefits in the compression ratio and reduced overhead during query processing. Vertica showed that it was useful to combine a write-optimized store for recent writes with a read-optimized store for older writes. An LSM already structures levels by write recency. Perhaps it is time to make the larger levels read-optimized especially when column-wise data is inserted to the LSM.

Update - read paper years ago then forgot that Kudu combines LSM + columnar.

Editorial 3

The previous section is mostly about being clever when storing column-wise data in an LSM to get better compression and use less CPU during query evaluation. This section is about being clever when storing the search index. 

The search index is likely to have many entries for some values a given attribute. Can an LSM be enhanced to take advantage of that for analytics workloads? In other storage engines there are two approaches -- bitmap indexes and RID-lists. Adapting these for an LSM is non-trivial but not impossible. It is likely that such an adaptation would only be done for the larger levels of the LSM tree.

Friday, May 17, 2019

index+log: implementations

My take on index+log systems like WiscKey is that they are neither better nor worse than an LSM - it all depends on your workload. But I am certain that we know much more about an LSM than about the index+log systems. Hopefully that changes over time as some of them are thriving.

The first index+log system that I read about was Berkeley DB Java Edition. The design paper is worth reading. Since then there have been a few more implementations and papers that I describe here. This list is probably incomplete: Bitcask, ForestDB, WiscKey, HashKV, TitanDB and RocksDB BlobDB.

At this point the systems that are getting into production, TitanDB and BadgerDB, use an LSM for the index. I wonder if an index structure that supports update-in-place would be better especially when the index must be cached because I expect the CPU read-amp for an LSM to be about 2X larger than for a b-tree and a b-tree supports update-in-place which makes it easier to relocate values during GC.

While I like index+log systems I think that papers and marketing tend to overstate LSM write-amp. For production RocksDB I usually see write-amp between 10 and 20. I expect that index+log could achieve something closer to 5. This paper from CMU explains one reason why per-level write-amp in an LSM is less than the per-level fanout (less than 10). Write skew is another reason.

The systems

Bitcask was part of the Riak effort.
  • The index is an in-memory hash table. The index isn't durable and the entire log has to be scanned to rebuild it on startup -- whether or not this was after a crash or a clean shutdown. The slow startup is a problem.
  • The value log is circular and GC copies live values from the tail to the head of the log. Liveness is determined by an index search. 

ForestDB was a finalist in the SIGMOD 2011 student programming contest. Eventually the project and creator moved to Couchbase. It is worth reading about here and on the github page. I published blog posts that compare ForestDB and RocksDB: 1, 2, 3 and 4. Google finds more interesting things to read.
  • The index is a space-efficient trie.
  • The value log might have log segments. GC copies live values to the head of the log. Liveness is determined by an index search.

WiscKey is described as an LSM with key-value separation and made popular the term key-value separation. I put it in the index+log family of index structures.
  • The index is an LSM. There is no redo log for the index as it can be recovered from the head of the value log.
  • Kudos for many references to amplification factors. The paper uses bytes read for read-amp. I prefer to consider both IO and CPU for read-amp with key comparisons for CPU and storage reads for IO.
  • It doesn't mention that it has more cache-amp than an LSM, but few papers mention that problem. Shrinking the LSM by keeping large values separate doesn't make the index parts of it (filter and index blocks) easier to cache as they are already separate from the data blocks. There is more to cache with index+log as I describe here.
  • It claims to overcome the (worst-case) problem of one storage IO per KV pair on a range scan by fetching in parallel. Assuming the storage device has enough spare IO this might hide the problem but it doesn't fix it. With many workloads there isn't spare IO and extra IO for reads also means extra CPU for decompression.
  • The value log is circular and single-threaded GC copies live values to the head of the log. Liveness is determined by an index search. I assume that multi-threaded GC is feasible.
  • The paper isn't clear about the total write-amp that might occur from both the first write to the value log and GC that follows.
  • Compression isn't explained.

BadgerDB is a golang implementation, and much more, of the WiscKey paper.
  • It has many features and many production use cases. This is impressive. 
  • GC is scheduled by the user. Based on Options.NumCompactors I assume it can be multi-threaded.
  • The docs state that the LSM can be served from RAM because the values are elsewhere. That is true but I don't consider it a feature. It must be in RAM to avoid IO from liveness queries done by GC. An LSM isn't a monolithic thing. There are index blocks, data blocks and filter blocks and most of the LSM, data blocks from the max level, don't have to be cached. 
  • There is extra work on reads to find values that have been moved by GC. See the comments about BadgerDB here.

HashKV is an interesting paper that avoids index queries during GC.
  • Hash-based data grouping distributes KV pairs by hash into one of N logs. GC is probably done by scanning a log twice -- once to get the keys and the second time to relocate the live values. A value is live when the most recent key is not a tombstone. A value might be live when needed for a snapshot. GC doesn't do index searches so the index doesn't have to be cached to make GC efficient but you might want to cache it to avoid doing index IO on queries -- and this index is still much larger than the block index for an LSM.
  • Hotness awareness copies cold values to a cold segment to avoid repeated GC for a value that doesn't get updated or deleted. A header for the value is kept in the non-cold log.
  • Small values are stored inline in the LSM.
  • I am curious if more log groups means more write-amp. See my comment about fsync in a previous post.
  • I am curious whether managing the hash buckets is easy. The goal is to make sure that keys for a segment group fit in memory. The range and number of buckets must change over time. Does this have anything in common with linear and extendible hashing?

TitanDB is part of TiDB and TiDB is thriving.
  • A WAL is used for new writes. This might make it easier to compress data on the first write to the value logs.
  • GC appears to do index searches to determine liveness.
  • Compression is per-record. I hope this does per-block in the future.
  • It might let the user tune between space-amp and write-amp via discardable_ratio.
  • This is compatible with most of the RocksDB API.k

RocksDB BlobDB is an extension to RocksDB that uses log segments for large values and stores small values in the LSM. GC copies live values and liveness is determined by an index search.

Future work

Future work for index+log systems includes:
  • Determine whether a b-tree is better than an LSM for the index structure
  • Determine whether the HashKV solution is the best way to avoid liveness queries during GC.
  • If an LSM is used for the index structure determine efficient ways to support relocating values during GC without extra overheads and complexity during read processing.
  • Determine whether clever things can be done during GC.
    • Block compression is easier than on first write.
    • Hot/cold value separation is possible (see HashKV). This is an example of generational GC even if we rarely mention that for index structures.
    • Values in a log segment can be ordered by key rather than by time of write during GC. GC can also merge multiple ordered segments to create longer sorted runs. I wonder if it is then possible to use block indexes (key+pointer per block rather than per row) to reduce cache-amp for such log segments.

Thursday, May 16, 2019

index+log: open issues

This post is about open issues for the index+log family of index structures. See past posts here and here for more details on index+log. The success of LSM has lead to several solutions that use an LSM for the index. I am curious if that is a good choice when implementing index+log from scratch. It is a great choice when starting with an LSM and then adding index+log.

Open Issues

The open issues for index+log include:
  • Is GC multi-threaded?
  • Does value log GC require index queries?
  • Must the index be cached?
  • Is block compression possible?
  • With an LSM as the index how is relocating an entry in the value log supported?

Is GC multi-threaded?
  1. I hope so.
  2. It hasn't been in some of the index+log papers/systems.
Does value log GC require index queries?
  1. Yes, in most of the index+log papers/systems this is required to determine whether a value is live when scanning the value log during GC
  2. HashKV proposed a solution that doesn't require such queries. It is a wonderful paper but I think there are more things to figure out. The general idea is to hash KV pairs into buckets such that all keys for a bucket will fit in memory during GC. GC then reads the value log segments for a bucket twice -- first to get the keys, second to copy the live KV pairs into a new log segment. I wonder if managing the hash bucket ranges has things in common with linear and extendible hashing.
Must the index be cached?
  1. The index for index+log might be 10X larger than for an LSM because an LSM uses a block index while index+log uses a row index. A block index has a key+pointer per block and a row index has that per row. An LSM only needs a block index because rows are clustered by key.
  2. To do at most one IO per point query for a database larger than memory the index must be cached for index+log to avoid index IO as the IO is spent reading the value log. If the <= 1 IO / query constraint is then cache-amp is larger for index+log compared to LSM because the index+log index is larger (see #1).
  3. If value log GC queries the index to determine whether each value is live then this query shouldn't do IO or GC will be too slow and inefficient. This is another reason for caching the index+log index.
  4. If the index must be cached then maybe an LSM isn't the best choice. Consider an index structure optimized for a main-memory DBMS.
Is block compression possible?
  1. I hope so but this hasn't been explained in some of the index+log papers/systems.
  2. Per-record compression can be done instead of block compression. That will have a lower compression rate but less CPU overhead when decompressing on read.
  3. It might be hard to do block compression the first time a KV pair is written to a value log. One option is to write to a redo log until enough data is available to do block compression and then write to the value log. Another option is to defer block compression until KV pairs are rewritten during GC.
When the index is an LSM what happens when values are moved?
  1. This is an issue that I learned about via CockroachDB. I should have figured it out long ago but mistakes happen.
  2. The LSM tree is read in order -- top to bottom with leveled compaction and left to right with tiered compaction. This guarantees that the correct result is returned with respect to visibility. If the first entry for a key is a tombstone then the search can stop (ignoring snapshot reads).
  3. Value log GC moves live KV pairs to new log segments. To find the KV pair after the move either the index entry must be updated or the index entry must have a logical value log key and then another index is needed to map the logical value log key to a physical value log (filename, offset). 
  4. Updating the LSM index entry to reference the new value log location (filename, offset) can be done by inserting a new KV pair into the LSM but that either breaks read consistency semantics or complicates read processing. It would break LSM read processing because inserting back into the LSM implies this is a new write, but it just a move for GC. Something other than an LSM that supports update-in-place makes this easier.
  5. Details on using a logical value log key are explained in a CockroachDB github issue.

Wednesday, May 15, 2019

Index+log, v2

I put most index structures into one of three categories -- page-based, LSM or index+log. My focus is on databases larger than memory and I might be ignoring categories used for main memory DBMS. Over the past decade index+log has gotten more attention and this is my second attempt at explaining it.

My definition of index+log is simple -- data is appended to a log on each write, index entries point into the log and GC scans the log to copy live values and discard others. The log can be one file or many log segments. GC might have to search the index to determine whether a value is live. The value written to the log might include the key.

Bitcask was the first index+log system that I noticed but I assume there are earlier examples and just found one -- Berkeley DB Java Edition. While WiscKey made popular the term key-value separation and is presented as an LSM variant, I put it in the index+log category. Other interesting index+log systems include RocksDB BlobDBTitanDBForestDB and Faster. While many of the tree-based solutions use an LSM for the index that is not required by index+log and an LSM is not used by Berkeley DB Java Edition or ForestDB.

For an index+log solution that must cache all of the index and query the index during GC I suspect that an LSM is not the best choice for the index. Although if you already have an LSM (see RocksDB BlobDB) then I get it.

The summary of my LSM vs index+log comparison is:
  1. index+log has less write-amp but more space-amp
  2. index+log has much more cache-amp, maybe 10X more
  3. index+log might have more deferred CPU write-amp
  4. I have yet to validate my claims by running benchmarks with index+log implementations.
Note that #1 is a feature, #2 is not a feature and for #3 it depends. The key point is that the cost of faster writes from index+log is more cache-amp (more memory) and more IO for range queries. In production with RocksDB I frequently see write-amp=15 with space-amp=1.1. I assume that could be reduced with index+log to write-amp ~=5 and space-amp ~= 1.3. It might be possible to avoid or reduce the impact of #2 and #3 in future index+log implementations.

Amplification Factors

I am speculating on read, write, space and cache amplification for index+log because I have little hands on experience with index+log implementations. Another reason for speculation is that index+log allows for different structures for the index (b-tree, LSM, etc) which affects some of the estimates.

The amplification factors are:
  • cache - the cache-amp for index+log is likely to be much larger than for an LSM. To achieve at most one IO per point query index+log might need 10X (or more) memory versus an LSM. Clustering values in key order doesn't just make range queries faster, it also means an LSM only needs block indexes in memory (key+pointer per block) while index+log needs a key+pointer in memory for every KV pair in the database. When there are 10 KV pairs per block then this is 10X more memory. Even when the database has hot and cold keys they are likely to be interleaved on the same index leaf pages --  so all of those leaf pages must be in memory to avoid doing two IOs (one for the leaf page, one for the value) on a point query.
    • There is additional data that an LSM and b-tree need in memory to satisfy the one IO per query constraint and I described that in previous posts (mostly things from the non leaf/max level).
    • It might be valid to debate whether my one IO per point query constraint is valid, but this blog post is already long.
    • Another reason for the index to be cached is to avoid doing IO during GC when index searches are done to determine whether KV pairs are live.
  • space - I ignore space-amp for the index and focus on the log because the log is usually larger. With index+log the user can trade between write-amp and space-amp. With the variable pct_full representing the percentage of live data in the database (a value between 1 and 100) then:
    • space-amp = 100 / pct_full
    • write-amp = 100 / (100 - pct_full)
    • Just like with an LSM, previously written KV pairs are rewritten during GC with the index+log approach. Fortunately this is done less often with index+log. 
    • I assumed that block compression is used but that is harder to implement for index+log. The WiscKey paper doesn't describe a solution and the HashKV paper suggests using per-record compression, which will have a lower compression rate versus block as used by an LSM. I assume block compression can be done for index+log but it isn't trivial.
    • To explain the estimates when pct_full=X assume that all log segments have X% live data (yes, this is naive). When GC is done on a log segment X% is copied leaving (100-X)% free space in the newly written log segment. So in total 100% of a log segment is written for each (100 - pct_full)% of new data, which is the formula above.
    • Thus with pct_full=90 then space-amp is 1.1 while write-amp is 10. Comparing these with a leveled LSM the space-amp is similar while the write-amp is slightly better than what I see in production. To get a write-amp that is significantly better the cost is more space-amp. For example with pct-full=75 then write-amp=4 and space-amp=1.33. 
  • read (CPU) - see here for range seek/next. The summary is that when an LSM is used for index+log then the costs are similar to an LSM. When a b-tree is used for index+log then the costs are smaller.
  • read (IO) - see here for range seek/next. In the cache-amp estimate I assume that the index is cached so the only IO to be done is for the log. Therefore the index structure (LSM vs b-tree) doesn't matter.
    • point query - the IO read-amp is <= 1 because the log is not cached.
    • range seek - range seek doesn't do IO when the index is cached
    • range next - this is much larger for index+log than for an LSM because it might do one IO per call to range next because rows are not clustered by key in the log. When data is compressed then there also is the CPU overhead for decompression per KV pair.
  • write - by write I assume update (read-modify-write) rather than a blind write.
    • immediate CPU - the cost of an index search. See the section on read CPU for point queries above.
    • immediate IO - the cost of an optional redo log write for the index structure and then writing the (value) log. Note that the minimum size of a write done by the storage device might be 4kb even if the data written is much smaller. Doing an fsync per 128 byte value might have a write-amp of 32 if that write is really forced to storage and doesn't just linger in a capacitor backed write cache.
    • deferred CPU - the deferred CPU write-amp is the cost of index searches done during GC, unless the HashKV approach is used. With pct_full=75, write-amp=4 and space-amp=1.33 then GC is done ~4 times for each key and the deferred CPU cost is 4 index searches. When live KV pairs are copied by GC then there is also the CPU and IO overhead from updating the index entry to point to the new location in the log.
    • deferred IO - this is determined by the percentage of live data in the database. With pct_full=75 it is 4.

Thursday, May 9, 2019

CRUM conjecture - read, write, space and cache amplification

The RUM Conjecture asserts that an index structure can't be optimal for all of read, write and space. I will ignore whether optimal is about performance or efficiency (faster is better vs efficient-er is better). I want to use CRUM in place of RUM where C stands for database cache.

The C in CRUM is the amount of memory per key-value pair (or row) the DBMS needs so that either a point query or the first row from a range query can be retrieved with at most X storage reads. The C can also be reported as the minimal database : memory ratio to achieve at most X storage reads per point query.

My points here are:
  • There are 4 amplification factors - read, write, space and cache
  • CRUM is for comparing index structure efficiency and performance
  • Read and write amplification have CPU and IO parts
  • Write amplification has immediate and deferred parts
Many X is faster than Y papers and articles neglect to quantify the tradeoffs made in pursuit of performance. I hope that changes and we develop better methods for quantifying the tradeoffs (a short rant on defining better).

Amplification factors (RUM -> CRUM) are used to compare index structures. Values for the factors are measured for real workloads and estimated for hypothetical ones. The comparison is the most valuable part. Knowing that the deferred CPU write-amp on inserts for a b-tree is 30 is not that useful. Knowing that it is 3X or 0.5X the value for an LSM is useful.

Workload matters. For estimates of amplification I usually assume uniform distribution because this simplifies the estimate. But there is much skew in production workloads and that impact can be measured to complement the estimates.

Read Amplification

This post is an overview for read-amp. This post explains it in detail for an LSM. There are two parts to read-amp -- CPU and IO. Thus for each of the three basic operations (point query, range seek, range next) there are 2 values for read-amp: CPU and IO. I have yet to consider deferred read-amp and by read-amp I mean immediate read-amp.

Metrics for CPU read-amp include CPU time, bytes/pages read and key comparisons. I use key comparisons when predicting performance and CPU time when running tests. I have not used bytes/pages read. While key comparisons are a useful metric they ignore other CPU overheads including hash table search, bloom filter search, page read and page decompress.

Metrics for IO read-amp include bytes read and pages read. I use pages read for disk and bytes read for SSD because disks are IOPs limited for small reads. IO read-amp implies extra CPU read-amp when the database is compressed. Decompressing pages after storage reads can use a lot of CPU with fast storage devices and even more with zlib but you should be using zstd.

With estimates for hypothetical workloads I assume there is a cache benefit as explained in the Cache Amplification section. This is likely to mean that comparisons assume a different amount of memory for index structures that have more or less cache-amp. For real tests I mostly run with database >> memory but don't attempt to use the least memory that satisfies the cache-amp X reads constraint.

Write Amplification

This post is an overview for write-amp. This post explains it in detail for an LSM. Write-amp has two dimensions: CPU vs IO, immediate vs deferred. For each operation (insert, delete, update) there are 4 values for write-amp: immediate CPU, deferred CPU, immediate IO and deferred IO.

The immediate write-amp occurs during the write. The deferred write-amp occurs after the write completes and includes writing back dirty pages in a b-tree and compaction in an LSM.

Possible metrics for CPU write-amp include bytes written, pages written, key comparisons and pages/bytes (de)compressed. Bytes and (in-memory) pages written are useful metrics for in-memory DBMS but my focus is on databases >> memory.

Possible metrics for IO write-amp include bytes written and pages written. These can be estimated for hypothetical workloads and measured for real ones. The choice between bytes or pages written might depend on whether disk or SSD is used as one is limited by ops/s and the other by transfer rate. If you use iostat to measure this then figure out whether Linux still counts bytes written as bytes trimmed.

Examples of deferred and immediate write-amp:
  • The InnoDB change buffer is deferred IO and CPU. Checking the change buffer and applying changes is deferred CPU. The deferred IO is from reading pages from storage to apply changes.
  • For a b-tree: page writeback for a b-tree is deferred IO, compression and creating the page checksum are deferred CPU, finding the in-memory copy of a page is immediate CPU, reading the page on a cache miss is immediate IO.
  • An LSM insert has immediate/deferred IO/CPU. 
    • Immediate CPU - key comparisons for memtable insert
    • Immediate IO - redo log write
    • Deferred IO - reading uncached pages for input SSTs and writing output SSTs during compaction
    • Deferred CPU - decompression, compression and key comparisons while merging input SSTs into output SSTs during compaction. Note that compaction does a merge, not a sort or merge+sort.

Space Amplification

Space-amp is the size of the database files versus the size of the data, or the ratio of the physical to logical database size. An estimate for the logical size is the size of the uncompressed database dump with some adjustment if secondary indexes are used. The space-amp is reduced by compression. It is increased by fragmentation in a b-tree and uncompacted data in an LSM.

It is best to measure this after the DBMS has reached a steady state to include the impact of fragmentation and uncompacted data.

Cache Amplification

I briefly described cache-amp in this post. The cache-amp describes memory efficiency. It represents the minimal database : memory ratio such that a point query requires at most X storage reads. A DBMS with cache-amp=10 (C=10) needs 10 times more memory than one with C=100 to satisfy the at most X reads constraint.

It can be more complicated to consider cache-amp for range seek and range next because processing them is more complicated for an LSM or index+log algorithm. Therefore I usually limit this to point queries.

For a few years I limited this to X=1 (at most 1 storage read). But it will be interesting to consider X=2 or 3. With X=1:
  • For a b-tree all but the leaf level must be in cache
  • For an LSM the cache must include all bloom filter and index blocks, all data blocks but the max level
  • For an index+log approach it depends (wait for another blog post)

Other posts

Related posts by me on this topic include: