Wednesday, September 25, 2019

Reducing write and space amplification in a b-tree

This started as a review of FineLine: Log-structured Transactional Storage and Recovery from VLDB 2019. The paper is wonderful but dense so I didn't take the time to understand all of it. It uses a log-structured approach to avoid the complexity of ARIES for recovery. While it can support a variety of index structures a b-tree is the obvious target.

While it wasn't a stated goal for the paper, the FineLine approach reduces space and write amplification for a b-tree. Maybe a focus on that is motivation for another paper from the authors. An LSM with leveled compaction has much less space and write amplification than a b-tree at the cost of more CPU. Reducing space and write amplification for a b-tree makes it more competitive when storage efficiency is a priority. The paper has good performance results for databases less than 10X the size of RAM. I am curious what happens when the database is 100X the size of RAM.

I look forward to corrections and apologize for not distinguishing between page and block. Fortunately I don't have to write about lock vs latch here. I also look forward to the next paper from the authors.

Improving b-tree storage efficiency

I previously explained how the InnoDB b-tree can use 2X more space and write 10X more bytes per transaction to storage compared to the MyRocks LSM. This is a big deal if you prefer to use less SSD and worry about endurance (because QLC is coming). But there are several things that can be done to make a b-tree better at storage efficiency.
  1. Variable length pages
  2. Block and index prefix compression
  3. Defer dirtying pages and page write-back
A traditional b-tree subject to random updates will have leaf pages that are <= 70% full. This wastes at least 30% of storage and space amplification (space-amp) is at least 1.42X (100 / 70). Variable length pages can avoid that waste. By variable length pages I mean that when the compressed page is X bytes then it uses at most X+Y bytes on disk where Y is <= 512. One reason for such a small value for Y is that a common page size for OLTP is 4kb or 8kb. The alternatives to supporting variable length on-disk pages has been to rely on file system hole punch (see here and here for my bad experience with that in InnoDB) or a proprietary file system. Variable length pages will cause space-amp from external fragmentation as there will be wasted space that has yet to be reclaimed.

Both index prefix compression and block compression are needed. Index prefix compression reduces the size of pages before and after compression. These should be optional because some workloads don't want the CPU overhead from block compression (even with fast lz4 and zstd) or from undoing prefix compression. Block compression usually leads to variable length pages.

Variable length pages and compression help mostly with space-amp. To reduce write amplification (write-amp) a b-tree can defer making pages dirty and the write-back that must follow. The FineLine paper has new ideas for this. The old ideas have expensive side-effects (slow crash recovery) including using a large buffer pool with 99% dirty pages and using a huge redo log. I wonder what the side-effects will be for FineLine.

The OSS b-tree engines that I have used (InnoDB, WiredTiger) can do more to reduce space-amp. InnoDB supports block compression but not variable length pages (by my definition) so the implementation is complex and some space-amp savings are lost. WiredTiger supports variable length pages, block compression and index prefix compression. From the WiredTiger docs the on-disk page size is a multiple of allocation_size which is >= 4kb which doesn't meet my requirement for variable length above. Regardless I am still a huge fan of InnoDB and WiredTiger. There is also more to be done to reduce write-amp in OSS engines. InnoDB fuzzy checkpoint combined with a large redo log can reduce write back. The InnoDB change buffer defers dirtying non-unique secondary index pages. WiredTiger doesn't support fuzzy checkpoint. I hope that changes. But neither have anything that can defer write-back like FineLine.


The goals for FineLine include avoiding the complexity of ARIES, instant recovery, larger than memory databases and a compromise between read and write optimized solutions -- better read performance than an LSM, better write performance than a b-tree.

There are a few logical concepts to grasp when reading the paper -- node and indexed log. FineLine uses node rather than page. A node has an ID and variable length size. Nodes can reference other nodes -- for a b-tree leaf nodes will be referenced by interior nodes. Log entries are either node format (a node image) or describe a change to a node. The indexed log is the database and supports append and fetch. The indexed log is unlikely to be implemented by a log file.
FineLine uses a single-storage approach. The indexed log is the database. For the paper the indexed log was implemented using a Foster b-tree, forked from ShoreMT, that might be a variant of a copy-on-write b-tree. Single-storage means there is only a log to write -- redo log records are written on commit and compaction runs in the background to merge redo log records and occasionally write page images (node-format log records) when there are many redo records for a given page. The paper describes a simple API for the indexed log -- append and fetch node.

Group commit is supported and all log records to be written are assigned a partition number and then ordered by node ID. I assume the partition number is monotonically increasing per commit group. A group commit creates a new partition in the indexed log. A bloom filter might also be created to make it faster to determine whether a partition includes records for a node. Only redo is written to the indexed log. There are no undo records. One reason for better write performance is less logging.

Adjacent partitions are merged by background threads. This has something in common with LSM compaction. The paper was vague but there are times where node images (node format log records) are written back to the indexed log when there are too many log records for that node. This can be done via cost-based decisions that consider read vs write efficiency. There was some work in Hekaton to figure out when to collapse per-node redo log records.

On a buffer cache miss the fetch operation must first find the most recent node image (node format log record) and then find/apply all of the redo log records for that node. In the worst case this means searching many partitions. Bloom filters can make this search faster.

A buffer manager caches node images. Some of the node images might be expensive to recreate as explained in the previous paragraph. While the paper claims FineLine supports databases larger than memory I wonder how much larger because of the cost of recreating node images. The paper provide tpc-b results where the database is no more than 10X the size of memory. I am curious about 100X larger.

Friday, September 20, 2019

Review of Spanner: Becoming a SQL System

This is a review of Spanner: Becoming a SQL System, at least the bits of it that I understand. It is worth reading and is from SIGMOD 2017 but I just noticed it. I am impressed by the ability of Google to build large-scale systems. Spanner + SQL is used within Google and by external customers for OLTP and OLAP (HTAP) although I am curious whether it is performant for OLAP - which is really a function of the time needed to improve the optimizer and execution engine for complex SQL. From the paper SQL and a common SQL across DBMS within Google has greatly increased Spanner usage.

Cloud Spanner provides serializable isolation. The paper states this is done with pessimistic concurrency control and timestamps so I assume that reads take locks in read-write transactions to prevent changes to rows that have been read. I wonder whether conflicts are an issue for workloads with contention. It isn't clear to me whether CockroachDB uses read locks.

There are several SQL systems within Google -- F1, Cloud Spanner and Dremel. They implement the same SQL language thanks to a common front-end (parse, semantic analysis), library of scalar functions, test framework, feature tests (coverage and correctness) and random query generator for testing. I wonder what is to become of F1 which first did SQL on Spanner for Google Ads.

Range extraction is done at compile time and run time to prune table shards, determine which fragments of a index should be read and determine key ranges to lock. All DBMS do this at compile time. Doing this analysis at run time is less common. Distributed SQL has more overhead for data access so the overhead from this analysis is worthwhile.

Restart tokens are used to allow the client library to hide transient failures even in cases where partial query results have been received. The token represents the internal state of query execution and works across upgrades. Intermediate query results are not persisted on the server side as the target workload is OLTP. The comdb2 DBMS also returns information to the client to hide transient failures.

Ressi is an LSM optimized for OLAP rather than OLTP. This uses the PAX idea within the LSM data blocks. Data within the LSM tree is ordered by key (row-wise) but within each data block is stored column-wise. This is a great compromise for HTAP -- one IO fetches all columns for a row but the vectorized format within the data block reduces CPU overhead during OLAP and the paper states that Ressi can operate directly on vectors in compressed form. Other features from Ressi include:
  • Large (multi-page) values are stored in separate files. I assume this is similar to the LOB indirection done by InnoDB.
  • There are active and inactive files. The active file has the current version for a key while the inactive file has older versions for a key. It wasn't clear to me how this is integrated into the LSM tree. My guess is that the split is done during LSM compaction and I am curious if this ever creates search overhead.
I am uncertain but think this means that a transaction can't read its own uncommitted writes. That seems like a source of confusion for external users.
Just like SQL syntax and semantics, transactional guarantees of database products vary widely from one system to another. Spanner supports very strong guarantees that establish a global order among transactions, called external consistency. Within a given read-modify-write transaction, however, Spanner imposes a constraint that all updates must be committed as the last step of the transaction. Overcoming this limitation requires supporting reading the uncommitted results within an active transaction. While we have seen comparatively little demand for this feature internally, supporting such semantics improves compatibility with other SQL systems and their ecosystems and is on our long-term radar. 
While the paper didn't cover pricing I wonder how Spanner compares to the alternatives from AWS and MSFT. Spanner charges per GB of storage while AWS Aurora charges for GB of storage and per IO. Is that a big deal? Do users pay for extra RAM to reduce their read IO with Aurora?

Wednesday, September 11, 2019

Optimal Tuning for an LSM

For a few years I have wanted to use math to determine optimal tuning for an LSM given a specific workload and constraints on efficiency. First I had to (re)learn math. I like calculus now while I didn't enjoy it in high school -- just like spinach.

By specific workload I mean the fraction of operations that are inserts, point queries and range reads. Note that if the workload has no range reads then an LSM might not be the best choice. By constraints on efficiency I mean the max write and space amplification that would be tolerated.

I decided to try constrained optimization and Lagrange Multipliers where the objective function was the average number of CPU comparisons per operation and the decision variables were the per-level fanouts for the LSM tree. In this case I am only optimizing for the CPU overhead, but I have to start somewhere. The eventual goal is to include CPU, memory and storage in the cost function.

I previously used Lagrange Multipliers to show how to set the per-level fanout to minimize write amplification. The result depends on assumptions about the per-level write amplification and this dependency is a new result but the previous result is that the per-level fanout should be the same for all levels. Here I show that CPU overhead can be reduced by using larger per-level fanouts for larger levels.

  • Books with titles like Math for Economists are a great resource for learning math
  • Choosing a good value for the number of levels in the LSM tree is more significant than figuring out the optimal per-level fanouts.
  • Take time to understand when a problem is, or is not, worth optimizing

Is this worth solving?

To minimize the CPU overhead of a workload it is much more important to choose the number of levels in the LSM tree and the per-level compaction algorithms (tiered, leveled, leveled-N) then it is to optimize the per-level fanout as I describe here. While the math I describe below isn't the best way to optimize an LSM, the math was interesting to me so this blog post lets me revisit my work.

In the cost function below the CPU cost of a read, Cr, is dominated by the size of the memtable. The per-level fanout values, f1 and f2, won't have a significant impact on the value of Cr. Therefore it isn't clear that it is worth figuring out the optimal values for the per-level fanout. By not significant assume that B=2^20 while f1 and f2 are likely to be <= 32. Then the contribution to Cr from B is 60 while the contribution from f1 and f2 are at most 15. For Ci, the cost of an insert, the contribution from B is likely to be similar to the contribution from f1 and f2.

The problem

The problem below has a nonlinear objective function, nonlinear equality constraint and simple inequality constraints.

* LSM memtable, L1 and L2 where there are B entries in memtable, f1 is fanout from memtable to L1 and f2 is fanout from L1 to L2. There are B * f1 * f2 rows in the LSM.

* Cost of read is Cr and cost of insert is Ci, both are the number of compares per operation. Bloom filter is not used and Ci includes compares for memtable insert and compaction.
* Fraction of reads and inserts in workloads is Pr and Pi where Pr + Pi = 1
* The value for X is somewhat arbitrary although X should be >= 2. B should be an integer but I ignore that for now.

Minimize Pr * Cr + Pi * Ci
such that B * f1 * f2 = K
and B >= X, B <= mK where m between 0 and 1
and f1 >= 2 and f2 >= 2

# compares per point-read without a bloom filter
# assume key is found in max level of LSM tree
Cr = memtable-cmp + l1-cmp + l2-cmp

Cr = log2(B) + log2(B*f1) + log2(B*f1*f2)
# Cr can also be expressed as
Cr = 3*log2(B) + 2*log2(f1) + log2(f2)

# compares per insert
Ci = memtable-insert-cmp + compaction-cmp
Ci = log2(B) + f1 + f2

Step 1

First, write the L function that includes the constraints:

# Note that I wrote u in place of lambda
L = Cr * Pr + Ci * Pi + u(K - B*f1*f2)

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
dL/df2 = Pr / (f2 * ln(2)) + Pi - u*f1*f2

Then realize that the inequality constraints make this a lot harder to solve and that complexity gets worse with more levels in the LSM tree. Then admit that you haven't done all of the work to confirm that the math techniques can be used on this problem (constraint qualifications, concave objective function).

Step 2

Without the constraints the goal is to find the min and max values for the objective function by finding values for which the partial derivatives (dL/dB, dL/df1 and dL/df2) are zero. I will do that here and then stop. Note that when the partial derivatives are zero the solution might violate the constraints and thus not be valid. Also note that when the partial derivatives are zero the extremum can be a min or a max. Regardless, the results suggest that fanout should increase from smaller to larger levels to minimize the cost.

First solve for u when dL/dB = 0.

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2
dL/dB = (3*Pr + Pi) / (B * ln(2)) - u*f1*f2
# given Pr + Pi = 1 ...

dL/dB = (2*Pr + 1) / (B * ln(2)) - u*f1*f2

# now solve when dL/dB = 0
(2*Pr + 1) / (B * ln(2)) = u*f1*f2
# given f1*f2 = K/B
(2*Pr + 1) / (B * ln(2)) = u*K/B
# multiply LHS and RHS by B
(2*Pr + 1) / ln(2) = u*K

# solve for u
u = (2*Pr + 1) / (K * ln(2))

Then  solve for f1 when dL/df1 = 0

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
# when dL/df1 = 0 then
(2*Pr) / (f1 * ln(2)) + Pi = u*B*f2
# given B*f2 = K/f1

(2*Pr) / (f1 * ln(2)) + Pi = u*K/f1
# multiply LHS and RHS by f1
(2*Pr)/ln(2) + f1*Pi = u*K
f1*Pi = u*K - (2*Pr)/ln(2)

# substitute u = (2*Pr + 1) / (K * ln(2))

f1*Pi = (2*Pr + 1) / ln(2) - (2*Pr)/ln(2)
f1*Pi = (2*Pr + 1 - 2*Pr) / ln(2)
f1*Pi = 1 / ln(2)
f1 = 1 / (Pi * ln(2)) 

Then solve for f2 when dL/df2 = 0

# skip the math for now, trust me :)
f2 = (1 + Pr) / (Pi * ln(2)

# were there an f3 then the answer would be...
f3 = (1 + 2*Pr) / (Pi * ln(2) 

Then solve for B, f1 and f2 given values for Pi and Pr

B = K / (f1 * f2)
B = K / (1 / (Pi * ln(2)) * (1+Pr) / (Pi * ln(2)))
B = K / (1 + Pr) / (Pi * ln(2))^2
B = K * (Pi * ln(2))^2 / (1+Pr)

# with Pi=0.5 and Pr=0.5
# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.080, f1=2.89, f2=4.33

# with Pi=0.1 and Pr=0.9# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.003, f1=14.43, f2=27.41

# with Pi=0.9 and Pr=0.1
# note that f1, f2 are < 2 which violates a constraint

B = K * 0.353, f1=1.60, f2=1.76

# with Pi=0 and Pr=1
# --> flush the memtable, no need to search it

Comparing optimal solution with typical config

For the the 2 cases above for which an optimal solution was found that doesn't violate constraints I compare the value of the objective function (compares/operation) with for the optimal vs a typical config (f1=f2 in typical config) assuming K=1B (1B rows in LSM).

These results explain my comment about confirming that the impact from an optimal config is significant. It isn't in these cases.

# with Pi=0.5 and Pr=0.5
# f1=f2=3.54 in default config
B = K * 0.080, f1=2.89, f2=4.33

optimal and default config do ~47 compares/operation
# with Pi=0.1 and Pr=0.9
B = K * 0.003, f1=14.43, f2=27.41
# f1=f2=19.89 in default config
optimal and default config do ~63 compares/operation

Using solvers

I tried the Mathematica and Matlab and had more success with Matlab. The Minimize function in Mathematica can get unhappy with fractional coefficients in the cost function (values for Pr and Pi).

Below are two functions and the script I used with Matlab.

# costf.m is the objective function
function [f,g] = costf(x)
B = 1000000;
Pq = 0.99;
Pu = 0.01;

f = Pq * (4*log2(B) + 3*log2(x(1)) + 2*log2(x(2)) + log2(x(3))) + Pu * (log2(B) + x(1) + x(2) + x(3));

if nargout > 1
    g = [((Pq * 3) / (x(1) * log(2))) + Pu;
         ((Pq * 2) / (x(2) * log(2))) + Pu;
         ((Pq * 1) / (x(3) * log(2))) + Pu];

# nlcon.m is the nonlinear constraint

function [c, ceq] = nlcon(x)
c = [];
ceq = x(1) * x(2) * x(3) - 1000;

# the script
lb = [2,2,2];

x0 = [10,10,10];

S = fmincon(@costf, x0, [], [], [], [], lb, [], @nlcon);

Tuesday, September 10, 2019

FoundationDB Record Layer

This is a review of the FoundationDB Record Layer paper. My summary is that Record Layer is a wonderful solution when you need to support billions of OLTP databases where each database is small with a simple and low-concurrency workload. I am not an expert on FoundationDB so this review contains questions and might have mistakes. My focus is not on the distributed system properties.

The big question is whether Record Layer can be the solution for less simple workloads, larger databases and higher concurrency. There doesn't appear to be an OLTP SQL layer in development for FoundationDB and there are many interesting alternatives in the open-source and source-available scale-out OLTP DBMS space that already have query layers including CockroachDB, Yugabyte, FaunaDB, TiDB, MySQL Cluster, Comdb2 and MongoDB.

Apple's CloudKit is the star customer for FoundationDB. Other use cases include SnowflakeDB (metadata mgtmt) and Wavefront (primary store). FoundationDB is open source (Apache 2.0) and has an active community. At the upcoming summit you can learn about Redwood, the new storage engine that will replace SQLite.

Update - CouchDB might be moving to FoundationDB

Questions and Comments

An overview of the API is here.

  • Record Layer uses optimistic concurrency control to provide serializable isolation. Commit can fail when there are conflicts. Locks are not held prior to commit. There will be conflicts especially when there is a lag between the time at which the transactions does reads and commit (stalls for reads from storage make this worse). 
  • It provides atomic RMW operations that don't fail from conflicts at commit like min/max and increment because they never conflict (but what happens when the target row has been deleted). This is great but not enough to reduce my concern about support for high-contention workloads -- there will be conflicts.
  • Support for server-side logic would help reduce conflicts by shrinking transaction duration, alas there is neither a layer for SQL nor for server-side logic.
  • Clients can remove keys from a transactions read set to reduce the chance of conflict. I wonder if this supports weaker isolation like repeatable read. Search for conflict and conflict ranges in the paper.
  • Record Layer supports aggregation indexes to maintain things like sum, count, min and max. I assume this uses the atomic RMW operations. I wonder what trouble this could cause with hot spots.
  • SQLite implements the ssd engine. This isn't the best choice if you care more about performance or efficiency. This is a great choice if you care about manageability while scaling out to billions of databases.
  • Transactions can run for at most 5 seconds. With the ssd engine, implemented by SQLite, only committed versions are forced to disk and undo is buffered in memory to support reads at previous versions. The 5 second limit is there to reduce the memory demand. This restriction might go away with the upcoming Redwood engine.
  • Transactions are limited to 10mb of changes.
  • Redwood is expected to replace SQLite. It is a copy-on-write b-tree with prefix compression. Long common prefixes are frequent with Record Layer so prefix compression is in demand. It uses shadow pages for the leaf pages and then remaps page IDs for non-leaf pages to reduce write-amp. There was an awesome PhD on shadow paging many years ago.
  • The storage engine must be single-threaded which makes it harder to reuse engines like RocksDB. 
  • Order by is only supported via indexes to reduce the memory demand on the server.
  • Record Layer is stateless. Query results are streamed back to the client. I hope the client can get more than one row per network RPC. A continuation is returned to the client when it must fetch more from a cursor.
  • Protocol Buffers are used to encode data. Index keys are compared via memcmp and Record Layer has support for combining composite keys, and hopefully for charsets, into a string for which memcmp does the right thing.

Monday, September 9, 2019

Vinyl - the LSM in Tarantool

Tarantool started as an in-memory DBMS then added an LSM, named Vinyl. This is my review of the Vinyl overview. I have some questions and assume that I will make a few mistakes in describing it. Corrections are welcome.


Tarantool is fast and one reason is server-side logic implemented in Lua and made faster by LuaJIT.

There has been talk of running Linkbench on Tarantool. I am not sure whether that requires too much work but I expect great throughput results from it.

I know of Tarantool because the former tech lead for it previously did great work at MySQL and now will do the same at ScyllaDB. HighLoad is a great place to learn more about Tarantool, as I did on my two visits.

LSM Tree Shape

Vinyl uses tiered+leveled compaction with the memtable as L0, leveled for Lmax and tiered for the levels in between the memtable and Lmax. See LSM Geek Code for more on my terminology.

Configuration parameters for Vinyl are here. There are two parameters that determine the shape of the LSM tree. The max number of sorted runs per level is set by vinyl_run_count_per_level. The size ratio between adjacent levels is set by vinyl_run_size_ratio. The docs in the Filling an LSM tree section have a nice description of the impact from these options. When vinyl_run_count_per_level is 2 then there will be at most 2 sorted runs in a level. When vinyl_run_size_ratio is 5 then the max size of a run in L2 is 5X the max size of a run in L1.

Not all tiered compaction algorithms allow an LSM tree to be described in terms of levels like this, but Vinyl does and that makes it easier to reason about performance and efficiency. More than one sorted run per level will have a large, negative impact on range query CPU overhead. However the impact for point queries should be much less when there is a bloom filter per sorted run as used by Vinyl.

Tuple Range Cache

The range cache sounds interesting but I am still confused by it. The docs describe it with the following. I assume this stores ranges (less or more than one page) from the max level of the LSM tree based on "after having performed a compaction spanning all tree levels". How is cache invalidation managed? What is done in between all-level compaction? Is there also a block cache separate from this?
Unlike, say, RocksDB or MySQL, this cache doesn’t store pages, but rather ranges of index values obtained from disk, after having performed a compaction spanning all tree levels. This allows the use of caching for both single-key and key-range searches. Since this method of caching stores only hot data and not, say, pages (you may need only some data from a page), RAM is used in the most efficient way possible. The cache size is controlled by the vinyl_cache parameter.
Other highlights

  • Vinyl measures the memtable insert and flush rates to predict when to begin flushing the memtable so that the flush will finish before the memtable is full. 
  • The Vinyl memtable is a b-tree rather than a skip-list
  • Serializable isolation is provided
  • zstd compression is used
  • Multi-level compaction is mentioned a few times but not explained
  • Vinyl has gap locks but the overview didn't explain how they are used.
  • What happens when a transaction stalls on a read from disk with Vinyl? Does Vinyl run transactions twice -- once to prefetch reads without applying changes, then again to apply changes? Does it run transactions until they stall on a disk read and then what happens when the read is ready? This might explain the need for gap locks.
  • I assume that transactions are interleaved when there are read stalls. Locks are not held so commit can fail when conflicts are detected. The overview explains it implements the MVTO (multiversion timestamp ordering) class, whereby the winning transaction is the one that finished earlier. There are no locks and associated deadlocks. But now I am confused by the claim that there are no locks when the overview mentions that gap locks have been implemented.
  • Vinyl creates bloom filters for full and partial key searches. I assume that a bloom filter for partial key search is similar to a prefix bloom filter in RocksDB.
  • Vinyl uses at least one LSM tree per logical index. More than one LSM tree is used when the logical index is too big in which case the LSM trees range partition the logical index. This allows different optimizations per partition and can be a major win for right or left growing indices, in which case only one partition will get writes and the others can be read optimized. Creating partitions is done via a split. Too-small partitions can be combined via a coalesce. Some of the split/coaleasce work is deferred until compaction (think of virtual partitions).
  • Vinyl supports upsert (read-free updates are in RocksDB)

Thursday, September 5, 2019

Adapting TPC-C for MongoDB - reviewing a VLDB paper

This is a review of Adapting TPC-C Benchmark to Measure Performance of Multi-Document Transactions in MongoDB which was published in VLDB 2019. I appreciate that MongoDB and Asya Kamsky took the time to get this published. That can be a weekend and nights project when in industry. I also appreciate that this not a benchmarketing effort. The purpose wasn't to overstate performance. The purpose was to show how to get good performance on a TPC-C like workload with MongoDB and realistic hardware and configurations. I hope for a similar effort on MongoDB with Linkbench.

My comments:
  • Work was done to reduce write-write conflicts which will be more likely given the extra commit latency from using w:majority writeConcern on a 3-node cluster. That work included 1) moving conflicting writes early in the transaction 2) moving writes before reads 3) using findAndModify instead of select/update and 4) batching writes. I wonder if non-stored procedures will be useful.
  • A small amount of denormalization was done by storing order lines in the order document. Denormalize everything isn't feasible here or in Linkbench because that leads to too-big documents.
  • Code and details were shared that will allow you to reproduce results.
  • w:majority was used on a 3-node cluster. The goal was to get realistic results, not a benchmarketing special.
I was confused by two things. First, section 3.3 states that majority write concern guarantees that a write is flushed to disk by any replica that ack'd the write. I thought this was determined by the value of the j option in writeConcern. Second, section 3.5.2 is about causal consistency and that (causal reads feature and logical clocks) seems like overkill when shards aren't used. If you want to avoid going back in time when moving from a primary to a secondary isn't it sufficient to remember the point-in-time at which primary queries are done? But maybe that is just a simple logical clock.

Wednesday, September 4, 2019

Tunable Consistency in MongoDB - reviewing a VLDB paper

This is a review of Tunable Consistency in MongoDB from VLDB 2019. It is worth reading and I appreciate that MongoDB has recently published several conference papers. I am not an expert on this topic. For expertise see Daniel Abadi, Kyle Kingsbury and Peter Bailis. Henrik can be added to the list with a few more blog posts.

MongoDB vs MySQL

MongoDB is a NoSQL DBMS that makes it easy to run sharded replicasets. While the NoSQL part of it is nice the sharded replicaset part of it is amazing. I hope that MySQL eventually gets similar support for sharded replicasets including readConcern and writeConcern options.

I previously compared MySQL semisync replication with MongoDB. With MongoDB the behavior for reads can be tuned separate from writes while MySQL combines them. In the MySQL implementation for lossless semisync a write is not locally visible until a replica acks. In the MongoDB implementation the write is committed locally and then replicated. The replicas apply writes as soon as possible without waiting for a distributed commit protocol. The key to making all of this work is controlling read visibility via an MVCC database engine courtesy of point-in-time reads.

The Review


  • With many options comes a need for more documentation and tutorials. While I appreciate splitting read and write semantics into separate options, the feature space is large and figuring this out is harder.
  • The paper states that the gold standard is linearizability. Well, this is a DBMS paper so maybe that should be changed to serializability.
  • I was confused by the difference between majority and linearizable reads. AFAIK snapshot reads are similar to linearizable (both wait for confirmation that the master really is the master) while majority reads don't have that extra wait.
  • I was confused by "The transaction commit operation accepts a write concern, which determines ... and its constituent read and write operations" because commit occurs after the reads so how could it effect them. As the paper promises, that is explained later.
  • MongoDB is a great case study in adding stronger consistency to an async replication system. It continues to use async replication, yet it now provides stronger consistency on demand.
  • I think that propagated means applied in the description of the w option for writeConcern. This means that a replica acks after applying a change -- either before or after making redo durable depending on the j option. AFAIK the more common option is to ack after shipping the change to the replicas log but before applying the change. However, I prefer what MongoDB does. Maybe commenters will correct my perception of what is more common.
  • To reduce write-write conflicts MongoDB uses the latest locally committed transaction as the point-in-time to do reads for read/write operations that use w:majority and for multi-statement transactions that use snapshots. The alternative was to use the older latest majority commit point-in-time. See section 5 from the paper. Therefore there is a wait before returning to the user for that locally committed timestamp to be committed to a majority of the replica set. This is true even for read-only transactions. So MongoDB can make reads wait. Obviously it can make writes wait before returning to the user doing the write for w:majority. An excellent CockroachDB blog post explains that it too can make reads wait while Spanner can make writes wait. 
  • Consistency is tunable in MongoDB. With writeConcern you can determine whether a write might wait. With readConcern you can determine whether a read might wait.
  • Some of the wait for reads is to confirm that the master from which the read has been done is still the master. I wonder if a master lease could have been used to avoid that wait at the cost of making failover slower. Which cost do you prefer?
  • Replication remains async. Writes are committed locally (and visible to others with the appropriate readConcern options) regardless of the write concern. These are shipped to replicas ASAP and applied by replicas ASAP. This means that a replica can apply a change that has to be undone during master failover because it was never committed to a majority. MongoDB has two ways to rollback that replica -- rollback WiredTiger to an older point in time or undo the extra replication events. Rollback sounds easy while undo is complicated.
  • Strongly consistent writes don't delay weakly consistent writes. A strongly consistent write is done locally then releases row locks then waits for replicas to ack.
  • MongoDB doesn't like long running transactions because WiredTiger must keep undo in memory to satisfy all active transactions. MongoDB kills snapshot reads that have been running longer than one minute. For non-snapshot reads it can advance the snapshot in the middle of the read operation. One side-effect of this is that you will have a hard time implementing a consistent logical backup tool like mysqldump.

InnoDB repeatable read in a nutshell

InnoDB repeatable read (RR) is complicated.  This is my short description. It isn't exactly correct.


Processing of statements within a transaction for InnoDB RR:
  1. Statements that are read-only and x-lock free use snapshot isolation
  2. Other statements use read committed (RC) with gap locks
  3. All statements observe uncommitted changes from the transaction
It would have been simpler to use snapshot isolation as-is but that leads to more write-write conflicts.

Sources of Confusion

  • The repeatable read snapshot is created at the start of the first real statement (SELECT, INSERT, etc) and not by BEGIN or START TRANSACTION. It is created by START TRANSACTION WITH CONSISTENT SNAPSHOT.
  • Some RR transactions use SELECT to find rows to modify followed by UPDATE when they should have used SELECT ... FOR UPDATE. Without the FOR UPDATE a SELECT statement uses the snapshot from transaction start while the UPDATE uses RC and sees all committed changes. So SELECT might observe different rows from the UPDATE that follows.
  • I never tried to learn whether SELECT ... FOR UPDATE and UPDATE use 1) use a snapshot from statement start or 2) see any row that is committed at the time statement processing encounters that row. I assume that it is #2. The manual states "reads the latest available data".
  • The rules for gap locks and next-key locks are non-trivial. I don't claim they are more complicated than they need to be because concurrency is hard and they provide useful semantics by preventing phantom reads and more. While the academic community tends towards stronger isolation the common choices in popular systems are weaker (RC in Postgres and Oracle, RR in InnoDB, ? in SQL Server).
When does InnoDB RR implement RR?

I am not an expert on isolation. Experts have yet to review this post. I hope they do. The ANSI spec requires that RR prevent dirty read, non-repeatable read and phantom reads. Alas, InnoDB can do reads (evaluate the WHERE clause) at different points in time (see above) which can allow for some of non-repeatable reads and phantom reads unless you are careful to use the FOR UPDATE clause.