Wednesday, September 11, 2019

Optimal Tuning for an LSM

For a few years I have wanted to use math to determine optimal tuning for an LSM given a specific workload and constraints on efficiency. First I had to (re)learn math. I like calculus now while I didn't enjoy it in high school -- just like spinach.

By specific workload I mean the fraction of operations that are inserts, point queries and range reads. Note that if the workload has no range reads then an LSM might not be the best choice. By constraints on efficiency I mean the max write and space amplification that would be tolerated.

I decided to try constrained optimization and Lagrange Multipliers where the objective function was the average number of CPU comparisons per operation and the decision variables were the per-level fanouts for the LSM tree. In this case I am only optimizing for the CPU overhead, but I have to start somewhere. The eventual goal is to include CPU, memory and storage in the cost function.

I previously used Lagrange Multipliers to show how to set the per-level fanout to minimize write amplification. The result depends on assumptions about the per-level write amplification and this dependency is a new result but the previous result is that the per-level fanout should be the same for all levels. Here I show that CPU overhead can be reduced by using larger per-level fanouts for larger levels.

Summary:
  • Books with titles like Math for Economists are a great resource for learning math
  • Choosing a good value for the number of levels in the LSM tree is more significant than figuring out the optimal per-level fanouts.
  • Take time to understand when a problem is, or is not, worth optimizing

Is this worth solving?

To minimize the CPU overhead of a workload it is much more important to choose the number of levels in the LSM tree and the per-level compaction algorithms (tiered, leveled, leveled-N) then it is to optimize the per-level fanout as I describe here. While the math I describe below isn't the best way to optimize an LSM, the math was interesting to me so this blog post lets me revisit my work.

In the cost function below the CPU cost of a read, Cr, is dominated by the size of the memtable. The per-level fanout values, f1 and f2, won't have a significant impact on the value of Cr. Therefore it isn't clear that it is worth figuring out the optimal values for the per-level fanout. By not significant assume that B=2^20 while f1 and f2 are likely to be <= 32. Then the contribution to Cr from B is 60 while the contribution from f1 and f2 are at most 15. For Ci, the cost of an insert, the contribution from B is likely to be similar to the contribution from f1 and f2.

The problem

The problem below has a nonlinear objective function, nonlinear equality constraint and simple inequality constraints.

Assume:
* LSM memtable, L1 and L2 where there are B entries in memtable, f1 is fanout from memtable to L1 and f2 is fanout from L1 to L2. There are B * f1 * f2 rows in the LSM.

* Cost of read is Cr and cost of insert is Ci, both are the number of compares per operation. Bloom filter is not used and Ci includes compares for memtable insert and compaction.
* Fraction of reads and inserts in workloads is Pr and Pi where Pr + Pi = 1
* The value for X is somewhat arbitrary although X should be >= 2. B should be an integer but I ignore that for now.


Minimize Pr * Cr + Pi * Ci
such that B * f1 * f2 = K
and B >= X, B <= mK where m between 0 and 1
and f1 >= 2 and f2 >= 2

# compares per point-read without a bloom filter
# assume key is found in max level of LSM tree
Cr = memtable-cmp + l1-cmp + l2-cmp

Cr = log2(B) + log2(B*f1) + log2(B*f1*f2)
# Cr can also be expressed as
Cr = 3*log2(B) + 2*log2(f1) + log2(f2)

# compares per insert
Ci = memtable-insert-cmp + compaction-cmp
Ci = log2(B) + f1 + f2


Step 1

First, write the L function that includes the constraints:

# Note that I wrote u in place of lambda
L = Cr * Pr + Ci * Pi + u(K - B*f1*f2)

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
dL/df2 = Pr / (f2 * ln(2)) + Pi - u*f1*f2

Then realize that the inequality constraints make this a lot harder to solve and that complexity gets worse with more levels in the LSM tree. Then admit that you haven't done all of the work to confirm that the math techniques can be used on this problem (constraint qualifications, concave objective function).

Step 2

Without the constraints the goal is to find the min and max values for the objective function by finding values for which the partial derivatives (dL/dB, dL/df1 and dL/df2) are zero. I will do that here and then stop. Note that when the partial derivatives are zero the solution might violate the constraints and thus not be valid. Also note that when the partial derivatives are zero the extremum can be a min or a max. Regardless, the results suggest that fanout should increase from smaller to larger levels to minimize the cost.

First solve for u when dL/dB = 0.

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2
dL/dB = (3*Pr + Pi) / (B * ln(2)) - u*f1*f2
# given Pr + Pi = 1 ...

dL/dB = (2*Pr + 1) / (B * ln(2)) - u*f1*f2

# now solve when dL/dB = 0
(2*Pr + 1) / (B * ln(2)) = u*f1*f2
# given f1*f2 = K/B
(2*Pr + 1) / (B * ln(2)) = u*K/B
# multiply LHS and RHS by B
(2*Pr + 1) / ln(2) = u*K

# solve for u
u = (2*Pr + 1) / (K * ln(2))

Then  solve for f1 when dL/df1 = 0

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
# when dL/df1 = 0 then
(2*Pr) / (f1 * ln(2)) + Pi = u*B*f2
# given B*f2 = K/f1

(2*Pr) / (f1 * ln(2)) + Pi = u*K/f1
# multiply LHS and RHS by f1
(2*Pr)/ln(2) + f1*Pi = u*K
f1*Pi = u*K - (2*Pr)/ln(2)

# substitute u = (2*Pr + 1) / (K * ln(2))

f1*Pi = (2*Pr + 1) / ln(2) - (2*Pr)/ln(2)
f1*Pi = (2*Pr + 1 - 2*Pr) / ln(2)
f1*Pi = 1 / ln(2)
f1 = 1 / (Pi * ln(2)) 


Then solve for f2 when dL/df2 = 0

# skip the math for now, trust me :)
f2 = (1 + Pr) / (Pi * ln(2)

# were there an f3 then the answer would be...
f3 = (1 + 2*Pr) / (Pi * ln(2) 


Then solve for B, f1 and f2 given values for Pi and Pr

B = K / (f1 * f2)
B = K / (1 / (Pi * ln(2)) * (1+Pr) / (Pi * ln(2)))
B = K / (1 + Pr) / (Pi * ln(2))^2
B = K * (Pi * ln(2))^2 / (1+Pr)

# with Pi=0.5 and Pr=0.5
# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.080, f1=2.89, f2=4.33


# with Pi=0.1 and Pr=0.9# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.003, f1=14.43, f2=27.41


# with Pi=0.9 and Pr=0.1
# note that f1, f2 are < 2 which violates a constraint

B = K * 0.353, f1=1.60, f2=1.76

# with Pi=0 and Pr=1
# --> flush the memtable, no need to search it

Comparing optimal solution with typical config


For the the 2 cases above for which an optimal solution was found that doesn't violate constraints I compare the value of the objective function (compares/operation) with for the optimal vs a typical config (f1=f2 in typical config) assuming K=1B (1B rows in LSM).

These results explain my comment about confirming that the impact from an optimal config is significant. It isn't in these cases.

# with Pi=0.5 and Pr=0.5
# f1=f2=3.54 in default config
B = K * 0.080, f1=2.89, f2=4.33

optimal and default config do ~47 compares/operation
# with Pi=0.1 and Pr=0.9
B = K * 0.003, f1=14.43, f2=27.41
# f1=f2=19.89 in default config
optimal and default config do ~63 compares/operation


Using solvers

I tried the Mathematica and Matlab and had more success with Matlab. The Minimize function in Mathematica can get unhappy with fractional coefficients in the cost function (values for Pr and Pi).

Below are two functions and the script I used with Matlab.

# costf.m is the objective function
function [f,g] = costf(x)
B = 1000000;
Pq = 0.99;
Pu = 0.01;


f = Pq * (4*log2(B) + 3*log2(x(1)) + 2*log2(x(2)) + log2(x(3))) + Pu * (log2(B) + x(1) + x(2) + x(3));


if nargout > 1
    g = [((Pq * 3) / (x(1) * log(2))) + Pu;
         ((Pq * 2) / (x(2) * log(2))) + Pu;
         ((Pq * 1) / (x(3) * log(2))) + Pu];
end

# nlcon.m is the nonlinear constraint

function [c, ceq] = nlcon(x)
c = [];
ceq = x(1) * x(2) * x(3) - 1000;

# the script
lb = [2,2,2];

x0 = [10,10,10];


S = fmincon(@costf, x0, [], [], [], [], lb, [], @nlcon);

Tuesday, September 10, 2019

FoundationDB Record Layer

This is a review of the FoundationDB Record Layer paper. My summary is that Record Layer is a wonderful solution when you need to support billions of OLTP databases where each database is small with a simple and low-concurrency workload. I am not an expert on FoundationDB so this review contains questions and might have mistakes. My focus is not on the distributed system properties.

The big question is whether Record Layer can be the solution for less simple workloads, larger databases and higher concurrency. There doesn't appear to be an OLTP SQL layer in development for FoundationDB and there are many interesting alternatives in the open-source and source-available scale-out OLTP DBMS space that already have query layers including CockroachDB, Yugabyte, FaunaDB, TiDB, MySQL Cluster, Comdb2 and MongoDB.

Apple's CloudKit is the star customer for FoundationDB. Other use cases include SnowflakeDB (metadata mgtmt) and Wavefront (primary store). FoundationDB is open source (Apache 2.0) and has an active community. At the upcoming summit you can learn about Redwood, the new storage engine that will replace SQLite.

Update - CouchDB might be moving to FoundationDB

Questions and Comments

An overview of the API is here.

  • Record Layer uses optimistic concurrency control to provide serializable isolation. Commit can fail when there are conflicts. Locks are not held prior to commit. There will be conflicts especially when there is a lag between the time at which the transactions does reads and commit (stalls for reads from storage make this worse). 
  • It provides atomic RMW operations that don't fail from conflicts at commit like min/max and increment because they never conflict (but what happens when the target row has been deleted). This is great but not enough to reduce my concern about support for high-contention workloads -- there will be conflicts.
  • Support for server-side logic would help reduce conflicts by shrinking transaction duration, alas there is neither a layer for SQL nor for server-side logic.
  • Clients can remove keys from a transactions read set to reduce the chance of conflict. I wonder if this supports weaker isolation like repeatable read. Search for conflict and conflict ranges in the paper.
  • Record Layer supports aggregation indexes to maintain things like sum, count, min and max. I assume this uses the atomic RMW operations. I wonder what trouble this could cause with hot spots.
  • SQLite implements the ssd engine. This isn't the best choice if you care more about performance or efficiency. This is a great choice if you care about manageability while scaling out to billions of databases.
  • Transactions can run for at most 5 seconds. With the ssd engine, implemented by SQLite, only committed versions are forced to disk and undo is buffered in memory to support reads at previous versions. The 5 second limit is there to reduce the memory demand. This restriction might go away with the upcoming Redwood engine.
  • Transactions are limited to 10mb of changes.
  • Redwood is expected to replace SQLite. It is a copy-on-write b-tree with prefix compression. Long common prefixes are frequent with Record Layer so prefix compression is in demand. It uses shadow pages for the leaf pages and then remaps page IDs for non-leaf pages to reduce write-amp. There was an awesome PhD on shadow paging many years ago.
  • The storage engine must be single-threaded which makes it harder to reuse engines like RocksDB. 
  • Order by is only supported via indexes to reduce the memory demand on the server.
  • Record Layer is stateless. Query results are streamed back to the client. I hope the client can get more than one row per network RPC. A continuation is returned to the client when it must fetch more from a cursor.
  • Protocol Buffers are used to encode data. Index keys are compared via memcmp and Record Layer has support for combining composite keys, and hopefully for charsets, into a string for which memcmp does the right thing.

Monday, September 9, 2019

Vinyl - the LSM in Tarantool

Tarantool started as an in-memory DBMS then added an LSM, named Vinyl. This is my review of the Vinyl overview. I have some questions and assume that I will make a few mistakes in describing it. Corrections are welcome.

Updates

Tarantool is fast and one reason is server-side logic implemented in Lua and made faster by LuaJIT.

There has been talk of running Linkbench on Tarantool. I am not sure whether that requires too much work but I expect great throughput results from it.

I know of Tarantool because the former tech lead for it previously did great work at MySQL and now will do the same at ScyllaDB. HighLoad is a great place to learn more about Tarantool, as I did on my two visits.

LSM Tree Shape

Vinyl uses tiered+leveled compaction with the memtable as L0, leveled for Lmax and tiered for the levels in between the memtable and Lmax. See LSM Geek Code for more on my terminology.

Configuration parameters for Vinyl are here. There are two parameters that determine the shape of the LSM tree. The max number of sorted runs per level is set by vinyl_run_count_per_level. The size ratio between adjacent levels is set by vinyl_run_size_ratio. The docs in the Filling an LSM tree section have a nice description of the impact from these options. When vinyl_run_count_per_level is 2 then there will be at most 2 sorted runs in a level. When vinyl_run_size_ratio is 5 then the max size of a run in L2 is 5X the max size of a run in L1.

Not all tiered compaction algorithms allow an LSM tree to be described in terms of levels like this, but Vinyl does and that makes it easier to reason about performance and efficiency. More than one sorted run per level will have a large, negative impact on range query CPU overhead. However the impact for point queries should be much less when there is a bloom filter per sorted run as used by Vinyl.

Tuple Range Cache

The range cache sounds interesting but I am still confused by it. The docs describe it with the following. I assume this stores ranges (less or more than one page) from the max level of the LSM tree based on "after having performed a compaction spanning all tree levels". How is cache invalidation managed? What is done in between all-level compaction? Is there also a block cache separate from this?
Unlike, say, RocksDB or MySQL, this cache doesn’t store pages, but rather ranges of index values obtained from disk, after having performed a compaction spanning all tree levels. This allows the use of caching for both single-key and key-range searches. Since this method of caching stores only hot data and not, say, pages (you may need only some data from a page), RAM is used in the most efficient way possible. The cache size is controlled by the vinyl_cache parameter.
Other highlights

  • Vinyl measures the memtable insert and flush rates to predict when to begin flushing the memtable so that the flush will finish before the memtable is full. 
  • The Vinyl memtable is a b-tree rather than a skip-list
  • Serializable isolation is provided
  • zstd compression is used
  • Multi-level compaction is mentioned a few times but not explained
  • Vinyl has gap locks but the overview didn't explain how they are used.
  • What happens when a transaction stalls on a read from disk with Vinyl? Does Vinyl run transactions twice -- once to prefetch reads without applying changes, then again to apply changes? Does it run transactions until they stall on a disk read and then what happens when the read is ready? This might explain the need for gap locks.
  • I assume that transactions are interleaved when there are read stalls. Locks are not held so commit can fail when conflicts are detected. The overview explains it implements the MVTO (multiversion timestamp ordering) class, whereby the winning transaction is the one that finished earlier. There are no locks and associated deadlocks. But now I am confused by the claim that there are no locks when the overview mentions that gap locks have been implemented.
  • Vinyl creates bloom filters for full and partial key searches. I assume that a bloom filter for partial key search is similar to a prefix bloom filter in RocksDB.
  • Vinyl uses at least one LSM tree per logical index. More than one LSM tree is used when the logical index is too big in which case the LSM trees range partition the logical index. This allows different optimizations per partition and can be a major win for right or left growing indices, in which case only one partition will get writes and the others can be read optimized. Creating partitions is done via a split. Too-small partitions can be combined via a coalesce. Some of the split/coaleasce work is deferred until compaction (think of virtual partitions).
  • Vinyl supports upsert (read-free updates are in RocksDB)

Thursday, September 5, 2019

Adapting TPC-C for MongoDB - reviewing a VLDB paper

This is a review of Adapting TPC-C Benchmark to Measure Performance of Multi-Document Transactions in MongoDB which was published in VLDB 2019. I appreciate that MongoDB and Asya Kamsky took the time to get this published. That can be a weekend and nights project when in industry. I also appreciate that this not a benchmarketing effort. The purpose wasn't to overstate performance. The purpose was to show how to get good performance on a TPC-C like workload with MongoDB and realistic hardware and configurations. I hope for a similar effort on MongoDB with Linkbench.

My comments:
  • Work was done to reduce write-write conflicts which will be more likely given the extra commit latency from using w:majority writeConcern on a 3-node cluster. That work included 1) moving conflicting writes early in the transaction 2) moving writes before reads 3) using findAndModify instead of select/update and 4) batching writes. I wonder if non-stored procedures will be useful.
  • A small amount of denormalization was done by storing order lines in the order document. Denormalize everything isn't feasible here or in Linkbench because that leads to too-big documents.
  • Code and details were shared that will allow you to reproduce results.
  • w:majority was used on a 3-node cluster. The goal was to get realistic results, not a benchmarketing special.
I was confused by two things. First, section 3.3 states that majority write concern guarantees that a write is flushed to disk by any replica that ack'd the write. I thought this was determined by the value of the j option in writeConcern. Second, section 3.5.2 is about causal consistency and that (causal reads feature and logical clocks) seems like overkill when shards aren't used. If you want to avoid going back in time when moving from a primary to a secondary isn't it sufficient to remember the point-in-time at which primary queries are done? But maybe that is just a simple logical clock.




Wednesday, September 4, 2019

Tunable Consistency in MongoDB - reviewing a VLDB paper

This is a review of Tunable Consistency in MongoDB from VLDB 2019. It is worth reading and I appreciate that MongoDB has recently published several conference papers. I am not an expert on this topic. For expertise see Daniel Abadi, Kyle Kingsbury and Peter Bailis. Henrik can be added to the list with a few more blog posts.

MongoDB vs MySQL

MongoDB is a NoSQL DBMS that makes it easy to run sharded replicasets. While the NoSQL part of it is nice the sharded replicaset part of it is amazing. I hope that MySQL eventually gets similar support for sharded replicasets including readConcern and writeConcern options.

I previously compared MySQL semisync replication with MongoDB. With MongoDB the behavior for reads can be tuned separate from writes while MySQL combines them. In the MySQL implementation for lossless semisync a write is not locally visible until a replica acks. In the MongoDB implementation the write is committed locally and then replicated. The replicas apply writes as soon as possible without waiting for a distributed commit protocol. The key to making all of this work is controlling read visibility via an MVCC database engine courtesy of point-in-time reads.

The Review

Highlights:

  • With many options comes a need for more documentation and tutorials. While I appreciate splitting read and write semantics into separate options, the feature space is large and figuring this out is harder.
  • The paper states that the gold standard is linearizability. Well, this is a DBMS paper so maybe that should be changed to serializability.
  • I was confused by the difference between majority and linearizable reads. AFAIK snapshot reads are similar to linearizable (both wait for confirmation that the master really is the master) while majority reads don't have that extra wait.
  • I was confused by "The transaction commit operation accepts a write concern, which determines ... and its constituent read and write operations" because commit occurs after the reads so how could it effect them. As the paper promises, that is explained later.
  • MongoDB is a great case study in adding stronger consistency to an async replication system. It continues to use async replication, yet it now provides stronger consistency on demand.
  • I think that propagated means applied in the description of the w option for writeConcern. This means that a replica acks after applying a change -- either before or after making redo durable depending on the j option. AFAIK the more common option is to ack after shipping the change to the replicas log but before applying the change. However, I prefer what MongoDB does. Maybe commenters will correct my perception of what is more common.
  • To reduce write-write conflicts MongoDB uses the latest locally committed transaction as the point-in-time to do reads for read/write operations that use w:majority and for multi-statement transactions that use snapshots. The alternative was to use the older latest majority commit point-in-time. See section 5 from the paper. Therefore there is a wait before returning to the user for that locally committed timestamp to be committed to a majority of the replica set. This is true even for read-only transactions. So MongoDB can make reads wait. Obviously it can make writes wait before returning to the user doing the write for w:majority. An excellent CockroachDB blog post explains that it too can make reads wait while Spanner can make writes wait. 
  • Consistency is tunable in MongoDB. With writeConcern you can determine whether a write might wait. With readConcern you can determine whether a read might wait.
  • Some of the wait for reads is to confirm that the master from which the read has been done is still the master. I wonder if a master lease could have been used to avoid that wait at the cost of making failover slower. Which cost do you prefer?
  • Replication remains async. Writes are committed locally (and visible to others with the appropriate readConcern options) regardless of the write concern. These are shipped to replicas ASAP and applied by replicas ASAP. This means that a replica can apply a change that has to be undone during master failover because it was never committed to a majority. MongoDB has two ways to rollback that replica -- rollback WiredTiger to an older point in time or undo the extra replication events. Rollback sounds easy while undo is complicated.
  • Strongly consistent writes don't delay weakly consistent writes. A strongly consistent write is done locally then releases row locks then waits for replicas to ack.
  • MongoDB doesn't like long running transactions because WiredTiger must keep undo in memory to satisfy all active transactions. MongoDB kills snapshot reads that have been running longer than one minute. For non-snapshot reads it can advance the snapshot in the middle of the read operation. One side-effect of this is that you will have a hard time implementing a consistent logical backup tool like mysqldump.

InnoDB repeatable read in a nutshell

InnoDB repeatable read (RR) is complicated.  This is my short description. It isn't exactly correct.

Summary

Processing of statements within a transaction for InnoDB RR:
  1. Statements that are read-only and x-lock free use snapshot isolation
  2. Other statements use read committed (RC) with gap locks
  3. All statements observe uncommitted changes from the transaction
It would have been simpler to use snapshot isolation as-is but that leads to more write-write conflicts.

Sources of Confusion

  • The repeatable read snapshot is created at the start of the first real statement (SELECT, INSERT, etc) and not by BEGIN or START TRANSACTION. It is created by START TRANSACTION WITH CONSISTENT SNAPSHOT.
  • Some RR transactions use SELECT to find rows to modify followed by UPDATE when they should have used SELECT ... FOR UPDATE. Without the FOR UPDATE a SELECT statement uses the snapshot from transaction start while the UPDATE uses RC and sees all committed changes. So SELECT might observe different rows from the UPDATE that follows.
  • I never tried to learn whether SELECT ... FOR UPDATE and UPDATE use 1) use a snapshot from statement start or 2) see any row that is committed at the time statement processing encounters that row. I assume that it is #2. The manual states "reads the latest available data".
  • The rules for gap locks and next-key locks are non-trivial. I don't claim they are more complicated than they need to be because concurrency is hard and they provide useful semantics by preventing phantom reads and more. While the academic community tends towards stronger isolation the common choices in popular systems are weaker (RC in Postgres and Oracle, RR in InnoDB, ? in SQL Server).
When does InnoDB RR implement RR?

I am not an expert on isolation. Experts have yet to review this post. I hope they do. The ANSI spec requires that RR prevent dirty read, non-repeatable read and phantom reads. Alas, InnoDB can do reads (evaluate the WHERE clause) at different points in time (see above) which can allow for some of non-repeatable reads and phantom reads unless you are careful to use the FOR UPDATE clause.