Tuesday, November 24, 2015

Read, write & space amplification - B-Tree vs LSM

This post compares a B-Tree and LSM for read, write and space amplification. The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload. For the LSM I consider leveled compaction rather than size-tiered compaction. For the B-Tree I consider a clustered index like InnoDB.

The comparison in practice provides values for read, write and space amplification on real workloads. The comparison in theory attempts to explain those values.

B-Tree vs LSM in theory

Read Amplification

Most comparisons should be done for a specific context including the hardware and workload. For now I am only specific about the cache hit rate. For the B-Tree I assume that all non-leaf levels are in cache. For the LSM I assume that everything but the data blocks of the largest LSM level are in cache. While an LSM with leveled compaction has more things to keep in the cache (bloom filters) it also benefits from a better compression rate and the cache requirements are similar to a clustered B-Tree.

Worst-case disk read-amp for point queries is 1 for the B-Tree and the LSM as one block is read from the B-Tree leaf level and largest LSM level. Disk read-amp for range queries is 1 or 2 for a short range scan assuming that 1 or 2 blocks from the B-Tree leaf level and LSM max level are read. Note the impact of my assumption for cached data. While many files might be accessed for a short range query with an LSM everything but the max level data blocks are in cache.

The number of key comparisons can be used as the in-memory read-amp. For a B-Tree with 1M keys there are about 20 key comparisons on a point query. For a range query with a B-Tree there is one additional comparison for each row fetched.

It is harder to reason about the number of comparisons for an LSM. Bloom filters can be used for a point query to avoid comparisons but when there are too many files in level 0 then there will be too many bloom filter checks. Bloom filters don't work for range queries, ignoring prefix bloom filters. When query processing is IO-bound I don't expect key comparison overhead to make a difference between an LSM and B-Tree. So I will ignore this for now.

If you want to maximize the ratio of the database to cache sizes while doing at most one disk read per point query then an LSM with leveled compaction or a clustered B-Tree are the best choices. For a clustered B-Tree the things that must be in memory are one key per leaf block and all non-leaf levels of the index. An LSM with leveled compaction has similar requirements, although it also needs some of the bloom filters to be in memory.

The cache requirement is much larger for an LSM with size-tiered compaction. First, the max level has ~50% of the data compared to ~90% with leveled compaction and it less likely that all data except the max file are in cache. Second, there are more old versions of key-value pairs, space-amp is larger, so there is more data that needs to be in the cache.

An unclustered B-Tree index also requires more memory to keep the important bits in cache. The important bits are all keys, which is much more memory than one key per leaf block for a clustered B-Tree.

Write Amplification

For now I assume that flash storage is used so I can focus on bytes written and ignore disk seeks when explaining write-amp. For a B-Tree a change is first recorded in the redo log and the page is eventually written back. The worst case occurs when the buffer pool is full with dirty pages and reading the to-be-modified page into the buffer pool forces a dirty page to be evicted and written back. In this case there is a redo log write and a page write back per row change. If the row is 128 bytes and the page is 4096 bytes then 4096+128 bytes are written to storage per 128 byte row change. The write-amp is 33 -- (4096 + 128) / 128. The write-amp is reduced when there is more one changed row on a page or when one row is changed many times before write back.

For the LSM the redo log is written immediately on a row change. When the memtable is full and flushed to level 0 then the row is written again. When level N is full and compaction is done from level N to level N+1 then one SST file is read from level N, ~10 SST files are ready from level N+1 and ~10 files are written back to level N+1. The write-amp to move rows from level N to N+1 is ~10 given my handwaving but in practice it is ~7 and I am waiting for a paper to be published to explain that. The total write-amp is computed from the writes required to move a row change from the memtable to the max level. The write-amp is 1 for the redo log, 1 for the memtable flush and usually ~1 for compacting to level 1. Assuming the LSM has levels 0 to 4 and the per-level write-amp is 7 for levels 2 to 4 then the total write-amp is 24 -- 1 + 1 + 1 + 7 + 7 + 7.

From the examples above the LSM has less write-amp than the B-Tree but those examples were not meant to be compared. An LSM tends to have less write-amp than a B-Tree. When using flash storage this means the device will last longer. When using disk storage this is likely to save more IO capacity for reads leading to higher QPS.

The IO pattern for a busy LSM is concurrent streams of IO. Each stream writes files sequentially, but the writes from different streams can end up in the same logical erase block (logical means it is striped across many NAND chips). The level of the leveled compaction LSM predicts the lifetime of the write. Writes to level 0 have a short lifetime. Writes to level 4 have a much longer lifetime. The write rates per level are similar -- there might be 10 MB/second of writes to levels 0 and 1 and then 20 MB/second of writes to levels 2 through 4. This means that logical erase blocks will end up with a mix of long and short lived data and the long-lived data will get copied out during flash garbage collection. Does your flash device use logical erase blocks? If it does then there will be write-amp from flash GC even with an LSM. Flash devices that support multi-stream will help a lot.

Space Amplification

A B-Tree gets space-amp from fragmentation, per-row metadata and fixed page sizes on disk. The leaf pages in a B-Tree are between 50% and 70% full when subject to random updates. When they are 2/3 full then space-amp is 1.5 and when they are 1/2 full then space-amp is 2. An update-in-place B-Tree like InnoDB uses ~20 bytes/row for metadata to support consistent read and transactions. The metadata overhead is much smaller for an LSM like MyRocks. Finally, when compression is done for InnoDB there will be wasted space because page sizes are fixed on disk. When a 16kb in-memory page compressed to 5kb for a table that uses 8kb pages on disk, then 3kb of the 8kb page on disk is wasted.

An LSM gets space-amp from old versions of key-value pairs. Leveled and size-tiered compaction differ significantly in this regard. With leveled compaction you are likely to get space-amp of 1.1 or 1.2 and with size-tiered compaction a more common result is space-amp of 2. Size-tiered compaction can suffer even more from additional but temporary space-amp when the max file is compacted and disk space is required for the old and new version of that file.

Compression reduces space-amp and for this reason I claim that space-amp of less than 1 is possible.

B-Tree vs LSM in practice

This post is longer than I expected, so I will write less here. This is a good spot for a joke about space-amp and write-amp. I have begun reporting on read, write and space amplification by normalizing the server's IO and CPU rates by QPS during benchmarks. I use iostat to get data for on-disk read-amp and write-amp by measuring reads/second, MB read/second and MB written/second. I frequently ignore writes/second because that mixes fast and slow writes (redo log writes are fast, page writes are slow). I use vmstat to measure the CPU utilization and that is a proxy for the in-memory read-amp and write-amp. Finally I look at the size of the database on disk to compare space-amp. The data is usually measured over 1 hour intervals to make it easy to detect when metrics get worse as a database ages. I try to run workloads for at least 12 hours to give things time to go bad.

Percona has begun doing this for some benchmark reports. I hope this becomes a common practice.

This is an example from running Linkbench for MongoDB with RocksDB and WiredTiger. I will soon have more results for MyRocks. I am thrilled that we have a copy-on-write B-Tree (WiredTiger) and an LSM (RocksDB) available as storage engines in MongoDB. We are also bringing RocksDB to MySQL via the MyRocks effort. The big deal for MyRocks compared to InnoDB is half the space-amp and half the write-amp. This has been measured on Linkbench and on the real workload. This is a big deal.

Monday, November 23, 2015

Read, write & space amplification - pick 2

Good things come in threes, then reality bites and you must choose at most two. This choice is well known in distributed systems with CAPPACELC and FIT. There is a similar choice for database engines. An algorithm can optimize for at most two from readwrite and space amplification. These are metrics for efficiency and performance. This means one algorithm is unlikely to be better than another at all three. For example a B-Tree has less read amplification than an LSM while an LSM has less write amplification than a B-Tree. I abbreviate the metrics as read-amp, write-amp and space-amp. I also abbreviate this as the framework.

The framework assumes a database workload that consists of point-queries, range-queries of length N and writes. Were I to add a delete operation then this would match the RocksDB and LevelDB API. The write is a blind-write as it doesn't imply a read prior to the write.

This is part one of a topic that requires several blog posts. The second post will compare a B-Tree and LSM using the framework. The third post will argue that an algorithm cannot be optimal for all three metrics.


Read, write and space amplification explain performance and efficiency when evaluating algorithms for real and potential workloads. They aren't a replacement for Big O notation. They usually assume a specific workload and configuration including RAM size, database size and type of storage.

We began using the framework to compare InnoDB and RocksDB because better performance is an insufficient metric on which to choose an algorithm. Endurance (write amp) and capacity (space amp) matter when using flash. IOPs (read amp for point and range queries, write amp for writes) matters when using disk.

The framework is useful for understanding the compromises made in search of better QPS. It is easy to trade write for space or read efficiency in write-optimized algorithms but these trades should be disclosed because they are not free. New algorithms can show better write throughput than RocksDB by making range reads less efficient but the Linkbench workload needs efficient writes and efficient range reads.

The framework is useful because key comparisons aren't created equal. Traditional algorithm analysis is great for understanding in-memory performance via bounds on the number of key comparisons. But big-O notation is harder to use when some keys are read from cache, others from RAM and some from disk. Constant factors matter. The difference between 1.2 and 1.5 disk reads per query can be a big deal.

Read amplification

Read-amp is the amount of work done per logical read operation. This can be defined for in-memory databases, persistent databases assuming no cache (worst-case behavior) and persistent databases assuming some cache (average-case behavior). The work done in-memory can be the number of key comparisons and traditional algorithm analysis can be used. The work done on-disk includes the number of bytes transferred and seeks (seeks matter on disks, not on NVM). The work done can also include the cost of decompressing data read from storage which is a function of the read block size and compression algorithm.

Read-amp is defined separately for point and range queries. For range queries the range length matters (the number of rows to be fetched). In Linkbench the average range query fetches about 20 rows.

Read-amp can also be defined for point queries on keys that don't exist. Some algorithms use a bloom filter to avoid disk IO for keys that don't exist. Queries for non-existent keys is common in some workloads. Bloom filters can't be used for a range query. The most frequent query in Linkbench is a range query that includes an equality predicate on the first two columns of the range query index. With RocksDB we define a prefix bloom filter to benefit from that.

Write amplification

Write-amp is the amount of work done per write operation. This can include the number of bytes written to storage and disk seeks per logical write operation. This can be split into in-memory and on-disk write-amp but I frequently ignore in-memory write-amp.

There is usually a cost to pay in storage reads and writes following a logical write. With write-amp we are ignoring the read cost. The read cost is immediate for an update-in-place algorithm like a B-Tree as a page must be read to modify it. The read cost is deferred for a write-optimized algorithm like an LSM as compaction is done in the background and decoupled from the logical write. There is usually some write cost that is not deferred - updating in-memory structures and writing a redo log.

With flash storage there is usually additional write-amp from the garbage collection done by the FTL to provide flash blocks that can be rewritten. Be careful about assuming too much about the benefit of sequential and large writes from a write-optimized database engine. While the physical erase block size on a NAND chip is not huge, many storage devices have something that spans physical erase blocks when doing GC that I will call a logical erase block. When data with different lifetimes ends up in the same logical erase block then the long-lived data will be copied out and increase flash GC write-amp (WAF greater than 1).  I look forward to the arrival of multi-stream to reduce flash GC WAF.

Space amplification

Space-amp is the ratio of the size of the database to the size of the data in the database. Compression decreases space-amp. It is increased by fragmentation with a B-Tree and old versions of rows with an LSM. A low value for space-amp is more important with flash storage than disk because of the price per GB for storage capacity.

Efficiency & Performance

I work on small data systems. Small data is another name for OLTP. Small data workloads are highly concurrent and with concurrency better efficiency usually implies better performance. But performance and efficiency are not always strongly correlated. For example an algorithm with a high read-amp for range queries might hide the extra latency by doing disk reads in parallel. This improves response time but doesn't improve efficiency and the algorithm with less read-amp will sustain more QPS at higher concurrency.

Wednesday, November 18, 2015

How does MongoDB do on Linkbench with concurrency?

I started to study performance math with a focus on queueing theory and the USL. To get data for the USL I ran Linkbench for MongoDB with different levels of concurrency. I tested WiredTiger in MongoDB 3.2.0rc0 and 3.0.7 and then RocksDB in MongoDB 3.2.0. The performance summary is:
  • WiredTiger load scales much better in 3.2 compared to 3.0
  • RocksDB throughput is stable at high concurrency
  • WiredTiger 3.2 throughput is stable at high concurrency. For WiredTiger 3.0 the load rate drops significantly at high concurrency and the query rate also has an odd drop.


The test server has 2 sockets with 6 CPU cores/socket and hyperthreading enabled to get 24 HW threads. The server also has 6 SAS disks with HW RAID 0 and 1 SSD. I intended to use the disk array for all tests but ended up using the SSD for the WiredTiger 3.0.7 test. The server also has 144G of RAM and the test database was cached by mongod for all tests. The oplog was enabled but sync-on-commit was not done. For Linkbench I set maxid1 to 10M and the test pattern was load, run for 30 minutes, run for 30 minutes and the query rate is reported for the second 30 minute run. Snappy compression was used for all tests. The test was run for 1 to 48 concurrent clients with MongoDB 3.2.0 and stopped at 30 concurrent clients for MongoDB 3.0.


Results are on gist.github for the load and query tests.


WiredTiger  throughput for write-heavy workloads is much better in 3.2 than 3.0 as I previously reportedFrom PMP output it looks like WiredTiger in MongoDB 3.0 suffers from too frequent plan-cache invalidation that I wrote about this previously. I think RocksDB insert performance saturates earlier from mutex contention on the memtable writer mutex. 


WiredTiger performance is the same for 3.0 and 3.2 until 21 concurrent clients. At that point the performance for 3.0 drops and response time for all operations is slower (reads & writes). This is odd and I won't try to debug it. For RocksDB I have an educated guess at the problem (need to use SingleDelete optimization in MongoRocks to reduce overhead from tombstones) to explain why throughput for it is worse than WiredTiger.

Thursday, November 12, 2015

Define better for a small-data DBMS

There are many dimensions by which a DBMS can be better for small data workloads: performance, efficiency, manageability, usability and availability. By small data I mean OLTP. Performance gets too much attention from both industry and academia while the other dimensions are at least as important in the success of a product. Note that this discussion is about which DBMS is likely to get the majority of new workloads. The decision to migrate a solved problem is much more complex.

  • Performance - makes marketing happy
  • Efficiency - makes management happy
  • Manageability - makes operations happy
  • Usability - makes databased-backed application developers happy
  • Availability - makes users happy

Performance makes marketing happy because they can publish a whitepaper to show their product is faster than the competition and hope that the message morphs from X is faster than Y in this context into X is faster than Y. This is the art of benchmarketing. It can be risky to use average throughput as a performance metric unless there was a response time SLO. Otherwise good average throughput can hide lousy variance like the page writeback stalls in InnoDB that Percona spent a lot of time making better. When a benchmark result doesn't include an SLO then you need to see throughput over time or response time histograms (response time at the 95th percentile for example).

Efficiency makes management happy because less hardware is able to do more work. A simple way to report this is to list the amount of disk IO and CPU used per transaction by dividing rates from iostat and vmstat by the QPS or TPS as I have been doing in recent benchmark reports. Better efficiency might not imply better response time but it usually implies better throughput for highly concurrent workloads. Efficiency has several dimensions including read, space and write and one database engine is unlikely to be optimal for all of them, but that is for another blog post.

Manageability makes life easier for the operations team that supports the DBMS. They shouldn't waste time manually fixing problems when the fix can be automated. Oncall week shouldn't mean you wake up every few hours to replace failed servers. Replacement of failed masters for MySQL isn't automated at many web-scale MySQL deployments and I feel sorry for the oncall at those places. With MySQL 5.7 the pieces exist for a solution and I hope these are combined into an open-source solution similar to what we have

Usability makes life easier developers who write database-backed applications. Developers shouldn't spend time reinventing the wheel, thus my expectation of per-shard support for consistent read and transactions. A declarative query language and good-enough query optimizer also make life easier for everyone. I consider the MongoDB query API to be declarative. My feature list is longer but I won't include it here.

Availability - this makes users happy. There is big and small downtime. Both can significantly reduce availability. An example of big downtime is taking 30 minutes to replace a failed master. An example of small downtime is slow commit courtesy of synchronous geo-replication. Small downtime doesn't get enough attention. When commit to one row takes 100 milliseconds and the workload demands 100 commits/second to that row then there will be 90 commits/second that can't get done and the user experience for can't get done is similar to database down.

MySQL and MongoDB

It is interesting to compare the upcoming releases of MongoDB (3.2) and MySQL (5.7) by these metrics. MongoDB has a large lead in manageability for web-scale deployments (sharded replica sets) and this continues with the 3.2 release. MongoDB comes with support for failover automation, while MySQL does not. MongoDB 3.2 has the potential to provide much better availability than a MySQL deployment that lacks automated master failover (we have that, upstream MySQL does not) although the new code in MongoDB needs time to mature.

MySQL has a large lead in performance based on the benchmarks I have run. Better response time and throughput are important but won't determine whether MongoDB or MySQL is chosen for a new workload. MySQL also has a large lead in usability because many workloads need per-shard transactions, per-shard consistent read and per-shard joins. It isn't hard to add support for per-shard transactions and consistent read to MongoDB given they are already provided by WiredTiger and RocksDB and I expect the MongoDB to match MySQL in usability in the next few years.

MongoDB and MySQL are similar for efficiency. Both have read & write optimized engines including MongoRocks and MyRocks for write optimized.

How does this end? MongoDB becomes better by improving usability. MySQL becomes better by improving manageability and availability. Both have long track records of steady improvement although MongoDB is moving faster. The product that gets there first is likely to get the majority of new workloads.

Wednesday, November 4, 2015

MongoDB 3.2 vs Linkbench

I used LinkbenchX to compare performance and efficiency for MongoDB 3.2.0rc0 vs 3.0.7 with the RocksDB and WiredTiger engines. The Linkbench test has two phases: load and query. The test was run in three configurations: cached with data on disk, too big to cache with data on disk and too big to cache with data on SSD. My summary:

  • load rates are similar for disk and SSD with RocksDB and WiredTiger
  • load rate for WiredTiger is ~2X better in 3.2 versus 3.0
  • load rate for WiredTiger is more than 2X better than RocksDB
  • query rate for WiredTiger is ~1.3X better than RocksDB for cached database
  • query rate for RocksDB is ~1.5X better than WiredTiger for not cached database
  • disk space used is ~1.33X higher for WiredTiger vs RocksDB
  • disk bytes written per document during the load is ~5X higher for RocksDB
  • disk bytes written per query is ~3.5X higher for WiredTiger
  • RocksDB uses ~1.8X more CPU during the load
  • WiredTiger uses ~1.4X more CPU during the query phase
  • with a 32G block cache mongod RSS is ~42G with WiredTiger vs ~34G with RocksDB


LinkbenchX is a fork of Linkbench. Source for Linkbench and LinkbenchX is in github. LinkbenchX adds support for MongoDB and an option to sustain a fixed request arrival rate. For this test I use the MongoDB support but not the fixed request arrival rate by using bin/linkbench from Linkbench. I am grateful to Percona for porting Linkbench to MongoDB.

The Linkbench workload requires transactions to update the count collection when a document is added to or removed from the link collection. MongoDB doesn't support per-shard transactions and the Linkbench results will be incorrect. I understand that cross-shard transactions are hard, but per-shard transactions and per-shard consistent read are valuable for Linkbench and for selling to the enterprise. I hope they arrive in MongoDB 3.4.

Linkbench is run in two phases: load and query. I configured Linkbench to use 12 threads for both phases. The query phase was done as a sequence of 1-hour tests to measure whether performance and efficiency changed over time.

For the cached database test I set the value of maxid1 in the Linkbench file config/FBWorkload.properties to 50,000,001 and the database block cache to 32G. The database was cached by the OS but not WiredTiger or RocksDB in this setup as the compressed database was ~50G.

For the not cached database test I set the value of maxid1 to 250,000,001, the database block cache to 8G and started a background process to use mlock to leave at most 40G for mongod and the OS page cache. The database was at least 150G and the workload used lot of storage IO.

For all tests I changed the Linkbench file config/LinkConfigMongoDBv2.properties to use transaction_support_level=1, requesters=12, loaders=12, maxtime=3600, requestrate=0 and requests=100000000000. Fsync is not done on commit.

The test server has 24 cores with hyperthreading enabled, 144G of RAM and either one 400G SSD (Intel DC s3700) or 6 SAS disks with HW RAID 0. The server uses Fedora release 19, Linux 3.14.27-100, gcc 4.8.3 and MongoDB was linked with tcmalloc.

Snappy compression was used for both RocksDB and WiredTiger. For all tests this was in the mongod configuration file:
  fork: true
  destination: file
  logAppend: true
  syncPeriodSecs: 600
    enabled: true
storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none
For the not cached database this was in the mongod configuration file:
replication.oplogSizeMB: 4000
storage.wiredTiger.engineConfig.cacheSizeGB: 8
storage.rocksdb.cacheSizeGB: 8
For the cached database this was in the mongod configuration file:
replication.oplogSizeMB: 8000
storage.wiredTiger.engineConfig.cacheSizeGB: 32
storage.rocksdb.cacheSizeGB: 32


The legend for the data in the following sections. The disk and CPU metrics are collected from iostat and vmstat. Most rates below are normalized by the rate for operations/second where operation is an insert during the load phase and query during the query phase. The real insert rate across all collections is reported for ops below but I use the number of inserts to the node collection (50M or 250M) when normalizing the iostat and vmstat rates.

  • ops - average rate for operations/second (inserts or queries per second)
  • db.gb - database size in GB (two numbers are from du without and with --apparent-size)
  • r/o - disk reads per operation
  • wKB/o - disk KB written per operation
  • cs/o - context switches per operation
  • cpu/o - CPU/operation from vmstat us+sy divided by ops multiplied by 1M 
  • rss - RSS from ps for the mongod process
  • setup - wt (wiredtiger), rx (rocksdb), 307 (mongo 3.0.7), 320 (mongo 3.2.0rc0), op0/op1 - oplog off/on

Cached database

For this test there were 50M, ~220M and X docs in the node, link and count collections after the load phase. In addition to the conclusions listed at the top of this post, WiredTiger 3.2 vs 3.0 uses less CPU and has fewer context switches per insert. It is good to see it become more efficient and the insert rate for WiredTiger has almost doubled from 3.0 to 3.2. 

The context switch rate per insert is much larger for RocksDB because of the global mutex that serializes inserts into the memtable. There are no disk reads during this test because the database fits in the OS page cache. The CPU rate for WiredTiger is also much higher during the load. That might be a side effect of more mutex contention.

The difference in database sizes for WiredTiger vs RocksDB is small after the load but grows quickly during the run phases. I did not try to debug it but the growth for WiredTiger could be a problem. WiredTiger also uses much more memory than RocksDB. But I don't know whether that is a fixed overhead (~8G) or a relative overhead (30% of the block cache size).

Using the oplog doubles the wKB/o rate because writes are done twice -- once to the oplog, once to the database. The internal write-amplification reported by RocksDB for rx.320.op1 is 6.1.

--- load
ops    db.gb   r/o     wKB/o   cs/o    cpu/o   setup
51416  53/36   0.0     1.355    2.8    2171    wt.320.op0
46156  44/40   0.0     2.316    4.0    2460    wt.320.op1
28171  47/41   0.0     1.358    0.9    3161    wt.307.op0
28080  46/35   0.0     2.304    1.8    3520    wt.307.op1
26654  31/31   0.0     5.318   16.0    3787    rx.320.op0
19033  36/36   0.0     11.643  18.4    4881    rx.320.op1

--- run, 2nd hour
ops     db.gb   r/o    wKB/o   cs/o    cpu/o   rss   setup
14483   86/72   0.0    2.486    3.6    2170    42G   wt.320.op1
14312   78/71   0.0    2.338    3.6    2259    43G   wt.307.op1
10794   38/38   0.0    1.357    3.9    2470    35G   rx.320.op1

--- run, 12th hour
ops     db.gb   r/o    wKB/o   cs/o    cpu/o   rss   setup
13042   100/90  0.0    2.588    4.1    2378    36G   wt.320.op1
12742   94/88   0.0    2.623    4.0    2414    43G   wt.307.op1
10550   45/45   0.0    1.491    4.1    2533    35G   rx.320.op1

Not cached, disk array

For this test there were 250M, 1B and X documents in the node, link and count collections after the load phase. The database did not fit in RAM and the disk array can do ~1200 IOPs. The query phase was run for 24 hours.

The new result here is that RocksDB sustained a much higher QPS rate during the query phase. From the response times listed at the end of this post the difference appears to be a better response time for the most frequent operation -- GET_LINKS_LIST -- which is a short range scan. RocksDB also benefits from a better cache hit rate because the database is smaller and the r/o rate is slightly smaller. It also uses less IO capacity for writes (wKB/o is smaller and writes are less random) leaving more IO capacity for reads.

-- load
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   setup
40667  191/177  0.0     2.319    2.8    2803    wt.320.op1
25149  191/178  0.0     2.306    1.8    4041    wt.307.op1
18725  153/153  0.0    11.568   18.7    4968    rx.320.op1

--- run, 2nd hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
504    199/188  0.842   5.326    8.0    5481    12G   wt.320.op1
507    206/187  0.798   5.171    7.5    8013    13G   wt.307.op1
850    153/153  0.746   1.743    5.3    3684    11G   rx.320.op1

--- run, 12th hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
491    196/195  0.831   5.173    8.1    5700    12G   wt.320.op1
488    195/195  0.794   5.273    7.6    8380    12G   wt.307.op1
864    155/155  0.725   1.588    5.4    3967    11G   rx.320.op1

--- run, 24th hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
494    199/197  0.799   5.404    8.1    5615    10G   wt.320.op1
471    197/197  0.814   5.303    7.8    8564    12G   wt.307.op1
833    156/156  0.738   1.721    5.5    4301    10G   rx.320.op1

Not cached, SSD

RocksDB sustained a higher QPS than WiredTiger for the query phase similar to the result for a disk array with a not cached database. I didn't expect that result here or for the disk array.

--- load
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   setup
40742  195/180  0.0     2.318    2.9    2798    wt.320.op1
25308  188/177  0.0     2.306    1.8    4007    wt.307.op1
18603  155/154  0.0    11.458   18.6    5005    rx.320.op1

--- run, 2nd hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
3870   229/213  0.821   4.869   6.3     4179    11G   wt.320.op1
3814   228/210  0.813   4.620   6.1     4163    13G   wt.307.op1
6146   155/155  0.735   1.344   5.3     3063    11G   rx.320.op1

--- run, 12th hour
ops    db.gb    r/o     wKB/o   cs/o    cpu/o   rss   setup
3715   232/221  0.855   4.810   6.6     4449    9G    wt.320.op1
3415   223/217  0.825   4.582   6.4     4332    11G   wt.307.op1
5827   162/162  0.776   1.356   5.6     3239    11G   rx.320.op1

Response time metrics

These are response time metrics for each configuration. These are printed by Linkbench at the end of each run. The most interesting result is for the GET_LINKS_LIST operation which uses a short range scan (~10 rows on average). For the cached database, the p99 for RocksDB is ~14ms vs ~3ms for WiredTiger. I think the problem for RocksDB is from too many tombstones. We recently came up with a more efficient way to remove them in MyRocks. The p99 for RocksDB in the not cached databases (disk & SSD) is better than WiredTiger and ~12ms for ssd, ~47ms for disk. 

RocksDB compaction stats

This is the compaction IO statistics from RocksDB at the end of the 24th 1-hour query run for the no cached, disk configuration.

Sunday, October 11, 2015

Losing it?

Many years ago the MySQL team at Google implemented semi-sync replication courtesy of Wei Li. The use case for it was limited and the community was disappointed that it did not provide the semantics they wanted. Eventually group commit was implemented for binlog+InnoDB: first by the MySQL team at Facebook, then much better by MariaDB, and finally in upstream MySQL. With group commit some magic was added to give us lossless semisync and now we have automated, lossless and fast failover in MySQL without using extra replicas. This feature is a big deal. I look forward to solutions for MariaDB (via binlog server and MaxScale) and MySQL (via Fabric and Proxy).

MongoDB is ahead of MySQL in features that make scale-out easier to manage and the next release (3.2) adds a few more features and more robust code for existing features. I hope that some of the features that have been sacrificed in the name of scale-out will eventually arrive in a MongoDB release: per-shard transactions, per-shard joins, per-shard consistent read.


Many times we describe algorithms when users need to understand behavior and then the user gets lost in the details. It is important for developers to understand when transactions can be lost and I will describe that with answers to two questions.
  1. After the changes from a commit are visible to a concurrent session under what conditions can that commit be lost?
  2. After a client has been informed that a commit has succeeded under what conditions can that commit be lost?


I list four combinations of behavior to consider and the features in MySQL and MongoDB that provide them:


MySQL uses one solution (lossless semisync) to protect against both loss of visible and acknowledged commits. With lossless semisync row locks are held on the master until at least one slave acknowledges the commit. This can reduce commit throughput as there is a network round trip between master and replica(s) before commit is finished and row locks are released. There is at one network round trip between commits to the same row. This overhead is reduced by moving the replica close to the master. In the solution described by Yoshinori the binlog archive provides the ack rather than using extra replicas in every data center and because the binlog archive doesn't require a lot of hardware it is easier to move it closer to the master.

In MongoDB total protection comes from two solutions -- majority read concern and write concern. The benefit is that majority write concern doesn't make commit slower assuming a workload has sufficient concurrency. It will add latency to each client doing a commit just like MySQL semisync. A different feature, majority read concern, protects against loss of a visible commit. However, there is the risk that a client that needs read-your-own write semantics will have to wait. At this point it isn't clear to me that MongoDB makes it easy to read your own writes. I wonder if applications that care more about performance will use majority read concern without using majority write concern. That isn't an option with MySQL.

It will take time to figure out the implications of the performance differences. With MySQL delays are added to writers as it takes them longer to get the through semi-sync commit protocol. With MongoDB delays might be added to readers as they wait for the majority read snapshot to be advanced.

Durable on a slave?

It can be important to understand how durable a change is on a replica when the replica acknowledges a transaction to the master. There are several options and I have not read enough recent MySQL or MongoDB documentation to determine whether there are options beyond durable in memory:
  1. durable in memory - the commit is buffered in memory on a replica before it acks. Many years ago when we implemented semi-sync for MySQL this was the only choice. I tend to think that durable in memory is an oxymoron, but I am a pessimist.
  2. durable in a log - the commit is durable in a log file on a replica before it acks. There has been talk that MySQL would 
  3. committed on the replica - the commit is applied on a replica before it acks. That guarantees read-your-writes semantics when that replica is queried soon after committing a change on a master. Alas this is also likely to create performance lag unless the replica uses many threads to apply changes concurrently, just like on the master. It also creates a window where a commit is visible on a replica before the master.

More editorial

MongoDB documentation has tended to be optimistic about the features provided by the software. I think this will be resolved as the community grows. There have been some interesting discoveries. Hopefully the gap between documented and actual behavior will be reduced over time.

The mmap engine releases the per-database or per-instance write lock before syncing the oplog even when durable writes are requested.  This is now described as read uncommitted, but read non-durable might be a better name because reads are still consistent but you can see changes from others before those changes are durable in the oplog. I wrote about this when reading code and the docs have been updated since then but I think their docs need more edits. This is only a problem for the mmap engine and multiple engines in MongoDB means they need to be clear about behavior for mmap versus WiredTiger.

There were too strong claims about the semantics of the majority write concern. It protects against the loss of an acknowledged commit but some docs suggested it protected against the loss of a visible commit. Aphyr, an expert in distributed systems testing, highlighted this problem in his Call Me Maybe series and a bug report. I wrote about part of the problem prior to that but I did not connect the problem with the too-strong claims in the documentation. Many years ago MySQL made a similar mistake when documenting semi-sync replication and fixed their docs after I filed a bug.

Documentation claimed that 2-phase commit was used to keep config servers in sync. That makes it more likely that commit is all-or-nothing for the 3 or 5 config servers hosting the same data. Alas it can lead to read-only mode when a server goes away. I read the code and the two phases were 1) ping all config servers and then if all responded 2) send the change to all config servers. If all servers did not respond with OK then manual intervention was required. This isn't 2 phase commit. Fortunately, something much better will be done for the 3.2 release and the docs have been updated.

Finally, read the excellent series of posts from Tokutek on replica set failover including the overview and posts one, two, three and four. There have been problems that haven't been widely known. Fortunately the 3.2 release of MongoDB should make things better.

Wednesday, October 7, 2015

Problems not worth fixing

I worked on a closed-source DBMS years ago and the development cycle was 1) code for 6 months 2) debug for 18 months. Part 2 was longer than part 1 because the customers have high expectations for quality and the product delivers on that. But it was also longer because some co-workers might not have been code complete after 6 months and were quietly extending part 1 into part 2.

I fixed many bugs during part 2. One that I remember was in the optimizer. Someone added code to detect and remove duplicate expressions (A and A and B --> A and B). Alas the algorithm was O(N*N). Analytics queries can have complex WHERE clauses and all queries were forced to pay the O(N*N) cost whether or not they had any duplicate expressions. 

The overhead for this feature was reasonable after I fixed the code but it raises an interesting question. Are some problems not worth the cost of prevention? Several times a year I see feature requests that make me think of this.

Forgot to add the bug that inspired this post. In bug 30342 there was a change in the MySQL 5.0 optimizer that added predicates which could increase the overhead for query optimization when more predicates means more calls to records_in_range.