Small Datum

Thursday, January 14, 2016

RocksDB vs InnoDB via Linkbench : performance and efficiency

MyRocks can reduce by half the hardware, or at least the storage hardware, required to run Linkbench compared to InnoDB. That is kind of a big deal.

A significant performance problem was recently fixed in MyRocks courtesy of the SingleDelete optimization. With this optimization RocksDB removes tombstones faster so that queries encounter fewer tombstones and waste less time on them. We hope to get the same feature into MongoRocks. I have been waiting a few months for this change and started another round of Linkbench tests when it arrived.

Performance and efficiency for MyRocks look great relative to InnoDB. We are far from done but I am amazed we reached this state so fast. The performance summary from my recent tests with IO-bound Linkbench and PCIe flash:

Uncompressed InnoDB loads faster than MyRocks and MyRocks loads faster than compressed InnoDB. I hope to figure out how to make MyRocks load faster than uncompressed InnoDB.
MyRocks uses about half the disk space compared to compressed InnoDB.
MyRocks writes much less to storage than InnoDB. This allows a workload to run on low-endurance SSD with MyRocks when it requires high-endurance SSD with InnoDB.
Average and p99 response times are much better for MyRocks
Maximum response times were usually better for InnoDB

Compression vs Device Endurance

Imagine a workload that uses 2 TB of storage and writes 100 MB/second to it. With 2X better compression it uses 1 TB of storage but might continue to write 100 MB/second. A side-effect of compression is that it increases the endurance required from a storage device and better endurance from SSD isn't free. Domas has been pointing out this side effect for many years.

This isn't a problem with MyRocks. It provides 2X better compression than compressed InnoDB for Linkbench. It also provides a write rate that is much less than half the InnoDB rate. When InnoDB uses 2 TB of storage and writes 100 MB/second to storage then MyRocks uses 1 TB of storage and writes less than 50 MB/second to storage. Alternatively we can put twice the number of databases on the server without increasing the storage or endurance requirements. For a server with 4 TB of storage we can either run 2 InnoDB databases that are 2 TB each and sustain 200 MB/second of writes or we can run 4 MyRocks databases that are 1 TB each and sustain less than 200 MB/second of writes. This assumes the server has sufficient CPU capacity.

Note that InnoDB writes more than 8X the number of bytes to storage per query compared to MyRocks. Many more details on that are below.

Configuration

I used my Linkbench repo to compare MyRocks with InnoDB in MySQL 5.6.26 and 5.7.10 with my.cnf files for MyRocks, 5.6.26 and 5.7.10. The test server has 2 sockets, 10 cores (20 HW threads) per socket, 144 GB of RAM and 3+ TB of PCIe flash. For Linkbench I configured loaders=20, requesters=20, maxid1=1B and maxtime=3600. The test pattern is first do the load and then run the query step 24 times for 1-hour each time. I collect performance data from MySQL, vmstat, iostat and the storage device during the tests. This is the schema for MyRocks, compressed InnoDB and uncompressed InnoDB.

I tested the following binaries:

rocksdb.zlib - MyRocks with zlib compression
orig5626.zlib - MySQL 5.6.26, InnoDB and zlib compression
orig5710.zlib - MySQL 5.7.10, InnoDB and zlib compression
orig5626.none - MySQL 5.6.26, InnoDB and no compression
orig5710.none - MySQL 5.7.10, InnoDB and no compression

Compression

This shows the database size at the end of each 1-hour query steps. The step function exists on the orig5710.none graph because the size is rounded to the nearest 100 GB once the size is >= 1 TB. Later in this post I show that MyRocks sustains a higher QPS during the query steps so it added the most data to the database. But the size growth is larger for InnoDB and likely caused by index fragmentation. This is the data for the graph. MyRocks uses less than half the space compared to compressed InnoDB.

Load

This displays the rate at which rows are inserted during the load with 20 concurrent clients. Uncompressed InnoDB has the best rate but I strongly prefer to use compression for this workload. Regardless I will debug this to see what can be done to help MyRocks have the best load rate on SSD. The data for the chart is here.

Query

This displays the average QPS from each 1-hour query step of Linkbench. The rate for compressed InnoDB increases significantly during the load because it suffers from contention as the b-tree becomes fragmented and page splits are done and this occurs more often immediately after the load. The data for the graph is here. The QPS for MyRocks is much higher than for all of the InnoDB configurations.

Efficiency

MyRocks is more IO efficient than InnoDB for Linkbench. It has the lowest per-query rates for disk reads and disk bytes written. I think it does fewer disk reads because it keeps more data in cache. One reason is that InnoDB wastes space in the buffer pool for uncompressed pages courtesy of b-tree fragmentation. There are other reasons I won't describe here. There are several benefits from doing fewer disk reads. First, you can get more throughput when the storage device is close to saturation. Second, you use less CPU for decompression because every page must be decompressed after the read. Finally, you waste less time managing the buffer pool -- page eviction has been a source of performance problems for InnoDB. The data for the chart and graph is here.

Uncompressed InnoDB writes 10.4X more data to storage per query than MyRocks. Compressed InnoDB writes 8.4X more data to storage per query than MyRocks. MyRocks enables workload consolidation because it has much better compression and a lower write-rate than InnoDB. The MyRocks rate for bytes read from storage per query includes reads done in the background for LSM compaction. But I am not certain why that rate is between the rates for uncompressed and compressed InnoDB.

Quality of Service

Linkbench has 10 database transactions and reports response time metrics per transaction type. The transactions are ADD_NODE, UPDATE_NODE, DELETE_NODE, GET_NODE, ADD_LINK, UPDATE_LINK, DELETE_LINK, COUNT_LINK, MULTIGET_LINK and GET_LINKS_LIST. By far the most frequent transaction is GET_LINKS_LIST which requires a short range scan on a covering secondary index. The workload is explained in a blog post and conference paper. For the 24th 1-hour run the metrics are listed here per engine (MyRocks, InnoDB) and then reordered with results per transaction type for all engines. From the latter it is clear that while maximum response times are usually better for InnoDB the average and p99 response times are much better for MyRocks.

Thursday, December 10, 2015

IO-bound linkbench for MongoDB 3.2

I previously shared Linkbench results for MongoDB 3.2.0 with a cached database. Here I provide results for a database larger than cache using SSD and a disk array to compare RocksDB with the WiredTiger B-Tree. The performance summary is:

the peak load rate is 2X better with WiredTiger in 3.2 vs 3.0
the load rate for WiredTiger is much better than for RocksDB
the load rate for WiredTiger and RocksDB does not get slower with disk vs SSD or with a cached database vs an uncached database. For RocksDB this occurs because secondary index maintenance doesn't require page reads. This might be true for WiredTiger only because the secondary index pages fit in cache.
the peak query rates were between 2X and 3X better for RocksDB vs WiredTiger

Configuration

The previous post explains the benchmark and test hardware. The test was repeated for 1, 4, 8, 16 and 24 concurrent clients for the disk array test and 1, 4, 8, 12, 16, 20 and 24 concurrent clients for the SSD test.

Load performance

I only show the insert rate graph for SSD. The results with the disk array are similar. The insert rate is better for WiredTiger because it supports more concurrency internally courtesy of extremely impressive engineering. We have work in progress to make this much better for RocksDB.

Query performance

These display the query rates for 1, 4, 8, 16 and 24 concurrent clients using a disk array and then SSD. RocksDB does better than WiredTiger on both disk and SSD. RocksDB uses less random IO when writing changes back to storage and the benefit from this is larger with disk than with an SSD so the speedup for RocksDB is larger with the disk array.

Efficiency

This includes absolute and relative efficiency metrics from the tests run with 16 concurrent clients and SSD. The values are from vmstat and iostat run for the duration of the test. The absolute metrics are the per-second rates. The relative metrics are the per-second rates divided by the operation rate which measures HW consumed per insert or query. The operation rate is either the load rate (IPS) or the query rate (QPS).

The columns are:

cs.sec - average context switch rate
cpu.sec - average CPU load (system + user CPU time)
cs.op - context switch rate / operation rate
cpu.Kop - (CPU load / operation rate) X 1000
r.sec - average rate for iostat r/s
rkb.sec - average rate for iostat rKB/s
wkb.sec - average rate for iostat wKB/s
r.op - r.sec / operation rate
rkb.op - rkb.sec / operation rate
wkb.op - w.sec / operation rate

Load Efficiency

Conclusions from efficiency on the load step:

The context switch and CPU overheads are larger with RocksDB. This might be from mutex contention
I need more precision to show this but the relative write rate is much better for WiredTiger
The relative read rate is much better for WiredTiger. I suspect that some data is being read during compaction by RocksDB.

cs.sec cpu.sec cs.op cpu.Kop   r.sec rkb.sec wkb.sec r.op rkb.op wkb.op engine
182210 43 21.8 5.176   7.7 41 100149 0.001 0.005 0.000 RocksDB
108230 68 4.3 2.721   0.5 4 58327 0.000 0.000 0.000 WiredTiger

Query Efficiency

Conclusions from efficiency on the query step:

The CPU overheads are similar
The read and write overheads are larger for WiredTiger. RocksDB sustains more QPS because it does less IO for an IO-bound workload.

cs.sec cpu.sec cs.op cpu.Kop   r.sec rkb.sec wkb.sec r.op rkb.op wkb.op engine
87075 53 4.6 2.772   6394.2 52516 27295 0.338 2.774 1.442 RocksDB
65889 40 5.5 3.320   6309.8 88675 50559 0.529 7.437 4.240 WiredTiger

Results for disk

Results at 1, 4, 8, 16 and 24 concurrent clients for RocksDB and WiredTiger. IPS is the average insert rate during the load phase and QPS is the average query rate during the query phase.

clients IPS QPS RocksDB

1 3781 281

4 12397 1244

8 16717 1946

16 19116 2259

24 17627 2458

clients IPS QPS WiredTiger

1 5080 227

4 18369 726

8 35272 843

16 55341 808

24 64577 813

Results for SSD

Results at 1, 4, 8, 12, 16, 20 and 24 concurrent clients for RocksDB and WiredTiger. IPS is the average insert rate during the load phase and QPS is the average query rate during the query phase.

clients IPS QPS RocksDB

1 3772 945

4 12475 3171

8 16689 6023

12 18079 8075

16 18248 9632

20 18328 10440

24 17327 10500

clients IPS QPS WiredTiger

1 5077 843

4 18511 2627

8 35471 4374

12 43105 5435

16 55108 6067

20 62380 5928

24 64190 5762

Response time

This has per-operation response time metrics that are printed by Linkbench at the end of a test run. These are from the SSD test with 16 clients. While the throughput is about 1.5X better for RocksDB the p99 latencies tend to be 2X better with it. It isn't clear whether the stalls are from WiredTiger or storage.

For RocksDB:

ADD_NODE count = 893692 p50 = [0.2,0.3]ms p99 = [3,4]ms max = 262.96ms mean = 0.37ms
UPDATE_NODE count = 2556755 p50 = [0.8,0.9]ms p99 = [10,11]ms max = 280.701ms mean = 1.199ms
DELETE_NODE count = 351389 p50 = [0.8,0.9]ms p99 = [11,12]ms max = 242.851ms mean = 1.303ms
GET_NODE count = 4484357 p50 = [0.5,0.6]ms p99 = [9,10]ms max = 262.863ms mean = 0.798ms
ADD_LINK count = 3119609 p50 = [1,2]ms p99 = [13,14]ms max = 271.504ms mean = 2.211ms
DELETE_LINK count = 1038625 p50 = [0.6,0.7]ms p99 = [13,14]ms max = 274.327ms mean = 1.789ms
UPDATE_LINK count = 2779251 p50 = [1,2]ms p99 = [13,14]ms max = 265.854ms mean = 2.354ms
COUNT_LINK count = 1696924 p50 = [0.3,0.4]ms p99 = [3,4]ms max = 262.514ms mean = 0.455ms
MULTIGET_LINK count = 182741 p50 = [0.7,0.8]ms p99 = [6,7]ms max = 237.901ms mean = 1.023ms
GET_LINKS_LIST count = 17592675 p50 = [0.8,0.9]ms p99 = [11,12]ms max = 26278.336ms mean = 1.631ms
REQUEST PHASE COMPLETED. 34696018 requests done in 3601 seconds. Requests/second = 9632

For WiredTiger:

ADD_NODE count = 562034 p50 = [0.2,0.3]ms p99 = [0.6,0.7]ms max = 687.348ms mean = 0.322ms
UPDATE_NODE count = 1609307 p50 = [1,2]ms p99 = [20,21]ms max = 1331.321ms mean = 1.761ms
DELETE_NODE count = 222067 p50 = [1,2]ms p99 = [20,21]ms max = 1116.159ms mean = 1.813ms
GET_NODE count = 2827037 p50 = [0.8,0.9]ms p99 = [19,20]ms max = 1119.06ms mean = 1.51ms
ADD_LINK count = 1963502 p50 = [2,3]ms p99 = [27,28]ms max = 1176.684ms mean = 3.324ms
DELETE_LINK count = 654387 p50 = [1,2]ms p99 = [21,22]ms max = 1292.405ms mean = 2.761ms
UPDATE_LINK count = 1752325 p50 = [2,3]ms p99 = [30,31]ms max = 4783.055ms mean = 3.623ms
COUNT_LINK count = 1068844 p50 = [0.3,0.4]ms p99 = [4,5]ms max = 1264.399ms mean = 0.705ms
MULTIGET_LINK count = 114870 p50 = [1,2]ms p99 = [17,18]ms max = 466.058ms mean = 1.717ms
GET_LINKS_LIST count = 11081761 p50 = [1,2]ms p99 = [21,22]ms max = 19840.669ms mean = 2.624ms

REQUEST PHASE COMPLETED. 21856135 requests done in 3602 seconds. Requests/second = 6067

Tuesday, November 24, 2015

Read, write & space amplification - B-Tree vs LSM

This post compares a B-Tree and LSM for read, write and space amplification. The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload. For the LSM I consider leveled compaction rather than size-tiered compaction. For the B-Tree I consider a clustered index like InnoDB.

The comparison in practice provides values for read, write and space amplification on real workloads. The comparison in theory attempts to explain those values.

B-Tree vs LSM in theory

Read Amplification

Most comparisons should be done for a specific context including the hardware and workload. For now I am only specific about the cache hit rate. For the B-Tree I assume that all non-leaf levels are in cache. For the LSM I assume that everything but the data blocks of the largest LSM level are in cache. While an LSM with leveled compaction has more things to keep in the cache (bloom filters) it also benefits from a better compression rate and the cache requirements are similar to a clustered B-Tree.

Worst-case disk read-amp for point queries is 1 for the B-Tree and the LSM as one block is read from the B-Tree leaf level and largest LSM level. Disk read-amp for range queries is 1 or 2 for a short range scan assuming that 1 or 2 blocks from the B-Tree leaf level and LSM max level are read. Note the impact of my assumption for cached data. While many files might be accessed for a short range query with an LSM everything but the max level data blocks are in cache.

The number of key comparisons can be used as the in-memory read-amp. For a B-Tree with 1M keys there are about 20 key comparisons on a point query. For a range query with a B-Tree there is one additional comparison for each row fetched.

It is harder to reason about the number of comparisons for an LSM. Bloom filters can be used for a point query to avoid comparisons but when there are too many files in level 0 then there will be too many bloom filter checks. Bloom filters don't work for range queries, ignoring prefix bloom filters. When query processing is IO-bound I don't expect key comparison overhead to make a difference between an LSM and B-Tree. So I will ignore this for now.

If you want to maximize the ratio of the database to cache sizes while doing at most one disk read per point query then an LSM with leveled compaction or a clustered B-Tree are the best choices. For a clustered B-Tree the things that must be in memory are one key per leaf block and all non-leaf levels of the index. An LSM with leveled compaction has similar requirements, although it also needs some of the bloom filters to be in memory.

The cache requirement is much larger for an LSM with size-tiered compaction. First, the max level has ~50% of the data compared to ~90% with leveled compaction and it less likely that all data except the max file are in cache. Second, there are more old versions of key-value pairs, space-amp is larger, so there is more data that needs to be in the cache.

An unclustered B-Tree index also requires more memory to keep the important bits in cache. The important bits are all keys, which is much more memory than one key per leaf block for a clustered B-Tree.

Write Amplification

For now I assume that flash storage is used so I can focus on bytes written and ignore disk seeks when explaining write-amp. For a B-Tree a change is first recorded in the redo log and the page is eventually written back. The worst case occurs when the buffer pool is full with dirty pages and reading the to-be-modified page into the buffer pool forces a dirty page to be evicted and written back. In this case there is a redo log write and a page write back per row change. If the row is 128 bytes and the page is 4096 bytes then 4096+128 bytes are written to storage per 128 byte row change. The write-amp is 33 -- (4096 + 128) / 128. The write-amp is reduced when there is more one changed row on a page or when one row is changed many times before write back.

For the LSM the redo log is written immediately on a row change. When the memtable is full and flushed to level 0 then the row is written again. When level N is full and compaction is done from level N to level N+1 then one SST file is read from level N, ~10 SST files are ready from level N+1 and ~10 files are written back to level N+1. The write-amp to move rows from level N to N+1 is ~10 given my handwaving but in practice it is ~7 and I am waiting for a paper to be published to explain that. The total write-amp is computed from the writes required to move a row change from the memtable to the max level. The write-amp is 1 for the redo log, 1 for the memtable flush and usually ~1 for compacting to level 1. Assuming the LSM has levels 0 to 4 and the per-level write-amp is 7 for levels 2 to 4 then the total write-amp is 24 -- 1 + 1 + 1 + 7 + 7 + 7.

From the examples above the LSM has less write-amp than the B-Tree but those examples were not meant to be compared. An LSM tends to have less write-amp than a B-Tree. When using flash storage this means the device will last longer. When using disk storage this is likely to save more IO capacity for reads leading to higher QPS.

The IO pattern for a busy LSM is concurrent streams of IO. Each stream writes files sequentially, but the writes from different streams can end up in the same logical erase block (logical means it is striped across many NAND chips). The level of the leveled compaction LSM predicts the lifetime of the write. Writes to level 0 have a short lifetime. Writes to level 4 have a much longer lifetime. The write rates per level are similar -- there might be 10 MB/second of writes to levels 0 and 1 and then 20 MB/second of writes to levels 2 through 4. This means that logical erase blocks will end up with a mix of long and short lived data and the long-lived data will get copied out during flash garbage collection. Does your flash device use logical erase blocks? If it does then there will be write-amp from flash GC even with an LSM. Flash devices that support multi-stream will help a lot.

Space Amplification

A B-Tree gets space-amp from fragmentation, per-row metadata and fixed page sizes on disk. The leaf pages in a B-Tree are between 50% and 70% full when subject to random updates. When they are 2/3 full then space-amp is 1.5 and when they are 1/2 full then space-amp is 2. An update-in-place B-Tree like InnoDB uses ~20 bytes/row for metadata to support consistent read and transactions. The metadata overhead is much smaller for an LSM like MyRocks. Finally, when compression is done for InnoDB there will be wasted space because page sizes are fixed on disk. When a 16kb in-memory page compressed to 5kb for a table that uses 8kb pages on disk, then 3kb of the 8kb page on disk is wasted.

An LSM gets space-amp from old versions of key-value pairs. Leveled and size-tiered compaction differ significantly in this regard. With leveled compaction you are likely to get space-amp of 1.1 or 1.2 and with size-tiered compaction a more common result is space-amp of 2. Size-tiered compaction can suffer even more from additional but temporary space-amp when the max file is compacted and disk space is required for the old and new version of that file.

Compression reduces space-amp and for this reason I claim that space-amp of less than 1 is possible.

B-Tree vs LSM in practice

This post is longer than I expected, so I will write less here. This is a good spot for a joke about space-amp and write-amp. I have begun reporting on read, write and space amplification by normalizing the server's IO and CPU rates by QPS during benchmarks. I use iostat to get data for on-disk read-amp and write-amp by measuring reads/second, MB read/second and MB written/second. I frequently ignore writes/second because that mixes fast and slow writes (redo log writes are fast, page writes are slow). I use vmstat to measure the CPU utilization and that is a proxy for the in-memory read-amp and write-amp. Finally I look at the size of the database on disk to compare space-amp. The data is usually measured over 1 hour intervals to make it easy to detect when metrics get worse as a database ages. I try to run workloads for at least 12 hours to give things time to go bad.

Percona has begun doing this for some benchmark reports. I hope this becomes a common practice.

This is an example from running Linkbench for MongoDB with RocksDB and WiredTiger. I will soon have more results for MyRocks. I am thrilled that we have a copy-on-write B-Tree (WiredTiger) and an LSM (RocksDB) available as storage engines in MongoDB. We are also bringing RocksDB to MySQL via the MyRocks effort. The big deal for MyRocks compared to InnoDB is half the space-amp and half the write-amp. This has been measured on Linkbench and on the real workload. This is a big deal.

Monday, November 23, 2015

Read, write & space amplification - pick 2

Good things come in threes, then reality bites and you must choose at most two. This choice is well known in distributed systems with CAP, PACELC and FIT. There is a similar choice for database engines. An algorithm can optimize for at most two from read, write and space amplification. These are metrics for efficiency and performance. This means one algorithm is unlikely to be better than another at all three. For example a B-Tree has less read amplification than an LSM while an LSM has less write amplification than a B-Tree. I abbreviate the metrics as read-amp, write-amp and space-amp. I also abbreviate this as the framework.

The framework assumes a database workload that consists of point-queries, range-queries of length N and writes. Were I to add a delete operation then this would match the RocksDB and LevelDB API. The write is a blind-write as it doesn't imply a read prior to the write.

This is part one of a topic that requires several blog posts. The second post will compare a B-Tree and LSM using the framework. The third post will argue that an algorithm cannot be optimal for all three metrics.

Purpose

Read, write and space amplification explain performance and efficiency when evaluating algorithms for real and potential workloads. They aren't a replacement for Big O notation. They usually assume a specific workload and configuration including RAM size, database size and type of storage.

We began using the framework to compare InnoDB and RocksDB because better performance is an insufficient metric on which to choose an algorithm. Endurance (write amp) and capacity (space amp) matter when using flash. IOPs (read amp for point and range queries, write amp for writes) matters when using disk.

The framework is useful for understanding the compromises made in search of better QPS. It is easy to trade write for space or read efficiency in write-optimized algorithms but these trades should be disclosed because they are not free. New algorithms can show better write throughput than RocksDB by making range reads less efficient but the Linkbench workload needs efficient writes and efficient range reads.

The framework is useful because key comparisons aren't created equal. Traditional algorithm analysis is great for understanding in-memory performance via bounds on the number of key comparisons. But big-O notation is harder to use when some keys are read from cache, others from RAM and some from disk. Constant factors matter. The difference between 1.2 and 1.5 disk reads per query can be a big deal.

Read amplification

Read-amp is the amount of work done per logical read operation. This can be defined for in-memory databases, persistent databases assuming no cache (worst-case behavior) and persistent databases assuming some cache (average-case behavior). The work done in-memory can be the number of key comparisons and traditional algorithm analysis can be used. The work done on-disk includes the number of bytes transferred and seeks (seeks matter on disks, not on NVM). The work done can also include the cost of decompressing data read from storage which is a function of the read block size and compression algorithm.

Read-amp is defined separately for point and range queries. For range queries the range length matters (the number of rows to be fetched). In Linkbench the average range query fetches about 20 rows.

Read-amp can also be defined for point queries on keys that don't exist. Some algorithms use a bloom filter to avoid disk IO for keys that don't exist. Queries for non-existent keys is common in some workloads. Bloom filters can't be used for a range query. The most frequent query in Linkbench is a range query that includes an equality predicate on the first two columns of the range query index. With RocksDB we define a prefix bloom filter to benefit from that.

Write amplification

Write-amp is the amount of work done per write operation. This can include the number of bytes written to storage and disk seeks per logical write operation. This can be split into in-memory and on-disk write-amp but I frequently ignore in-memory write-amp.

There is usually a cost to pay in storage reads and writes following a logical write. With write-amp we are ignoring the read cost. The read cost is immediate for an update-in-place algorithm like a B-Tree as a page must be read to modify it. The read cost is deferred for a write-optimized algorithm like an LSM as compaction is done in the background and decoupled from the logical write. There is usually some write cost that is not deferred - updating in-memory structures and writing a redo log.

With flash storage there is usually additional write-amp from the garbage collection done by the FTL to provide flash blocks that can be rewritten. Be careful about assuming too much about the benefit of sequential and large writes from a write-optimized database engine. While the physical erase block size on a NAND chip is not huge, many storage devices have something that spans physical erase blocks when doing GC that I will call a logical erase block. When data with different lifetimes ends up in the same logical erase block then the long-lived data will be copied out and increase flash GC write-amp (WAF greater than 1). I look forward to the arrival of multi-stream to reduce flash GC WAF.

Space amplification

Space-amp is the ratio of the size of the database to the size of the data in the database. Compression decreases space-amp. It is increased by fragmentation with a B-Tree and old versions of rows with an LSM. A low value for space-amp is more important with flash storage than disk because of the price per GB for storage capacity.

Efficiency & Performance

I work on small data systems. Small data is another name for OLTP. Small data workloads are highly concurrent and with concurrency better efficiency usually implies better performance. But performance and efficiency are not always strongly correlated. For example an algorithm with a high read-amp for range queries might hide the extra latency by doing disk reads in parallel. This improves response time but doesn't improve efficiency and the algorithm with less read-amp will sustain more QPS at higher concurrency.

Wednesday, November 18, 2015

How does MongoDB do on Linkbench with concurrency?

I started to study performance math with a focus on queueing theory and the USL. To get data for the USL I ran Linkbench for MongoDB with different levels of concurrency. I tested WiredTiger in MongoDB 3.2.0rc0 and 3.0.7 and then RocksDB in MongoDB 3.2.0. The performance summary is:

WiredTiger load scales much better in 3.2 compared to 3.0
RocksDB throughput is stable at high concurrency
WiredTiger 3.2 throughput is stable at high concurrency. For WiredTiger 3.0 the load rate drops significantly at high concurrency and the query rate also has an odd drop.

Configuration

The test server has 2 sockets with 6 CPU cores/socket and hyperthreading enabled to get 24 HW threads. The server also has 6 SAS disks with HW RAID 0 and 1 SSD. I intended to use the disk array for all tests but ended up using the SSD for the WiredTiger 3.0.7 test. The server also has 144G of RAM and the test database was cached by mongod for all tests. The oplog was enabled but sync-on-commit was not done. For Linkbench I set maxid1 to 10M and the test pattern was load, run for 30 minutes, run for 30 minutes and the query rate is reported for the second 30 minute run. Snappy compression was used for all tests. The test was run for 1 to 48 concurrent clients with MongoDB 3.2.0 and stopped at 30 concurrent clients for MongoDB 3.0.

Data

Results are on gist.github for the load and query tests.

Load

WiredTiger throughput for write-heavy workloads is much better in 3.2 than 3.0 as I previously reported. From PMP output it looks like WiredTiger in MongoDB 3.0 suffers from too frequent plan-cache invalidation that I wrote about this previously. I think RocksDB insert performance saturates earlier from mutex contention on the memtable writer mutex.

Query

WiredTiger performance is the same for 3.0 and 3.2 until 21 concurrent clients. At that point the performance for 3.0 drops and response time for all operations is slower (reads & writes). This is odd and I won't try to debug it. For RocksDB I have an educated guess at the problem (need to use SingleDelete optimization in MongoRocks to reduce overhead from tombstones) to explain why throughput for it is worse than WiredTiger.

Thursday, November 12, 2015

Define better for a small-data DBMS

There are many dimensions by which a DBMS can be better for small data workloads: performance, efficiency, manageability, usability and availability. By small data I mean OLTP. Performance gets too much attention from both industry and academia while the other dimensions are at least as important in the success of a product. Note that this discussion is about which DBMS is likely to get the majority of new workloads. The decision to migrate a solved problem is much more complex.

Performance - makes marketing happy
Efficiency - makes management happy
Manageability - makes operations happy
Usability - makes databased-backed application developers happy
Availability - makes users happy

Performance makes marketing happy because they can publish a whitepaper to show their product is faster than the competition and hope that the message morphs from X is faster than Y in this context into X is faster than Y. This is the art of benchmarketing. It can be risky to use average throughput as a performance metric unless there was a response time SLO. Otherwise good average throughput can hide lousy variance like the page writeback stalls in InnoDB that Percona spent a lot of time making better. When a benchmark result doesn't include an SLO then you need to see throughput over time or response time histograms (response time at the 95th percentile for example).

Efficiency makes management happy because less hardware is able to do more work. A simple way to report this is to list the amount of disk IO and CPU used per transaction by dividing rates from iostat and vmstat by the QPS or TPS as I have been doing in recent benchmark reports. Better efficiency might not imply better response time but it usually implies better throughput for highly concurrent workloads. Efficiency has several dimensions including read, space and write and one database engine is unlikely to be optimal for all of them, but that is for another blog post.

Manageability makes life easier for the operations team that supports the DBMS. They shouldn't waste time manually fixing problems when the fix can be automated. Oncall week shouldn't mean you wake up every few hours to replace failed servers. Replacement of failed masters for MySQL isn't automated at many web-scale MySQL deployments and I feel sorry for the oncall at those places. With MySQL 5.7 the pieces exist for a solution and I hope these are combined into an open-source solution similar to what we have.

Usability makes life easier developers who write database-backed applications. Developers shouldn't spend time reinventing the wheel, thus my expectation of per-shard support for consistent read and transactions. A declarative query language and good-enough query optimizer also make life easier for everyone. I consider the MongoDB query API to be declarative. My feature list is longer but I won't include it here.

Availability - this makes users happy. There is big and small downtime. Both can significantly reduce availability. An example of big downtime is taking 30 minutes to replace a failed master. An example of small downtime is slow commit courtesy of synchronous geo-replication. Small downtime doesn't get enough attention. When commit to one row takes 100 milliseconds and the workload demands 100 commits/second to that row then there will be 90 commits/second that can't get done and the user experience for can't get done is similar to database down.

MySQL and MongoDB

It is interesting to compare the upcoming releases of MongoDB (3.2) and MySQL (5.7) by these metrics. MongoDB has a large lead in manageability for web-scale deployments (sharded replica sets) and this continues with the 3.2 release. MongoDB comes with support for failover automation, while MySQL does not. MongoDB 3.2 has the potential to provide much better availability than a MySQL deployment that lacks automated master failover (we have that, upstream MySQL does not) although the new code in MongoDB needs time to mature.

MySQL has a large lead in performance based on the benchmarks I have run. Better response time and throughput are important but won't determine whether MongoDB or MySQL is chosen for a new workload. MySQL also has a large lead in usability because many workloads need per-shard transactions, per-shard consistent read and per-shard joins. It isn't hard to add support for per-shard transactions and consistent read to MongoDB given they are already provided by WiredTiger and RocksDB and I expect the MongoDB to match MySQL in usability in the next few years.

MongoDB and MySQL are similar for efficiency. Both have read & write optimized engines including MongoRocks and MyRocks for write optimized.

How does this end? MongoDB becomes better by improving usability. MySQL becomes better by improving manageability and availability. Both have long track records of steady improvement although MongoDB is moving faster. The product that gets there first is likely to get the majority of new workloads.

Wednesday, November 4, 2015

MongoDB 3.2 vs Linkbench

I used LinkbenchX to compare performance and efficiency for MongoDB 3.2.0rc0 vs 3.0.7 with the RocksDB and WiredTiger engines. The Linkbench test has two phases: load and query. The test was run in three configurations: cached with data on disk, too big to cache with data on disk and too big to cache with data on SSD. My summary:

Performance:

load rates are similar for disk and SSD with RocksDB and WiredTiger
load rate for WiredTiger is ~2X better in 3.2 versus 3.0
load rate for WiredTiger is more than 2X better than RocksDB
query rate for WiredTiger is ~1.3X better than RocksDB for cached database
query rate for RocksDB is ~1.5X better than WiredTiger for not cached database

Efficiency:

disk space used is ~1.33X higher for WiredTiger vs RocksDB
disk bytes written per document during the load is ~5X higher for RocksDB
disk bytes written per query is ~3.5X higher for WiredTiger
RocksDB uses ~1.8X more CPU during the load
WiredTiger uses ~1.4X more CPU during the query phase
with a 32G block cache mongod RSS is ~42G with WiredTiger vs ~34G with RocksDB

LinkbenchX

LinkbenchX is a fork of Linkbench. Source for Linkbench and LinkbenchX is in github. LinkbenchX adds support for MongoDB and an option to sustain a fixed request arrival rate. For this test I use the MongoDB support but not the fixed request arrival rate by using bin/linkbench from Linkbench. I am grateful to Percona for porting Linkbench to MongoDB.

The Linkbench workload requires transactions to update the count collection when a document is added to or removed from the link collection. MongoDB doesn't support per-shard transactions and the Linkbench results will be incorrect. I understand that cross-shard transactions are hard, but per-shard transactions and per-shard consistent read are valuable for Linkbench and for selling to the enterprise. I hope they arrive in MongoDB 3.4.

Linkbench is run in two phases: load and query. I configured Linkbench to use 12 threads for both phases. The query phase was done as a sequence of 1-hour tests to measure whether performance and efficiency changed over time.

For the cached database test I set the value of maxid1 in the Linkbench file config/FBWorkload.properties to 50,000,001 and the database block cache to 32G. The database was cached by the OS but not WiredTiger or RocksDB in this setup as the compressed database was ~50G.

For the not cached database test I set the value of maxid1 to 250,000,001, the database block cache to 8G and started a background process to use mlock to leave at most 40G for mongod and the OS page cache. The database was at least 150G and the workload used lot of storage IO.

For all tests I changed the Linkbench file config/LinkConfigMongoDBv2.properties to use transaction_support_level=1, requesters=12, loaders=12, maxtime=3600, requestrate=0 and requests=100000000000. Fsync is not done on commit.

The test server has 24 cores with hyperthreading enabled, 144G of RAM and either one 400G SSD (Intel DC s3700) or 6 SAS disks with HW RAID 0. The server uses Fedora release 19, Linux 3.14.27-100, gcc 4.8.3 and MongoDB was linked with tcmalloc.

Snappy compression was used for both RocksDB and WiredTiger. For all tests this was in the mongod configuration file:

processManagement:
fork: true
systemLog:
destination: file
logAppend: true
storage:
syncPeriodSecs: 600
journal:
enabled: true
storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

For the not cached database this was in the mongod configuration file:

replication.oplogSizeMB: 4000
storage.wiredTiger.engineConfig.cacheSizeGB: 8
storage.rocksdb.cacheSizeGB: 8

For the cached database this was in the mongod configuration file:

replication.oplogSizeMB: 8000
storage.wiredTiger.engineConfig.cacheSizeGB: 32
storage.rocksdb.cacheSizeGB: 32

Legend

The legend for the data in the following sections. The disk and CPU metrics are collected from iostat and vmstat. Most rates below are normalized by the rate for operations/second where operation is an insert during the load phase and query during the query phase. The real insert rate across all collections is reported for ops below but I use the number of inserts to the node collection (50M or 250M) when normalizing the iostat and vmstat rates.

ops - average rate for operations/second (inserts or queries per second)
db.gb - database size in GB (two numbers are from du without and with --apparent-size)
r/o - disk reads per operation
wKB/o - disk KB written per operation
cs/o - context switches per operation
cpu/o - CPU/operation from vmstat us+sy divided by ops multiplied by 1M
rss - RSS from ps for the mongod process
setup - wt (wiredtiger), rx (rocksdb), 307 (mongo 3.0.7), 320 (mongo 3.2.0rc0), op0/op1 - oplog off/on

Cached database

For this test there were 50M, ~220M and X docs in the node, link and count collections after the load phase. In addition to the conclusions listed at the top of this post, WiredTiger 3.2 vs 3.0 uses less CPU and has fewer context switches per insert. It is good to see it become more efficient and the insert rate for WiredTiger has almost doubled from 3.0 to 3.2.

The context switch rate per insert is much larger for RocksDB because of the global mutex that serializes inserts into the memtable. There are no disk reads during this test because the database fits in the OS page cache. The CPU rate for WiredTiger is also much higher during the load. That might be a side effect of more mutex contention.

The difference in database sizes for WiredTiger vs RocksDB is small after the load but grows quickly during the run phases. I did not try to debug it but the growth for WiredTiger could be a problem. WiredTiger also uses much more memory than RocksDB. But I don't know whether that is a fixed overhead (~8G) or a relative overhead (30% of the block cache size).

Using the oplog doubles the wKB/o rate because writes are done twice -- once to the oplog, once to the database. The internal write-amplification reported by RocksDB for rx.320.op1 is 6.1.

--- load
ops db.gb r/o wKB/o cs/o cpu/o setup
51416 53/36 0.0 1.355 2.8 2171 wt.320.op0
46156 44/40 0.0 2.316 4.0 2460 wt.320.op1
28171 47/41 0.0 1.358 0.9 3161 wt.307.op0
28080 46/35 0.0 2.304 1.8 3520 wt.307.op1
26654 31/31 0.0 5.318 16.0 3787 rx.320.op0
19033 36/36 0.0 11.643 18.4 4881 rx.320.op1

--- run, 2nd hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
14483 86/72 0.0 2.486 3.6 2170 42G wt.320.op1
14312 78/71 0.0 2.338 3.6 2259 43G wt.307.op1
10794 38/38 0.0 1.357 3.9 2470 35G rx.320.op1

--- run, 12th hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
13042 100/90 0.0 2.588 4.1 2378 36G wt.320.op1
12742 94/88 0.0 2.623 4.0 2414 43G wt.307.op1
10550 45/45 0.0 1.491 4.1 2533 35G rx.320.op1

Not cached, disk array

For this test there were 250M, 1B and X documents in the node, link and count collections after the load phase. The database did not fit in RAM and the disk array can do ~1200 IOPs. The query phase was run for 24 hours.

The new result here is that RocksDB sustained a much higher QPS rate during the query phase. From the response times listed at the end of this post the difference appears to be a better response time for the most frequent operation -- GET_LINKS_LIST -- which is a short range scan. RocksDB also benefits from a better cache hit rate because the database is smaller and the r/o rate is slightly smaller. It also uses less IO capacity for writes (wKB/o is smaller and writes are less random) leaving more IO capacity for reads.

-- load
ops db.gb r/o wKB/o cs/o cpu/o setup
40667 191/177 0.0 2.319 2.8 2803 wt.320.op1
25149 191/178 0.0 2.306 1.8 4041 wt.307.op1
18725 153/153 0.0 11.568 18.7 4968 rx.320.op1

--- run, 2nd hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
504 199/188 0.842 5.326 8.0 5481 12G wt.320.op1
507 206/187 0.798 5.171 7.5 8013 13G wt.307.op1
850 153/153 0.746 1.743 5.3 3684 11G rx.320.op1

--- run, 12th hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
491 196/195 0.831 5.173 8.1 5700 12G wt.320.op1
488 195/195 0.794 5.273 7.6 8380 12G wt.307.op1
864 155/155 0.725 1.588 5.4 3967 11G rx.320.op1

--- run, 24th hour
ops db.gb r/o wKB/o cs/o cpu/o rss setup
494 199/197 0.799 5.404 8.1 5615 10G wt.320.op1
471 197/197 0.814 5.303 7.8 8564 12G wt.307.op1
833 156/156 0.738 1.721 5.5 4301 10G rx.320.op1

Not cached, SSD

RocksDB sustained a higher QPS than WiredTiger for the query phase similar to the result for a disk array with a not cached database. I didn't expect that result here or for the disk array.

--- load

ops db.gb r/o wKB/o cs/o cpu/o setup

40742 195/180 0.0 2.318 2.9 2798 wt.320.op1

25308 188/177 0.0 2.306 1.8 4007 wt.307.op1

18603 155/154 0.0 11.458 18.6 5005 rx.320.op1

--- run, 2nd hour

ops db.gb r/o wKB/o cs/o cpu/o rss setup

3870 229/213 0.821 4.869 6.3 4179 11G wt.320.op1

3814 228/210 0.813 4.620 6.1 4163 13G wt.307.op1

6146 155/155 0.735 1.344 5.3 3063 11G rx.320.op1

--- run, 12th hour

ops db.gb r/o wKB/o cs/o cpu/o rss setup

3715 232/221 0.855 4.810 6.6 4449 9G wt.320.op1

3415 223/217 0.825 4.582 6.4 4332 11G wt.307.op1

5827 162/162 0.776 1.356 5.6 3239 11G rx.320.op1

Response time metrics

These are response time metrics for each configuration. These are printed by Linkbench at the end of each run. The most interesting result is for the GET_LINKS_LIST operation which uses a short range scan (~10 rows on average). For the cached database, the p99 for RocksDB is ~14ms vs ~3ms for WiredTiger. I think the problem for RocksDB is from too many tombstones. We recently came up with a more efficient way to remove them in MyRocks. The p99 for RocksDB in the not cached databases (disk & SSD) is better than WiredTiger and ~12ms for ssd, ~47ms for disk.

RocksDB compaction stats

This is the compaction IO statistics from RocksDB at the end of the 24th 1-hour query run for the no cached, disk configuration.