Tuesday, September 20, 2016

MyRocks and InnoDB with large objects and compression

I ran tests to explain the difference between MyRocks and InnoDB when storing large objects and data with varying amounts of compressibility.

Compression in MyRocks is simpler than in InnoDB. You should expect the database to use about 1.1X times the size of the compressed output. When rows compress to 60% of their original size and are 10kb before compression, then each row should use about 6.6kb in the database. The 1.1X adjustment is for space-amplification from leveled compaction.

Predicting the space used for InnoDB is harder. First, large LOB column are not stored inline and overflow pages are not shared. Second, disk pages have a fixed size and you risk using too much space or getting too many page splits when searching for a good value for key_block size. More details are here.

I ran two tests for two types of data. The first test is an insert only workload in PK-order for data with varying amounts of compressibility. The second test determined how fast point queries could be done on that data while rate-limited inserts were in progress. By varying amounts of compressibility I mean that there was one large varchar column per row and that 20%, 45%, 75% or 95% of the data in the column was random and the remainder was constant and easily compressed. Both tests used one connection for inserts. The query test also used one connection for queries.

The test pattern was run twice. In both cases the large column was a varchar. In the first case it had a length between 10,000 and 20,000 characters. In the second case it had a length between 100 and 1000 characters. The database block size was 16kb for MyRocks and InnoDB.

Insert only

For the insert-only workload the space used for MyRocks can be predicted from the compressibility of the data. That is much less true for InnoDB. For example compressed InnoDB uses about the same amount of space for pctRand in 20, 45 and 75.

MyRocks used the least amount of space. InnoDB used much more space when the column was larger (10,000 to 20,000 vs 100 to 1000). Overflow pages are the root cause.

The insert rates are better for MyRocks than for InnoDB. They were also stable for MyRocks and uncompressed InnoDB independent of the compressibility. Rates for uncompressed InnoDB are better than compressed InnoDB. While this wasn't a performance benchmark, it matches many other results I get. It is hard to get performance and compression from InnoDB.  The CPU overhead per insert was similar between MyRocks and uncompressed InnoDB. CPU overheads were mostly larger for compressed InnoDB.

Legend for the data:
  • ips - inserts per second
  • size - database size in GB at test end
  • Mcpu - microseconds of CPU per insert
  • pctRand - percentage of random data in large column
  • engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

column up to 20,000       column up to 1000
ips     size    Mcpu      ips     size    Mcpu    pctRand engine
5489      7.7    1090     34468   11      151     20      rx.zlib-6
5540     16      1127     34824   19      149     45
5532     24      1307     34517   27      166     75
5523     30      1467     34701   33      160     95

ips     size    Mcpu      ips     size    Mcpu    pctRand engine
3995     87       933     23470   66      173     20      i6n
3981     87       928     23704   66      174     45
3981     86       917     23487   66`     175     75
3995     88       914     23658   66      176     95

ips     size    Mcpu      ips     size    Mcpu    pctRand engine
3339     36      1064     13429   33      262     20      i6c
2779     32      1278     13124   33      271     45
2133     35      1750      8767   30      392     75
1757     50      2061      7228   38      461     95

Point queries

MyRocks provides the best compression, the best query throughput, and the east CPU overhead per query. My conclusions for InnoDB space consumption are similar to the results from the insert-only workload.

Legend for the data:
  • qps - queries per second
  • size - database size in GB at test end
  • Mcpu - microseconds of CPU per query
  • pctRand - percentage of random data in large column
  • engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 984      9.3    4308     2214    11      1585    20      rx.zlib-6
 910     19      4532     2113    19      1627    45
 846     30      4952     2102    27      1601    75
 795     37      5598     2051    33      1691    95

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 628    113      6240     1302    62      2527    20      i6n
 624    110      6226     1300    63      2501    45
 624    114      6312     1302    63      2536    75
 628    115      6218     1305    66      2474    95

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 708     38      5560      770    34      4450    20      i6c
 629     39      6643      687    34      4895    45
 513     44      8494      589    30      6046    75
 418     57     10619      576    39      6599    95

Thursday, September 15, 2016

Peak benchmarketing season for MySQL

Maybe this is my XKCD week. With Oracle Open World and Percona Live Amsterdam we are approaching peak benchmarketing season for MySQL. I still remember when MySQL 4.0 was limited to about 10k QPS on 4 and 8 core servers back around 2005, so the 1M QPS results we see today are a reminder of the great progress that has been made thanks to investments by upstream and the community.

In General

But getting 1.5M QPS today compared to 1M QPS last year isn't at the top of the list for many (potential) users of MySQL. I use performance, usability, mangeability, availability and efficiency to explain what matters for web-scale DBMS users. My joke is that each of these makes a different group happy: performance -> marketing, usability -> developers, manageability -> operations, availability -> end users, efficiency -> management.

The benchmarketing results mostly focus on performance. Whether InnoDB does a bit more QPS than Amazon Aurora isn't going to make Aurora less popular. Aurora might have excellent performance but I assume people are deploying it for other reasons. I hope we make it easier to market usability, manageability, availability and efficiency in the MySQL community. MongoDB has gone a long way by marketing and then delivering usability and manageability.

Even when limited to performance we need to share more than peak QPS. Efficiency and quality-of-service (QoS) are equally important. QPS without regard to response time is frequently a bogus metric. I get more IOPs from a disk by using a too large queue depth. But more IOPs at the cost of 100 millisecond disk read response times is an expensive compromise. Even when great QPS is accompanied by a good average response time I want to know if there is lousy QoS from frequent stalls leading to lousy 99th percentile response times. Percona has built their business in part by being excellent at documenting and reducing stalls in InnoDB that occur on benchmarks and real workloads.

I have been guilty of sharing too many benchmark reports in the past that ignored efficiency and QoS. I have been trying to change that this year and hope that other providers of MySQL performance results do the same. This is an example of a result that includes performance, efficiency and QoS.

MyRocks and RocksDB

A lot of the RocksDB marketing message has been about performance. Database access is faster with an embedded database than client/server because you avoid network latency. The MyRocks message has been about efficiency. The target has been better compression and less write amplification than InnoDB so you can use less SSD and lower-endurance SSD. For a workload I care about we see 2X better compression and 1/10 the write rate to storage. This is a big deal.

When starting the project we had many discussions about the amount of performance loss (reduced QPS, higher response time) we could tolerate to get more efficiency. While we were vague the initial goal was to get similar QPS and response time to InnoDB for real workloads, but we were willing to accept some regressions. It turned out that there was no regression and similar performance with much better efficiency is a big deal.

But benchmarks aren't real workloads and there will soon be more benchmark results. Some of these will repeat what I have claimed, others will not. I don't expect to respond to every result that doesn't match my expectations. I will consult when possible.

One last disclaimer. If you care about read-mostly/in-memory workloads then InnoDB is probably an excellent choice. MyRocks can still be faster than InnoDB for in-memory workloads. That is more likely when the bottleneck for InnoDB is page write-back performance. So write-heavy/in-memory can still be a winner for MyRocks.

Seriously, this is the last disclaimer. While we are bickering about benchmark results others are focusing on usability and manageability and getting all of the new deployments.

Wednesday, September 14, 2016

Excited about Percona Live Amsterdam

I look forward to attending and speaking at Percona Live Amsterdam. There are presentations and people with expertise on MySQL, MongoDB and PostgreSQL. This is a great chance to learn from people with different experience and different expertise. I hope you make the most of it.

A great way to learn is to ask questions like:
  • Why are you using X?
  • What problems are you trying to solve?

A great way to end the conversation early and avoid learning anything is to ask questions like:
  • Why aren't you using Y? 
  • You should be using Z! (this isn't a question)

zlib vs zstd for MyRocks running Linkbench

I used an IO-heavy configuration to determine the impact of zstandard vs zlib compression for MyRocks. There was about 1 read from SSD per transaction and decompression is done after each page read from the OS page cache and storage.

The results are impressive. Zstandard compresses like zlib level 1 but uses much less CPU.
  • zstandard reduces CPU by 45% vs zlib level 1 for the load test
  • zstandard reduces CPU by 11% vs zlib level 1 for the query test
  • zstandard gets 8% more TPS vs zlib level 1 for the query test


Configuration for MyRocks is still complex. The templates for the MyRocks my.cnf files for Linkbench and general usage are explained on the wiki. I used no compression for L0, L1, L2, then lz4 for all but the max level and then one of zlib level 1, zlib level 6 or zstd for the max level. The tests used an Aug5 build of MyRocks, so this used kZSTDNotFinalCompression as the build preceded the 1.0 release of zstandard.

The test host has 50G of RAM available to userland, fast storage (5TB of NVMe MLC) and 24 CPU cores with 48 HW threads. The RocksDB block cache was set to 10G, the binlog was disabled but sync-on-commit was disabled for the binlog and RocksDB. Linkbench is run with maxid1=1B, the load test uses 2 clients and the query tests use 16 clients. Query tests are run as 24 1-hour loops and I report metrics from the 24th hour. I used my branch of linkbench and support scripts.


The results for zstandard are impressive. I look forward to using this in production. Thanks Yann.

  • ips/tps - inserts & transactions per second
  • r/i, r/t - iostat reads per insert and per transaction
  • wKB/i, wKB/t - iostat KB written per insert and per transaction
  • Mcpu/i, Mcpu/t - usecs of CPU time per insert and per transaction
  • size - database size in GB
  • rss - mysqld RSS size in GB
  • un, gn, ul, gl - p99 response time in milliseconds for the most frequent transactions (Update Node, Get Node, Update Link, Get Link List)

Results for the load

ips     r/i     rKB/i   wKB/i   Mcpu/i  size    rss     engine
61543   0       0       0.98     81     324     3.1     zstd
61504   0       0       0.98    146     331     2.0     zlib-1
61457   0       0       0.97    153     312     2.2     zlib-6

Results for the 24th hour of the query test

tps    r/t   rKB/t   wKB/t  Mcpu/t  size  rss   un    gn   ul  gl   engine
39366  1.00  10.38   2.36    878    377   12.2  0.6   0.6  1   0.8  zstd
36524  1.00  10.47   2.45    992    381   12.1  0.7   0.6  1   0.9  zlib-1
37233  0.97   9.76   2.30   1002    360   12.0  0.7   0.7  1   0.9  zlib-6

Wednesday, September 7, 2016

Tuning the RocksDB block cache

I spent many years using InnoDB with direct IO and I didn't worry about buffered IO performance. Well, I didn't worry until Domas told me to worry. My focus has switched to RocksDB and now I worry about buffered IO performance. Fortunately, another co-worker (Jens Axboe) promises to make buffered writeback much better.

With direct IO, InnoDB stores compressed and uncompressed pages in the InnoDB buffer pool. It has a clever algorithm to determine how much memory to use for each based on whether the workload appears to be IO or CPU bound. My vague memory is that we tune my.cnf to keep it from being too clever.

With buffered IO, RocksDB manages a block cache for uncompressed blocks and then depends on the OS page cache for compressed blocks. While I think there is an opportunity to be more efficient in that area, that is not the topic for today.

The question today is how to divide memory between the RocksDB block cache and the OS page cache. I have read tuning advice for other buffered IO databases that suggest giving as much RAM as possible to the database. I disagree and my advice is:
  1. If the uncompressed working set fits in the RocksDB block cache then give as much RAM as possible to the block cache.
  2. Else if the compressed working set fits in the OS page cache then give most RAM to the OS page cache by using a small RocksDB block cache.
  3. Else give the RocksDB block cache about 20% of host RAM.
This is a rule of thumb. Sometimes in rule 3 I suggest giving 25% or 30% to the block cache, but I hope you get the point. The goal is to avoid reads from storage by caching more data in RAM. I assume that decompressing a block is much faster than reading it from storage which is more likely when you use zstandard.

This isn't proven unless you accept proof by anecdote. I ran a test with Linkbench on a host with 50G of RAM and a ~350G database. The test was repeated with the RocksDB block cache set to 5G, 10G, 20G and 35G. Using a smaller block cache reduced the storage read cost per transaction by between 10% and 20% using iostat r/s and iostat rKB/s. My advice might not work for you, but might help you to consider your choices before following tuning advice you read on the web.

Tuesday, August 23, 2016

Web-scale MySQL back in the day

I spent a few years at Facebook where I was extremely busy helping to make MySQL better at web-scale. I worked a lot with Domas. He found so many problems and I helped fix them along with a few others (the MySQL db-eng team was small). Domas made it easy to understand what was broken and there was a lot of low-hanging fruit. This slide deck is one perspective on what we did. I doubt I have the energy to go through another few years like that, but it was a great time. The timing was also right as there were many people at Oracle/MySQL pushing to make MySQL scale on modern hardware.

Wednesday, June 8, 2016

RocksDB, InnoDB and the insert benchmark

Today I share results for RocksDB and InnoDB using MySQL 5.6 and the insert benchmark. I have many more results that I am currently sharing with engine vendors to help them improve (WiredTiger) or to show them the potential of RocksDB. Percona has embraced MongoRocks. I hope to see similar support for MyRocks in a widely used MySQL distribution.

tl;dr - RocksDB did better than InnoDB for a write-heavy workload and a range-scan heavy workload. The former is expected, the latter is a welcome surprise. I hope we get RocksDB into a supported MySQL distribution to make it easier for you to try it.

I have been running several database benchmarks on a variety of hardware, storage, database engines and configurations. The benchmarks include the insert benchmark, linkbench and sysbench. The hardware includes dual-socket, many-core servers with 144G or more of RAM and a 1-socket, few-core server.  The few-core server has 2 cores, 4 HW threads and is an Intel NUC 5i3ryh with 8G of RAM and a Samsung m.2 SSD. The storage includes m.2 SSD (Samsung 850 EVO 110G), older PCIe SSD and newer NVMe SSD. The engines include RocksDB, InnoDB, MyISAM and TokuDB for MySQL and RocksDB, WiredTiger and mmapv1 for MongoDB. The configurations include databases that fit in cache and databases that do not. Even when the database fits in cache the workload can be IO-bound from the writes required to persist a database.


If you just want the graphs here they are.

Test pattern

I used the small server (Intel NUC, 2 cores, 8G RAM, m.2 Samsung SSD) for this test.`

The workload is the insert benchmark using the pattern described below. I use this pattern to include a mix of workloads (write-only, reads with write-heavy, reads without write-heavy). I don't include a read-only step. The mixed workload (reads + writes) is very important especially when the database is larger than RAM:
  1. Insert 250M rows into a table in PK order. The table has 3 secondary indexes and the inserts are in random order for each of the secondary indexes. Measure insert performance.
  2. Insert 5M rows into the table with the writer limited to 1000/second and measure query performance for short range queries. The range queries use LIMIT 10 and randomly select the index (secondary or PK) and the starting offset for the range. The index is covering for the query. Assuming the writer is able to do 1000/second this takes 5000 seconds. For this step I care about throughput and response time for reads. For writes my main concern is whether the rate was able to sustain the rate of 1000/second.
  3. Same as #2 except 500k rows are inserted and the writer is limited to 100/second. Assuming the writer can do 100/second this also takes 5000 seconds.

Performance & Efficiency

I am interested in performance and efficiency. I include both. Sometimes better efficiency explains better performance. Sometimes we need better efficiency even if that comes at the cost of decreased performance.

The metrics I use for performance are throughput and response time. For performance I measure the average rate for queries and inserts. I also want to determine whether there are stalls in throughput so the benchmark client reports average throughput per interval (~10 seconds each) and at test end I compute the 50th, 75th, 90th, 95th and 99th percentile per-interval throughput. For many tests including the insert benchmark it is even better to show throughput over time as some database engines suffer from stalls or throughput degrades. Some of that is visible via the p95 and p99 throughput metric.

For efficiency I measure HW consumed per operation and including CPU microseconds, storage reads, storage KB read/written. For efficiency I also include the database size and the RSS of the database process. I don't distinguish storage reads and writes done for user inserts versus those done for user queries. If the storage read rate is 200/second, the query rate is 10/second and the insert rate is 5/second then the metrics below will show 20 reads/query and 40 reads/insert.

I usually ignore storage writes for 2 reasons. First, the penalty for random writes is small with SSD. Endurance is still an issue and KB written captures that. Second, it is hard to distinguish between slow and fast write operations. Some write operations are fast because they are mostly sequential or are small log writes that benefit from battery backed write cache on a RAID device.

I use vmstat and iostat to collect efficiency metrics. Note that Linux in mid-2016 still counts trim as bytes written. So with a database engine like RocksDB that deletes files as frequently as it creates them in the steady state and a kernel+filesystem+device setup to do trim on file delete then KB written as reported by iostat will be twice the real rate. You can confirm this by comparing iostat metrics with metrics reported by your SSD.


I used a 2G block cache for RocksDB which uses the rest of RAM for compressed blocks in the OS page cache. I used a 6G buffer pool for InnoDB which used direct IO.

For RocksDB I used Facebook MySQL with the git hash 2406e9. For InnoDB I used 5.6.29 and 5.7.10. All mysqld builds used jemalloc. The binlog was enabled for all tests but sync-on-commit was disabled for the database engine and the binlog. The server runs Ubuntu 14.04.

MyRocks is changing fast. I used a build from May 7 and since then several performance bugs have been fixed. My my.cnf settings for RocksDB are also stale compared to the latest & greatest values suggested by Yoshinori. It is fun to work on a project that improving so quickly.

The my.cnf files for RocksDB, InnoDB in MySQL 5.6, InnoDB in MySQL 5.7.

Results for insert-only

  • InnoDB writes between 20X and 30X more per insert than RocksDB
  • RocksDB does between 4X and 10X more inserts/second. The difference increases as the database:RAM ratio increases because the insert rate for RocksDB is almost flat over time while InnoDB drops dramatically once the database gets larger than RAM.

* ips.av - average insert rate
* ips.75, ips.95, ips.99 - 75th, 95th and 99th percentile insert rate over the 

                           per-interval (~10 seconds) average rates
* r/i - storage reads per insert via iostat
* rkb/i, wkb/i - storage KB read and written per insert via iostat. iostat
                 overstates the wkb/i result for RocksDB by 2X because it
                 counts trim as bytes written and the result here have
                 been corrected.
* Mcpu/i - usecs CPU per insert via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end
* engine - format is $e.$comp[.skip] and .skip means that unique constraints
           are not checked (only on RocksDB). For $e, "rx" is RocksDB,
           "i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
           For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib

           level 6 and "none" is no compression.

ips.av  ips.75  ips.95  ips.99  r/i     rkb/i   wkb/i   Mcpu/i  size    rss     engine
23129   24515   12276    9515   0.02    0.25     1.55    3354   19g     2.47    rx.snap.skip
18177   18356    7283    6245   0.02    0.23     1.46    3805   17g     2.52    rx.zlib.skip
17705   15976   15282   14766   0.02    0.25     1.75    3944   18g     2.59    rx.snap
15357   14945   11532    8114   0.02    0.23     1.57    4920   16g     2.67    rx.zlib
12800   14557    5681    4694   0.01    0.19     1.46    5433   16g     2.53    rx.zlib-6
12008   10525    9355    9007   0.02    0.41     2.01    4585   27g     2.64    rx.none
 3811    2854    2607    2533   0.06    1.76    30.69    2715   42g     6.98    i7.none
 2699    1995    1497    1281   0.29    5.97    36.17    4565   42g     6.78    i6.none
 2120    1619    1338    1278   0.09    0.79    51.70   11142   23g     7.06    i7.zlib
 2121    1604    1335    1275   0.09    0.81    51.34   11137   23g     6.85    i6.zlib

Results with inserts limited to 1000/second

RocksDB gets more QPS because it does fewer reads from storage per query. That is expected for uncompressed InnoDB because that database is about 2X the size compared to RocksDB. But the compressed InnoDB database isn't much larger than RocksDB so it should be able to keep a similar amount of hot data in cache. I wonder if key prefix compression as done by RocksDB allows it to get a better cache hit rate.

RocksDB also benefits from not doing reads for non-unique secondary index pages during index maintenance. That is a write-only operation for RocksDB.

* ips.av, qps.av - average insert and query rates
* ips.99, qps.99 - 99th percentile query rate over the 
per-interval average rates
* r/q - storage reads per insert and per query query via iostat
* rkb/q, wkb/q - storage KB read and written per query 
* Mcpu/i - usecs CPU per query via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end

* engine - format is $e.$comp[.skip] and .skip means that unique constraints\
           are not checked (only on RocksDB). For $e, "rx" is RocksDB,
           "i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
           For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
           level 6 and "none" is no compression.

ips.av  ips.99  qps.av  qps.99  r/q     rkb/q   Mcpu/q  size    rss     engine
 998     997     2877   1579    0.27     3.40   12190   19g     2.53    rx.snap
 998     997     2635   1526    0.22     2.32   11544   16g     2.34    rx.zlib
 998     997     2652   1531    0.23     2.49   12396   15g     2.45    rx.zlib-6
 999     961      256     10    6.68   123.54   29334   42g     6.98    i7.none
 999     966      256     12    6.86   128.08   42332   42g     6.79    i6.none
 999     967      218    173    6.86    58.00  110277   23g     7.07    i7.zlib
 999     966      210    161    6.71    59.37  113406   23g     6.85    i6.zlib

Results with inserts limited to 100/second

InnoDB still does many more storage reads per user request and storage reads done for secondary index maintenance don't explain why RocksDB is more efficient.

ips.av  ips.99  qps.av  qps.99  r/q     rkb/q   Mcpu/q  size    rss     engine
 100     100    3059    2339    0.22    2.59    9754    19g     2.41    rx.snap
 100     100    2773    2312    0.18    1.61    9051    15g     2.33    rx.zlib
 100     100    2899    2158    0.18    1.60    8329    15g     2.42    rx.zlib-6
 100     100     728     438    5.78   93.25   11258    42g     6.98    i7.none
 100     100     637     463    5.82   94.04   16334    42g     6.79    i6.none
 100     100     456     439    4.64   37.10   42641    23g     7.08    i7.zlib
 100     100     457     354    4.72   37.98   43306    23g     6.85    i6.zlib