Wednesday, June 8, 2016

RocksDB, InnoDB and the insert benchmark

Today I share results for RocksDB and InnoDB using MySQL 5.6 and the insert benchmark. I have many more results that I am currently sharing with engine vendors to help them improve (WiredTiger) or to show them the potential of RocksDB. Percona has embraced MongoRocks. I hope to see similar support for MyRocks in a widely used MySQL distribution.

tl;dr - RocksDB did better than InnoDB for a write-heavy workload and a range-scan heavy workload. The former is expected, the latter is a welcome surprise. I hope we get RocksDB into a supported MySQL distribution to make it easier for you to try it.

I have been running several database benchmarks on a variety of hardware, storage, database engines and configurations. The benchmarks include the insert benchmark, linkbench and sysbench. The hardware includes dual-socket, many-core servers with 144G or more of RAM and a 1-socket, few-core server.  The few-core server has 2 cores, 4 HW threads and is an Intel NUC 5i3ryh with 8G of RAM and a Samsung m.2 SSD. The storage includes m.2 SSD (Samsung 850 EVO 110G), older PCIe SSD and newer NVMe SSD. The engines include RocksDB, InnoDB, MyISAM and TokuDB for MySQL and RocksDB, WiredTiger and mmapv1 for MongoDB. The configurations include databases that fit in cache and databases that do not. Even when the database fits in cache the workload can be IO-bound from the writes required to persist a database.


Graphs


If you just want the graphs here they are.


Test pattern

I used the small server (Intel NUC, 2 cores, 8G RAM, m.2 Samsung SSD) for this test.`

The workload is the insert benchmark using the pattern described below. I use this pattern to include a mix of workloads (write-only, reads with write-heavy, reads without write-heavy). I don't include a read-only step. The mixed workload (reads + writes) is very important especially when the database is larger than RAM:
  1. Insert 250M rows into a table in PK order. The table has 3 secondary indexes and the inserts are in random order for each of the secondary indexes. Measure insert performance.
  2. Insert 5M rows into the table with the writer limited to 1000/second and measure query performance for short range queries. The range queries use LIMIT 10 and randomly select the index (secondary or PK) and the starting offset for the range. The index is covering for the query. Assuming the writer is able to do 1000/second this takes 5000 seconds. For this step I care about throughput and response time for reads. For writes my main concern is whether the rate was able to sustain the rate of 1000/second.
  3. Same as #2 except 500k rows are inserted and the writer is limited to 100/second. Assuming the writer can do 100/second this also takes 5000 seconds.

Performance & Efficiency


I am interested in performance and efficiency. I include both. Sometimes better efficiency explains better performance. Sometimes we need better efficiency even if that comes at the cost of decreased performance.

The metrics I use for performance are throughput and response time. For performance I measure the average rate for queries and inserts. I also want to determine whether there are stalls in throughput so the benchmark client reports average throughput per interval (~10 seconds each) and at test end I compute the 50th, 75th, 90th, 95th and 99th percentile per-interval throughput. For many tests including the insert benchmark it is even better to show throughput over time as some database engines suffer from stalls or throughput degrades. Some of that is visible via the p95 and p99 throughput metric.

For efficiency I measure HW consumed per operation and including CPU microseconds, storage reads, storage KB read/written. For efficiency I also include the database size and the RSS of the database process. I don't distinguish storage reads and writes done for user inserts versus those done for user queries. If the storage read rate is 200/second, the query rate is 10/second and the insert rate is 5/second then the metrics below will show 20 reads/query and 40 reads/insert.

I usually ignore storage writes for 2 reasons. First, the penalty for random writes is small with SSD. Endurance is still an issue and KB written captures that. Second, it is hard to distinguish between slow and fast write operations. Some write operations are fast because they are mostly sequential or are small log writes that benefit from battery backed write cache on a RAID device.

I use vmstat and iostat to collect efficiency metrics. Note that Linux in mid-2016 still counts trim as bytes written. So with a database engine like RocksDB that deletes files as frequently as it creates them in the steady state and a kernel+filesystem+device setup to do trim on file delete then KB written as reported by iostat will be twice the real rate. You can confirm this by comparing iostat metrics with metrics reported by your SSD.

Configuration

I used a 2G block cache for RocksDB which uses the rest of RAM for compressed blocks in the OS page cache. I used a 6G buffer pool for InnoDB which used direct IO.

For RocksDB I used Facebook MySQL with the git hash 2406e9. For InnoDB I used 5.6.29 and 5.7.10. All mysqld builds used jemalloc. The binlog was enabled for all tests but sync-on-commit was disabled for the database engine and the binlog. The server runs Ubuntu 14.04.

MyRocks is changing fast. I used a build from May 7 and since then several performance bugs have been fixed. My my.cnf settings for RocksDB are also stale compared to the latest & greatest values suggested by Yoshinori. It is fun to work on a project that improving so quickly.

The my.cnf files for RocksDB, InnoDB in MySQL 5.6, InnoDB in MySQL 5.7.

Results for insert-only


Summary:
  • InnoDB writes between 20X and 30X more per insert than RocksDB
  • RocksDB does between 4X and 10X more inserts/second. The difference increases as the database:RAM ratio increases because the insert rate for RocksDB is almost flat over time while InnoDB drops dramatically once the database gets larger than RAM.

legend:
* ips.av - average insert rate
* ips.75, ips.95, ips.99 - 75th, 95th and 99th percentile insert rate over the 

                           per-interval (~10 seconds) average rates
* r/i - storage reads per insert via iostat
* rkb/i, wkb/i - storage KB read and written per insert via iostat. iostat
                 overstates the wkb/i result for RocksDB by 2X because it
                 counts trim as bytes written and the result here have
                 been corrected.
* Mcpu/i - usecs CPU per insert via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end
* engine - format is $e.$comp[.skip] and .skip means that unique constraints
           are not checked (only on RocksDB). For $e, "rx" is RocksDB,
           "i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
           For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib

           level 6 and "none" is no compression.

ips.av  ips.75  ips.95  ips.99  r/i     rkb/i   wkb/i   Mcpu/i  size    rss     engine
23129   24515   12276    9515   0.02    0.25     1.55    3354   19g     2.47    rx.snap.skip
18177   18356    7283    6245   0.02    0.23     1.46    3805   17g     2.52    rx.zlib.skip
17705   15976   15282   14766   0.02    0.25     1.75    3944   18g     2.59    rx.snap
15357   14945   11532    8114   0.02    0.23     1.57    4920   16g     2.67    rx.zlib
12800   14557    5681    4694   0.01    0.19     1.46    5433   16g     2.53    rx.zlib-6
12008   10525    9355    9007   0.02    0.41     2.01    4585   27g     2.64    rx.none
 3811    2854    2607    2533   0.06    1.76    30.69    2715   42g     6.98    i7.none
 2699    1995    1497    1281   0.29    5.97    36.17    4565   42g     6.78    i6.none
 2120    1619    1338    1278   0.09    0.79    51.70   11142   23g     7.06    i7.zlib
 2121    1604    1335    1275   0.09    0.81    51.34   11137   23g     6.85    i6.zlib


Results with inserts limited to 1000/second


RocksDB gets more QPS because it does fewer reads from storage per query. That is expected for uncompressed InnoDB because that database is about 2X the size compared to RocksDB. But the compressed InnoDB database isn't much larger than RocksDB so it should be able to keep a similar amount of hot data in cache. I wonder if key prefix compression as done by RocksDB allows it to get a better cache hit rate.

RocksDB also benefits from not doing reads for non-unique secondary index pages during index maintenance. That is a write-only operation for RocksDB.

legend:
* ips.av, qps.av - average insert and query rates
* ips.99, qps.99 - 99th percentile query rate over the 
per-interval average rates
* r/q - storage reads per insert and per query query via iostat
* rkb/q, wkb/q - storage KB read and written per query 
* Mcpu/i - usecs CPU per query via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end

* engine - format is $e.$comp[.skip] and .skip means that unique constraints\
           are not checked (only on RocksDB). For $e, "rx" is RocksDB,
           "i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
           For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
           level 6 and "none" is no compression.

ips.av  ips.99  qps.av  qps.99  r/q     rkb/q   Mcpu/q  size    rss     engine
 998     997     2877   1579    0.27     3.40   12190   19g     2.53    rx.snap
 998     997     2635   1526    0.22     2.32   11544   16g     2.34    rx.zlib
 998     997     2652   1531    0.23     2.49   12396   15g     2.45    rx.zlib-6
 999     961      256     10    6.68   123.54   29334   42g     6.98    i7.none
 999     966      256     12    6.86   128.08   42332   42g     6.79    i6.none
 999     967      218    173    6.86    58.00  110277   23g     7.07    i7.zlib
 999     966      210    161    6.71    59.37  113406   23g     6.85    i6.zlib

Results with inserts limited to 100/second


InnoDB still does many more storage reads per user request and storage reads done for secondary index maintenance don't explain why RocksDB is more efficient.

ips.av  ips.99  qps.av  qps.99  r/q     rkb/q   Mcpu/q  size    rss     engine
 100     100    3059    2339    0.22    2.59    9754    19g     2.41    rx.snap
 100     100    2773    2312    0.18    1.61    9051    15g     2.33    rx.zlib
 100     100    2899    2158    0.18    1.60    8329    15g     2.42    rx.zlib-6
 100     100     728     438    5.78   93.25   11258    42g     6.98    i7.none
 100     100     637     463    5.82   94.04   16334    42g     6.79    i6.none
 100     100     456     439    4.64   37.10   42641    23g     7.08    i7.zlib
 100     100     457     354    4.72   37.98   43306    23g     6.85    i6.zlib

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...