Small Datum: RocksDB, InnoDB and the insert benchmark

Today I share results for RocksDB and InnoDB using MySQL 5.6 and the insert benchmark. I have many more results that I am currently sharing with engine vendors to help them improve (WiredTiger) or to show them the potential of RocksDB. Percona has embraced MongoRocks. I hope to see similar support for MyRocks in a widely used MySQL distribution.

tl;dr - RocksDB did better than InnoDB for a write-heavy workload and a range-scan heavy workload. The former is expected, the latter is a welcome surprise. I hope we get RocksDB into a supported MySQL distribution to make it easier for you to try it.

I have been running several database benchmarks on a variety of hardware, storage, database engines and configurations. The benchmarks include the insert benchmark, linkbench and sysbench. The hardware includes dual-socket, many-core servers with 144G or more of RAM and a 1-socket, few-core server. The few-core server has 2 cores, 4 HW threads and is an Intel NUC 5i3ryh with 8G of RAM and a Samsung m.2 SSD. The storage includes m.2 SSD (Samsung 850 EVO 110G), older PCIe SSD and newer NVMe SSD. The engines include RocksDB, InnoDB, MyISAM and TokuDB for MySQL and RocksDB, WiredTiger and mmapv1 for MongoDB. The configurations include databases that fit in cache and databases that do not. Even when the database fits in cache the workload can be IO-bound from the writes required to persist a database.

Graphs

If you just want the graphs here they are.

Test pattern

I used the small server (Intel NUC, 2 cores, 8G RAM, m.2 Samsung SSD) for this test.`

The workload is the insert benchmark using the pattern described below. I use this pattern to include a mix of workloads (write-only, reads with write-heavy, reads without write-heavy). I don't include a read-only step. The mixed workload (reads + writes) is very important especially when the database is larger than RAM:

Insert 250M rows into a table in PK order. The table has 3 secondary indexes and the inserts are in random order for each of the secondary indexes. Measure insert performance.
Insert 5M rows into the table with the writer limited to 1000/second and measure query performance for short range queries. The range queries use LIMIT 10 and randomly select the index (secondary or PK) and the starting offset for the range. The index is covering for the query. Assuming the writer is able to do 1000/second this takes 5000 seconds. For this step I care about throughput and response time for reads. For writes my main concern is whether the rate was able to sustain the rate of 1000/second.
Same as #2 except 500k rows are inserted and the writer is limited to 100/second. Assuming the writer can do 100/second this also takes 5000 seconds.

Performance & Efficiency

I am interested in performance and efficiency. I include both. Sometimes better efficiency explains better performance. Sometimes we need better efficiency even if that comes at the cost of decreased performance.

The metrics I use for performance are throughput and response time. For performance I measure the average rate for queries and inserts. I also want to determine whether there are stalls in throughput so the benchmark client reports average throughput per interval (~10 seconds each) and at test end I compute the 50th, 75th, 90th, 95th and 99th percentile per-interval throughput. For many tests including the insert benchmark it is even better to show throughput over time as some database engines suffer from stalls or throughput degrades. Some of that is visible via the p95 and p99 throughput metric.

For efficiency I measure HW consumed per operation and including CPU microseconds, storage reads, storage KB read/written. For efficiency I also include the database size and the RSS of the database process. I don't distinguish storage reads and writes done for user inserts versus those done for user queries. If the storage read rate is 200/second, the query rate is 10/second and the insert rate is 5/second then the metrics below will show 20 reads/query and 40 reads/insert.

I usually ignore storage writes for 2 reasons. First, the penalty for random writes is small with SSD. Endurance is still an issue and KB written captures that. Second, it is hard to distinguish between slow and fast write operations. Some write operations are fast because they are mostly sequential or are small log writes that benefit from battery backed write cache on a RAID device.

I use vmstat and iostat to collect efficiency metrics. Note that Linux in mid-2016 still counts trim as bytes written. So with a database engine like RocksDB that deletes files as frequently as it creates them in the steady state and a kernel+filesystem+device setup to do trim on file delete then KB written as reported by iostat will be twice the real rate. You can confirm this by comparing iostat metrics with metrics reported by your SSD.

Configuration

I used a 2G block cache for RocksDB which uses the rest of RAM for compressed blocks in the OS page cache. I used a 6G buffer pool for InnoDB which used direct IO.

For RocksDB I used Facebook MySQL with the git hash 2406e9. For InnoDB I used 5.6.29 and 5.7.10. All mysqld builds used jemalloc. The binlog was enabled for all tests but sync-on-commit was disabled for the database engine and the binlog. The server runs Ubuntu 14.04.

MyRocks is changing fast. I used a build from May 7 and since then several performance bugs have been fixed. My my.cnf settings for RocksDB are also stale compared to the latest & greatest values suggested by Yoshinori. It is fun to work on a project that improving so quickly.

The my.cnf files for RocksDB, InnoDB in MySQL 5.6, InnoDB in MySQL 5.7.

Results for insert-only

Summary:

InnoDB writes between 20X and 30X more per insert than RocksDB
RocksDB does between 4X and 10X more inserts/second. The difference increases as the database:RAM ratio increases because the insert rate for RocksDB is almost flat over time while InnoDB drops dramatically once the database gets larger than RAM.

legend:

* ips.av - average insert rate
* ips.75, ips.95, ips.99 - 75th, 95th and 99th percentile insert rate over the
per-interval (~10 seconds) average rates
* r/i - storage reads per insert via iostat
* rkb/i, wkb/i - storage KB read and written per insert via iostat. iostat

overstates the wkb/i result for RocksDB by 2X because it

counts trim as bytes written and the result here have

been corrected.
* Mcpu/i - usecs CPU per insert via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end

* engine - format is $e.$comp[.skip] and .skip means that unique constraints

are not checked (only on RocksDB). For $e, "rx" is RocksDB,
"i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
level 6 and "none" is no compression.

ips.av ips.75 ips.95 ips.99 r/i rkb/i wkb/i Mcpu/i size rss engine

23129 24515 12276 9515 0.02 0.25 1.55 3354 19g 2.47 rx.snap.skip

18177 18356 7283 6245 0.02 0.23 1.46 3805 17g 2.52 rx.zlib.skip

17705 15976 15282 14766 0.02 0.25 1.75 3944 18g 2.59 rx.snap

15357 14945 11532 8114 0.02 0.23 1.57 4920 16g 2.67 rx.zlib

12800 14557 5681 4694 0.01 0.19 1.46 5433 16g 2.53 rx.zlib-6

12008 10525 9355 9007 0.02 0.41 2.01 4585 27g 2.64 rx.none

3811 2854 2607 2533 0.06 1.76 30.69 2715 42g 6.98 i7.none

2699 1995 1497 1281 0.29 5.97 36.17 4565 42g 6.78 i6.none

2120 1619 1338 1278 0.09 0.79 51.70 11142 23g 7.06 i7.zlib

2121 1604 1335 1275 0.09 0.81 51.34 11137 23g 6.85 i6.zlib

Results with inserts limited to 1000/second

RocksDB gets more QPS because it does fewer reads from storage per query. That is expected for uncompressed InnoDB because that database is about 2X the size compared to RocksDB. But the compressed InnoDB database isn't much larger than RocksDB so it should be able to keep a similar amount of hot data in cache. I wonder if key prefix compression as done by RocksDB allows it to get a better cache hit rate.

RocksDB also benefits from not doing reads for non-unique secondary index pages during index maintenance. That is a write-only operation for RocksDB.

legend:

* ips.av, qps.av - average insert and query rates
* ips.99, qps.99 - 99th percentile query rate over the per-interval average rates
* r/q - storage reads per insert and per query query via iostat
* rkb/q, wkb/q - storage KB read and written per query

* Mcpu/i - usecs CPU per query via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end
* engine - format is $e.$comp[.skip] and .skip means that unique constraints\
are not checked (only on RocksDB). For $e, "rx" is RocksDB,
"i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
level 6 and "none" is no compression.

ips.av ips.99 qps.av qps.99 r/q rkb/q Mcpu/q size rss engine

998 997 2877 1579 0.27 3.40 12190 19g 2.53 rx.snap

998 997 2635 1526 0.22 2.32 11544 16g 2.34 rx.zlib

998 997 2652 1531 0.23 2.49 12396 15g 2.45 rx.zlib-6

999 961 256 10 6.68 123.54 29334 42g 6.98 i7.none

999 966 256 12 6.86 128.08 42332 42g 6.79 i6.none

999 967 218 173 6.86 58.00 110277 23g 7.07 i7.zlib

999 966 210 161 6.71 59.37 113406 23g 6.85 i6.zlib

Results with inserts limited to 100/second

InnoDB still does many more storage reads per user request and storage reads done for secondary index maintenance don't explain why RocksDB is more efficient.

ips.av ips.99 qps.av qps.99 r/q rkb/q Mcpu/q size rss engine

100 100 3059 2339 0.22 2.59 9754 19g 2.41 rx.snap

100 100 2773 2312 0.18 1.61 9051 15g 2.33 rx.zlib

100 100 2899 2158 0.18 1.60 8329 15g 2.42 rx.zlib-6

100 100 728 438 5.78 93.25 11258 42g 6.98 i7.none

100 100 637 463 5.82 94.04 16334 42g 6.79 i6.none

100 100 456 439 4.64 37.10 42641 23g 7.08 i7.zlib

100 100 457 354 4.72 37.98 43306 23g 6.85 i6.zlib

5 comments:

UnknownJune 9, 2016 at 10:18 PM
Hi Mark,
congratulations to all RocksDb team for the great results !
I want to start to test the MyRocks with Mysql facebook distribution but I read in documentation
"Currently, it is not allowed to enable both InnoDB and RocksDB storage engines at the same time, because it is hard to make it work in production."
and this is a big problem to make a system where I can use my normal production engine (Innodb) and to start to try RocksDB.
Is there in a roadmap a distribution with Innodb and RocksDB working togheter ?
Thank you very much
Mark CallaghanJune 11, 2016 at 4:10 PM
I think both can be enabled now but this was closed without a link the diff:
https://github.com/facebook/mysql-5.6/issues/101
AlexJune 11, 2016 at 7:41 PM
That's a lot of benchmarking, Mark! Are you also using the 850 evo on the 2s server?
Mark CallaghanAugust 10, 2016 at 5:33 PM
Better p99 is good, but I wasn't part of that testing effort.