Wednesday, June 8, 2016

RocksDB, InnoDB and the insert benchmark

Today I share results for RocksDB and InnoDB using MySQL 5.6 and the insert benchmark. I have many more results that I am currently sharing with engine vendors to help them improve (WiredTiger) or to show them the potential of RocksDB. Percona has embraced MongoRocks. I hope to see similar support for MyRocks in a widely used MySQL distribution.

tl;dr - RocksDB did better than InnoDB for a write-heavy workload and a range-scan heavy workload. The former is expected, the latter is a welcome surprise. I hope we get RocksDB into a supported MySQL distribution to make it easier for you to try it.

I have been running several database benchmarks on a variety of hardware, storage, database engines and configurations. The benchmarks include the insert benchmark, linkbench and sysbench. The hardware includes dual-socket, many-core servers with 144G or more of RAM and a 1-socket, few-core server.  The few-core server has 2 cores, 4 HW threads and is an Intel NUC 5i3ryh with 8G of RAM and a Samsung m.2 SSD. The storage includes m.2 SSD (Samsung 850 EVO 110G), older PCIe SSD and newer NVMe SSD. The engines include RocksDB, InnoDB, MyISAM and TokuDB for MySQL and RocksDB, WiredTiger and mmapv1 for MongoDB. The configurations include databases that fit in cache and databases that do not. Even when the database fits in cache the workload can be IO-bound from the writes required to persist a database.


Graphs


If you just want the graphs here they are.


Test pattern

I used the small server (Intel NUC, 2 cores, 8G RAM, m.2 Samsung SSD) for this test.`

The workload is the insert benchmark using the pattern described below. I use this pattern to include a mix of workloads (write-only, reads with write-heavy, reads without write-heavy). I don't include a read-only step. The mixed workload (reads + writes) is very important especially when the database is larger than RAM:
  1. Insert 250M rows into a table in PK order. The table has 3 secondary indexes and the inserts are in random order for each of the secondary indexes. Measure insert performance.
  2. Insert 5M rows into the table with the writer limited to 1000/second and measure query performance for short range queries. The range queries use LIMIT 10 and randomly select the index (secondary or PK) and the starting offset for the range. The index is covering for the query. Assuming the writer is able to do 1000/second this takes 5000 seconds. For this step I care about throughput and response time for reads. For writes my main concern is whether the rate was able to sustain the rate of 1000/second.
  3. Same as #2 except 500k rows are inserted and the writer is limited to 100/second. Assuming the writer can do 100/second this also takes 5000 seconds.

Performance & Efficiency


I am interested in performance and efficiency. I include both. Sometimes better efficiency explains better performance. Sometimes we need better efficiency even if that comes at the cost of decreased performance.

The metrics I use for performance are throughput and response time. For performance I measure the average rate for queries and inserts. I also want to determine whether there are stalls in throughput so the benchmark client reports average throughput per interval (~10 seconds each) and at test end I compute the 50th, 75th, 90th, 95th and 99th percentile per-interval throughput. For many tests including the insert benchmark it is even better to show throughput over time as some database engines suffer from stalls or throughput degrades. Some of that is visible via the p95 and p99 throughput metric.

For efficiency I measure HW consumed per operation and including CPU microseconds, storage reads, storage KB read/written. For efficiency I also include the database size and the RSS of the database process. I don't distinguish storage reads and writes done for user inserts versus those done for user queries. If the storage read rate is 200/second, the query rate is 10/second and the insert rate is 5/second then the metrics below will show 20 reads/query and 40 reads/insert.

I usually ignore storage writes for 2 reasons. First, the penalty for random writes is small with SSD. Endurance is still an issue and KB written captures that. Second, it is hard to distinguish between slow and fast write operations. Some write operations are fast because they are mostly sequential or are small log writes that benefit from battery backed write cache on a RAID device.

I use vmstat and iostat to collect efficiency metrics. Note that Linux in mid-2016 still counts trim as bytes written. So with a database engine like RocksDB that deletes files as frequently as it creates them in the steady state and a kernel+filesystem+device setup to do trim on file delete then KB written as reported by iostat will be twice the real rate. You can confirm this by comparing iostat metrics with metrics reported by your SSD.

Configuration

I used a 2G block cache for RocksDB which uses the rest of RAM for compressed blocks in the OS page cache. I used a 6G buffer pool for InnoDB which used direct IO.

For RocksDB I used Facebook MySQL with the git hash 2406e9. For InnoDB I used 5.6.29 and 5.7.10. All mysqld builds used jemalloc. The binlog was enabled for all tests but sync-on-commit was disabled for the database engine and the binlog. The server runs Ubuntu 14.04.

MyRocks is changing fast. I used a build from May 7 and since then several performance bugs have been fixed. My my.cnf settings for RocksDB are also stale compared to the latest & greatest values suggested by Yoshinori. It is fun to work on a project that improving so quickly.

The my.cnf files for RocksDB, InnoDB in MySQL 5.6, InnoDB in MySQL 5.7.

Results for insert-only


Summary:
  • InnoDB writes between 20X and 30X more per insert than RocksDB
  • RocksDB does between 4X and 10X more inserts/second. The difference increases as the database:RAM ratio increases because the insert rate for RocksDB is almost flat over time while InnoDB drops dramatically once the database gets larger than RAM.

legend:
* ips.av - average insert rate
* ips.75, ips.95, ips.99 - 75th, 95th and 99th percentile insert rate over the 

                           per-interval (~10 seconds) average rates
* r/i - storage reads per insert via iostat
* rkb/i, wkb/i - storage KB read and written per insert via iostat. iostat
                 overstates the wkb/i result for RocksDB by 2X because it
                 counts trim as bytes written and the result here have
                 been corrected.
* Mcpu/i - usecs CPU per insert via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end
* engine - format is $e.$comp[.skip] and .skip means that unique constraints
           are not checked (only on RocksDB). For $e, "rx" is RocksDB,
           "i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
           For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib

           level 6 and "none" is no compression.

ips.av  ips.75  ips.95  ips.99  r/i     rkb/i   wkb/i   Mcpu/i  size    rss     engine
23129   24515   12276    9515   0.02    0.25     1.55    3354   19g     2.47    rx.snap.skip
18177   18356    7283    6245   0.02    0.23     1.46    3805   17g     2.52    rx.zlib.skip
17705   15976   15282   14766   0.02    0.25     1.75    3944   18g     2.59    rx.snap
15357   14945   11532    8114   0.02    0.23     1.57    4920   16g     2.67    rx.zlib
12800   14557    5681    4694   0.01    0.19     1.46    5433   16g     2.53    rx.zlib-6
12008   10525    9355    9007   0.02    0.41     2.01    4585   27g     2.64    rx.none
 3811    2854    2607    2533   0.06    1.76    30.69    2715   42g     6.98    i7.none
 2699    1995    1497    1281   0.29    5.97    36.17    4565   42g     6.78    i6.none
 2120    1619    1338    1278   0.09    0.79    51.70   11142   23g     7.06    i7.zlib
 2121    1604    1335    1275   0.09    0.81    51.34   11137   23g     6.85    i6.zlib


Results with inserts limited to 1000/second


RocksDB gets more QPS because it does fewer reads from storage per query. That is expected for uncompressed InnoDB because that database is about 2X the size compared to RocksDB. But the compressed InnoDB database isn't much larger than RocksDB so it should be able to keep a similar amount of hot data in cache. I wonder if key prefix compression as done by RocksDB allows it to get a better cache hit rate.

RocksDB also benefits from not doing reads for non-unique secondary index pages during index maintenance. That is a write-only operation for RocksDB.

legend:
* ips.av, qps.av - average insert and query rates
* ips.99, qps.99 - 99th percentile query rate over the 
per-interval average rates
* r/q - storage reads per insert and per query query via iostat
* rkb/q, wkb/q - storage KB read and written per query 
* Mcpu/i - usecs CPU per query via vmstat
* size - database size in GB at test end
* rss - mysqld RSS in GB at test end

* engine - format is $e.$comp[.skip] and .skip means that unique constraints\
           are not checked (only on RocksDB). For $e, "rx" is RocksDB,
           "i7" is InnoDB in MySQL 5.7.10 and "i6" is InnoDB in MySQL 5.6.29.
           For $comp "snap" is snappy, "zlib" is zlib level 1, "zlib-6" is zlib
           level 6 and "none" is no compression.

ips.av  ips.99  qps.av  qps.99  r/q     rkb/q   Mcpu/q  size    rss     engine
 998     997     2877   1579    0.27     3.40   12190   19g     2.53    rx.snap
 998     997     2635   1526    0.22     2.32   11544   16g     2.34    rx.zlib
 998     997     2652   1531    0.23     2.49   12396   15g     2.45    rx.zlib-6
 999     961      256     10    6.68   123.54   29334   42g     6.98    i7.none
 999     966      256     12    6.86   128.08   42332   42g     6.79    i6.none
 999     967      218    173    6.86    58.00  110277   23g     7.07    i7.zlib
 999     966      210    161    6.71    59.37  113406   23g     6.85    i6.zlib

Results with inserts limited to 100/second


InnoDB still does many more storage reads per user request and storage reads done for secondary index maintenance don't explain why RocksDB is more efficient.

ips.av  ips.99  qps.av  qps.99  r/q     rkb/q   Mcpu/q  size    rss     engine
 100     100    3059    2339    0.22    2.59    9754    19g     2.41    rx.snap
 100     100    2773    2312    0.18    1.61    9051    15g     2.33    rx.zlib
 100     100    2899    2158    0.18    1.60    8329    15g     2.42    rx.zlib-6
 100     100     728     438    5.78   93.25   11258    42g     6.98    i7.none
 100     100     637     463    5.82   94.04   16334    42g     6.79    i6.none
 100     100     456     439    4.64   37.10   42641    23g     7.08    i7.zlib
 100     100     457     354    4.72   37.98   43306    23g     6.85    i6.zlib

Friday, May 6, 2016

smartctl and Samsung 850 EVO

I have 3 small servers at home for performance testing. Each is an Intel NUC with 8G of RAM and a core i3 via 5i3RYH.  These work quietly under my desk. I have been collecting results for MongoDB and MySQL to understand storage engine performance and efficiency. I use them for single-threaded workloads to learn when storage engines sacrifice too much performance at low concurrency to make things better at high concurrency.

Each NUC has one SATA disk and one SSD. Most tests use the SSD because the disk has the OS install and I don't want to lose the install when too much testing makes the disk unhappy. My current SSD is Samsung 850 EVO with 120G and one of these became sick.
[3062127.595842] attempt to access beyond end of device
[3062127.595847] sdb1: rw=129, want=230697888, limit=230686720
[3062127.595850] XFS (sdb1): discard failed for extent [0x7200223,8192], error 5

Other error messages were amusing.
[2273399.254789] Uhhuh. NMI received for unknown reason 3d on CPU 3.
[2273399.254818] Do you have a strange power saving mode enabled?
[2273399.254840] Dazed and confused, but trying to continue

What does smartctl say? I am interested in Wear_Leveling_Count. The raw value is 1656. If that means what I think it means then this device can go to 2000 thanks to 3D TLC NAND (aka 3D V-NAND). The VALUE is 022 and that counts down from 100 to 0 so this device is 80% done and Wear_Leveling_Count might reach 2000. I created a new XFS filesystem on the device, rebooted the server and restarted my test. I don't think I need to replace this SSD today.

sudo smartctl -a /dev/sdb1
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       4323
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       42
177 Wear_Leveling_Count     0x0013   022   022   000    Pre-fail  Always       -       1656
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   055   049   000    Old_age   Always       -       45
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       9
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       365781411804

Wednesday, April 27, 2016

Fun with scons while building MongoDB from source

This post might not have a large audience as not many people build MongoDB from source. Fortunately MongoDB has a thriving developers email list where my questions get answered quickly. Some of my builds must specify non-default paths for the compiler toolchain including the location of binaries like gcc, include paths and library paths. Last year MongoDB added options to their scons build to make that possible and I appreciate that they move fast to make things better.

Build tools (scons, cmake, autoconf/automake) are like snowflakes. Everyone is different and those differences are painful to me because I'd rather not invest time to become an expert. Today's fun problem was figuring out how to specify multiple directories for include and library paths. I assumed this would be like LD_LIBRARY_PATH and I could use a colon as the path separator. Alas I was wrong and the path separator is a space. The docs claim that a colon should work. I am still confused, but I have a working build for MongoRocks!
  • This is OK: scons CPPPATH="/path/to/inc1ude1 /path/to/include2" mongod
  • This is not OK: scons CPPPATH="/path/to/include1:/path/to/incude2" mongod

Monday, April 25, 2016

TRIM, iostat and Linux

I use iostat and vmstat to measure how much CPU and storage is used during my performance tests. Many of the database engines have their own counters to report disk IO but it is good to use the same measurement across engines. I use the "-k" option with iostat so it reports KB written per second per device.

The rate of writes to storage can be overstated by a factor of two in one case and I don't think this is widely known. When TRIM is done for an SSD then the Linux kernels that I use report that as bytes written. If I create an 8G file then I will see at least 8G of writes reported by iostat. If I then remove the file I will see an additional 8G of writes reported by iostat assuming TRIM is used. But that second batch of 8G of writes wasn't really writes.

One of the database engines that I evaluate, RocksDB, frequently creates and removes files. When TRIM is counted as bytes written then this overstates the amount of storage writes done by RocksDB. The other engines that I evaluate do not create and remove files as frequently -- InnoDB, WiredTiger, TokuDB, mmapv1.

The best way to figure out whether TRIM is done for your favorite SSD is to test it yourself.
  1. If TRIM is done then iostat reports TRIM as bytes written. 
  2. If iostat reports TRIM as bytes written and your database engine frequently removes files then iostat wKB/second might be overstated.

Testing this:


My test case is:

output=/path/to/many/GB/file/this/will/create

# run this long enough to get a file that is many GB in size

dd if=/dev/zero of=$output bs=1M oflag=direct &
dpid=$!

sleep 30
kill $dpid


iostat -kx 1 >& o.io &
ipid=$!
sleep 3
rm -f $output; sync
sleep 10
kill $ipid
# look at iostat data in o.io

Example iostat output from a 4.0.9 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65538.00     0.00 8387632.00   255.96     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00

Example iostat output from a 3.10.53 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 31078.00     0.00 3977984.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 283935.00     0.00 36343552.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 288343.00     0.00 36907908.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 208534.00     0.00 26692352.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Wednesday, March 2, 2016

Using jemalloc heap profiling with MySQL

I spent too much time figuring this out.

This works for me:
MALLOC_CONF="prof:true,prof_gdump:true,prof_prefix:/path/to/files/jez" \
libexec/mysqld ...

This does not work for me:
MALLOC_CONF="prof:true,prof_gdump:true" \
libexec/mysqld ...

MyRocks and sql-bench

MyRocks is now able to run sql-bench with support recently added for tables that are missing a PK. I found one bug and two performance problems in MyRocks when running sql-bench.

While writing this I found a post that claims sql-bench will be removed from the MySQL repo. It is useful and I hope it remains in some repo.

I run sql-bench for MyRocks with this command line:
./run-all-tests --server=mysql --create-options="engine=rocksdb default collate latin1_bin"

Monday, February 22, 2016

Concurrent transaction performance in RocksDB

Support for transactions was added to RocksDB last year. Here I explain performance for the pessimistic transaction API with a concurrent workload. I used the db_bench client rather than MySQL but MyRocks reuses the RocksDB transaction API and gets the performance benefits. For pessimistic transactions:
  • there is not much overhead from the transaction API
  • throughput improves as the work done per transaction increases
  • throughput improves with the concurrent memtable

Configuration


The test server has 2 sockets, 24 CPU cores and 48 HW threads. The database used fast storage (tmpfs). The benchmark is db_bench --benchmarks=randomtransaction via this test script. It was run was run for 1 to 48 concurrent clients to understand the impact of concurrency. It was run without transactions and with pessimistic transactions to understand the impact of the transaction API. It was run with --batch_size=1 and --batch_size=4 to understand the impact of doing more work per transaction. The test database was cached by RocksDB.

Results


The first two graphs show the transaction throughput with (concurMT=1) and without (concurMT=0) the concurrent memtable enabled. Throughput is much larger with the concurrent memtable enabled. The benefit is larger with batch_size=4 than batch_size=1 because there is more work done with a larger batch_size and more mutex contention to avoid. Throughput is larger with batch_size=1 because there is 4X more work done per transaction with batch_size=4.



The next two graphs show the efficiency of the transaction API. It compares the ratio of the throughput with pessimistic transactions versus the throughput without transactions. When the value is 1.0 then the throughput with transactions matches the throughput without transactions. From the graphs below the efficiency is better with batch_size=1 than with batch_size=4 and the efficiency improves with concurrency.

Data for the graphs is here: