Friday, November 3, 2017

Insert benchmark, IO-bound, Intel NUC: MyRocks vs InnoDB

This post explains the IO-bound insert benchmark on Intel NUC servers with results for MyRocks and InnoDB. The previous post explained an in-memory insert benchmark.

tl;dr
  • MyRocks gets better insert rates while using less HW per insert: ~8X faster insert rates on the i3 NUC and ~5X faster on the i5 NUC.
  • InnoDB writes ~20X more to storage per insert compared to MyRocks.
  • Uncompressed InnoDB uses ~1.6X more space than uncompressed MyRocks
  • MyRocks is up to 1.3X slower than InnoDB for secondary index scans and up to 1.7X slower for PK scans. The gap is larger for PK indexes because InnoDB secondary indexes suffer from fragmentation in this test. MyRocks suffers from using more CPU per row, so it can't drive storage as fast as InnoDB. 
  • InnoDB suffers more write stalls on the read-write test
  • MyRocks does much better for reads on the read-write tests because it is able to cache the secondary indexes. This is a benefit of being more space efficient. In the future I need to make the database larger to avoid comparing cached MyRocks vs not-cached InnoDB on this benchmark.

Configuration

The tests used a MyRocks build from FB MySQL 5.6.35 and August 15 with git hash 0d76ae and then upstream InnoDB 5.6.35 and 5.7.17. All tests used jemalloc with mysqld. The i3 and i5 NUC servers are described here. The insert benchmark is described here. The my.cnf files are here for the i3 NUC and i5 NUC. The i5 NUC has more RAM, faster CPUs and faster storage than the i3 NUC. I tried to tune my.cnf for all engines including: disabled SSL for MySQL 5.7,  use the same charset and collation, set innodb_purge_threads=1 to reduce mutex contention on the small servers. For all tests the binlog was enabled but fsync was disabled for the binlog and I set flush_log_at_trx_commit=2 for both engines. I compiled all of the MySQL versions on the test servers and did not use binaries from upstream.

The database is larger than RAM. The i3 NUC has 8gb of RAM and the i5 NUC has 16gb. The insert benchmark used 1 table and loaded 250M rows on the i3 NUC and 500M rows on the i5 NUC, then did a full scan of each index on the table (PK, 3 secondary, PK again), then two read-write tests. The first read-write test tries to insert 1000 rows/second with one client while the other client does short range scans as fast as possible. The second read-write test is similar except the insert rate limit is 100/second.

The MyRocks tests were done without compression and with zstd compression for the max LSM level. These are labeled MyRocks.none and MyRocks.zstd below.

Results

All of the data for the graphs is here. Note that Linux iostat can overstate bytes written by RocksDB by 2X because it counts bytes trimmed as bytes written. The data in github does not adjust for that. The numbers here for wKB/i does by dividing the measured value by 2 -- which is an estimate. Note that even if my adjustment were exact this value is still missing bytes written by flash GC and when I measured that in the past it was much worse for InnoDB.

Results: load

This section has results for the load.

This graph has the insert rate for each engine relative to the rate for MyRocks. I am not surprised that MyRocks has a better insert rate than InnoDB because it does less work per insert than InnoDB. MyRocks secondary index maintenance is read free and MyRocks has much less write amplification. For the in-memory insert benchmark there is a regression in the insert rate for InnoDB from 5.6 to 5.7 but that does not occur here.

These tables have more data for the i3 NUC including the average insert rate (IPS), the CPU overhead per insert, CPU/i, KB written to and read from storage per insert, wKB/i & rKB/i, and the size of the database when the load finishes, size(GB). The first table has the absolute values and the second the values relative to MyRocks without compression (MyRocks.none). By all metrics MyRocks is more efficient with hardware:
  • InnoDB writes ~20X more to storage per insert which helps my SSD devices last longer.
  • InnoDB reads between 10X and 20X more from storage per insert because it reads leaf pages for secondary index maintenance while MyRocks reads for compaction but secondary index maintenance is read free
  • InnoDB uses more CPU per insert.
  • InnoDB uses ~1.6X more space than MyRocks when neither use compression. InnoDB suffers from B-Tree fragmentation for the secondary indexes and MyRocks benefits from key prefix compression.

                Absolute values
                IPS     CPU/i   wKB/i   rKB/i   size(GB)
MyRocks.none    22122   2699     1.79   0.25    28
MyRocks.zstd    20169   3243     1.36   0.14    13
InnoDB-5.6       2665   4934    39.08   6.10    46
InnoDB-5.7       3242   3526    34.24   2.54    46

                Values relative to MyRocks
                IPS     CPU/i   wKB/i   rKB/i   size(GB)
MyRocks.none    1.00    1.00     1.00    1.00   1.00
MyRocks.zstd    0.91    1.20     0.76    0.56   0.46
InnoDB-5.6      0.12    1.83    21.83   24.40   1.64
InnoDB-5.7      0.15    1.31    19.13   10.16   1.64

These tables have more data for the i5 NUC. The insert rate difference is smaller here than for the i3 NUC. InnoDB suffers more from the slower storage device on the i3 NUC. Similar to the i3 NUC, InnoDB writes ~20X more to storage per insert and ~1.6X more space compared to uncompressed MyRocks.

                Absolute values
                IPS     CPU/i   wKB/i   rKB/i   size(GB)
MyRocks.none    26837   2278     1.75   0.12    56
MyRocks.zstd    24811   2770     1.52   0.06    25
InnoDB-5.6       5978   4265    44.50   2.10    90
InnoDB-5.7       5634   4690    44.92   1.26    90

                Values relative to MyRocks
                IPS     CPU/i   wKB/i   rKB/i   size(GB)
MyRocks.none    1.00    1.00     1.00    1.00   1.00
MyRocks.zstd    0.92    1.22     0.87    0.50   0.45
InnoDB-5.6      0.22    1.87    25.43   17.50   1.61
InnoDB-5.7      0.21    2.06    25.67   10.50   1.61

Results: scan

This section has results for the index scans. There are 5 scans: 1 is the PK, 2/3/4 are the secondary indexes and then 5 is the PK again. The 2+3+4 column has the time to scan all secondary indexes. The gap between MyRocks and InnoDB is much smaller here than it was on the in-memory full index scan. I assume the CPU overhead from MyRocks is less significant here where the scan is IO-bound.

I wrote above that I used a MyRocks build from August 15. Unfortunately for the scan test I used one for June 16 with git hash 52e058.

The result below includes results (MyRocks.none.ra, MyRocks.zstd.ra) that used a RocksDB option to make it more likely that filesystem readahead would be enabled. The option is rocksdb_advise_random_on_open and when set to 0 RocksDB uses POSIX_FADV_NORMAL rather than POSIX_FADV_RANDOM when opening SST files for user reads. I don't think this option is ready for production, but it can be used on benchmarks.

Some notes from the results:
  • Full-index scans are faster with InnoDB. The gap is larger for the PK index because InnoDB secondary indexes suffer from fragmentation on this test. MyRocks suffers from using more CPU per row, so it can't drive storage as fast as InnoDB. MyRocks is up to 1.3X slower than InnoDB for secondary index scans and up to 1.7X slower for PK scans.
  • The 2+3+4 result for MyRocks is close to the time for InnoDB in MySQL 5.6 when MyRocks uses filesystem readahead. I assume we can do more to get better readahead. Index scans on a fragmented B-Tree generate a lot of random reads. A full index scan from an LSM can do large reads from each LSM level to avoid that.
  • Something was done for InnoDB in MySQL 5.7 to make range scans faster. Thank you to the InnoDB team.
#seconds for each index scan
 1      2       3       4       5       2+3+4   index/engine
 219    202     222     224     207      648    MyRocks.none
 182    189     203     197     168      589    MyRocks.none.ra
 241    206     230     228     230      664    MyRocks.zstd
 241    195     215     212     201      622    MyRocks.zstd.ra
 292    181     194     212     127      587    InnoDB-5.6
 419    148     157     203      99      508    InnoDB-5.7

Results for the i5 NUC are similar to the i3 NUC.

#seconds for each index scan
 1      2       3       4       5       2-4     index/engine
 394    345     381     366     342     1092    MyRocks.none
 331    318     344     331     285      993    MyRocks.none.ra
 385    346     383     386     372     1115    MyRocks.zstd
 359    327     360     356     341     1043    MyRocks.zstd.ra
 475    290     302     314     225      906    InnoDB-5.6
 417    250     267     249     164      766    InnoDB-5.7

Next are hardware efficiency metrics that help explain the performance differences. There are many interesting things in here. The columns are explained here.
  • InnoDB gets more read IO throughput than MyRocks. I need to understand this better but part of the problem is that MyRocks uses more CPU per row so it isn't able to consume data from storage as fast as InnoDB.
  • The InnoDB PK scan gets more read IO throughput than the InnoDB secondary index scan because the secondary index is fragmented and the PK is not courtesy of the workload (PK inserts are in PK order, secondary index inserts are not).
  • Filesystem readahead gets ~10% more read IO throughput for MyRocks.
  • InnoDB in MySQL 5.7 uses much less CPU than in MySQL 5.6 (thanks InnoDB team). I have yet to confirm whether the work done to make InnoDB scans faster in 5.7 are InnoDB-only or will benefit other engines.
  • The gap between InnoDB and MyRocks is larger for the PK scan than the secondary index scan. This might occur because of secondary index fragmentation for InnoDB.
  • The CPU overhead from zstd compression for MyRocks is small.

Metrics for q4: the scan of a secondary index
secs    rMB/s   rKB/o   rGB     MCPU/o  engine
366     32.3    0.024   12      16.872  MyRocks.none
331     35.7    0.024   12      17.044  MyRocks.none.ra
386     13.1    0.010    5      18.684  MyRocks.zstd
356     14.0    0.010    5      17.764  MyRocks.zstd.ra
314     44.2    0.028   13      10.748  InnoDB-5.6
249     56.0    0.028   14       7.592  InnoDB-5.7

Metrics for q5 - the second scan of the PK
secs    rMB/s   rKB/o   rGB     MCPU/o
342      72.7   0.051   24      16.064  MyRocks.none
285      87.3   0.051   24      14.612  MyRocks.none.ra
372      30.6   0.023   11      17.820  MyRocks.zstd
341      33.3   0.023   11      17.396  MyRocks.zstd.ra
225     147.0   0.067   32      14.732  InnoDB-5.6
164     198.1   0.067   32       8.900  InnoDB-5.6

And a graph of the results from the 2+3+4 column because some readers like graphs.

Results: read-write, 1000 writes/second

Disclaimer - MyRocks is able to cache the secondary indexes during this test while InnoDB is not. I didn't plan this and in the future I need to prevent that from happening. While being more efficient with space is a benefit of MyRocks, even without compression, there is a step function in performance when the database is cached that distorts performance differences. MyRocks needs less space for secondary indexes because it doesn't suffer fragmentation like a B-Tree and it uses key prefix compression.

This section has results for the read-write tests where the writer does 1000 inserts/second. The tables below have the most interesting data. The  metrics are explained here. All of the engines were able to sustain the target insert rate on average (ips.av). The value is 999 rather than 1000 because of implementation artifacts. There were more write stalls on InnoDB than MyRocks based on the 99th percentile insert rate as ips.99 is lower for InnoDB. QPS is much better for MyRocks but see the disclaimer above. InnoDB writes ~20X more to storage per insert compared to MyRocks.

                Absolute values for the i3 NUC
                IPS.av  IPS.99  wKB/i   QPS.av  QPS.99  rKB/q   CPU/q
MyRocks.none    999     998      1.30   3349    2377      0.37   7912
MyRocks.zstd    999     998      1.70   3450    2506      0.25   7986
InnoDB-5.6      999     977     26.20    258     156    101.61  37834
InnoDB-5.7      999     977     30.70    311     238     98.68  26829

                Absolute values for the i5 NUC
                IPS.av  IPS.99  wKB/i   QPS.av  QPS.99  rKB/q   CPU/q
MyRocks.none    999     998      2.23   3818    2395     0.47    6945
MyRocks.zstd    999     998      1.96   3986    2858     0.28    6788
InnoDB-5.6      999     987     73.09    992     626    59.45   16798
InnoDB-5.7      999     988     71.22   1042     738    61.07   16115

Results: read-write, 100 writes/second

See the disclaimer in the previous section.

This section has results for the read-write tests where the writer does 100 inserts/second. The metrics are explained here. The analysis from the previous section is relevant here.

                Absolute values for the i3 NUC
                IPS.av  IPS.99  QPS.av  QPS.99  rKB/q   CPU/q
MyRocks.none    100     100     3419    3184     0.01    7173
MyRocks.zstd    100     100     3549    3270     0.01    6969
InnoDB-5.6      100     100      827     534    66.49   13675
InnoDB-5.7      100     100      900     566    69.56   10269

                Absolute values for the i5 NUC
                IPS.av  IPS.99  QPS.av  QPS.99  rKB/q   CPU/q
MyRocks.none    100     100     3911    3740     0       6202
MyRocks.zstd    100     100     4090    3841     0.01    5938
InnoDB-5.6      100     100     1334     951    50.42   11546
InnoDB-5.7      100     100     1383     996    52.50   10807

I did not include a graph. See the disclaimer above.


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...