Thursday, May 11, 2017

InnoDB, MyRocks and TokuDB on the insert benchmark

This post shows some of the improvements we recently made to RocksDB to reduce response time variance for write-heavy workloads. This work helps RocksDB, MyRocks and MongoRocks.
This also extends the result I shared for the impact of the InnoDB redo log size on insert benchmark load throughout. Here I add results for MyRocks and TokuDB. In case you haven't heard, the goal for MyRocks is to provide similar performance to InnoDB with much better space and write efficiency. We have real workloads where InnoDB uses 4X more space than MyRocks.
tl;dr
  • Sorry for overusing the word much.
  • Write response time variance is much better in recent RocksDB builds
  • All engines have room for improvement to reduce write response time variance
  • The MyRocks advantage increases when moving from faster to slower storage
  • MyRocks insert rates are not slowed when compression is enabled.
  • While the write rates I demonstrate here for all engines are impressive, the rates might not be sustainable if I expect the SSD devices to last for more than one year. Of course, better write-efficiency from MyRocks helps a lot with endurance.
Making RocksDB better

My summary of the recent improvements to RocksDB is too vague. I hope the authors write real posts soon to explain their contributions. The work includes:
  • perform L0 to L0 compaction to reduce the number of L0 files when the L1 is busy
  • add compaction_pri=kMinOverlapping to be more clever about the key on which SST files start (or stop) to reduce write-amplification during compaction
  • increase delayed_write_rate from 2MB to 16MB/second to smooth the impact from throttling
Configuration

While I usually run the insert benchmark in 3 steps (insert only, insert & queries, insert & queries) I only share results for the insert only step here. Be wary about drawing too many conclusions from an insert-only workload. I will soon share results for the insert & queries steps.

What I wrote in the Configuration section of my previous post is still valid. I tested InnoDB, MyRocks and TokuDB and the following names describe the engine and configuration:
  • Rocks.def10.kmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and with kMinOverlappingRatio
  • Rocks.def10.nokmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and without kMinOverlappingRatio.
  • Rocks.def10.kmin.apr - MyRocks from April 14 using the def10 my.cnf without compression and with kMinOverlappingRatio.
  • Rocks.def10.nokmin.apr - MyRocks from April 14 using the def10 my.cnf withtout compression and without kMinOverlappingRatio.
  • Rocks.def10.kmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and with kMinOverlappingRatio.
  • Rocks.def10.nokmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and without kMinOverlappingRatio.
  • InnoDB-5.7.17.redoX - InnoDB from MySQL 5.7.17 with an Xgb redo log.
  • InnoDB-5.6.35.redoX - InnoDB from MySQL 5.6.35 with an Xgb redo log.
  • TokuDB-5.7.17-none - TokuDB without compression from Percona Server for MySQL 5.7.17
  • TokuDB-5.7.17-zlib - TokuDB with zlib compression from Percona Server for MySQL 5.7.17
I compiled MySQL from source for MyRocks (FB MySQL) and InnoDB (upstream MySQL). I used Percona Server for TokuDB. The TokuDB block cache was 150gb for in-memory workloads and 10gb for IO-bound workloads and the my.cnf is here. The InnoDB buffer pool was 180gb for in-memory workloads and 35gb for IO-bound workloads. The my.cnf for InnoDB is here for 5.6.35 and here for 5.7.17. The base my.cnf for MyRocks is here. It needs to be edited (grep for TODO) because the MyRocks my.cnf files listed above differ in a few ways
  1. The value for RocksDB block_cache_size (150gb for in-memory, 10gb for IO-bound).
  2. Whether compaction_pri=kMinOverlappingRatio was set
  3. Whether compression was enabled. With compression enabled I used none for L0, L1 & L2, then LZ4 starting at L3 and finally bottommost_compression=kZSTDCompression.
In-memory load

The database fits in the database cache for the in-memory load. There should be no reads to storage and many writes to storage.

For fast SSD InnoDB in MySQL 5.7.17 has the best throughput and is much faster than in 5.6.35. A larger InnoDB redo log improves throughput. TokuDB has is faster than MyRocks, but that changes when the workload switches from insert-only to inserts and queries.

Results are different for the server with slow SSD. InnoDB depends on fast random writes and slow SSD provides less of that than. Here MyRocks is faster than InnoDB in 5.7.17 except when a large redo log (32gb) is use. Is is faster than InnoDB in 5.6.35 in all cases. I did not test TokuDB on this hardware.

From the In-memory load metrics section below, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space than MyRocks (see size) with or without compression. A larger redo log reduces the InnoDB write rate per insert by more than half. The kMinOverlappingRatio option in MyRocks greatly reduces the write rate to storage.

Compared to MyRocks, TokuDB uses more than 2X the disk space in the database directory (see size) and uses more CPU/insert (see Mcpu/i). TokuDB might be trading more space amplification to get less write amplification and faster inserts as explained by the RUM Conjecture. MyRocks doesn't make that trade with leveled compaction. It does with universal compaction, which I did not test.

In-memory load metrics

See the In-memory load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av  ips.99  wkb/i   Mcpu/i  size    rss  wmb/s   cpu  engine
120250   4386   0.78    157      61     2.0  186.1  18.9  Rocks.def10.kmin.mar
132979   7786   0.78    156      63    22.0  206.7  20.8  Rocks.def10.kmin,aor
232666   9925   5.31    176      97   104.4 1235.1  41.0  InnoDB-5.7.17.redo4
295683  18009   1.66    154      97   104.1  490.6  45.5  InnoDB-5.7.17.redo32
 92404   3504   5.11    201      97   106.2  472.5  18.6  InnoDB-5.6.35.redo4
171116  10165   1.74    175      97   106.2  297.9  30.0  InnoDB-5.6.35.redo32
207555  10852   0.78    212     146   176.3  162.3  44.0  TokuDB-5.7.17-none

Slow SSD

ips.av ips.99  wkb/i Mcpu/i size    rss   wmb/s   cpu  engine
115607   4315  0.72     131   61    1.9   163.2   15.2 Rocks.def10.kmin.mar
123793   5834  1.16     168   64    1.8   285.5   20.8 Rocks.def10.nokmin.mar
130548   7724  0.72     130   61   21.8   184.2   17.0 Rocks.def10.kmin.apr
128833   7555  1.09     154   69   20.4   280.9   19.8 Rocks.def10.nokmin.apr
 68672   3277  4.95     165   97  104.4   339.9   11.3 InnoDB-5.7.17.redo4
177179   7231  0.85     130   97  104.3   151.3   23.0 InnoDB-5.7.17.redo32
 38058   1691  5.01     176   97  106.3   190.6    6.7 InnoDB-5.6.35.redo4
 71317   2914  1.26     145   97  106.3    89.5   10.3 InnoDB-5.6.35.redo32

IO-bound load

Things are different for the IO-bound load compared to the in-memory load. MyRocks is strictly faster than InnoDB for the IO-bound load and the redo log size doesn't make a big difference for InnoDB. InnoDB in 5.7.17 does better than in 5.6.35. MyRocks insert rates don't drop when compression is enabled, while they do for TokuDB.

Secondary index maintenance for InnoDB is read-modify-write. Whether the reads are done as part of the insert or deferred to the change buffer, eventually those random reads must get done and they use IO capacity that is then not available for random writes. Statements can also stall on page writeback when the buffer pool is full and pages at the tail of the LRU are dirty (insert joke about single page flush here).

Results for fast SSD and slow SSD are similar but the difference between MyRocks and InnoDB is larger on slow SSD because InnoDB depends more on random IO performance. The slow SSD results also show the benefit from using kMinOverlappingRatio when compression is enabled. The IO-bound load metrics section shows the benefit from kMinOverlapping ratio -- wKB/i and rKB/i are about 2/3 the rate compared to MyRocks without that option.

TokuDB has the fastest insert rate but it also uses much more space (see wKB/i) and CPU (see Mcpu/i) compared to MyRocks. I suspect it is trading more space amplification to get less write amplification as explained by the RUM Conjecture.

Compared to MyRocks, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space (see size) even compared to uncompress MyRocks.


IO-bound load metrics

See the IO-bound load metrics section in my previous post for the legend for the tables below.

Fast SSD

ips.av ips.99  r/i   rkb/i wkb/i Mcpu/i size  rss   r/s  rmb/s  wmb/s   cpu  engine
126727   4671  0     0.23  2.07  175    226   4.9   266   29.6  523.2  22.2  Rocks.def10.kmin.mar
129879   7638  0     0.34  1.98  174    225  12.8   383   43.6  513.3  22.7  Rocks.def10.kmin.apr
128816   7687  0     0.13  1.64  209    101  14.1   147   16.6  421.0  27.0  Rocks.def10.kmin.apr.zstd
 61711   3353  0.09  1.48 14.65  562    402   4.0  5698   91.2  904.3  34.7  InnoDB-5.7.17.redo4
 63959   3419  0.09  1.43 10.11  535    402   4.0  5732   91.7  646.6  34.2  InnoDB-5.7.17.redo32
 45874   1790  0.11  1.71 11.25  585    402  43.2  4915   78.6  516.0  26.8  InnoDB-5.6.35.redo4
 59312   2473  0.10  1.56  8.36  593    403  43.2  5776   92.4  495.7  35.2  InnoDB-5.6.35.redo32
156250   4150  0.01  0.82  2.97  340    400  11.2  1854  128.1  464.5  53.1  TokuDB-5.7.17-none
123259   1535  0.01  0.25  1.00  540    143  11.2   854   30.4  123.5  66.6  TokuDB-5.7.17-zlib

Slow SSD

ips.av ips.99  r/i   rkb/i wkb/i Mcpu/i size  rss   r/s  rmb/s  wmb/s   cpu  engine
122963   5280  0.01  1.46  1.55  192    232  12.8  1443  179.6  380.0  23.7  Rocks.def10.nokmin.apr
127316   7283  0     0.32  0.91  148    225  12.9   334   40.8  230.1  18.8  Rocks.def10.kmin.apr
125842   7420  0     0.11  0.73  177    101  14.1   112   13.4  182.7  22.3  Rocks.def10.kmin.apr.zstd
 82771   1980  0.01  0.66  1.11  259    114  13.7   445   54.5  182.7  21.4  Rocks.def10.nokmin.apr.zstd
 34101   1975  0.10  1.60 10.91  402    400  39.7  3412   54.6  372.0  13.7  InnoDB-5.7.17.redo4
 49244   1750  0.10  1.56  6.39  454    403  40.0  4803   76.9  314.5  22.4  InnoDB-5.7.17.redo32
 17654    605  0.18  2.79 12.22  448    398  43.2  3083   49.3  215.7   7.9  InnoDB-5.6.35.redo4
 30734   1276  0.09  1.50  5.01  407    403  43.2  2882   46.1  153.9  12.5  InnoDB-5.6.35.redo32

In-memory throughput over time

Fast SSD

InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls, similar to the March 17 build of MyRocks.



Slow SSD

InnoDB suffers from too much variance regardless of redo log size. It needs random write IOPs and the slow SSD has less of that than the fast SSD. MyRocks with the April 14 build is much better than the March 17 build because of the change to perform L0 to L0 compaction to reduce the chance of stalls.





IO-bound throughput over time

Fast SSD

This is similar to the in-memory results. InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls.




Slow SSD

InnoDB has too much variance, redo log size isn't significant but 5.7.17 is better than 5.6.35. For MyRocks the April 14 build is better than March 17 and the kMinOverlappingRatio feature helps.




2 comments:

  1. Do you have the details of what was being inserted (i.e. was this **random** write performance?) - I didn't see that anywhere. Want to know if it is applicable to my usecase.

    ReplyDelete
    Replies
    1. Inserts are in PK order, but random for secondary indexes. So IO to the PK index is easy. But many random reads/writes for the secondary indexes.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...