This post shows some of the improvements we recently made to RocksDB to reduce response time variance for write-heavy workloads. This work helps RocksDB, MyRocks and MongoRocks.
This also extends the
result I shared for the impact of the InnoDB redo log size on insert benchmark load throughout. Here I add results for MyRocks and TokuDB. In case you haven't heard, the goal for MyRocks is to provide similar performance to InnoDB with much better space and write efficiency. We have real workloads where InnoDB uses 4X more space than MyRocks.
tl;dr
- Sorry for overusing the word much.
- Write response time variance is much better in recent RocksDB builds
- All engines have room for improvement to reduce write response time variance
- The MyRocks advantage increases when moving from faster to slower storage
- MyRocks insert rates are not slowed when compression is enabled.
- While the write rates I demonstrate here for all engines are impressive, the rates might not be sustainable if I expect the SSD devices to last for more than one year. Of course, better write-efficiency from MyRocks helps a lot with endurance.
Making RocksDB better
My summary of the recent improvements to RocksDB is too vague. I hope the authors write real posts soon to explain their contributions. The work includes:
- perform L0 to L0 compaction to reduce the number of L0 files when the L1 is busy
- add compaction_pri=kMinOverlapping to be more clever about the key on which SST files start (or stop) to reduce write-amplification during compaction
- increase delayed_write_rate from 2MB to 16MB/second to smooth the impact from throttling
Configuration
While I usually run the insert benchmark in 3 steps (insert only, insert & queries, insert & queries) I only share results for the insert only step here. Be wary about drawing too many conclusions from an insert-only workload. I will soon share results for the insert & queries steps.
What I wrote in the
Configuration section of my previous post is still valid. I tested InnoDB, MyRocks and TokuDB and the following names describe the engine and configuration:
- Rocks.def10.kmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and with kMinOverlappingRatio
- Rocks.def10.nokmin.mar - MyRocks from March 17 using the def10 my.cnf without compression and without kMinOverlappingRatio.
- Rocks.def10.kmin.apr - MyRocks from April 14 using the def10 my.cnf without compression and with kMinOverlappingRatio.
- Rocks.def10.nokmin.apr - MyRocks from April 14 using the def10 my.cnf withtout compression and without kMinOverlappingRatio.
- Rocks.def10.kmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and with kMinOverlappingRatio.
- Rocks.def10.nokmin.apr.zstd - MyRocks from April 14 using the def10 my.cnf with zstd compression and without kMinOverlappingRatio.
- InnoDB-5.7.17.redoX - InnoDB from MySQL 5.7.17 with an Xgb redo log.
- InnoDB-5.6.35.redoX - InnoDB from MySQL 5.6.35 with an Xgb redo log.
- TokuDB-5.7.17-none - TokuDB without compression from Percona Server for MySQL 5.7.17
- TokuDB-5.7.17-zlib - TokuDB with zlib compression from Percona Server for MySQL 5.7.17
I compiled MySQL from source for MyRocks (FB MySQL) and InnoDB (upstream MySQL). I used Percona Server for TokuDB. The TokuDB block cache was 150gb for in-memory workloads and 10gb for IO-bound workloads and the my.cnf
is here. The InnoDB buffer pool was 180gb for in-memory workloads and 35gb for IO-bound workloads. The my.cnf for InnoDB is
here for 5.6.35 and
here for 5.7.17. The base my.cnf for MyRocks
is here. It needs to be edited (grep for TODO) because the MyRocks my.cnf files listed above differ in a few ways
- The value for RocksDB block_cache_size (150gb for in-memory, 10gb for IO-bound).
- Whether compaction_pri=kMinOverlappingRatio was set
- Whether compression was enabled. With compression enabled I used none for L0, L1 & L2, then LZ4 starting at L3 and finally bottommost_compression=kZSTDCompression.
In-memory load
The database fits in the database cache for the in-memory load. There should be no reads to storage and many writes to storage.
For fast SSD InnoDB in MySQL 5.7.17 has the best throughput and is much faster than in 5.6.35. A larger InnoDB redo log improves throughput. TokuDB has is faster than MyRocks, but that changes when the workload switches from insert-only to inserts and queries.
Results are different for the server with slow SSD. InnoDB depends on fast random writes and slow SSD provides less of that than. Here MyRocks is faster than InnoDB in 5.7.17 except when a large redo log (32gb) is use. Is is faster than InnoDB in 5.6.35 in all cases. I did not test TokuDB on this hardware.
From the
In-memory load metrics section below, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space than MyRocks (see size) with or without compression. A larger redo log reduces the InnoDB write rate per insert by more than half. The kMinOverlappingRatio option in MyRocks greatly reduces the write rate to storage.
Compared to MyRocks, TokuDB uses more than 2X the disk space in the database directory (see size) and uses more CPU/insert (see Mcpu/i). TokuDB might be trading more space amplification to get less write amplification and faster inserts as explained by
the RUM Conjecture. MyRocks doesn't make that trade with leveled compaction. It does with universal compaction, which I did not test.
In-memory load metrics
See the
In-memory load metrics section in
my previous post for the legend for the tables below.
Fast SSD
ips.av ips.99 wkb/i Mcpu/i size rss wmb/s cpu engine
120250 4386 0.78 157 61 2.0 186.1 18.9 Rocks.def10.kmin.mar
132979 7786 0.78 156 63 22.0 206.7 20.8 Rocks.def10.kmin,aor
232666 9925 5.31 176 97 104.4 1235.1 41.0 InnoDB-5.7.17.redo4
295683 18009 1.66 154 97 104.1 490.6 45.5 InnoDB-5.7.17.redo32
92404 3504 5.11 201 97 106.2 472.5 18.6 InnoDB-5.6.35.redo4
171116 10165 1.74 175 97 106.2 297.9 30.0 InnoDB-5.6.35.redo32
207555 10852 0.78 212 146 176.3 162.3 44.0 TokuDB-5.7.17-none
Slow SSD
ips.av ips.99 wkb/i Mcpu/i size rss wmb/s cpu engine
115607 4315 0.72 131 61 1.9 163.2 15.2 Rocks.def10.kmin.mar
123793 5834 1.16 168 64 1.8 285.5 20.8 Rocks.def10.nokmin.mar
130548 7724 0.72 130 61 21.8 184.2 17.0 Rocks.def10.kmin.apr
128833 7555 1.09 154 69 20.4 280.9 19.8 Rocks.def10.nokmin.apr
68672 3277 4.95 165 97 104.4 339.9 11.3 InnoDB-5.7.17.redo4
177179 7231 0.85 130 97 104.3 151.3 23.0 InnoDB-5.7.17.redo32
38058 1691 5.01 176 97 106.3 190.6 6.7 InnoDB-5.6.35.redo4
71317 2914 1.26 145 97 106.3 89.5 10.3 InnoDB-5.6.35.redo32
IO-bound load
Things are different for the IO-bound load compared to the in-memory load. MyRocks is strictly faster than InnoDB for the IO-bound load and the redo log size doesn't make a big difference for InnoDB. InnoDB in 5.7.17 does better than in 5.6.35. MyRocks insert rates don't drop when compression is enabled, while they do for TokuDB.
Secondary index maintenance for InnoDB is read-modify-write. Whether the reads are done as part of the insert or deferred to the change buffer, eventually those random reads must get done and they use IO capacity that is then not available for random writes. Statements can also stall on page writeback when the buffer pool is full and pages at the tail of the LRU are dirty (insert joke about single page flush here).
Results for fast SSD and slow SSD are similar but the difference between MyRocks and InnoDB is larger on slow SSD because InnoDB depends more on random IO performance. The slow SSD results also show the benefit from using kMinOverlappingRatio when compression is enabled. The
IO-bound load metrics section shows the benefit from kMinOverlapping ratio -- wKB/i and rKB/i are about 2/3 the rate compared to MyRocks without that option.
TokuDB has the fastest insert rate but it also uses much more space (see wKB/i) and CPU (see Mcpu/i) compared to MyRocks. I suspect it is trading more space amplification to get less write amplification as explained by
the RUM Conjecture.
Compared to MyRocks, InnoDB writes more to storage per insert (see wKB/i), uses more CPU per insert (see Mcpu/i) and uses more space (see size) even compared to uncompress MyRocks.
IO-bound load metrics
See the
IO-bound load metrics section in
my previous post for the legend for the tables below.
Fast SSD
ips.av ips.99 r/i rkb/i wkb/i Mcpu/i size rss r/s rmb/s wmb/s cpu engine
126727 4671 0 0.23 2.07 175 226 4.9 266 29.6 523.2 22.2 Rocks.def10.kmin.mar
129879 7638 0 0.34 1.98 174 225 12.8 383 43.6 513.3 22.7 Rocks.def10.kmin.apr
128816 7687 0 0.13 1.64 209 101 14.1 147 16.6 421.0 27.0 Rocks.def10.kmin.apr.zstd
61711 3353 0.09 1.48 14.65 562 402 4.0 5698 91.2 904.3 34.7 InnoDB-5.7.17.redo4
63959 3419 0.09 1.43 10.11 535 402 4.0 5732 91.7 646.6 34.2 InnoDB-5.7.17.redo32
45874 1790 0.11 1.71 11.25 585 402 43.2 4915 78.6 516.0 26.8 InnoDB-5.6.35.redo4
59312 2473 0.10 1.56 8.36 593 403 43.2 5776 92.4 495.7 35.2 InnoDB-5.6.35.redo32
156250 4150 0.01 0.82 2.97 340 400 11.2 1854 128.1 464.5 53.1 TokuDB-5.7.17-none
123259 1535 0.01 0.25 1.00 540 143 11.2 854 30.4 123.5 66.6 TokuDB-5.7.17-zlib
Slow SSD
ips.av ips.99 r/i rkb/i wkb/i Mcpu/i size rss r/s rmb/s wmb/s cpu engine
122963 5280 0.01 1.46 1.55 192 232 12.8 1443 179.6 380.0 23.7 Rocks.def10.nokmin.apr
127316 7283 0 0.32 0.91 148 225 12.9 334 40.8 230.1 18.8 Rocks.def10.kmin.apr
125842 7420 0 0.11 0.73 177 101 14.1 112 13.4 182.7 22.3 Rocks.def10.kmin.apr.zstd
82771 1980 0.01 0.66 1.11 259 114 13.7 445 54.5 182.7 21.4 Rocks.def10.nokmin.apr.zstd
34101 1975 0.10 1.60 10.91 402 400 39.7 3412 54.6 372.0 13.7 InnoDB-5.7.17.redo4
49244 1750 0.10 1.56 6.39 454 403 40.0 4803 76.9 314.5 22.4 InnoDB-5.7.17.redo32
17654 605 0.18 2.79 12.22 448 398 43.2 3083 49.3 215.7 7.9 InnoDB-5.6.35.redo4
30734 1276 0.09 1.50 5.01 407 403 43.2 2882 46.1 153.9 12.5 InnoDB-5.6.35.redo32
In-memory throughput over time
Fast SSD
InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls, similar to the March 17 build of MyRocks.
Slow SSD
InnoDB suffers from too much variance regardless of redo log size. It needs random write IOPs and the slow SSD has less of that than the fast SSD. MyRocks with the April 14 build is much better than the March 17 build because of the change to perform L0 to L0 compaction to reduce the chance of stalls.
IO-bound throughput over time
Fast SSD
This is similar to the in-memory results. InnoDB variance and throughput are much worse with a 4gb vs 32gb redo log. MyRocks variance has improved significantly from the March 17 to the April 14 build thanks to support for L0 to L0 compaction. TokuDB has a few stalls.
Slow SSD
InnoDB has too much variance, redo log size isn't significant but 5.7.17 is better than 5.6.35. For MyRocks the April 14 build is better than March 17 and the kMinOverlappingRatio feature helps.