Tuesday, November 22, 2016

Why is MyRocks more write-efficient than InnoDB?

This year I shared results where InnoDB wrote between 10X and 20X more data to storage than MyRocks for the same workload. I use KB written to storage per transaction as a measure of write efficiency and I usually compute this with data from the benchmark client and iostat. I get KB written/second from iostat, average transaction/second from the benchmark client and divide the former by the latter to compute KB written/transaction. When using SSD this excludes the writes done by SSD firmware and I previously reported that the overhead was worse for InnoDB than for RocksDB on one vendor's device.

An engine that writes less to storage per transaction is more write efficient. It is a good thing if MyRocks writes 10X less to storage than InnoDB for the same workload. This might enable MyRocks to use lower-endurance SSD for workloads where InnoDB required higher-endurance SSD. This might enable MyRocks to use SSD for workloads in which the device would not last with InnoDB. This also means that MyRocks needs less overprovisioning on the SSD, which is another way of saying you get more capacity from the device.

This is an update on results I previously shared.

There are a few reasons why MyRocks is more write-efficient than InnoDB:
  1. Doublewrite buffer
  2. Configuration
  3. Page size
The InnoDB doublewrite buffer doubles the storage write rate. It does this for a good reason -- to protect against partial page writes. Perhaps one day SSD vendors will agree on an atomic-write solution that works across vendors and with popular file systems on Linux. Perhaps one day crash safe RAM will be a common thing in data centers. Perhaps one day we will have a copy-on-write filesystem that is widely used for InnoDB on Linux. Until then we are stuck with 2X write-amplification from the doublewrite buffer.

I might be overstating this to make a point. If you have fast storage (NAND flash) and the database working set fits in RAM then you have too much RAM. If you have fast storage then configure the database to use it. Or keep the working set in RAM and use a disk array instead of NAND flash. But if you have an in-memory workload and a database engine that does random IO (update-in-place b-tree) then you still need IOPs capacity or you should switch to a proper in-memory database engine like Tarantool.

Most of the servers that I care about are setup so that the database working set isn't in RAM. I configure benchmarks like Linkbench in the same manner. I am not promising that MyRocks will write 10X less to storage than InnoDB for all use cases - workload and configuration matter. But it tends to be better.

Page size
InnoDB page size has a big impact on write-efficiency when the working set isn't cached because dirty b-tree pages will be evicted from the tail of the LRU earlier and when evicted they must be written back to storage (twice with InnoDB thanks to the doublewrite buffer). In the worst case pages are written back with only dirty row and the write-amplification in that case is sizeof(page) / sizeof(row). In the best case all rows on the page are dirty but the best case isn't likely when the working set isn't cached.

The obvious way to reduce write-amplification (and improve write-efficiency) is to reduce the database page size. The default page size for InnoDB is 16kb, but it is possible to use 8kb or 4kb pages at initialization by setting innodb_page_size or when using compressed tables. Both of these have a cost. First, InnoDB performance is greatly reduced when using compressed tables. Second, rows must fit in half of a page, excluding LOB columns, so a smaller page also means a smaller max row size. A smaller page also reduces the max size of an index key. See the manual for more details.

I repeated Linkbench with an IO-bound configuration - 50G of RAM, database is ~400G for MyRocks with zlib compression and ~1.6T for InnoDB without compression. The data below shows throughput (TPS is transactions/second) and the storage write-rate per transaction (iostat wKB/t) for MyRocks versus InnoDB with different page sizes. When using a smaller page size for InnoDB the storage write rate and TPS is better. I assume TPS is better because when less IO capacity is used for writes then more IO capacity is available for reads.

          Page-size  TPS     iostat-wKB/t
MyRocks   16kb       28965    1.25
InnoDB    4kb        24845    6.13
InnoDB    8kb        24352   10.52
InnoDB    16kb       21414   19.70

And graphs for the same data.


  1. Tx Mark for this very informative benchmark , is it InnoDB page compression 1/2 ? (sorry if you answer that question in an other post) ? That can be nice to compare KB on disk vs Network Byte Receive and with or without binary logs, to get a big picture of binlog amplification !

    1. I did not use compression for InnoDB in this test