Friday, May 9, 2014

Write amplification from log writes

MongoDB, TokuMX and MySQL use log files with high-value data. For MongoDB this is the journal that uses direct IO. For MySQL this is the binlog, relay log and InnoDB redo log,  all use buffered IO by default, the InnoDB redo log uses 512 bytes as the "page size"  and the replication logs have no notion of page size.

The minimum size for a write to a storage device is either the sector size or filesystem page size. The sector size is either 512 or 4096 bytes today. In the future it will be 4096 or larger. The filesystem page size on my servers is 4096 bytes. When a DBMS tries to append 309 bytes to the end of a log file then more than 309 bytes are written to the storage device. Depending on the filesystem and choice of buffered or direct IO either a disk sector or a filesystem page will be written. This can explain why the bytes written rate as reported by iostat using the wsec/s column is higher than the rate reported by the DBMS.

MongoDB avoids the uncertainty by padding journal writes to a multiple of 8 KB. Perhaps the padding will be reduced to a multiple of 4 KB (see JIRA 13344). But the good thing is that the counters reported by MongoDB are correct. The OS and storage device will report the same value for bytes written to the journal file. Of course, this ignores any write-amplification from flash garbage collection.

MySQL and TokuMX do not avoid the uncertainty. I spent a few hours today looking at a busy host to explain the difference between the write rates reported by MySQL, iostat and flash storage physical bytes written counters. The sources of writes on the host include the following. Most of the bytes written were from the doublewrite buffer followed by dirty page write back.
  • InnoDB dirty page writeback - The host uses mostly 2X compression with an in-memory page size of 16 KB. So most disk writes are 8 KB but some are 16 KB.
  • InnoDB doublewrite buffer - Even though most pages are 8 KB, any page in the doublewrite buffer uses 16 KB. Domas has described this as triple-writing pages in his bug report.
  • InnoDB redo log - logically these are done as a multiple of 512 bytes with buffered IO. In the WebScaleSQL patch we have an option to round up the write to 4 KB to avoid reads from the filesystem when the to-be-written page is not in the OS filesystem cache. Alas the InnoDB counter for bytes written to the log did not include the bytes written from the round-to-4KB.
  • MySQL binlog - the binlog makes no attempt to round up writes to the end of the log. The counter for bytes written to the binlog ignore the round up done by the filesystem when fsync is called.
We are working on fixing the InnoDB redo log bytes-written counter to include the padding added when rounding writes up to 4 KB. Until then you can use existing counters to estimate how much rounding was needed. These counters show the number of bytes written to the redo log and the number of fsyncs (or fdatasyncs) done. The size of the average write per fsync is less than 2 KB. With a 4 KB filesystem page the real rate is more than 2X the value reported by the Innodb_os_log_written counter.
    Innodb_os_log_fsyncs 370923966
    Innodb_os_log_written 656878594048

Counters for the binlog writes and fsyncs can also be used to understand whether the write rate to it is wrong. The size of the average write per fsync is again less than 2 KB and with a 4 KB filesystem page the real rate is more than 2X the value reported by the Binlog_bytes_written counter.
    Binlog_bytes_written 699055440580
    Binlog_fsync_count 385104140

The final set of counters were used to estimate the average size of a page writeback. The ratio of the values below is about 10 KB. Most pages written were 8 KB but some were 16 KB.
    Innodb_data_async_write_requests 286202895
    Innodb_data_async_write_bytes 2764395651584

I am not sure whether all of the counters above are available in upstream MySQL. They are in WebScaleSQL. We have been adding extra monitoring for many years to support running MySQL at web scale.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...