Monday, October 26, 2020

InnoDB, fsync and fdatasync - reducing commit latency

MySQL has an commit penalty compared to Postgres and MongoDB. If you want commit to be durable with the binlog enabled then MySQL does two fsyncs per commit -- one for the binlog, one for the InnoDB redo log. But Postgres and MongoDB only need one fsync per commit in the same setup. Note that I am ignoring the benefit of group commit for now, and all of these can benefit from that.

Better performance is the reason for considering fdatasync for the InnoDB redo log. The response time for fdatasync is likely to be less than for fsync because fdatasync does less work. The InnoDB redo log is pre-allocated so the size doesn't change. Therefore InnoDB and fdatasync seem like a good fit and can reduce the commit penalty of MySQL. 

Postgres supports fdatasync for redo (redo == WAL in Postgres) via the wal_sync_method option. InnoDB has the  innodb_flush_method option, but fdatasync isn't one of the choices and the choices are fsync or nothing.

I did a few simple tests to understand fsync vs fdatasync latency on my home NUC servers. The servers use XFS on a Samsung 970 EVO NVMe SSD. First I have results from the pg_test_fsync binary that is part of Postgres and fsync latency was about 2X larger than fdatasync. The average latency was 1353 & 3625 usecs for fdatasync & fsync for one 8kb write and then 1506 & 3346 usecs for two 8kb writes. Full results are here.

Next I did a simple benchmark with mysqlslap and MySQL 8.0.21 with the binlog disabled and a commit per update statement using the upstream binary and then one changed to use fdatasync. The average latency per update statement was 4040 usecs & 6520 usecs for fdatasync & fsync. On my hardware, fsync is clearly slower and that isn't a surprise. I might run this on ec2 hosts using local storage and EBS, but that won't happen today.

I hope InnoDB is updated to support fdatasync as an option. MDEV-21382 is open in MariaDB to support such a change. Percona has perf results for fsync vs fdatasync on different devices along with some history on the use of fsync by InnoDB. In their tests the latency for fsync is larger and is expected. I am confused by some of the Percona docs as they mention fdatasync, but that is tech docs debt as old versions of MySQL used fdatasync as the option value but that meant "use fsync".

I created a MySQL feature request for this -- see 101326.


  1. The implementation of InnoDB in MariaDB Server 10.5 should use fdatasync() instead of fsync() whenever it is available at compilation time. Apart from the MDEV-21382 that you mentioned, there was a further change in MDEV-22177, which was included in the 10.5.3 release. We concluded that the only advantage of fsync() over fdatasync() should be that the file time stamps are updated, and InnoDB does not care about the time stamps at all, except maybe for the update_time that is being reported for InnoDB tables, but that is inaccurate anyway.

    Your note about the multiple fsync() needed due to two logs has been my pet hate for several years. We have the open ticket MDEV-18959 for that, presenting two conflicting solutions. The first one is to guarantee that the binlog is never behind the engine redo log, and to have recovery roll-forward any changes from binlog to InnoDB.

    A simpler-sounding idea has been implemented by Alibaba: use the InnoDB redo log as a kind of a ‘doublewrite buffer’ for the binlog, and avoid fsync() on the binlog files except when that buffer would be discarded due to an InnoDB log checkpoint. So, recovery would append any missing content to the binlog, recover the InnoDB data pages as of the latest durably written log sequence number (LSN), and finally roll forward the binlog to InnoDB tables, in case some recent transactions were committed in the binlog but not yet in the InnoDB undo log.

    Coincidentally, I just pushed a latency improvement to MariaDB 10.5 yesterday, to be included in the upcoming 10.5.7 release. The page flushing was basically rewritten in MDEV-23399 and MDEV-23855.

    1. Thank you for more details. I hope you consider using the insert benchmark when making big changes to writeback because that benchmark can be a great source of stress for a DBMS.

      I like the idea of avoiding 2 syncs on commit. Using the redo log sounds similar in spirit to what we did so many years ago to make replication state crash proof by persisting it in InnoDB.

      The downside to using InnoDB redo is that redo log rotation will happen faster leading to more writeback and higher write-amp. Perhaps using much larger redo logs helps with that. But when redo uses buffered IO, then you have to worry about redo logs wasting memory in the OS page cache.

  2. You are right. Also, the suggested buffering technique would be unapplicable if the InnoDB log is fairly small and we are using row-level logging to cover a massive DELETE operation. In such cases, we had better durably write only to the binlog before committing the transaction in InnoDB. Also in this case, it could be possible to sync only one of the files per commit.

    We have been thinking of using asynchronous unbuffered writes for the redo log in a future MariaDB release.

  3. The writes to redo could already be unbuffered (on Linux), if Innodb could figure out (on Linux) whether 512 byte-aligned writes is going to be fine with O_DIRECT, on a given file.