PG and Mongo use buffered IO for database files and Mongo takes that one step further by using mmap. Both have redo logs and after changes are on disk in the redo log file the change can be applied to the pages of the database file in the OS filesystem cache. Some time after applying those changes PG will call fsync and Mongo will call msync. This has the potential to schedule a huge number of page writeback requests to the storage system in a short period of time. Linux does not optimize for that workload and other IO requests like page reads and log writes will suffer during a writeback flood. See posts by Domas, a PG expert and the LSFMM summit summary for more information. After reading the LSFMM summary I suspect the DBMS side must solve this problem at least for the next few years. PG has made this better over the years with changes to avoid the flood but it isn't a solved problem. MongoDB has yet to start on the problem of avoiding the flood. I saw advice for one workload where the workaround was to use syncdelay=1 which means that all dirty pages are written back every second. That is a great way to wear out a flash device prematurely. The basic issue is that page reads and log writes need to have priority over dirty page writeback. This is much less of an issue for InnoDB when direct IO is used although there are a few corner cases.
This post has several links to LWN.net which is a great source of technical information on Linux. I am a subscriber and hope that a few more people consider subscribing. There were many more posts from last week's issue of LWN.net:
- Facebook and the kernel - several of our DBMS solutions would benefit from giving priorities to page reads and log writes versus dirty page writeback.
- Memory mapping locking - mutexes used by mmap were a serious source of contention on older 2.6 kernels I previously used for performance testing
- Persistent memory - this is another way to make doublewrite buffer writes cheap. Write into the persistent memory buffer assuming it only writes to storage on unplanned shutdown.
- Shingled magnetic recording drives - to increase capacity these disks only support sequential writes to most of the data (a small fraction can be allocated for update-in-place). There are several ways to hide that. Can an update-in-place b-tree be competitive on it?
- NUMA - more changes are planned, more fun will be had with perf debugging (see Domas, jcole, jcole).
- Huge pages - if you run PG, TokuDB (and TokuMX?) then disable them