Perhaps I agree with Dr. Stonebraker. This is my answer which might not be the correct answer. My reasons for O_DIRECT are performance, quality of service (QoS) and manageability and performance might get too much attention. I don't dislike Linux but the VM, buffered IO, readahead and page cache are there for all Linux use cases. They must be general purpose. Complex system software like a DBMS isn't general purpose and can do its own thing when needed. Also, I appreciate that kernel developers have done a lot to make Linux better for a DBMS. One of the perks at FB was easy access to many kernel developers.
Most of my web-scale MySQL/InnoDB experience is with O_DIRECT. While InnoDB can use buffered IO we always chose O_DIRECT. Eventually, RocksDB arrived and it only did buffered IO for a few years. Then O_DIRECT support was added and perhaps one day the web-scale MyRocks team will explain what they use.
I deal with readahead when running benchmarks and a common experience is using the wrong (too large) value and then repeating tests which means I spend more time and more SSD endurance thanks to buffered IO. I have many blog posts with performance results for readahead including at least one for MongoDB. Usually my goal was to find which small value is good enough. I learned that 0 is too small. Readahead can help scan-heavy workloads, but my focus is on OLTP where we avoided most scans except for logical backup.
I understand why buffered IO is used by some DBMS. Early in the product lifecycle it can be a placeholder until more features are added to make O_DIRECT performant. The benefits of the OS page cache include:
- Filesystem readahead can be used before the DBMS adds support for prefetching when doing scans. But filesystem readahead is a black box, might differ between filesystems, provides no metrics and will do the wrong thing for some workloads. InnoDB provides a prefetch feature which can help when O_DIRECT is used. I disabled it because OLTP. The Facebook MySQL team (thanks Nizam) added logical readahead to make logical backup faster and more efficient. Filesystem readahead is likely to struggle with index structure fragmentation, so it is best suited for heap-organized tables and will suffer with index scans.
- Doing writes to the OS page cache followed by fsync can be used before the DBMS adds support for async IO or background write threads. But Postgres suffered for so long from this approach because calling fsync with an unknown amount of dirty pages in the OS page cache can starve all other pending IO requests for many seconds. The situation is less dire today thanks to work by Jens Axboe to make writeback less annoying. There was much discussion in 2014 at a summit that included Postgres and Linux kernel developers. In addition to Linux improvements, features have been added to Postgres to reduce the impact from writeback storms -- read this to learn about spread checkpoints.
- For a DBMS that does compression it is easier to use the DBMS cache for uncompressed pages and the OS page cache for compressed pages. I am familiar with amazing work in InnoDB to manage both in the DBMS cache. We all agree the code is very complex. RocksDB also has an option to cache both in its block cache but I have little experience with the feature. It is hard to figure out the best way to divide the DBMS cache between compressed and uncompressed pages.
Performance advantages for O_DIRECT include:
- Does one memory copy to move data from storage to the DBMS while buffered needs two
- Avoids CPU and mutex contention overhead in the OS page cache
- Avoids wasting memory from double buffering between the DBMS cache and OS page cache
QoS advantages for O_DIRECT include:
- Filesystem readahead is frequently wrong and either wastes IO or favors the wrong user leading to worse IO response times for other users
- OS page cache will get trashed by other services sharing the host
- Writeback storms starve other IO requests. Writeback is usually a background task and can tolerate response time variance. Too much writeback makes user reads and log fsync slower and those operations don't want response time variance.
- Reduces stalls - this is a placeholder because my local expert has yet to explain this in public. But you will have a better time with Linux when moving less data through the OS page cache, especially with modern storage devices that can sustain many GB/sec of throughput. And when you aren't having a good time then you can fix the DBMS. The DBMS is involved whether or not it relies on the OS page cache so you always have to make it work.
- DBMS prefetch and writeback are documented, tunable and provide metrics. Such claims are less true for filesystem readahead and VM writeback. There is a lot of advice on the web and much disagreement especially on the topic of min_free_kbytes. Domas used to be my source on this but he doesn't blog enough about the Linux VM.