Small Datum: Notes on the storage stack

Monday, April 7, 2014

Notes on the storage stack

If you want high performance and quality of service from a DBMS then you need the same from the OS. The MySQL/Postgres/MongoDB crowd doesn't always speak with the Linux crowd. On the bright side there is a good collection of experts from the Linux side of things at my employer and we have begun speaking. There were several long threads on the PG hackers lists about PG+Linux and this lead to a meeting at the LSFMM summit. I am very happy these groups met. We have a lot to learn from each other. DBMS people can explain our IO patterns and get motivated to write DBMS workload simulators (like innosim, innotsim and ldbsim) to make it easy for others to test our workloads. Linux people can provide a summary of performance details to make it easier to understand storage performance, especially things like the per-inode mutex (and this too), optimizations gone bad in sync_file_range, and IO scheduler behavior.

PG and Mongo use buffered IO for database files and Mongo takes that one step further by using mmap. Both have redo logs and after changes are on disk in the redo log file the change can be applied to the pages of the database file in the OS filesystem cache. Some time after applying those changes PG will call fsync and Mongo will call msync. This has the potential to schedule a huge number of page writeback requests to the storage system in a short period of time. Linux does not optimize for that workload and other IO requests like page reads and log writes will suffer during a writeback flood. See posts by Domas, a PG expert and the LSFMM summit summary for more information. After reading the LSFMM summary I suspect the DBMS side must solve this problem at least for the next few years. PG has made this better over the years with changes to avoid the flood but it isn't a solved problem. MongoDB has yet to start on the problem of avoiding the flood. I saw advice for one workload where the workaround was to use syncdelay=1 which means that all dirty pages are written back every second. That is a great way to wear out a flash device prematurely. The basic issue is that page reads and log writes need to have priority over dirty page writeback. This is much less of an issue for InnoDB when direct IO is used although there are a few corner cases.

This post has several links to LWN.net which is a great source of technical information on Linux. I am a subscriber and hope that a few more people consider subscribing. There were many more posts from last week's issue of LWN.net:

Facebook and the kernel - several of our DBMS solutions would benefit from giving priorities to page reads and log writes versus dirty page writeback.
Memory mapping locking - mutexes used by mmap were a serious source of contention on older 2.6 kernels I previously used for performance testing
Persistent memory - this is another way to make doublewrite buffer writes cheap. Write into the persistent memory buffer assuming it only writes to storage on unplanned shutdown.
Shingled magnetic recording drives - to increase capacity these disks only support sequential writes to most of the data (a small fraction can be allocated for update-in-place). There are several ways to hide that. Can an update-in-place b-tree be competitive on it?
NUMA - more changes are planned, more fun will be had with perf debugging (see Domas, jcole, jcole).
Huge pages - if you run PG, TokuDB (and TokuMX?) then disable them

Small Datum

Monday, April 7, 2014

Notes on the storage stack

No comments:

Post a Comment

Is it time for TPC-BLOB?