I just read an interesting paper: Optimizing Databases by Learning Hidden Parameters of Solid State Drives by Aarati Kakaraparthy, Jignesh Patel, Kwanghyun Park and Brian Kroth. Reading the paper was a good use of my time. I hope there are more papers on this topic. This isn't a review of the paper but a list of questions and comments on the paper. The paper is well written and easy to read, so a review by me wouldn't add much.
The reason for the paper is that the performance profile of an SSD is complicated and complicated in different ways than the performance profile of a disk. It might be easy to predict throughput from a disk given a transfer rate, rotational and seek latency. But that gets a more complicated when using the outer vs inner tracks of the disk. And it gets even more complicated with concurrency and the impact of the IO scheduler.
Predictions for a disk are easy but complicated (that might sound odd) but predictions for an SSD are less easy and more complicated and many of the details aren't explained by the SSD vendors. The paper explains how to derive some of the design decisions via testing and then uses that to improve performance for MariaDB and SQLite.
If you care about performance an SSD is a black-box. SSD vendors might feel the same way about the workloads that run on SSDs. I wrote innosim (InnoDB IO simulator) a long time ago to share a workload that I care about with SSD vendors. It is fun to see results from it on the web.
The paper is about direct attach SSD. I wonder if there is an opportunity for interesting research on cloud SSD. With EBS requests <= 256KB count as one IO so you will get more throughput if you can figure out how to use larger requests.
Comments on the paper
It is possible that I missed a few things in the paper, and I am happy to be corrected.
The desirable write request size is determined by creating files using writes of size X KB (X <= 512) and then computing the read latency for reads of size 1MB per the pseudo-code for Experiment 1. The value of X that provides the minimum read latency is the desirable write request size.
- The desirable size is likely to be much larger than 16kb but if you are using an update-in-place B-Tree for OLTP then you are unlikely to do that as leaf pages will be small (<= 16kb). Write requests can be larger with an LSM (SST files are > 1MB) or a copy-on-write B-Tree.
- I am not sure about using min latency for 1MB reads as the objective function to determine the best write request size. It is unlikely that 1MB reads will be done from a B-Tree. With an LSM that will occur during compaction, although I am not sure that optimizing for compaction reads is a goal. Large reads are common with a heap-organized table (Postgres), especially when doing analytics. So I think more nuance is needed and that could be included in another paper.
- What is the impact from concurrent writes? For InnoDB either page writeback or doublewrite IO will be concurrent with binlog or redo IO. For MyRocks compaction writes will be concurrent with binlog or WAL IO. Assuming it were possible to use the desirable write request size, how is this impacted by concurrent writes because requests can be interleaved by the block IO layer and by the SSD. Note that if the SSD erase block size is <= 64MB then it might be nice to let an LSM not share erase blocks between SST files as not sharing means there is nothing to copy-out of the erase block during GC (all or none of the erase block is in use). I probably should re-read this document once a year as Jens Axboe does a great job explaining what happens in the block IO layer for large requests.
- What is the impact from GC? Writes done today can be relocated tomorrow by flash GC.
IO patterns and optimization goals vary by file structure. For MySQL with InnoDB the important IO is listed below. Given that reading redo and binlogs from storage aren't (or shouldn't be) the common case the desirable write request size for them is one that minimizes write-amp:
- InnoDB database files - writes of size 16 KB (sometimes 8KB), uses either buffered IO or O_DIRECT, single-page reads and writes are frequent
- InnoDB undo - I will let someone with more expertise summarize this
- InnoDB redo - writes of size 512*x, frequent fsync, buffered IO, file is pre-allocated, rarely read. We suffered from redo read IO back in the day for InnoDB redo when it didn't stay in cache (large redo isn't free). The problem was that writing the first 512 bytes of a file system page not in cache turns into read-modify-write. I don't know the status of this today.
- InnoDB doublewrite buffer - large writes, probably a small multiple of 512 KB to the same location in the system tablespace, uses either buffered IO or O_DIRECT. This might have changed recently in MySQL 8.
- binlog - small writes of any size, frequent fsync, file grows with each write, reads are usually done from the OS page cache, not from storage