Thursday, October 8, 2020

Comments on: Optimizing Databases by Learning Hidden Parameters of Solid State Drives

I just read an interesting paper: Optimizing Databases by Learning Hidden Parameters of Solid State Drives by Aarati Kakaraparthy, Jignesh Patel, Kwanghyun Park and Brian Kroth. Reading the paper was a good use of my time. I hope there are more papers on this topic. This isn't a review of the paper but a list of questions and comments on the paper. The paper is well written and easy to read, so a review by me wouldn't add much.

Storage Performance

The reason for the paper is that the performance profile of an SSD is complicated and complicated in different ways than the performance profile of a disk. It might be easy to predict throughput from a disk given a transfer rate, rotational and seek latency. But that gets a more complicated when using the outer vs inner tracks of the disk. And it gets even more complicated with concurrency and the impact of the IO scheduler.

Predictions for a disk are easy but complicated (that might sound odd) but predictions for an SSD are less easy and more complicated and many of the details aren't explained by the SSD vendors. The paper explains how to derive some of the design decisions via testing and then uses that to improve performance for MariaDB and SQLite.

If you care about performance an SSD is a black-box. SSD vendors might feel the same way about the workloads that run on SSDs. I wrote innosim (InnoDB IO simulator) a long time ago to share a workload that I care about with SSD vendors. It is fun to see results from it on the web.

The paper is about direct attach SSD. I wonder if there is an opportunity for interesting research on cloud SSD. With EBS requests <= 256KB count as one IO so you will get more throughput if you can figure out how to use larger requests.

Comments on the paper

It is possible that I missed a few things in the paper, and I am happy to be corrected.

The desirable write request size is determined by creating files using writes of size X KB (X <= 512) and then computing the read latency for reads of size 1MB per the pseudo-code for Experiment 1. The value of X that provides the minimum read latency is the desirable write request size.

  • The desirable size is likely to be much larger than 16kb but if you are using an update-in-place B-Tree for OLTP then you are unlikely to do that as leaf pages will be small (<= 16kb). Write requests can be larger with an LSM (SST files are > 1MB) or a copy-on-write B-Tree.
  • I am not sure about using min latency for 1MB reads as the objective function to determine the best write request size. It is unlikely that 1MB reads will be done from a B-Tree. With an LSM that will occur during compaction, although I am not sure that optimizing for compaction reads is a goal. Large reads are common with a heap-organized table (Postgres), especially when doing analytics. So I think more nuance is needed and that could be included in another paper.
  • What is the impact from concurrent writes? For InnoDB either page writeback or doublewrite IO will be concurrent with binlog or redo IO. For MyRocks compaction writes will be concurrent with binlog or WAL IO. Assuming it were possible to use the desirable write request size, how is this impacted by concurrent writes because requests can be interleaved by the block IO layer and by the SSD. Note that if the SSD erase block size is <= 64MB then it might be nice to let an LSM not share erase blocks between SST files as not sharing means there is nothing to copy-out of the erase block during GC (all or none of the erase block is in use). I probably should re-read this document once a year as Jens Axboe does a great job explaining what happens in the block IO layer for large requests.
  • What is the impact from GC? Writes done today can be relocated tomorrow by flash GC.
I hope more is written to explain the difference between desirable write request size and stripe size and now I will try to read other papers to better understand stripe size. From table 1 they are equal for 2 of the 4 devices tested. They aren't equal for SSD-S and I think that was a judgement call based on the graphs in figure 3.

The paper didn't try to discover the desirable concurrency level (number of channels) as that has been addressed by previous work. But I hope they include that in future work.

IO patterns and optimization goals vary by file structure. For MySQL with InnoDB the important IO is listed below. Given that reading redo and binlogs from storage aren't (or shouldn't be) the common case the desirable write request size for them is one that minimizes write-amp:
  • InnoDB database files - writes of size 16 KB (sometimes 8KB), uses either buffered IO or O_DIRECT, single-page reads and writes are frequent
  • InnoDB undo - I will let someone with more expertise summarize this
  • InnoDB redo - writes of size 512*x, frequent fsync, buffered IO, file is pre-allocated, rarely read. We suffered from redo read IO back in the day for InnoDB redo when it didn't stay in cache (large redo isn't free). The problem was that writing the first 512 bytes of a file system page not in cache turns into read-modify-write. I don't know the status of this today.
  • InnoDB doublewrite buffer - large writes, probably a small multiple of 512 KB to the same location in the system tablespace, uses either buffered IO or O_DIRECT. This might have changed recently in MySQL 8.
  • binlog - small writes of any size, frequent fsync, file grows with each write, reads are usually done from the OS page cache, not from storage
TRIM is another black box and a topic for future work. TRIM performance varies across devices. I was spoiled by early Fusion IO devices where TRIM was fast. Modern devices have differing capacities for the number of files and number of MB that can be trimmed per second. There will be stalls when those rates are exceeded. Alas, these capacities are not explained. As a workaround, R.ocksDB has an option to rate limit file deletion. Domas Mituzas has written about and shared code to manage file deletion done by value-added processes that run on DBMS HW to keep them from hurting DBMS QoS.





3 comments:

  1. Hi Mark,

    Thanks for taking a look at our work, I'm glad that you found it useful! I've responded to your comments in my blog post: https://dbdrifter.blogspot.com/2020/10/response-to-comments-on-our-paper.html. I'd be happy to discuss more if you are interested.

    Best,
    RT

    ReplyDelete
    Replies
    1. Thank you for a reply, and even more thanks for writing an interesting paper. I will reach out to you soon.

      Delete
  2. I Like to add one more important thing here, Solid-State Drives Market (SSD Market) by Form Factor (2.5”, 3.5”, M.2, U.2/SFF 8639, FHHL/HHHL), Interface (SATA, SAS, PCIe), Technology (SLC, MLC, TLC), End-user (Enterprise, Client, Industrial, Automotive), and Geography - Global Forecast to 2021-2026 -Executive Data Report

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...