The Unwritten Contract of Solid State Drives by Jun He, Sudarsun Kannan, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau was published in EuroSys 2017. The paper is interesting and worth reading. I hope for more papers on this topic and recently shared a review on another paper from UW-Madison on this topic. This tries to be a brief review of the paper. I only review papers that I think are useful to me and others.
I didn't notice this paper until recently. While I wish I didn't miss interesting papers like that, it is hard for me to keep up with so many great papers getting published. Regardless, I assume that the RocksDB team is open to discussions with people doing research in this space. I am too and this post was half-serious.
The paper presents five rules that are frequently true for an SSD and explains that all rules aren't true for all SSDs. Following the rules often leads to better SSD performance.
The objective functions in the paper are throughput and write amplification. Quality of service (QoS) wasn't included and maximizing throughput can hurt QoS. Many applications want maximum throughput that respects QoS where QoS can be defined by p99 and worst-case response times. So I hope that a future paper includes QoS. The paper explains that it focuses on internal states of the SSD rather than end-to-end performance, but QoS can also be measured via internal states because that is where many stalls happen (read delayed because slower program or erase in progress).
There is also a difference between using all device throughput that a workload needs vs using all of the device throughput. Saturating device throughput isn't a goal. Getting as much throughput as needed for a workload is a goal. A device offers several things -- capacity, throughput and low-latency operations. It is hard to saturate both capacity and throughput while not hurting QoS.
The rules are:
- Request scale - issue large requests or concurrent small requests to get more throughput and benefit from internal parallelism of the SSD
- Locality - access with locality (temporal and/or spatial). Possible benefits include reduced translation cache misses, reduced data cache misses and reduced write-amp from flash GC
- Aligned sequentiality - write files sequentially using large write requests. For an FTL that can do hybrid block mapping this requires fewer entries in the translation table.
- Grouping by death time - do writes so that all data in an erase block is discarded at the same time.
- Uniform lifetime - structure data so that all writes have a similar lifetime. Note that this is stricter version of grouping by death time and unlikely to be followed by a DBMS except for workloads that are write-once and then read-only.
- Request scale
- Compaction reads with RocksDB are a (compressed) block at a time and the uncompressed block size is likely ~4k, ~8k or ~16k. The block sizes are approximate, usually < the target size and not aligned. The compaction_readahead_size option in RocksDB can be set to get large read requests otherwise you rely on filesystem readahead and it is risky to make read_ahead_kb large because that is per-device.
- Compaction writes with RocksDB are a page at a time. Each SST is written sequentially and fsync is called when the SST is full (usually ~64M). Async write-behind can be enabled via the bytes_per_sync to avoid a storm of writes on fsync and to get large write requests of size bytes_per_sync.
- WAL writes can use the wal_bytes_per_sync option
- For user reads, many concurrent users are the easy way to get concurrent small reads. I am less certain about options for doing prefetch or getting parallel IO requests for a user query. Back when I did small data at FB getting concurrent IO for one user request wasn't a priority because the focus was sharing a device across many users.
- Compaction reads & writes respect this as they read & write files sequentially.
- Whether user reads respect this depends on the workload.
- Aligned sequentiality - compaction writes respect this
- Grouping by death time
- I know this via the multistream effort and even collaborated on an evaluation. The results in my eval weren't significant so I should read papers by others to understand where it was a big deal.
- Grouping by space requires an API like multistream. With that it is easy to group for an LSM. The upper/smaller levels of the LSM tree have shorter lifetimes than the lower/larger levels.
- It is usually impossible to do this without an API. Because I don't know how many erase blocks are concurrently open. Nor do I know the size of an erase block. I risk using the wrong name as I am not an expert on this but I mean the logical erase block that is managed by the FTL not the physical erase block that is per NAND chip. The logical erase block is striped over many NAND chips for a few reasons, including to prevent the loss of data when a NAND chip fails. This structure is rarely explained by vendors.
- Uniform lifetime - I doubt any DBMS can follow this except for data that is write once.
- Observation 1 - Application log structure increases the scale of write size. This is expected with an LSM. I appreciate that this paper distinguishes between IO patterns for the application (LSM writes SST files sequentially) and for the device (concurrent compaction means that a device interleaves large write requests from multiple files).
- Observation 2 - The scale of read requests is low. It is hard to know whether this is a function of the system software (bottlenecks & missing features) or the workload (working set cached, not many concurrent users).
- Observation 3 - RocksDB doesn't consume enough device bandwidth. This is hard to appreciate without more details on the benchmark setup. Note that I don't care about saturating device throughput unless QoS is also a goal.
- Observation 4 - Frequent data barriers limit request scale. The obvious one is fsync/fdatasync although concurrency usually overcomes that. For reads, prefetching and readahead can overcome but this wasn't a priority for the workloads I cared about because they had many concurrent users. There are things that can be done, and might already be in RocksDB, to get more concurrent IO reads to a device when apps have low concurrency.
- Observation 5 - Linux buffered I/O limits request scale. Fortunately, RocksDB can also run with O_DIRECT. I wasn't clear on the claim that read_ahead_kb limits the size of a read. Perhaps this is true for reads smaller than read_ahead_kb. Perhaps I can ask the block IO expert. For compaction, the compaction_readahead_size option can get larger reads for compaction. For scans I am not sure if RocksDB has a way to get readahead (multi-page reads) when needed.
- Observation 8 - be wary of deferring discard. The variety of discard performance across devices is large and not well documented. I hope more research is done on this.
- Observation 9 - SSDs demand accurate and aggressive prefetching. I am wary of inaccurate and aggressive prefetching. An LSM can benefit from prefetching for compaction reads (RocksDB has an option for that). Prefetching for an LSM during user reads is interesting. For a point query prefetching the lower levels of the LSM tree can be wasted work when the data is found in the upper levels of the tree. Prefetching for a scan is easier because data from all levels must be read. However, the amount of readahead to use during a scan is more complicated because the LSM usually isn't told how large the scan will be (first 10 rows vs all rows). There might be options in RocksDB to get parallel reads on behalf of a single user query.
- Observation 17 - application log structuring does not reduce garbage collection. I am skeptical about this. Also, the paper shows that it doesn't prevent GC. It does not show that it doesn't reduce it. I have results from 2016 to show that flash GC write-amp from MyRocks was less than from InnoDB. Of course, that is specific to one device. Regardless I appreciate that the paper explains some of the issues including that a device interleaves large write requests from many files with concurrent compaction. I am wary of the misleading sequential writes message that is repeated in too many papers.