Saturday, March 13, 2021

Sequential IO and an LSM tree

I see statements that sequential IO is a benefit of using an LSM tree. That is a vague and truthy statement. It is correct that the writes done by compaction are sequential per-file. But I am interested in the IO from the perspective of the storage device. 

The truthier claim is that with an LSM there will be many streams of IO (read & write) that benefit from large IO requests. The reads and writes done by compaction are sequential per file, but there is much concurrency and the storage device will see many concurrent streams of IO.

The IO patterns for an LSM with a busy workload are:

  • N compaction threads where each thread is reading from ~10 files, merging the results and writing the output to create new SSTs. Each thread has one file open at a time that is written sequentially. The reads are sequential per-file. At any point in time there can be 10*N read streams and N write streams. The reads benefit from large IO requests and prefetch. The writes benefit from async write-back. The writes might benefit from clever placement on the storage device (see multistream). The writes are likely to generate large requests to the storage device.
  • The WAL, if enabled, is another write stream. If fsync is done on commit then stream gets a sequence of small writes.
  • User queries generate read requests. For OLTP these are mostly small (random) requests. But in some cases (logical backup, analytics) there can be scans.

1 comment:

  1. I think you want to talk to somebody at Teledyne/Oakgate, maybe Fred Tzeng, which greetings from Kris and Peter at Booking. And then make a painting, like I did for MySQL in https://blog.koehntopp.info/2021/02/25/mysql-from-below.html

    That's probably quite enlightening.

    ReplyDelete

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB a...