Saturday, March 13, 2021

Sequential IO and an LSM tree

I see statements that sequential IO is a benefit of using an LSM tree. That is a vague and truthy statement. It is correct that the writes done by compaction are sequential per-file. But I am interested in the IO from the perspective of the storage device. 

The truthier claim is that with an LSM there will be many streams of IO (read & write) that benefit from large IO requests. The reads and writes done by compaction are sequential per file, but there is much concurrency and the storage device will see many concurrent streams of IO.

The IO patterns for an LSM with a busy workload are:

  • N compaction threads where each thread is reading from ~10 files, merging the results and writing the output to create new SSTs. Each thread has one file open at a time that is written sequentially. The reads are sequential per-file. At any point in time there can be 10*N read streams and N write streams. The reads benefit from large IO requests and prefetch. The writes benefit from async write-back. The writes might benefit from clever placement on the storage device (see multistream). The writes are likely to generate large requests to the storage device.
  • The WAL, if enabled, is another write stream. If fsync is done on commit then stream gets a sequence of small writes.
  • User queries generate read requests. For OLTP these are mostly small (random) requests. But in some cases (logical backup, analytics) there can be scans.

Sunday, March 7, 2021

How to submit a PR for RocksDB

I submitted my first PR to RocksDB as an external contributor and have advice on the process.

  • Read the Contribution Guide
  • Run make format on your diffs. This depends on clang-format and you might need to install it.
  • Edit HISTORY.md if there are user-facing changes
  • Optionally run make check to confirm tests pass. Some tests failed because my small servers have 16G of RAM, the tests use /dev/shm and needed more than the 8G that /dev/shm gets on my servers.
  • Confirm that RocksDB LITE builds still work: LITE=1 make db_bench
  • Some of the CI tests are flaky now and have bogus failures (see here). To get CI tests to repeat I edit and push small changes to HISTORY.md. The team is fixing the intermittent test failures. Alas, one of the intermittent problems was from internal gcc failures on ARM -- something I can't reproduce on my x86 hardware. The good news is that after three attempts everything passes.
  • Google for advice on how to do this in git. I am not a git guru but the advice I find is great

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...