Friday, October 25, 2019

Nitpicking papers in the LSM space

  • LSM compaction does merge, not merge sort. Compaction inputs are already sorted. Merge sort is O(NlogN) for N records. Naive merge is O(NlogK) for N records and K input streams and in some cases RocksDB can do better than naive merge -- see the reference to optimized binary heap. Compaction merges for RocksDB with leveled compaction have two forms -- L0 to L1 and Ln to Ln+1. For L0 to L1 there might be ~4 streams from the L0 and 1 stream from the L1 and size(L1) is usually >= size(L0). For Ln to Ln+1 there is one stream from each and size(Ln+1) is approximately 10 * size(Ln).
  • LSMs do large random writes, not sequential writes. Writes are sequential from perspective of a single-file but a busy LSM does compaction and memtable flush concurrently. So there are concurrent reads and writes from perspective of device -- each stream of reads and writes is sequential but they are interleaved at the device level.
  • The statement write amplification is approximately 10 * num-levels is some of: the worst case, a hypothesis, an unvalidated performance model. I am a frequent user of unvalidated performance models but they have their limits. There is an amazing paper that measures write-amp in practice and then provides a brilliant explanation. I wish there were more reporting of write-amp from production workloads. LevelDB overstates write-amp because it uses too many levels because the L0 and L1 are small (~10mb).
  • It is great to read about new index structures that don't support range query in pursuit of less write-amp. If a workload doesn't need range query then maybe an LSM isn't the best choice. The other reason to use an LSM with leveled compaction is low space-amp, so I hope that some alternatives consider ways to keep space-amp low while also getting low write-amp.
  • Consider CPU and IO overhead. That is there are IO and CPU components for read and write amplification. IO gets a lot of attention. CPU needs more attention.
  • Benchmarking is hard. See here for an example of a horrible mistake I made which lead to many bogus reports of lousy MySQL performance regressions. I explained some of the problems for LevelDB and RocksDB benchmarks here and here. LevelDB doesn't target high performance, RocksDB is hard to configure, papers rarely provide enough details to support reproduction and even if they did nobody is volunteering their time to reproduce results. I assume that benchmark result accuracy is inversely related to the number of systems that have been tested for a given paper -- expertise is in short supply. My offer for advice still stands. I have read about possible programs that would volunteer grad students to reproduce results. While I think that would be a great education, I wonder if it would advance their academic careers. 

No comments:

Post a Comment