Monday, January 8, 2024

Explaining changes in RocksDB performance for IO-bound workloads

I have two recent posts for RocksDB benchmarks (here and here) that mention there might be a regression in IO-bound workloads starting in version 8.6 when buffered IO is used. I have one recent post that started to explain the problem. The root cause is changes to code that does readahead for compaction and the problem is worse when the value for the compaction_readahead_size option is larger than the value for max_sectors_kb of the underlying storage device(s). And this is more complex when RAID is used. Some of my test servers use SW RAID 0 and I don't know whether the value for the underlying devices or for the SW RAID device takes precedence.


  • With RocksDB 8.6+ you might need to set compaction_read_ahead_size so that it isn't larger than max_sectors_kb. I opened RocksDB issue 12038 for this.

The benchmark is described in a previous post. The test server has 40 cores, 80 HW threads, hyperthreads enabled, 256G of RAM and XFS with SW RAID 0 over 6 devices. The value of max_sectors_kb is 128 for the SW RAID device (md2) and 1280 for the underling SSDs.

Tests were repeated for RocksDB versions 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2.

I repeated the IO-bound benchmark using buffered IO in 3 setups:
  • default - this uses the default for compaction_readahead_size which is 0 prior to RocksDB 8.7 and 2MB starting in RocksDB 8.7. 
  • crs.1MB - explicitly set compaction_readahead_size=1MB
  • crs.512KB - explicitly set compaction_readahead_size=512KB
Code for compaction readahead changed in both RocksDB 8.5 and 8.6. A side-effect of this change is that using compaction_read_ahead_size =0 is bad for performance because it means there will be (almost) no readahead.


Below there are three graphs. The first shows throughput, the second shows the average value for read MB/s per iostat and the third shows the average value for read request size (rareq-sz) per iostat. All of these are measured during the overwrite benchmark step which is write-only and suffers when compaction cannot keep up.

The performance summaries from the benchmark scripts are here and the iostat summary is here.

  • Throughput is lousy in 8.6.7 because the benchmark client (db_bench) hardwired the value for compaction_readahead_size to 0 rather than use the default of 2MB.
  • Throughput is best with compaction_readahead_size =1MB and worst with it =512KB
  • The IO rate (read MB/s) is best with compaction_readahead_size =2MB, but that doesn't translate to better throughput for the application.
  • The average read size from storage (rareq-sz) is best with compaction_readahead_size =1MB and worst with it =2MB
  • Note that better or worse here depends on context and a big part of the context is the value of max_sectors_kb. So changing the default for compaction_readahead_size from 2MB to 1MB might be good in some cases but probably not all cases.

No comments:

Post a Comment