I have two recent posts for RocksDB benchmarks (here and here) that mention there might be a regression in IO-bound workloads starting in version 8.6 when buffered IO is used. I have one recent post that started to explain the problem. The root cause is changes to code that does readahead for compaction and the problem is worse when the value for the compaction_readahead_size option is larger than the value for max_sectors_kb of the underlying storage device(s). And this is more complex when RAID is used. Some of my test servers use SW RAID 0 and I don't know whether the value for the underlying devices or for the SW RAID device takes precedence.
tl;dr
- With RocksDB 8.6+ you might need to set compaction_read_ahead_size so that it isn't larger than max_sectors_kb. I opened RocksDB issue 12038 for this.
I repeated the IO-bound benchmark using buffered IO in 3 setups:
- default - this uses the default for compaction_readahead_size which is 0 prior to RocksDB 8.7 and 2MB starting in RocksDB 8.7.
- crs.1MB - explicitly set compaction_readahead_size=1MB
- crs.512KB - explicitly set compaction_readahead_size=512KB
The performance summaries from the benchmark scripts are here and the iostat summary is here.
- Throughput is lousy in 8.6.7 because the benchmark client (db_bench) hardwired the value for compaction_readahead_size to 0 rather than use the default of 2MB.
- Throughput is best with compaction_readahead_size =1MB and worst with it =512KB
- The IO rate (read MB/s) is best with compaction_readahead_size =2MB, but that doesn't translate to better throughput for the application.
- The average read size from storage (rareq-sz) is best with compaction_readahead_size =1MB and worst with it =2MB
- Note that better or worse here depends on context and a big part of the context is the value of max_sectors_kb. So changing the default for compaction_readahead_size from 2MB to 1MB might be good in some cases but probably not all cases.
No comments:
Post a Comment