Friday, June 24, 2022

Fixing mmap performance for RocksDB

RocksDB inherited support for mmap from LevelDB. Performance was worse than expected because filesystem readahead fetched more data than needed as I explained in a previous post. I am not a fan of the standard workaround which is to tune kernel settings to reduce readahead because that has an impact for everything running on that server. The DBMS knows more about the IO patterns and can use madvise to provide hints to the OS, just as RocksDB uses fadvise for POSIX IO.

Good news, issue 9931 has been fixed and the results are impressive. 

Benchmark

I used db_bench with an IO-bound workload - the same as was used for my previous post. Two binaries were tested:

  • old - this binary was compiled at git hash ce419c0f and does not have the fix for issue 9931
  • fix - this binary was compiled at git hash 69a32ee and has the fix for issue 9931.
Note that git hash ce419c0f and 69a32ee are adjacent in the commit log.

The verify_checksums option was false for all tests. The CPU overhead would be much larger were it true because checksum verification would be done on each block access. Tests were repeated with cache_index_and_filter_blocks set to true and false. That didn't have a big impact on results.

Results

The graphs have results for these binary+config pairs:

  • cache0.old - cache_index_and_filter_blocks=false, does not have fix for issue 9931
  • cache0.fix - cache_index_and_filter_blocks=false, has fix for issue 9931
  • cache1.old - cache_index_and_filter_blocks=true, does not have fix for issue 9931
  • cache1.fix - cache_index_and_filter_blocks=true, has fix for issue 9931
The improvements from the fix are impressive for benchmark steps that do reads for user queries -- see the green and red bars. The average value for the average read request size (rareq-sz in iostat) is:
  • for readwhilewriting: 115kb without the fix, 4kb with the fix
  • for fwdrangewhilewriting: 79kb without the fix, 4kb with the fix

Tell me how you really feel about mmap + DBMS

It hasn't been great for me. Long ago I did some perf work with an mmap DBMS and Linux 2.6 kernels suffered from severe mutex contention in VM code. Performance was lousy back then. But I didn't write this to condemn mmap and IO-bound workloads where the read working set is much larger than memory might not be the best choice for mmap.

For the results above if you compare the improved mmap numbers with the POSIX/buffered IO numbers in my previous post -- peak QPS for the IO-bound tests (everything but fillseq and overwrite) is ~100k/second with mmap vs ~250k/second with buffered IO.

From the vmstat results collected during the benchmark I see:
  • more mutex contention with mmap based on the cs column
  • more CPU overhead with mmap based on the us and sy columns

Legend:
* qps - throughput from the benchmark
* cs - context switches from the cs column in vmstat
* us - user CPU from the us column in vmstat
* sy - system CPU from the sy column in vmstat

Average values
IO      qps     cs      us      sy
mmap     91757  475279  15.0    7.0
bufio   248543  572470  13.8    7.0

Values per query (column divided by QPS, for us and sy result is multiplied by 1000)
IO      qps     cs      us      sy
mmap    1       5.2     163     76
bufio   1       2.3      55     28

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...