RocksDB inherited support for mmap from LevelDB. Performance was worse than expected because filesystem readahead fetched more data than needed as I explained in a previous post. I am not a fan of the standard workaround which is to tune kernel settings to reduce readahead because that has an impact for everything running on that server. The DBMS knows more about the IO patterns and can use madvise to provide hints to the OS, just as RocksDB uses fadvise for POSIX IO.
Good news, issue 9931 has been fixed and the results are impressive.
I used db_bench with an IO-bound workload - the same as was used for my previous post. Two binaries were tested:
- old - this binary was compiled at git hash ce419c0f and does not have the fix for issue 9931
- fix - this binary was compiled at git hash 69a32ee and has the fix for issue 9931.
The verify_checksums option was false for all tests. The CPU overhead would be much larger were it true because checksum verification would be done on each block access. Tests were repeated with cache_index_and_filter_blocks set to true and false. That didn't have a big impact on results.
The graphs have results for these binary+config pairs:
- cache0.old - cache_index_and_filter_blocks=false, does not have fix for issue 9931
- cache0.fix - cache_index_and_filter_blocks=false, has fix for issue 9931
- cache1.old - cache_index_and_filter_blocks=true, does not have fix for issue 9931
- cache1.fix - cache_index_and_filter_blocks=true, has fix for issue 9931
- for readwhilewriting: 115kb without the fix, 4kb with the fix
- for fwdrangewhilewriting: 79kb without the fix, 4kb with the fix
For the results above if you compare the improved mmap numbers with the POSIX/buffered IO numbers in my previous post -- peak QPS for the IO-bound tests (everything but fillseq and overwrite) is ~100k/second with mmap vs ~250k/second with buffered IO.
- more mutex contention with mmap based on the cs column
- more CPU overhead with mmap based on the us and sy columns