Small Datum: Using mmap with RocksDB

RocksDB inherited support for mmap from LevelDB. I was curious how performance with mmap compared buffered IO (pread) and ran benchmarks with db_bench. I haven't had great experiences with mmap for database workloads in the past but tried to be fair.

tl;dr

Don't use mmap for IO-bound workloads until issue 9931 is fixed
For in-memory databases performance with mmap can be better than pread if db_bench is run with --cache_index_and_filter_blocks=false. Setting --verify_checksum=false provides a smaller performance boost.

Longer than tl;dr

If the database doesn't fit in memory then issue 9931 means there will be significant overfetching on reads from storage and this will ruin performance. I assume the fix is to add madvise calls similar to the posix_fadvise calls used for buffered IO (pread).

With --cache_index_and_filter_blocks=false the RocksDB process might use more memory than you expect so be careful while chasing performance. When it is true and --mmap_read=true then index and filter blocks will still be copied into the RocksDB block cache.

While --verify_checksum=false improves performance in some cases, it also prevents the use of checksums to detect file corruption on data blocks. Checksums for filter and index blocks read into the block cache are always checked regardless of the value of --verify_checksum because the BlockFetcher uses a ReadOptions struct with default values and the default for ReadOptions::verify_checksums is true (see here and here).

Configuration

I used benchmark.sh to run db_bench on a server with more than 32 CPU cores and then hyperthreading was enabled. The database was run in three configurations: cached by RocksDB, cached by OS and IO-bound. For cached by RocksDB the block cache was larger than the database files. For cached by OS the block cache was 1G but the database fits in the OS page cache. For IO-bound the database is larger than memory. I don't share graphs for cached by OS but the results were similar to cached by RocksDB.

The benchmarks were run with 32 client threads (--threads=32) and WAL sync disabled. The background writer was limited to 2MB/s.

The benchmarks were run in this order:

fillseq - load the database in key order
readrandom - point lookups via Get
multireadrandom - point lookups via MultiGet
revrangeww - reverse range scans with a background writer
fwdrangeww - forward range scans with a background writer
readww - point lookups via Get with a background writer
overwrite - writes via Put run as fast as possible.

IO-bound

The database was loaded with 4B key-value pairs by fillseq. The throughput graphs show that results were lousy for mmap on the read-heavy benchmarks. On my server RocksDB with mmap was fetching ~100X more data per query compared to pread. While tuning filesystem options might have reduced the excessive fetching I am not sure it would prevent it and I am reluctant to change global options on a server to benefit one of the applications that use the server. The real fix and more stats on the excessive fetching are explained in issue 9931.

For pread and mmap the benchmarks used --verify_checksum=true and --cache_index_and_filter_blocks=true.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Cached by RocksDB

The database was loaded with 40M key-value pairs by fillseq. Results are provided for several configurations. Font size reduced to avoid spanning lines.

pread-ver1-cm1: -mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=true
mmap-ver1-cm1: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=true
pread-ver0 -cm1: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=true
pread-ver0-cm0: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=false
pread-ver1-cm0: --mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=false
mmap-ver0-cm1: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=true
mmap-ver0-cm0: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=false
mmap-ver1-cm0: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=false

Command lines are here for pread-ver0-cm1. To use the *-cm0 variants see here. To run the IO-bound variants see here.

The most important db_bench option for performance is --cache_index_and_filter_blocks=false as shown in the results for pread-ver*-cm0 and mmap-ver*-cm0.

Setting --verify_checksum=false helps for mmap but not for pread. See pread-ver0-cm0 vs pread-ver1-cm0 and mmap-ver0-cm0 vs mmap-ver1-cm0. This might be workload specific.

Setting --mmap_read also helps compared to pread but only when --cache_index_and_filter_blocks=false. It is possible that also enabling the option for compaction to pre-populate the block cache would allow pread to be as fast as mmap but I did not try it.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Small Datum

Monday, May 9, 2022

Using mmap with RocksDB

No comments:

Post a Comment

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1