Wednesday, September 7, 2016

Tuning the RocksDB block cache

I spent many years using InnoDB with direct IO and I didn't worry about buffered IO performance. Well, I didn't worry until Domas told me to worry. My focus has switched to RocksDB and now I worry about buffered IO performance. Fortunately, another co-worker (Jens Axboe) promises to make buffered writeback much better.

With direct IO, InnoDB stores compressed and uncompressed pages in the InnoDB buffer pool. It has a clever algorithm to determine how much memory to use for each based on whether the workload appears to be IO or CPU bound. My vague memory is that we tune my.cnf to keep it from being too clever.

With buffered IO, RocksDB manages a block cache for uncompressed blocks and then depends on the OS page cache for compressed blocks. While I think there is an opportunity to be more efficient in that area, that is not the topic for today.

The question today is how to divide memory between the RocksDB block cache and the OS page cache. I have read tuning advice for other buffered IO databases that suggest giving as much RAM as possible to the database. I disagree and my advice is:
  1. If the uncompressed working set fits in the RocksDB block cache then give as much RAM as possible to the block cache.
  2. Else if the compressed working set fits in the OS page cache then give most RAM to the OS page cache by using a small RocksDB block cache.
  3. Else give the RocksDB block cache about 20% of host RAM.
This is a rule of thumb. Sometimes in rule 3 I suggest giving 25% or 30% to the block cache, but I hope you get the point. The goal is to avoid reads from storage by caching more data in RAM. I assume that decompressing a block is much faster than reading it from storage which is more likely when you use zstandard.

This isn't proven unless you accept proof by anecdote. I ran a test with Linkbench on a host with 50G of RAM and a ~350G database. The test was repeated with the RocksDB block cache set to 5G, 10G, 20G and 35G. Using a smaller block cache reduced the storage read cost per transaction by between 10% and 20% using iostat r/s and iostat rKB/s. My advice might not work for you, but might help you to consider your choices before following tuning advice you read on the web.


  1. Hi Mark, I have been reading your blog in the past weeks, nice work.

    I have some question regarding the caching in RocksDB. In LevelDB default configuration, the SST files are mmaped with a default limit of 1000 files. This limit avoids the system to consume too much memory with mmap (up to 2GB). However, assuming a skewed access pattern, even if only the portions of a file that are being accessed are mmaped (rather than the entire file), it seems that I am wasting memory. As an example, assume I have 1k mmapped files, but only a portion of each of these files is being read, I am consuming less than 2GB with mmap, but I cannot use the remaining memory anywhere else, because I am already in the limit of mmaped files. I don't like this and I think having a unified cache seems a better way to go (maybe a block_cache for compressed data?).

    From your post and a quick view of the code, RocksDB do not use mmap by default, and simply relies on the OS page cache, which seems to be a way cleaner solution. However, I was not able to found any option to set the amount of memory from the OS page cache I want to use. I think this is just another instance of the eternal discussion between OS people and database people demanding complete control over their systems.

    1. There is an option to use the RocksDB block cache for compressed pages. It is infrequently used. There are also options to reduce the amount of data that RocksDB might put into the OS page cache.

      We know we can do better here and we expect to have something better, maybe in the next 6 months. It isn't clear whether that means to only use direct IO, or to just get better at managing the data that might be in the OS page cache.