Thursday, October 16, 2025

Why is RocksDB spending so much time handling page faults?

This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead added by RocksDB. While looking at flamegraphs taken during the benchmark I was confused that about 20% of the samples were from page fault handling. This confused me at first.

The lesson here is to run your benchmark long enough to reach a steady state before you measure things or there will be confusion. And I was definitely confused when I first saw this. Perhaps my post saves time for the next person who spots this.

The workload is db_bench with a database size that is much larger than memory and read-only microbenchmarks for point lookups and range scans.

Then I wondered if this was a transient issue that occurs while RocksDB is warming up the block cache and growing process RSS until the block cache has been fully allocated.

While b-trees as used by Postgres and MySQL will do a large allocation at process start, RocksDB does an allocation per block read, and when the block is evicted then the allocation is free'd. This can be a stress test for a memory allocator which is why jemalloc and tcmalloc work better than glibc malloc for RocksDB. I revisit the mallocator topic every few years and my most recent post is here.

In this case I use RocksDB with jemalloc. Even though per-block allocations are transient, the memory used by jemalloc is mostly not transient. While there are cases where jemalloc an return memory to the OS, with my usage that is unlikely to happen.

Were I to let the benchmark run for a long enough time, then eventually jemalloc would finish getting memory from the OS. However, my tests were running for about 10 minutes and doing about 10,000 block reads per second while I had configured RocksDB to use a block cache that was at least 36G and the block size was 8kb. So my tests weren't running long enough for the block cache to fill, which means that during the measurement period:

  • jemalloc was still asking for memory
  • block cache eviction wasn't needed and after each block read a new entry was added to the block cache
The result in this example is 22.69% of the samples are from page fault handling. That is the second large stack from the left. The RocksDB code where it happens is rocksdb::BlockFetcher::ReadBlockContents.

When I run the benchmark for more time, the CPU overhead from page fault handling goes away.




2 comments:

  1. Just to clarify, these page faults are minor (soft) not major (hard), right ? They are not related to an IO (these would be major / hard), but they are related to first access of RAM after growing RSS.

    It is unclear to me why this would be a noticable overhead, because each of these soft page fault would happen in pair with an actual IO (reading from disk for filling the block cache), and I would assume that the soft page fault latency is insignificant compared to the IO latency.

    ReplyDelete
    Replies
    1. Good question and I will confirm via "perf stat" soon.

      Whether or not it is significant relative to the IO latency depends on the IO latency. Here my focus is on minimizing the CPU overhead because 10 or 20 usecs of CPU from RocksDB and the kernel per-IO becomes significant when you have a fast storage device.

      At least during warmup, the page fault overhead was more than 20% of total CPU. I needed to understand why that happened and am happy to learn that I can ignore it.

      Delete

Why is RocksDB spending so much time handling page faults?

This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead ...