Thursday, October 16, 2025

Why is RocksDB spending so much time handling page faults?

This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead added by RocksDB. While looking at flamegraphs taken during the benchmark I was confused that about 20% of the samples were from page fault handling. This confused me at first.

The lesson here is to run your benchmark long enough to reach a steady state before you measure things or there will be confusion. And I was definitely confused when I first saw this. Perhaps my post saves time for the next person who spots this.

The workload is db_bench with a database size that is much larger than memory and read-only microbenchmarks for point lookups and range scans.

Then I wondered if this was a transient issue that occurs while RocksDB is warming up the block cache and growing process RSS until the block cache has been fully allocated.

While b-trees as used by Postgres and MySQL will do a large allocation at process start, RocksDB does an allocation per block read, and when the block is evicted then the allocation is free'd. This can be a stress test for a memory allocator which is why jemalloc and tcmalloc work better than glibc malloc for RocksDB. I revisit the mallocator topic every few years and my most recent post is here.

In this case I use RocksDB with jemalloc. Even though per-block allocations are transient, the memory used by jemalloc is mostly not transient. While there are cases where jemalloc an return memory to the OS, with my usage that is unlikely to happen.

Were I to let the benchmark run for a long enough time, then eventually jemalloc would finish getting memory from the OS. However, my tests were running for about 10 minutes and doing about 10,000 block reads per second while I had configured RocksDB to use a block cache that was at least 36G and the block size was 8kb. So my tests weren't running long enough for the block cache to fill, which means that during the measurement period:

  • jemalloc was still asking for memory
  • block cache eviction wasn't needed and after each block read a new entry was added to the block cache
The result in this example is 22.69% of the samples are from page fault handling. That is the second large stack from the left. The RocksDB code where it happens is rocksdb::BlockFetcher::ReadBlockContents.

When I run the benchmark for more time, the CPU overhead from page fault handling goes away.




5 comments:

  1. Just to clarify, these page faults are minor (soft) not major (hard), right ? They are not related to an IO (these would be major / hard), but they are related to first access of RAM after growing RSS.

    It is unclear to me why this would be a noticable overhead, because each of these soft page fault would happen in pair with an actual IO (reading from disk for filling the block cache), and I would assume that the soft page fault latency is insignificant compared to the IO latency.

    ReplyDelete
    Replies
    1. Good question and I will confirm via "perf stat" soon.

      Whether or not it is significant relative to the IO latency depends on the IO latency. Here my focus is on minimizing the CPU overhead because 10 or 20 usecs of CPU from RocksDB and the kernel per-IO becomes significant when you have a fast storage device.

      At least during warmup, the page fault overhead was more than 20% of total CPU. I needed to understand why that happened and am happy to learn that I can ignore it.

      Delete
    2. A latency of 10 usecs for a soft page fault looks like a lot. But it might be right as updating the Page Table needs invalidating the cache of other cores, and these cores could be on distant sockets.

      How fast are you IOs ?

      Maybe an explanation for soft page fault dominating the workload is that a large IO was done (let's imagine 8 MiB for 1024 blocks of 8 KiB), and that a soft page fault occurs for each block being written in RAM, causing 1024 interruptions on the process trying to load 1024 of these blocks after a single 8 MiB IO. The process might not be doing this large IO itself, the OS could be doing prefetching (the data would be in the in the Page Cache), and because the context switch for the read syscall is less expensive than the context switch for the soft page fault (needing cache invalidation), all this could be file caching side-effects with the memory allocator relying on lazy memory mapping. Could we tell the allocator to not do lazy mapping (paying a single "large" soft page fault for mapping a large block of RAM instead of doing a "small" soft page fault on each 8 KiB block; assuming a large and small soft page fault have similar cost) ?

      Delete
    3. I appreciate your comments and that you took the time to guess at things I didn't disclose.

      The workloads here are read-only, RocksDB compression is disabled and the RocksDB block size is 8kb.

      From iostat I see that rareq-sz is 8.41 because I didn't enable the option to align RocksDB block to filesystem page boundaries as that wastes space.

      I configured RocksDB to use O_DIRECT, so there should not be filesystem readahead. Also, the db_bench tests I am running are all for point queries.

      r_await is 0.12 (120 usecs) on the home server I am using. But the issue repros on faster SSDs.

      I am curious whether the allocator has an impact here and I might repeat tests with glibc malloc or tcmalloc.

      I will have a blog post soon with more details and try to answer more of your questions.

      Delete
    4. Via "perf stat" it is minor-faults

      Delete

Why is RocksDB spending so much time handling page faults?

This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead ...