Thursday, October 23, 2025

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.

By IO efficiency I mean: (storage read IOPs from RocksDB benchmark / storage read IOPs from fio)

And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).

This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and 0.88 at 6 clients. Were I to use storage that had a 2X larger storage latency then the IO efficiency would be closer to 0.95.

 Note that:

  • IO efficiency increases (decreases) when SSD read latency increases (decreases)
  • IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
  • RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled

The overheads per 8kb block read on my test hardware were:

  • about 11 microseconds from libc + kernel
  • between 6 and 10 microseconds from RocksDB
  • between 100 and 150 usecs of IO latency from SSD per iostat

A simple performance model

A simple model to predict the wall-clock latency for reading a block is:
    userland CPU + libc/kernel CPU + device latency

For fio I assume that userland CPU is zero, I measured libc/kernel at 10 usecs and will estimate that device latency is ~91 usecs. My device latency estimate comes from read-only benchmarks with fio where fio reports the average latency as 102 usecs which includes 11 usecs of CPU from libc+kernel and 91 = 102 - 11

This model isn't perfect, as I will show below when reporting results for RocksDB, but it might be sufficient.

Q and A

Q: What is the CPU overhead from libc + kernel per 8kb read?
A: About 10 microseconds on this CPU.

Q: Can you write your own code that will be faster than RocksDB for such a workload?
A: Yes, you can

Q: Should you write your own library for this?
A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.

Q: Will RocksDB add features to make this faster?
A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.

Q: Does this matter?
A: It matters more when storage is fast (read latency less than 100 usecs). As read response time grows the CPU overhead from RocksDB becomes much less of an issue.

Benchmark hardware

I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.

From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.

CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:

$ cpupower frequency-info

analyzing CPU 5:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 5
  CPUs which need to have their frequency coordinated by software: 5
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.60 GHz - 3.80 GHz
  available frequency steps:  3.80 GHz, 2.20 GHz, 1.60 GHz
  available cpufreq governors: conservative ... powersave performance schedutil
  current policy: frequency should be within 1.60 GHz and 3.80 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 3.79 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: no

Results from fio

I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.

fio --name=randread --rw=randread --ioengine=sync --numjobs=$NJ --iodepth=1 \
  --buffered=0 --direct=1 \
  --bs=8k \
  --size=400G \
  --randrepeat=0 \
  --runtime=600s --ramp_time=1s \
  --filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8  \
  --group_reporting

Results are:

legend:
* iops - average reads/s reported by fio
* usPer, syPer - user, system CPU usecs per read
* cpuPer - usPer + syPer
* lat.us - average read latency in microseconds
* numjobs - the value for --numjobs with fio

iops    usPer   syPer   cpuPer  lat.us  numjobs
 9884   1.351    9.565  10.916  101.61  1
43782   1.379   10.642  12.022  136.35  6

Results from RocksDB

I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:

  1. fillseq - loads the LSM tree in key order
  2. revrange - I ignore the results from this
  3. overwritesome - overwrites 10% of the KV pairs
  4. flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
  5. readrandom - does random point queries when LSM tree has many levels
  6. compact - compacts LSM tree into one level
  7. readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
  8. readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled
I use readrandom, readrandom2 and readrandom3 to vary the amount of work that RocksDB must do per query and measure the CPU overhead of that work. The most work happens with readrandom as the LSM tree has many levels and there are bloom filters to check. The least work happens with readrandom3 as the LSM tree only has one level and there are no bloom filters to check.

Initially I ran tests with --block_align not set as that reduces space-amplification (less padding) but 8kb reads are likely to cross file system page boundaries and become larger reads. But given the focus here is on IO efficiency, I used --block_align. 

A summary of the results for db_bench with 1 user (thread) and 6 users (threads) is:

--- 1 user
qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
8282     8350   8.5     11.643   7.602  19.246  120.74  101     readrandom
8394     8327   8.7      9.997   8.525  18.523  119.13  105     readrandom2
8522     8400   8.2      8.732   8.718  17.450  117.34  100     readrandom3

--- 6 users
38391   38628   8.1     14.645   7.291  21.936  156.27  134     readrandom
39359   38623   8.3     10.449   9.346  19.795  152.43  144     readrandom2
39669   38874   8.0      9.459   9.850  19.309  151.24  140     readrandom3

From the following:
  • IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
  • With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do. 
  • With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
  • IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.
legend:
* io.eff - IO efficiency as (db_bench storage read IOPs / fio storage read IOPs)
* us.inc - incremental user CPU usecs per read as (db_bench usPer - fio usPer)
* cpu.inc - incremental total CPU usecs per read as (db_bench cpuPer - fio cpuPer)

--- 1 user

        io.eff          us.inc          cpu.inc         test
        ------          ------          ------
        0.844           10.292           8.330          readrandom
        0.842            8.646           7.607          readrandom2
        0.849            7.381           6.534          readrandom3

--- 6 users

        io.eff          us.inc          cpu.inc         test
        ------          ------          ------
        0.882           13.266           9.914          readrandom
        0.882            9.070           7.773          readrandom2
        0.887            8.080           7.287          readrandom3

Evaluating the simple performance model

I described a simple performance model earlier in this blog post and now it is time to see how well it does for RocksDB. First I will use values from the 1 user/client/thread case:
  • IO latency is ~91 usecs per fio
  • libc+kernel CPU overhead is ~11 usecs per fio
  • RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3
The model is far from perfect as it predicts that RocksDB will sustain:
  • 9063 IOPs for readrandom, when it actually did 8350
  • 9124 IOPs for readrandom2, when it actually did 8327
  • 9214 IOPs for readrandom3, when it actually did 8400
Regardless, model is a good way to think about the problem.

The impact from --block_align

RocksDB QPS increases by between 7% and 9% when --block_align is enabled. Enabling it reduces read-amp and increases space-amp. But given the focus here is on IO efficiency I prefer to enable it. RocksDB QPS increases with it enabled because fewer storage read requests cross file system page boundaries, thus the average read size from storage is reduced (see the reqsz column below).

legend:
* qps - RocksDB QPS
* iops - average reads/s reported by fio
* reqsz - average read request size in KB per iostat
* usPer, syPer, cpuPer - user, system and (user+system) CPU usecs per read
* rx.lat - average read latency in microseconds, per RocksDB
* io.lat - average read latency in microseconds, per iostat
* test - the db_bench test name

- block_align disabled
qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
7629     7740   8.9     12.133   8.718  20.852  137.92  111     readrandom
7866     7813   9.1     10.094   9.098  19.192  127.12  115     readrandom2
7972     7862   8.6      8.931   9.326  18.257  125.44  110     readrandom3

- block_align enabled
qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
8282     8350   8.5     11.643   7.602  19.246  120.74  101     readrandom
8394     8327   8.7      9.997   8.525  18.523  119.13  105     readrandom2
8522     8400   8.2      8.732   8.718  17.450  117.34  100     readrandom3

No comments:

Post a Comment

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB a...