Small Datum: How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.

By IO efficiency I mean:
(storage read IOPs from RocksDB benchmark / storage read IOPs from fio)

And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).

This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and ~0.88 at 6 clients. Were I to use slower storage, such as an SSD where read latency was ~200 usecs at io_depth=1 then the IO efficiency would be closer to 0.95.

Note that:

IO efficiency increases (decreases) when SSD read latency increases (decreases)
IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled

The overheads per 8kb block read on my test hardware were:

about 11 microseconds from libc + kernel
between 6 and 10 microseconds from RocksDB
~100 usecs of IO latency at io_depth=1, ~150 usecs at io_depth=6

A simple performance model

A simple model to predict the wall-clock latency for reading a block is:

userland CPU + libc/kernel CPU + device latency

For fio I assume that userland CPU is zero, I measured libc/kernel at ~11 usecs and will estimate that device latency is ~91 usecs. My device latency estimate comes from read-only benchmarks with fio where fio reports the average latency as 102 usecs which includes 11 usecs of CPU from libc+kernel and 91 = 102 - 11.

This model isn't perfect, as I will show below when reporting results for RocksDB, but it might be sufficient. But it allows you to predict latencies and IO efficiency when the RocksDB CPU overhead is increased or reduced.

Q and A

The RocksDB API could function as a universal API for storage engines, and if new DBMS built on that then it would be possible to combine new DBMS with new storage engines much faster than what is possible today.

Persistent hash indexes are not widely implemented, but getting one that uses the RocksDB API would be interesting for workloads such as the one I run here. However, there are fewer use cases for a hash index (no range queries) than for a range index like an LSM so it is harder to justify the investment in such work.

Q: What is the CPU overhead from libc + kernel per 8kb read?
A: About 10 microseconds on this CPU.

Q: Can you write your own code that will be faster than RocksDB for such a workload?
A: Yes, you can

Q: Should you write your own library for this?
A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.

Q: Will RocksDB add features to make this faster?
A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.

Q: Does this matter?

A: It matters more when storage is fast (read latency less than 100 usecs). As read response time grows the CPU overhead from RocksDB becomes much less of an issue.

Benchmark hardware

I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.

From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.

CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:

$ cpupower frequency-info

analyzing CPU 5:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 5
CPUs which need to have their frequency coordinated by software: 5
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.60 GHz - 3.80 GHz
available frequency steps: 3.80 GHz, 2.20 GHz, 1.60 GHz
available cpufreq governors: conservative ... powersave performance schedutil
current policy: frequency should be within 1.60 GHz and 3.80 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.79 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: no

Results from fio

I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.

fio --name=randread --rw=randread --ioengine=sync --numjobs=$NJ --iodepth=1 \
--buffered=0 --direct=1 \
--bs=8k \
--size=400G \
--randrepeat=0 \
--runtime=600s --ramp_time=1s \
--filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8 \
--group_reporting

Results are:

legend:
* iops - average reads/s reported by fio
* usPer, syPer - user, system CPU usecs per read

* cpuPer - usPer + syPer
* lat.us - average read latency in microseconds
* numjobs - the value for --numjobs with fio

iops usPer syPer cpuPer lat.us numjobs
9884 1.351 9.565 10.916 101.61 1
43782 1.379 10.642 12.022 136.35 6

Results from RocksDB

I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:

fillseq - loads the LSM tree in key order
revrange - I ignore the results from this
overwritesome - overwrites 10% of the KV pairs
flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
readrandom - does random point queries when LSM tree has many levels
compact - compacts LSM tree into one level
readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled

I use readrandom, readrandom2 and readrandom3 to vary the amount of work that RocksDB must do per query and measure the CPU overhead of that work. The most work happens with readrandom as the LSM tree has many levels and there are bloom filters to check. The least work happens with readrandom3 as the LSM tree only has one level and there are no bloom filters to check.

Initially I ran tests with --block_align not set as that reduces space-amplification (less padding) but 8kb reads are likely to cross file system page boundaries and become larger reads. But given the focus here is on IO efficiency, I used --block_align.

A summary of the results for db_bench with 1 user (thread) and 6 users (threads) is:

--- 1 user

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

8282 8350 8.5 11.643 7.602 19.246 120.74 101 readrandom

8394 8327 8.7 9.997 8.525 18.523 119.13 105 readrandom2

8522 8400 8.2 8.732 8.718 17.450 117.34 100 readrandom3

--- 6 users

38391 38628 8.1 14.645 7.291 21.936 156.27 134 readrandom

39359 38623 8.3 10.449 9.346 19.795 152.43 144 readrandom2

39669 38874 8.0 9.459 9.850 19.309 151.24 140 readrandom3

From the following:

IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do.
With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.

legend:

* io.eff - IO efficiency as (db_bench storage read IOPs / fio storage read IOPs)

* us.inc - incremental user CPU usecs per read as (db_bench usPer - fio usPer)

* cpu.inc - incremental total CPU usecs per read as (db_bench cpuPer - fio cpuPer)

--- 1 user

io.eff us.inc cpu.inc test

------ ------ ------

0.844 10.292 8.330 readrandom

0.842 8.646 7.607 readrandom2

0.849 7.381 6.534 readrandom3

--- 6 users

io.eff us.inc cpu.inc test

------ ------ ------

0.882 13.266 9.914 readrandom

0.882 9.070 7.773 readrandom2

0.887 8.080 7.287 readrandom3

Evaluating the simple performance model

I described a simple performance model earlier in this blog post and now it is time to see how well it does for RocksDB. First I will use values from the 1 user/client/thread case:

IO latency is ~91 usecs per fio
libc+kernel CPU overhead is ~11 usecs per fio
RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3

The model is far from perfect as it predicts that RocksDB will sustain:

9063 IOPs for readrandom, when it actually did 8350
9124 IOPs for readrandom2, when it actually did 8327
9214 IOPs for readrandom3, when it actually did 8400

Regardless, model is a good way to think about the problem.

The impact from --block_align

RocksDB QPS increases by between 7% and 9% when --block_align is enabled. Enabling it reduces read-amp and increases space-amp. But given the focus here is on IO efficiency I prefer to enable it. RocksDB QPS increases with it enabled because fewer storage read requests cross file system page boundaries, thus the average read size from storage is reduced (see the reqsz column below).

legend:

* qps - RocksDB QPS

* iops - average reads/s reported by fio

* reqsz - average read request size in KB per iostat

* usPer, syPer, cpuPer - user, system and (user+system) CPU usecs per read

* rx.lat - average read latency in microseconds, per RocksDB

* io.lat - average read latency in microseconds, per iostat

* test - the db_bench test name

- block_align disabled

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

7629 7740 8.9 12.133 8.718 20.852 137.92 111 readrandom

7866 7813 9.1 10.094 9.098 19.192 127.12 115 readrandom2

7972 7862 8.6 8.931 9.326 18.257 125.44 110 readrandom3

- block_align enabled

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

8282 8350 8.5 11.643 7.602 19.246 120.74 101 readrandom

8394 8327 8.7 9.997 8.525 18.523 119.13 105 readrandom2

8522 8400 8.2 8.732 8.718 17.450 117.34 100 readrandom3

Async IO in RocksDB

Per the wiki, RocksDB can do async IO for point queries that use MultiGet. That is done via coroutines and requires linking with Folly. My builds do not support that today and because my focus is on efficiency rather than throughput I did not try it for this test.

Flamegraphs

Flamegraphs are here for readrandom, readrandom2 and readrandom3.

A summary of where CPU time is spent based on the flamegraphs.

Legend:

* rr, rr2, rr3 - readrandom, readrandom2, readrandom3

* libc+k - time in libc + kernel

* checksm - verify data block checksum after read

* IBI:Sk - IndexBlockIter::SeekImpl

* DBI:Sk - DataBlockIter::SeekImpl

* LRU - lookup, insert blocks in the LRU, update metrics

* bloom - check bloom filters

* BSI - BinarySearchIndexReader::NewIterator

* File - FilePicker::GetNextFile, FindFileInRange

* other - other parts of the call stack, from DBImpl::Get and functions called by it

rr is readrandom, rr2 is readrandom2, rr3 is readrandom3

Percentage of samples

rr rr2 rr3

libc+k 37.30 42.22 50.92

checksm 3.76 2.66 2.91

IBI:Sk 7.07 7.36 7.76

DBI:Sk 3.05 2.15 1.96

LRU 5.19 6.19 6.02

bloom 18.35 8.14 0

BSI 2.28 4.02 3.12

File 3.74 3.34 4.44

other 19.26 23.92 22.87

Small Datum

Thursday, October 23, 2025

How efficient is RocksDB for IO-bound, point-query workloads?

No comments:

Post a Comment

IO-bound sysbench vs MySQL on a 48-core server