How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.
By IO efficiency I mean: (storage read IOPs from RocksDB benchmark / storage read IOPs from fio)
And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).
This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and 0.88 at 6 clients. Were I to use storage that had a 2X larger storage latency then the IO efficiency would be closer to 0.95.
Note that:
- IO efficiency increases (decreases) when SSD read latency increases (decreases)
- IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
- RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled
- about 11 microseconds from libc + kernel
- between 6 and 10 microseconds from RocksDB
- between 100 and 150 usecs of IO latency from SSD per iostat
Q and A
A: About 10 microseconds on this CPU.
A: Yes, you can
A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.
A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.
I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.
From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.
CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:
$ cpupower frequency-info
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 5
CPUs which need to have their frequency coordinated by software: 5
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.60 GHz - 3.80 GHz
available frequency steps: 3.80 GHz, 2.20 GHz, 1.60 GHz
available cpufreq governors: conservative ... powersave performance schedutil
current policy: frequency should be within 1.60 GHz and 3.80 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.79 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: no
Results from fio
I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.
--buffered=0 --direct=1 \
--bs=8k \
--size=400G \
--randrepeat=0 \
--runtime=600s --ramp_time=1s \
--filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8 \
--group_reporting
Results are:
* iops - average reads/s reported by fio
* usPer, syPer - user, system CPU usecs per read
* lat.us - average read latency in microseconds
* numjobs - the value for --numjobs with fio
iops usPer syPer cpuPer lat.us numjobs
9884 1.351 9.565 10.916 101.61 1
43782 1.379 10.642 12.022 136.35 6
Results from RocksDB
I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:
- fillseq - loads the LSM tree in key order
- revrange - I ignore the results from this
- overwritesome - overwrites 10% of the KV pairs
- flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
- readrandom - does random point queries when LSM tree has many levels
- compact - compacts LSM tree into one level
- readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
- readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled
- IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
- With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do.
- With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
- IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.
- IO latency is ~91 usecs per fio
- libc+kernel CPU overhead is ~11 usecs per fio
- RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3
- 9063 IOPs for readrandom, when it actually did 8350
- 9124 IOPs for readrandom2, when it actually did 8327
- 9214 IOPs for readrandom3, when it actually did 8400
No comments:
Post a Comment