Small Datum: RocksDB 10.2 benchmarks: large server

This post has benchmark results for RocksDB 10.x, 9.x, 8.11, 7.10 and 6.29 on a large server.

\tl;dr

There are several big improvements
There are no new regressions
For the block cache hyperclock does much better than LRU on CPU-bound tests

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3, 10.2.1. Everything was compiled with gcc 11.4.0.

For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

lru_cache was used for versions 7.6 and earlier
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 36 threads.

Tests were repeated for 3 workload+configuration setups:

byrx - database is cached by RocksDB
iobuf - database is larger than RAM and RocksDB uses buffered IO
iodir - database is larger than RAM and RocksDB uses O_DIRECT

The benchmark steps named on the charts are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys using many threads

Results: byrx

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5:

overwrite

~1.2X faster in modern RocksDB

revrangeww, fwdrangeww, readww

slightly faster in modern RocksDB

fillseq

~15% slower in modern RocksDB most likely from new code added for correctness checks

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4:

there are approximately zero regressions. The changes are small and might be normal variance.

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock

readww

almost 3X faster with hyperclock because it suffers the most from block cache contention

revrangeww, fwdrangeww

almost 2X faster with hyperclock

fillseq

no change with hyperclock because the workload uses only 1 thread

overwrite

no benefit from hyperclock because write stalls are the bottleneck

Results: iobuf

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.

fillseq

~1.6X faster since RocksDB 7.x

readww

~6% faster in modern RocksDB

overwrite

suffered from issue 12038 in versions 8.6 through 9.8

revrangeww, fwdrangeww

~5% slower since early 8.x

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.

overwrite

suffered from issue 12038 in versions 8.6 through 9.8. The line would be similar to what I show above had the base case been prior to 8.5 or earlier

fillseq

~7% faster in 10.2 relative to 8.11

revrangeww, fwdrangeww, readww

unchanged from 8.11 to 10.2

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

readww

~8% faster with hyperclock. The benefit here is smaller than above for byrx because the workload here is less CPU-bound

revrangeww, fwdrangeww, overwrite

slightly faster with hyperclock

fillseq

no change with hyperclock because the workload uses only 1 thread

Results: iodir

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.

fillseq

~1.6X faster since RocksDB 7.x (see results above for iobuf)

overwrite

~1.2X faster in modern RocksDB

revrangeww, fwdrangeww, readww

unchanged from 6.29 to 10.2

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.

overwrite

might have a small regression (~3%) from 8.11 to 10.2

revrangeww, fwdrangeww, readww, fillseq

unchanged from 8.11 to 10.2

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

there are small regressions and/or small improvements and/or normal variance

Small Datum

Tuesday, May 6, 2025

RocksDB 10.2 benchmarks: large server

No comments:

Post a Comment

Explaining why throughput varies for Postgres with a CPU-bound Insert Benchmark