Small Datum: Using db_bench to measure RocksDB performance with gcc and clang

This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.

tl;dr

a common outcome is

~10% more QPS with clang+LTO than with gcc
~5% more QPS with clang than with gcc

the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions

Variance

I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.

Possible sources of variance are:

the compiler toolchain

a bad code layout might hurt performance by increasing cache and TLB misses

RocksDB

the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads

hardware

sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever

benchmark client

the way in which I run tests can create more or less variance and more information on that is here and here

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version three times:

gcc using version 13.3.0
clang - using version 18.3.1
clang+LTO - using version 18.3.1, where LTO is link-time optimization

The build command lines are below

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for gcc
make "${flags[@]}" static_lib db_bench

# for clang
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make "${flags[@]}" static_lib db_bench

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make USE_LTO=1 "${flags[@]}" static_lib db_bench

On the small servers I used the LRU block cache. On the large server I used hyper clock when possible:

lru_cache was used for versions 7.6 and earlier
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

I used two small servers and one large server, all run Ubuntu 22.04:

pn-53

Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
benchmarks are run with 1 client (thread)

an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
benchmarks are run with 1 client (thread)

hetzner

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
benchmarks are run with 36 clients (threads)

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 6.29 compiled with gcc)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here.

Results: fillseq

Results for the pn53 server

clang+LTO provides ~15% more QPS than gcc in RocksDB 10.8

clang provides ~11% more QPS than gcc in RocksDB 10.8

Results for the Arm server

I am fascinated by how stable the QPS is here for clang and clang+LTO
clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability
the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8