This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.
tl;dr
- a common outcome is
- ~10% more QPS with clang+LTO than with gcc
- ~5% more QPS with clang than with gcc
- the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions
Variance
I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.
Possible sources of variance are:
- the compiler toolchain
- a bad code layout might hurt performance by increasing cache and TLB misses
- RocksDB
- the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads
- hardware
- sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever
- benchmark client
Software
I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.
I compiled each version three times:
- gcc using version 13.3.0
- clang - using version 18.3.1
- clang+LTO - using version 18.3.1, where LTO is link-time optimization
flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )# for gccmake "${flags[@]}" static_lib db_bench# for clangAR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \make "${flags[@]}" static_lib db_bench# for clang+LTOAR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \make USE_LTO=1 "${flags[@]}" static_lib db_bench
- lru_cache was used for versions 7.6 and earlier
- hyper_clock_cache was used for versions 7.7 through 8.5
- auto_hyper_clock_cache was used for versions 8.5+
Hardware
I used two small servers and one large server, all run Ubuntu 22.04:
- pn-53
- Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
- benchmarks are run with 1 client (thread)
- arm
- an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
- benchmarks are run with 1 client (thread)
- hetzner
- an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
- benchmarks are run with 36 clients (threads)
Benchmark
Overviews on how I use db_bench are here and here.
Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.
- fillseq
- load RocksDB in key order with 1 thread
- revrangeww, fwdrangeww
- do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
- readww
- do point queries with a rate-limited writer. Report performance for the point queries.
- overwrite
- overwrite (via Put) random keys
Relative QPS
Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 6.29 compiled with gcc)
The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.
The spreadsheet with numbers and charts is here.
Results: fillseq
Results for the pn53 server
Results for the Arm server
- I am fascinated by how stable the QPS is here for clang and clang+LTO
- clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8
Results for the Hetzner server
- I don't show results for 6.29 or 7.x to improve readability
- the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
- clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8
Results: revrangeww
Results for the pn53 server
- clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
- clang provides ~6% more QPS than gcc in RocksDB 10.8
Results for the Arm server
Results for the Hetzner server
Results: fwdrangeww
Results for the pn53 server
Results for the Arm server
Results for the Hetzner server
Results: readww
Results for the pn53 server
Results for the Arm server
Results for the Hetzner server
Results: overwrite
Results for the pn53 server
Results for the Arm server
- QPS is similar for gcc, clang and clang+LTO
Results for the Hetzner server















No comments:
Post a Comment