Monday, December 1, 2025

Using db_bench to measure RocksDB performance with gcc and clang

This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.

tl;dr

  • a common outcome is
    • ~10% more QPS with clang+LTO than with gcc
    • ~5% more QPS with clang than with gcc
  • the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions

Variance

I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.

Possible sources of variance are:

  • the compiler toolchain
    • a bad code layout might hurt performance by increasing cache and TLB misses
  • RocksDB
    • the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads
  • hardware
    • sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever
  • benchmark client
    • the way in which I run tests can create more or less variance and more information on that is here and here

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version three times:

  • gcc using version 13.3.0
  • clang - using version 18.3.1
  • clang+LTO - using version 18.3.1, where LTO is link-time optimization
The build command lines are below

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for gcc
make "${flags[@]}" static_lib db_bench

# for clang
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make "${flags[@]}" static_lib db_bench

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make USE_LTO=1 "${flags[@]}" static_lib db_bench

On the small servers I used the LRU block cache. On the large server I used hyper clock when possible:
  • lru_cache was used for versions 7.6 and earlier
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auto_hyper_clock_cache was used for versions 8.5+

Hardware

I used two small servers and one large server, all run Ubuntu 22.04:

  • pn-53
    • Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
    • benchmarks are run with 1 client (thread)
  • arm
    • an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
    • benchmarks are run with 1 client (thread)
  • hetzner
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
    • benchmarks are run with 36 clients (threads)

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 6.29 compiled with gcc)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here.

Results: fillseq

Results for the pn53 server

  • clang+LTO provides ~15% more QPS than gcc in RocksDB 10.8
  • clang provides ~11% more QPS than gcc in RocksDB 10.8
  • Results for the Arm server

    • I am fascinated by how stable the QPS is here for clang and clang+LTO
    • clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8

    Results for the Hetzner server

    • I don't show results for 6.29 or 7.x to improve readability
    • the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
    • clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8

    Results: revrangeww

    Results for the pn53 server

    • clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
    • clang provides ~6% more QPS than gcc in RocksDB 10.8

    Results for the Arm server

  • clang+LTO provides ~11% more QPS than gcc in RocksDB 10.8
  • clang provides ~6% more QPS than gcc in RocksDB 10.8
  • Results for the Hetzner server

    • I don't show results for 6.29 or 7.x to improve readability
    • clang+LTO provides ~8% more QPS than gcc in RocksDB 10.8
    • clang provides ~3% more QPS than gcc in RocksDB 10.8
    • Results: fwdrangeww

      Results for the pn53 server

    • clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
    • clang provides ~4% more QPS than gcc in RocksDB 10.8
    • Results for the Arm server

    • clang+LTO provides ~13% more QPS than gcc in RocksDB 10.8
    • clang provides ~7% more QPS than gcc in RocksDB 10.8
    • Results for the Hetzner server

      • I don't show results for 6.29 or 7.x to improve readability
      • clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8
      • clang provides ~1% more QPS than gcc in RocksDB 10.8
      • Results: readww

        Results for the pn53 server

      • clang+LTO provides ~6% more QPS than gcc in RocksDB 10.8
      • clang provides ~5% less QPS than gcc in RocksDB 10.8
      • Results for the Arm server

      • clang+LTO provides ~14% more QPS than gcc in RocksDB 10.8
      • clang provides ~2% more QPS than gcc in RocksDB 10.8
      • Results for the Hetzner server

        • I don't show results for 6.29 or 7.x to improve readability
        • clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8
        • clang provides ~1% more QPS than gcc in RocksDB 10.8
        • Results: overwrite

          Results for the pn53 server

        • clang+LTO provides ~6% less QPS than gcc in RocksDB 10.8
        • clang provides ~8% less QPS than gcc in RocksDB 10.8
        • but for most versions there is similar QPS for gcc, clang and clang+LTO
        • Results for the Arm server

          • QPS is similar for gcc, clang and clang+LTO

          Results for the Hetzner server

          • I don't show results for 6.29 or 7.x to improve readability
          • clang+LTO provides ~2% more QPS than gcc in RocksDB 10.8
          • clang provides ~1% more QPS than gcc in RocksDB 10.8
          • Using db_bench to measure RocksDB performance with gcc and clang

            This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one...