By default RocksDB uses crc32c as a block checksum. The source is here and a benchmark is here (db_bench --benchmarks=crc32c). By default RocksDB is compiled with gcc on Linux but it is easy to switch to clang. I did that switch to compare benchmark performance between gcc and clang builds and the initial results were interesting.
The overhead for crc32c is apparent for benchmarks that do a lot of IO from fast storage (fast SSD or the OS page cache).
tl;dr for x86 HW
- For crc32
- throughput is more than 2X faster with gcc 9.4 than clang for clang versions < 14
- the difference drops to 1.36X at clang version 14
- For xxh3
- clang versions 10 to 13 are ~1.08X faster than gcc 9.4
- the difference drops to ~1.04X faster for clang 14
- clang versions 14 and 15 have similar performance, so do clang versions 10 through 13
Results
Legend:
- cc - compiler
- crc32c, xxhash, xxh3 - db_bench benchmark names
- others are abbreviated names for db_bench benchmarks: xxh64 = xxhash64, comp = compress, uncomp = uncompress
gcc 19327 5036 9796 26805 661 5201
clang 8935 5043 9847 29327 658 5185
clang11 8367 5050 9849 29344 660 5184
clang12 7594 5024 9832 28392 658 4858
clang13 7571 5021 9832 29031 659 5234
clang14 14239 5008 9847 27946 660 4770
It is not shown here but the results for the uncompress test had too much variance so I ignore them. Perhaps the test needs to run for more time.
Setup
Most of my tests used an Intel NUC described here. This has Ubuntu 20.04 with gcc 9.4.0 and clang 10.0.0-4ubuntu1. After noticing that gcc 9.4 was much faster than clang 10 I tried clang versions 11, 12, 13 and 14 and the scripts here made it easy to install the newer versions of clang.
Other notes:
- RocksDB is compiled with -O2 and all of the needed flags/includes to get a fast crc32.
- From clang --version the versions were 11.1.0, 12.0.1, 13.0.1 and 14.0.1.
I then compiled RocksDB using gcc and clang:
make clean; make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.gcc.use1
make clean; CC=/usr/bin/clang CXX=/usr/bin/clang++ USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib
db_bench; mv db_bench db_bench.clang.use1
for v in 11 12 13 14; do make clean; CC=/usr/bin/clang-${v} CXX=/usr/bin/clang++-${v} USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.clang${v}.use1 ; done
And then I ran each of the CPU-intensive microbenchmarks 3 times and reported the median result:
./db_bench.$x --benchmarks=$bm --stats_per_interval=1 --stats_interval_seconds=600 2> /dev/null | grep ^"$bm"
$ /usr/bin/clang-12 --version
$ /usr/bin/clang-13 --version
$ /usr/bin/clang-14 --version
$ /usr/bin/clang-15 --version