This revisits my previous work to understand the impact of compilers and optimizer flags on the performance of RocksDB microbenchmarks. Benchmarks were run on Arm and x86 servers on both AWS and GCP using db_bench from RocksDB and benchHash from xxHash.
tl;dr
- Why won't AWS tell us whether Graviton3 is Neoverse N1 or N2
- Much time will be spent figuring out which compiler flags to use for Arm
- clang on x86 has a known problem with crc32c
- Good advice appears to be use -march=native on x86 and -mcpu=native on Arm. I don't have a strong opinion on -O2 vs -O3
- Relative to x86, Arm does worse on xxh3 than on lz4, zstd and crc32c. Using the 8kb input case to compare latency from the best results the (c7g / c6i) ratios are crc32c = 1.35, lz4 uncompress/compress = 1.05 / 1.17, zstd uncompress/compress = 1.15 / 1.13 and then xxh3 = 2.38, so xxh3 is the outlier.
- On AWS with these single-threaded workloads c6i (x86) was faster than c7g (Arm). I am not sure it is fair to compare the GCP CPUs (c2 is x86, t2a is Arm).
Hardware
For AWS I used c7g for Arm and c6i for x86. For GCP I used t2a for Arm and c2 for x86. See here for more info on the AWS instance types and GCP machine types. The GCP t2a is from the Arm Neoverse N1 family. There is speculation that the AWS c7g is from the Arm Neoverse N2 family but I can't find a statement from AWS on that.
All servers used Ubuntu 22.04.
Benchmarks
The first set of tests were microbenchmarks from db_bench (part of RocksDB) that measure the latency per 4kb and per 8kb page for crc32c, xxh3, lz4 (de)compression and zstd (de)compression. A script for that is here.
The second set of tests were microbenchmarks from benchHash (part of xxHash) that measure the time to do xxh3 and other hash functions for different input sizes including 4kb and 8kb. I share the xxh3 results at 4kb and 8kb.
Compiling
db_bench and xxHash were compiled using clang and gcc with a variety of flags. On Ubuntu 22.04 clang is version 14.0.0-1ubuntu and gcc is version 11.3.0.
For RocksDB the make command lines are:
make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 static_lib db_bench
make CC=/usr/bin/clang CXX=/usr/bin/clang++ DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 static_lib db_bench
For db_bench:
- to use -O3 rather than -O2 I edited Makefile here
- to select -march or -mcpu for Arm I edited Makefile here
For db_bench and x86:
- rx.gcc.march.o2 - gcc with the default, -O2 -march=native
- rx.clang.march.o2 - clang with the default, -O2 -march=native
- rx.gcc.march.o3 - gcc with -O3 -march=native
- rx.clang.march.o3 - clang with -O3 -march=native
For db_bench and Arm:
- rx.gcc.march.o2 - gcc with the default, -O2 -march=armv8-a+crc+crypto
- rx.gcc.mcpu.o2 - gcc with -O2 -mcpu=native
- rx.gcc.neo.o2 - gcc with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
- rx.clang.march.o2 - clang with the default, -O2 -march=armv8-a+crc+crypto
- rx.clang.mcpu.o2 - clang with -O2 -mcpu=native
- rx.clang.neo.o2 - clang with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
- rx.gcc.march.o3 - gcc with -O3 -march=armv8-a+crc+crypto
- rx.gcc.mcpu.o3 - gcc with -O3 -mcpu=native
- rx.gcc.neo.o3 - gcc with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
- rx.clang.march.o3 - clang with -O3 -march=armv8-a+crc+crypto
- rx.clang.mcpu.o3 - clang with -O3 -mcpu=native
- rx.clang.neo.o3 - clang with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
- xx.gcc.o2 - CFLAGS="-O2" make
- xx.gcc.o3 - CFLAGS="-O3" make, the default that matches what you get without CFLAGS
- xx.gcc.march.o2 - CFLAGS="-O2 -march=native" make
- xx.gcc.march.o3 - CFLAGS="-O3 -march=native" make
- xx.clang.o2 - same as xx.gcc.o2 except adds CC=/usr/bin/clang
- xx.clang.o3 - same as xx.gcc.o3 except adds CC=/usr/bin/clang
- xx.clang.march.o2 - same as xx.gcc.march.o2 except adds CC=/usr/bin/clang
- xx.clang.march.o3 - same as xx.gcc.march.o3 except adds CC=/usr/bin/clang
- xx.gcc.o2 - gcc with CFLAGS="-O2" make
- xx.gcc.o3 - gcc with CFLAGS="-O3" make, the default that matches what you get without CFLAGS
- xx.gcc.mcpu.o2 - gcc with CFLAGS="-O2 -mcpu=native"
- xx.gcc.mcpu.o3 - gcc with CFLAGS="-O3 -mcpu=native"
- xx.gcc.neo.o2 - gcc with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
- xx.gcc.neo.o3 - gcc with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
- xx.clang.o2 - same as xx.gcc.o2 except adds CC=/usr/bin/clang
- xx.clang.o3 - same as xx.gcc.o3 except adds CC=/usr/bin/clang
- xx.clang.mcpu.o2 - same as xx.gcc.mcpu.o2 except adds CC=/usr/bin/clang
- xx.clang.mcpu.o3 - same as xx.gcc.mcpu.o3 except adds CC=/usr/bin/clang
- xx.clang.neo.o2 - same as xx.gcc.neo.o2 except adds CC=/usr/bin/clang
- xx.clang.neo.o3 - same as xx.gcc.neo.o3 except adds CC=/usr/bin/clang
- crc32c - time to compute crc32c for a 4kb or 8kb page
- uncomp.lz4 - time for lz4 decompression with a 4kb or 8kb page
- comp.lz4 - time for lz4 compression with a 4kb or 8kb page
- uncomp.zstd - time for zstd decompression with a 4kb or 8kb page
- comp.zstd - time for zstd compression with a 4kb or 8kb page
- c6i (x86, AWS) with 4kb and with 8kb inputs
- clang is much worse than gcc on crc32c. Otherwise compilers and flags don't change results. The crc32c issue for clang is known and a fix is making its way upstream.
- c7g (Arm, AWS) with 4kb and with 8kb inputs
- Compilers and flags don't change results. This one fascinates me.
- The c6i CPU was between 1.1X and 1.25X faster than c7g.
- c2 (x86, GCP) with 4kb and with 8kb inputs
- Same as c6i (clang is worse at crc32c, known problem)
- t2a (Arm, GCP) with 4kb and with 8kb inputs
- Results for crc32c with rx.gcc.neo.o2 and rx.gcc.neo.o3 are ~1.15X slower than everything else. In this case neo means -mcpu=neoverse-n1.
- c6i (x86, AWS)
- The best results come from -O3 -march=native for both clang and gcc
- c7g (Arm, AWS)
- The best results are from clang with -mcpu=native or -mcpu=neoverse-512tvb. In that case -O2 vs -O3 doesn't matter. The second best results (with gcc or clang with different flags) aren't that far from the best results.
- The best results on c7g are more than 2X slower than the best on c6i. This difference is much larger than the differences for the db_bench microbenchmarks (crc32, lz4, zstd). I don't know why Arm struggles more with xxh3.
- c2 (x86, GCP)
- The best results are from gcc with -O3 -march=native and gcc does better than clang. The benefit from adding -march=native is huge.
- t2a (Arm, GCP)
- The best result is from xx.gcc.o3 which is odd (and reproduces). Ignoring that the results for clang and gcc are similar.
easyaspi314 the NEON nerd here, XXH3 tends to have more trouble on ARM because of how sensitive they are to the pipeline. It is recommended to toy around with XXH3_NEON_LANES to get the best performance when optimizing for a specific target. By default it is set to 6 on generic ARM processors because it is best for the average mobile Cortex, but especially on higher end chips this may vary.
ReplyDeleteThank you. There will be updates from me once I follow the advice.
Delete