Small Datum: RocksDB microbenchmarks: compilers, Arm and x86

This revisits my previous work to understand the impact of compilers and optimizer flags on the performance of RocksDB microbenchmarks. Benchmarks were run on Arm and x86 servers on both AWS and GCP using db_bench from RocksDB and benchHash from xxHash.

tl;dr

Why won't AWS tell us whether Graviton3 is Neoverse N1 or N2
Much time will be spent figuring out which compiler flags to use for Arm
clang on x86 has a known problem with crc32c
Good advice appears to be use -march=native on x86 and -mcpu=native on Arm. I don't have a strong opinion on -O2 vs -O3
Relative to x86, Arm does worse on xxh3 than on lz4, zstd and crc32c. Using the 8kb input case to compare latency from the best results the (c7g / c6i) ratios are crc32c = 1.35, lz4 uncompress/compress = 1.05 / 1.17, zstd uncompress/compress = 1.15 / 1.13 and then xxh3 = 2.38, so xxh3 is the outlier.
On AWS with these single-threaded workloads c6i (x86) was faster than c7g (Arm). I am not sure it is fair to compare the GCP CPUs (c2 is x86, t2a is Arm).

Hardware

For AWS I used c7g for Arm and c6i for x86. For GCP I used t2a for Arm and c2 for x86. See here for more info on the AWS instance types and GCP machine types. The GCP t2a is from the Arm Neoverse N1 family. There is speculation that the AWS c7g is from the Arm Neoverse N2 family but I can't find a statement from AWS on that.

All servers used Ubuntu 22.04.

Benchmarks

The first set of tests were microbenchmarks from db_bench (part of RocksDB) that measure the latency per 4kb and per 8kb page for crc32c, xxh3, lz4 (de)compression and zstd (de)compression. A script for that is here.

The second set of tests were microbenchmarks from benchHash (part of xxHash) that measure the time to do xxh3 and other hash functions for different input sizes including 4kb and 8kb. I share the xxh3 results at 4kb and 8kb.

Compiling

db_bench and xxHash were compiled using clang and gcc with a variety of flags. On Ubuntu 22.04 clang is version 14.0.0-1ubuntu and gcc is version 11.3.0.

For RocksDB the make command lines are:

make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 static_lib db_bench

make CC=/usr/bin/clang CXX=/usr/bin/clang++ DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 static_lib db_bench

For db_bench:

to use -O3 rather than -O2 I edited Makefile here
to select -march or -mcpu for Arm I edited Makefile here

By default, xxHash uses "-O3" and the default compiler which doesn't enable the best performance.

For db_bench and x86:

rx.gcc.march.o2 - gcc with the default, -O2 -march=native
rx.clang.march.o2 - clang with the default, -O2 -march=native
rx.gcc.march.o3 - gcc with -O3 -march=native
rx.clang.march.o3 - clang with -O3 -march=native

For db_bench and Arm:

rx.gcc.march.o2 - gcc with the default, -O2 -march=armv8-a+crc+crypto
rx.gcc.mcpu.o2 - gcc with -O2 -mcpu=native
rx.gcc.neo.o2 - gcc with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
rx.clang.march.o2 - clang with the default, -O2 -march=armv8-a+crc+crypto
rx.clang.mcpu.o2 - clang with -O2 -mcpu=native
rx.clang.neo.o2 - clang with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
rx.gcc.march.o3 - gcc with -O3 -march=armv8-a+crc+crypto
rx.gcc.mcpu.o3 - gcc with -O3 -mcpu=native
rx.gcc.neo.o3 - gcc with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
rx.clang.march.o3 - clang with -O3 -march=armv8-a+crc+crypto
rx.clang.mcpu.o3 - clang with -O3 -mcpu=native
rx.clang.neo.o3 - clang with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a

For xxHash/benchHash and x86:

xx.gcc.o2 - CFLAGS="-O2" make
xx.gcc.o3 - CFLAGS="-O3" make, the default that matches what you get without CFLAGS
xx.gcc.march.o2 - CFLAGS="-O2 -march=native" make
xx.gcc.march.o3 - CFLAGS="-O3 -march=native" make
xx.clang.o2 - same as xx.gcc.o2 except adds CC=/usr/bin/clang
xx.clang.o3 - same as xx.gcc.o3 except adds CC=/usr/bin/clang
xx.clang.march.o2 - same as xx.gcc.march.o2 except adds CC=/usr/bin/clang
xx.clang.march.o3 - same as xx.gcc.march.o3 except adds CC=/usr/bin/clang

For xxHash/benchHash and Arm:

xx.gcc.o2 - gcc with CFLAGS="-O2" make
xx.gcc.o3 - gcc with CFLAGS="-O3" make, the default that matches what you get without CFLAGS
xx.gcc.mcpu.o2 - gcc with CFLAGS="-O2 -mcpu=native"
xx.gcc.mcpu.o3 - gcc with CFLAGS="-O3 -mcpu=native"
xx.gcc.neo.o2 - gcc with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
xx.gcc.neo.o3 - gcc with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
xx.clang.o2 - same as xx.gcc.o2 except adds CC=/usr/bin/clang
xx.clang.o3 - same as xx.gcc.o3 except adds CC=/usr/bin/clang
xx.clang.mcpu.o2 - same as xx.gcc.mcpu.o2 except adds CC=/usr/bin/clang
xx.clang.mcpu.o3 - same as xx.gcc.mcpu.o3 except adds CC=/usr/bin/clang
xx.clang.neo.o2 - same as xx.gcc.neo.o2 except adds CC=/usr/bin/clang
xx.clang.neo.o3 - same as xx.gcc.neo.o3 except adds CC=/usr/bin/clang

Results

All of the results are in this spreadsheet. Read the previous section to decode the names and understand which compiler options were used. The first four sets of tables have results for the following db_bench microbenchmarks. The numbers are in nanoseconds and the average latency per 4kb or 8kb input. Results for 4kb are on the left and for 8kb on the right.

crc32c - time to compute crc32c for a 4kb or 8kb page
uncomp.lz4 - time for lz4 decompression with a 4kb or 8kb page
comp.lz4 - time for lz4 compression with a 4kb or 8kb page
uncomp.zstd - time for zstd decompression with a 4kb or 8kb page
comp.zstd - time for zstd compression with a 4kb or 8kb page

The last 4 tables have results for xxh3 from both db_bench and benchHash for 4kb and 8kb inputs. The numbers are in nanoseconds and the average latency per (4kb, 8kb) input.

I used colors to highlight outliers in red (worst case) and green (best case).

The following are links to the results for db_bench microbenchmarks:

c6i (x86, AWS) with 4kb and with 8kb inputs

clang is much worse than gcc on crc32c. Otherwise compilers and flags don't change results. The crc32c issue for clang is known and a fix is making its way upstream.

c7g (Arm, AWS) with 4kb and with 8kb inputs

Compilers and flags don't change results. This one fascinates me.
The c6i CPU was between 1.1X and 1.25X faster than c7g.

c2 (x86, GCP) with 4kb and with 8kb inputs

Same as c6i (clang is worse at crc32c, known problem)

t2a (Arm, GCP) with 4kb and with 8kb inputs

Results for crc32c with rx.gcc.neo.o2 and rx.gcc.neo.o3 are ~1.15X slower than everything else. In this case neo means -mcpu=neoverse-n1.

The following are links to the results for xxh3 from db_bench and benchHash:

c6i (x86, AWS)

The best results come from -O3 -march=native for both clang and gcc

c7g (Arm, AWS)

The best results are from clang with -mcpu=native or -mcpu=neoverse-512tvb. In that case -O2 vs -O3 doesn't matter. The second best results (with gcc or clang with different flags) aren't that far from the best results.
The best results on c7g are more than 2X slower than the best on c6i. This difference is much larger than the differences for the db_bench microbenchmarks (crc32, lz4, zstd). I don't know why Arm struggles more with xxh3.

c2 (x86, GCP)

The best results are from gcc with -O3 -march=native and gcc does better than clang. The benefit from adding -march=native is huge.

t2a (Arm, GCP)

The best result is from xx.gcc.o3 which is odd (and reproduces). Ignoring that the results for clang and gcc are similar.

Update 1

Another round of tests from benchHash/xxHash on c6i and c7g where I show perf as a function of XXH_VECTOR (for c7g and c6i) and XXH3_NEON_LANES (for c7g).

Results from c7g with gcc

* with default make XXH_VECTOR set to XXH_NEON by detection code

* with CFLAGS="-mcpu=native" XXH_VECTOR set to XXH_SVE by detection code

* some results with xxhash.h edited to set XXH_NEON_LANES

Numbers as latency in nanosecs

4kb 8kb XXH_VECTOR

326 635 XXH_SCALAR make -O3

277 528 XXH_NEON make -O3, set XXH_NEON_LANES=2

245 448 XXH_NEON make -O3, set XXH_NEON_LANES=4

193 377 XXH_NEON make -O3, XXH_NEON_LANES=6 (default)

220 419 XHX_NEON make -O3, set XXH_NEON_LANES=8

195 378 XXH_SVE make -O3 -mcpu=native

Numbers as MB/second

4kb 8kb XXH_VECTOR

12.6 12.9 XXH_SCALAR make -O3

14.8 15.5 XXH_NEON make -O3, set XXH_NEON_LANES=2

16.7 18.3 XXH_NEON make -O3, set XXH_NEON_LANES=4

21.3 21.8 XXH_NEON make -O3, XXH_NEON_LANES=6 (default)

18.7 19.5 XHX_NEON make -O3, set XXH_NEON_LANES=8

21.0 21.7 XXH_SVE make -O3 -mcpu=native

Results from c6i with gcc

* with default make XXH_VECTOR set to XXH_SSE2 by detection code

* with CFLAGS="-march=native" XXH_VECTOR set to XXH_AVX512 by detection code

* all results from make with -O3 and -DXXH_VECTOR=...

Numbers as latency in nanosecs

4kb 8kb XXH_VECTOR

298 595 XXH_SCALAR

172 340 XXH_SSE2

96 185 XXH_AVX2

75 143 XXH_AVX512

Numbers as MB/second

4kb 8kb XXH_VECTOR

13.78 13.8 XXH_SCALAR

23.8 24.1 XXH_SSE2

42.7 44.2 XXH_AVX2

54.4 57.4 XXH_AVX512

2 comments:

easyaspi314February 3, 2023 at 11:44 AM
easyaspi314 the NEON nerd here, XXH3 tends to have more trouble on ARM because of how sensitive they are to the pipeline. It is recommended to toy around with XXH3_NEON_LANES to get the best performance when optimizing for a specific target. By default it is set to 6 on generic ARM processors because it is best for the average mobile Cortex, but especially on higher end chips this may vary.

Small Datum

Thursday, February 2, 2023

RocksDB microbenchmarks: compilers, Arm and x86

2 comments:

Postgres 18 beta2: large server, sysbench