Monday, April 25, 2022

clang vs crc32c in RocksDB

By default RocksDB uses crc32c as a block checksum. The source is here and a benchmark is here (db_bench --benchmarks=crc32c). By default RocksDB is compiled with gcc on Linux but it is easy to switch to clang. I did that switch to compare benchmark performance between gcc and clang builds and the initial results were interesting.

The overhead for crc32c is apparent for benchmarks that do a lot of IO from fast storage (fast SSD or the OS page cache).

tl;dr for x86 HW

  • For crc32
    • throughput is more than 2X faster with gcc 9.4 than clang for clang versions < 14
    • the difference drops to 1.36X at clang version 14
  • For xxh3
    • clang versions 10 to 13 are ~1.08X faster than gcc 9.4
    • the difference drops to ~1.04X faster for clang 14
  • clang versions 14 and 15 have similar performance, so do clang versions 10 through 13

Update - issue 55153 filed for LLVM

Results

Legend:

  • cc - compiler
  • crc32c, xxhash, xxh3 - db_bench benchmark names
  • others are abbreviated names for db_bench benchmarks: xxh64 = xxhash64, comp = compress, uncomp = uncompress
The numbers in the table are the throughput in MB/s reported by db_bench.

cc      crc32c  xxhash  xxh64   xxh3    comp    uncomp
gcc     19327   5036    9796    26805   661     5201
clang    8935   5043    9847    29327   658     5185
clang11  8367   5050    9849    29344   660     5184
clang12  7594   5024    9832    28392   658     4858
clang13  7571   5021    9832    29031   659     5234
clang14 14239   5008    9847    27946   660     4770
clang15 14266   5027    9811    28020   660     5170

It is not shown here but the results for the uncompress test had too much variance so I ignore them. Perhaps the test needs to run for more time.

Setup

Most of my tests used an Intel NUC described here. This has Ubuntu 20.04 with gcc 9.4.0 and clang 10.0.0-4ubuntu1. After noticing that gcc 9.4 was much faster than clang 10 I tried clang versions 11, 12, 13 and 14 and the scripts here made it easy to install the newer versions of clang.

Other notes:

  • RocksDB is compiled with -O2 and all of the needed flags/includes to get a fast crc32.
  • From clang --version the versions were 11.1.0, 12.0.1, 13.0.1 and 14.0.1.

I then compiled RocksDB using gcc and clang:

# Compile for gcc 9.4
make clean; make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.gcc.use1

# Compile for clang 10.0.0-4ubuntu1
make clean; CC=/usr/bin/clang CXX=/usr/bin/clang++ USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib
db_bench; mv db_bench db_bench.clang.use1

# Compile for clang versions 11, 12, 13, 14
for v in 11 12 13 14; do make clean; CC=/usr/bin/clang-${v} CXX=/usr/bin/clang++-${v} USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.clang${v}.use1 ; done

And then I ran each of the CPU-intensive microbenchmarks 3 times and reported the median result:


for bm in crc32c xxhash xxhash64 xxh3 compress uncompress ; do
  echo; echo $bm; echo
  for x in gcc clang.use1 clang11.use1 clang12.use1 clang13.use1 clang14.use1 ; do
    echo; echo $x
    for z in 1 2 3; do
      ./db_bench.$x --benchmarks=$bm --stats_per_interval=1 --stats_interval_seconds=600 2> /dev/null | grep ^"$bm"
    done
  done
done

Clang versions

$ /usr/bin/clang-11 --version
Ubuntu clang version 11.1.0-++20211011094159+1fdec59bffc1-1~exp1~20211011214622.5
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-12 --version
Ubuntu clang version 12.0.1-++20211029101322+fed41342a82f-1~exp1~20211029221816.4
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-13 --version
Ubuntu clang version 13.0.1-++20220120110924+75e33f71c2da-1~exp1~20220120231001.58
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-14 --version
Ubuntu clang version 14.0.1-++20220423123024+9a3e81e1f91f-1~exp1~20220423003108.124
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-15 --version
Ubuntu clang version 15.0.0-++20220424052740+3f0f20366622-1~exp1~20220424172822.231
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Monday, April 4, 2022

Becoming less confused about perf record

I previously wrote about generating bogus flamegraphs and now I can be more clear about what I didn't understand. The default behavior for perf record is frequency-based and on modern Linux you probably get perf record -F 4000 as the default. For now assume the command line also used -p $pid and did not use -e. So in that case perf record does sampling of one process for the cycles event.

Disclaimer -- I am still waiting for an expert to review this.
Update - a fix was pushed

With frequency-based sampling (-F $frequency) perf record tries to generate $frequency samples per second. The perf tutorial wiki does a good job explaining what happens next. A sample (stack trace) is taken when the PMU cycles counter overflows. The challenge is that the number of cycles consumed per second by your process varies (more when it is CPU-bound, less when it is IO or mutex bound). So the sample period (the number of cycles at which overflow occurs) must be changed dynamically to provide $frequency samples/second. A key point here is that perf is sampling based on the number of cycles consumed by the process named in -p. If it were just doing that based on wall-clock time, for example running PMP once/second, then no adjustment is needed (just wake up $frequency times/second and get samples).

The sample periods are displayed in perf script output and here is an example where 1 cycles and 258 cycles are the sample periods. The sample periods also serve as the weight of a sample. For example, if there are 3 stack traces each with a sample period like the following, then perf report will report that g,h,i accounts for 50% of the time.

stack   period/cycles

a,b,c   1

d,e,f   1

g,h,i   2

The equivalence of sample period with what I called weight in my previous post confused me at first. If perf record is run with -c instead of -F then sample periods are not adjusted and all samples have the same sample period (same weight, a value likely to be much larger than 1) which is another workaround for FlameGraph issue 165 as suggested in the issue report.

As I wrote in a LinkedIn discussion: this is interesting and complicated. Given perf record -g -p $pid and a stack trace taken after a sample period of X cycles then you assume that stack trace represents where CPU was consumed by that process for the entire X cycles of that sample period. When the sample period is constant for all stack traces (perf record -c) then this doesn't matter because all stack traces count equally (have equal weight). But it does matter when the sample period is adjusted (perf record -F or perf record) and stack traces have different sample period (the X in X cycles changes) as in this example.

In addition to the links above, these helped reduce my confusion. So did a co-worker from the kernel team:


Sunday, April 3, 2022

Examples of the RocksDB trivial move optimization

I recently wrote two posts (here and here) on the trivial move optimization in RocksDB and now have a third post that provides examples via db_bench to demonstrate where it works, the impact of using compression on some levels and the complexity introduced by dynamic leveled. All of the examples use db_bench and the fillseq benchmark that inserts keys in ascending order. The four examples are:

  • dynamic level enabled, compression disabled
  • dynamic level disabled, compression disabled
  • dynamic level enabled, compression enabled
  • dynamic level disabled, compression enabled
Configuration

Dynamic leveled is explained by this post, enabled by this configuration option and set on the db_bench command line by --level_compaction_dynamic_level_bytes.

Compression is enabled by the db_bench flags --min_level_to_compress=2 --compression_type=lz4.

Dynamic level enabled, compression disabled

The db_bench command line and output are here. This shows the best case as there are no writes for levels beyond L0 -- Write(GB)=0 and Moved(GB) > 0. Trivial move cannot be done for writes into L0 (memtable flush) because the LSM has to write data at least once.

Dynamic level disabled, compression disabled

The db_bench command line and output are here. This shows the best case as there are no writes for levels beyond L0 -- Write(GB)=0 and Moved(GB) > 0.  Trivial move cannot be done for writes into L0 (memtable flush) because the LSM has to write data at least once.

Dynamic level enabled, compression enabled

The db_bench command line and output are here. This shows the impact of compression. When Ln and Ln+1 don't have the same style of compression then trivial move cannot be done from Ln to Ln+1. This is complicated by dynamic leveled because the LSM tree starts with L0 then grows into L0,L6 then grows into L0,L5,L6 and finally reaches L0,L4,L5,L6. Note I am using the level names displayed in compaction IO statistics and I consider these to be physical names. The logical names would be L0,L1,L2,L3. This is one source of complexity created by a great feature (dynamic leveled).

When the tree has data in L0,L5,L6 then the --min_level_to_compress option means that only L6 should be compressed. When the tree reaches L0,L4,L5,L6 then L5 and L6 should be compressed. This is another source of complexity from dynamic leveled.

This explains why both Write(GB) and Moved(GB) are greater than zero for L5 and L6. That doesn't occur with dynamic level disabled.

  • For L0 all bytes written are from memtable flush, Write(GB) > 0 & Moved(GB)=0.
  • For L4 only trivial move is done, Write(GB)=0 & Moved(GB) > 0
  • For L5 both Write(GB) > 0 & Moved(GB) > 0. 
    • With L0,L5,L6  trivial moves are done from L0 to L5 with uncompressed data
    • With L0,L4,L5,L6 writes are done from L4 to L5 because L4 is uncompressed and L5 uses lz4
  • For L6 both Write(GB) > 0 & Moved(GB) > 0. 
    • With L0,L6 trivial moves are done from L0 to L6 with uncompressed data
    • With L0,L5,L6 writes are done from L5 to L6 because L5 is uncompressed and L6 uses lz4
    • With L0,L4,L5,L6 trivial moves are done from L5 to L6 because both use lz4


Dynamic level disabled, compression enabled

The db_bench command line and output are here. When Ln and Ln+1 don't have the same style of compression then trivial move cannot be done from Ln to Ln+1. However dynamic leveled is disabled so the compaction IO statistics are easier to read.

  • For L0 all bytes written are from memtable flush, Write(GB) > 0 & Moved(GB)=0.
  • For L1 only trivial move is done, Write(GB)=0 & Moved(GB) > 0
  • For L2 only writes are done, Write(GB) > 0 & Moved(GB)=0,  because L1 is uncompressed & L2 is
  • For L3 only trivial move is done, Write(GB)=0 & Moved(GB) > 0, because L2 and L3 use lz4







Friday, April 1, 2022

One problem with flamegraphs

In the past I used PMP a lot with some usage of perf. PMP was great for me because my focus was on stalls from IO and synchronization. PMP is simple but good for showing off-cpu stalls. As I catch up to modern performance tools I have begun to use flamegraphs with perf. However, I also started to share bogus flamegraphs because there is a bug and I did not validate that the flamegraphs matched the output from perf report. Fortunately, Domas pointed out the problem.

The bug is FlameGraph issue 165 and the problem is that stackcollapse-perf.pl assumes that all stack traces are weighted equally but the output from perf script does not match that assumption. In the perf script output the stack traces have weights like 1 cycles and 258 cycles (collected via a script like this). If you use those values to compute time spent in each stack trace then your results will match those provided by perf report (at least in the cases I checked). But stackcollapse-perf.pl doesn't parse that number and just assumes each stack trace has weight 1 (see here).

I started with PR 250 but it is ~20 diffs behind and I am not sure it is correct. My variant of PR 250 that works for me is here. And now I get valid flamegraphs again that are extremely useful. Note that I am not an expert in Perl, perf or the FlameGraph repo.

Update