Thursday, October 17, 2024

The impact of PGO, LTO and more for MySQL on a small CPU

This has results for MySQL 8.0.28 vs sysbench on a laptop-class CPU (AMD 4700u) to understand the impact of various compile-time optimizations including LTO and PGO. I will have results from a server-class CPU next weeok. I previously published results on this but then retracted some of them as they appeared bogus.

tl;dr

  • clang and gcc have similar performance with a few exceptions
    • clang does ~7% better than gcc with PGO
    • clang does ~25% better than gcc with PGO+LTO
    • oddly, gcc with PGO is ~3% faster than gcc with PGO+LTO
  • LTO makes things ~6% faster
  • PGO or PGO+LTO makes things between 1.4X and 1.8X faster
  • gcc and clang without -march=native -mtune=native are ~2% faster than with those options

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here

All work (builds & tests) are done on a server that uses Ubuntu 22.04.5. The gcc version is 11.4.0 and the clang version is 14.0.0-1ubuntu1.1.

The CMake command lines for all of the builds are here. All builds use CMAKE_BUILD_TYPE =Release. The builds are listed below and the name of the CMake command line file per build is cmk.80.$val where $val is the build name listed below:
  • rel_o2nofp
    • uses gcc, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel_o2nofp_clang
    • uses clang, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel
    • uses gcc, -O3, doesn't work great for flamegraphs
  • rel_clang
    • uses clang, -O3, doesn't work great for flamegraphs
  • rel_o2nofp_lto
    • like rel_o2nofp but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_lto_clang
    • like rel_o2nofp_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto
    • like rel but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto_clang
    • like rel_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_native
    • like rel_o2nofp but adds -march=native -mtune=native
  • rel_o2nofp_native_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native
  • rel_native
    • like rel but adds -march=native -mtune=native
  • rel_native_clang
    • like rel_clang but adds -march=native -mtune=native
  • rel_o2nofp_native_lto
    • like rel_o2nofp but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_o2nofp_native_lto_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto
    • like rel but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto_clang
    • like rel_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_pgo_use
    • like rel but uses PGO via -fprofile-generate
  • rel_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate
  • rel_lto_pgo_use
    • like rel but uses PGO via -fprofile-generate and -DWITH_LTO=ON
  • rel_lto_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate and -DWITH_LTO=ON
For PGO I do the following:
  • compile mysqld with -fprofile-generate
  • run sysbench to generate profiles
  • recompile mysqld with -fprofile-use using those profiles
  • rerun sysbench to get the results that I report.
For PGO with gcc I use -fprofile-generate rather than -fprofile-generate=$path and when I run mysqld the gcda files are written into the subdirectories of the object files from when I compiled MySQL. That worked because I used the same host for compiling and testing.

For PGO with clang I use -fprofile-generate rather than -fprofile-generate=$path and when I ran mysqld the profdata files were empty (the profiling data). Then I prepended this to the mysqld command line, but the profdata files were still empty:
  LLVM_PROFILE_FILE="code-%p.profraw"

The problem appears to be that mysqld calls chdir($data-dir) at startup which confuses the clang profiling support. But making that path absolute fixes the problem:
    LLVM_PROFILE_FILE="$PWD/code-%p.profraw"

By mysqld command line I mean that I started mysqld via mysqld_safe, did ps to see the command line, and then used that command line in place of mysqld_safe.

Finally, once I had non-empty profdata files, I had to run llvm-profdata on them and the output (rel_pgo_gen_clang.profdata or rel_lto_pgo_gen_clang.profdata) was used as input when I compiled mysqld with clang and -fprofile-use.
    llvm-profdata merge -output=rel_pgo_gen_clang.profdata code-296681.profraw

Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. But here I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run with 1 thread, 1 table and 30M rows. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script was:
    bash r.sh 1 30000000 300 600 nvme0n1 1 1 1

Results

All of the results are here using relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the rel_o2nofp build
  • my version is one of the other builds
I then compute the median and average of the rQPS for each build. There is no result below for rel_o2nofp because that is the base version and the avg and median for it are 1.

avg     median  build
0.98    0.98    rel_o2nofp_clang
...
1.00    1.00    rel
0.98    0.98    rel_clang
...
1.05    1.05    rel_o2nofp_lto
1.06    1.06    rel_o2nofp_lto_clang
...
1.07    1.07    rel_lto
1.07    1.07    rel_lto_clang
...
0.99    0.99    rel_o2nofp_native
0.97    0.97    rel_o2nofp_native_clang
...
0.99    0.99    rel_native
0.98    0.98    rel_native_clang
...
1.05    1.04    rel_o2nofp_native_lto
1.05    1.05    rel_o2nofp_native_lto_clang
...
1.06    1.06    rel_native_lto
1.06    1.06    rel_native_lto_clang
...
1.42    1.48    rel_pgo_use
1.52    1.59    rel_pgo_use_clang
...
1.39    1.43    rel_lto_pgo_use
1.77    1.76    rel_lto_pgo_use_clang

1 comment:

  1. All good ideas, I just need to find spare time or SmallDatum LLC needs to find a client to sponsor this work on ARM. I have done some work on ARM in the past

    https://smalldatum.blogspot.com/2023/02/sysbench-arm-x86-public-cloud.html

    https://smalldatum.blogspot.com/2023/01/the-insert-benchmark-on-arm-and-x86.html

    ReplyDelete

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 202...