Thursday, October 17, 2024

The impact of PGO, LTO and more for MySQL on a small CPU

This has results for MySQL 8.0.28 vs sysbench on a laptop-class CPU (AMD 4700u) to understand the impact of various compile-time optimizations including LTO and PGO. I will have results from a server-class CPU next weeok. I previously published results on this but then retracted some of them as they appeared bogus.

tl;dr

  • clang and gcc have similar performance with a few exceptions
    • clang does ~7% better than gcc with PGO
    • clang does ~25% better than gcc with PGO+LTO
    • oddly, gcc with PGO is ~3% faster than gcc with PGO+LTO
  • LTO makes things ~6% faster
  • PGO or PGO+LTO makes things between 1.4X and 1.8X faster
  • gcc and clang without -march=native -mtune=native are ~2% faster than with those options

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here

All work (builds & tests) are done on a server that uses Ubuntu 22.04.5. The gcc version is 11.4.0 and the clang version is 14.0.0-1ubuntu1.1.

The CMake command lines for all of the builds are here. All builds use CMAKE_BUILD_TYPE =Release. The builds are listed below and the name of the CMake command line file per build is cmk.80.$val where $val is the build name listed below:
  • rel_o2nofp
    • uses gcc, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel_o2nofp_clang
    • uses clang, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel
    • uses gcc, -O3, doesn't work great for flamegraphs
  • rel_clang
    • uses clang, -O3, doesn't work great for flamegraphs
  • rel_o2nofp_lto
    • like rel_o2nofp but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_lto_clang
    • like rel_o2nofp_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto
    • like rel but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto_clang
    • like rel_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_native
    • like rel_o2nofp but adds -march=native -mtune=native
  • rel_o2nofp_native_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native
  • rel_native
    • like rel but adds -march=native -mtune=native
  • rel_native_clang
    • like rel_clang but adds -march=native -mtune=native
  • rel_o2nofp_native_lto
    • like rel_o2nofp but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_o2nofp_native_lto_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto
    • like rel but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto_clang
    • like rel_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_pgo_use
    • like rel but uses PGO via -fprofile-generate
  • rel_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate
  • rel_lto_pgo_use
    • like rel but uses PGO via -fprofile-generate and -DWITH_LTO=ON
  • rel_lto_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate and -DWITH_LTO=ON
For PGO I do the following:
  • compile mysqld with -fprofile-generate
  • run sysbench to generate profiles
  • recompile mysqld with -fprofile-use using those profiles
  • rerun sysbench to get the results that I report.
For PGO with gcc I use -fprofile-generate rather than -fprofile-generate=$path and when I run mysqld the gcda files are written into the subdirectories of the object files from when I compiled MySQL. That worked because I used the same host for compiling and testing.

For PGO with clang I use -fprofile-generate rather than -fprofile-generate=$path and when I ran mysqld the profdata files were empty (the profiling data). Then I prepended this to the mysqld command line, but the profdata files were still empty:
  LLVM_PROFILE_FILE="code-%p.profraw"

The problem appears to be that mysqld calls chdir($data-dir) at startup which confuses the clang profiling support. But making that path absolute fixes the problem:
    LLVM_PROFILE_FILE="$PWD/code-%p.profraw"

By mysqld command line I mean that I started mysqld via mysqld_safe, did ps to see the command line, and then used that command line in place of mysqld_safe.

Finally, once I had non-empty profdata files, I had to run llvm-profdata on them and the output (rel_pgo_gen_clang.profdata or rel_lto_pgo_gen_clang.profdata) was used as input when I compiled mysqld with clang and -fprofile-use.
    llvm-profdata merge -output=rel_pgo_gen_clang.profdata code-296681.profraw

Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. But here I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run with 1 thread, 1 table and 30M rows. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script was:
    bash r.sh 1 30000000 300 600 nvme0n1 1 1 1

Results

All of the results are here using relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the rel_o2nofp build
  • my version is one of the other builds
I then compute the median and average of the rQPS for each build. There is no result below for rel_o2nofp because that is the base version and the avg and median for it are 1.

avg     median  build
0.98    0.98    rel_o2nofp_clang
...
1.00    1.00    rel
0.98    0.98    rel_clang
...
1.05    1.05    rel_o2nofp_lto
1.06    1.06    rel_o2nofp_lto_clang
...
1.07    1.07    rel_lto
1.07    1.07    rel_lto_clang
...
0.99    0.99    rel_o2nofp_native
0.97    0.97    rel_o2nofp_native_clang
...
0.99    0.99    rel_native
0.98    0.98    rel_native_clang
...
1.05    1.04    rel_o2nofp_native_lto
1.05    1.05    rel_o2nofp_native_lto_clang
...
1.06    1.06    rel_native_lto
1.06    1.06    rel_native_lto_clang
...
1.42    1.48    rel_pgo_use
1.52    1.59    rel_pgo_use_clang
...
1.39    1.43    rel_lto_pgo_use
1.77    1.76    rel_lto_pgo_use_clang

1 comment:

  1. All good ideas, I just need to find spare time or SmallDatum LLC needs to find a client to sponsor this work on ARM. I have done some work on ARM in the past

    https://smalldatum.blogspot.com/2023/02/sysbench-arm-x86-public-cloud.html

    https://smalldatum.blogspot.com/2023/01/the-insert-benchmark-on-arm-and-x86.html

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...