Wednesday, March 1, 2023

Adventures in compiling MySQL: RelWithDebInfo vs Release

I am a non-expert in many build tools -- CMake for MySQL, autoconf for Postgres, scons for MongoDB and Maven for Linkbench. While working to confirm my MySQL builds are OK I used sysbench to compare several of them and was confused by the results. This is part 3 of my adventure - parts 1 and 2 are here and here.

tl;dr

  • RelWithDebInfo uses link time optimization by default
  • Release does not use link time optimization by default
  • Performance is better with link time optimization
  • Link time optimization helped point queries more than range queries or writes
Compiling MySQL from source

The options for CMAKE_BUILD_TYPE with MySQL include RelWithDebInfo and Relase. On Ubuntu 22.04 with gcc the interesting differences include:
  • RelWithDebInfo uses -O2, link time optimization (-flto) and -fstack-protector-strong
  • Release uses -O3 but does not use link time optimization by default
In my benchmarks (CPU-bound sysbench) the RelWithDebInfo build gets better throughput because it used less CPU/query and the difference was up to 10%. Looking at statistics from perf I saw that insn per cycle was frequently ~10% better for the RelWithDebInfo build and memory system counters were also better for RelWithDebInfo.

My first guess at the root cause was the binary size from -O2 vs -O3, 54M for RelWithDebInfo vs 66M for Release using stripped mysqld binaries. But that was wrong.

Then I remembered that RelWithDebInfo used -flto while Release did not. I ignored that last week, but it turned out the explain the difference. 

Why don't Release builds use link time optimization? Yura Sorokin explained this to me. In CMakeLists.txt there is this code that sets WITH_PACKAGE_FLAGS_DEFAULT to ON for RelWithDebInfo and OFF for Release. When that is set to ON then the output from dpkg-buildflags is added to compile and linker command lines (dpkg-buildflags --get $X for X in CPPFLAGS, CFLAGS, CXXFLAGS, LDFLAGS). And on Ubuntu 22.04 I see:

$ dpkg-buildflags --get CPPFLAGS

-Wdate-time -D_FORTIFY_SOURCE=2


$ dpkg-buildflags --get CFLAGS

-g -O2 -ffile-prefix-map=/home/mdcallag/git/mytools/bench/sysbench.lua/r.1tab.1thr.feb23.repro=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security


$ dpkg-buildflags --get CXXFLAGS

-g -O2 -ffile-prefix-map=/home/mdcallag/git/mytools/bench/sysbench.lua/r.1tab.1thr.feb23.repro=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security


$ dpkg-buildflags --get LDFLAGS

-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro

Thus, I get -flto by default with RelWithDebInfo but not with Release. To enable link time optimization for Release I can do one of:

Results: graphs

The full spreadsheet is here and has the throughput for builds relative to a RelWithDebInfo build (QPS for build / QPS for RelWithDebInfo). LTO means link time optimization which is enabled by default for RelWithDebInfo, disabled by default for Release and enabled via WITH_LTO=ON. The builds tested are:
  • RelWithDebInfo - CMAKE_BUILD_TYPE=RelWithDebInfo, LTO, -O2
  • Release - CMAKE_BUILD_TYPE=Release, -O3, does not use LTO
  • Release+LTO - CMAKE_BUILD_TYPE=Release, LTO, -O3
  • Release+LTO+O2 - uses CMAKE_BUILD_TYPE=Release, LTO, -O2
  • Release+LTO+Native - CMAKE_BUILD_TYPE=Release, LTO, -march=native, -mtune=native
The spreadsheet has results for all builds. The graphs only show QPS for Release and Release+LTO relative to RelWithDebInfo to improve readability. From Release+LTO I see that LTO benefits point queries more than range queries or writes and that is also clear in the summary statistics displayed in the following section. Also, range query and writes each have two outliers for which LTO had a large benefit:
The graphs show the relative QPS for each build which is (QPS for the build / QPS for RelWithDebInfo).

Results: summary statistics

This section has summary statistics: the average, median, min and max value for each of the benchmark types (point, range and write) for each of the builds using the relative QPS (QPS for the build / QPS for RelWithDebInfo build). From the Average column it is clear that LTO benefits point queries more than range queries or writes. It is also clear that the use of LTO improves performance.

Release
PointRangeWrite
Average0.970.950.93
Median0.980.960.93
Min0.910.910.90
Max1.000.990.98
Release+LTO
PointRangeWrite
Average1.131.061.13
Median1.151.041.07
Min1.020.971.02
Max1.201.211.45
Rlease+LTO+O2
PointRangeWrite
Average1.091.031.09
Median1.101.021.04
Min1.010.970.99
Max1.141.161.33
Release+LTO+Native
PointRangeWrite
Average1.131.041.14
Median1.151.031.07
Min1.040.931.03
Max1.191.201.45

CMake command lines

RelWithDebInfo
cmake .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF \
  -DWITH_MYSQLX=OFF \
  -DWITH_UNIT_TESTS=OFF
Release
BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF \
  -DWITH_MYSQLX=OFF \
  -DWITH_UNIT_TESTS=OFF
Release+LTO
BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DWITH_LTO=ON \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF
Release+LTO+O2
BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DCMAKE_C_FLAGS_RELEASE="-O2 -DNDEBUG" \
  -DCMAKE_CXX_FLAGS_RELEASE="-O2 -DNDEBUG" \
  -DWITH_LTO=ON \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF
Release+LTO+Native
BF=" -march=native -mtune=native -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DWITH_LTO=ON \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF
Compiler command lines

This section lists the interesting diffs in the compiler command lines for mysqld.cc

RelWithDebInfo vs Release
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -flto=auto
< -fstack-protector-strong
< -g

< -O2
---
> -O3
RelWithDebInfo vs Release+LTO
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g

< -flto=auto
---
> -flto


< -O2
---
> -O3
RelWithDebInfo vs Release+LTO+O2
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g

< -flto=auto
---
> -flto
RelWithDebInfo vs Release+LTO+Native
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g
---
> -march=native
> -mtune=native

< -flto=auto
---
> -flto

< -O2
---
> -O3

1 comment:

  1. Eventually I will consider PGO. But I have so many questions. I have also been burned by odd perf results from code compiled with the wrong PGO profiles and suspect I am not alone -- there is more complexity.

    One question is the training workload from which the PGO profiles are collected. I am extra reluctant to collect profiles from workload A for a benchmark of workload A. I might be willing to collect them from multiple workloads, and then create one binary used for all workloads.

    But if you have one workload that is widely deployed then doing a mysqld binary per workload might be OK.

    Would be nice if more people were to publish on this. And note that my focus isn't peak performance -- I am mostly looking for performance and efficiency changes over time.

    ReplyDelete