Small Datum: Adventures in compiling MySQL: RelWithDebInfo vs Release

I am a non-expert in many build tools -- CMake for MySQL, autoconf for Postgres, scons for MongoDB and Maven for Linkbench. While working to confirm my MySQL builds are OK I used sysbench to compare several of them and was confused by the results. This is part 3 of my adventure - parts 1 and 2 are here and here.

tl;dr

RelWithDebInfo uses link time optimization by default
Release does not use link time optimization by default
Performance is better with link time optimization
Link time optimization helped point queries more than range queries or writes

Compiling MySQL from source

The options for CMAKE_BUILD_TYPE with MySQL include RelWithDebInfo and Relase. On Ubuntu 22.04 with gcc the interesting differences include:

RelWithDebInfo uses -O2, link time optimization (-flto) and -fstack-protector-strong
Release uses -O3 but does not use link time optimization by default

In my benchmarks (CPU-bound sysbench) the RelWithDebInfo build gets better throughput because it used less CPU/query and the difference was up to 10%. Looking at statistics from perf I saw that insn per cycle was frequently ~10% better for the RelWithDebInfo build and memory system counters were also better for RelWithDebInfo.

My first guess at the root cause was the binary size from -O2 vs -O3, 54M for RelWithDebInfo vs 66M for Release using stripped mysqld binaries. But that was wrong.

Then I remembered that RelWithDebInfo used -flto while Release did not. I ignored that last week, but it turned out the explain the difference.

Why don't Release builds use link time optimization? Yura Sorokin explained this to me. In CMakeLists.txt there is this code that sets WITH_PACKAGE_FLAGS_DEFAULT to ON for RelWithDebInfo and OFF for Release. When that is set to ON then the output from dpkg-buildflags is added to compile and linker command lines (dpkg-buildflags --get $X for X in CPPFLAGS, CFLAGS, CXXFLAGS, LDFLAGS). And on Ubuntu 22.04 I see:

$ dpkg-buildflags --get CPPFLAGS

-Wdate-time -D_FORTIFY_SOURCE=2

$ dpkg-buildflags --get CFLAGS

-g -O2 -ffile-prefix-map=/home/mdcallag/git/mytools/bench/sysbench.lua/r.1tab.1thr.feb23.repro=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security

$ dpkg-buildflags --get CXXFLAGS

$ dpkg-buildflags --get LDFLAGS

-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro

Thus, I get -flto by default with RelWithDebInfo but not with Release. To enable link time optimization for Release I can do one of:

Results: graphs

The full spreadsheet is here and has the throughput for builds relative to a RelWithDebInfo build (QPS for build / QPS for RelWithDebInfo). LTO means link time optimization which is enabled by default for RelWithDebInfo, disabled by default for Release and enabled via WITH_LTO=ON. The builds tested are:

RelWithDebInfo - CMAKE_BUILD_TYPE=RelWithDebInfo, LTO, -O2
Release - CMAKE_BUILD_TYPE=Release, -O3, does not use LTO
Release+LTO - CMAKE_BUILD_TYPE=Release, LTO, -O3
Release+LTO+O2 - uses CMAKE_BUILD_TYPE=Release, LTO, -O2
Release+LTO+Native - CMAKE_BUILD_TYPE=Release, LTO, -march=native, -mtune=native

The spreadsheet has results for all builds. The graphs only show QPS for Release and Release+LTO relative to RelWithDebInfo to improve readability. From Release+LTO I see that LTO benefits point queries more than range queries or writes and that is also clear in the summary statistics displayed in the following section. Also, range query and writes each have two outliers for which LTO had a large benefit:

range-notcovered-si.pre_range=100 - uses oltp_range_covered.lua on a non-covering secondary index, runs prior to write heavy tests
range-notcovered-si_range=100 - uses oltp_range_covered.lua on a non-covering secondary index, runs after write heavy tests
delete_range=100 - uses oltp_delete.lua, the delete statement is here
update-index_range=100 - uses oltp_update_index.lua, the update statement is here and requires secondary index maintenance

The graphs show the relative QPS for each build which is (QPS for the build / QPS for RelWithDebInfo).

Results: summary statistics

This section has summary statistics: the average, median, min and max value for each of the benchmark types (point, range and write) for each of the builds using the relative QPS (QPS for the build / QPS for RelWithDebInfo build). From the Average column it is clear that LTO benefits point queries more than range queries or writes. It is also clear that the use of LTO improves performance.

Release
	Point	Range	Write
Average	0.97	0.95	0.93
Median	0.98	0.96	0.93
Min	0.91	0.91	0.90
Max	1.00	0.99	0.98

Release+LTO
	Point	Range	Write
Average	1.13	1.06	1.13
Median	1.15	1.04	1.07
Min	1.02	0.97	1.02
Max	1.20	1.21	1.45

Rlease+LTO+O2
	Point	Range	Write
Average	1.09	1.03	1.09
Median	1.10	1.02	1.04
Min	1.01	0.97	0.99
Max	1.14	1.16	1.33

Release+LTO+Native
	Point	Range	Write
Average	1.13	1.04	1.14
Median	1.15	1.03	1.07
Min	1.04	0.93	1.03
Max	1.19	1.20	1.45

CMake command lines

RelWithDebInfo

cmake .. \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DWITH_SSL=system \
-DWITH_ZLIB=bundled \
-DMYSQL_MAINTAINER_MODE=0 \
-DENABLED_LOCAL_INFILE=1 \
-DCMAKE_INSTALL_PREFIX=$1 \
-DWITH_BOOST=$PWD/../boost \
-DWITH_NUMA=ON \
-DWITH_ROUTER=OFF \
-DWITH_MYSQLX=OFF \
-DWITH_UNIT_TESTS=OFF

Release

BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_SSL=system \
-DWITH_ZLIB=bundled \
-DMYSQL_MAINTAINER_MODE=0 \
-DENABLED_LOCAL_INFILE=1 \
-DCMAKE_INSTALL_PREFIX=$1 \
-DWITH_BOOST=$PWD/../boost \
-DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
-DWITH_NUMA=ON \
-DWITH_ROUTER=OFF \
-DWITH_MYSQLX=OFF \
-DWITH_UNIT_TESTS=OFF

Release+LTO

BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_SSL=system \
-DWITH_ZLIB=bundled \
-DMYSQL_MAINTAINER_MODE=0 \
-DENABLED_LOCAL_INFILE=1 \
-DCMAKE_INSTALL_PREFIX=$1 \
-DWITH_BOOST=$PWD/../boost \
-DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
-DWITH_LTO=ON \
-DWITH_NUMA=ON \
-DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF

Release+LTO+O2

BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_SSL=system \
-DWITH_ZLIB=bundled \
-DMYSQL_MAINTAINER_MODE=0 \
-DENABLED_LOCAL_INFILE=1 \
-DCMAKE_INSTALL_PREFIX=$1 \
-DWITH_BOOST=$PWD/../boost \
-DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
-DCMAKE_C_FLAGS_RELEASE="-O2 -DNDEBUG" \
-DCMAKE_CXX_FLAGS_RELEASE="-O2 -DNDEBUG" \
-DWITH_LTO=ON \
-DWITH_NUMA=ON \
-DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF

Release+LTO+Native

BF=" -march=native -mtune=native -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_SSL=system \
-DWITH_ZLIB=bundled \
-DMYSQL_MAINTAINER_MODE=0 \
-DENABLED_LOCAL_INFILE=1 \
-DCMAKE_INSTALL_PREFIX=$1 \
-DWITH_BOOST=$PWD/../boost \
-DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
-DWITH_LTO=ON \
-DWITH_NUMA=ON \
-DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF

Compiler command lines

This section lists the interesting diffs in the compiler command lines for mysqld.cc

RelWithDebInfo vs Release

< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -flto=auto
< -fstack-protector-strong
< -g

< -O2
---
> -O3

RelWithDebInfo vs Release+LTO

< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g

< -flto=auto
---
> -flto

< -O2
---
> -O3

RelWithDebInfo vs Release+LTO+O2

< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g

< -flto=auto
---
> -flto

RelWithDebInfo vs Release+LTO+Native

< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g
---
> -march=native
> -mtune=native

< -flto=auto
---
> -flto

< -O2
---
> -O3

1 comment:

Mark CallaghanNovember 19, 2023 at 8:04 AM
Eventually I will consider PGO. But I have so many questions. I have also been burned by odd perf results from code compiled with the wrong PGO profiles and suspect I am not alone -- there is more complexity.

One question is the training workload from which the PGO profiles are collected. I am extra reluctant to collect profiles from workload A for a benchmark of workload A. I might be willing to collect them from multiple workloads, and then create one binary used for all workloads.

But if you have one workload that is widely deployed then doing a mysqld binary per workload might be OK.

Would be nice if more people were to publish on this. And note that my focus isn't peak performance -- I am mostly looking for performance and efficiency changes over time.

Small Datum

Wednesday, March 1, 2023

Adventures in compiling MySQL: RelWithDebInfo vs Release

1 comment:

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)