Small Datum: InnoDB, compiler options and sysbench

In this post I have more results on the impact of compile-time options for MySQL, InnoDB and sysbench. I previously wrote about the impact of PGO and LTO on MySQL performance.

tl;dr

There is no silver bullet to undo the performance regressions
You can improve QPS by 10% to 20% with -O3, LTO and disabling most of the perf schema at compile time but I am reluctant to suggest running in production without the perf schema
You can improve QPS by 5% to 10% with -O3 and LTO
Compiling with -Os is bad for performance

Builds

I used InnoDB from MySQL 8.0.39. The compiler was gcc 11.4.0.

The base case was named my8039_rel_o2nofp.z11a_bee and was compiled with -O2 and frame pointers enabled. The CMake command line is here.

All of the CMake files are here and I tested the following variations where the names are in two parts as in build.config where build names the install directory (and the build) and config names the my.cnf file. While the suffix for config files here is _bee, that is only used for the amd server and the suffix was _c8r32 for the pn53 server and _c24r64 for the c2dhc server. For each of the following I link the CMake command line for that build:

my8039_rel_o2nofp.z11a_nops_bee

similar to the base case except performance_schema=0

my8039_rel_o2nofp_lto.z11a_bee

adds link-time optimization (WITH_LTO=ON)

my8039_rel.z11a_bee

compiled with -O3 and the default for frame pointers

my8039_rel.z11a_nops_bee

like my8039_rel.z11a_bee but with performance_schema=0

my8039_rel_lto.z11a_bee

like my8039_rel.z11a_bee and uses WITH_LTO=ON

my8039_rel_o2nofp_less.z11a_bee

like my8039_rel_o2nofp.z11a_bee but disables more things at compile time

my8039_rel_less.z11a_bee

like my8039_rel.z11a_bee but disables more things at compile time

my8039_rel_o2nofp_lessps.z11a_bee

like my8039_rel_o2nofp.z11a_bee but disables most of the perf schema at compile time

my8039_rel_lessps.z11a_bee

like my8039_rel.z11a_bee but disables most of the perf schema at compile time

my8039_rel_o2nofp_less_lessps.z11a_bee

like my8039_rel_o2nofp.z11a_bee but disables most of the perf schema at compile time and disables more things at compile time

my8039_rel_less_lessps.z11a_bee

like my8039_rel.z11a_bee but disables most of the perf schema at compile time and disables more things at compile time

my8039_rel_o2nofp_less_lessps_lto.z11a_bee

like my8039_rel_o2nofp.z11a_bee but disables most of the perf schema at compile time, disables more things at compile time and uses WITH_LTO=ON

my8039_rel_less_lessps_lto.z11a_bee

like my8039_rel.z11a_bee but disables most of the perf schema at compile time, disables more things at compile time and uses WITH_LTO=ON

my8039_rel_o2nofp_less_lessps_lto.z11a_nops_bee

like my8039_rel_o2nofp.z11a_bee but disables most of the perf schema at compile time, disables more things at compile time, uses WITH_LTO=ON and uses performance_schema=0

my8039_rel_less_lessps_lto.z11a_nops_bee

like my8039_rel.z11a_bee but disables most of the perf schema at compile time, disables more things at compile time, uses WITH_LTO=ON and uses performance_schema=0

my8039_rel_osnofp.z11a_bee

compiled with -Os instead of -O3 or -O2

The my.cnf files are:

my.cnf.cz11a_bee (z11a_bee) is here and z11a_nops_bee adds performance_schema=0

pn53

my.cnf.cz11a_c8r32 (z11a_c8r32) is here and z11a_nops_c8r32 adds performance_schema=0

c2dhc

my.cnf.cz11a_c24r64 (z11a_c24r64) is here and z11a_nops_c24r64 adds performance_schema=0

Hardware

I tested on three servers:

amd - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU
pn53 - ASUS ExpertCenter PN53 (see here) with 8 cores and an AMD Ryzen 7 7735HS CPU. The CPU on the PN53 is newer than the CPU on the SER4.
c2dhc - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and SMT disabled so there are 16 cores

All servers use Ubuntu 22.04 with ext4.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with:

amd - 1 thread, 1 table and 30M rows
pn53 - 1 thread, 1 table and 50M rows
C2D - 12 threads, 8 tables and 10M rows per table
each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise
prepared statements were enabled

The command lines for my helper scripts were:

# amd

bash r.sh 1 30000000 300 600 nvme0n1 1 1 1

# pn53

bash r.sh 1 50000000 300 600 nvme0n1 1 1 1

# c2dhc

bash r.sh 8 10000000 300 600 md0 1 1 12

Results

For the results below I split the 42 microbenchmarks into 3 groups -- 1 for point queries, 1 for range queries, 1 for writes.

The numbers in the tables linked below are the relative QPS which is (QPS for my version) / (QPS for base case) and the base case is my8039_rel_o2nofp with the z11a_bee, z11a_c8r32 and z11a_c24r64 configs. When the relative throughput is > 1 then that version is faster than the base case. When it is 1.0 then they have the same throughput.

Here I only share the average relative QPS per microbenchmark group. All of the data is here and I provide 3 views:

the first view is the least compressed and is hard to fit in one page
the second view uses much less space than the first view but still makes it hard to compare results across the server types
the third view is the most dense. It formats results as a-b-c in each column where a is the average relative QPS for point queries, b is the average relative QPS for range queries and c is the average relative QPS for writes

Data from the third view is also listed below. The last line for osnpfo has 0.77-0.69-0.76 in the amd column. That means the relative QPS with osnofp is 0.77 for point queries, 0.69 for range queries and 0.76 for writes. From the c2dhc column for rel,lto the entry is 1.07-1.06-1.04 which means the relative QPS is 1.07 for point queries, 1.06 for range queries and 1.04 for writes so this is 4% to 7% faster than the base case.

A summary from the results:

my8039_rel_less_lessps_lto.z11a_nops_$suffix has the best result. It improves QPS by up to 20% on the c2dhc server. While there isn't much benefit from less, there is a benefit from rel (-O3), lessps (disable most of the perf schema at compile time) and lto (link-time optimization). However, I don't suggest disabling most of the perf schema at compile time because it is useful.
my8039_rel_lto.z11_$suffix is the build.config pair that I would use. It is usually 5% to 10% faster than the base case on the c2dhc server.
the benefits are cumulative. For example, rel_lto (-O3 and link-time optimization) is better than rel alone.

Using rel_o2nofp as base amd pn53 c2dhc

nops 1.04-1.03-1.07 1.03-1.03-1.04 1.02-1.02-1.03

lto 1.08-1.06-1.08 1.09-1.07-1.06 1.07-1.05-1.03

rel 1.04-1.02-1.02 1.04-1.02-1.00 1.03-1.02-1.00

rel,nops 1.07-1.06-1.07 1.06-1.05-1.03 1.04-1.03-1.03

rel,lto 1.10-1.09-1.10 1.11-1.07-1.07 1.07-1.06-1.04

less 1.05-1.04-1.02 1.01-1.01-1.00 1.00-1.00-1.00

rel,less 1.05-1.05-1.02 1.05-1.02-1.00 1.02-1.01-1.00

lessps 1.06-1.06-1.11 1.05-1.05-1.06 1.06-1.06-1.04

rel,lessps 1.09-1.07-1.11 1.09-1.07-1.07 1.09-1.06-1.05

less,lessps 1.05-1.06-1.11 1.05-1.05-1.07 1.07-1.06-1.05

rel,less,lessps 1.09-1.07-1.11 1.07-1.06-1.07 1.09-1.07-1.05

less,lessps,lto 1.14-1.11-1.19 1.14-1.10-1.14 1.14-1.10-1.09

rel,less,lessps,lto 1.17-1.14-1.22 1.17-1.13-1.17 1.17-1.12-1.10

less,lessps,lto,nops 1.14-1.11-1.19 1.14-1.10-1.14 1.14-1.10-1.08

rel,less,lessps,lto,nops 1.17-1.14-1.21 1.17-1.13-1.16 1.17-1.13-1.11

osnofp 0.77-0.69-0.76 0.81-0.77-0.79 0.83-0.77-0.86

Small Datum

Wednesday, September 11, 2024

InnoDB, compiler options and sysbench

No comments:

Post a Comment

Postgres 18 beta2: large server, Insert Benchmark, part 2