Wednesday, September 11, 2024

InnoDB, compiler options and sysbench

In this post I have more results on the impact of compile-time options for MySQL, InnoDB and sysbench. I previously wrote about the impact of PGO and LTO on MySQL performance.

tl;dr

  • There is no silver bullet to undo the performance regressions
  • You can improve QPS by 10% to 20% with -O3, LTO and disabling most of the perf schema at compile time but I am reluctant to suggest running in production without the perf schema
  • You can improve QPS by 5% to 10% with -O3 and LTO
  • Compiling with -Os is bad for performance

Builds

I used InnoDB from MySQL 8.0.39. The compiler was gcc 11.4.0.

The base case was named my8039_rel_o2nofp.z11a_bee and was compiled with -O2 and frame pointers enabled. The CMake command line is here.

All of the CMake files are here and I tested the following variations where the names are in two parts as in build.config where build names the install directory (and the build) and config names the my.cnf file. While the suffix for config files here is _bee, that is only used for the amd server and the suffix was _c8r32 for the pn53 server and _c24r64 for the c2dhc server. For each of the following I link the CMake command line for that build:
The my.cnf files are:
  • amd
    • my.cnf.cz11a_bee (z11a_bee) is here and z11a_nops_bee adds performance_schema=0
  • pn53
    • my.cnf.cz11a_c8r32 (z11a_c8r32) is here and z11a_nops_c8r32 adds performance_schema=0
  • c2dhc
    • my.cnf.cz11a_c24r64 (z11a_c24r64) is here and z11a_nops_c24r64 adds performance_schema=0
Hardware

I tested on three servers:
  • amd - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU 
  • pn53 - ASUS ExpertCenter PN53 (see here) with 8 cores and an AMD Ryzen 7 7735HS CPU. The CPU on the PN53 is newer than the CPU on the SER4.
  • c2dhc - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and SMT disabled so there are 16 cores
All servers use Ubuntu 22.04 with ext4. 

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with:
  • amd - 1 thread, 1 table and 30M rows
  • pn53 - 1 thread, 1 table and 50M rows
  • C2D - 12 threads, 8 tables and 10M rows per table
  • each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise
  • prepared statements were enabled
The command lines for my helper scripts were:
# amd
bash r.sh 1 30000000 300 600 nvme0n1 1 1 1
# pn53
bash r.sh 1 50000000 300 600 nvme0n1 1 1 1
# c2dhc
bash r.sh 8 10000000 300 600 md0 1 1 12

Results

For the results below I split the 42 microbenchmarks into 3 groups -- 1 for point queries, 1 for range queries, 1 for writes.

The numbers in the tables linked below are the relative QPS which is (QPS for my version) / (QPS for base case) and the base case is my8039_rel_o2nofp with the z11a_bee, z11a_c8r32 and z11a_c24r64 configs. When the relative throughput is > 1 then that version is faster than the base case. When it is 1.0 then they have the same throughput.

Here I only share the average relative QPS per microbenchmark group. All of the data is here and I provide 3 views:
  • the first view is the least compressed and is hard to fit in one page
  • the second view uses much less space than the first view but still makes it hard to compare results across the server types
  • the third view is the most dense. It formats results as a-b-c in each column where a is the average relative QPS for point queries, b is the average relative QPS for range queries and c is the average relative QPS for writes
Data from the third view is also listed below.  The last line for osnpfo has 0.77-0.69-0.76 in the amd column. That means the relative QPS with osnofp is 0.77 for point queries, 0.69 for range queries and 0.76 for writes. From the c2dhc column for rel,lto the entry is 1.07-1.06-1.04 which means the relative QPS is 1.07 for point queries, 1.06 for range queries and 1.04 for writes so this is 4% to 7% faster than the base case.

A summary from the results:
  • my8039_rel_less_lessps_lto.z11a_nops_$suffix has the best result. It improves QPS by up to 20% on the c2dhc server. While there isn't much benefit from less, there is a benefit from rel (-O3), lessps (disable most of the perf schema at compile time) and lto (link-time optimization). However, I don't suggest disabling most of the perf schema at compile time because it is useful.
  • my8039_rel_lto.z11_$suffix is the build.config pair that I would use. It is usually 5% to 10% faster than the base case on the c2dhc server.
  • the benefits are cumulative. For example, rel_lto (-O3 and link-time optimization) is better than rel alone.
    Using rel_o2nofp as base amd pn53 c2dhc
    nops 1.04-1.03-1.07 1.03-1.03-1.04 1.02-1.02-1.03
    lto 1.08-1.06-1.08 1.09-1.07-1.06 1.07-1.05-1.03
    rel 1.04-1.02-1.02 1.04-1.02-1.00 1.03-1.02-1.00
    rel,nops 1.07-1.06-1.07 1.06-1.05-1.03 1.04-1.03-1.03
    rel,lto 1.10-1.09-1.10 1.11-1.07-1.07 1.07-1.06-1.04
    less 1.05-1.04-1.02 1.01-1.01-1.00 1.00-1.00-1.00
    rel,less 1.05-1.05-1.02 1.05-1.02-1.00 1.02-1.01-1.00
    lessps 1.06-1.06-1.11 1.05-1.05-1.06 1.06-1.06-1.04
    rel,lessps 1.09-1.07-1.11 1.09-1.07-1.07 1.09-1.06-1.05
    less,lessps 1.05-1.06-1.11 1.05-1.05-1.07 1.07-1.06-1.05
    rel,less,lessps 1.09-1.07-1.11 1.07-1.06-1.07 1.09-1.07-1.05
    less,lessps,lto 1.14-1.11-1.19 1.14-1.10-1.14 1.14-1.10-1.09
    rel,less,lessps,lto 1.17-1.14-1.22 1.17-1.13-1.17 1.17-1.12-1.10
    less,lessps,lto,nops 1.14-1.11-1.19 1.14-1.10-1.14 1.14-1.10-1.08
    rel,less,lessps,lto,nops 1.17-1.14-1.21 1.17-1.13-1.16 1.17-1.13-1.11
    osnofp 0.77-0.69-0.76 0.81-0.77-0.79 0.83-0.77-0.86





    No comments:

    Post a Comment

    RocksDB on a big server: LRU vs hyperclock, v2

    This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...