Small Datum: The impact of PGO for MySQL

This post explains the benefit from PGO (profile guided optimization) on MySQL. A previous post showed that LTO (link time optimization) reduces CPU (improves throughput) by about 5% for CPU-bound sysbench.

The goals here are:

Determine the impact of PGO
Determine the impact of PGO + LTO
Determine whether PGO binary that used sysbench to generate the profile is useful when running other benchmarks (in short, yes, but the longer answer waits for another blog post).
Determine whether PGO helps MySQL 8.0 more than 5.6 and 5.7. A hypothesis is that the many perf regressions in MySQL 8.0 are from code bloat. Perhaps PGO helps more with more-bloated code than it does with less-bloated code.
Determine the impact from -Os vs -O2. A hypothesis is that -Os will undo some code bloat. Note that -Os includes the -O2 optimizations except the ones that increase code size.
Document the 5.6->5.7 and 5.7->8.0 regressions
Determine whether the regressions at low-concurrency are offset by improvements at high-concurrency

Just saw this great post on using BOLT with Postgres from Tomas Vondra.

tl;dr

PGO is good, PGO + LTO is better
PGO helps MySQL 8.0 more than 5.6
PGO does not undo the perf regressions that are new in 8.0
PGO improves results on the small servers more than on the medium server
Regressions from 5.7 to 8.0 are larger than from 5.6 to 5.7
Performance is worse with -Os compared to -O2

Updates:

I retracted the results for PGO+LTO because mistakes were made and I will redo that work. The updated results are here for a laptop-class CPU and pending for a server-class CPU.

Builds

I used InnoDB from MySQL 5.6.35, 5.6.51, 5.7.44 and 8.0.37. The compiler was gcc 11.4.0.

I used -fprofile-generate to get binaries that create PGO profiles
I used -fprofile-use -fprofile-correction to get binaries that used PGO profiles
I used -Os or -O2 (see CMake command lines here)

I also did one test with MyRocks from FB MySQL 8.0.32 but my focus is upstream MySQL.

To get PGO for clang

compile with -fprofile-generate
start mysqld with LLVM_PROFILE_FILE=$PWD/code-%p.profraw <mysqld command line>. If you don't use an absolute path then things won't work because mysqld calls chdir($data-dir) at startup and that confuses the clang profiling code

Hardware

I tested on three servers:

SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU
PN53 - ASUS ExpertCenter PN53 (see here) with 8 cores and an AMD Ryzen 7 7735HS CPU. The CPU on the PN53 is newer than the CPU on the SER4.
C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and SMT disabled so there are 16 cores

All servers use Ubuntu 22.04 with ext4.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with:

SER4, PN53 - 1 thread, 1 table and 30M rows
C2D - 12 threads, 8 tables and 10M rows per table
each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise
prepared statements were enabled

The command lines for my helper scripts were:

# PN53, SER4

bash r.sh 1 30000000 300 600 nvme0n1 1 1 1

# C2D

bash r.sh 8 10000000 300 600 md0 1 1 12

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheets with all data are here for SER4, PN53 and C2D. For each microbenchmark group there is a table with summary statistics.

The numbers in the spreadsheets are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

Results: without PGO

I use summary statistics per microbenchmark group rather than charts because there would be too many charts. The numbers are the relative QPS. MySQL 5.6.35 is the base case. MySQL 5.7.44 or 8.0.37 are slower than 5.6.35 when the relative QPS is less than 1. I focus on the median value per microbenchmark group.

Results from SER4 (older small server, low concurrency)

5.7.44 is ~15% slower than 5.6.35 for most reads and ~15% slower for writes
8.0.37 is ~30% slower than 5.6.35 for most reads and ~40% slower for writes
Regressions from 5.7->8.0 and 5.6->5.7 are similar

5.7.44	min	max	avg	median
point-1	0.85	0.89	0.86	0.85
point-2	0.78	0.99	0.88	0.85
range-1	0.80	0.93	0.82	0.81
range-2	0.85	1.21	1.01	0.96
writes	0.74	1.29	0.90	0.85

8.0.37	min	max	avg	median
point-1	0.65	0.78	0.70	0.69
point-2	0.61	0.80	0.70	0.69
range-1	0.64	0.69	0.66	0.66
range-2	0.65	0.96	0.79	0.75
writes	0.47	1.02	0.62	0.59

Results from PN53 (newer small server, low concurrency)

5.7.44 is ~12% slower than 5.6.35 for most reads and ~11% slower for writes
8.0.37 is ~30% slower than 5.6.35 for most reads and ~34% slower for writes
Regressions from 5.7->8.0 are larger than from 5.6->5.7

5.7.44	min	max	avg	median
point-1	0.86	0.92	0.89	0.88
point-2	0.86	1.02	0.92	0.88
range-1	0.83	0.90	0.86	0.84
range-2	0.90	1.23	1.04	0.99
writes	0.82	1.17	0.92	0.89

8.0.37	min	max	avg	median
point-1	0.66	0.78	0.71	0.70
point-2	0.70	0.83	0.75	0.71
range-1	0.67	0.71	0.69	0.69
range-2	0.74	0.98	0.85	0.82
writes	0.57	0.97	0.69	0.66

Results from C2D (medium server/concurrency)

5.7.44 is slower and faster on reads and much faster on writes vs 5.6.35
8.0.37 is up to 29% slower on reads and much faster on writes vs 5.6.35
There are large regressions from 5.7 to 8.0

5.7.44	min	max	avg	median
point-1	0.86	1.44	1.03	0.94
point-2	0.91	1.29	1.11	1.13
range-1	0.85	1.00	0.88	0.86
range-2	1.15	1.26	1.20	1.20
writes	1.45	3.96	2.46	2.49

8.0.37	min	max	avg	median
point-1	0.71	1.10	0.83	0.75
point-2	0.73	1.04	0.88	0.89
range-1	0.67	0.79	0.71	0.71
range-2	0.93	1.05	0.99	0.99
writes	1.25	3.06	1.91	1.91

Results: with PGO

Results from SER4 (older small server, low concurrency)

5.7.44 is ~11% slower than 5.6.35 for most reads and ~6% slower for writes
8.0.37 is ~25% slower than 5.6.35 for most reads and ~34% slower for writes
Regressions from 5.7->8.0 are larger than 5.6->5.7
PGO helps 8.0 more than 5.6

5.7.44	min	max	avg	median
point-1	0.87	1.02	0.91	0.89
point-2	0.86	1.03	0.93	0.89
range-1	0.74	0.89	0.85	0.87
range-2	0.97	1.25	1.08	1.02
writes	0.84	1.13	0.96	0.94

8.0.37	min	max	avg	median
point-1	0.64	0.88	0.77	0.76
point-2	0.70	0.86	0.77	0.74
range-1	0.51	0.74	0.69	0.72
range-2	0.80	1.00	0.88	0.85
writes	0.55	1.02	0.70	0.66

Results from PN53 (newer small server, low concurrency)

5.7.44 is ~10% slower than 5.6.35 for most reads and ~4% slower for writes
8.0.37 is ~25% slower than 5.6.35 for most reads and ~26% slower for writes
Regressions from 5.7->8.0 are larger than 5.6->5.7
PGO helps 8.0 more than 5.6

5.7.44	min	max	avg	median
point-1	0.88	0.92	0.90	0.90
point-2	0.89	1.02	0.93	0.90
range-1	0.81	0.89	0.87	0.88
range-2	0.93	1.31	1.10	1.04
writes	0.85	1.23	1.00	0.96

8.0.37	min	max	avg	median
point-1	0.67	0.82	0.76	0.75
point-2	0.75	0.87	0.79	0.75
range-1	0.58	0.74	0.71	0.73
range-2	0.80	1.02	0.91	0.90
writes	0.61	1.02	0.76	0.74

Results from C2D (medium server/concurrency)

5.7.44 is slower and faster on reads and much faster on writes vs 5.6.35
8.0.37 is slower and faster on reads and much faster on writes vs 5.6.35
There are large regressions from 5.7 to 8.0
PGO helps 8.0 more than 5.6

5.7.44	min	max	avg	median
point-1	0.86	1.73	1.14	1.04
point-2	0.97	1.44	1.24	1.30
range-1	0.84	0.89	0.87	0.88
range-2	1.25	1.33	1.29	1.29
writes	1.38	4.29	2.68	2.72

8.0.37	min	max	avg	median
point-1	0.75	1.36	0.96	0.87
point-2	0.82	1.23	1.04	1.09
range-1	0.60	0.76	0.72	0.72
range-2	1.05	1.09	1.07	1.07
writes	1.27	3.09	2.14	2.14

Results: PGO + LTO

I retracted these results because mistakes were made and will redo the work.

Results: gcc -Os

I use summary statistics per microbenchmark group rather than charts because there would be too many charts. The numbers are the relative QPS. MySQL 8.0.37 with -O2 and that is compared to 8.0.37 with -Os When the relative QPS is less than 1 then the base case is faster. I focus on the median value per microbenchmark group.

Results from SER4 (older small server, low concurrency)

Performance is worse with -Os compared to -O2

-Os	min	max	avg	median
point-1	0.72	0.79	0.77	0.78
point-2	0.77	0.80	0.79	0.79
range-1	0.52	0.77	0.70	0.72
range-2	0.68	0.74	0.72	0.73
writes	0.49	0.80	0.75	0.78

Results from PN53 (newer small server, low concurrency)

Performance is worse with -Os compared to -O2

-Os	min	max	avg	median
point-1	0.77	0.87	0.83	0.84
point-2	0.81	0.85	0.84	0.85
range-1	0.64	0.84	0.76	0.75
range-2	0.78	0.81	0.80	0.81
writes	0.68	0.82	0.80	0.81

Results from C2D (medium server/concurrency)

Performance is worse with -Os compared to -O2

-Os	min	max	avg	median
point-1	0.80	0.84	0.82	0.83
point-2	0.82	0.83	0.83	0.83
range-1	0.61	0.84	0.76	0.77
range-2	0.67	0.83	0.77	0.81
writes	0.80	0.90	0.86	0.87

Small Datum

Tuesday, July 2, 2024

The impact of PGO for MySQL

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)