Monday, October 21, 2024

MySQL 8.0.40 does not fix the regressions I hoped it would fix

Performance regressions arrived in InnoDB with MySQL 8.0.30. Eventually multiple bugs were filed. The worst regressions were from changes to the hash function (perhaps fixed in 8.0.36) and from changes to how functions are inlined for InnoDB (bug 111538). The problems are obvious if you run CPU-bound workloads, and my CPU-bound workload is sysbench with a cached database.

Bug 111538 is now closed and marked as fixed in 8.0.40. Alas, there are still significant CPU perf regressions in 8.0.40 relative to 8.0.28. My advice to upstream is to stop innovating if you don't have the CI setup to catch the new performance problems that your innovation creates. Using something like Nyrkio would help.

This post has results from sysbench on several servers using MySQL 8.0.28, 8.0.30, 8.0.33, 8.0.39 and 8.0.40 to show there are large regressions starting in 8.0.30, many of which are still there in 8.0.40. Tests were repeated on 5 different servers.

tl;dr

  • SELECT statements with a large in-list use much less CPU starting in MySQL 8.0.31 because bug 102037 was fixed. I found that via sysbench and filed a bug vs 8.0.22
  • bug 111538 should not have been closed as fixed
  • The scan microbenchmark still has regressions from 8.0.28 to 8.0.40
    • For all servers the QPS is less in 8.0.40 than in 8.0.28
    • On 4 of 5 servers the QPS is less in 8.0.40 than in 8.0.30
  • The update microbenchmark still has regressions from 8.0.28 to 8.0.40
    • In all cases the QPS is less in 8.0.40 than in 8.0.28
    • From 8.0.30 to 8.0.40 -- on 2 servers the QPS is less in 8.0.40, on 2 it is about the same and one 1 it has improved 
    • Regressions after 8.0.28 are bad for the small servers (beelink and pn53 below) and really bad (QPS drops in half) for one of the large servers (delll32 below). That is the subject of a pending blog post.

Builds

I used MySQL 8.0.28, 8.0.30, 8.0.33, 8.0.39 and 8.0.40 compiled from source using CMAKE_BUILD_TYPE =Release, -O2 and -fno-omit-frame-pointer.

Hardware

The servers are
  • beelink
    • Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
  • pn53
    • ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
  • socket2
    • SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB,  ext4). The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz
  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4)
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. In some cases I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run with ...
  • beelink - 1 thread, 1 table, 30M rows
  • pn53 - 1 thread, 1 table, 50M rows
  • socket2 - 16 threads, 8 tables, 10M rows/table
  • dell32 - 24 threads, 8 tables, 10M rows/table
  • ax162-s - 40 threads, 8 tables, 10M rows/table
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results 

All of the results se relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions (8.0.30, 8.0.33, 8.0.39, 8.0.40)
The scan microbenchmark is the canary in the coal mine for bug 111538 as most of the CPU time is spent in InnoDB. Regressions that arrived after 8.0.28 remain unfixed in 8.0.40. The rQPS drops from MySQL 8.0.30 to 8.0.40 on 4 of the 5 servers.

                rQPS    rQPS
server          8.0.30  8.0.40
------          ------  ------
beelink         0.89    0.78
pn53            0.91    0.83
socket2         0.90    0.84
dell32          0.70    0.91
ax162-s         0.91    0.83

The update-index microbenchmark also has large regressions after 8.0.28. The QPS drops from 8.0.30 to 8.0.40 on 2 servers (beelink, pn53), remains about the same on 2 of them (socket2, dell32) and improves on 1 (ax162-s). The result for dell32 is lousy as update QPS drops almost in half after 8.0.28 and I will have more about that in a pending blog post.

                rQPS    rQPS
server          8.0.30  8.0.40
------          ------  ------
beelink         0.88    0.76
pn53            0.91    0.78
socket2         0.89    0.92
dell32          0.56    0.56
ax162-s         0.80    0.90

Results: beelink

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
Relative to: x.my8028_rel_o2nofp.z11a_bee.pk1
col-1 : x.my8030_rel_o2nofp.z11a_bee.pk1
col-2 : x.my8033_rel_o2nofp.z11a_bee.pk1
col-3 : x.my8039_rel_o2nofp.z11a_bee.pk1
col-4 : x.my8040_rel_o2nofp.z11a_bee.pk1

col-1 col-2 col-3 col-4
0.91 1.06 1.09 1.13 hot-points_range=100
1.00 0.91 0.92 0.92 point-query.pre_range=100
1.00 0.91 0.93 0.92 point-query_range=100
0.91 1.01 1.03 1.06 points-covered-pk.pre_range=100
0.90 1.00 1.03 1.06 points-covered-pk_range=100
0.89 1.00 1.03 1.07 points-covered-si.pre_range=100
0.89 1.00 1.03 1.06 points-covered-si_range=100
0.89 1.00 1.02 1.04 points-notcovered-pk.pre_range=100
0.89 1.00 1.02 1.05 points-notcovered-pk_range=100
0.88 0.93 0.98 1.01 points-notcovered-si.pre_range=100
0.88 0.93 0.98 1.01 points-notcovered-si_range=100
0.96 2.20 2.30 2.35 random-points.pre_range=1000
0.90 1.00 1.03 1.05 random-points.pre_range=100
0.94 0.91 0.93 0.94 random-points.pre_range=10
0.94 2.20 2.30 2.35 random-points_range=1000
0.90 1.00 1.03 1.05 random-points_range=100
0.93 0.91 0.93 0.94 random-points_range=10
0.95 0.90 0.90 0.96 range-covered-pk.pre_range=100
0.95 0.90 0.90 0.96 range-covered-pk_range=100
0.96 0.91 0.91 0.97 range-covered-si.pre_range=100
0.95 0.90 0.90 0.98 range-covered-si_range=100
0.97 0.93 0.94 0.97 range-notcovered-pk.pre_range=100
0.96 0.92 0.93 0.97 range-notcovered-pk_range=100
0.88 0.86 0.91 0.94 range-notcovered-si.pre_range=100
0.87 0.86 0.91 0.94 range-notcovered-si_range=100
0.97 0.94 0.93 0.98 read-only.pre_range=10000
0.97 0.93 0.93 0.94 read-only.pre_range=100
0.99 0.93 0.94 0.94 read-only.pre_range=10
0.98 0.95 0.94 0.99 read-only_range=10000
1.00 0.95 0.95 0.97 read-only_range=100
0.98 0.94 0.94 0.96 read-only_range=10
0.89 0.82 0.80 0.78 scan_range=100
0.95 0.88 0.89 0.90 delete_range=100
0.94 0.88 0.88 0.88 insert_range=100
0.97 0.93 0.93 0.95 read-write_range=100
0.96 0.92 0.92 0.93 read-write_range=10
0.88 0.82 0.84 0.76 update-index_range=100
0.94 0.87 0.89 0.90 update-inlist_range=100
0.96 0.89 0.90 0.91 update-nonindex_range=100
0.97 0.91 0.91 0.91 update-one_range=100
0.95 0.90 0.90 0.91 update-zipf_range=100
0.93 0.88 0.89 0.87 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.28X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.222553        2.534   0       0.001   0.035   55      x.my8028_rel_o2nofp
0.251913        6.640   0       0       0.052   49      x.my8030_rel_o2nofp
0.273314        7.107   0       0       0.039   45      x.my8033_rel_o2nofp
0.282176        15.578  0       0       0.053   44      x.my8039_rel_o2nofp
0.285792        7.622   0       0       0.041   43      x.my8040_rel_o2nofp
--- relative to first result
1.13            2.62    1       0.00    1.49    0.89    x.my8030_rel_o2nofp
1.23            2.80    1       0.00    1.11    0.82    x.my8033_rel_o2nofp
1.27            6.15    1       0.00    1.51    0.80    x.my8039_rel_o2nofp
1.28            3.01    1       0.00    1.17    0.78    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • CPU per update (cpu/o) is 1.33X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.47X larger in 8.0.40 vs 8.0.28. 
    • A possible cause is that the writeback code path isn't slowed by regressions while the update codepath is slowed. But I am just waving my hands.
    • This one is confusing because I use the same redo log size for all versions. However, configuration for the redo log changes in 8.0.30 (see here) and as a result I used 15 X 1G log segments in 8.0.28 vs 32 X 480M log segments in 8.0.40. I am repeating tests for 8.0.28 with 32 X 480M log segments. Using the output of the Pages ..., written line from SHOW ENGINE INNODB STATUS the pages written per update value is 0.97 for 8.0.28 vs 1.31 for 8.0.40. Perhaps 8.0.30+ is more aggressive about doing dirty page writeback but I use the same values for innodb_max_dirty_pages_pct_lwm (=80) and innodb_max_dirty_pages_pct (=90) for 8.0.28 to 8.0.40. I repeated tests for 8.0.28 using 32 redo log segments, but results (QPS, HW metrics) doesn't change with that.
  • Context switches per update (cs/o) is 1.24X larger in 8.0.40 vs 8.0.28, but I am less sure this is a problem as the absolute rates (15.312 vs 18.917 per update) are not that big.
sb.met.update-index.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.008211        15.312  0       0       47.066  2649    x.my8028_rel_o2nofp
0.009487        17.353  0       0       66.043  2337    x.my8030_rel_o2nofp
0.009875        17.636  0       0       66.695  2178    x.my8033_rel_o2nofp
0.010162        17.959  0       0       67.203  2216    x.my8039_rel_o2nofp
0.010903        18.917  0       0       68.954  2024    x.my8040_rel_o2nofp
--- relative to first result
1.16            1.13    1       1       1.40    0.88    x.my8030_rel_o2nofp
1.20            1.15    1       1       1.42    0.82    x.my8033_rel_o2nofp
1.24            1.17    1       1       1.43    0.84    x.my8039_rel_o2nofp
1.33            1.24    1       1       1.47    0.76    x.my8040_rel_o2nofp

Results: pn53

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
  • the changes to HW overheads here are similar to the changes above for the beelink server
Relative to: x.my8028_rel_o2nofp.z11a_c8r32.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c8r32.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c8r32.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c8r32.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c8r32.pk1

col-1 col-2 col-3 col-4
0.89 1.12 1.13 1.18 hot-points_range=100
0.97 0.94 0.93 0.94 point-query.pre_range=100
0.98 0.95 0.94 0.94 point-query_range=100
0.89 1.03 1.05 1.10 points-covered-pk_range=100
0.87 1.02 1.04 1.09 points-covered-si_range=100
0.91 1.03 1.05 1.11 points-notcovered-pk_range=100
0.84 0.96 1.01 1.06 points-notcovered-si_range=100
0.95 2.14 2.21 2.30 random-points_range=1000
0.89 1.03 1.06 1.10 random-points_range=100
0.96 0.94 0.95 0.97 random-points_range=10
0.97 0.93 0.92 0.99 range-covered-pk_range=100
0.98 0.93 0.93 0.98 range-covered-si_range=100
0.98 0.92 0.94 0.97 range-notcovered-pk_range=100
0.91 0.89 0.92 0.99 range-notcovered-si.pre_range=100
0.92 0.89 0.92 0.99 range-notcovered-si_range=100
0.97 0.94 0.98 1.01 read-only_range=10000
0.98 0.94 0.95 0.96 read-only_range=100
0.98 0.94 0.94 0.96 read-only_range=10
0.91 0.83 0.83 0.83 scan_range=100
0.96 0.91 0.92 0.92 delete_range=100
0.94 0.90 0.90 0.91 insert_range=100
0.97 0.93 0.94 0.95 read-write_range=100
0.97 0.93 0.94 0.95 read-write_range=10
0.91 0.85 0.84 0.78 update-index_range=100
0.94 0.91 0.91 0.94 update-inlist_range=100
0.96 0.92 0.92 0.93 update-nonindex_range=100
0.96 0.92 0.92 0.92 update-one_range=100
0.96 0.92 0.92 0.93 update-zipf_range=100
0.94 0.90 0.91 0.90 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.22X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.532922        6.523   0       0       0.092   23      x.my8028_rel_o2nofp
0.586497        15.751  0       0       0.122   21      x.my8030_rel_o2nofp
0.650201        17.147  0       0       0.105   19      x.my8033_rel_o2nofp
0.651073        17.145  0       0       0.098   19      x.my8039_rel_o2nofp
0.649427        17.975  0       0       0.128   19      x.my8040_rel_o2nofp
--- relative to first result
1.10            2.41    1       1       1.33    0.91    x.my8030_rel_o2nofp
1.22            2.63    1       1       1.14    0.83    x.my8033_rel_o2nofp
1.22            2.63    1       1       1.07    0.83    x.my8039_rel_o2nofp
1.22            2.76    1       1       1.39    0.83    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the changes here are similar to the ones above for the beelink server
  • CPU per update (cpu/o) is 1.29X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.48X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.22X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop1
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.004972        11.052  0       0       43.246  4198    x.my8028_rel_o2nofp
0.005658        11.948  0       0       61.606  3810    x.my8030_rel_o2nofp
0.005927        12.288  0       0       62.306  3578    x.my8033_rel_o2nofp
0.005997        12.536  0       0       62.804  3544    x.my8039_rel_o2nofp
0.006401        13.486  0       0       63.98   3294    x.my8040_rel_o2nofp
--- relative to first result
1.14            1.08    1       1       1.42    0.91    x.my8030_rel_o2nofp
1.19            1.11    1       1       1.44    0.85    x.my8033_rel_o2nofp
1.21            1.13    1       1       1.45    0.84    x.my8039_rel_o2nofp
1.29            1.22    1       1       1.48    0.78    x.my8040_rel_o2nofp

Results: socket2

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
  • the changes to HW overheads here and the regressions here are similar to but smaller than the changes above for the beelink and pn53 servers
Relative to: x.my8028_rel_o2nofp.z11a_c24r64.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c24r64.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c24r64.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c24r64.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c24r64.pk1

col-1 col-2 col-3 col-4
0.94 1.04 1.01 1.10 hot-points_range=100
0.97 0.94 0.97 0.96 point-query.pre_range=100
0.99 0.96 0.97 0.97 point-query_range=100
0.92 1.00 1.04 1.06 points-covered-pk_range=100
0.90 0.98 1.03 1.04 points-covered-si_range=100
0.92 1.00 1.04 1.06 points-notcovered-pk_range=100
0.89 0.94 0.98 1.00 points-notcovered-si_range=100
0.93 1.76 1.81 1.87 random-points_range=1000
0.92 1.00 1.03 1.06 random-points_range=100
0.94 0.93 0.95 0.97 random-points_range=10
0.98 0.94 0.95 1.00 range-covered-pk_range=100
0.97 0.94 0.94 1.00 range-covered-si_range=100
0.98 0.94 0.95 0.98 range-notcovered-pk_range=100
0.91 0.91 0.94 0.98 range-notcovered-si.pre_range=100
0.90 0.90 0.93 0.98 range-notcovered-si_range=100
0.98 0.97 0.97 1.00 read-only_range=10000
0.99 0.96 0.96 0.98 read-only_range=100
0.99 0.95 0.96 0.97 read-only_range=10
0.90 0.84 0.80 0.84 scan_range=100
0.95 0.93 0.94 0.94 delete_range=100
0.93 0.92 0.93 0.93 insert_range=100
0.97 0.94 0.95 0.96 read-write_range=100
0.97 0.94 0.95 0.96 read-write_range=10
0.89 0.90 0.88 0.92 update-index_range=100
0.96 0.96 0.98 0.98 update-inlist_range=100
0.95 0.94 0.94 0.94 update-nonindex_range=100
0.95 0.93 0.93 0.93 update-one_range=100
0.96 0.95 0.94 0.94 update-zipf_range=100
0.91 0.90 0.92 0.90 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.16X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop8
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.180593        4.756   0       0       0.007   174     x.my8028_rel_o2nofp
0.199852        5.846   0       0       0.008   156     x.my8030_rel_o2nofp
0.213651        6.156   0       0       0.008   146     x.my8033_rel_o2nofp
0.223563        6.615   0       0       0.008   140     x.my8039_rel_o2nofp
0.209727        6.136   0       0       0.008   146     x.my8040_rel_o2nofp
--- relative to first result
1.11            1.23    1       1       1.14    0.90    x.my8030_rel_o2nofp
1.18            1.29    1       1       1.14    0.84    x.my8033_rel_o2nofp
1.24            1.39    1       1       1.14    0.80    x.my8039_rel_o2nofp
1.16            1.29    1       1       1.14    0.84    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the changes here are similar to the ones above for the beelink and pn53 servers, but not as large. And the regression here is also not as large.
  • CPU per update (cpu/o) is 1.09X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.30X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.05X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop16
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001677        11.125  0       0       16.578  38477   x.my8028_rel_o2nofp
0.001859        12.496  0       0       24.596  34256   x.my8030_rel_o2nof
0.001849        11.945  0       0       22.664  34786   x.my8033_rel_o2nofp
0.001907        12.115  0       0       24.136  33971   x.my8039_rel_o2nofp
0.001823        11.670  0       0       21.581  35336   x.my8040_rel_o2nofp
--- relative to first result
1.11            1.12    1       1       1.48    0.89    x.my8030_rel_o2nofp
1.10            1.07    1       1       1.37    0.90    x.my8033_rel_o2nofp
1.14            1.09    1       1       1.46    0.88    x.my8039_rel_o2nofp
1.09            1.05    1       1       1.30    0.92    x.my8040_rel_o2nofp

Results: dell32

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update. I still don't understand why the regression for update-index here is so much worse than on the other servers.
Relative to: x.my8028_rel_o2nofp.z11a_c32r128.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c32r128.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c32r128.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c32r128.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c32r128.pk1

col-1 col-2 col-3 col-4
0.91 1.01 1.04 1.06 hot-points_range=100
0.98 0.94 0.93 0.95 point-query.pre_range=100
0.98 0.94 0.92 0.95 point-query_range=100
0.89 0.96 1.00 1.05 points-covered-pk.pre_range=100
0.89 0.96 1.00 1.05 points-covered-pk_range=100
0.87 0.93 1.00 1.04 points-covered-si.pre_range=100
0.88 0.94 1.01 1.05 points-covered-si_range=100
0.89 0.96 1.00 1.05 points-notcovered-pk.pre_range=100
0.89 0.96 1.00 1.05 points-notcovered-pk_range=100
0.89 0.90 0.98 1.02 points-notcovered-si.pre_range=100
0.88 0.90 0.97 1.01 points-notcovered-si_range=100
0.92 1.73 1.80 1.89 random-points.pre_range=1000
0.89 0.96 1.00 1.05 random-points.pre_range=100
0.94 0.92 0.93 0.96 random-points.pre_range=10
0.92 1.75 1.82 1.91 random-points_range=1000
0.89 0.96 1.00 1.05 random-points_range=100
0.94 0.92 0.93 0.96 random-points_range=10
0.96 0.93 0.93 0.99 range-covered-pk.pre_range=100
0.96 0.93 0.93 0.99 range-covered-pk_range=100
0.97 0.93 0.93 0.99 range-covered-si.pre_range=100
0.97 0.93 0.93 0.99 range-covered-si_range=100
0.96 0.92 0.93 0.96 range-notcovered-pk.pre_range=100
0.96 0.92 0.93 0.96 range-notcovered-pk_range=100
0.88 0.86 0.92 0.97 range-notcovered-si.pre_range=100
0.88 0.86 0.91 0.97 range-notcovered-si_range=100
0.98 0.96 0.98 1.01 read-only.pre_range=10000
0.97 0.93 0.94 0.95 read-only.pre_range=100
0.97 0.94 0.93 0.95 read-only.pre_range=10
0.99 0.96 0.97 1.01 read-only_range=10000
0.97 0.93 0.94 0.95 read-only_range=100
0.97 0.94 0.94 0.95 read-only_range=10
0.70 0.87 0.85 0.91 scan_range=100
0.94 0.92 0.93 0.93 delete_range=100
0.95 0.94 0.94 0.94 insert_range=100
0.97 0.93 0.94 0.95 read-write_range=100
0.97 0.93 0.93 0.95 read-write_range=10
0.56 0.56 0.56 0.56 update-index_range=100
0.99 0.99 1.00 1.03 update-inlist_range=100
0.96 0.95 0.95 0.95 update-nonindex_range=100
0.96 0.93 0.94 0.94 update-one_range=100
0.96 0.95 0.95 0.96 update-zipf_range=100
0.95 0.93 0.93 0.94 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.13X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop8
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.093496        3.256   0       0       0.006   246     x.my8028_rel_o2nofp
0.104925        5.301   0       0       0.007   172     x.my8030_rel_o2nofp
0.110480        4.195   0       0       0.006   215     x.my8033_rel_o2nofp
0.113259        4.306   0       0       0.006   210     x.my8039_rel_o2nofp
0.106105        4.065   0       0       0.006   225     x.my8040_rel_o2nofp
--- relative to first result
1.12            1.63    1       1       1.17    0.70    x.my8030_rel_o2nofp
1.18            1.29    1       1       1.00    0.87    x.my8033_rel_o2nofp
1.21            1.32    1       1       1.00    0.85    x.my8039_rel_o2nofp
1.13            1.25    1       1       1.00    0.91    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the regression here is huge and I am not sure about the root causes
  • CPU per update (cpu/o) is 1.50X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 2.32X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.41X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop24
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001185        10.777  0       0       14.275  51319   x.my8028_rel_o2nofp
0.001739        15.475  0       0       33.68   28515   x.my8030_rel_o2nofp
0.001794        15.192  0       0       32.953  28872   x.my8033_rel_o2nofp
0.001783        15.196  0       0       33.05   28983   x.my8039_rel_o2nofp
0.001781        15.242  0       0       33.139  28966   x.my8040_rel_o2nofp
--- relative to first result
1.47            1.44    1       1       2.36    0.56    x.my8030_rel_o2nofp
1.51            1.41    1       1       2.31    0.56    x.my8033_rel_o2nofp
1.50            1.41    1       1       2.32    0.56    x.my8039_rel_o2nofp
1.50            1.41    1       1       2.32    0.56    x.my8040_rel_o2nofp

Results: ax162-s

Summary
  • the big improvement to random-points starting in 8.0.33 is from fixing bug 102037
  • for scan the problem is new CPU overhead after 8.0.28
  • for update-index the problem is more CPU overhead, (maybe) more context switches and (maybe) more KB written to storage per update
  • the changes to HW overheads here and the regressions here are similar to but smaller than the changes above for the beelink and pn53 servers
Relative to: x.my8028_rel_o2nofp.z11a_c32r128.pk1
col-1 : x.my8030_rel_o2nofp.z11a_c32r128.pk1
col-2 : x.my8033_rel_o2nofp.z11a_c32r128.pk1
col-3 : x.my8039_rel_o2nofp.z11a_c32r128.pk1
col-4 : x.my8040_rel_o2nofp.z11a_c32r128.pk1

col-1 col-2 col-3 col-4
0.89 0.93 0.99 0.99 hot-points_range=100
0.98 0.94 0.94 0.94 point-query.pre_range=100
0.96 0.93 0.93 0.94 point-query_range=100
0.90 0.96 1.01 1.02 points-covered-pk.pre_range=100
0.89 0.95 1.00 1.02 points-covered-pk_range=100
0.87 0.89 0.95 0.98 points-covered-si.pre_range=100
0.88 0.89 0.96 0.99 points-covered-si_range=100
0.90 0.96 1.01 1.02 points-notcovered-pk.pre_range=100
0.89 0.95 1.00 1.02 points-notcovered-pk_range=100
0.86 0.88 0.94 0.96 points-notcovered-si.pre_range=100
0.87 0.88 0.94 0.96 points-notcovered-si_range=100
0.95 1.61 1.68 1.72 random-points.pre_range=1000
0.90 0.96 1.01 1.02 random-points.pre_range=100
0.92 0.91 0.95 0.94 random-points.pre_range=10
0.95 1.62 1.69 1.72 random-points_range=1000
0.89 0.95 1.00 1.02 random-points_range=100
0.91 0.90 0.94 0.94 random-points_range=10
0.92 0.90 0.92 0.95 range-covered-pk.pre_range=100
0.92 0.90 0.92 0.95 range-covered-pk_range=100
0.93 0.91 0.93 0.96 range-covered-si.pre_range=100
0.94 0.91 0.94 0.96 range-covered-si_range=100
0.93 0.90 0.93 0.94 range-notcovered-pk.pre_range=100
0.93 0.90 0.93 0.94 range-notcovered-pk_range=100
0.87 0.87 0.92 0.93 range-notcovered-si.pre_range=100
0.87 0.86 0.92 0.93 range-notcovered-si_range=100
0.99 0.97 0.98 1.01 read-only.pre_range=10000
0.95 0.92 0.93 0.94 read-only.pre_range=100
0.96 0.92 0.94 0.94 read-only.pre_range=10
0.99 0.97 0.98 1.00 read-only_range=10000
0.95 0.92 0.93 0.93 read-only_range=100
0.96 0.92 0.93 0.94 read-only_range=10
0.91 0.82 0.81 0.83 scan_range=100
0.93 0.91 0.94 0.93 delete_range=100
0.96 0.94 0.94 0.94 insert_range=100
0.95 0.94 0.95 0.95 read-write_range=100
0.96 0.94 0.95 0.95 read-write_range=10
0.80 0.85 0.83 0.90 update-index_range=100
0.99 1.04 1.03 1.02 update-inlist_range=100
0.98 0.97 0.96 0.96 update-nonindex_range=100
0.96 0.93 0.94 0.95 update-one_range=100
0.97 0.97 0.96 0.95 update-zipf_range=100
0.94 0.93 0.94 0.92 write-only_range=10000

For scan the problem is CPU overhead (see cpu/o below) which is 1.20X larger in 8.0.40 vs 8.0.28.

sb.met.scan.range100.pk1.dop8
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.018767        0.552   0       0       0.052   872     x.my8028_rel_o2nofp
0.020741        0.746   0       0       0.016   793     x.my8030_rel_o2nofp
0.022690        0.808   0       0       0.004   713     x.my8033_rel_o2nofp
0.023079        0.791   0       0       0.003   706     x.my8039_rel_o2nofp
0.022533        0.800   0       0       0.013   725     x.my8040_rel_o2nofp
--- relative to first result
1.11            1.35    1       1       0.31    0.91    x.my8030_rel_o2nofp
1.21            1.46    1       1       0.08    0.82    x.my8033_rel_o2nofp
1.23            1.43    1       1       0.06    0.81    x.my8039_rel_o2nofp
1.20            1.45    1       1       0.25    0.83    x.my8040_rel_o2nofp

For update-index there are several changes after 8.0.28
  • the changes here are similar to the ones above for the beelink and pn53 servers, but not as large. And the regression here is also not as large
  • CPU per update (cpu/o) is 1.09X larger in 8.0.40 vs 8.0.28
  • KB written per update (wKB/o) is 1.37X larger in 8.0.40 vs 8.0.28
  • context switches per update (cs/o) is 1.03X larger in 8.0.40 vs 8.0.28
sb.met.update-index.range100.pk1.dop40
--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000550        12.157  0       0       8.354   83129   x.my8028_rel_o2nofp
0.000641        13.426  0       0       14.791  66432   x.my8030_rel_o2nofp
0.000623        12.988  0       0       13.155  70890   x.my8033_rel_o2nofp
0.000637        13.044  0       0       14.158  68996   x.my8039_rel_o2nofp
0.000600        12.514  0       0       11.416  74917   x.my8040_rel_o2nofp
--- relative to first result
1.17            1.10    1       1       1.77    0.80    x.my8030_rel_o2nofp
1.13            1.07    1       1       1.57    0.85    x.my8033_rel_o2nofp
1.16            1.07    1       1       1.69    0.83    x.my8039_rel_o2nofp
1.09            1.03    1       1       1.37    0.90    x.my8040_rel_o2nofp




Thursday, October 17, 2024

The impact of PGO, LTO and more for MySQL on a small CPU

This has results for MySQL 8.0.28 vs sysbench on a laptop-class CPU (AMD 4700u) to understand the impact of various compile-time optimizations including LTO and PGO. I will have results from a server-class CPU next weeok. I previously published results on this but then retracted some of them as they appeared bogus.

tl;dr

  • clang and gcc have similar performance with a few exceptions
    • clang does ~7% better than gcc with PGO
    • clang does ~25% better than gcc with PGO+LTO
    • oddly, gcc with PGO is ~3% faster than gcc with PGO+LTO
  • LTO makes things ~6% faster
  • PGO or PGO+LTO makes things between 1.4X and 1.8X faster
  • gcc and clang without -march=native -mtune=native are ~2% faster than with those options

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here

All work (builds & tests) are done on a server that uses Ubuntu 22.04.5. The gcc version is 11.4.0 and the clang version is 14.0.0-1ubuntu1.1.

The CMake command lines for all of the builds are here. All builds use CMAKE_BUILD_TYPE =Release. The builds are listed below and the name of the CMake command line file per build is cmk.80.$val where $val is the build name listed below:
  • rel_o2nofp
    • uses gcc, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel_o2nofp_clang
    • uses clang, -O2 instead of -O3, adds -fno-omit-frame-pointer, works great for flamegraphs
  • rel
    • uses gcc, -O3, doesn't work great for flamegraphs
  • rel_clang
    • uses clang, -O3, doesn't work great for flamegraphs
  • rel_o2nofp_lto
    • like rel_o2nofp but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_lto_clang
    • like rel_o2nofp_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto
    • like rel but adds -DWITH_LTO=ON to get link-time optimization
  • rel_lto_clang
    • like rel_clang but adds -DWITH_LTO=ON to get link-time optimization
  • rel_o2nofp_native
    • like rel_o2nofp but adds -march=native -mtune=native
  • rel_o2nofp_native_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native
  • rel_native
    • like rel but adds -march=native -mtune=native
  • rel_native_clang
    • like rel_clang but adds -march=native -mtune=native
  • rel_o2nofp_native_lto
    • like rel_o2nofp but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_o2nofp_native_lto_clang
    • like rel_o2nofp_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto
    • like rel but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_native_lto_clang
    • like rel_clang but adds -march=native -mtune=native and -DWITH_LTO=ON
  • rel_pgo_use
    • like rel but uses PGO via -fprofile-generate
  • rel_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate
  • rel_lto_pgo_use
    • like rel but uses PGO via -fprofile-generate and -DWITH_LTO=ON
  • rel_lto_pgo_use_clang
    • like rel_clang but uses PGO via -fprofile-generate and -DWITH_LTO=ON
For PGO I do the following:
  • compile mysqld with -fprofile-generate
  • run sysbench to generate profiles
  • recompile mysqld with -fprofile-use using those profiles
  • rerun sysbench to get the results that I report.
For PGO with gcc I use -fprofile-generate rather than -fprofile-generate=$path and when I run mysqld the gcda files are written into the subdirectories of the object files from when I compiled MySQL. That worked because I used the same host for compiling and testing.

For PGO with clang I use -fprofile-generate rather than -fprofile-generate=$path and when I ran mysqld the profdata files were empty (the profiling data). Then I prepended this to the mysqld command line, but the profdata files were still empty:
  LLVM_PROFILE_FILE="code-%p.profraw"

The problem appears to be that mysqld calls chdir($data-dir) at startup which confuses the clang profiling support. But making that path absolute fixes the problem:
    LLVM_PROFILE_FILE="$PWD/code-%p.profraw"

By mysqld command line I mean that I started mysqld via mysqld_safe, did ps to see the command line, and then used that command line in place of mysqld_safe.

Finally, once I had non-empty profdata files, I had to run llvm-profdata on them and the output (rel_pgo_gen_clang.profdata or rel_lto_pgo_gen_clang.profdata) was used as input when I compiled mysqld with clang and -fprofile-use.
    llvm-profdata merge -output=rel_pgo_gen_clang.profdata code-296681.profraw

Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. But here I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run with 1 thread, 1 table and 30M rows. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script was:
    bash r.sh 1 30000000 300 600 nvme0n1 1 1 1

Results

All of the results are here using relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the rel_o2nofp build
  • my version is one of the other builds
I then compute the median and average of the rQPS for each build. There is no result below for rel_o2nofp because that is the base version and the avg and median for it are 1.

avg     median  build
0.98    0.98    rel_o2nofp_clang
...
1.00    1.00    rel
0.98    0.98    rel_clang
...
1.05    1.05    rel_o2nofp_lto
1.06    1.06    rel_o2nofp_lto_clang
...
1.07    1.07    rel_lto
1.07    1.07    rel_lto_clang
...
0.99    0.99    rel_o2nofp_native
0.97    0.97    rel_o2nofp_native_clang
...
0.99    0.99    rel_native
0.98    0.98    rel_native_clang
...
1.05    1.04    rel_o2nofp_native_lto
1.05    1.05    rel_o2nofp_native_lto_clang
...
1.06    1.06    rel_native_lto
1.06    1.06    rel_native_lto_clang
...
1.42    1.48    rel_pgo_use
1.52    1.59    rel_pgo_use_clang
...
1.39    1.43    rel_lto_pgo_use
1.77    1.76    rel_lto_pgo_use_clang

MySQL 8.0.40 does not fix the regressions I hoped it would fix

Performance regressions arrived in InnoDB with MySQL 8.0.30. Eventually multiple bugs were filed. The worst regressions were from changes to...