Sunday, October 13, 2024

Managing CPU frequency for AMD on Ubuntu 22.04

I need stable performance from the servers I use for benchmarks. I also need servers that don't run too hot because too-hot servers cause HW to fail and introduce variance when CPU speeds are throttled. This post is about my latest adventures managing CPU speed. My previous post is here.

At a high level my solution is:

  • disable turbo boost
  • (optional) cap the max frequency the CPU can use
Background reading

The AMD pstate scaling drivers (amd-pstate, amd-pstate-epp) make life interesting for some of us. The common advice today is to stay with acpi-cpufreq for servers. I agree and assume it is best to wait for things to settle, for the scaling drivers to be feature complete, for docs to catch up and for new AMD CPUs that support all of the new features to arrive in your servers. It is very confusing today -- the non-expert user experience isn't great and there is too much advice on the interweb that is wrong and/or out of date.

The amd-pstate-epp scaling driver used in active mode doesn't appear to have a way to disable or enable turbo boost. It also isn't obvious it has a way to limit the max CPU frequency. So expect your server to run fast and hot, then throttle the CPU, then repeat forever. That might be fine for a laptop but isn't good for my use case.

There is much to learn, but sometimes I prefer to focus on my problems (database storage engines) and not have to spend too much time on topics like this:
  • an overview on AMD pstate drivers from AMD (see here)
  • an overview of CPU frequency scaling from Arch Linux (see here)
  • good configuration advice on Reddit (see here)
  • benchmark results from Phoronix (see here)
  • kernel docs on frequency governors (see here)
  • a good user experience post (see here)
  • an overview from RedHat (see here)
The solution

The concrete steps are:
  • disable turbo boost
  • use the acpi-cpufreq scaling driver
  • use the performance frequency governor
  • (optionally) cap the max CPU frequency (this only works on some of my servers)
Run this to disable turbo boost. Note that the boost file exists when using the acpi-cpufreq scaling driver. If using amd-pstate-epp then the file isn't there with active mode and is there with guided mode.

    echo '0' | sudo tee /sys/devices/system/cpu/cpufreq/boost

To get the acpi-cpufreq scaling driver, edit /etc/default/grub to add one of these lines (nosmt disables AMD SMT), then run sudo update-grub and then reboot.

    GRUB_CMDLINE_LINUX_DEFAULT="nosmt amd_pstate=disable"
    GRUB_CMDLINE_LINUX_DEFAULT="amd_pstate=disable"

Run this to use the performance frequency governor

    sudo cpupower frequency-set --governor performance

And then run this to confirm you enabled the performance governor:

    cpupower -c all frequency-info | grep gov ; cpupower frequency-info

While disabling turbo boost goes a long way to avoiding a too-hot CPU, sometimes you might want to reduce the CPU frequencies even more, and somtimes that is possible via the cpupower command. However, this doesn't have an impact on all of my servers. Fortunately, summer has passed and I don't have to worry as much about overheating for a few months. Here is one example of a user having a problem similar to mine (the max value is ignored).

    sudo cpupower frequency-set -u 2.40GHz

Setting up the adventure

This adventure began when I had to replace m.2 devices on several of my small servers because they were at or near endurance limits. I use Ubuntu 22.04 and then decided to update the installs which brought new kernel versions. And while doing that I updated all of my servers to use an HWE kernel which is now 6.8.something with Ubuntu 22.04.5.

Ubuntu 22.04 Server uses the schedutil frequency governor by default when using the acpi-cpufreq scaling driver. I noticed a few months back that schedutil that caused odd performance results for MyRocks on some of my servers and switching to the performance governor gave me up to ~2X more QPS (see here). So I decided to switch all of my small servers to use the performance governor.

Servers

I have two types of small servers. Both use Ubutu 22.04 and everything has the 6.8.0-45-generic HWE enabled kernel today.

 The older type has an AMD Ryzen 4700u CPU with 8 cores, 16G of RAM, no support for AMD SMT and no support for CPPC. I did not check to see if CPPC support was disabled in the BIOS. And by no support for CPPC I mean that these directories do not exist:

    /sys/devices/system/cpu/cpu*/acpi_cppc

However, on the same server this shows that CPPC is supported: lscpu | grep cppc

The newer small server has 8 cores with AMD SMT disabled, 32G RAM and an AMD Ryzen 7 CPU -- either 7840HS or 7735HS.  These have support for CPPC.

Results

I used this script to determine the behavior of scaling driver (acpi-cpufreq, amd-pstate-epp in active mode, amd-pstate-epp in guided mode), frequency governor (schedutil, performance, powersave) and energy performance preference (the EPP in amd-pstate-epp).

I decided to not share the results, perhaps I am grouchy after spending a few too many hours on this.

Update 1 - cpupower idle-set

One of my smart friends suggested I might need to use cpupower idle-set -D1 to avoid some sources of variance. While I would rather more on storage engines and less on tuning frequency management, I suppose I need to look at this.

I ran cpupower idle-info on the CPUs in my home servers. The CPUs are and per-state latencies are listed below. One interesting thing is that gap between C1 and C2 is huge for the 4700u CPU (1 to 350) but much smaller on all of the other (and newer) CPUs.

  • AMD Ryzen 7 4700u (laptop class, the oldest & slowest of the bunch)
    • Latency for (Poll, C1, C2, C3) = (0, 1, 350, 400)
  • AMD Ryzen 7 7735HS (laptop class)
    • Latency for (Poll, C1, C2, C3) = (0, 1, 18, 350)
  • AMD Ryzen 7 7840HS (laptop class)
    • Latency for (Poll, C1, C2, C3) = (0, 1, 18, 350)
  • Intel Xeon Silver 4214R
    • Latency for (Poll, C1, C1E, C6) = (0, 2, 10, 133)
  • AMD Ryzen Threadripper PRO 5975WX
    • Latency for (Poll, C1, C2) = (0, 1, 18)
I get the following output from cpupower idle-info

AMD Ryzen 7 4700u

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 6:

Number of idle states: 4
Available idle states: POLL C1 C2 C3
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 18496
Duration: 924722
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 89774
Duration: 31663862
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 350
Usage: 25768
Duration: 24190992
C3:
Flags/Description: ACPI IOPORT 0x415
Latency: 400
Usage: 405313
Duration: 85575022637

AMD Ryzen 7 7735HS

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 4:

Number of idle states: 4
Available idle states: POLL C1 C2 C3
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 5236511
Duration: 33151236
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 68356802
Duration: 1954121741
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 18
Usage: 25603592
Duration: 1986129966
C3:
Flags/Description: ACPI IOPORT 0x415
Latency: 350
Usage: 14551352
Duration: 77114754509

AMD Ryzen 7 7840HS

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 5:

Number of idle states: 4
Available idle states: POLL C1 C2 C3
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 2754712
Duration: 91604147
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 101206334
Duration: 5540660093
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 18
Usage: 17879457
Duration: 1736766115
C3:
Flags/Description: ACPI IOPORT 0x415
Latency: 350
Usage: 3887844
Duration: 73780112353

Intel Xeon Silver 4214R

CPUidle driver: intel_idle
CPUidle governor: menu
analyzing CPU 13:

Number of idle states: 4
Available idle states: POLL C1 C1E C6
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 1780886844
Duration: 8636888464
C1:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 16431990427
Duration: 337814322622
C1E:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 5084275890
Duration: 309067690957
C6:
Flags/Description: MWAIT 0x20
Latency: 133
Usage: 1368088856
Duration: 3588025308542

AMD Ryzen Threadripper PRO 5975WX

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 27:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 410444172
Duration: 1861530465
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 14490956799
Duration: 422501068392
C2:
Flags/Description: ACPI IOPORT 0x814
Latency: 18
Usage: 3995240933
Duration: 3752298780474

Friday, October 11, 2024

What does "perf stat" tell us about MySQL performance regressions on an AMD Threadripper CPU

I used perf stat to collect a variety of HW performance counter and will share the results over several posts. I used 4 different servers with 4 different CPUs and there will be a post per CPU. This is the third post with results from a server class CPU. Previous posts that used laptop class CPUs (but mini PCs, not laptops) are here and here and a result from a server class CPU is here.

tl;dr

  • MySQL (InnoDB and MyRocks) gets slower over time from both memory system bloat (TLB and cache activity) and code bloat (more instructions/operation). For code bloat, InnoDB in MySQL 8.0.39 uses (1.43, 1.22, 1.28, 1.36) times more instructions per SQL operation for (scan, insert, point-query, update-index) microbenchmarks relative to 5.6.51.
  • the regressions in MyRocks and InnoDB are similar so this appears to be independent of the storage engine (perhaps code above the storage engine, perhaps just new code in general)
  • the results for InnoDB here are better than on the small servers. While memory system and code bloat hurts modern MySQL, there were many improvements that reduce mutex contention and the workload here has more concurrency (24 threads) while the small server workloads used low concurrency (1 thread)
  • the update-index microbenchmark has a huge regression from 8.0.28 to 8.0.39. I need a few more days to explain that. I see increases in CPU utilization and the context switch rate.
Builds

I compiled upstream MySQL from source for versions 5.6.51, 5.7.10, 5.7.44, 8.0.11, 8.0.28 and 8.0.39.

I compiled FB MySQL from source and tested the following builds:
  • fbmy5635_rel_o2nofp_210407_f896415f_6190
    • MyRocks 5.6.35 at git sha f896415f (as of 21/04/07) with RocksDB 6.19
  • fbmy5635_rel_o2nofp_231016_4f3a57a1_870
    • MyRocks 5.6.35 at git sha 4f3a57a1 (as of 23/10/16) with RocksDB 8.7.0
  • fbmy8028_rel_o2nofp_220829_a35c8dfe_752
    • MyRocks 8.0.28 at git sha a35c8dfe (as of 22/08/29) with RocksDB 7.5.2
  • fbmy8028_rel_o2nofp_231202_4edf1eec_870
    • MyRocks 8.0.28 at git sha 4edf1eec (as of 23/12/02) with RocksDB 8.7.0
  • fbmy8032_rel_o2nofp_231204_e3a854e8_870
    • MyRocks 8.0.32 at git sha e3a854e8 (as of 23/12/04) with RocksDB 8.7.0
  • fbmy8032_rel_o2nofp_240529_49b37dfe_921
    • MyRocks 8.0.32 at git sha 49b37dfe (as of 24/05/29) with RocksDB 9.2.1
The my.cnf files are here for MyRocks: 5.6.358.0.28 and 8.0.32.

The my.cnf files for InnoDB are in the subdirectories here. Given the amount of innovation in MySQL 8.0 I can't use one my.cnf file for all 8.0 versions.

Hardware

The server is a Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX 32-Cores. The OS is Ubuntu 22.04 with an HWE kernel. Storage is 1 m.2 SSD with ext4 (data=writeback, discard enabled).

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with 24 threads, 8 tables and 10M rows per table. Each microbenchmark runs for 330 seconds if read-only and 630 seconds otherwise. Prepared statements were enabled.

The command line for my helper scripts was:
    bash r.sh 8 10000000 330 630 nvme0n1 1 1 24

Using perf

I used perf stat with code that does:

  1. Sleep for 30 seconds
  2. Run perf stat 7 times for 10 seconds at a time. Each run collects different counters - see here.
So it takes ~100 seconds per loop and given that I run the read-heavy tests for ~330 seconds and the write-heavy tests for 630 seconds I collect data from ~3 loops for read-heavy and ~6 loops for write-heavy.

I then compare results across DBMS versions for one (or a few) of the loops and generate two tables per microbenchmark -- one with absolute values, the other with values relative to the base version. The base version for InnoDB is 5.6.51 and for MyRocks is fbmy5635_rel_o2nofp_210407_f896415f_6190. By values relative I mean: (value for my version / value for base version). My focus is to compare MyRocks with MyRocks and InnoDB with InnoDB. I assume that most of the regressions for MyRocks are also in InnoDB and then try to confirm that assumption.

Results

For the results I normally split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. But I don't do that here because I share 1 (or 2) tables of results from perf stat per microbenchmark and doing that for 42 microbenchmarks is too much.

So I focus on four that I think are representative:
  • scan - do full tables scans in a loop for ~300 seconds
  • insert - do inserts
  • point-query - fetch 1 row by exact match on the PK index
  • update-index - do updates that require secondary index maintenance
For each microbenchmark I continue to use relative QPS (rQPS) which is:
     (QPS for my version / QPS for base version)

And the base version is explained above in the Using perf section.

The relative QPS is here for MyRocks. The QPS for the subset of microbenchmarks on which I focus is below. The results for a recent build for FB MyRocks 8.0.32 are in col-5 and the relative QPS values range from 0.60 for scan to 0.92 for update-index. When the relative QPS for scan is 0.60 with a recent MyRocks 8.0.32 build, the scan throughput in that recent build is 60% versus an old build of MyRocks 5.6.35 -- when the workload is CPU bound. I am not happy about that.

Relative to: x.fbmy5635_rel_o2nofp_210407_f896415f_6190
col-1 : x.fbmy5635_rel_o2nofp_231016_4f3a57a1_870
col-2 : x.fbmy8028_rel_o2nofp_220829_a35c8dfe_752
col-3 : x.fbmy8028_rel_o2nofp_231202_4edf1eec_870
col-4 : x.fbmy8032_rel_o2nofp_231204_e3a854e8_870
col-5 : x.fbmy8032_rel_o2nofp_240529_49b37dfe_921

col-1   col-2   col-3   col-4   col-5
0.97    0.64    0.71    0.62    0.60    scan_range=100
0.96    0.83    0.80    0.78    0.77    insert_range=100
0.97    0.87    0.87    0.85    0.85    point-query_range=100
0.98    0.97    0.94    0.92    0.92    update-index_range=100

The relative QPS is here for InnoDB. The QPS for the subset of microbenchmarks on which I focus is below. The results for MySQL 8.0.39 are in col-5 and the relative QPS values range from 0.75 for scan to 1.77 for update-index. InnoDB in MySQL 8.0.39 is slower than 5.6.51 for the read-only microbenchmarks, scan and point-query, and then faster on the write-only microbenchmarks, insert and update-index. Also note that the values in col-5 (8.0.39) are a bit smaller than the values in col-4 (8.0.28), especially for scan. I think this is bug 111538 which might get fixed in the upcoming 8.0.40 release. Even in the one case where 8.0.39 gets more throughput than 5.6.51 (update-index) there is a steady drop in relative QPS from col-2 (5.7.44) through col-5 (8.0.39), and even from col-4 (8.0.28) to col-5 (8.0.39).
 
Relative to: x.my5651_rel_o2nofp
col-1 : x.my5710_rel_o2nofp
col-2 : x.my5744_rel_o2nofp
col-3 : x.my8011_rel_o2nofp
col-4 : x.my8028_rel_o2nofp
col-5 : x.my8039_rel_o2nofp

col-1   col-2   col-3   col-4   col-5
0.98    0.90    1.03    0.92    0.75    scan_range=100
1.51    1.52    1.59    1.50    1.41    insert_range=100
1.04    1.01    0.94    0.89    0.83    point-query_range=100
3.12    3.62    1.44    3.21    1.77    update-index_range=100

Explaining regressions

While it would be great if most of the regressions were caused by a small number of diffs, that does not appear to be the case. When I look at QPS across all of the point releases I see a few things:
  • larger regressions across major releases (5.6.51 to 5.7.10 and 5.7.44 to 8.0.11)
  • many regressions across 8.0 releases (8.0.11 through 8.0.39)
  • some regressions across 5.7 releases (5.7.10 through 5.7.44)
  • not many regressions across 5.6 releases
At a high level there are two reasons for the regressions and I am working to distinguish between them when I look at results from perf record and perf stat.
  • code bloat - more instructions are executed per SQL command
  • memory system bloat - more cache and TLB activity per SQL command
With either tool  (perf statperf record) I run it for a fixed number of seconds while the amount of work (QPS) completed during that interval is not fixed -- for low-concurrency workloads the QPS is larger for older MySQL and smaller for modern MySQL. I must keep this in mind when interpreting the results. 

With flamegraphs from perf record a common case is that the distribution of time per function doesn't change that much even when there are large regressions and my hypothesis is that memory system bloat is the problem. 

When looking at perf stat output I start with the absolute values for the counters. When they are small then a 10X difference in the value for that counter between old and modern MySQL is frequently not significant as it won't have a large impact on performance. Otherwise, I focus on the relative values for these counters which is (value for my version / value for base version). I then use the old version of MySQL or FB MyRocks as the base version and try to focus on the cases where the relative value is large (perhaps larger than 1.15, as in a 15% increase).

Just as with perf record, the values from perf stat are measured over a fixed number of seconds while the QPS completed during that interval tends to be larger for older MySQL. So I then compute another value which is: (value for my version / value for base version / relative QPS for my version).  This is the relative value for a counter per query (query == SQL command) and when that is equal to 1.5 for some version then some version does 1.5X more for that counter per query (1.5X more cache misses, etc).

While I use the word significant above to mean that the value for some counter is interesting it isn't easy to estimate how much that contributes to a performance regression.

Looking at perf stat output

This spreadsheet has the interesting counters from perf stat for the scan, insert, point-query and update-index microbenchmarks. The set of tables on the left has the absolute values for the counters and on the right are the normalized values -- (value for my version / value for base version / relative QPS).

The amd-rx tab has results for MyRocks and the amd-in tab has results for InnoDB on the server I write about here (the SER4 explained above). And I will inline the tables with normalized values below.

For MyRocks the version names I use below are abbreviated from what I explained above:
  • fbmy5635.a - fbmy5635_rel_o2nofp_210407_f896415f_6190
  • fbmy5635.b - fbmy5635_rel_o2nofp_231016_4f3a57a1_870
  • fbmy8028.a - fbmy8028_rel_o2nofp_220829_a35c8dfe_752
  • fbmy8028.b - fbmy8028_rel_o2nofp_231202_4edf1eec_870
  • fbmy8032.a - fbmy8032_rel_o2nofp_231204_e3a854e8_870
  • fbmy8032.b - fbmy8032_rel_o2nofp_240529_49b37dfe_921
A few more things:
  • I use yellow to highlight counters for which there are large increases
  • I use red to highlight the values for cpu/o (relative CPU per SQL command) and relative QPS
  • values for cycles, GHz, ipc, cpu/o and relative QPS are not normalized as described above. For them the value is computed as: (value for me / value for base version) and I don't divide that by the relative QPS.
Results from perf stat: scan

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.60 & 0.75 for modern MyRocks & InnoDB so they get 60% & 75% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per scan (cpu/o) is 1.65X & 1.36X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 4.63X & 12.61X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.61X & 1.43X in modern MyRocks & InnoDB
  • there are four cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.061.541.421.671.65
branch-misses1.000.941.261.011.141.12
cache-misses1.000.951.020.911.081.08
cache-references1.000.701.141.321.921.76
cycles1.001.001.001.001.001.00
dTLB-load-misses1.000.901.110.921.111.08
dTLB-loads1.001.148.131.362.082.75
GHz1.001.001.001.001.001.00
instructions1.001.071.511.381.631.61
ipc1.001.040.970.981.000.97
iTLB-load-misses1.000.990.860.840.960.92
iTLB-loads1.001.3824.341.407.002.96
L1-dcache-load-misses1.001.181.361.231.731.38
L1-dcache-loads1.001.061.431.331.561.56
L1-icache-loads1.001.312.182.072.512.26
L1-icache-loads-misses1.000.731.752.626.524.23
stalled-cycles-backend1.000.860.795.931.370.97
stalled-cycles-frontend1.001.180.961.540.912.15
cpu/o1.001.041.561.421.621.65
QPS1.000.970.640.710.620.60

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.061.121.061.151.45
branch-misses1.001.000.970.220.250.41
cache-misses1.000.931.030.700.970.94
cache-references1.000.940.961.551.681.70
cycles1.001.001.001.001.001.00
dTLB-load-misses1.000.950.930.951.171.19
dTLB-loads1.000.771.001.063.804.74
GHz1.001.001.001.001.001.00
instructions1.001.021.071.041.151.43
ipc1.001.010.951.071.051.07
iTLB-load-misses1.001.171.211.651.622.75
iTLB-loads1.002.2270.582.12728.815591.47
L1-dcache-load-misses1.000.920.921.982.092.08
L1-dcache-loads1.001.061.121.111.251.50
L1-icache-loads1.001.061.550.610.591.17
L1-icache-loads-misses1.001.922.462.1711.0412.61
stalled-cycles-backend1.001.422.482.220.853.07
stalled-cycles-frontend1.003.047.314.061.522.19
cpu/o1.001.031.141.001.121.36
QPS1.000.980.901.030.920.75

Results from perf stat: insert

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.77 for modern MyRocks and 1.41 for modern InnoDB. So MyRocks gets slower over time. InnoDB in MySQL 8.0.39 is faster than 5.6.51 but has been getting slower since 5.7.10
  • the CPU overhead per insert (cpu/o) is 1.40X larger in modern MyRocks but smaller in modern InnoDB.
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.65X & 1.61X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • modern MyRocks & InnoDB suffer from code bloat as the number of instructions per SQL operation increases by 1.22X for both and InnoDB has been getting worse since 5.7.10
  • there are three cases where HW perf counters show a regression in fbmy5635.b and I did not expect that
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.021.581.181.551.23
branch-misses1.001.031.491.411.591.49
cache-misses1.001.031.481.481.591.58
cache-references1.001.031.451.421.591.55
cycles1.000.971.201.061.181.07
dTLB-load-misses1.000.871.251.081.211.16
dTLB-loads1.000.991.421.341.471.51
GHz1.001.001.061.051.061.05
instructions1.001.011.581.161.611.22
ipc1.000.991.090.871.060.87
iTLB-load-misses1.001.131.992.202.432.44
iTLB-loads1.001.111.551.471.621.65
L1-dcache-load-misses1.000.951.321.211.411.25
L1-dcache-loads1.000.831.391.221.451.16
L1-icache-loads1.001.191.731.691.731.87
L1-icache-loads-misses1.001.161.671.751.681.65
stalled-cycles-backend1.002.713.012.753.133.03
stalled-cycles-frontend1.000.960.990.971.020.98
cpu/o1.001.031.341.341.391.40
QPS1.000.960.830.800.780.77

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.000.820.821.051.181.25
branch-misses1.000.860.931.101.231.35
cache-misses1.000.940.991.081.271.42
cache-references1.000.940.981.091.251.38
cycles1.000.940.901.041.061.07
dTLB-load-misses1.001.021.081.461.451.65
dTLB-loads1.001.111.231.511.701.83
GHz1.000.980.970.980.991.00
instructions1.000.830.841.041.151.22
ipc1.001.321.391.581.611.58
iTLB-load-misses1.001.341.422.042.653.46
iTLB-loads1.001.271.401.371.581.72
L1-dcache-load-misses1.000.971.021.141.191.27
L1-dcache-loads1.000.950.981.211.331.47
L1-icache-loads1.000.920.961.121.251.40
L1-icache-loads-misses1.001.051.091.211.431.61
stalled-cycles-backend1.000.860.841.041.051.07
stalled-cycles-frontend1.000.790.780.860.920.94
cpu/o1.000.670.610.710.740.80
QPS1.001.511.521.591.501.41

Results from perf stat: point-query

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.85 & 0.83 for modern MyRocks & InnoDB so they get ~84% of the throughput vs older MyRocks & InnoDB.
  • the CPU overhead per query (cpu/o) is 1.18X & 1.22X larger in modern MyRocks & InnoDB vs older MyRocks & InnoDB
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.29X & 1.63X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • both MyRocks and InnoDB suffer from code bloat because the number of instructions per SQL operation increases by 1.20X & 1.28X in modern MyRocks & InnoDB
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.051.161.151.161.21
branch-misses1.001.061.271.301.341.32
cache-misses1.001.061.291.321.371.34
cache-references1.001.051.271.281.341.32
cycles1.001.011.051.051.061.06
dTLB-load-misses1.000.931.091.001.050.99
dTLB-loads1.001.041.381.311.361.38
GHz1.001.001.011.011.021.02
instructions1.001.041.151.141.141.20
ipc1.001.000.950.930.910.95
iTLB-load-misses1.001.231.261.581.531.66
iTLB-loads1.001.081.341.241.461.41
L1-dcache-load-misses1.001.021.191.171.211.22
L1-dcache-loads1.001.041.201.191.221.25
L1-icache-loads1.001.051.271.301.361.33
L1-icache-loads-misses1.000.921.211.301.391.29
stalled-cycles-backend1.000.970.990.931.000.99
stalled-cycles-frontend1.001.011.010.990.971.01
cpu/o1.001.031.151.151.171.18
QPS1.000.970.870.870.850.85

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.001.061.101.161.211.27
branch-misses1.001.021.061.211.261.40
cache-misses1.001.081.131.281.421.57
cache-references1.001.071.121.241.341.46
cycles1.001.061.071.101.121.15
dTLB-load-misses1.000.850.940.941.041.09
dTLB-loads1.001.321.491.651.711.84
GHz1.001.011.021.031.031.04
instructions1.001.061.101.191.201.28
ipc1.001.041.041.020.960.92
iTLB-load-misses1.001.661.301.692.572.99
iTLB-loads1.001.201.381.491.521.63
L1-dcache-load-misses1.001.091.151.251.271.34
L1-dcache-loads1.001.061.121.211.241.33
L1-icache-loads1.001.061.091.211.281.40
L1-icache-loads-misses1.001.131.061.311.551.63
stalled-cycles-backend1.000.900.860.910.930.98
stalled-cycles-frontend1.000.920.950.950.971.00
cpu/o1.000.991.021.091.151.22
QPS1.001.041.010.940.890.83

Results from perf stat: update-index

The tl;dr here is that MyRocks and InnoDB have similar regressions.

Summary:
  • the relative QPS is 0.92 for modern MyRocks and 1.77X for modern InnoDB. So MyRocks gets slower over time. InnoDB in MySQL 8.0.39 is a lot faster than 5.6.51 but has been getting slower since 5.7.10 and there is a large regression from 8.0.28 to 8.0.39.
  • the CPU overhead per insert (cpu/o) is 1.04X larger in modern MyRocks but for InnoDB it is smaller in 8.0.39 than 5.6.51
  • in both MyRocks and InnoDB there are large increases for TLB and cache activity. My biggest concern is for L1-icache-loads-misses which is 1.30X & 1.75X larger in modern MyRocks & InnoDB.
  • the large values for iTLB-loads and iTLB-load-misses are a concern, but reducing them to near zero (via using huge pages for text) increased QPS on the scan microbenchmark by only ~5% (see here)
  • modern MyRocks suffers from code bloat because the number of instructions per SQL operation increases by 1.32X. For InnoDB this is less of an issue assuming the large increase from 8.0.28 to 8.0.39 is caused by mutex contention.
MyRocks

fbmy5635.afbmy5635.bfbmy8028.afbmy8028.bfbmy8032.afbmy8032.b
branches1.001.051.121.131.231.30
branch-misses1.001.021.031.041.091.09
cache-misses1.001.011.121.121.191.18
cache-references1.001.021.101.091.181.17
cycles1.001.000.940.920.960.96
dTLB-load-misses1.000.960.330.310.310.32
dTLB-loads1.000.981.151.101.171.17
GHz1.001.001.011.011.021.02
instructions1.001.061.111.131.251.32
ipc1.001.051.161.171.211.26
iTLB-load-misses1.001.060.640.670.710.71
iTLB-loads1.001.232.542.412.722.68
L1-dcache-load-misses1.001.001.031.001.041.04
L1-dcache-loads1.001.031.111.081.131.02
L1-icache-loads1.001.011.100.971.191.18
L1-icache-loads-misses1.001.111.221.221.341.30
stalled-cycles-backend1.000.490.770.420.760.75
stalled-cycles-frontend1.000.960.410.410.430.42
cpu/o1.001.011.021.011.041.04
QPS1.000.980.970.940.920.92

InnoDB

5.6.515.7.105.7.448.0.118.0.288.0.39
branches1.000.460.431.190.751.30
branch-misses1.000.700.671.200.851.26
cache-misses1.001.010.971.351.201.52
cache-references1.000.890.861.391.111.52
cycles1.001.401.291.011.401.30
dTLB-load-misses1.000.750.731.610.981.34
dTLB-loads1.000.910.981.941.361.68
GHz1.001.001.000.990.991.00
instructions1.000.450.431.250.771.36
ipc1.001.001.201.771.761.84
iTLB-load-misses1.001.010.991.651.612.06
iTLB-loads1.001.581.662.201.912.32
L1-dcache-load-misses1.000.730.741.300.941.34
L1-dcache-loads1.000.480.511.310.851.44
L1-icache-loads1.000.780.731.290.981.36
L1-icache-loads-misses1.000.981.031.621.341.75
stalled-cycles-backend1.000.650.591.280.841.29
stalled-cycles-frontend1.000.500.511.100.721.00
cpu/o1.000.410.330.590.400.62
QPS1.003.123.621.443.211.77

Managing CPU frequency for AMD on Ubuntu 22.04

I need stable performance from the servers I use for benchmarks. I also need servers that don't run too hot because too-hot servers caus...