Small Datum: Make MyRocks 2X faster by changing the CPU frequency governor

Thank you for clicking on my of my better clickbait titles. Obviously this trick won't work everywhere but it does work on one of my home servers.

tl;dr

Switching from the Ubuntu 22.04 Server default CPU frequency governor (schedutil) to the performance governor increases QPS by ~2X for many write-heavy microbenchmarks with MyRocks on one of my large home servers.
The problem is that Linux is under-clocking the CPU with MyRocks on the server with the schedutil frequency governor. Switching to the performance frequency governor fixes that -- I am not making MyRocks faster by over-clocking.
The same change doesn't impact InnoDB performance.
The same change doesn't appear to make a difference on my other home servers.

Introduction

My home servers are listed here and changing the CPU frequency governor from schedutil to performance improves QPS by up to 2X on the write-heavy benchmark steps of sysbench. An example of my sysbench usage is here.

But first a short diversion, I finally switched from XFS to ext4 on my benchmark servers because I get odd performance and odd warnings (grep for hogged CPU in syslog). That was my first odd problem in Ubuntu 22.04 Server with the HWE kernel (currently 6.5.0-41-generic) -- see here.

This blog post is about my second odd problem and it is limited to one of my servers which is named socket2 below.

I have been experimenting with this on 3 servers. It is only a problem on one -- socket2:

socket2

the large server with the problem. This is the v6 server here and is a SuperMicro SuperWorkstation with 2 sockets, 12 cores/socket, Intel Xeon CPUs with HT disabled, 64G RAM, Ubuntu 22.04 and ext4 using data=writeback and SW RAID 0 over 2 NVMe devices. I enabled HWE kernels and the kernel is 6.5.0-41-generic.

dell32

another large server that doesn't have this problem. This is the v7 server here and is a Dell Precision 7865 Tower with 32-cores, AMD Ryzen Threadripper PRO 5975WX, AMD SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 using data=writeback and SW RAID 0 over 2 NVMe devices. I enabled HWE kernels and the kernel is 6.5.0-41-generic.

pn53

a small server that doesn't have this problem. This is the v8 server here and is an ASUS ExpertCenter PN53 with 8 cores, AMD SMT disabled, an AMD Ryzen 7 7735HS CPU, 32G RAM, Ubuntu 22.04 and ext4 using data=writeback over 1 NVMe device. I enabled HWE kernels and the kernel is 6.5.0-41-generic.

The symptoms

The relative QPS for FB MyRocks was worse than expected on the socket2 server. By relative QPS in this case I mean: (QPS for FB MyRocks) / (QPS for InnoDB) where:

FB MyRocks

FB MySQL 8.0.32 with source as of 24-05-29 built at git hash 49b37dfe using RocksDB 9.2.1

InnoDB

InnoDB from MySQL 8.0.37

Using a subset of the sysbench microbenchmarks, the relative QPS for FB MyRocks on socket2 has values that are bad (less than 0.5) for some write-heavy tests. When the relative QPS is 0.48 (see update-index below) then FB MyRocks gets ~48% of QPS vs InnoDB (or InnoDB gets ~2X more QPS). This result is much worse than what I see on my other large server, the dell32.

0.82 point-query.pre_range=100
0.78 range-notcovered-si.pre_range=100
0.56 range-notcovered-si_range=100
0.46 scan_range=100
0.31 delete_range=100
0.24 insert_range=100
0.70 read-write_range=100
0.71 read-write_range=10
0.48 update-index_range=100
0.31 update-inlist_range=100
0.27 update-nonindex_range=100
0.27 update-one_range=100
0.27 update-zipf_range=100
0.78 write-only_range=10000

For reference, the results from dell32 which show that InnoDB is still doing better than MyRocks, but the worst case below has a difference of ~2X and the worst case above is ~4X.

0.84 point-query.pre_range=100
0.73 range-notcovered-si.pre_range=100
0.49 range-notcovered-si_range=100
0.42 scan_range=100
0.45 delete_range=100
0.46 insert_range=100
0.68 read-write_range=100
0.68 read-write_range=10
1.22 update-index_range=100
0.44 update-inlist_range=100
0.53 update-nonindex_range=100
0.54 update-one_range=100
0.53 update-zipf_range=100
0.76 write-only_range=10000

Diagnosis: step 1

Why does MyRocks do so much worse on socket2? I started with flamegraphs but they were similar between socket2 and dell32. So the distribution of time spent across the code is similar.

Then I repeated the benchmark while recording perf counters (see here) and the first clue is below where perf claims the CPU runs at 1.018 GHz for MyRocks vs 2.823 GHz for InnoDB.

With update-index I get this for MyRocks on socket2:

81,862.93 msec task-clock # 8.182 CPUs utilized
690,877 context-switches # 8.439 K/sec
12,119 cpu-migrations # 148.040 /sec
12,944 page-faults # 158.118 /sec
83,298,404,517 cycles # 1.018 GHz
50,239,502,774 instructions # 0.60 insn per cycle
9,318,574,550 branches # 113.831 M/sec
309,350,411 branch-misses # 3.32% of all branches

And then with update-index for InnoDB on socket2:

152,948.80 msec task-clock # 15.290 CPUs utilized
2,085,965 context-switches # 13.638 K/sec
270,671 cpu-migrations # 1.770 K/sec
167,039 page-faults # 1.092 K/sec
431,800,251,266 cycles # 2.823 GHz
503,680,176,190 instructions # 1.17 insn per cycle
86,250,048,599 branches # 563.915 M/sec
787,451,136 branch-misses # 0.91% of all branches

Results on the on the dell32 server don't reproduce the problem as perf claims the CPU is 3.0.38 GHz for MyRocks vs 3.269 GHz for InnoDB.

With update-index I get this for MyRocks on dell32:

106,338.62 msec task-clock # 10.629 CPUs utilized
1,998,242 context-switches # 18.791 K/sec
66,738 cpu-migrations # 627.599 /sec
55,534 page-faults # 522.237 /sec
323,031,027,098 cycles # 3.038 GHz (83.39%)
2,209,382,673 stalled-cycles-frontend

# 0.68% frontend cycles idle (82.94%)
41,816,695,185 stalled-cycles-backend
# 12.95% backend cycles idle (83.23%)
208,542,471,854 instructions # 0.65 insn per cycle

# 0.20 stalled cycles per insn (83.37%)
40,354,000,154 branches # 379.486 M/sec (83.46%)
3,275,786,898 branch-misses. # 8.12% of all branches (83.63%)

And then with update-index for InnoDB on dell32:

174,166.34 msec task-clock # 17.409 CPUs utilized
2,202,343 context-switches # 12.645 K/sec
262,216 cpu-migrations # 1.506 K/sec
15 page-faults # 0.086 /sec
569,317,425,554 cycles # 3.269 GHz (83.39%)
3,105,643,611 stalled-cycles-frontend

# 0.55% frontend cycles idle (83.38%)
57,983,678,763 stalled-cycles-backend

# 10.18% backend cycles idle (83.19%)
728,678,166,055 instructions # 1.28 insn per cycle
# 0.08 stalled cycles per insn (83.29%)
127,614,340,327 branches # 732.715 M/sec (83.35%)
3,729,114,189 branch-misses # 2.92% of all branches (83.42%)

Diagnosis: step 2

I then edited my helper scripts to sample per-core speeds from /proc/cpuinfo every 5 seconds and then aggregate it into 100 MHz buckets.

Note that the socket2 server has 24 cores and the benchmark is run with 16 clients. From cpupower frequency-info I see the following for socket2 and learn that the CPU can range between 1 GHz and 3.5 GHz. Also, the schedutil CPU frequency governor is used.

driver: intel_cpufreq
CPUs which run at the same hardware frequency: 23
CPUs which need to have their frequency coordinated by software: 23
maximum transition latency: 20.0 us
hardware limits: 1000 MHz - 3.50 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 1000 MHz and 3.50 GHz.
The governor "schedutil" may decide which speed to use
within this range.

The output for dell32 is slightly different as it uses the acpi-cpufreq driver, and at least in theory, with AMD the CPU frequency is limited to one of three values:

driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 31
CPUs which need to have their frequency coordinated by software: 31
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.80 GHz - 7.01 GHz
available frequency steps: 3.60 GHz, 2.70 GHz, 1.80 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 1.80 GHz and 3.60 GHz.
The governor "schedutil" may decide which speed to use
within this range.

With update-index I get this for MyRocks on socket2 when I aggregate into 100 MHz buckets the CPU speeds listed in /proc/cpuinfo. Note that the histogram is skewed towards the lower values. This means that most of the cores run at a low speed most of the time which isn't expected because the workload is CPU-bound and most cores should be busy:

count MHz

1042 900
970 1000
161 1100
94 1200
68 1300
95 1400
167 1500
67 1600
46 1700
29 1800
12 1900
6 2000
13 2100
14 2200
64 2300
38 2400
2 2500
4 2600
9 2700
2 2900
1 3500

With update-index and InnoDB on socket2 the histogram is skewed towards the larger values. So the result with InnoDB is what I expect and much better than the result above for MyRocks.

count MHz

1 900
22 1000
4 1100
5 1200
6 1300
1 1400
4 1500
3 1600
10 1700
5 1800
8 1900
9 2000
12 2100
16 2200
34 2300
31 2400
37 2500
58 2600
69 2700
102 2800
554 2900
1895 3000
6 3100
8 3200
3 3300
1 3500

The fix

Why are the CPUs being run at low speed (often 1 GHz) for MyRocks, while with InnoDB they are run at high speed (often 3 GHz)? Thermal throttling is one reason but that wasn't happening here.

I then changed the CPU frequency governor from schedutil to performance using these commands:

cpupower frequency-set --governor performance
cpupower frequency-set -u 2.5GHz
cpupower frequency-set -d 2.0GHz

And then I repeated the benchmark. The first group of numbers below are the results with schedutil and the second group are the results with the performance frequency governor. Note that the values for many of the write-heavy microbenchmarks is about 2X larger with the performance frequency governor, which means that MyRocks is ~2X faster on them with the change. Unfortunately, I might never learn why.

First, the relative QPS for MyRocks prior to the change when using schedutil:

0.82 point-query.pre_range=100
0.78 range-notcovered-si.pre_range=100
0.56 range-notcovered-si_range=100
0.46 scan_range=100
0.31 delete_range=100
0.24 insert_range=100
0.70 read-write_range=100
0.71 read-write_range=10
0.48 update-index_range=100
0.31 update-inlist_range=100
0.27 update-nonindex_range=100
0.27 update-one_range=100
0.27 update-zipf_range=100
0.78 write-only_range=10000

And then the values for MyRocks after then change from schedutil to performance:

0.82 point-query.pre_range=100
0.77 range-notcovered-si.pre_range=100
0.54 range-notcovered-si_range=100
0.48 scan_range=100
0.52 delete_range=100
0.53 insert_range=100
0.73 read-write_range=100
0.75 read-write_range=10
1.03 update-index_range=100
0.52 update-inlist_range=100
0.60 update-nonindex_range=100
0.62 update-one_range=100
0.60 update-zipf_range=100
0.86 write-only_range=10000

Appendix

Random articles I found while debugging

schedutil performance from Phoronix
schedutil performance on Linux 5.1 from Phoronix
much background from Linux docs
my history with turbo boost
CPU frequency governor user guide
more kernel docs
an LWN article

7 comments:

Aleksei MilovidovJuly 17, 2024 at 2:49 PM
Old knowledge: https://clickhouse.com/docs/en/operations/tips#cpu-scaling-governor
Sunny BainsJuly 22, 2024 at 5:59 PM
A shot in the dark: Schedutil prioritizes power consumption and so it has to sample the load. It's quite possible that the sampling falls into some kind of convoy pattern. The hypothesis being that LSM writes are batched and large and CPU consumption could be low during those periods? Maybe its sampling method is broken?
Sunny BainsJuly 22, 2024 at 6:07 PM
Another candidate could be: https://www.reddit.com/r/linux/comments/15p4bfs/amd_pstate_and_amd_pstate_epp_scaling_driver/ apparently Intel CPUs have similar issues.
kyoungDecember 28, 2024 at 6:01 PM
I observed something similar when experimenting with io_uring[0]. I thought I was getting 50%+ increase in throughput but it turned out the control was underclocking cpu.

I am curious if you managed to find the root cause or mitigations beside using the performance governor.

[0] https://git.uwaterloo.ca/lseo/io_uring-experiments

Small Datum

Wednesday, July 17, 2024

Make MyRocks 2X faster by changing the CPU frequency governor

7 comments:

Postgres 18 beta2: large server, Insert Benchmark, part 2