Thank you for clicking on my of my better clickbait titles. Obviously this trick won't work everywhere but it does work on one of my home servers.
tl;dr
- Switching from the Ubuntu 22.04 Server default CPU frequency governor (schedutil) to the performance governor increases QPS by ~2X for many write-heavy microbenchmarks with MyRocks on one of my large home servers.
- The problem is that Linux is under-clocking the CPU with MyRocks on the server with the schedutil frequency governor. Switching to the performance frequency governor fixes that -- I am not making MyRocks faster by over-clocking.
- The same change doesn't impact InnoDB performance.
- The same change doesn't appear to make a difference on my other home servers.
Introduction
My home servers are listed here and changing the CPU frequency governor from schedutil to performance improves QPS by up to 2X on the write-heavy benchmark steps of sysbench. An example of my sysbench usage is here.
But first a short diversion, I finally switched from XFS to ext4 on my benchmark servers because I get odd performance and odd warnings (grep for hogged CPU in syslog). That was my first odd problem in Ubuntu 22.04 Server with the HWE kernel (currently 6.5.0-41-generic) -- see here.
This blog post is about my second odd problem and it is limited to one of my servers which is named socket2 below.
I have been experimenting with this on 3 servers. It is only a problem on one -- socket2:
- socket2
- the large server with the problem. This is the v6 server here and is a SuperMicro SuperWorkstation with 2 sockets, 12 cores/socket, Intel Xeon CPUs with HT disabled, 64G RAM, Ubuntu 22.04 and ext4 using data=writeback and SW RAID 0 over 2 NVMe devices. I enabled HWE kernels and the kernel is 6.5.0-41-generic.
- dell32
- another large server that doesn't have this problem. This is the v7 server here and is a Dell Precision 7865 Tower with 32-cores, AMD Ryzen Threadripper PRO 5975WX, AMD SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 using data=writeback and SW RAID 0 over 2 NVMe devices. I enabled HWE kernels and the kernel is 6.5.0-41-generic.
- pn53
- a small server that doesn't have this problem. This is the v8 server here and is an ASUS ExpertCenter PN53 with 8 cores, AMD SMT disabled, an AMD Ryzen 7 7735HS CPU, 32G RAM, Ubuntu 22.04 and ext4 using data=writeback over 1 NVMe device. I enabled HWE kernels and the kernel is 6.5.0-41-generic.
The symptoms
The relative QPS for FB MyRocks was worse than expected on the socket2 server. By relative QPS in this case I mean: (QPS for FB MyRocks) / (QPS for InnoDB) where:
- FB MyRocks
- FB MySQL 8.0.32 with source as of 24-05-29 built at git hash 49b37dfe using RocksDB 9.2.1
- InnoDB
- InnoDB from MySQL 8.0.37
0.82 point-query.pre_range=1000.78 range-notcovered-si.pre_range=1000.56 range-notcovered-si_range=1000.46 scan_range=1000.31 delete_range=1000.24 insert_range=1000.70 read-write_range=1000.71 read-write_range=100.48 update-index_range=1000.31 update-inlist_range=1000.27 update-nonindex_range=1000.27 update-one_range=1000.27 update-zipf_range=1000.78 write-only_range=10000
For reference, the results from dell32 which show that InnoDB is still doing better than MyRocks, but the worst case below has a difference of ~2X and the worst case above is ~4X.
0.84 point-query.pre_range=1000.73 range-notcovered-si.pre_range=1000.49 range-notcovered-si_range=1000.42 scan_range=1000.45 delete_range=1000.46 insert_range=1000.68 read-write_range=1000.68 read-write_range=101.22 update-index_range=1000.44 update-inlist_range=1000.53 update-nonindex_range=1000.54 update-one_range=1000.53 update-zipf_range=1000.76 write-only_range=10000
Diagnosis: step 1
Why does MyRocks do so much worse on socket2? I started with flamegraphs but they were similar between socket2 and dell32. So the distribution of time spent across the code is similar.
Then I repeated the benchmark while recording perf counters (see here) and the first clue is below where perf claims the CPU runs at 1.018 GHz for MyRocks vs 2.823 GHz for InnoDB.
With update-index I get this for MyRocks on socket2:
690,877 context-switches # 8.439 K/sec
12,119 cpu-migrations # 148.040 /sec
12,944 page-faults # 158.118 /sec
83,298,404,517 cycles # 1.018 GHz
50,239,502,774 instructions # 0.60 insn per cycle
9,318,574,550 branches # 113.831 M/sec
309,350,411 branch-misses # 3.32% of all branches
And then with update-index for InnoDB on socket2:
2,085,965 context-switches # 13.638 K/sec
270,671 cpu-migrations # 1.770 K/sec
167,039 page-faults # 1.092 K/sec
431,800,251,266 cycles # 2.823 GHz
503,680,176,190 instructions # 1.17 insn per cycle
86,250,048,599 branches # 563.915 M/sec
787,451,136 branch-misses # 0.91% of all branches
Results on the on the dell32 server don't reproduce the problem as perf claims the CPU is 3.0.38 GHz for MyRocks vs 3.269 GHz for InnoDB.
With update-index I get this for MyRocks on dell32:
1,998,242 context-switches # 18.791 K/sec
66,738 cpu-migrations # 627.599 /sec
55,534 page-faults # 522.237 /sec
323,031,027,098 cycles # 3.038 GHz (83.39%)
2,209,382,673 stalled-cycles-frontend
41,816,695,185 stalled-cycles-backend
# 12.95% backend cycles idle (83.23%)
208,542,471,854 instructions # 0.65 insn per cycle
40,354,000,154 branches # 379.486 M/sec (83.46%)
3,275,786,898 branch-misses. # 8.12% of all branches (83.63%)
And then with update-index for InnoDB on dell32:
2,202,343 context-switches # 12.645 K/sec
262,216 cpu-migrations # 1.506 K/sec
15 page-faults # 0.086 /sec
569,317,425,554 cycles # 3.269 GHz (83.39%)
3,105,643,611 stalled-cycles-frontend
57,983,678,763 stalled-cycles-backend
728,678,166,055 instructions # 1.28 insn per cycle
# 0.08 stalled cycles per insn (83.29%)
127,614,340,327 branches # 732.715 M/sec (83.35%)
3,729,114,189 branch-misses # 2.92% of all branches (83.42%)
Diagnosis: step 2
I then edited my helper scripts to sample per-core speeds from /proc/cpuinfo every 5 seconds and then aggregate it into 100 MHz buckets.
Note that the socket2 server has 24 cores and the benchmark is run with 16 clients. From cpupower frequency-info I see the following for socket2 and learn that the CPU can range between 1 GHz and 3.5 GHz. Also, the schedutil CPU frequency governor is used.
CPUs which run at the same hardware frequency: 23
CPUs which need to have their frequency coordinated by software: 23
maximum transition latency: 20.0 us
hardware limits: 1000 MHz - 3.50 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 1000 MHz and 3.50 GHz.
The governor "schedutil" may decide which speed to use
within this range.
The output for dell32 is slightly different as it uses the acpi-cpufreq driver, and at least in theory, with AMD the CPU frequency is limited to one of three values:
CPUs which run at the same hardware frequency: 31
CPUs which need to have their frequency coordinated by software: 31
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.80 GHz - 7.01 GHz
available frequency steps: 3.60 GHz, 2.70 GHz, 1.80 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 1.80 GHz and 3.60 GHz.
The governor "schedutil" may decide which speed to use
within this range.
With update-index I get this for MyRocks on socket2 when I aggregate into 100 MHz buckets the CPU speeds listed in /proc/cpuinfo. Note that the histogram is skewed towards the lower values. This means that most of the cores run at a low speed most of the time which isn't expected because the workload is CPU-bound and most cores should be busy:
970 1000
161 1100
94 1200
68 1300
95 1400
167 1500
67 1600
46 1700
29 1800
12 1900
6 2000
13 2100
14 2200
64 2300
38 2400
2 2500
4 2600
9 2700
2 2900
1 3500
With update-index and InnoDB on socket2 the histogram is skewed towards the larger values. So the result with InnoDB is what I expect and much better than the result above for MyRocks.
22 1000
4 1100
5 1200
6 1300
1 1400
4 1500
3 1600
10 1700
5 1800
8 1900
9 2000
12 2100
16 2200
34 2300
31 2400
37 2500
58 2600
69 2700
102 2800
554 2900
1895 3000
6 3100
8 3200
3 3300
1 3500
The fix
Why are the CPUs being run at low speed (often 1 GHz) for MyRocks, while with InnoDB they are run at high speed (often 3 GHz)? Thermal throttling is one reason but that wasn't happening here.
I then changed the CPU frequency governor from schedutil to performance using these commands:
cpupower frequency-set --governor performancecpupower frequency-set -u 2.5GHzcpupower frequency-set -d 2.0GHz
And then I repeated the benchmark. The first group of numbers below are the results with schedutil and the second group are the results with the performance frequency governor. Note that the values for many of the write-heavy microbenchmarks is about 2X larger with the performance frequency governor, which means that MyRocks is ~2X faster on them with the change. Unfortunately, I might never learn why.
First, the relative QPS for MyRocks prior to the change when using schedutil:
0.82 point-query.pre_range=1000.78 range-notcovered-si.pre_range=1000.56 range-notcovered-si_range=1000.46 scan_range=1000.31 delete_range=1000.24 insert_range=1000.70 read-write_range=1000.71 read-write_range=100.48 update-index_range=1000.31 update-inlist_range=1000.27 update-nonindex_range=1000.27 update-one_range=1000.27 update-zipf_range=1000.78 write-only_range=10000
And then the values for MyRocks after then change from schedutil to performance:
0.82 point-query.pre_range=1000.77 range-notcovered-si.pre_range=1000.54 range-notcovered-si_range=1000.48 scan_range=1000.52 delete_range=1000.53 insert_range=1000.73 read-write_range=1000.75 read-write_range=101.03 update-index_range=1000.52 update-inlist_range=1000.60 update-nonindex_range=1000.62 update-one_range=1000.60 update-zipf_range=1000.86 write-only_range=10000
Appendix
Random articles I found while debugging
- schedutil performance from Phoronix
- schedutil performance on Linux 5.1 from Phoronix
- much background from Linux docs
- my history with turbo boost
- CPU frequency governor user guide
- more kernel docs
- an LWN article
Old knowledge: https://clickhouse.com/docs/en/operations/tips#cpu-scaling-governor
ReplyDeletethat old knowledge would be more valuable if there were more details that motivated the advice
DeleteA shot in the dark: Schedutil prioritizes power consumption and so it has to sample the load. It's quite possible that the sampling falls into some kind of convoy pattern. The hypothesis being that LSM writes are batched and large and CPU consumption could be low during those periods? Maybe its sampling method is broken?
ReplyDeleteAnother candidate could be: https://www.reddit.com/r/linux/comments/15p4bfs/amd_pstate_and_amd_pstate_epp_scaling_driver/ apparently Intel CPUs have similar issues.
ReplyDeleteI want everything to be simple & easy except for the thing in which I have the time and expertise to cope with the complexity.
Delete