I repeated CPU-bound sysbench on my smallest server while using cpupower idle-set to disable some of the c-states to understand the impact on performance.
With a lower-concurrency (1 thread) workload there was up to a 13% performance improvement when some of the c-states were disabled.
With a higher-concurrency (6 threads) workload there is up to a 14% performance improvement for one of the microbenchmarks, but the average and median benefit is much less than it is for the lower-concurrency tests.
I don't know whether that benefit is worth the impact (higher power consumption) so I don't have an opinion on whether this is a good thing to do. Be careful.
Builds
I compiled upstream MySQL 8.0.28 from source. The my.cnf file
is here.
Hardware
The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.
The CPU used here (
AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.
c-states
From cpupower idle-info the c-states and their latencies are listed below. On this CPU the latency gap between C1 and C2 is large:
- poll - latency=0
- C1 - latency=1
- C2 - latency=350
- C3 - latency=400
The output from cpupower idle-info:
CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 1:
Number of idle states: 4
Available idle states: POLL C1 C2 C3
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 581127542
Duration: 35202301723
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 115404404
Duration: 20416804588
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 350
Usage: 563498
Duration: 336593281
C3:
Flags/Description: ACPI IOPORT 0x415
Latency: 400
Usage: 13242213
Duration: 240735087110
Benchmark
I used sysbench and my usage is
explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. But here I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.
The benchmark is run at two levels of concurrency -- 1 thread, 6 threads. In each case there is 1 table, with 30M rows. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.
The command lines for my helper script was:
bash r.sh 1 30000000 300 600 nvme0n1 1 1 1 6
The benchmark was run for 3 c-state configurations:
- with all c-states enabled
- with C1, C2 and C3 disabled via cpupower idle-set -D 1
- with C2 and C3 disabled via cpupower idle-set -D 10
Results: 1 thread
The numbers below are the relative QPS which is: (QPS for me / QPS for base) where base is the result with all c-states enabled.
- Disabling C1, C2, and C3 gives up to 13% more QPS
- Disabling C1 and C2 gives up to 10% more QPS
Relative to: x.my8028_rel.z11a_bee.pk1.cstate.all
col-1 : x.my8028_rel.z11a_bee.pk1.cstate.D1
col-2 : x.my8028_rel.z11a_bee.pk1.cstate.D10
col-1 col-2
1.10 1.09 hot-points_range=100
1.01 1.00 point-query.pre_range=100
1.00 0.99 point-query_range=100
1.12 1.09 points-covered-pk_range=100
1.13 1.10 points-covered-si_range=100
1.10 1.08 points-notcovered-pk_range=100
1.10 1.08 points-notcovered-si_range=100
1.00 1.00 random-points_range=1000
1.10 1.08 random-points_range=100
1.02 1.01 random-points_range=10
1.02 1.00 range-covered-pk_range=100
1.01 1.00 range-covered-si_range=100
1.02 1.01 range-notcovered-pk_range=100
1.12 1.10 range-notcovered-si.pre_range=100
1.12 1.10 range-notcovered-si_range=100
1.00 1.01 read-only_range=10000
1.05 1.04 read-only_range=100
1.05 1.04 read-only_range=10
0.98 0.98 scan_range=100
1.04 1.02 delete_range=100
1.04 1.03 insert_range=100
1.06 1.05 read-write_range=100
1.07 1.06 read-write_range=10
1.09 1.03 update-index_range=100
1.04 1.02 update-inlist_range=100
1.03 1.01 update-nonindex_range=100
1.03 1.02 update-one_range=100
1.03 1.02 update-zipf_range=100
1.06 1.03 write-only_range=10000
Results: 6 threads
The numbers below are the relative QPS which is: (QPS for me / QPS for base) where base is the result with all c-states enabled.
- Disabling C1, C2, and C3 gives up to 14% more QPS
- Disabling C1 and C2 gives up to 12% more QPS
- With the exception of the update-one microbenchmark, the benefit from disabling c-states here is less than it is above for the tests run with 1 client thread. My guess is that update-one is helped here because it suffers from the most contention (all updates are done to the same row).
Relative to: x.my8028_rel.z11a_bee.pk1.cstate.all
col-1 : x.my8028_rel.z11a_bee.pk1.cstate.D1
col-2 : x.my8028_rel.z11a_bee.pk1.cstate.D10
col-1 col-2
1.04 1.04 hot-points_range=100
1.02 1.00 point-query.pre_range=100
1.02 0.99 point-query_range=100
1.03 1.03 points-covered-pk_range=100
1.04 1.03 points-covered-si_range=100
1.04 1.03 points-notcovered-pk_range=100
1.04 1.04 points-notcovered-si_range=100
1.01 1.00 random-points_range=1000
1.04 1.03 random-points_range=100
1.02 1.01 random-points_range=10
1.00 1.00 range-covered-pk_range=100
1.01 1.01 range-covered-si_range=100
1.01 1.01 range-notcovered-pk_range=100
1.04 1.04 range-notcovered-si.pre_range=100
1.04 1.04 range-notcovered-si_range=100
1.01 1.01 read-only_range=10000
1.02 1.01 read-only_range=100
1.03 1.01 read-only_range=10
0.97 0.97 scan_range=100
1.01 1.00 delete_range=100
1.02 1.01 insert_range=100
1.04 1.02 read-write_range=100
1.04 1.03 read-write_range=10
1.03 1.02 update-index_range=100
1.02 1.00 update-inlist_range=100
1.00 0.99 update-nonindex_range=100
1.14 1.12 update-one_range=100
1.00 0.98 update-zipf_range=100
1.05 1.04 write-only_range=10000
No comments:
Post a Comment