Small Datum: Sysbench vs c-state on an AMD 4700u

I repeated CPU-bound sysbench on my smallest server while using cpupower idle-set to disable some of the c-states to understand the impact on performance.

With a lower-concurrency (1 thread) workload there was up to a 13% performance improvement when some of the c-states were disabled.

With a higher-concurrency (6 threads) workload there is up to a 14% performance improvement for one of the microbenchmarks, but the average and median benefit is much less than it is for the lower-concurrency tests.

I don't know whether that benefit is worth the impact (higher power consumption) so I don't have an opinion on whether this is a good thing to do. Be careful.

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here.

Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

c-states

For background reading start here.

From cpupower idle-info the c-states and their latencies are listed below. On this CPU the latency gap between C1 and C2 is large:

poll - latency=0
C1 - latency=1
C2 - latency=350
C3 - latency=400

The output from cpupower idle-info:

CPUidle driver: acpi_idle

CPUidle governor: menu

analyzing CPU 1:

Number of idle states: 4

Available idle states: POLL C1 C2 C3

POLL:

Flags/Description: CPUIDLE CORE POLL IDLE

Latency: 0

Usage: 581127542

Duration: 35202301723

C1:

Flags/Description: ACPI FFH MWAIT 0x0

Latency: 1

Usage: 115404404

Duration: 20416804588

C2:

Flags/Description: ACPI IOPORT 0x414

Latency: 350

Usage: 563498

Duration: 336593281

C3:

Flags/Description: ACPI IOPORT 0x415

Latency: 400

Usage: 13242213

Duration: 240735087110

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. But here I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run at two levels of concurrency -- 1 thread, 6 threads. In each case there is 1 table, with 30M rows. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script was:

bash r.sh 1 30000000 300 600 nvme0n1 1 1 1 6

The benchmark was run for 3 c-state configurations:

with all c-states enabled

named x.my8028_rel.z11a_bee.pk1.cstate.all below

with C1, C2 and C3 disabled via cpupower idle-set -D 1

named x.my8028_rel.z11a_bee.pk1.cstate.D1 below

with C2 and C3 disabled via cpupower idle-set -D 10

named x.my8028_rel.z11a_bee.pk1.cstate.D10 below

Results: 1 thread

The numbers below are the relative QPS which is: (QPS for me / QPS for base) where base is the result with all c-states enabled.

Disabling C1, C2, and C3 gives up to 13% more QPS
Disabling C1 and C2 gives up to 10% more QPS

Relative to: x.my8028_rel.z11a_bee.pk1.cstate.all

col-1 : x.my8028_rel.z11a_bee.pk1.cstate.D1

col-2 : x.my8028_rel.z11a_bee.pk1.cstate.D10

col-1 col-2

1.10 1.09 hot-points_range=100

1.01 1.00 point-query.pre_range=100

1.00 0.99 point-query_range=100

1.12 1.09 points-covered-pk_range=100

1.13 1.10 points-covered-si_range=100

1.10 1.08 points-notcovered-pk_range=100

1.10 1.08 points-notcovered-si_range=100

1.00 1.00 random-points_range=1000

1.10 1.08 random-points_range=100

1.02 1.01 random-points_range=10

1.02 1.00 range-covered-pk_range=100

1.01 1.00 range-covered-si_range=100

1.02 1.01 range-notcovered-pk_range=100

1.12 1.10 range-notcovered-si.pre_range=100

1.12 1.10 range-notcovered-si_range=100

1.00 1.01 read-only_range=10000

1.05 1.04 read-only_range=100

1.05 1.04 read-only_range=10

0.98 0.98 scan_range=100

1.04 1.02 delete_range=100

1.04 1.03 insert_range=100

1.06 1.05 read-write_range=100

1.07 1.06 read-write_range=10

1.09 1.03 update-index_range=100

1.04 1.02 update-inlist_range=100

1.03 1.01 update-nonindex_range=100

1.03 1.02 update-one_range=100

1.03 1.02 update-zipf_range=100

1.06 1.03 write-only_range=10000

Results: 6 threads

The numbers below are the relative QPS which is: (QPS for me / QPS for base) where base is the result with all c-states enabled.

Disabling C1, C2, and C3 gives up to 14% more QPS
Disabling C1 and C2 gives up to 12% more QPS
With the exception of the update-one microbenchmark, the benefit from disabling c-states here is less than it is above for the tests run with 1 client thread. My guess is that update-one is helped here because it suffers from the most contention (all updates are done to the same row).

Relative to: x.my8028_rel.z11a_bee.pk1.cstate.all

col-1 : x.my8028_rel.z11a_bee.pk1.cstate.D1

col-2 : x.my8028_rel.z11a_bee.pk1.cstate.D10

col-1 col-2

1.04 1.04 hot-points_range=100

1.02 1.00 point-query.pre_range=100

1.02 0.99 point-query_range=100

1.03 1.03 points-covered-pk_range=100

1.04 1.03 points-covered-si_range=100

1.04 1.03 points-notcovered-pk_range=100

1.04 1.04 points-notcovered-si_range=100

1.01 1.00 random-points_range=1000

1.04 1.03 random-points_range=100

1.02 1.01 random-points_range=10

1.00 1.00 range-covered-pk_range=100

1.01 1.01 range-covered-si_range=100

1.01 1.01 range-notcovered-pk_range=100

1.04 1.04 range-notcovered-si.pre_range=100

1.04 1.04 range-notcovered-si_range=100

1.01 1.01 read-only_range=10000

1.02 1.01 read-only_range=100

1.03 1.01 read-only_range=10

0.97 0.97 scan_range=100

1.01 1.00 delete_range=100

1.02 1.01 insert_range=100

1.04 1.02 read-write_range=100

1.04 1.03 read-write_range=10

1.03 1.02 update-index_range=100

1.02 1.00 update-inlist_range=100

1.00 0.99 update-nonindex_range=100

1.14 1.12 update-one_range=100

1.00 0.98 update-zipf_range=100

1.05 1.04 write-only_range=10000

Small Datum

Wednesday, October 16, 2024

Sysbench vs c-state on an AMD 4700u

No comments:

Post a Comment

Challenges compiling old C++ code on modern Linux