Wednesday, October 16, 2024

Sysbench vs c-state on an AMD 4700u

I repeated CPU-bound sysbench on my smallest server while using cpupower idle-set to disable some of the c-states to understand the impact on performance.

With a lower-concurrency (1 thread) workload there was up to a 13% performance improvement when some of the c-states were disabled. 

With a higher-concurrency (6 threads) workload there is up to a 14% performance improvement for one of the microbenchmarks, but the average and median benefit is much less than it is for the lower-concurrency tests.

I don't know whether that benefit is worth the impact (higher power consumption) so I don't have an opinion on whether this is a good thing to do. Be careful.

Builds

I compiled upstream MySQL 8.0.28 from source. The my.cnf file is here

Hardware

The server here is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM and Ubuntu 22.04. The storage is 1 NVMe device.

The CPU used here (AMD 4700u) is described as a laptop class CPU. The server is configured to use the performance frequency governor and acpi-cpufreq scaling driver.

c-states

For background reading start here.

From cpupower idle-info the c-states and their latencies are listed below. On this CPU the latency gap between C1 and C2 is large:
  • poll - latency=0
  • C1 - latency=1
  • C2 - latency=350
  • C3 - latency=400
The output from cpupower idle-info:

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 1:

Number of idle states: 4
Available idle states: POLL C1 C2 C3
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 581127542
Duration: 35202301723
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 115404404
Duration: 20416804588
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 350
Usage: 563498
Duration: 336593281
C3:
Flags/Description: ACPI IOPORT 0x415
Latency: 400
Usage: 13242213
Duration: 240735087110

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. But here I skip the read-only tests that run prior to writes to save time. The database is cached by InnoDB.

The benchmark is run at two levels of concurrency -- 1 thread, 6 threads. In each case there is 1 table, with 30M rows. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper script was:
    bash r.sh 1 30000000 300 600 nvme0n1 1 1 1 6

The benchmark was run for 3 c-state configurations:
  • with all c-states enabled
    • named x.my8028_rel.z11a_bee.pk1.cstate.all below
  • with C1, C2 and C3 disabled via cpupower idle-set -D 1
    • named x.my8028_rel.z11a_bee.pk1.cstate.D1 below
  • with C2 and C3 disabled via cpupower idle-set -D 10
    • named x.my8028_rel.z11a_bee.pk1.cstate.D10 below
Results: 1 thread

The numbers below are the relative QPS which is: (QPS for me / QPS for base) where base is the result with all c-states enabled.
  • Disabling C1, C2, and C3 gives up to 13% more QPS
  • Disabling C1 and C2 gives up to 10% more QPS
Relative to: x.my8028_rel.z11a_bee.pk1.cstate.all
col-1 : x.my8028_rel.z11a_bee.pk1.cstate.D1
col-2 : x.my8028_rel.z11a_bee.pk1.cstate.D10

col-1   col-2
1.10    1.09    hot-points_range=100
1.01    1.00    point-query.pre_range=100
1.00    0.99    point-query_range=100
1.12    1.09    points-covered-pk_range=100
1.13    1.10    points-covered-si_range=100
1.10    1.08    points-notcovered-pk_range=100
1.10    1.08    points-notcovered-si_range=100
1.00    1.00    random-points_range=1000
1.10    1.08    random-points_range=100
1.02    1.01    random-points_range=10
1.02    1.00    range-covered-pk_range=100
1.01    1.00    range-covered-si_range=100
1.02    1.01    range-notcovered-pk_range=100
1.12    1.10    range-notcovered-si.pre_range=100
1.12    1.10    range-notcovered-si_range=100
1.00    1.01    read-only_range=10000
1.05    1.04    read-only_range=100
1.05    1.04    read-only_range=10
0.98    0.98    scan_range=100
1.04    1.02    delete_range=100
1.04    1.03    insert_range=100
1.06    1.05    read-write_range=100
1.07    1.06    read-write_range=10
1.09    1.03    update-index_range=100
1.04    1.02    update-inlist_range=100
1.03    1.01    update-nonindex_range=100
1.03    1.02    update-one_range=100
1.03    1.02    update-zipf_range=100
1.06    1.03    write-only_range=10000

Results: 6 threads

The numbers below are the relative QPS which is: (QPS for me / QPS for base) where base is the result with all c-states enabled.
  • Disabling C1, C2, and C3 gives up to 14% more QPS
  • Disabling C1 and C2 gives up to 12% more QPS
  • With the exception of the update-one microbenchmark, the benefit from disabling c-states here is less than it is above for the tests run with 1 client thread. My guess is that update-one is helped here because it suffers from the most contention (all updates are done to the same row).
Relative to: x.my8028_rel.z11a_bee.pk1.cstate.all
col-1 : x.my8028_rel.z11a_bee.pk1.cstate.D1
col-2 : x.my8028_rel.z11a_bee.pk1.cstate.D10

col-1   col-2
1.04    1.04    hot-points_range=100
1.02    1.00    point-query.pre_range=100
1.02    0.99    point-query_range=100
1.03    1.03    points-covered-pk_range=100
1.04    1.03    points-covered-si_range=100
1.04    1.03    points-notcovered-pk_range=100
1.04    1.04    points-notcovered-si_range=100
1.01    1.00    random-points_range=1000
1.04    1.03    random-points_range=100
1.02    1.01    random-points_range=10
1.00    1.00    range-covered-pk_range=100
1.01    1.01    range-covered-si_range=100
1.01    1.01    range-notcovered-pk_range=100
1.04    1.04    range-notcovered-si.pre_range=100
1.04    1.04    range-notcovered-si_range=100
1.01    1.01    read-only_range=10000
1.02    1.01    read-only_range=100
1.03    1.01    read-only_range=10
0.97    0.97    scan_range=100
1.01    1.00    delete_range=100
1.02    1.01    insert_range=100
1.04    1.02    read-write_range=100
1.04    1.03    read-write_range=10
1.03    1.02    update-index_range=100
1.02    1.00    update-inlist_range=100
1.00    0.99    update-nonindex_range=100
1.14    1.12    update-one_range=100
1.00    0.98    update-zipf_range=100
1.05    1.04    write-only_range=10000

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs ...