Tuesday, August 22, 2023

The impact from hyperthreading for RocksDB db_bench on a medium server

This has results from db_bench on the same instance type configured with and without hyperthreading to determine the impact from it.

tl;dr

  • I used the LRU block cache and will soon repeat this experiment with Hyper Clock Cache
  • When the CPU is oversubscribed hyperthreading improves QPS on read-heavy benchmark steps but hurts it on the high-concurrency, write-heavy step (overwrite)

Builds

I used RocksDB 8.5.0 compiled from source.

Benchmark

The benchmark was run with the LRU block cache and an in-memory workload. The test database was ~15GB. The RocksDB benchmarks scripts were used (here and here).

The test server is a c2-standard-60 server from GCP with 120G of RAM. The OS is Ubuntu 22.04. I repeated tests for it it with and without hyperthreading and name the servers ht0 and ht1:

  • ht0 - hyperthreads disabled, 30 HW threads and 30 cores
  • ht1 - hyperthreads enabled, 60 HW threads and 30 cores

The benchmark was repeated for 10, 20, 30, 40 and 50 threads. At 10 threads the CPU is undersubscribed for both ht0 and ht1. At 50 threads the CPU is oversubscribed for both ht0 and ht1. I want to see the impact on performance as the workload changes from an undersubscribed to an oversubscribed CPU.

Results

Results are here and charts for the results are below. The y-axis for the charts starts at 0.9 rather than 0 to improve readability. The charts show the relative QPS which is (QPS for ht1 / QPS for ht0). Hyperthreading helps when the relative QPS is great than 1.

  • At 10 and 20 threads hyperthreading has a small, negative impact on QPS
  • At 30 threads hyperthreading has no impact on QPS
  • At 40 and 50 threads hyperthreading helps performance for read heavy tests and hurts it for the concurrent write heavy test (overwrite)
  • Note that fillseq always runs with 1 thread regardless of what the other tests use
I abbreviated a few of the test names to fit on the chart -- revrangeww is revrangewhilewriting, fwdrangeww is fwdrangewhilewriting and readww is readwhilewriting. See the Benchmark section above that explains how I run db_bench.

What happens at 50 threads for fwdrangeww where hyperthreading gets more QPS vs overwrite where it hurt QPS? The following sections attempt to explain it.

For both tests, the configuration with a much larger context switch rate (more than 2X larger) gets more QPS: ~1.2X more on fwdrangeww for ht1, ~1.1X more for ht0 on overwrite.

Explaining fwdrangeww

First, are results where I divide user & system CPU seconds by QPS to compute CPU/operation for the test with 50 threads. The CPU time is measured by time db_bench ... and that is multiplied by 1M, so this is CPU microseconds per operation.

From this, there might be more CPU time consumed per operation when hyperthreads are enabled. However, this metric can be misleading especially when the CPU isn't fully subscribed because a HW thread isn't a CPU core.

        user/q  sys/q   (user+sys)/q
ht0     139     23      163
ht1     183     35      219

Next are the average values for context switches (cs), user CPU time (us) and system CPU time (sy) from vmstat. The CPU utilization rates can be misleading as explained in the previous paragraph. The context switch rates for nt50 (5

10 threads
cs      us      sy
132350  30.8    2.1
128609  15.1    1.1

20 threads
cs      us      sy
515579  58.1    6.1
497088  28.8    3.1

30 threads
cs      us      sy
1053293 82.6    11.8
1053209 41.1    5.9

40 threads
cs      us      sy
900373  83.0    13.0
1338999 54.1    8.6

50 threads
cs      us      sy
742973  83.1    14.1
1620549 66.0    12.7

Explaining overwrite

First, are results where I divide user & system CPU seconds by QPS to compute CPU/operation for the test with 50 threads.

From this, there might be more CPU time consumed per operation when hyperthreads are enabled. However, this metric can be misleading because a HW thread isn't a CPU core.

        user/q  sys/q   (user+sys)/q
ht0     112     62      175
ht1     175     102     277

Next are the average values for context switches (cs), user CPU time (us) and system CPU time (sy) from vmstat.  The average CPU utilization sustained is higher with hyperthreads disabled, but that can also be misleading for the same reason as mentioned above. The context switch rate is much higher when hyperthreads are disabled for 20, 30, 40 and 50 threads. That can mean there is more mutex contention.

10 threads
cs      us      sy
15878   21.3    8.4
16753   10.9    4.4

20 threads
cs      us      sy
50776   30.9    12.8
28250   15.7    6.8

30 threads
cs      us      sy
526622  31.3    14.5
102929  20.1    9.9

40 threads
cs      us      sy
833918  29.1    15.9
248478  22.3    12.7

50 threads
cs      us      sy
1107510 27.7    15.8
461239  19.6    11.7

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...