Small Datum: The impact from hyperthreading for RocksDB db

Tuesday, August 22, 2023

The impact from hyperthreading for RocksDB db_bench on a medium server

This has results from db_bench on the same instance type configured with and without hyperthreading to determine the impact from it.

tl;dr

I used the LRU block cache and will soon repeat this experiment with Hyper Clock Cache
When the CPU is oversubscribed hyperthreading improves QPS on read-heavy benchmark steps but hurts it on the high-concurrency, write-heavy step (overwrite)

Builds

I used RocksDB 8.5.0 compiled from source.

Benchmark

The benchmark was run with the LRU block cache and an in-memory workload. The test database was ~15GB. The RocksDB benchmarks scripts were used (here and here).

The test server is a c2-standard-60 server from GCP with 120G of RAM. The OS is Ubuntu 22.04. I repeated tests for it it with and without hyperthreading and name the servers ht0 and ht1:

ht0 - hyperthreads disabled, 30 HW threads and 30 cores
ht1 - hyperthreads enabled, 60 HW threads and 30 cores

The benchmark was repeated for 10, 20, 30, 40 and 50 threads. At 10 threads the CPU is undersubscribed for both ht0 and ht1. At 50 threads the CPU is oversubscribed for both ht0 and ht1. I want to see the impact on performance as the workload changes from an undersubscribed to an oversubscribed CPU.

Results

Results are here and charts for the results are below. The y-axis for the charts starts at 0.9 rather than 0 to improve readability. The charts show the relative QPS which is (QPS for ht1 / QPS for ht0). Hyperthreading helps when the relative QPS is great than 1.

At 10 and 20 threads hyperthreading has a small, negative impact on QPS
At 30 threads hyperthreading has no impact on QPS
At 40 and 50 threads hyperthreading helps performance for read heavy tests and hurts it for the concurrent write heavy test (overwrite)
Note that fillseq always runs with 1 thread regardless of what the other tests use

I abbreviated a few of the test names to fit on the chart -- revrangeww is revrangewhilewriting, fwdrangeww is fwdrangewhilewriting and readww is readwhilewriting. See the Benchmark section above that explains how I run db_bench.

What happens at 50 threads for fwdrangeww where hyperthreading gets more QPS vs overwrite where it hurt QPS? The following sections attempt to explain it.

For both tests, the configuration with a much larger context switch rate (more than 2X larger) gets more QPS: ~1.2X more on fwdrangeww for ht1, ~1.1X more for ht0 on overwrite.

Explaining fwdrangeww

First, are results where I divide user & system CPU seconds by QPS to compute CPU/operation for the test with 50 threads. The CPU time is measured by time db_bench ... and that is multiplied by 1M, so this is CPU microseconds per operation.

From this, there might be more CPU time consumed per operation when hyperthreads are enabled. However, this metric can be misleading especially when the CPU isn't fully subscribed because a HW thread isn't a CPU core.

user/q sys/q (user+sys)/q

ht0 139 23 163

ht1 183 35 219

Next are the average values for context switches (cs), user CPU time (us) and system CPU time (sy) from vmstat. The CPU utilization rates can be misleading as explained in the previous paragraph. The context switch rates for nt50 (5

10 threads

cs us sy

132350 30.8 2.1

128609 15.1 1.1

20 threads

cs us sy

515579 58.1 6.1

497088 28.8 3.1

30 threads

cs us sy

1053293 82.6 11.8

1053209 41.1 5.9

40 threads

cs us sy

900373 83.0 13.0

1338999 54.1 8.6

50 threads

cs us sy

742973 83.1 14.1

1620549 66.0 12.7

Explaining overwrite

First, are results where I divide user & system CPU seconds by QPS to compute CPU/operation for the test with 50 threads.

From this, there might be more CPU time consumed per operation when hyperthreads are enabled. However, this metric can be misleading because a HW thread isn't a CPU core.

user/q sys/q (user+sys)/q
ht0 112 62 175
ht1 175 102 277

Next are the average values for context switches (cs), user CPU time (us) and system CPU time (sy) from vmstat. The average CPU utilization sustained is higher with hyperthreads disabled, but that can also be misleading for the same reason as mentioned above. The context switch rate is much higher when hyperthreads are disabled for 20, 30, 40 and 50 threads. That can mean there is more mutex contention.

10 threads
cs us sy
15878 21.3 8.4
16753 10.9 4.4

20 threads
cs us sy
50776 30.9 12.8
28250 15.7 6.8

30 threads
cs us sy
526622 31.3 14.5
102929 20.1 9.9

40 threads
cs us sy
833918 29.1 15.9
248478 22.3 12.7

50 threads
cs us sy
1107510 27.7 15.8
461239 19.6 11.7

Small Datum

Tuesday, August 22, 2023

The impact from hyperthreading for RocksDB db_bench on a medium server

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)