Monday, January 2, 2023

Variance in RocksDB benchmarks on cloud HW

I ran experiments with performance regression tests on AWS servers to measure throughput variance across multiple runs of the benchmarks. I search for a setup with less QPS variance to reduce false positives during performance regression tests. 

The goals were to determine:

  • Which instances have less variance
  • Whether there is less variance with hyperthreading enabled or disabled
  • Whether there is less variance with EBS (io2) or local SSD 

The results were good for most of the instance types that I tried. By good I mean that the max coefficient of variance (stddev/mean) of the QPS for all benchmark steps of a given (server type, number of client threads) pair is less than 3%. Scroll down to the Which to choose section to see the list.

Servers

My focus was on m5d because m5d.2xl.ht1.nvme is the current setup used for automated tests. In the names I use below ht0 means hyperthreading disabled and ht1 means hyperthreading enabled.

The servers that I tested are:

  • m5.2xl.ht0, m5.2xl.ht1
    • m5.2xlarge, 8 vcpu, 32G RAM, EBS (io2, 1T, 32K IOPs), hyperthread disabled/enabled
  • m5d.2xl.ht0.nvme, m5d.2xl.ht1.nvme
    • m5d.2xlarge, 8 vcpu, 32G RAM, local NVMe SSD (300G), hyperthread disabled/enabled
  • m5d.2xl.ht0.io2, m5d.2xl.ht1.io2
    • m5d.2xlarge, 8 vcpu, 32G RAM, EBS (io2, 1T, 32K IOPs), hyperthread disabled/enabled
  • m5d.4xl.ht0.nvme, m5d.4xl.ht1.nvme
    • m5d.4xlarge, 16 vcpu, 64G RAM, local NVMe SSD (300G), hyperthread disabled/enabled
  • m5d.4xl.ht0.io2, m5d.4xl.ht1.io2
    • m5d.4xlarge, 16 vcpu, 64G RAM, EBS(io2, 1T, 64K IOPs), hyperthread disabled/enabled
  • c6i.2xl.ht1
    • c6i.2xlarge, 8 vcpu, 16G RAM, EBS(io2, 1T, 32K IOPs), hyperthread enabled
  • c6i.4xl.ht0, c6i.4xl.ht1
    • c6i.4xlarge, 16 vcpu, 32G RAM, EBS(io2, 1T, 64K IOPs), hyperthread disabled/enabled
  • c6i.8xl.ht0, c6i.8xl.ht1
    • c6i.8xlarge, 32 vcpu, 64G RAM, EBS(io2, 2T, 64K IOPs), hyperthread disabled/enabled
  • c6i.12xl.ht0, c6i.12xl.ht1
    • c6i.12xlarge, 48 vcpu, 96G RAM, EBS(io2, 2T, 64K IOPs), hyperthread disabled/enabled

Benchmark

An overview of how I run RocksDB benchmarks is here. The benchmark was run 3 times per server type and I focus on QPS using standard deviation and coefficient of variance (stddev / mean).

I used a fork of the RocksDB benchmark scripts. My fork is here and then I applied another diff to disable one of the benchmark steps to save time. My helper scripts make it easy for me to run the benchmarks with reasonable configuration options. The important things the helper scripts do are:

  • Configure the shape of the LSM tree so there will be enough levels even with a small database
  • Set max_background_jobs as a function of the number of CPU cores
  • Set the RocksDB block cache size as a function of the amount of RAM

The benchmark was started using x3.sh like this this where $nt is the number of client threads, $cfg is the RocksDB configuration, $num_mem is the number of KV pairs for a cached workload and $num_io is the number of KV pairs for an IO-bound workload:

CI_TESTS_ONLY=true bash x3.sh $nt no 1800 $cfg $num_mem $num_io byrx iobuf

Rather than name the config I describe the values for max_background_jobs (mbj) and the block cache size (bc). The values per server type for nt, mbj, bc, num_mem, num_io were:

  • m5.2xl.ht0, m5.2xl.ht1
    • mbj=4, bc=24g, num_mem=40M, num_io=1B, nt=1,4
  • m5d.2xl.ht0.nvme, m5d.2xl.ht1.nvme
    • mbj=4, bc=24g, num_mem=40M, num_io=750M, nt=1,2,4
  • m5d.2xl.ht0.io2, m5d.2xl.ht1.io2
    • mbj=4, bc=24g, num_mem=40M, num_io=750M, nt=1,2,4
  • m5d.4xl.ht0.nvme, m5d.4xl.ht1.nvme
    • mbj=8, bc=52g, num_mem=40M, num_io=750M, nt=1,4,8
  • m5d.4xl.ht0.io2, m5d.4xl.ht1.io2
    • mbj=8, bc=52g, num_mem=40M, num_io=750M, nt=1,4,8
  • c6i.2xl.ht1
    • mbj=4, bc=12G, num_mem=40M, num_io=1B, nt=4
    • With num_mem=40M the database is slightly larger than memory. I should have used 20M.
  • c6i.4xl.ht0
    • mbj=4, bc=24G, num_mem=40M, num_io=1B, nt=4
  • c6i.4xl.ht1
    • mbj=8, bc=24G, num_mem=40M, num_io=2B, nt=8
  • c6i.8xl.ht0
    • mbj=8, bc=52G, num_mem=40M, num_io=2B, nt=8
  • c6i.8xl.ht1
    • mbj=8, bc=52G, num_mem=40M, num_io=2B, nt=16
  • c6i.12xl.ht0
    • mbj=8, bc=80G, num_mem=40M, num_io=2B, nt=16
  • c6i.12xl.ht1
    • mbj=8, bc=80G, num_mem=40M, num_io=3B, nt=24 

Graphs for a cached workload

This section has results for a cached workload, described as byrx earlier in this post. The benchmark steps are described here. The graphs show the coefficient of variance (stddev/mean) of the QPS for these benchmark steps: fillseq, readrandom, fwdrangewhilewriting, readwhilewriting and overwrite. Note that to save space the graphs use fwdrangeww and readww in place of fwdrangewhilewriting and readwhilewriting. For reasons I have yet to fully explain the range query tests (fwdrangewhilewriting here) tend to have more QPS variance regardless of hardware.

The graph for 1 client thread is here. The worst variance occurs for m5.2xl.ht0, m5.2xl.ht1, m5d.2xl.ht0.io2 and m5d.2xl.ht1.io2.

The graph for 2 client threads is here. The worst variance occurs for m5d.2xl.ht1.nvme and m5d.2xl.ht0.io2.

The graph for 4 client threads is here. The worst variance occurs for m5d.2xl.ht?.nvme, m5d.2xl.ht0.io2, m5d.4xl.ht0.nvme, m5d.4xl.ht0.io2 and c6i.2xl.ht1. I wrote above that the database didn't fit in the DBMS cache for c6i.2xl.ht1 which might explain why it is an outlier.

The graph for 8 client threads is here. The worst results occur for m5d.4xl.ht?.nvme and m5d.4xl.ht0.io2.

The graph for 16 client threads is here. The worst results occur for c6i.12xlarge.ht0.

Graphs for an IO-bound workload

This section has results for an IO-bound workload, described as iobuf earlier in this post. The benchmark steps are described here. The graphs show the coefficient of variance (stddev/mean) of the QPS for these benchmark steps: fillseq, readrandom, fwdrangewhilewriting, readwhilewriting and overwrite. Note that to save space the graphs use fwdrangeww and readww in place of fwdrangewhilewriting and readwhilewriting.

The graph for 1 client thread is here. The worst results occur for m5d.2xl.ht0.nvme, m5d.2xl.ht0.io2 and m5d.4xl.*.

The graph for 2 client threads is here. The worst results occur for m5d.2xl.ht0.nvme.

The graph for 4 client threads is here. The worst results occur for m5.2xl.ht1, m5d.2xl.*.nvme, m5d.4xl.ht0.io2 and c6i.2xl.ht1.

The graph for 8 client threads is here. The worst results occur for m5d.4xl.ht0.io2 and c6i.4xl.ht1.

The graph for 16 client threads is here. The worst results occur for c6i.8xlarge.ht1.

Which to choose?

Which setup (server type + number of client threads) should be used to minimize variance? The tables below list the maximum and average coefficient of variance from all tests for a given setup where dop=X is the number of client threads. I will wave my hands and claim a good result is a max coefficient of variance that is <= .03 (.03 = 3%).

These setups have good results for both cached (byrx) and IO-bound (iobuf) workloads:

  • m5d.2xl.ht0.nvme - dop=1,2
  • m5d.2xl.ht1.nvme - dop=1
  • m5d.2xl.ht1.io2 - dop=2,4
  • m5d.4xl.ht0.nvme - dop=1,4,8
  • m5d.4xl.ht1.nvme - dop=4,8
  • m5d.4xl.ht0.io2 - dop=1,4,8
  • m5d.4xl.ht1.io2 - dop=4,8
  • c6i.4xl.ht0 - dop=4
  • c6i.4xl.ht1 - dop=8
  • c6i.8xl.ht0 - dop=8
  • c6i.8xl.ht1 - dop=16
  • c6i.12xl.ht0 - dop=16
These are results for byrx (database is cached by RocksDB). 

Max coefficient of variance
dop=1dop=2dop=4dop=8dop=16
m5.2xl.ht00.1490.010
m5.2xl.ht10.0500.028
m5d.2xl.ht0.nvme0.0130.0210.031
m5d.2xl.ht1.nvme0.0140.0660.043
m5d.2xl.ht0.io20.0870.0480.039
m5d.2xl.ht1.io20.0750.0160.014
m5d.4xl.ht0.nvme0.0270.0280.026
m5d.4xl.ht1.nvme0.0380.0170.019
m5d.4xl.ht0.io20.0130.0280.018
m5d.4xl.ht1.io20.0390.0090.015
c6i.2xl.ht10.115
c6i.4xl.ht00.021
c6i.4xl.ht10.008
c6i.8xl.ht00.013
c6i.8xl.ht10.007
c6i.12xl.ht00.007

These are results for iobuf (database is larger than memory).

Max coefficient of variance
dop=1dop=2dop=4dop=8dop=16
m5.2xl.ht00.0050.006
m5.2xl.ht10.0070.014
m5d.2xl.ht0.nvme0.0250.0270.016
m5d.2xl.ht1.nvme0.0050.0110.030
m5d.2xl.ht0.io20.0080.0080.010
m5d.2xl.ht1.io20.0050.0080.004
m5d.4xl.ht0.nvme0.0110.0100.007
m5d.4xl.ht1.nvme0.0100.0080.004
m5d.4xl.ht0.io20.0110.0280.013
m5d.4xl.ht1.io20.0110.0110.004
c6i.2xl.ht10.042
c6i.4xl.ht00.007
c6i.4xl.ht10.012
c6i.8xl.ht00.010
c6i.8xl.ht10.017
c6i.12xl.ht00.007

Performance summaries

In the linked URLs below: byrx means cached by RocksDB and iobuf means IO-bound with buffered IO. The links are to the performance summaries generated by the benchmark scripts.

Results for dop=1

Results for dop=2
Results for dop=4
Results for dop=8
Results for dop=16
Results for dop=24

1 comment:

  1. https://github.com/facebook/rocksdb/issues/11017

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...