Monday, November 6, 2023

Debugging perf changes in RocksDB 8.6 on a small server

This post has benchmark results for RocksDB 8.5 through 8.8 using a small server. I previously wrote about results for this benchmark on a large server. 

RocksDB 8.6 changes how prefetching is implemented for storage reads done by compaction. It now uses the readahead system call on Linux (see here and here) and the amount to prefetch is determined by the compaction_readahead_size option. There are/were two issues and RocksDB issue 12308 is open for this:

  • db_bench in 8.6 and earlier releases uses a default of 0 for compaction_readahead_size which clobbered the new default of 2MB, leading to too-small prefetch requests. This is fixed in 8.7.
  • db_bench in 8.6+ uses readahead and it appears that the performance on the overwrite benchmark step suffers with buffered IO when the value of compaction_readahead_size is larger than the value of max_sectors_kb for the underlying storage devices. When RAID is used I am not sure if compaction_readahead_size should be less than the smallest value of max_sectors_kb from any of the underlying devices or less than the sum.
  • While reducing compaction_readahead_size fixes the perf regression for the overwrite benchmark step with buffered IO it introduces a perf regression of up to 6% for other read+write benchmark steps. Reducing compaction_readahead_size also hurts perf with O_DIRECT, which doesn't use the readahead syscall and benefits from a larger prefetch size. I am curious if the large reads done by compaction when compaction_readahead_size is 480KB interfere (make slower) the user reads.
tl;dr, for now ...
  • Set compaction_readahead_size to be <= max_sectors_kb if using buffered IO
  • Keep the default for compaction_readahead_size if using O_DIRECT
Updates

Fixed a typo: s/min_sectors_kb/max_sectors_kb/g

Benchmark

The benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04 with the 5.15.0-79-generic kernel. There is just one storage device and no RAID. The value of max_sectors_kb is 512.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was used 1 client thread. How I do benchmarks for RocksDB is explained here and here.

The benchmark was repeated in three setups:
  • cached - database fits in the RocksDB block cache
  • iobuf - IO-bound, working set doesn't fit in memory, uses buffered IO
  • iodir - IO-bound, working set doesn't fit in memory, uses O_DIRECT

Results

All of the charts below use QPS relative to 8.5.4 which is: (QPS for $me / QPS for RocksDB 8.5.4) where $me is some other version of RocksDB. Normalizing throughput like this makes it easy to see changes in performance.

The charts have results for the following. Note that max_sectors_kb is 512 so I reduced compaction_readahead_size to 480KB for some of the tests:
  • 8.6.7 - RocksDB 8.6.7 as-is
  • 8.7.2-2mb - RocksDB 8.7.2 with compaction_readahead_size=2MB (default)
  • 8.7.2-480kb - RocksDB 8.7.2 with compaction_readahead_size=480KB
  • 8.8.0-2mb - RocksDB 8.8.0 with compaction_readahead_size=2MB (default)
  • 8.8.0-480kb - RocksDB 8.8.0 with compaction_readahead_size=480KB

For a cached workload the QPS for overwrite ...
  • Drops in 8.6.7 because it uses compaction_readahead_size=0 (a bug, fixed in 8.7)
  • Drops by ~30% in 8.7 and 8.8 when compaction_readahead_size is larger than max_sectors_kb
Reducing compaction_readahead_size fixes performance for overwrite but reduces perf on the read+write benchmark steps by between 3% and 10%. The benchmark steps are revrangewhilewriting, fwdrangewhilewriting and readwhilewriting.

For the IO-bound workload with buffered IO the QPS for overwrite ...
  • Drops in 8.6.7 because it uses compaction_readahead_size=0 (a bug, fixed in 8.7)
  • Drops by ~10% in 8.7 and 8.8 when compaction_readahead_size is larger than max_sectors_kb
Reducing compaction_readahead_size fixes performance for overwrite but reduces perf on the read+write benchmark steps by between 3% and 5%. The benchmark steps are revrangewhilewriting, fwdrangewhilewriting and readwhilewriting.

For the IO-bound workload with O_DIRECT the QPS for overwrite ...
  • Drops in 8.6.7 because it uses compaction_readahead_size=0 (a bug, fixed in 8.7)
  • Drops by between 3% and 5% in 8.7 and 8.8 when compaction_readahead_size is reduced to be small than max_sectors_kb. The issue with a too-large value for compaction_readahead_size is only there when buffered IO is used.
Reducing compaction_readahead_size fixes performance for overwrite but reduces perf on the read+write benchmark steps by between 3% and 6%. The benchmark steps are revrangewhilewriting, fwdrangewhilewriting and readwhilewriting.

Debugging this for overwrite

This benchmark step only does Put operations, so the only storage reads are for compaction.

From iostat collected during the overwriteandwait benchmark step I see:
  • read request sizes (rareqsz) are too small when buffered IO is used and compaction_read_ahead_size is larger than max_sectors_kb. Below it is 10.5KB for the cached workload and 5.6KB for the IO-bound with buffered IO workload. The rareqsz increases to ~400KB when compaction_readahead_size is reduced to 480KB.
  • When O_DIRECT is used then the rareqsz increases by ~10% (from ~400KB to ~440KB) when compaction_readahead_size is reduced from 2MB to 480KB but throughput for overwrite, as mentioned above, also drops by ~5%.
Legend:

* rps - average value for iostat's r/s
* rmbps - average value for iostat's rMB/s
* rrqmps - average value for iostat's rrqm/s
* rawait - average value for iostat's r_await
* rareqsz - average value for iostat's rareq-sz, in KB
* version - RocksDB version

byrx
rps     rmbps   rrqmps  rawait  rareqsz version
8089    83.2    0.00    0.07    10.5    8.7.2-ra2mb
217     82.3    0.00    0.42    418.0   8.7.2-ra480kb

iobuf
rps     rmbps   rrqmps  rawait  rareqsz version
8436    48.8    0.00    0.04    5.6     8.7.2-ra2mb
148     62.9    0.00    0.33    390.7   8.7.2-ra480kb

iodir
rps     rmbps   rrqmps  rawait  rareqsz version
660     256.6   4.51    0.63    398.0   8.7.2-ra2mb
570     244.4   0.14    0.37    439.7   8.7.2-ra480kb

rps     rmbps   rrqmps  rawait  rareqsz version
668     258.1   4.89    0.63    395.7   8.8.0-ra2mb
569     244.0   0.15    0.37    439.4   8.8.0-ra480kb

From vmstat I see
  • Context switch rate is much smaller with compaction_readahead_size=480kb
  • User CPU utilization is larger with compaction_readahead_size=480kb
Legend:
* cs - average context switch rate from vmstat
* us - average user CPU utilization from vmstat
* sy - average system CPU utilization from vmstat
* us+sy - sum of us and sy
* id - average idle CPU utilization from vmstat
* wa - average wait CPU utilization from vmstat
* version - RocksDB version

byrx
cs      us      sy      us+sy   id      wa      version
17897   16.7    14.0    30.7    67.6    7.2     8.7.2-ra2mb
2190    23.9    14.0    37.9    64.0    1.0     8.7.2-480kb

iobuf
cs      us      sy      us+sy   id      wa      version
18652   17.3    14.0    31.3    72.3    4.4     8.7.2-ra2mb
2346    21.7    14.0    35.7    70.8    0.6     8.7.2-480kb

iodir
cs      us      sy      us+sy   id      wa      version
5660    23.1    14.0    37.1    67.2    5.2     8.7.2-ra2mb
6345    22.4    14.0    36.4    67.2    6.1     8.7.2-480kb

Debugging this for readwhilewriting

This benchmark step only rate-limited Put operations and then point queries as fast as possible. There will be some reads for compaction and many more (small) reads for the point queries. 

I assume that the larger compaction reads that happen with compaction_readahead_size=480kb interfere with the user reads and that would explain why QPS is slightly reduced. From the compaction IO stats for iobuf (long lines, see here) I see that compaction CPU overhead, see CompMergeCPU(sec), is similar at ~423s vs ~386s. The compaction wall clock time, see Comp(sec), is much smaller when compaction_readahead_size is 480kb, ~455s vs ~714s with it set to 2MB. This different doesn't repeat for iodir (see here).

From iostat collected during the readwhilewriting benchmark step I see:
  • Average read request size (rareqsz) increases when compaction_readahead_size is reduced to be smaller than max_sectors_kb
Legend:

* rps - average value for iostat's r/s
* rmbps - average value for iostat's rMB/s
* rrqmps - average value for iostat's rrqm/s
* rawait - average value for iostat's r_await
* rareqsz - average value for iostat's rareq-sz, in KB
* version - RocksDB version

byrx
rps     rmbps   rrqmps  rawait  rareqsz version
2577    29.1    0.00    0.07    12.1    8.7.2-ra2mb
2607    39.0    0.00    0.07    17.3    8.7.2-480kb

iobuf
rps     rmbps   rrqmps  rawait  rareqsz version
10081   76.9    0.00    0.06    7.9     8.7.2-ra2mb
7370    75.5    0.00    0.07    10.8    8.7.2-480kb

iodir
rps     rmbps   rrqmps  rawait  rareqsz version
11937   80.6    0.41    0.05    7.2     8.7.2-ra2mb
11170   76.7    0.02    0.05    7.5     8.7.2-480kb

From vmstat I see
  • Context switch rate is larger during iobuf (IO-bound with buffered IO) when compaction_readahead_size is 2MB.
Legend:
* cs - average context switch rate from vmstat
* us - average user CPU utilization from vmstat
* sy - average system CPU utilization from vmstat
* us+sy - sum of us and sy
* id - average idle CPU utilization from vmstat
* wa - average wait CPU utilization from vmstat
* version - RocksDB version

byrx
cs      us      sy      us+sy   id      wa      version
11305   6.0     14.0    20.0    82.8    7.5     8.7.2-ra2mb
11367   5.6     14.0    19.6    82.5    8.1     8.7.2-480kb

iobuf
cs      us      sy      us+sy   id      wa
24476   5.2     14.0    19.2    80.9    10.2    8.7.2-ra2mb
18744   5.0     14.0    19.0    82.7    8.9     8.7.2-480kb

iodir
cs      us      sy      us+sy   id      wa
25141   6.1     14.0    20.1    84.2    7.6     8.7.2-ra2mb
24524   5.9     14.0    19.9    83.9    8.0     8.7.2-480kb

No comments:

Post a Comment

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...