Thursday, July 18, 2024

Searching for regressions in RocksDB with db_bench

I used db_bench to check for performance regressions in RocksDB using leveled compaction and three different servers. A future post will have results for universal compaction. A recent report from me about db_bench on a large server is here.

tl;dr

  • if you use buffered IO for compaction (not O_DIRECT) then bug 12038 is an issue starting in RocksDB 8.6. The workaround is to set compaction_readahead_size to be <= max_sectors_kb for your storage device. 
  • there is a large regression to overwriteandwait that arrives in RocksDB 8.6 (see the previous bullet point) when using buffered IO for compaction. This is related to changes for compaction readahead and need to do more debugging.
  • otherwise there are some big improvements and some small regressions
To learn more about max_sectors_kb and max_hw_sectors_kb start with this excellent doc from Jens Axboe and then read this blog post. Note that max_sectors_kb must be <= max_hw_sectors_kb and (AFAIK) max_hw_sectors_kb is read-only so you can't increase it.

I am working to determine whether the value for the storage device or RAID device takes precedence when SW RAID is used. From a few results, it looks like the value for the RAID device matters and the storage device's values are ignored (assuming the RAID device doesn't exceed what the storage devices support).

I also want to learn how max_hw_sectors_kb is set for a SW RAID device.

Hardware

I tested on three servers:
  • Small server
    • SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU, ext4 with data=writeback and 1 NVMe device. The storage device has 128 for max_hw_sectors_kb and max_sectors_kb.
  • Medium server
    • C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and 16 cores, XFS with data=writeback, SW RAID 0 and 4 NVMe devices. The RAID device has 512 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
  • Big server
    • BIG - Intel CPU, 1-socket, 48 cores with HT enabled, enough RAM (vague on purpose), xfs with SW RAID 0 and 3 devices. The RAID device has 128 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
The small and medium server use Ubuntu 22.04 with ext4. AMD SMT and Intel HT are disabled.

Builds

I compiled db_bench from source on all servers. On the small and medium servers I used versions 6.0.2, 6.10.4, 6.20.4, 6.29.5, 7.0.4, 7.3.2, 7.6.0, 7.10.2, 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.

On the large server I used versions 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.

Benchmark

Everything used leveled compaction, the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command lines to run them are: 
# Small server, SER4: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
bash x3.sh 1 no 3600 c8r32 20000000 400000000 byrx iobuf iodir

# Medium server, C2D: use 8 threads, 40M KV pairs for cached, 2B for IO-bound 
bash x3.sh 8 no 3600 c16r64 40000000 2000000000 byrx iobuf iodir

# Big server, BIG: use 16 threads, 40M KV pairs for cached, 800M for IO-bound
bash x3.sh 16 no 3600 c16r64 40000000 800000000 byrx iobuf iodir

I should have used a value larger than 800M for IO-bound on the BIG server, but the results are still IO-bound with 800M. 

Workloads

There are three workloads:

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries linked below for byrx, iobuf and iodir.
Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

Results: byrx (cached)

The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.

Summary
  • Small server
    • most benchmarks are ~10% slower in 9.3 vs 6.0
    • overwrite(andwait) might be ~20% slower but the comparison isn't fair because there is no andwait (--benchmarks=waitforcompaction) done for old versions of RocksDB including 6.0.
    • for overwriteandwait relative QPS drops from 0.91 to 0.83 from 8.5.4 to 8.6.7 of ~8%. This might be related to changes in compaction readahead that arrive in 8.6.
  • Medium server
    • most benchmarks are 5% to 10% slower in 9.3 vs 6.0
    • overwrite(andwait) is faster in 9.3 but see the disclaimer in the Small server section above
    • fillseq is slower after 6.0 but after repeating benchmarks for 6.0, 6.1, ..., 6.10 there is much variance and in this case the result for 6.0.2 was a best-case result that most versions might get, but usually don't
  • Big server
    • most benchmarks are ~4% slower in 9.3 vs 8.1
    • overwriteandwait is ~2% faster in 9.3

Results: iobuf (IO-bound, buffered IO)

The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.

Summary
  • Small server
    • the *whilewriting tests are 1.2X to 1.4X faster in 9.3 than 6.0
    • the overwriteandwait test in much slower in 9.3 and the regression occurs in 8.6.7. This is from changes in compaction readahead. The default value for the --compaction_readahead flag in 8.6.7 is 0 (a bug) that was fixed to make it 2M in 8.7.3 but the results don't improve in 8.7.3.
  • Medium server
    • most tests have no regressions from 6.0 to 9.3
    • overwriteandwait is ~1.5X faster in 9.3 and the big improvements arrived late in 7.x
    • fillseq was ~9% slower in 9.3 and I will try to reproduce that
  • Big server
    • most tests have no regressions from 6.0 to 9.3
    • overwriteandwait is much slower in 9.3 and the regression arrived in 8.6.7, probably from changes to compaction readahead
Results: iodir (IO-bound, O_DIRECT)

The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.

Summary
  • Small server
    • fillseq is ~11% slower in 9.3 vs 6.0 and the regression arrived late in 6.x
    • overwriteandwait is ~6% slower in 9.3 vs 6.0 and the regression might have arrived late in 6.x
    • the *whilewriting tests are 1.2X to 1.4X faster in 9.3 vs 6.0 and the improvements arrived around 6.20
  • Medium server
    • mosts tests have similar performance between 9.3 and 6.0
    • overwriteandwait is ~2X faster the improvements arrived in 7.10 and 8.7
  • Big server
    • most tests have similar performance between 9.3 and 6.0
    • fillseq and overwriteandwait are ~5% faster in 9.3


Debugging overwriteandwait performance

In the cases where there is a regression for overwriteandwait for the small and big server from the summaries for the small and the big servers there are large changes for several metrics. These changes are less obvious on the medium server.
  • ops_sec - operations/second (obviously)
  • c_wsecs - this is compaction wall clock time and c_csecs is compaction CPU time. The c_wsecs value increases slightly even though the total amount of writes has decreased (because ops_secs decreased). Also the ratio (c_csecs / c_wsecs) decreases on the small and big servers. There is more time spent waiting for IO during compaction (again, probably related to changes in how compaction readahead is done).
  • stall% - this is the percentage of time there are write stalls. It increases after version 8.5 which is expected. Compaction becomes less efficient after 8.5 so there are more compaction stalls.
Values from iostat from the small server for overwriteandwait with iobuf show a big problem:
  • rps (reads/s or r/s) increases after 8.5 (from ~1500 to ~18000)
  • rmbps (read MB/s) decreases after 8.5 (from ~140 to ~85)
  • rareqsz (read request size) decreases after 8.5 (from ~100 to ~4)
c       rps     rmbps   rrqmps  rawait  rareqsz wps     wmbps   wrqmps  wawait  wareqsz
3716    1486    143.9   0.00    0.44    100.5   4882    585.8   10.03   0.43    122.7   -> 8.0
3717    1500    145.7   0.00    0.44    100.5   4905    586.9   9.96    0.44    122.3   -> 8.1
3716    1415    143.0   0.00    0.45    103.6   4912    589.2   9.41    0.45    122.6   -> 8.2
3717    1486    144.5   0.00    0.45    101.2   4931    590.9   9.10    0.43    122.5   -> 8.3
3717    1474    141.8   0.00    0.44    100.1   4884    584.9   9.51    0.44    122.4   -> 8.4
3716    1489    144.5   0.00    0.44    100.5   4934    591.1   9.94    0.44    122.5   -> 8.5
4098    18854   80.2    0.00    0.07    4.3     2935    351.3   6.99    0.35    122.3   -> 8.6
4061    18543   84.0    0.00    0.06    4.6     3074    367.7   7.34    0.36    122.4   -> 8.7
4079    18527   84.0    0.00    0.06    4.6     3053    365.4   7.23    0.36    122.5   -> 8.8
4064    18328   83.0    0.00    0.07    4.6     3036    363.3   7.18    0.36    122.4   -> 8.9
4062    18386   83.3    0.00    0.06    4.6     3054    365.8   7.37    0.35    122.6   -> 8.10
4029    17998   81.6    0.00    0.07    4.6     2980    356.8   7.29    0.35    122.5   -> 8.11
4022    18429   83.5    0.00    0.07    4.6     3045    364.6   7.08    0.35    122.6   -> 9.0
4028    18483   83.7    0.00    0.07    4.6     3052    365.5   7.35    0.35    122.6   -> 9.1
4020    17761   80.5    0.00    0.07    4.6     2947    352.6   7.27    0.35    122.5   -> 9.2
4011    16394   74.3    0.00    0.09    4.6     2713    325.1   6.31    0.33    122.6   -> 9.3

Values from iostat on the medium server for overwriteandwait with iobuf
  • the results look much better except for the glitch in 8.6 because db_bench had a bad default value for compaction_readahead_size had a bad default
  • there are changes in 8.7 but they are much smaller here than above for the small server
    • rrqmps is larger, this is read requests merged/s
    • rps is smaller, this is r/s
    • rareqsz is larger, this is read request size
    • rawait is small, this is read latency
c       rps     rmbps   rrqmps  rawait  rareqsz wps     wmbps   wrqmps  wawait  wareqsz
3688    370     59.8    4.48    0.68    174.3   936     199.7   7.31    27.80   218.8   -> 8.0
3689    345     59.3    5.40    0.72    184.9   937     199.5   4.57    27.58   218.7   -> 8.1
3689    387     64.1    4.98    0.75    179.5   1013    215.5   3.35    23.90   218.4   -> 8.2
3679    362     58.4    4.58    0.73    175.1   920     196.7   4.63    29.25   219.6   -> 8.3
3690    375     61.5    5.01    0.76    178.6   976     208.0   3.54    25.11   218.6   -> 8.4
3689    361     58.5    4.78    0.76    176.6   930     198.2   3.49    27.58   218.6   -> 8.5
4168    5256    22.6    0.00    0.18    4.4     377     80.7    1.06    6.54    219.3   -> 8.6
3682    311     54.2    13.98   0.61    198.8   869     185.7   2.17    28.49   219.8   -> 8.7
3683    316     59.9    16.91   0.62    211.2   953     203.2   6.64    25.81   219.1   -> 8.8
3682    327     56.9    14.12   0.63    195.9   905     193.6   2.16    28.65   220.1   -> 8.9
3678    290     53.9    15.01   0.59    203.6   865     184.1   3.67    34.64   219.0   -> 8.10
3681    284     51.8    15.33   0.60    209.0   828     177.2   2.08    37.94   220.4   -> 8.11
3680    304     52.7    13.14   0.59    194.5   843     179.4   6.80    35.83   219.1   -> 9.0
3686    319     58.2    16.08   0.65    206.8   923     197.6   1.64    26.88   220.2   -> 9.1
3682    320     56.4    13.77   0.60    199.8   900     191.8   4.38    28.66   219.0   -> 9.2
3685    300     56.0    15.06   0.61    207.3   906     191.7   5.56    28.58   217.8   -> 9.3

And then I remembered that I saw this before and the issue is that starting with 8.6 the value for compaction_readahead_size should be <= max_sectors_kb value for the underlying storage device(s).
  • see RocksDB issue 12038
  • when SW RAID 0 is used I am not sure whether the value that matters is from the storage devices or the RAID device
Then I repeated the benchmark on the small and medium servers with compaction_readahead_size (CRS) set to 96k, 512k and 1M.
  • small server
    • the QPS from overwriteandwait was ~125k/s prior to 8.6 and then drops to ~85k/s in 8.6 and more recent versions. Results are better when compacton_readahead_size is decreased. It is ~123k/s with CRS =96k, ~104k/s with CRS =512k and ~88k/s with CRS =1M.
  • medium server
    • the QPS from overwriteandwait was ~150k/s prior to 8.6, ~145k/s in 8.6 and more recent releases. Results are better when compacton_readahead_size is decreased. It is ~172k/s with CRS =96k, ~187k/s with CRS =512k and ~161k/s with CRS =1M.
  • big server
    • tests are in progress

























No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...