I used db_bench to check for performance regressions in RocksDB using leveled compaction and three different servers. A future post will have results for universal compaction. A recent report from me about db_bench on a large server is here.
tl;dr
- if you use buffered IO for compaction (not O_DIRECT) then bug 12038 is an issue starting in RocksDB 8.6. The workaround is to set compaction_readahead_size to be <= max_sectors_kb for your storage device.
- there is a large regression to overwriteandwait that arrives in RocksDB 8.6 (see the previous bullet point) when using buffered IO for compaction. This is related to changes for compaction readahead and need to do more debugging.
- otherwise there are some big improvements and some small regressions
To learn more about max_sectors_kb and max_hw_sectors_kb start with this excellent doc from Jens Axboe and then read this blog post. Note that max_sectors_kb must be <= max_hw_sectors_kb and (AFAIK) max_hw_sectors_kb is read-only so you can't increase it.
I am working to determine whether the value for the storage device or RAID device takes precedence when SW RAID is used. From a few results, it looks like the value for the RAID device matters and the storage device's values are ignored (assuming the RAID device doesn't exceed what the storage devices support).
I also want to learn how max_hw_sectors_kb is set for a SW RAID device.
Hardware
I tested on three servers:
- Small server
- SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU, ext4 with data=writeback and 1 NVMe device. The storage device has 128 for max_hw_sectors_kb and max_sectors_kb.
- Medium server
- C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and 16 cores, XFS with data=writeback, SW RAID 0 and 4 NVMe devices. The RAID device has 512 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
- Big server
- BIG - Intel CPU, 1-socket, 48 cores with HT enabled, enough RAM (vague on purpose), xfs with SW RAID 0 and 3 devices. The RAID device has 128 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
Builds
I compiled db_bench from source on all servers. On the small and medium servers I used versions 6.0.2, 6.10.4, 6.20.4, 6.29.5, 7.0.4, 7.3.2, 7.6.0, 7.10.2, 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.
On the large server I used versions 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.
On the large server I used versions 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.
Benchmark
Everything used leveled compaction, the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache.
I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command lines to run them are:
I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command lines to run them are:
# Small server, SER4: use 1 thread, 20M KV pairs for cached, 400M for IO-boundbash x3.sh 1 no 3600 c8r32 20000000 400000000 byrx iobuf iodir# Medium server, C2D: use 8 threads, 40M KV pairs for cached, 2B for IO-boundbash x3.sh 8 no 3600 c16r64 40000000 2000000000 byrx iobuf iodir# Big server, BIG: use 16 threads, 40M KV pairs for cached, 800M for IO-boundbash x3.sh 16 no 3600 c16r64 40000000 800000000 byrx iobuf iodir
I should have used a value larger than 800M for IO-bound on the BIG server, but the results are still IO-bound with 800M.
Workloads
There are three workloads:
- byrx - the database is cached by RocksDB
- iobuf - the database is larger than memory and RocksDB uses buffered IO
- iodir - the database is larger than memory and RocksDB uses O_DIRECT
A spreadsheet with all results is here and performance summaries linked below for byrx, iobuf and iodir.
- byrx - for the small, medium and big server
- iobuf - for the small, medium and big server
- iodir - for the small, medium and big server
Relative QPS
The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).
Results: byrx (cached)
The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.
Summary
- Small server
- most benchmarks are ~10% slower in 9.3 vs 6.0
- overwrite(andwait) might be ~20% slower but the comparison isn't fair because there is no andwait (--benchmarks=waitforcompaction) done for old versions of RocksDB including 6.0.
- for overwriteandwait relative QPS drops from 0.91 to 0.83 from 8.5.4 to 8.6.7 of ~8%. This might be related to changes in compaction readahead that arrive in 8.6.
- Medium server
- most benchmarks are 5% to 10% slower in 9.3 vs 6.0
- overwrite(andwait) is faster in 9.3 but see the disclaimer in the Small server section above
- fillseq is slower after 6.0 but after repeating benchmarks for 6.0, 6.1, ..., 6.10 there is much variance and in this case the result for 6.0.2 was a best-case result that most versions might get, but usually don't
- Big server
- most benchmarks are ~4% slower in 9.3 vs 8.1
- overwriteandwait is ~2% faster in 9.3
Results: iobuf (IO-bound, buffered IO)
The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.
Summary
- Small server
- the *whilewriting tests are 1.2X to 1.4X faster in 9.3 than 6.0
- the overwriteandwait test in much slower in 9.3 and the regression occurs in 8.6.7. This is from changes in compaction readahead. The default value for the --compaction_readahead flag in 8.6.7 is 0 (a bug) that was fixed to make it 2M in 8.7.3 but the results don't improve in 8.7.3.
- Medium server
- most tests have no regressions from 6.0 to 9.3
- overwriteandwait is ~1.5X faster in 9.3 and the big improvements arrived late in 7.x
- fillseq was ~9% slower in 9.3 and I will try to reproduce that
- Big server
- most tests have no regressions from 6.0 to 9.3
- overwriteandwait is much slower in 9.3 and the regression arrived in 8.6.7, probably from changes to compaction readahead
Results: iodir (IO-bound, O_DIRECT)
The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.
Summary
- Small server
- fillseq is ~11% slower in 9.3 vs 6.0 and the regression arrived late in 6.x
- overwriteandwait is ~6% slower in 9.3 vs 6.0 and the regression might have arrived late in 6.x
- the *whilewriting tests are 1.2X to 1.4X faster in 9.3 vs 6.0 and the improvements arrived around 6.20
- Medium server
- mosts tests have similar performance between 9.3 and 6.0
- overwriteandwait is ~2X faster the improvements arrived in 7.10 and 8.7
- Big server
- most tests have similar performance between 9.3 and 6.0
- fillseq and overwriteandwait are ~5% faster in 9.3
Debugging overwriteandwait performance
In the cases where there is a regression for overwriteandwait for the small and big server from the summaries for the small and the big servers there are large changes for several metrics. These changes are less obvious on the medium server.
- ops_sec - operations/second (obviously)
- c_wsecs - this is compaction wall clock time and c_csecs is compaction CPU time. The c_wsecs value increases slightly even though the total amount of writes has decreased (because ops_secs decreased). Also the ratio (c_csecs / c_wsecs) decreases on the small and big servers. There is more time spent waiting for IO during compaction (again, probably related to changes in how compaction readahead is done).
- stall% - this is the percentage of time there are write stalls. It increases after version 8.5 which is expected. Compaction becomes less efficient after 8.5 so there are more compaction stalls.
Values from iostat from the small server for overwriteandwait with iobuf show a big problem:
- rps (reads/s or r/s) increases after 8.5 (from ~1500 to ~18000)
- rmbps (read MB/s) decreases after 8.5 (from ~140 to ~85)
- rareqsz (read request size) decreases after 8.5 (from ~100 to ~4)
c rps rmbps rrqmps rawait rareqsz wps wmbps wrqmps wawait wareqsz
3716 1486 143.9 0.00 0.44 100.5 4882 585.8 10.03 0.43 122.7 -> 8.0
3717 1500 145.7 0.00 0.44 100.5 4905 586.9 9.96 0.44 122.3 -> 8.1
3716 1415 143.0 0.00 0.45 103.6 4912 589.2 9.41 0.45 122.6 -> 8.2
3717 1486 144.5 0.00 0.45 101.2 4931 590.9 9.10 0.43 122.5 -> 8.3
3717 1474 141.8 0.00 0.44 100.1 4884 584.9 9.51 0.44 122.4 -> 8.4
3716 1489 144.5 0.00 0.44 100.5 4934 591.1 9.94 0.44 122.5 -> 8.5
4098 18854 80.2 0.00 0.07 4.3 2935 351.3 6.99 0.35 122.3 -> 8.6
4061 18543 84.0 0.00 0.06 4.6 3074 367.7 7.34 0.36 122.4 -> 8.7
4079 18527 84.0 0.00 0.06 4.6 3053 365.4 7.23 0.36 122.5 -> 8.8
4064 18328 83.0 0.00 0.07 4.6 3036 363.3 7.18 0.36 122.4 -> 8.9
4062 18386 83.3 0.00 0.06 4.6 3054 365.8 7.37 0.35 122.6 -> 8.10
4029 17998 81.6 0.00 0.07 4.6 2980 356.8 7.29 0.35 122.5 -> 8.11
4022 18429 83.5 0.00 0.07 4.6 3045 364.6 7.08 0.35 122.6 -> 9.0
4028 18483 83.7 0.00 0.07 4.6 3052 365.5 7.35 0.35 122.6 -> 9.1
4020 17761 80.5 0.00 0.07 4.6 2947 352.6 7.27 0.35 122.5 -> 9.2
4011 16394 74.3 0.00 0.09 4.6 2713 325.1 6.31 0.33 122.6 -> 9.3
Values from iostat on the medium server for overwriteandwait with iobuf
- the results look much better except for the glitch in 8.6 because db_bench had a bad default value for compaction_readahead_size had a bad default
- there are changes in 8.7 but they are much smaller here than above for the small server
- rrqmps is larger, this is read requests merged/s
- rps is smaller, this is r/s
- rareqsz is larger, this is read request size
- rawait is small, this is read latency
c rps rmbps rrqmps rawait rareqsz wps wmbps wrqmps wawait wareqsz
3688 370 59.8 4.48 0.68 174.3 936 199.7 7.31 27.80 218.8 -> 8.0
3689 345 59.3 5.40 0.72 184.9 937 199.5 4.57 27.58 218.7 -> 8.1
3689 387 64.1 4.98 0.75 179.5 1013 215.5 3.35 23.90 218.4 -> 8.2
3679 362 58.4 4.58 0.73 175.1 920 196.7 4.63 29.25 219.6 -> 8.3
3690 375 61.5 5.01 0.76 178.6 976 208.0 3.54 25.11 218.6 -> 8.4
3689 361 58.5 4.78 0.76 176.6 930 198.2 3.49 27.58 218.6 -> 8.5
4168 5256 22.6 0.00 0.18 4.4 377 80.7 1.06 6.54 219.3 -> 8.6
3682 311 54.2 13.98 0.61 198.8 869 185.7 2.17 28.49 219.8 -> 8.7
3683 316 59.9 16.91 0.62 211.2 953 203.2 6.64 25.81 219.1 -> 8.8
3682 327 56.9 14.12 0.63 195.9 905 193.6 2.16 28.65 220.1 -> 8.9
3678 290 53.9 15.01 0.59 203.6 865 184.1 3.67 34.64 219.0 -> 8.10
3681 284 51.8 15.33 0.60 209.0 828 177.2 2.08 37.94 220.4 -> 8.11
3680 304 52.7 13.14 0.59 194.5 843 179.4 6.80 35.83 219.1 -> 9.0
3686 319 58.2 16.08 0.65 206.8 923 197.6 1.64 26.88 220.2 -> 9.1
3682 320 56.4 13.77 0.60 199.8 900 191.8 4.38 28.66 219.0 -> 9.2
3685 300 56.0 15.06 0.61 207.3 906 191.7 5.56 28.58 217.8 -> 9.3
And then I remembered that I saw this before and the issue is that starting with 8.6 the value for compaction_readahead_size should be <= max_sectors_kb value for the underlying storage device(s).
- see RocksDB issue 12038
- when SW RAID 0 is used I am not sure whether the value that matters is from the storage devices or the RAID device
Then I repeated the benchmark on the small and medium servers with compaction_readahead_size (CRS) set to 96k, 512k and 1M.
- small server
- the QPS from overwriteandwait was ~125k/s prior to 8.6 and then drops to ~85k/s in 8.6 and more recent versions. Results are better when compacton_readahead_size is decreased. It is ~123k/s with CRS =96k, ~104k/s with CRS =512k and ~88k/s with CRS =1M.
- medium server
- the QPS from overwriteandwait was ~150k/s prior to 8.6, ~145k/s in 8.6 and more recent releases. Results are better when compacton_readahead_size is decreased. It is ~172k/s with CRS =96k, ~187k/s with CRS =512k and ~161k/s with CRS =1M.
- big server
- tests are in progress
No comments:
Post a Comment