Monday, July 22, 2024

Searching for regressions in RocksDB with db_bench: part 2

In a recent post I shared results for RocksDB performance tests using versions from 6.0 through 9.0 and 3 different types of servers (small, medium, big). While there were few regressions over time, there is one regression that arrived in version 8.6, bug 12038, and the workarounds are one of:

  • use O_DIRECT for compaction reads
  • set compaction_readahead_size to be <= max_sectors_kb for the database storage device. When SW RAID is used I don't know whether the value that matters is from the underlying storage devices or the SW RAID device.
In this post I have more results from tests done with compaction_readahead_size set to a value <= max_sectors_kb.

tl;dr
  • Setting compaction_readahead_size to be <= max_sectors_kb was good for performance on the small and big servers. One effect of this is the average read request size is large (tens of KB) when the value is correctly sized and ~4K (single-block reads) when it is not.
  • If you don't want to worry about this then use O_DIRECT for compaction reads
Read the Builds and Benchmark sections from my recent post for more context.

Hardware

I tested on three servers:
  • Small server
    • SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU, ext4 with data=writeback and 1 NVMe device. The storage device has 128 for max_hw_sectors_kb and max_sectors_kb.
    • I set compaction_readahead_size to 96K
  • Medium server
    • C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and 16 cores, XFS with data=writeback, SW RAID 0 and 4 NVMe devices. The RAID device has 512 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
    • I set compaction_readahead_size to 512K
  • Big server
    • BIG - Intel CPU, 1-socket, 48 cores with HT enabled, enough RAM (vague on purpose), xfs with SW RAID 0 and 3 devices. The RAID device has 128 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
    • I set compaction_readahead_size to 96K

Workloads

There are two workloads:

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
Results: byrx (cached)

For each server there are links to two sets of results.

The first set of results has 3 lines per test. The first line is from RocksDB 8.5.4, the second from 8.7.3 using the default (=2M) for compaction_readahead_size and the third from 8.7.3 with compaction_readahead_size =96K. An example is here.

The second set of results is similar to the first, except the second and third lines are from RocksDB 9.3.1 instead of 8.7.3.

Below I use CRS to mean compaction_readahead_size and compare the QPS from the overwriteandwait microbenchmark.

The results:
  • SER4 (small server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~12% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~6% less QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf
  • C2D (medium server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~2% more QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~26% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf
  • BIG (big server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 gets the same QPS as 8.5.4 with CRS set to either =2M or =96K
      • 9.3.1 gets ~4% less QPS than 8.5.4 with CRS set to either =2M or =96K
Summary
  • Setting compaction_readahead_size to be <= max_sectors_kb helps performance on the small and medium server but not on the big server. Note there are large differences on the big server between the value of max_sectors_kb for the RAID device and for the underlying storage devices -- it is much larger for the storage devices.
  • In the cases where reducing the value of compaction_readahead_size helped, QPS from overwriteandwait in RocksDB 8.5.4 is still better than the versions that follow
Results: iobuf (IO-bound with buffered IO)

For each server there are links to two sets of results.

The first set of results has 4 lines per test. The first line is from RocksDB 8.5.4, the second from 8.7.3 using the default (=2M) for compaction_readahead_size and the third from 8.7.3 with compaction_readahead_size =96K. O_DIRECT was not used for the first three lines. The fourth line is from 8.7.3 using O_DIRECT. An example is here.

The second set of results is similar to the first, except the second, third and fourth lines are from RocksDB 9.3.1 instead of 8.7.3.

Below I use CRS to mean compaction_readahead_size and compare the QPS from the overwriteandwait microbenchmark.

The results:
  • SER4 (small server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get 30% to 40% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~10% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get ~2% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf but O_DIRECT is better
      • Average ead request size per iostat (see rareqsz here) is much larger with CRS =96K than =2M (84.5 vs 4.6)
  • C2D (medium server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get 4% to 7% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =512K get ~20% more QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get ~11% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf and better than O_DIRECT
      • Average read request size per iostat (see rareqsz here) is similar with CRS =512K and 2M (185.9 vs 194.1)
  • BIG (big server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~28% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get 4% to 7% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get 7% to 10% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf but O_DIRECT is better
      • Average ead request size per iostat (see rareqsz here) is much larger with CRS =96K than =2M (61.2 vs 5.1)
Summary
  • Setting compaction_readahead_size to be <= max_sectors_kb helps on all servers
  • On the small and big server, performance with O_DIRECT was better than without.













MyRocks vs InnoDB on cached sysbench: revised

A few weeks ago I shared results for sysbench with InnoDB and MyRocks on a variety of servers. The worst-case for MyRocks occurred on a 2-socket server with the write microbenchmarks. After some performance debugging I learned that changing the CPU frequency governor from schedutil to performance increased QPS by ~2X for the worst cases (see here) with MyRocks. Note that for Ubuntu 22.04 the default for the CPU frequency governor is schedutil.

This blog post shares results for the 2-socket server after I repeated all tests with the performance CPU frequency governor.

tl;dr

  • MyRocks uses ~1.4X more CPU than InnoDB for this benchmark which means that MyRocks gets ~70% of the QPS compared to InnoDB for CPU-bound benchmarks. Note that compression is enabled for MyRocks but not for InnoDB. That will increase the CPU for MyRocks on the microbenchmarks that do writes. Here I ignore the benefits from compression, but they are a big deal in production.
  • the largest regressions from early MyRocks 5.6.35 to modern MyRocks 8.0.32, occur on writes and range queries and a typical regression is ~5% 
  • MySQL with InnoDB is faster in 8.0.37 (on average) than 5.6.35 for writes and some queries, but also up to 28% slower on some point and range queries

Builds

I tested the following builds for FB MyRocks:
  • 5635-210407 - FB MyRocks 5.6.35 at git sha f896415f (as of 21/04/07) with RocksDB 6.19.0
  • 5635-231016 - FB MyRocks 5.6.35 at git sha 4f3a57a1 (as of 23/10/16) with RocksDB 8.7.0
  • 8032-231204 - FB MyRocks 8.0.32 at git sha e3a854e8 (as of 23/12/04) with RocksDB 8.7.0
  • 8032-240529 - FB MyRocks 8.0.32 at git sha 49b37dfe (as of 24/05/29) with RocksDB 9.2.1
  • 8032-240529-LTO - same as 8032-240529 except adds link-time optimization
I also compiled upstream MySQL 5.6.35, 5.6.51, 5.7.10, 5.7.44, 8.0.11 and 8.0.37 from source. For 8.0.37 I also created a binary for 8.0.37 with link-time optimization (LTO) enabled via -DWITH_LTO=ON.

Hardware

I tested on one server that I call Socket2. It is a SuperMicro SuperWorkstation (see here) with 2-sockets, 12 cores/socket, 64G RAM and ext4 (SW RAID 0 over 2 NVMe devices). It uses Ubuntu 22.04 with ext4 and Intel HT is disabled. As described above, the server uses the performance CPU frequency governor.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by MyRocks and InnoDB.

The benchmark is run with 16 threads, 8 tables and 10M rows per table. Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper scripts was:
# Socket2 -> 16 clients
bash r.sh 8 10000000 300 600 md0 1 1 16

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here. For each microbenchmark group there is a table with summary statistics. I don't have charts because that would use too much space, but the results per microbenchmark are in the spreadsheets.

The numbers in the spreadsheets and the tables below are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

Results: MyRocks 5.6.35 vs 8.0.32

The numbers in the tables are the relative QPS as explained above.

The base case is 5635-210407, FB MyRocks 5.6.35 as of 21/04/07. The tables below compare it with:
  • 5635-231016 - FB MyRocks 5.6.35 as of 23/10/16
  • 8032-231204 - FB MyRocks 8.0.32 as of 23/12/04
  • 8032-240529 - FB MyRocks 8.0.32 as of 24/05/29
  • 8032-240529-LTO - FB MyRocks 8.0.32 as of 24/05/29 with link-time optimization
This shows the relative QPS as: (QPS for 5635-231016) / (QPS for 5635-210407)
  • the max (1.24) occurs on points-covered-si.pre_range=100
  • the min (0.90) occurs on range-notcovered-pk.pre_range=100 
5635-231016minmaxavgmedian
point-10.931.241.020.98
point-20.940.980.960.95
range-10.900.980.940.94
range-20.920.990.950.96
writes0.940.990.960.95

This shows the relative QPS as: (QPS for 8032-231204) / (QPS for 5635-210407)
  • the max (1.28) occurs on points-covered-si.pre_range=100
  • the min (0.74) occurs on  scan_range=100
8032-231204minmaxavgmedian
point-10.861.280.980.92
point-20.911.080.980.94
range-10.741.020.950.96
range-20.970.990.980.98
writes0.810.980.920.93

This shows the relative QPS as: (QPS for 8032-240529) / (QPS for 5635-210407)
  • the max (1.24) occurs on points-covered-si.pre_range=100
  • the min (0.67) occurs on scan_range=100
  • the min for writes (0.79) occurs on insert_range=100
    • from vmstat metrics the CPU overhead (cpu/o) grows and the context switch rate (cs/o) drops from 5635-210407 to 8032-240529
8032-240529minmaxavgmedian
point-10.861.240.980.93
point-20.891.080.970.94
range-10.671.010.930.94
range-20.960.990.970.97
writes0.790.960.900.91

This shows the relative QPS as: (QPS for 8032-240529-LTO) / (QPS for 5635-210407)
  • the max (1.26) occurs on points-covered-si.pre_range=100
  • the min (0.67) occurs on scan_range=100
  • the min for writes (0.80) occurs on insert_range=100
    • from vmstat metrics the CPU overhead (cpu/o) grows and the context switch rate (cs/o) drops from 5635-210407 to 8032-240529-LTO
8032-240529-LTOminmaxavgmedian
point-10.791.260.990.93
point-20.911.100.980.94
range-10.671.020.940.95
range-20.971.010.991.00
writes0.800.980.930.93

Results: InnoDB 5.6, 5.7 and 8.0

The numbers in the tables are the relative QPS as explained above.

The tables below show the relative QPS. The base case is InnoDB from MySQL 5.6.35. It is compared with InnoDB from 5.6.51, 5.7.10, 5.7.44, 8.0.11, 8.0.37 and 8.0.37 with link-time optimization.

This shows the relative QPS as: (QPS for InnoDB+MySQL 5.7.44) / (QPS for InnoDB+MySQL 5.6.35)
  • the max (3.16) occurs on update-index_range=100
  • the min (0.77) occurs on scan_range=100
5.7.44minmaxavgmedian
point-10.871.571.110.99
point-20.981.381.191.25
range-10.770.870.850.86
range-21.021.261.131.12
writes1.073.161.431.25

This shows the relative QPS as: (QPS for InnoDB+MySQL 8.0.37) / (QPS for InnoDB+MySQL 5.6.35)
  • the max (3.16) occurs on update-index_range=100 which is much less than the result in 5.7.44
  • the min (0.65) occurs on scan_range=100
  • the median values here are much less than the values above for InnoDB 5.7.44 because there are significant performance regressions in 8.0
8.0.37minmaxavgmedian
point-10.741.260.910.84
point-20.781.120.960.99
range-10.650.750.720.72
range-20.881.080.970.96
writes0.962.051.221.12

This shows the relative QPS as : (QPS for InnoDB 8.0.37+LTO) / (QPS for InnoDB 5.6.35)
  • the max(2.18) occurs on update-index_range=100 which is much less than the result in 5.7.44
  • the min (0.64) occurs on scan_range=100
  • LTO improves QPS by ~4%
8.0.37-LTOminmaxavgmedian
point-10.771.370.950.88
point-20.821.150.991.02
range-10.640.790.750.75
range-20.911.111.001.00
writes0.992.181.271.17

Results: MyRocks vs InnoDB

The numbers in the tables are the relative QPS as explained above.

The base case is InnoDB from MySQL 8.0.37. The tables below show the relative QPS for MyRocks:
  • 5635-231016 - FB MyRocks 5.6.35 as of 23/10/16
  • 8032-231204 - FB MyRocks 8.0.32 as of 24/05/29
This shows the relative QPS as (QPS for FB MyRocks 5635-231016) / (QPS for InnoDB 8.0.37)
  • the max (1.06) occurs on update-index_range=100
  • the min (~0.50) occurs on many tests
  • the median is ~0.70 which means that MyRocks uses ~1.4X more CPU than InnoDB
5635-231016minmaxavgmedian
point-10.520.940.740.78
point-20.610.870.720.68
range-10.560.770.650.65
range-20.690.770.730.73
writes0.551.060.710.64

This shows the relative QPS as: (QPS for FB MyRocks 8032-240529) / (QPS for InnoDB 8.0.37)
  • the max (1.02) occurs on update-index_range=100
  • the min (~0.50) occurs on many tests
  • the median is ~0.70 which means that MyRocks uses ~1.4X more CPU than InnoDB
  • the worst cases here are ~0.50 when using the performance CPU frequency governor. It was ~0.25 when using schedutil.
8032-240529minmaxavgmedian
point-10.510.840.710.78
point-20.610.850.730.74
range-10.500.770.640.63
range-20.700.800.750.75
writes0.511.020.670.61

















Friday, July 19, 2024

MySQL 8.0.38 vs cached Sysbench on a medium server

This has benchmark results for MySQL 8.0.38 and a few other 8.0 releases using Sysbench with a cached database on a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. Results from the Insert Benchmark in the same setup are here.

tl;dr

  • Performance for many range scans might be about 10% lower in 8.0.38 than in 8.0.26. The regression is much larger for the scan benchmark where the performance drop is about 22%. The regressions arrived in 8.0.30 and 8.0.31. Percona PS-8822 and MySQL 111538 are open for this.
  • Performance for writes is up to 9% slower in 8.0.38 vs 8.0.26 with one exception. Performance for the update-index microbenchmark is ~23% slower in 8.0.38. The update-index regression arrives in 8.0.30. I assume it is related to changes for the InnoDB redo log (see innodb_redo_log_capacity and the release notes). It is odd that the large regression is limited to the update-index microbenchmark, although this benchmark requires secondary index maintenance which might mean there is more stress on redo.
  • There is a huge improvement, almost 2X, for several microbenchmarks in the point-2 microbenchmark group. Bug 102037 was fixed in MySQL 8.0.31. I reported that bug against MySQL 8.0.22 thanks to sysbench.
Builds, configuration and hardware

I compiled from source MySQL versions 8.0.26 through 8.0.38.

The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

The my.cnf file is here for MySQL 8.0.30+ and is here for 8.0.26 through 8.0.28.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with 12 threads, 8 tables and 10M rows per table.. Ech microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

The command lines for my helper scripts were:
 8 tables, 10M rows/table, 12 threads
bash r.sh 8 10000000 300 600 md0 1 1 12

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here. For each microbenchmark group there is a table with summary statistics. I don't have charts because that would use too much space, but the results per microbenchmark are in the spreadsheets.

The numbers in the spreadsheets and the tables below are the relative QPS which is (QPS for my version) / (QPS for base case). When the relative throughput is > 1 then that version is faster than the base case.

For all results below the base case is InnoDB from MySQL 8.0.26

Results: summary statistics

Each table has summary statistics per microbenchmark group. The numbers are the relative QPS for MySQL 8.0.38 which is (QPS for 8.0.38 / QPS for 8.0.26).

The results are mixed using the median values.
  • Performance for many range scans might be about 10% lower in 8.0.38 than in 8.0.26. The regression is much larger for the scan benchmark where the performance drop is about 22%. The regressions arrived in 8.0.30 and 8.0.31. Percona PS-8822 and MySQL 111538 are open for this. From bug comments and browsing the code, the root cause might be a change in how function inlining is done for InnoDB.
  • Performance for writes is up to 9% slower in 8.0.38 vs 8.0.26 with one exception. Performance for the update-index microbenchmark is ~23% slower in 8.0.38. The update-index regression arrives in 8.0.30. I assume it is related to changes for the InnoDB redo log (see innodb_redo_log_capacity and the release notes).
  • There is a huge improvement, almost 2X, for several microbenchmarks in the point-2 microbenchmark group. Bug 102037 was fixed in MySQL 8.0.31. I reported that bug against MySQL 8.0.22 thanks to sysbench.
  • The releases at which regressions occur are visible in the spreadsheet
minmaxavgmedian
point-10.931.061.001.02
point-20.951.971.311.02
range-10.780.910.890.90
range-20.910.930.920.92
writes0.770.960.920.94

Results: charts

The y-axis starts at 0.70 instead of 0 to improve readability. The charts can make it easier to see trends and to see when regressions or improvements arrive.

I suspect the regressions for point queries are related to PS-8822 and MySQL 111538.
I suspect the regressions for point queries are related to PS-8822 and MySQL 111538. The large improvement to the random-points microbenchmarks are from fixing bug 102037
I suspect the regressions for range queries are related to PS-8822 and MySQL 111538 and the worst-case regression occurs for the scan microbenchmark.
I suspect the regressions for range queries are related to PS-8822 and MySQL 111538 

The regressions for most write microbenchmarks is <= 10% with one exception -- update-index. The update-index regression arrives in 8.0.30. I assume it is related to changes for the InnoDB redo log (see innodb_redo_log_capacity and the release notes).
Debugging the regression in update-index

The regression for update-index arrives in 8.0.30. 

From vmstat metrics I see:
  • more CPU per operation in 8.0.30
    • The cpu/o column is CPU /operation and it increases from .002615 in 8.0.28 to .002979 in 8.0.30. The cpu in cpu/o is derived from the sum of vmstat us and sy.
  • more context switches per operation in 8.0.30
    • The cs/o column is context switches /operation and it increases from 10.726 in 8.0.28 to 12.258 in 8.0.30. There appears to be more mutex contention in 8.0.30.
I repeated tests with MySQL 8.0.28 with a my.cnf changed to use the same number and size of redo log files as used by 8.0.30+. The goal was to determine whether more+smaller redo log files were the issue for the update-index regression. Alas, it was not and 8.0.28 with that alternate my.cnf still was much better at update-index than 8.0.30+. With 8.0.30+ the my.cnf has redo_log_capacity =50G and MySQL uses 32 files which are each ~1.6G (50G / 32). To match that in the repeated tests with 8.0.28 I used: 
innodb_log_files_in_group=32
innodb_log_file_size=1638M
I then repeated tests and used PMP to collect and aggregate thread stacks. One stack I see in 8.0.30 that doesn't occur in 8.0.28 is below and I confirmed that log_free_check_wait exists in 8.0.28. Alas, this thread stack also doesn't show up with 8.0.31 or 8.0.32.
 __GI___clock_nanosleep
__GI___nanosleep
std::this_thread::sleep_for<long,,wait_for<log_free_check_wait(log_t&)::<lambda(bool)>
log_free_check_wait
log_free_check,log_free_check
log_free_check,row_upd
row_upd_step
row_update_for_mysql_using_upd_graph
ha_innobase::update_row
handler::ha_update_row
Sql_cmd_update::update_single_table
Sql_cmd_dml::execute
mysql_execute_command
Prepared_statement::execute
Prepared_statement::execute_loop
mysqld_stmt_execute
dispatch_command
do_command
handle_connection
pfs_spawn_thread
start_thread
clone
Debugging the regression in scan

From vmstat metrics for a range query I see that the CPU overhead increases ~5% from 8.0.28 to 8.0.30 while QPS drops ~5%. I don't have good vmstat data for the scan microbenchmark but in the past the problem has been more CPU. This looks like Percona PS-8822 and MySQL 111538 and I still have hope it will be fixed.

















Thursday, July 18, 2024

Searching for regressions in RocksDB with db_bench

I used db_bench to check for performance regressions in RocksDB using leveled compaction and three different servers. A future post will have results for universal compaction. A recent report from me about db_bench on a large server is here.

tl;dr

  • if you use buffered IO for compaction (not O_DIRECT) then bug 12038 is an issue starting in RocksDB 8.6. The workaround is to set compaction_readahead_size to be <= max_sectors_kb for your storage device. 
  • there is a large regression to overwriteandwait that arrives in RocksDB 8.6 (see the previous bullet point) when using buffered IO for compaction. This is related to changes for compaction readahead and need to do more debugging.
  • otherwise there are some big improvements and some small regressions
To learn more about max_sectors_kb and max_hw_sectors_kb start with this excellent doc from Jens Axboe and then read this blog post. Note that max_sectors_kb must be <= max_hw_sectors_kb and (AFAIK) max_hw_sectors_kb is read-only so you can't increase it.

I am working to determine whether the value for the storage device or RAID device takes precedence when SW RAID is used. From a few results, it looks like the value for the RAID device matters and the storage device's values are ignored (assuming the RAID device doesn't exceed what the storage devices support).

I also want to learn how max_hw_sectors_kb is set for a SW RAID device.

Hardware

I tested on three servers:
  • Small server
    • SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU, ext4 with data=writeback and 1 NVMe device. The storage device has 128 for max_hw_sectors_kb and max_sectors_kb.
  • Medium server
    • C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and 16 cores, XFS with data=writeback, SW RAID 0 and 4 NVMe devices. The RAID device has 512 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
  • Big server
    • BIG - Intel CPU, 1-socket, 48 cores with HT enabled, enough RAM (vague on purpose), xfs with SW RAID 0 and 3 devices. The RAID device has 128 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
The small and medium server use Ubuntu 22.04 with ext4. AMD SMT and Intel HT are disabled.

Builds

I compiled db_bench from source on all servers. On the small and medium servers I used versions 6.0.2, 6.10.4, 6.20.4, 6.29.5, 7.0.4, 7.3.2, 7.6.0, 7.10.2, 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.

On the large server I used versions 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3, 8.8.1, 8.9.2, 8.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.1.

Benchmark

Everything used leveled compaction, the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command lines to run them are: 
# Small server, SER4: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
bash x3.sh 1 no 3600 c8r32 20000000 400000000 byrx iobuf iodir

# Medium server, C2D: use 8 threads, 40M KV pairs for cached, 2B for IO-bound 
bash x3.sh 8 no 3600 c16r64 40000000 2000000000 byrx iobuf iodir

# Big server, BIG: use 16 threads, 40M KV pairs for cached, 800M for IO-bound
bash x3.sh 16 no 3600 c16r64 40000000 800000000 byrx iobuf iodir

I should have used a value larger than 800M for IO-bound on the BIG server, but the results are still IO-bound with 800M. 

Workloads

There are three workloads:

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries linked below for byrx, iobuf and iodir.
Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

Results: byrx (cached)

The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.

Summary
  • Small server
    • most benchmarks are ~10% slower in 9.3 vs 6.0
    • overwrite(andwait) might be ~20% slower but the comparison isn't fair because there is no andwait (--benchmarks=waitforcompaction) done for old versions of RocksDB including 6.0.
    • for overwriteandwait relative QPS drops from 0.91 to 0.83 from 8.5.4 to 8.6.7 of ~8%. This might be related to changes in compaction readahead that arrive in 8.6.
  • Medium server
    • most benchmarks are 5% to 10% slower in 9.3 vs 6.0
    • overwrite(andwait) is faster in 9.3 but see the disclaimer in the Small server section above
    • fillseq is slower after 6.0 but after repeating benchmarks for 6.0, 6.1, ..., 6.10 there is much variance and in this case the result for 6.0.2 was a best-case result that most versions might get, but usually don't
  • Big server
    • most benchmarks are ~4% slower in 9.3 vs 8.1
    • overwriteandwait is ~2% faster in 9.3

Results: iobuf (IO-bound, buffered IO)

The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.

Summary
  • Small server
    • the *whilewriting tests are 1.2X to 1.4X faster in 9.3 than 6.0
    • the overwriteandwait test in much slower in 9.3 and the regression occurs in 8.6.7. This is from changes in compaction readahead. The default value for the --compaction_readahead flag in 8.6.7 is 0 (a bug) that was fixed to make it 2M in 8.7.3 but the results don't improve in 8.7.3.
  • Medium server
    • most tests have no regressions from 6.0 to 9.3
    • overwriteandwait is ~1.5X faster in 9.3 and the big improvements arrived late in 7.x
    • fillseq was ~9% slower in 9.3 and I will try to reproduce that
  • Big server
    • most tests have no regressions from 6.0 to 9.3
    • overwriteandwait is much slower in 9.3 and the regression arrived in 8.6.7, probably from changes to compaction readahead
Results: iodir (IO-bound, O_DIRECT)

The base case is RocksDB 6.0.2 for the small and medium servers. It is 8.1.1 for the big server. The numbers in y-axis are the relative QPS.

Summary
  • Small server
    • fillseq is ~11% slower in 9.3 vs 6.0 and the regression arrived late in 6.x
    • overwriteandwait is ~6% slower in 9.3 vs 6.0 and the regression might have arrived late in 6.x
    • the *whilewriting tests are 1.2X to 1.4X faster in 9.3 vs 6.0 and the improvements arrived around 6.20
  • Medium server
    • mosts tests have similar performance between 9.3 and 6.0
    • overwriteandwait is ~2X faster the improvements arrived in 7.10 and 8.7
  • Big server
    • most tests have similar performance between 9.3 and 6.0
    • fillseq and overwriteandwait are ~5% faster in 9.3


Debugging overwriteandwait performance

In the cases where there is a regression for overwriteandwait for the small and big server from the summaries for the small and the big servers there are large changes for several metrics. These changes are less obvious on the medium server.
  • ops_sec - operations/second (obviously)
  • c_wsecs - this is compaction wall clock time and c_csecs is compaction CPU time. The c_wsecs value increases slightly even though the total amount of writes has decreased (because ops_secs decreased). Also the ratio (c_csecs / c_wsecs) decreases on the small and big servers. There is more time spent waiting for IO during compaction (again, probably related to changes in how compaction readahead is done).
  • stall% - this is the percentage of time there are write stalls. It increases after version 8.5 which is expected. Compaction becomes less efficient after 8.5 so there are more compaction stalls.
Values from iostat from the small server for overwriteandwait with iobuf show a big problem:
  • rps (reads/s or r/s) increases after 8.5 (from ~1500 to ~18000)
  • rmbps (read MB/s) decreases after 8.5 (from ~140 to ~85)
  • rareqsz (read request size) decreases after 8.5 (from ~100 to ~4)
c       rps     rmbps   rrqmps  rawait  rareqsz wps     wmbps   wrqmps  wawait  wareqsz
3716    1486    143.9   0.00    0.44    100.5   4882    585.8   10.03   0.43    122.7   -> 8.0
3717    1500    145.7   0.00    0.44    100.5   4905    586.9   9.96    0.44    122.3   -> 8.1
3716    1415    143.0   0.00    0.45    103.6   4912    589.2   9.41    0.45    122.6   -> 8.2
3717    1486    144.5   0.00    0.45    101.2   4931    590.9   9.10    0.43    122.5   -> 8.3
3717    1474    141.8   0.00    0.44    100.1   4884    584.9   9.51    0.44    122.4   -> 8.4
3716    1489    144.5   0.00    0.44    100.5   4934    591.1   9.94    0.44    122.5   -> 8.5
4098    18854   80.2    0.00    0.07    4.3     2935    351.3   6.99    0.35    122.3   -> 8.6
4061    18543   84.0    0.00    0.06    4.6     3074    367.7   7.34    0.36    122.4   -> 8.7
4079    18527   84.0    0.00    0.06    4.6     3053    365.4   7.23    0.36    122.5   -> 8.8
4064    18328   83.0    0.00    0.07    4.6     3036    363.3   7.18    0.36    122.4   -> 8.9
4062    18386   83.3    0.00    0.06    4.6     3054    365.8   7.37    0.35    122.6   -> 8.10
4029    17998   81.6    0.00    0.07    4.6     2980    356.8   7.29    0.35    122.5   -> 8.11
4022    18429   83.5    0.00    0.07    4.6     3045    364.6   7.08    0.35    122.6   -> 9.0
4028    18483   83.7    0.00    0.07    4.6     3052    365.5   7.35    0.35    122.6   -> 9.1
4020    17761   80.5    0.00    0.07    4.6     2947    352.6   7.27    0.35    122.5   -> 9.2
4011    16394   74.3    0.00    0.09    4.6     2713    325.1   6.31    0.33    122.6   -> 9.3

Values from iostat on the medium server for overwriteandwait with iobuf
  • the results look much better except for the glitch in 8.6 because db_bench had a bad default value for compaction_readahead_size had a bad default
  • there are changes in 8.7 but they are much smaller here than above for the small server
    • rrqmps is larger, this is read requests merged/s
    • rps is smaller, this is r/s
    • rareqsz is larger, this is read request size
    • rawait is small, this is read latency
c       rps     rmbps   rrqmps  rawait  rareqsz wps     wmbps   wrqmps  wawait  wareqsz
3688    370     59.8    4.48    0.68    174.3   936     199.7   7.31    27.80   218.8   -> 8.0
3689    345     59.3    5.40    0.72    184.9   937     199.5   4.57    27.58   218.7   -> 8.1
3689    387     64.1    4.98    0.75    179.5   1013    215.5   3.35    23.90   218.4   -> 8.2
3679    362     58.4    4.58    0.73    175.1   920     196.7   4.63    29.25   219.6   -> 8.3
3690    375     61.5    5.01    0.76    178.6   976     208.0   3.54    25.11   218.6   -> 8.4
3689    361     58.5    4.78    0.76    176.6   930     198.2   3.49    27.58   218.6   -> 8.5
4168    5256    22.6    0.00    0.18    4.4     377     80.7    1.06    6.54    219.3   -> 8.6
3682    311     54.2    13.98   0.61    198.8   869     185.7   2.17    28.49   219.8   -> 8.7
3683    316     59.9    16.91   0.62    211.2   953     203.2   6.64    25.81   219.1   -> 8.8
3682    327     56.9    14.12   0.63    195.9   905     193.6   2.16    28.65   220.1   -> 8.9
3678    290     53.9    15.01   0.59    203.6   865     184.1   3.67    34.64   219.0   -> 8.10
3681    284     51.8    15.33   0.60    209.0   828     177.2   2.08    37.94   220.4   -> 8.11
3680    304     52.7    13.14   0.59    194.5   843     179.4   6.80    35.83   219.1   -> 9.0
3686    319     58.2    16.08   0.65    206.8   923     197.6   1.64    26.88   220.2   -> 9.1
3682    320     56.4    13.77   0.60    199.8   900     191.8   4.38    28.66   219.0   -> 9.2
3685    300     56.0    15.06   0.61    207.3   906     191.7   5.56    28.58   217.8   -> 9.3

And then I remembered that I saw this before and the issue is that starting with 8.6 the value for compaction_readahead_size should be <= max_sectors_kb value for the underlying storage device(s).
  • see RocksDB issue 12038
  • when SW RAID 0 is used I am not sure whether the value that matters is from the storage devices or the RAID device
Then I repeated the benchmark on the small and medium servers with compaction_readahead_size (CRS) set to 96k, 512k and 1M.
  • small server
    • the QPS from overwriteandwait was ~125k/s prior to 8.6 and then drops to ~85k/s in 8.6 and more recent versions. Results are better when compacton_readahead_size is decreased. It is ~123k/s with CRS =96k, ~104k/s with CRS =512k and ~88k/s with CRS =1M.
  • medium server
    • the QPS from overwriteandwait was ~150k/s prior to 8.6, ~145k/s in 8.6 and more recent releases. Results are better when compacton_readahead_size is decreased. It is ~172k/s with CRS =96k, ~187k/s with CRS =512k and ~161k/s with CRS =1M.
  • big server
    • tests are in progress