This explains my attempt to improve the my.cnf settings that I use for InnoDB with the insert benchmark. Here I use the updated insert benchmark with deletes per insert to avoid growing tables. For MySQL 5.6, 5.7 & 8.0 I tested 3, 4 & 11 changes to the my.cnf options I have been using.
tl;dr
- In some benchmarks setting innodb_use_native_aio=OFF improves query while hurting insert performance. I assume the problem is that IO queues get overloaded with aio=ON. Too many concurrent IO requests is OK for background activity (page writeback) but bad for foreground activity (queries). In some cases I would accept this trade off and use aio=OFF but I hope InnoDB adds a limit to concurrent IO requests with aio=ON.
- Other than the a1 config (innodb_use_native_aio=OFF) none of the configs consistently help performance. The base config I have been using is good enough for now.
- While I don't focus on comparisons to other DBMS here, MySQL QPS on the read+write benchmark steps suffers a lot (and too much) from bug 109595.
- There were several long (~100 second) read stalls at the start of some benchmark steps. At last I can explain this, see the Updates section.
Updates
The following explains the ~100 second read stalls with InnoDB. I assume that Postgres and MyRocks also have this problem, but need to confirm that. The problem is:
- tables have a PK on trxid column
- background process intermittently deletes N rows at a time with smallest trxid value
- at start of q100, q500, q1000 benchmark step inserts and deletes will not be done until this query finishes: select min(trxid) from $table
- if InnoDB purge gets behind then that query encounters and skips a large number of delete-marked rows before finding the first live one. And in the tests here that can take ~100 seconds.
Benchmark
The insert benchmark was run in three configurations.
- cached by InnoDB - InnoDB buffer pool caches all tables
- cached by OS - the 4G InnoDB buffer pool does not call all tables but the OS page cache does
- IO-bound - the database is larger than memory
I used the rel build for MySQL 5.6.56 and the rel_lto builds for MySQL 5.7.40 and 8.0.33. The builds are explained here. In all cases I build from source using upstream MySQL.
The test HW is a c2-standard-30 server from GCP with 15 cores, hyperthreads disabled, 120G RAM and 1.5TB of XFS via SW RAID 0 striped over four local NVMe devices. The OS is Ubuntu 22.04.
The benchmark is run with 8 clients and a client per table. For cached by InnoDB/OS the l.i0 step inserts 20M rows/table and for IO-bound it inserts 400M rows/table. The l.i1 step is run with --delete_per_insert and does 50M inserts/deletes per table. The read+write steps (q100, q500, q1000) also run with --delete_per_insert for 1800 seconds.
Configurations
The benchmark was run first using the base configurations. Then it was repeated for other configurations that changed or added a small number of options. I use a1 through a11 to name them here but the full name for a configuration file has the pattern my.cnf.cy10X_gcp_c2s30 or my.cnf.cy10X_4g_gcp_c2s30 where X has the value a1 ... a11. All of the config files are here for 5.6.51, 5.7.40 and 8.0.33.
For cached by InnoDB and IO-bound the configurations have an 80G InnoDB buffer pool and don't use buffered IO (innodb_flush_method = O_DIRECT_NO_FSYNC), with one exception -- the a4 config uses innodb_flush_method = fsync.
For cached by OS the configurations have a 4G InnoDB buffer pool and use buffered IO (innodb_flush_method = fsync). 
The base configurations are here:
The configurations tested are:
- a1 - adds innodb_use_native_io=off
- a2 - increases innodb_io_capacity[_max] to 20k/40k
- a3 - adds innodb_flush_sync=OFF
- a4 - adds innodb_flush_method=fsync
- a5 - adds innodb_change_buffering=none
- a6 - adds innodb_max_dirty_pages_pct_lwm=40, innodb_max_dirty_pages_pct=50
- a7 - adds innodb_lru_scan_depth=2048
- a8 - adds innodb_lru_scan_depth=4096
- a9 - adds innodb_lru_scan_depth=8192
- a10 - adds innodb_purge_threads=8, innodb_page_cleaners=8
- a11 - reduces innodb_redo_log_capacity to 16G
All of the configurations were tested for MySQL 8.0.33. Only the first three or four (a1 to a4) were tested for 5.6.51 and 5.7.40.
For each of the workloads below I analyze the performance in four parts:
- average throughput - I look at the average queries and inserts per second to see if these are much better or worse than the base config
- response time - I look at max response time and the response time distributions to see if a config makes them much better or much worse than the base config
- graphs - there are graphs for the per-second query, insert and delete rates. I look at the graphs to see if the rates degrade over time or have too many stalls.
- SLA - there is a target rate for the background inserts and deletes which is one of 100, 500 or 1000 per second per client. There is an SLA failure when the actual rate is too much smaller than the target.
Performance Problems
QPS on the read+write benchmark steps is worse than it should be courtesy of bug 109595. The optimizer is doing extra reads from the PK index to determine selectivity. There is no workaround and it is unfortunate that this bug exists because the queries for the index to be used, to the optimizer should need this info, just trust my hint.
The insert rate on the l.i1 benchmark steps is lower than I expect for the cached by OS workloads. By this I mean that inserts/s for l.i1 is much smaller than I expect relative to l.i0. I am not sure this needs to be explained because while the l.i1/l.i0 ratio for cached by OS is much worse with InnoDB than with Postgres, the ratios for InnoDB and Postgres are similar for the IO-bound workloads. 
Legend:
* IPS is average for inserts/second
* Numbers are (IPS for l.i1 / IPS for l.i0)
MySQL 5.6.51
  by InnoDB     82051 / 430108 = .190
  by OS         14388 / 422164 = .034
  IO-bound      17252 / 432199 = .039
MySQL 5.7.40
  by InnoDB     137368 / 812183 = .169
  by OS          14443 / 784314 = .018
  IO-bound       17433 / 844327 = .020
MySQL 8.0.33
  by InnoDB     129686 / 686695 = .188
  by OS          13098 / 675105 = .019
  IO-bound       18022 / 709534 = .025
Postgres 15.3
  by Postgres   249299 / 1000000 = .249
  by OS         110345 / 1012658 = .108
  IO-bound       20317 / 1104972 = .018
For several of the benchmark results below there is a ~100 second write stall at the start of a benchmark step. See the Updates section above for an explanation. 
Results: cached by InnoDB
Reports are here for cached by InnoDB, cached by OS and IO-bound.
Summary:
- average throughput
- For 5.6.51 the a1, a2, a4 configs have a small impact, both good and bad
- For 5.7.40 the a1 config reduces QPS on q100, q500 and q1000 by 10% to 20% and the a4 config reduces throughput by 1% to 4% for l.i0, l.i1, q100, q500 and q1000. The a2 and a3 configs have a small impact, both good and bad.
- For 8.0.33 the a1 config is bad for all benchmark steps. The a2 and a3 configs have a small impact, both good and bad. The a4 config hurts perf. The remaining configs (a5 to a11) generally helped QPS on q100 but hurt perf on all other benchmark steps. And a10 was bad for l.i1.
- response time
- distributions are similar for 5.6.51, 5.7.40. For 8.0.33 except the a10 configs is bad for l.i0 and l.i1.
- graphs
- for 5.6.51 and 5.7.40 nothing interesting beyond the write stalls
- for 8.0.33 the a10 config has an odd graph for the l.i1 insert rate, 
- SLA
- the a1 config failed for q100 with 5.7.40. It didn't fail for q500 or q1000. This is odd. From the metrics the disk reads per query metric (rpq) is 2X the other configs. This benchmark step runs immediately after create index and the InnoDB buffer pool has to warm up after create index. Perhaps something wiped the OS page cache. I am happy to ignore this for now. From the graphs there is a write stall for ~100 seconds at the start of the benchmark step.
Results: cached by OS
Reports are here for cached by InnoDB, cached by OS and IO-bound.
Summary:
- average throughput
- For 5.6.51 the a1 and a2 configs have a small impact, both good and bad
 
- For 5.7.40 the a1 config is great for perf. Note that this combines AIO with buffered IO while for cached by InnoDB and IO-bound it would combine AIO with O_DIRECT. Also note that the read IO done here is from the OS page cache. The a2 and a3 configs hurt perf.
- For 8.0.33 the a1 config is again great for perf. But see the disclaimers in the previous point. The a2, a3, a7, a8 and a11 configs have a small impact, both good and bad. The a5 config is lousy for perf. The a6 config boosts QPS for a small cost to the insert rates. The a9 config might be good, although it is odd that a7 and a8 were not because all increase lru_scan_depth. The a10 config hurts perf.
- response time
- distributions are similar for 5.6.51, 5.7.40. 
- For 8.0.33 the a5 is bad for writes and a9 is bad for worst-case response time.
- graphs
- not much to see. For 5.7.40 and q1000 the a3 config does much better with the background inserts than the other configs.
- SLA
- the a5 config failed for q500 and q1000 with 8.0.33. The a5 config disables the InnoDB change buffer and because of that the KB written to disk per insert (see wkbpi) is 2X to 4X larger vs the other configs. This is one example of the benefit of the change buffer. Alas, the change buffer has significant costs from mutex contention that I will document later.
Results: IO-bound
Reports are here for cached by InnoDB, cached by OS and IO-bound.
Summary:
- average throughput
- For 5.6.51 the a1 config has a small impact, good and bad. The a2 config has a small impact, mostly bad. The a4 config is lousy for perf.
- For 5.7.40 the a1 config has a medium impact (bad for inserts, good for QPS). The a2 and a3 configs have a small impact, good and bad. The a4 config is lousy for perf.
- For 8.0.33 the a1 config has a medium impact (bad for inserts, good for QPS). The a2 and a7 configs have a small impact, good and bad. The a3 config has a small impact, mostly bad. The a4, a5, a6, a10 and a11 configs are bad for perf. The a8 and a9 configs have a small impact, mostly good.
- response time
- For 5.6.51 the a4 config is bad. All of the configs have 4 second write stalls with l.i1.
- For 5.7.40 the a4 config is bad.
- For 8.0.33 with l.i1 he a11 config has a 2 second write stall and the a4, a5 and a10 configs are bad. The a4, a5 and a10 configs are also bad for q1000.
- graphs
- for 5.6.51 the a4 config has a write stall and QPS variance with q1000
- for 5.7.40 the a4 config has much QPS variance with q1000
- SLA
- the a4 config failed for q100 and q1000 with 5.6.51. For q100 I see that KB read per query (rkbpq) is almost 4X larger with a1, but reads per query (rpq) is the same as other configs. I don't understand this. For the a4 config there is a similar problem with the rkbpq metric. The metrics are here. For q1000 there is a ~100 second write stall at the start of the benchmark step (see here). While a4 didn't fail for q500, the graph shows a similar write stall and too much variance.
- the a4 config failed for q100 with 5.7.40. Again, there is a 100-second write stall at the start of the benchmark step.
- the a5 config failed for q1000 with 8.0.33. From the metrics the reads per query (rpq) and KB written per insert (wkbpi) are much larger for the a5 config.