Friday, July 7, 2023

innodb_page_cleaners, part 2: big server, database not cached

This is part 2 of the Fun with innodb_page_cleaners series that documents my attempts to reduce write stalls with InnoDB during the insert+delete benchmark on a big server. The first post is here.

tl;dr

  • While I want to find a my.cnf that prevents the write stalls, I suspect I won't find one and the stalls can't be avoided without enhancing upstream InnoDB.
  • InnoDB has ~10 second write stalls under stress.
  • The best results are from the ua18 config which sets innodb_page_cleaners and innodb_buffer_pool_instances to 4. The defaults for them are =4 and =8 respectively and I suspect it is a better to set them to the same value.
  • On reading the comments from JFG in my previous post you will learn that this problem has been known for some time and Percona Server might have the improvements I seek.

Benchmarks

The insert benchmark was run using the cached by OS setup where the buffer pool is 4G, InnoDB doesn't cache all data but the OS page cache does. This is used to simulate a setup with extra fast storage (reading from OS page cache is fast) and high pressure on the buffer pool (working set is much larger than it).

The benchmark is explained here. Only the first 3 benchmark steps were run (l.i0, l.x, l.i1). The benchmark is run with 24 clients and a client per table. The l.i0 step inserts 20M rows/table. The l.i1 step is run with --delete_per_insert and does 5M inserts/deletes per table. The read+write steps (q100, q500, q1000) were skipped.

I used the rel_native_lto build for MySQL 8.0.33. The builds are explained here. In all cases I build from source using upstream MySQL. The my.cnf files are here.

The server has 80 cores with hyperthreads enabled, 256G of RAM and fast NVMe storage. 

Results

The benchmark report is here. Some parts of it are broken because I skipped the read+write benchmark steps. The start of the report explains each of the my.cnf variations used. The my.cnf files are here.

I assume the stalls are mostly from a lack of free pages because the page cleaners can't keep up. Example stack traces for that from a small server (4 clients, 8 cores) is here. That show all (or most) user sessions stuck doing single-page flushing.

I focus on performance for the l.i1 benchmark step which is where 10+ second write stalls occur. From the summary the ua17 my.cnf gets the best average throughput for the l.i1. But note that ua17 disables the InnoDB doublewrite buffer, which is not safe for production but I tried it to understand the impact of it.

From response time histograms the max response time for all of the configurations is ~10 seconds. However, the distributions for the ua18 config looks better than the others when also considering the fraction of responses that take more than 1 second.

Finally, look at the charts that show some values measured over 1-second intervals -- inserts/second, max insert response time, deletes/second, max delete response time. While there are 24 clients, these charts are only from the first client. I have two disclaimers. First, by chance, performance for the first client tends to be worse than for the others. Second, the insert rate (IPS) has a spike near the end of the test because the other clients finished earlier so the first client had the DBMS to itself.

  • Most of the insert response time charts look like this with frequent spikes up to 10 seconds
  • The chart for the ua10 config looks a bit better. That config increases innodb_page_cleaners from 4 (the default) to 8. I think the defaults for innodb_page_cleaners and innodb_buffer_pool_instances should be the same. Alas, they are not today.
  • The chart for the ua18 config is the best. It has innodb_page_cleaners=4 and innodb_buffer_pool_instances=4. While it has some spikes up to 10 seconds at the start, it then reaches a phase where the spikes don't exceed 1 second.

Details from innodb_metrics

I repeated the benchmark with innodb_monitor_enable = module_buffer set in my.cnf to enable more counters that can be read via IS.innodb_metrics. All of the buffer_LRU counters for each my.cnf are here.

The buffer_LRU_single_flush counters are here. It isn't clear which counter matters most. The ua18 config has the smallest value for buffer_LRU_single_flush_num_scan.  The ua9, ua10 and ua18 configs have the largest value for buffer_LRU_single_flush_scanned_per_call.

Limiting the metrics to buffer_LRU_batch_flush I see variety in the value for buffer_LRU_batch_flush_avg_time_slot. It is <= 138 for u, ua10, ua17, ua18 and ua19 but much larger for ua7, ua8 and ua9. That is reasonable given that ua7, ua8 and ua9 use a larger value for innodb_lru_scan_depth.

2 comments:

  1. You’ve mentioned Percona two, or three times between these posts in that they might have the fixes you seek. But I don’t see Percona in your report. Are you planning on benching with Percona? If not, how come?

    ReplyDelete
    Replies
    1. The primary reason I have yet to try Percona is a lack of time. I am already spending too many hours doing this for many releases of MyRocks, upstream InnoDB and Postgres. Explaining, debugging & fixing the odd results I see consumes much time.

      On the Postgres side, I have yet to try OrioleDB.

      And for all DBMS, I am sure there is one more config option that I should try, but again, I have to fit that into my time budget and my HW budget.

      Delete