Thursday, July 6, 2023

Fun with innodb_page_cleaners

I have been running the insert+delete benchmark for Postgres, InnoDB and MyRocks and then trying to improve the configurations I use for each DBMS. With InnoDB a common problem is write stalls when user sessions get stuck doing single-page flushing. 

This post is truthy rather than true. I have begun reading InnoDB source but my expertise isn't what it used to be and I am still figuring things out on my latest round of benchmarking.

The single-page flushing problem occurs when background threads cannot keep enough clean pages at the tail of the LRU. Any user session that needs a clean page might have to do write back, including writes to the doublewrite buffer, to find a clean page that can be evicted and reused. From PMP stacks I frequently see most of the page cleaner threads stuck on writes and then most of the user sessions all stuck doing single-page flush writes -- and the extra work from the doublewrite buffer just makes this worse.

Upstream has good docs on configuring the buffer pool and Percona has many great posts including here and here. But I have yet to discover how to avoid single-page flushes via my.cnf tuning.

Architecture

I have a few complaints and/or feature requests because avoiding single-page flush is a good idea:

  • Provide always-on counters that make the problem easy to spot
  • Use the same value by default for innodb_buffer_pool_instances and innodg_page_cleaners
  • Wake the page cleaner threads more often (more than once/second) so they can run more frequently but do less work per run.
  • For users, consider reducing the value for innodb_max_dirty_pages_pct. The default, 90%, is too high (reducing the default is another request). Perhaps 50% is a better default. The page cleaners will have an easier time keeping up when they only have to write back 50% of the pages at the tail of the LRU versus writing back 90% of them.
  • In theory using a larger value for innodb_lru_scan_depth means that more pages will be cleaned per second. The expected number of dirty pages cleaned per second is innodb_max_dirty_pages_pct X innodb_lru_scan_depth X innodb_buffer_pool_instances. However a larger value for lru_scan_depth means there will be more mutex contention.
  • Make page cleaning rates adaptive based on the workload rather than hardcoded based on innodb_lru_scan_depth

My first request is to make the default values the same for innodb_buffer_pool_instances (currently 8) and innodb_page_cleaners (currently 4). When there are more buffer pool instances than page cleaner threads (which is true by default today) then InnoDB is likely to have a harder time avoiding single-page flushes.

The page cleaner threads wake once per second, scan innodb_lru_scan_depth pages from the tail of the LRU, and writeback the dirty pages to make the clean and faster to evict. A given page cleaner thread works on one buffer pool instance at a time. The page cleaner thread can also do some writeback from the flush list, but I didn't try to understand that code path (see here).

To browse the source, start with buf_flush_page_coordinator_thread, pc_request and pc_flush_slot.

4 comments:

  1. > The single-page flushing problem occurs when background threads cannot keep enough clean pages at the tail of the LRU.

    For a query thread to try finding a clean page at the tail of the LRU, the Free List first needs to be empty. Have you tried maintaining more free pages in the free list by increasing innodb_lru_scan_depth ? Many details about this in [1] (probably also relevant even if you are not using Percona Server).

    [1]: https://jfg-mysql.blogspot.com/2022/11/tail-latencies-in-percona-server-because-innodb-stalls-on-empty-free-list.html

    > Wake the page cleaner threads more often (more than once/second) so they can run more frequently but do less work per run.

    Percona has this, details in [1b] and [1].

    [1b]: https://docs.percona.com/percona-server/8.0/performance/xtradb_performance_improvements_for_io-bound_highly-concurrent_workloads.html#multi-threaded-lru-flusher

    > In theory using a larger value for innodb_lru_scan_depth means that more pages will be cleaned per second.

    This applies to the tail of the LRU, but has other effects, details in [1], [2] and [3].

    [2]: https://bugs.mysql.com/bug.php?id=108927

    [3]: https://bugs.mysql.com/bug.php?id=108928

    > However a larger value for lru_scan_depth means there will be more mutex contention.

    I do not understand this, could you elaborate ?

    > Make page cleaning rates adaptive based on the workload rather than hardcoded based on innodb_lru_scan_depth

    I think Percona Server has this, details in [1b] and [1].

    While writing [1], I thought about ways to improve Flushing. My idea would boil-down to:
    - keep doing List Flushing once per second
    - split LRU Flushing in two: LRU Cleaning and Free List Refilling
    - do Free List Refilling very often, and LRU Cleaning not too often

    Basically, above is because the "only" waste in flushing is scanning the LRU (because needing to scan clean pages, which is a waste), all the rest is "useful". So I would have LRU flushing waking-up often to refill the Free List (probably in an adaptative way as implemented in Percona Server), but only scan / clean the LRU every 10 to 100 iteration of Free List Refilling, or when refilling the Free List finds a dirty page at the tail of the LRU.

    Also, I would like LRU Cleaning to go further than the size of the Free List, to be able to maintain a "long" tail of the LRU clean, to prevent Free List refilling to hit a dirty page, or prevent a query thread looking for a clean page in the LRU to have to scan too many dirty pages. This longer LRU cleaning is described in [3] (look for innodb_lru_scan_size, I changed my mind a few time in this feature request, you will have to read all the comments).

    Also, with a longer tail of the LRU clean, I can assume that a query thread starting to scan the LRU when the Free List is Empty will quickly find a clean page. This way, I can also aggressively reduce the Free List Size, because the pages in the Free List is wasted RAM.

    Also, while cleaning the tail of the LRU, a page cleaner should check that the Free List is still relatively full, because if it drains, query threads will try getting the LRU Mutex, which is held by the page cleaner, and this is a stall because the page cleaner was too busy cleaning and did not keep refilling.

    This is a complex subject, and I would be happy to bounce ideas with you on this if you want. Ping me on Messenger if you want to schedule a chat.

    ReplyDelete
    Replies
    1. > Have you tried maintaining more free pages in the free list by increasing innodb_lru_scan_depth ?

      Yes and results were mixed.

      I have results, not yet shared for lru_scan_depth=1024,2048,4096 and 1024 is the default. I have been trying to get better results before sharing but perhaps that isn't possible.

      > [1]: https://jfg-mysql.blogspot.com/2022/11/tail-latencies-in-percona-server-because-innodb-stalls-on-empty-free-list.html

      Will read your blog post soon. Thanks.

      > Percona has this, details in [1b] and [1].
      > [1b]: https://docs.percona.com/percona-server/8.0/performance/xtradb_performance_improvements_for_io-bound_highly-concurrent_workloads.html#multi-threaded-lru-flusher

      Reading some of this brings me to tears (almost). Not sure of joy or frustration that these features aren't upstream.

      > > However a larger value for lru_scan_depth means there will be more mutex contention.
      > I do not understand this, could you elaborate ?

      Have to read more current code but back in the day important "global" mutexes were held by the page cleaner that others needed to access the LRU.

      > While writing [1], I thought about ways to improve Flushing. My idea would boil-down to:

      I like your ideas. Although I also like frequent page cleaning (writeback of dirty pages). Doing it more than once/second might reduce bursts of write requests.

      > This is a complex subject, and I would be happy to bounce ideas with you on this if you want. Ping me on Messenger if you want to schedule a chat.

      I appreciate the offer. But my time devoted to making InnoDB better is much smaller than it used to be. For now I just want to know that my benchmark results are fair, and I am happy to file MySQL bugs & feature requests as part of that.





      Delete
    2. > > Percona has this [...]

      Is it easy for you to run your tests with Percona Server ? Maybe you would be able to get more throughput if their implementation allows for more Free Page production. Their backoff Empty Free List Algorithm might also be better suited for higher throughput for what you are testing. More about producer-consumer below.

      > Have to read more current code but back in the day important "global" mutexes were held by the page cleaner that others needed to access the LRU.

      Right, while LRU Flushing is happening, the Page Cleaner (or LRU Manager Thread in Percona Server) is holding the LRU Mutex, which prevents a query thread to do a Single Page Flush. But while this is happening, there is no point in a query thread to get the LRU Mutex as the poin t of the Page Cleaner to do LRU Flushing is in part to generate free pages. So at this point, a Query Thread failing to get the LRU Mutex should go back to looking at the Free List (assuming it is not another query thread holding the LRU Mutex, but I discuss this further down below).

      Could you clarify what you mean by "Single Page Flush" ? Flush here is ambiguous and could mean either getting a clean page at the tail of the LRU, or after lru_scan_depth pages have been scanned without finding a clean page, doing a "Single Page 'Clean'". I assume it is a Single Page Clean as you mention IO in your post, but I am not sure the InnoDB Metrics Counters are making this distinction.

      It is disappointing that a query thread is doing a "Single Page 'whatever'". It would be nice that it would behave in a way oriented toward the greater good of the system, and not just getting what it needs. For that, I once it is holding the LRU Mutex, it would be good that it refills Free List, or if it ends-up kicking IO, it should do more than one as doing one or many on SSDs is probably the same cost.

      I think there is a flag that is set by the first query thread scanning lru_scan_dept dirty pages for telling the others to directly go into Single Page Clean mode. This flag is reset by the Page Cleaner. I think this prevents all query thread to scan 1024 pages before doing a Single Page Clean. Still, a parameter to scan less pages might be interesting as once we have scanned 16 dirty pages at the tail of the LRU, there are probably a lot of unclean stuff there.

      When I read below Tweet [a], it made me rethink a few things...

      [a]: https://twitter.com/MarkCallaghanDB/status/1677109651668881408

      > If you run a DBMS on fast storage then that DBMS better be able to do page eviction quickly.

      It is interesting to thing about the system as a producer-consumer with a buffer, all this around free pages. The consumers are the query threads, the producers are the Page Cleaners, and the buffer is the Free List. The other bottleneck of the consumer is doing a blocking read from disk. The other bottleneck of the producer is doing 2 writes to disk, but these can be done in parallel (many pages can be cleaned in parallel by a Page Cleaner). If the cost of read and writes are the same, I think there will always be queuing. But if reads are significantly more expensive than writes, you might be able to avoid queuing. This might be possible in a setup with a write cache, including in a storage implementation where writes are buffered, like zfs (or other lsm-type storage). But as the current consumers ends-up "only thinking about themselves" when doing Single Page Clean, the system degrades badly (addressed by Percona backoff Empty Free List Algorithm).

      Delete
    3. We were very keen to use the Percona MT LRU flusher but we didn't notice any improvement. First we were told it was because of the DBLWR buffer that we weren't seeing the improvement. After fixing the DBLWR buffer we still didn't see any improvement and so we gave up.

      I chatted with Laurynas about this too since he was the author of the patch (IIRC). Perhaps Dimitri remembers the details.

      Regards,
      -sunny

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...