Tuesday, October 18, 2022

Reasons for writeback with an update-in-place B-Tree

Are there well known names for the things that trigger dirty page writeback with an update-in-place B-Tree? Some of my performance discussions would be easier if I could point to those definitions.

I spent many years keeping InnoDB, an update-in-place B-Tree, happy in production and a shorter time with WiredTiger, a copy-on-write random (CoW-R) B-Tree. At work I recently revisited the topic of checkpoint and wondered if there were good names for the things that trigger page dirty page writeback. Alas, my InnoDB expertise isn't what it used to be.

InnoDB implements fuzzy checkpointing. While the reference manual page for this is too brief, the advice on that page is excellent -- a larger redo log is good because that reduces write-amplification. A short pitch for fuzzy checkpointing is that it avoids writeback storms, the kind that used to plague Postgres. When you ask a storage device to write GBs of data in a short amount of time there will be a queue and requests (reads, log fsync) that encounter that queue will wait. By redo log here I mean the sum of the sizes of the N redo logs you will use with InnoDB. The write-amp is reduced because LSN triggered writeback (using the terminology defined below) is reduced.

Back to my point about the things that trigger writeback. I will suggest names here. Hopefully I will learn that these have already been named and I can use the existing names. Otherwise I will try to use the names in the future. 

The triggers are:

  • shutdown - for InnoDB writeback is done on shutdown when innodb_fast_shutdown = 0 or 1. Writeback is not done when it is set to 2, but that means crash recovery is needed on startup. Back in the day when we did web-scale MySQL with spinning disks there were many slow shutdowns and at times the MySQL provided script that drove the shutdown would timeout (we modified that script to avoid timeouts).
  • LRU - writeback a dirty page when it reaches the end of the LRU but the memory can't be reused to store another block until the dirty page has been written back. This can occur when a user query needs to read a block into the buffer pool but all pages are in use (single-page writeback with InnoDB, bad for performance). But InnoDB also has a background thread that schedules writeback for dirty pages that are close to the LRU tail to avoid single-page writeback and the innodb_lru_scan_depth option controls how far from the tail it searches.
  • capacity - writeback dirty pages because the buffer pool has too many dirty pages. InnoDB has innodb_max_dirty_pages_pct to define too many. A limit is needed because it controls how long shutdown and crash recovery will take. With InnoDB these pages will be found from the end of the flush (dirty page) list but I want to distinguish this trigger from the previous one. Also with InnoDB this is triggered by one of the many tasks done by background threads.
  • LSN - writeback dirty pages for which the oldest commit is too old. For InnoDB too old means that the redo log that has changes for that commit will soon be recycled. There are many great resources for this topic written by others. With InnoDB this is triggered by a background thread.
Updates:
  • Laurynas Biveinis reminded me about innodb_lru_scan_depth. It is funny I forgot about that because long ago I had a production ready rewrite of it because the older code suffered from mutex contention. But before I could deploy it upstream had their own rewrite. I wasn't aware of their work-in-progress. But the new tests I wrote for my version found at least one bug in their rewrite, so it wasn't a complete waste of time.
  • While I write about update-in-place B-Tree above this is really about update-in-place and also includes heap-organized tables as used by Postgres, Oracle, etc
  • From Inaam, correction about usage of the flush list and names that I like for the writeback triggers: shutdown flushing, checkpoint flushing, dirty page flushing and LRU flushing
  • Much info from JFG about InnoDB internals related to writeback

8 comments:

  1. Thank you for the writeup, Mark.

    In MariaDB Server 10.5, we rewrote the page writeback so that it would scale nicely when using a single buffer pool instance, a single page write initiator thread (buf_flush_page_cleaner) and a single log file. While working on that (MDEV-23399, MDEV-23855), I came up with the terms "checkpoint flushing" and "page eviction flushing". The latter is what is often called "LRU flushing". I moved it out from the dedicated flushing thread, letting one of the threads that are requesting a free page to trigger it. I also made the buf_flush_page_cleaner update the log checkpoint after each page flushing batch.

    For the "LSN" trigger, I think that a more descriptive name might be "checkpoint age" (current LSN minus the latest checkpoint LSN).

    Usually, I would associate the term "capacity" with "log capacity", that is, the available space in the circular write-ahead log file (MariaDB 10.5 uses only 1) minus the checkpoint age and some safety margin. A log checkpoint will have to be triggered to prevent the tail of the redo log from overwriting the previous checkpoint, which would break crash recovery. This will trigger "furious flushing" and may lead to completely stalled SQL threads that are attempting to write log. In MariaDB, there is some logic to prevent such stalls by initiating a "flush-ahead" a little earlier.

    I cannot come up with a really good suggestion to replace your "capacity" trigger. I might call it "background flushing" because the flushing driven by the buffer pool cleanliness ratio is rate-limited (innodb_io_capacity). I now see that the parameter could be where your name came from.

    My colleague Vladislav Vaintroub identified one source of write stalls to be a conflict between fdatasync() and synchronous writes. Thanks to him, the infamous "doublewrite buffer" starting with MariaDB 10.5.7 is being written asynchronously, up to 128 pages at a time, while a second buffer is being filled in memory.

    We also experimented with initiating some log writes asynchronously (MDEV-26603), but that mostly resulted in performance regressions. Log writing speed was improved in MariaDB 10.8 thanks to a more flexible log format that allows multiple threads to copy log snippets to the global log_sys.buf concurrently. Only the allocation of log snippets is covered by an exclusive lock; the copying (simple memcpy) is covered by an rw-lock (exclusive lock during log checkpoint). We are still using a single circular ib_logfile0 for the log, supporting an arbitrary physical block size.

    ReplyDelete
    Replies
    1. I frequently suggest to various MariaDB people that we need more content -- blog posts and tweets -- from the MariaDB community including the developers and your replies are always interesting and frequently long enough to be their own blog posts. So I will begin including your name as someone who should write more the next time I make such a suggestion.

      I definitely agree about "checkpoint age" vs "LSN". Perhaps I was trying to stick to one word names. I think two word names would be fine. But we just need to agree on them, and use them in docs to make future discussions easier.

      With respect to contention between fdatasync and sync writes, is the per-inode mutex locked during buffered IO (but not locked with O_DIRECT, at least not for XFS) an issue? Although the workaround for that is to use O_DIRECT and XFS.

      And for something to write about in the MariaDB/InnoDB space, I think adding support for io_uring is a great topic.

      Delete
  2. Nice write up Mark!

    Naming not my forte but I'd call them shutdown_flushing, checkpoint_flushing, dirty_page_flushing and lru_flushing. In InnoDB code, Heikki named the dirty page list as flush_list which caused us a lot of noun/verb naming pain e.g.: buf_flush_list_flush.
    One minor correction. dirty_page_flushing (capacity in your post) happens from the tail for flush_list (not lru). I think flushing from LRU is a good idea. I even opened a bug (https://bugs.mysql.com/bug.php?id=74637) for that few years ago but I don't think anyone got a chance to look into that.
    Also, not in the direct write flushing path but we can have flushing happening in some special cases as well in InnoDB e.g.: when quiescing a table for export.

    ReplyDelete
    Replies
    1. Thanks for the corrections and I like the names you suggested.

      Delete
  3. Parallel write and flush in context of redo log writes (the group commit) turned out to produce worse throughput numbers, also for unbuffered IO. One possible theory is that there is some contention in hardware (flush disk cache and populate it via write at the same time perhaps does not work well on disks I had at that time). An interesting alternative theory is that write+flush takes more time than either operation, and during "more time" more redo log in the spare memory buffer is written, and thus group batches will be larger. In any case, in a benchmark comparison between writes and flushes in parallel vs serialized, serialized was winning.

    ReplyDelete
    Replies
    1. I never spent much time on this. Standard questions - what file system and what kind of storage?

      Delete
  4. These are the names used in code:
    ```
    /** Flags for flush types */
    enum buf_flush_t : uint8_t {
    /** Flush via the LRU list */
    BUF_FLUSH_LRU = 0,

    /** Flush via the flush list of dirty blocks */
    BUF_FLUSH_LIST,

    /** Flush via the LRU list but only a single page */
    BUF_FLUSH_SINGLE_PAGE,

    /** Index of last element + 1 */
    BUF_FLUSH_N_TYPES
    };
    ```
    Note that there is no separate concept of "flushing because of innodb_max_dirty_pages_pct", as the way it works is very ..let's call it "complicated"..and the innodb_max_dirty_pages_pct_lwm and innodb_max_dirty_pages_pct are one of the inputs to the heuristic which tries to figure out how many pages the BUF_FLUSH_LIST should write to disc per iteration (which is roughly 1/second). So, the useful way to think about it seems to be that there are two sources from which the pages to write back can be obtained, which differ by ordering:

    a) start from the page which was not used for the longest time, as we need to evict something, to obtain a free buffer from BP (BUF_FLUSH_LRU and BUF_FLUSH_SINGLE_PAGE), and in theory the best utilization of cache is when you evict using LRU
    b) start from the page which has oldest modifications not yet written to disc because we want to be able to advance the checkpoint_lsn and thus shorten the future recovery time (BUF_FLUSH_LIST), and for that we need to be able to forget the prefix of redo log which describes the oldest changes

    And then, on top of that you might want to decide how many to of them you want to write to disc.

    ReplyDelete
    Replies
    1. Thank you for the corrections. I don't read much InnoDB source these days.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...