Tuesday, December 10, 2019

Historical - InnoDB IO Performance

This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

This is a collection from several posts about InnoDB IO performance

Max dirty pages

InnoDB provides a my.cnf variable, innodb_max_dirty_pages_pct, to set the maximum percentage of buffer pool pages that should be dirty. It then appears to ignores said variable for IO bound workloads (see this post from DimitriK). It doesn't really ignore the value. The problem is that it does not try hard enough to flush dirty pages even when there is available IO capacity. Specific problems include:
  • one thread uses synchronous IO to write pages to disk. When write latency is significant (because O_DIRECT is used, SATA write cache is disabled, network attached storage is used, ext2 is used) then this thread becomes a bottleneck. This was fixed in the v2 Google patch and is also fixed in MySQL 5.4 and Percona builds.
  • rate limits are too small. InnoDB has a background thread that schedules writes for 100 dirty pages at a time when there are too many dirty pages. The limit of 100 is reasonable for a single-disk server. It must be larger for a high IOPs server. The v2 Google patch, MySQL 5.4 (maybe) and Percona branches use the innodb_io_capacity my.cnf variable to determine the number of pages per second that should be written for this case and the amount of IO that should be done in other cases. All work is expressed as a fraction of this variable, rather than as a fixed number of IO operations.
  • request arrays are too small. On my servers, each array has 256 slots. For a server that can do 1000 IOPs, this is too small. The v4 patch makes the size of the array a function of the value of innodb_io.capacity
  • user sessions are not constrained. InnoDB delay's user sessions when the purge thread gets too far behind. Otherwise, not much is done to delay a user session. The v4 patch adds code to force user session's to stop and flush dirty pages when the maximum number of dirty pages has been exceeded. Hopefully, this code does nothing as the background thread is more likely to keep up given other changes in the v4 patch.

IO Performance

This provides performance results for work done to improve InnoDB IO performance. TODO - fix the links:

It is one thing to publish performance results. It is another to understand them. The results here need more analysis and the code needs to be tested by others in the community.

This describes work to make InnoDB faster on IO bound workloads. The goal is to make it easy to use InnoDB on a server that can do 1000 to 10000 IOPs. Many of problems must be fixed for that to be possible, but this is a big step towards that goal. These changes improve performance by 20% to more than 400% on several benchmarks. At a high level, these changes make InnoDB:
  • more efficient when processing IO requests
  • more likely to use available IO capacity
  • better at balancing different IO tasks
  • easier to monitor

One day, Heikki will write the Complete Guide to InnoDB (edit - Jeremy Cole did a lot to explain them), until then you need to consult multiple sources to understand the internals. It also helps to read the source code. These may help you to understand it:

Features

  • Changes the computation of the percentage of dirty buffer pool pages. Before this change the percentage exluded pages borrowed from the buffer pool for other uses. While that may be more accurate, it also requires the caller to lock/unlock a hot mutex. It also made the percentage vary a bit too much as the insert buffer grew and shrank. The v4 patch doesn't exclude the borrowed pages. As most of the borrowed pages should be used in the insert buffer and the insert buffer should be smaller (thanks to ibuf_max_pct_of_buffer), this is probably a good thing.
  • (edit removed many links to other project pages)

Background IO

InnoDB starts a thread, the main background IO thread, to perform background IO operations. This has operations that run once per second, once per 10 seconds and only when the server is idle. This is implemented with a for loop that iterates 10 times. Each time through the loop, the thread sleeps for 1 second unless too much work was done on the previous iteration of the loop. At the end of 10 iterations, the once per 10 seconds tasks are run.

It is hard to understand the behavior of this loop because the sleep is optional dependent on the amount of work done on the previous iteration of the loop. And there are costs from this complexity. For example, one of the 1 second tasks is to flush the transaction log to disk to match the expected behavior from innodb_flush_log_at_trx_commit=2. However, when the 1 second loop runs much more frequently than once per second there will be many more fsync calls then expected.

In the v4 patch, the sleep is not optional. Other changes to the main background IO thread make it possible for each loop iteration to do enough work that there is no need to skip the sleep.

In the v4 patch all of the code that submits a large number of async IO requests makes sure that the number of requests does not exceed the number of free slots in the array. Otherwise, the async IO requests block until there are free slots.

CPU overhead from IO

There are several factors that consume CPU time during IO processing:
  • checksum computation and verification - the v4 patch does not make this faster. Using -O3 rather than -O2 with gcc makes this faster. On a server that does 10,000 IOPs, this will consume a lot of CPU time. Domas wrote about this. We may need to consider alternative checksum algorithms and machine-specific optimizations.
  • request array iteration - InnoDB maintains requests for IO in an array. It frequently iterates on the array and called a function to get the next element in the array. That has been changed to use pointer arithmetic. This makes a big difference when the array is large.
  • request merging - InnoDB merges requests for adjacent blocks so that one large IO operation is done instead of several page size operations. Up to 64 page requests can be merged into one large (1MB) request. The merging algorithm was O(N*N) on the size of the request array and has been changed to be O(N). This will merge fewer requests but use much less CPU. A better change might be to replace each array with two lists: one that maintains requests in file order and the other in arrival order. But that must wait for another day.

my.cnf options for IO performance

These InnoDB my.cnf variables are new in the Google patches:
  • innodb_max_merged_io - maximum number of IO requests merged to issue large IO from background IO threads
  • innodb_read_io_threads - number of background read I/O threads in InnoDB
  • innodb_write_io_threads - number of background write I/O threads in InnoDB
  • innodb_adaptive_checkpoint - makes the background IO thread flush dirty pages when are there old pages that will delay a checkpoint. OFF provides traditional behavior
  • innodb_check_max_dirty_foreground - make user sessions flush some dirty pages when innodb_max_dirty_pages_pct has been exceeded. OFF provides traditional behavior
  • innodb_file_aio_stats - compute and export per-file IO statistics for InnoDB
  • innodb_flush_adjacent_background - when background IO threads flush dirty pages, flush adjacent dirty pages from the same extent. ON provides traditional behavior.
  • innodb_flush_adjacent_foreground - when user sessions flush dirty pages, flush adjacent dirty pages from the same extent. ON provides traditional behavior
  • innodb_ibuf_flush_pct - percent of innodb_io_capacity that should be used for prefetch reads used to merge insert buffer entries
  • innodb_ibuf_max_pct_of_buffer - soft limit for the percent of buffer cache pages that can be used for the insert buffer. When this is exceeded background IO threads work harder to merge insert buffer entries. The hard limit is 50%. The traditional value is 50%.
  • innodb_ibuf_reads_sync - use sync IO to read blocks for insert buffer merges. ON provides traditional behavior. 
  • innodb_io_capacity - maximum number of concurrent IO requests that should be done to flush dirty buffer pool pages. CAUTION -- setting this too high will use a lot of CPU to schedule IO requests and more than 1000 might be too high. The traditional value is 100.

Insert Buffer Improvements

InnoDB performance on many IO bound workloads is much better than expected because of the insert buffer. Unfortunately, InnoDB does not try hard enough to keep the insert buffer from getting full. And when it gets full it kills performance because it continues to use memory from the buffer pool but cannot be used to defer IO for secondary index maintenance.

The v4 patch has several changes to fix this:
  • the my.cnf variable innodb_ibuf_max_pct_of_buffer specifies a soft limit on the size of the buffer pool. The hard limit is 50%. When the hard limit is reached no more inserts are done to the insert buffer. When the soft limit is reached, the main background IO thread aggressively requests prefetch reads to merge insert buffer records.
  • the my.cnf variable innodb_ibuf_flush_pct specifies the number of prefetch reads that can be submitted at a time as a percentage of innodb_io_capacity. Prior to the v4 patch, InnoDB did 5 prefetch read requests at a time and this was usually done once per second.
  • the my.cnf variable innodb_ibuf_reads_sync determines whether async IO is used for the prefetch reads. Prior to the v4 patch, sync IO was used for the prefetch reads done to merge insert buffer records. This variable was added for testing as the default value (skip_innodb_ibuf_reads_sync) should be used in production.
  • code is added to delay user sessions and make them merge insert buffer records when the size of the insert buffer exceeds the soft limit.

Freeze InnoDB IO

This feature wasn't useful in production. It added the commands:
  • set global innodb_disallow_writes=ON
  • set global innodb_disallow_writes=OFF

These enable and disable all Innodb file system activity except for reads. If you want to take a database backup without stopping the server and you don't use LVM, ZFS or some other storage software that provides snapshots, then you can use this to halt all destructive file system activity from InnoDB and then backup the InnoDB data files. Note that it is not sufficient to run FLUSH TABLES WITH READ LOCK as there are background IO threads used by InnoDB that may still do IO.

Async IO for InnoDB

InnoDB supports asynchronous IO for Windows. For Linux, it uses 4 threads to perform background IO tasks and each thread uses synchronous IO. There is one thread for each of:
  • insert buffer merging
  • log IO
  • read prefetch requests
  • writing dirty buffer cache pages
InnoDB issues prefetch requests when it detects locality in random IO and when it detects a sequential scan. However, it only uses one thread to execute these requests. Multi-disk servers are best utilized when more IO requests can be issued concurrently.

For deployments that use buffered IO rather than direct IO or some type of remote disk (SAN, NFS, NAS), there is not much of a need for more write threads because writes complete quickly into the OS buffer cache. However, as servers with many GB of RAM are used, it is frequently better to use direct IO.

We changed InnoDB to support a configurable number of background IO threads for read and write requests. This is controlled by the parameters:
  • innodb_max_merged_io - Max number of IO requests merged to issue large IO from background IO threads
  • innodb_read_io_threads - the number of background IO threads for read prefetch requests
  • innodb_write_io_threads - the number of background IO threads for writing dirty pages from the buffer cache

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...