Monday, October 24, 2022

Early lock release and InnoDB

Early lock release has been in the news and I almost forgot that we prototyped this for InnoDB while figuring out how to do group commit for both the InnoDB redo log and MySQL replication log. A Facebook Note about that work is here, but formatting isn't great as Notes for Pages have been deprecated.

Had the feature made it into a release it would have been documented, because it isn't good to surprise users with a feature that can make visible commits disappear after the primary DBMS crashes and recovers. It is even worse when the race is as simple as the DBMS process doing crash/recover which is more common than the primary node's HW doing the same. In the former there is no protection, in the latter enabling fsync on commit prevents it.

Early lock release also made it into MariaDB and Percona via XtraDB. Again, that was documented.

Since I don't want to lose the content for the note I have republished it below. Several of the links no long work.

Content

Group commit has an interesting history in MySQL. Peter opened bug 13669 for this many years ago. That bug has been closed and InnoDB announced that the plugin uses group commit. That is only true when the binlog is disabled. Things are more complicated when the binlog is enabled and I am not certain this is clear from the documentation.

There are three log writes during commit when the binlog is enabled. I have written about this before and am repeating myself to help the reader understand how group commit has been fixed in the Facebook patch. The log writes are:

How did InnoDB fix group commit in the 1.0.4 plugin? They changed either step 1 or step 3 to use group commit and that can make performance much better when the binlog is disabled. Alas, most of us run with the binlog enabled. It might not be enabled on a slave, but there is much less opportunity for group commit on a slave as the SQL thread is single-threaded. The peak number of commits per second on a server can be limited by the peak number of fsync calls per second that a disk system can perform.

Vamsi Ponnekanti and Ryan McElroy implemented group commit for the binlog write/fsync and this is now part of the Facebook patch for MySQL. Kristian Nielsen has begun working on a fix for this as part of MariaDB. Mats from the MySQL replication team has begun describing a fix for this as part of official MySQL. As MySQL is reluctant to commit to claiming things will be fixed in future releases, I won't claim that MySQL has work in progress. I will claim that I think this won't be a problem in the future and that bug 49326 will be fixed.

Three solutions have been described and it is possible that three different solutions will be implemented. I don't mind this diversity. At least one of these solutions will be excellent. Work done by my team has different constraints than work done for MariaDB or official MySQL. They can do the right thing even when that requires changing a lot of code and taking a lot of time. Others must solve problems faster.

Group commit in the Facebook patch

Vamsi provided a lot of the content that follows.

Group commit isn't possible today in official MySQL and MariaDB for the binlog write/flush/fsync because prepare_commit_mutex is locked for the duration of that work. The first change we made was to add a dynamic server configuration variable, innodb_prepare_commit_mutex, to disable use of that mutex. The mutex is only required when hot backup is being done and most of the time we are not taking a hot backup. The mutex is not used when innodb_prepare_commit_mutex=OFF so that threads can concurrently run the code that does the write/flush/fsync for the binlog.

The next step is to make sure that the binlog and InnoDB transaction log are written in the same order. This would otherwise be possible when prepare_commit_mutex is not used.  We use tickets to force the same order. A thread is assigned a ticket number when it prepares an InnoDB transaction by writing changes for that transaction to the InnoDB log. Threads write transaction changes to the binlog in ticket number order. After writing the changes to the binlog a thread will wait a small amount of time. During that wait another thread might do the binlog fsync thus saving this thread and possibly others from also doing an fsync. A server configuration variable, force_binlog_order, was added to determine whether this feature is enabled.

Unfortunately, all of the changes described thus far do not improve performance when there are concurrent threads trying to modify the same rows. The results below are from a sysbench read-write test with group commit enabled. TPS degrades at high-concurrency:

  16   32   64  128  256  384  512   #concurrent clients
 203  376  297  156   93   86   71   transactions per second

The problem is that the row locks are not released until the commit step is done (step 3 above). This means that the locks are not released until 2 fsyncs have been done (one for the InnoDB prepare step and one for the binlog write/flush/fsync step). We then added an server configuration variable, innodb_release_locks_early, to determine whether row locks are released during the prepare step. By doing this it is possible for other sessions to read changes that are not committed (if there is a crash after the InnoDB prepare step and before the binlog fsync, then that transaction will be rolled back during crash recovery). You should determine whether this is an issue.

The tests were repeated after the server included the change described above. Results are much better:

 16   32   64  128  256   512   #concurrent clients
203  376  297  156   93    71   transactions per second without change
               645   621  631   transactions per second with change

The final task was to make sure that this performs as expected for real workloads with lots of concurrency. Initial performance was disappointing. Vamsi discovered that there was too much mutex contention from the code that made the binlog write order match the InnoDB prepare order. The problem was that a single condition variable was used and broadcast was done to wake all threads waiting on it rather than the one thread that should next write the binlog. Vamsi fixed this by using an array of condition variables, making threads wait on (ticket# mod array_size), and then doing a broadcast only for one element of the array. This causes far fewer spurious wakeups.

If you want to read the source code for the changes then check out the change log for the Facebook patch for MySQL.

There are other changes as part of this feature:

Use caution while turning on group commit

As there are 3 variables controlling the feature, the order in which they are changed becomes important. This will be enabled and disabled dynamically as it should not be enabled when hot backups are done. If they are to be enabled dynamically, the suggested order when enabling group commit is:

Turning OFF innodb_prepare_commit_mutex before turning ON force_binlog_order could potentially cause some transactions to write to binlog in a different order than their commit order in transaction log.

The suggested order for dynamically disabling group commit is:

We added several tests for group commit:

[the content was cut off at this point]

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...