Tuesday, December 10, 2019

Historical - Transactional Replication

This is a post about work done by Wei Li at Google to make MySQL replication state crash safe. Before this patch it was easy for a MySQL storage engine and replication state to disagree after a crash. Maybe it didn't matter as much for people running MyISAM because that too wasn't crash safe. But so many people had to wake up late at night to recover from this problem which would manifest as either a duplicate key error or silent corruption. The safe thing to do was to not restart a replica after a crash and instead restore a new replica from a backup.

Wei Li spent about 12 months fixing MySQL replication adding crash safety and several other features. That was an amazing year from him. I did the code reviews. My reviews were weak. MySQL replication code was difficult back then.

I got to know Domas Mituzas after he extracted this feature from the Google patch to use for Wikipedia. I was amazed he did this and he continued to make my life with MySQL so much better at Google and then Facebook. When I moved to Google I took too long to port this patch for them. My excuse is that Domas wasn't yelling enough -- there were many problems and my priority list was frequently changing.

This post was first shared at code.google.com which has since shutdown. This feature was in production around 2007, many years before something similar was provided by upstream. I can't imagine doing-web-scale MySQL without it. The big-3 problems for me back then were: lack of monitoring, replication and InnoDB on many-core.

Introduction

Replication state on the slave is stored in two files: relay-log.info and master.info. The slave SQL thread commits transactions to a storage engine and then updates these files to indicate the next event from the relay log to be executed. When the slave mysqld process is stopped between the commit and the file update, replication state is inconsistent and the slave SQL thread will duplicate the last transaction when the slave mysqld process is restarted.

Details

This feature prevents that failure for the InnoDB storage engine by storing replication state in the InnoDB transaction log. On restart, this state is used to make the replication state files consistent with InnoDB.

The feature is enabled by the configuration parameter rpl_transaction_enabled=1. Normally, this is added to the mysqld section in /etc/my.cnf. The state stored in the InnoDB transaction log can be cleared setting a parameter and then committing a transaction in InnoDB. For example:
set session innodb_clear_replication_status=1;
create table foo(i int) type=InnoDB;
insert into foo values (1);
commit;
drop table foo;
Replication state is updated in the InnoDB transaction log for every transaction that includes InnoDB. It is updated for some transactions that don't include InnoDB. When the replication SQL thread stops, it stores its offset in InnoDB.

The Dream

We would love to be able to kill the slave (kill -9) and have it always recover correctly. We are not there yet for a few reasons:
  • We don't update the state in InnoDB for some transactions that do not use InnoDB
  • DDL is not atomic in MySQL. For *drop table* and *create table* there are two steps: create or drop the table in the storage engine and create or drop the frm file that describes the table. A crash between these steps leaves the storage engine out of sync with the MySQL dictionary.
  • Other replication state is not updated atomically. When relay logs are purged, the files are removed and then the index file is updated. A crash before the index file update leaves references to files that don't exist. Replication cannot not be started in that case. Also, the index file is not updated in place rather than atomically (write temp file, sync, rename). 


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...