Sunday, October 11, 2015

Losing it?

Many years ago the MySQL team at Google implemented semi-sync replication courtesy of Wei Li. The use case for it was limited and the community was disappointed that it did not provide the semantics they wanted. Eventually group commit was implemented for binlog+InnoDB: first by the MySQL team at Facebook, then much better by MariaDB, and finally in upstream MySQL. With group commit some magic was added to give us lossless semisync and now we have automated, lossless and fast failover in MySQL without using extra replicas. This feature is a big deal. I look forward to solutions for MariaDB (via binlog server and MaxScale) and MySQL (via Fabric and Proxy).

MongoDB is ahead of MySQL in features that make scale-out easier to manage and the next release (3.2) adds a few more features and more robust code for existing features. I hope that some of the features that have been sacrificed in the name of scale-out will eventually arrive in a MongoDB release: per-shard transactions, per-shard joins, per-shard consistent read.

Behavior


Many times we describe algorithms when users need to understand behavior and then the user gets lost in the details. It is important for developers to understand when transactions can be lost and I will describe that with answers to two questions.
  1. After the changes from a commit are visible to a concurrent session under what conditions can that commit be lost?
  2. After a client has been informed that a commit has succeeded under what conditions can that commit be lost?


Details


I list four combinations of behavior to consider and the features in MySQL and MongoDB that provide them:


Editorial


MySQL uses one solution (lossless semisync) to protect against both loss of visible and acknowledged commits. With lossless semisync row locks are held on the master until at least one slave acknowledges the commit. This can reduce commit throughput as there is a network round trip between master and replica(s) before commit is finished and row locks are released. There is at one network round trip between commits to the same row. This overhead is reduced by moving the replica close to the master. In the solution described by Yoshinori the binlog archive provides the ack rather than using extra replicas in every data center and because the binlog archive doesn't require a lot of hardware it is easier to move it closer to the master.

In MongoDB total protection comes from two solutions -- majority read concern and write concern. The benefit is that majority write concern doesn't make commit slower assuming a workload has sufficient concurrency. It will add latency to each client doing a commit just like MySQL semisync. A different feature, majority read concern, protects against loss of a visible commit. However, there is the risk that a client that needs read-your-own write semantics will have to wait. At this point it isn't clear to me that MongoDB makes it easy to read your own writes. I wonder if applications that care more about performance will use majority read concern without using majority write concern. That isn't an option with MySQL.

It will take time to figure out the implications of the performance differences. With MySQL delays are added to writers as it takes them longer to get the through semi-sync commit protocol. With MongoDB delays might be added to readers as they wait for the majority read snapshot to be advanced.


Durable on a slave?


It can be important to understand how durable a change is on a replica when the replica acknowledges a transaction to the master. There are several options and I have not read enough recent MySQL or MongoDB documentation to determine whether there are options beyond durable in memory:
  1. durable in memory - the commit is buffered in memory on a replica before it acks. Many years ago when we implemented semi-sync for MySQL this was the only choice. I tend to think that durable in memory is an oxymoron, but I am a pessimist.
  2. durable in a log - the commit is durable in a log file on a replica before it acks. There has been talk that MySQL would 
  3. committed on the replica - the commit is applied on a replica before it acks. That guarantees read-your-writes semantics when that replica is queried soon after committing a change on a master. Alas this is also likely to create performance lag unless the replica uses many threads to apply changes concurrently, just like on the master. It also creates a window where a commit is visible on a replica before the master.

More editorial


MongoDB documentation has tended to be optimistic about the features provided by the software. I think this will be resolved as the community grows. There have been some interesting discoveries. Hopefully the gap between documented and actual behavior will be reduced over time.

The mmap engine releases the per-database or per-instance write lock before syncing the oplog even when durable writes are requested.  This is now described as read uncommitted, but read non-durable might be a better name because reads are still consistent but you can see changes from others before those changes are durable in the oplog. I wrote about this when reading code and the docs have been updated since then but I think their docs need more edits. This is only a problem for the mmap engine and multiple engines in MongoDB means they need to be clear about behavior for mmap versus WiredTiger.

There were too strong claims about the semantics of the majority write concern. It protects against the loss of an acknowledged commit but some docs suggested it protected against the loss of a visible commit. Aphyr, an expert in distributed systems testing, highlighted this problem in his Call Me Maybe series and a bug report. I wrote about part of the problem prior to that but I did not connect the problem with the too-strong claims in the documentation. Many years ago MySQL made a similar mistake when documenting semi-sync replication and fixed their docs after I filed a bug.

Documentation claimed that 2-phase commit was used to keep config servers in sync. That makes it more likely that commit is all-or-nothing for the 3 or 5 config servers hosting the same data. Alas it can lead to read-only mode when a server goes away. I read the code and the two phases were 1) ping all config servers and then if all responded 2) send the change to all config servers. If all servers did not respond with OK then manual intervention was required. This isn't 2 phase commit. Fortunately, something much better will be done for the 3.2 release and the docs have been updated.

Finally, read the excellent series of posts from Tokutek on replica set failover including the overview and posts one, two, three and four. There have been problems that haven't been widely known. Fortunately the 3.2 release of MongoDB should make things better.

3 comments:

  1. Nice summary Mark.
    I can see another implementation of "cannot lose visible, cannot lose acknowledged": shared storage (DRBD or filer with high durability enabled). This one comes with the drawback of suffering crash recovery on MySQL startup. When comparing "lossless semisync" to shared storage, the network round-trip on commit does not look that bad.

    ReplyDelete
    Replies
    1. I don't have any experience with DRBD. How far apart can the disk pair be so that I don't lose data when I lose a rack or consecutive racks?

      Delete
    2. DRBD has three modes. C requires a sync to disk on the remote node, and an ACK over the network. This is the slowest mode. B mode is like mysql semi-sync. The write has to make it into the buffer on B, and B has to send an ACK. In C mode, writes are async. There is no ack, and writes proceed immediately. This is normally a DR node.

      If using A or B, you probably want a serial cable (or whatever) for heartbeat, and be sure to configure STONITH.

      Only one side can mount the DRBD device at once, unless you use a cluster filesystem. Crash recovery is required after the system comes back up, so it can take awhile, which can really bites into the 9s

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...