Wednesday, April 22, 2015

Much ado about nothing

Kyle does amazing work with Jepsen and I am happy that he devotes some of his skill and time into making MongoDB better. This week a new problem was reported as stale reads despite the use of the majority write concern. Let me compress the bug report and blog post for you, but first this isn't a code bug this is expected behavior for async replication.

  1. MongoDB implements asynchronous master-slave replication
  2. Commits can be visible to others on the master before oplog entries are sent to the slave
The problem occurs when transaction 1 commits a change on the master, transaction 2 views that change on the master, the master disappears and a slave is promoted to be the new master where the oplog entries from transaction 1 never reached any slave. At this point transaction 1 didn't happen and won't happen on the remaining members of the replica set yet transaction 2 viewed that change. A visible commit has been lost.

When reading MongoDB source I noticed this in early 2014. See my post on when MongoDB makes a transaction visible. I even included a request to update the docs for write concerns and included this statement:
I assume the race exists in that case too, meaning the update is visible on the primary to others before a slave ack has been received.
This isn't a bug, this is async replication. You can fix it by adding support for sync replication. The majority write concern doesn't fix it because that only determines when to acknowledge the commit, it does not determine when to make the commit visible to others.  For now the problem might be in the documentation if it wasn't clear about this problem. The majority write concern is a lot like semisync replication in MySQL and then clever people added lossless semisync replication so that commits aren't visible on the master until they have been received by a slave. Finally, really clever people got lossless semisync replication running in production and we were much happier.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...