Saturday, March 1, 2014

When does MongoDB make a transaction visible?

Changes to a single document are atomic in MongoDB, but what does that mean? One thing it means is that when reads run concurrent with an update to a document the reads will see the document in either the pre-update or the post-update state. But there are several post-update states. There is the pre-fsync post-update state and the post-fsync post-update state.

In the pre-fsync post-update state the change record for the update has been written to the journal but the journal has not been forced to disk. If a client were to read the document in this state and then the server rebooted before the group commit fsync was done then the document might be restored to the pre-update state during crash recovery. Note that if the only failure is a mongod process failure then the document will be restored to the post-update state. But another client might have read the post-update state prior to the server reboot. Is it a problem for a client to observe a transaction that is undone? That is application dependent. AFAIK there is no way to avoid this race with MongoDB as the per-db write lock is released before waiting for the group commit fsync. There are other write concern options that block until the write is acknowledged by replicas. I assume the race exists in that case too, meaning the update is visible on the primary to others before a slave ack has been received.

I think this behavior should be described in the MongoDB documentation on write concerns or elsewhere. I am not sure it has been.

While reading the code might make this obvious (that the per-db write lock is released before group commit wait) I used a testcase to convince myself. This is easy to see in MongoDB with a version modified to allow huge values for  journalCommitInterval. I set it to 300000 and then used one client to do an insert with j:1 and w:1 so the insert (via pymongo) blocked until a journal fsync was done. MongoDB uses group commit and fsync is done every journalCommitInterval milliseconds (think of this as the fsync train leaving every X milliseconds). But when there is an operation that requested an fsync then an express train arrives in no more than journalCommitInterval/3 milliseconds. In another window I used the mongo client to query the data in the collection and the newly inserted document was there long before the fsync was done.

MySQL has two races like this but they can be avoided. InnoDB has a similar race when innodb_flush_log_at_trx_commit=2, but that can be avoided by setting the option to 1. Commit latency is reduced when the application can tolerate the race. Semi-sync replication for MySQL always has the race. Committed changes are visible on the primary to others before a slave ack has been received. But the race is avoided by using enhanced semi-sync.

1 comment:

  1. I have created a ticket to track this issue:

    https://jira.mongodb.org/browse/DOCS-2908

    Thanks for bringing it to our attention.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...