Changes to a single document are atomic in MongoDB, but what does that mean? One thing it means is that when reads run concurrent with an update to a document the reads will see the document in either the pre-update or the post-update state. But there are several post-update states. There is the pre-fsync post-update state and the post-fsync post-update state.
In the pre-fsync post-update state the change record for the update has been written to the journal but the journal has not been forced to disk. If a client were to read the document in this state and then the server rebooted before the group commit fsync was done then the document might be restored to the pre-update state during crash recovery. Note that if the only failure is a mongod process failure then the document will be restored to the post-update state. But another client might have read the post-update state prior to the server reboot. Is it a problem for a client to observe a transaction that is undone? That is application dependent. AFAIK there is no way to avoid this race with MongoDB as the per-db write lock is released before waiting for the group commit fsync. There are other write concern options that block until the write is acknowledged by replicas. I assume the race exists in that case too, meaning the update is visible on the primary to others before a slave ack has been received.
I think this behavior should be described in the MongoDB documentation on write concerns or elsewhere. I am not sure it has been.
While reading the code might make this obvious (that the per-db write lock is released before group commit wait) I used a testcase to convince myself. This is easy to see in MongoDB with a version modified to allow huge values for journalCommitInterval. I set it to 300000 and then used one client to do an insert with j:1 and w:1 so the insert (via pymongo) blocked until a journal fsync was done. MongoDB uses group commit and fsync is done every journalCommitInterval milliseconds (think of this as the fsync train leaving every X milliseconds). But when there is an operation that requested an fsync then an express train arrives in no more than journalCommitInterval/3 milliseconds. In another window I used the mongo client to query the data in the collection and the newly inserted document was there long before the fsync was done.
MySQL has two races like this but they can be avoided. InnoDB has a similar race when innodb_flush_log_at_trx_commit=2, but that can be avoided by setting the option to 1. Commit latency is reduced when the application can tolerate the race. Semi-sync replication for MySQL always has the race. Committed changes are visible on the primary to others before a slave ack has been received. But the race is avoided by using enhanced semi-sync.
Subscribe to:
Post Comments (Atom)
RocksDB on a big server: LRU vs hyperclock, v2
This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...
-
This provides additional results for Postgres versions 11 through 16 vs Sysbench on a medium server. My previous post is here . The goal is ...
-
I often use HWE kernels with Ubuntu and currently use Ubuntu 22.04. Until recently that meant I ran Linux 6.2 but after a recent update I am...
-
I am trying out a dedicated server from Hetzner for my performance work. I am trying the ax162-s that has 48 cores (96 vCPU), 128G of RAM a...
I have created a ticket to track this issue:
ReplyDeletehttps://jira.mongodb.org/browse/DOCS-2908
Thanks for bringing it to our attention.