Wednesday, September 4, 2019

Tunable Consistency in MongoDB - reviewing a VLDB paper

This is a review of Tunable Consistency in MongoDB from VLDB 2019. It is worth reading and I appreciate that MongoDB has recently published several conference papers. I am not an expert on this topic. For expertise see Daniel Abadi, Kyle Kingsbury and Peter Bailis. Henrik can be added to the list with a few more blog posts.

MongoDB vs MySQL

MongoDB is a NoSQL DBMS that makes it easy to run sharded replicasets. While the NoSQL part of it is nice the sharded replicaset part of it is amazing. I hope that MySQL eventually gets similar support for sharded replicasets including readConcern and writeConcern options.

I previously compared MySQL semisync replication with MongoDB. With MongoDB the behavior for reads can be tuned separate from writes while MySQL combines them. In the MySQL implementation for lossless semisync a write is not locally visible until a replica acks. In the MongoDB implementation the write is committed locally and then replicated. The replicas apply writes as soon as possible without waiting for a distributed commit protocol. The key to making all of this work is controlling read visibility via an MVCC database engine courtesy of point-in-time reads.

The Review

Highlights:

  • With many options comes a need for more documentation and tutorials. While I appreciate splitting read and write semantics into separate options, the feature space is large and figuring this out is harder.
  • The paper states that the gold standard is linearizability. Well, this is a DBMS paper so maybe that should be changed to serializability.
  • I was confused by the difference between majority and linearizable reads. AFAIK snapshot reads are similar to linearizable (both wait for confirmation that the master really is the master) while majority reads don't have that extra wait.
  • I was confused by "The transaction commit operation accepts a write concern, which determines ... and its constituent read and write operations" because commit occurs after the reads so how could it effect them. As the paper promises, that is explained later.
  • MongoDB is a great case study in adding stronger consistency to an async replication system. It continues to use async replication, yet it now provides stronger consistency on demand.
  • I think that propagated means applied in the description of the w option for writeConcern. This means that a replica acks after applying a change -- either before or after making redo durable depending on the j option. AFAIK the more common option is to ack after shipping the change to the replicas log but before applying the change. However, I prefer what MongoDB does. Maybe commenters will correct my perception of what is more common.
  • To reduce write-write conflicts MongoDB uses the latest locally committed transaction as the point-in-time to do reads for read/write operations that use w:majority and for multi-statement transactions that use snapshots. The alternative was to use the older latest majority commit point-in-time. See section 5 from the paper. Therefore there is a wait before returning to the user for that locally committed timestamp to be committed to a majority of the replica set. This is true even for read-only transactions. So MongoDB can make reads wait. Obviously it can make writes wait before returning to the user doing the write for w:majority. An excellent CockroachDB blog post explains that it too can make reads wait while Spanner can make writes wait. 
  • Consistency is tunable in MongoDB. With writeConcern you can determine whether a write might wait. With readConcern you can determine whether a read might wait.
  • Some of the wait for reads is to confirm that the master from which the read has been done is still the master. I wonder if a master lease could have been used to avoid that wait at the cost of making failover slower. Which cost do you prefer?
  • Replication remains async. Writes are committed locally (and visible to others with the appropriate readConcern options) regardless of the write concern. These are shipped to replicas ASAP and applied by replicas ASAP. This means that a replica can apply a change that has to be undone during master failover because it was never committed to a majority. MongoDB has two ways to rollback that replica -- rollback WiredTiger to an older point in time or undo the extra replication events. Rollback sounds easy while undo is complicated.
  • Strongly consistent writes don't delay weakly consistent writes. A strongly consistent write is done locally then releases row locks then waits for replicas to ack.
  • MongoDB doesn't like long running transactions because WiredTiger must keep undo in memory to satisfy all active transactions. MongoDB kills snapshot reads that have been running longer than one minute. For non-snapshot reads it can advance the snapshot in the middle of the read operation. One side-effect of this is that you will have a hard time implementing a consistent logical backup tool like mysqldump.

1 comment:

  1. Copying what I wrote elsewhere...

    Assuming that MariaDB will not pursue the GR approach as that adds a lot of complexity per-engine and async replication already works, perhaps MariaDB will pursue the MongoDB approach of layering support for stronger consistency above async replication.

    That might also be useful for Spider and Vitess and could make MariaDB the preferred DBMS for Vitess. Something is needed to do cross-shard consistent reads.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...