Saturday, March 22, 2014

A few comments on MongoDB and InnoDB replication

MongoDB replication has something like InnoDB fake changes built in. It prefetches all documents to be changed while holding a read lock before trying to apply any changes. I don't know whether the read prefetch extends to indexes. That question has now been added to my TODO list. Using fake changes to prefetch on MySQL replicas for InnoDB worked better than everything that came before it because it prefetched any index pages that were needed for index maintenance. Then we made it even better by making sure to prefetch sibling pages in b-tree leaf pages when pessimistic changes (changes not limited to a single page) might be done (thanks Domas).  Hopefully InnoDB fake changes can be retired with the arrival of parallel replication apply. This is an example of the challenge of running a DBMS at web scale -- we can't always wait and many of our solutions are in production 3 to 5 years before the proper fix arrives from the upstream vendor. I am not complaining as this gives us many interesting problems to help solve.

I was curious whether there are opportunities for inconsistency when multi-document operations are replayed on a replica. This states that the oplog records per-document changes rather than one entry representing the multi-document write. This makes the replay order on the replica match the order on the master and avoids problems when concurrent multi-document changes are interleaved on the primary. This is similar in spirit to using row-based replication in MySQL. I expect it to also avoid inconsistencies when an error occurs half-way through a multi-document change on the master as the oplog would only contain changes prior to the error.

1 comment:

  1. there's been a lot of fixes to avoid non-idempotent replicated events , especially with the increment option and others [ https://jira.mongodb.org/browse/SERVER-6671 ] . compare consistency [ similar to pt-table-checksum ] after 10's of millions of documents are loaded and findandmodify's are ran, and there should still be some edge cases or you may find documents missing altogether on some members. TokuMX's approach is different and seeks to avoid these edge cases. findandmodify is interesting also because of DBclient .

    impressive work on digging into the internals!

    ~Alexis

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...