There is an opportunity for confusion when describing replication including physical vs logical and synchronous vs asynchronous. I prefer describing replication as physical, statement or document/row rather than logical vs physical and briefly define them below.
In physical replication the log records the changes to the database pages (change bytes starting at offset X on page Y from this sequence to that sequence). A benefit of this approach is that the database pages will be the same between primary and replica. This approach also avoids the overhead (more CPU than IO) of evaluating the queries that lead to the page changes. A problem from this approach is that the pages need to be the same between the primary and replica. Changes to pages must be deterministic, long running reporting queries on a replica might block replication apply, MVCC GC requires coordination, etc.
In statement based replication (SBR) the log records the statements that modified the database. The benefit from this approach is that it is easier to implement. When one statement changes many rows then this also reduces the size of the log. But for OLTP statements are less likely to change many rows. Unfortunately long running write statements on the primary will be repeated on the slave and that can waste CPU and cause replication lag. It is also difficult to detect dependencies between transactions which makes it harder to replay in parallel on a replica. I did web-scale MySQL for many years with SBR. It worked, but the move to RBR is a good thing.
In document/row replication (aka RBR or DBR) the log has one entry per changed doc/row. This avoids the need for the replica to reevaluate the statement that generated the changes -- it can just fetch the docs/rows by ID and apply changes to them. When the doc/row entries in the log also include PK (_id) values then it is possible to determine which transactions can be replayed in parallel. It is possible to make these changes idempotent in contrast to SBR which reduces the burden of making replay crash safe.
MongoDB oplog
This is an example of the oplog contents for MongoDB. This is document based based on the terminology above:
db.bar.insertMany([{a:1}, {a:2}, {a:3}])
use local
db.oplog.rs.find({"o.msg": {$ne: "periodic noop"}},
{op:1, o:1}).sort({$natural:-1}).limit(3)
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 3 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 2 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 1 } }
Subscribe to:
Post Comments (Atom)
Evaluating vector indexes in MariaDB and pgvector: part 2
This post has results from the ann-benchmarks with the fashion-mnist-784-euclidean dataset for MariaDB and Postgres (pgvector) with conc...
-
This provides additional results for Postgres versions 11 through 16 vs Sysbench on a medium server. My previous post is here . The goal is ...
-
I often use HWE kernels with Ubuntu and currently use Ubuntu 22.04. Until recently that meant I ran Linux 6.2 but after a recent update I am...
-
I am trying out a dedicated server from Hetzner for my performance work. I am trying the ax162-s that has 48 cores (96 vCPU), 128G of RAM a...
Hi Mark, thanks for this post that clarifies the different type of replications. There is one thing I can think of adding which is important for physical replication in the context of MySQL (and InnoDB), but may also apply to other database. In physical replication, the unit of replication is a "change", not a transaction. This means that a transaction that rolls-back needs to be replicated. Also, physical replication makes replicating some InnoDB optimization more complicated: thinking about the Change Buffer, there are many problems about replicating this as the state of the Merge is usually different on each nodes (I think there are similar problems with the purge).
ReplyDeleteCheers, JFG
Background work is an interesting problem. For InnoDB that is purge and the change buffer. For Postgres that is vacuum. For RocksDB that is compaction. Maybe page writeback is also on the list for InnoDB and Postgres.
DeleteI think Alibaba added shared storage for MyRocks so the primary and secondaries could share one copy of the database and they had to figure out how to schedule compaction without removing versions of data in use by some replicas.