Monday, October 26, 2020

Overloading the meaning of "copy-on-write" for index structures

I previously used copy-on-write to explain B-Tree index structures that don't overwrite pages during writeback. One benefit of copy-on-write is avoiding the risk of partial page writes as that has overhead including write-amplification. InnoDB uses a doublewrite buffer to recover from partial page writes. Postgres writes full pages to the redo log once per checkpoint. 

This can be called page-based copy-on-write (page-based CoW) and there are at least two types - CoW-R and CoW-S. CoW-R overwrites dead versions of pages and -R means it is likely to do small random writes. CoW-S writes pages to the tail of a log file, -S means sequential IO, and then background garbage collection is needed to reclaim space from old log segments. WiredTiger and LMDB are examples of CoW-R. Amazon Aurora for MySQL might be an example of CoW-S, although the B-Tree (InnoDB) isn't aware that it is CoW.

Row-based Copy-on-Write

But copy-on-write can also be used to explain what happens at the row level on inserts, updates and deletes. Lets call this row-based copy-on-write (row-based CoW). For InnoDB, changes to primary key index entries, where the base row is stored, are update-in-place (UiP) while changes to secondary indexes are CoW. For Postgres, changes to the base row and index entries are CoW except when the HOT optimization is used.

When row-based CoW is used then something must eventually remove the old versions of rows and/or index entries when they are no longer visible to an open or future snapshot. For an MVCC storage engine that something is called MVCC garbage collection (MVCC GC). For a log-structured merge tree (LSM) that something is called compaction. MVCC GC is called purge by InnoDB and vacuum by Postgres.

InnoDB uses update-in-place for updates to the base row stored in the PK index. But delete processing is special in that the base row is delete-marked (a flag is set in the row header) and purge will eventually remove delete-marked rows to reclaim space.


  1. MariaDB 10.5 includes some minor optimizations related to this.

    The doublewrite buffer will be bypassed for those pages that were fully initialized via redo log records since the last time the page was written out (MDEV-19738). For example, if a table were emptied, we would reinitialize the root page and we would directly write the page to the data file. On recovery, thanks to changes that were originally made in MDEV-12699 to make crash recovery more robust, we would avoid reading pages that can be initialized via the redo log records.

    Another optimization in MariaDB 10.5 is that pages that have been freed will not be written out (MDEV-15528). Optionally, if a ‘scrubbing’ feature is enabled or if page_compressed tables are being used, we would punch holes or zero out the contents in the data file. Also these writes will skip the doublewrite buffer. Recovery will avoid reading any pages that it determines to have been freed, based on the new FREE_PAGE record.

    When it comes to MVCC and locking in secondary indexes, I think that per-record transaction identifiers (MDEV-17598) would solve many problems in InnoDB. But the copy-on-write aspect would remain.

    1. It would be useful to have more MVCC related metrics. Two counters to understand the secondary index issue are:
      1) number of secondary index entries read
      2) number of times he PK index had to be consulted to determine visibility

      There are many interesting changes to InnoDB in MariaDB. It is unlikely to be me, but I hope someone does evaluations of IO efficiency including write-amplification using the "right" benchmark.