Friday, August 19, 2022

When is a secondary index-only query not index-only for InnoDB and Postgres?

For Postgres and MySQL/InnoDB an index-only query plan that uses a covering secondary index might still require access to the base row for some of the index entries read from the secondary index. Most of the time index-only really is index-only. But in some cases there will still be fetches of the indexed row and power users should know when this can happen.

This doesn't happen with MySQL/MyRocks.


One reason for this is recent transactions. The issue for Postgres is well-documented and widely known -- the visibility map. I won't explain it because experts have done a better job than I can. I don't know whether Postgres has performance counters to monitor the frequency of this.

One interesting variant of this with Postgres was for insert-only workloads that I encountered with the Insert Benchmark. The problem was that prior to PG 13 autovacuum wasn't triggered by insert-only workloads, thus visibility map bits were not set for pages full of new inserts and index-only scans weren't really index only for deployments that relied on autovacuum.

The result is that for some index entries read from the secondary index for an index-only query plan, Postgres might have to fetch the row by using the tuple id to read from the heap-organized table. This can result in unexpected random IO. 


The issue for InnoDB is less widely-known and not well documented. I don't think there are performance counters to monitor the frequency for this. Long-ago when I supported InnoDB in production I added some to the FB patch.

The best doc I found in a quick search is this slide deck (thanks Marko from MariaDB) and I will quote parts from it:

  1. secondary indexes have a PAGE_MAX_TRX_ID
  2. PK index entries have a DB_TRX_ID per entry, secondary index entries do not
  3. See here for a description of DB_TRX_ID
  4. Secondary indexes may contain multiple (indexed_col, pk) records pointing to a single pk, one for each value of indexed_col
  5. All versions except at most one must be delete-marked
  6. If a secondary index record is delete-marked, MVCC must look up pk in the PRIMARY index and attempt to construct a version that matches indexed_col, to determine if (indexed_col, pk) exists in the read view
  7. If the PAGE_MAX_TRX_ID is too new, for each record in the secondary index leaf page, we must look up pk
  8. For old read views, secondary index scan can be very expensive!
InnoDB looks up the base by searching the clustered PK index using the PK columns stored in the secondary index entry.

While the InnoDB docs have some description of this, they aren't as clear as the slide deck above. 

When a secondary index column is updated, old secondary index records are delete-marked, new records are inserted, and delete-marked records are eventually purged. When a secondary index record is delete-marked or the secondary index page is updated by a newer transaction, InnoDB looks up the database record in the clustered index. In the clustered index, the record's DB_TRX_ID is checked, and the correct version of the record is retrieved from the undo log if the record was modified after the reading transaction was initiated.

If a secondary index record is marked for deletion or the secondary index page is updated by a newer transaction, the covering index technique is not used. Instead of returning values from the index structure, InnoDB looks up the record in the clustered index.

That page also explains another detail about secondary indexes that many InnoDB users might not know. Secondary index entries are not updated in place. An update is implemented as delete-mark existing entry, insert new entry. Eventually purge will physically remove the delete-marked entry after it is no longer visible to current snapshots. Thus InnoDB secondary indexes risk suffering from bloat and churn which is worse with long-open snapshots that block purge -- because those delete-marked entries can linger leading to excessive page-splits.
I wrote above that MyRocks doesn't have this problem and that is true. It can have problems with bloat/churn after secondary index maintenance that I mention above for InnoDB. Fortunately the SingleDelete optimization helps a lot to avoid that as described in the VLDB paper. SingleDelete allows for faster removal of tombstones as normal (Delete) tombstones are frequently not removed until reaching the max level of the LSM tree, and MyRocks uses SingleDelete for secondary index maintenance.

The reason that tombstones are frequently not removed until reaching the max level of the LSM tree is that the removal check chooses to be fast and has many false positives. During compaction when a tombstone is found, if the key for the tombstone is covered by the key ranges of SSTs in larger levels, then it is not dropped because the key might exist in those larger levels. The number of false positives could be reduced at the cost of more CPU, and possibly some IO, if after confirming overlap the bloom filters in those larger levels were probed.

No comments:

Post a Comment