For Postgres and MySQL/InnoDB an index-only query plan that uses a covering secondary index might still require access to the base row for some of the index entries read from the secondary index. Most of the time index-only really is index-only. But in some cases there will still be fetches of the indexed row and power users should know when this can happen.
This doesn't happen with MySQL/MyRocks.
One reason for this is recent transactions. The issue for Postgres is well-documented and widely known -- the visibility map. I won't explain it because experts have done a better job than I can. I don't know whether Postgres has performance counters to monitor the frequency of this.
One interesting variant of this with Postgres was for insert-only workloads that I encountered with the Insert Benchmark. The problem was that prior to PG 13 autovacuum wasn't triggered by insert-only workloads, thus visibility map bits were not set for pages full of new inserts and index-only scans weren't really index only for deployments that relied on autovacuum.
The result is that for some index entries read from the secondary index for an index-only query plan, Postgres might have to fetch the row by using the tuple id to read from the heap-organized table. This can result in unexpected random IO.
The issue for InnoDB is less widely-known and not well documented. I don't think there are performance counters to monitor the frequency for this. Long-ago when I supported InnoDB in production I added some to the FB patch.
The best doc I found in a quick search is this slide deck (thanks Marko from MariaDB) and I will quote parts from it:
- secondary indexes have a PAGE_MAX_TRX_ID
- PK index entries have a DB_TRX_ID per entry, secondary index entries do not
- See here for a description of DB_TRX_ID
- Secondary indexes may contain multiple (indexed_col, pk) records pointing to a single pk, one for each value of indexed_col
- All versions except at most one must be delete-marked
- If a secondary index record is delete-marked, MVCC must look up pk in the PRIMARY index and attempt to construct a version that matches indexed_col, to determine if (indexed_col, pk) exists in the read view
- If the PAGE_MAX_TRX_ID is too new, for each record in the secondary index leaf page, we must look up pk
- For old read views, secondary index scan can be very expensive!
While the InnoDB docs have some description of this, they aren't as clear as the slide deck above.
When a secondary index column is updated, old secondary index records are delete-marked, new records are inserted, and delete-marked records are eventually purged. When a secondary index record is delete-marked or the secondary index page is updated by a newer transaction,
InnoDBlooks up the database record in the clustered index. In the clustered index, the record's
DB_TRX_IDis checked, and the correct version of the record is retrieved from the undo log if the record was modified after the reading transaction was initiated.
If a secondary index record is marked for deletion or the secondary index page is updated by a newer transaction, the covering index technique is not used. Instead of returning values from the index structure,
InnoDBlooks up the record in the clustered index.
The reason that tombstones are frequently not removed until reaching the max level of the LSM tree is that the removal check chooses to be fast and has many false positives. During compaction when a tombstone is found, if the key for the tombstone is covered by the key ranges of SSTs in larger levels, then it is not dropped because the key might exist in those larger levels. The number of false positives could be reduced at the cost of more CPU, and possibly some IO, if after confirming overlap the bloom filters in those larger levels were probed.
Post a Comment