Wednesday, April 12, 2023

GC for MVCC - MyRocks, InnoDB and Postgres

MVCC GC is hard -- see the amount of content devoted to Postgres vacuum. Perhaps Oracle does it best. A long-open snapshot can bring about the worst-case behavior for MVCC GC as Postgres vacuum and InnoDB purge cannot touch rows that are reclaimable when said rows have a commit time more recent than the oldest open snapshot. And by reclaimable here I mean to make the space used by that row or index entry available for reuse.

An example of the performance problems from long-open snapshots and OLTP workloads on InnoDB is here. In this case MyRocks does much better.

When InnoDB deletes a row it isn't removed immediately. The delete operation sets a delete-mark flag on the row and purge will eventually (usually quickly) reclaim the space for that row once there are no open snapshots from a point-in-time older than the commit time on the delete. Purge is blocked by long-open snapshots -- it cannot do any work for committed deletes that are more recent than the oldest open snapshot.

MyRocks is more clever. RocksDB compaction will keep only the versions needed for long-open snapshots. There is an open feature request for InnoDB to do the same, but I don't expect it to be implemented. But there is a catch for RocksDB (see below).

An example, if:

  • versions for a row were created at timestamp 1, 2, 3, ..., 9, 10
  • current timestamp is 11
  • long-open snapshot created at timestamp 5
  • In theory, only the versions at timestamps 5 and 10 are needed. At timestamp 5 for the long-open snapshot and at timestamp 10 for current reads.
  • Purge/vacuum can reclaim versions for timestamps 1 to 4 but not for timestamps 6 to 9 because they are blocked by the long-open snapshot
  • MyRocks compaction can reclaim versions for timestamps 1 to 4 and 6 to 9
The catch

The catch means that tombstones won't be dropped as early as you want them to be dropped. But fortunately, space used by tombstones is less than space used by non-tombstone KV pairs and non-tombstone KV pairs can be dropped prior to reaching the max level of the LSM tree. RocksDB also has a SingleDelete operation, which is used in some cases by MyRocks, and which allows for early dropping of tombstones.

The catch for MyRocks with RocksDB, and probably for other LSM implementations, is that compaction is unlikely to drop tombstones unless they have reached the max level/file of the LSM structure. The reason is that the tombstone can only be dropped when the KV pair for which it indicates deletion is known to not exist in the LSM tree. The check for known to not exist is cheap (low CPU, no IO) and just confirms that the LSM levels/files lower in the LSM structure do not overlap the checked key. This cheap check has many false positives. For example if the tombstone is in level 3 of the LSM tree, the tombstone is for key=5 and an SST in level 4 has keys for [1, 10) then that tombstone cannot be dropped.

A more expensive check would also check bloom filters if the key range overlapped. But that might be too much CPU overhead from compaction. The most accurate check would then do a search for the key if the bloom filter check showed the key might exist. But that would be extremely expensive during compaction.


  1. Optimistic locking solves this problem .......

    1. How?
      Which problem? This blog post mentions several.

      Regardless, it creates many problems for high-contention workloads.

  2. Rocks has compaction filters and some other related functions. These can’t be used by any chance to purge tombstones earlier?

    1. Compaction drops tombstones. Today it favors efficiency at the cost of not dropping many tombstones until they reach the max level. In theory it could be more clever, at the cost of more CPU and IO overheads, and drop them sooner.

  3. the real thing here is difference between B+ Tree and LSM disk structures.