- Optionally archive all previous versions of rows on separate storage as WORM optical disks were likely to arrive on the market.
- Don't use a WAL (redo log). The benefits include instant recovery, less code complexity and better support for archiving.
- Use multiple processes to benefit from multi-processor systems that were likely to arrive and were in development at UC Berkeley and elsewhere.
- Use non-volatile RAM (NVRAM) to boost performance.
Modern PostgreSQL uses 32-bit transaction IDs and wraparound is a source of problems. Other difference are that modern PostgreSQL has a redo log, doesn't force modified pages to stable storage on commit and doesn't (yet) try to take advantage of NVRAM.
I expect that POSTGRES had worse write-amplification then a system that didn't force dirty pages on commit. But I am unlikely to run the insert benchmark to confirm this. Besides, LMDB does FORCE on commit and has many happy users.
The per-row metadata includes:
OID - system-assigned unique ID
Xmin, Xmax - the XID that starts, ends the version
Tmin, Tmax - commit time of XID from Xmin, Xmax
Cmin, Cmax - ID of command that starts, ends the version. This is 1-byte so there could be at most 256 commands (statements?) per transaction.
PTR - pointer to older or newer version of the row (explained below)
Modern PostgreSQL has similar per-row metadata. The differences are that in PostgreSQL the XID is 32 bits, there is only one field for command ID, there is a 6-byte tuple ID (TID) and the OID is usually not used for user tables.
Fields were set:
On insert the OID, Xmin and Cmin were set. Tmin was not set because commit had yet to occur.
On update Xmax and Cmax were set to end the row version and a new version of the row was inserted (hopefully to the same page). The new version reused the OID of the ended version and the PTR for the new version pointed to the ended version.
On delete Xmax and Cmax were set.
2.3 Time Management
This section shows the logic required to determine whether a version is visible to a query. The check is more complicated than what InnoDB and RocksDB require, but I assume the CPU overhead is not that different than what occurs in modern PostgreSQL and in my testing of modern PG this isn't an issue.
The logic includes a check of the transaction log to determine whether the transaction from Xmin or Xmax committed. That check wouldn't be needed if the commit timestamp were written into the row on commit -- but doing that is non-trivial and can hurt performance.
The need to check the transaction log also means that the searched parts of the log must remain in memory or there will be disk reads. The ability to keep that in memory is explained in section 2.1. I am wary of the ability to keep the log in memory for high TPS systems but this is a problem they didn't need to solve at the time.
2.4 Concurrency Control and Timestamp Management
POSTGRES contains a TIME relation that has the commit time for each transaction. This has 32 bits per XID and is updated on commit. The tail of TIME should be stored in stable main memory to avoid forcing a disk page on commit.
Relations are marked by the user as no-archive, light-archive or heavy-archive. Tmin and Tmax are never set for no-archive relations and I assume old versions for them are not moved to the archive. For light-archive, old versions are moved to the archive but Tmin/Tmax are not set to avoid the overhead of doing a search of the transaction log to determine their status. For heavy-archive the reader (a query) will lookup the commit time from the TIME relation and update Tmin/Tmax (thus making a page dirty). Vacuum sets Tmin/Tmax for heavy-archive when moving older versions to the archive. It is possible that the thing (query, vacuum) that searches TIME will be delayed by disk reads.
2.5 Record Access
On each page there is a line table with an entry per anchor point record. Secondary index entry points to line table entry. On update a secondary index only needs maintenance if the indexed columns have been changed.
Modern PostgreSQL uses the name line pointer. Also modern PostgreSQL does secondary index maintenance for all secondary indexes unless no indexed columns have changed. So if there 3 secondary indexes and an update changes a column used by 1 of them then maintenance is done for all of them -- unless HOT is used. If no indexed columns have changed then the Heap Only Tuples (HOT) optimization is used and the new version is added to the end of the update chain and secondary index entries reference the line pointer for the head of the update chain. Quoting from the HOT document:
Without HOT, every version of a row in an update chain has its own index entries, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries.
3.1 Vacuuming the disk
POSTGRES had a command to trigger vacuum of a relation. The example was vacuum rel-name after "30 days". This reclaims space from aborted transactions and moves old versions to the archive. Old versions for relations marked as light-archive and heavy-archive are moved to archive storage. If heavy-archive is set for the relation then vacuum will set Tmin/Tmax if unset. Differences between POSTGRES and modern PostgreSQL include:
Vacuum in modern PostgreSQL doesn't move older versions to an archive. It does reclaim space for versions that have been deleted and are no longer visible. It also sets bits in the visibility map and does work to avoid transaction ID wraparound.
Vacuum did a full scan of the relation in POSTGRES while modern PostgreSQL only checks pages that require vacuum courtesy of the visibility map
Vacuum in modern PostgreSQL does a full index scan for every secondary index of the vacuumed relation when there are rows to remove.
3.2 Archival Medium
The target archival media was optical WORM. While WORM might not have been a huge hit CD-R and DVD-R were a big deal for a long time. Zip drives were a big deal for a shorter time and now we have USB thumb drives. Maybe WORM will return in the form of ultra-low-endurance NAND flash SSDs that support only one device write.
The paper also explained interesting ways to manage secondary indexes using both magnetic disk and archive devices with plans for R-trees to support efficient time-bounded queries.
3.3 Vacuum Process
Vacuum in POSTGRES did:
Write archive record and index entries
Write new anchor point in current database, insert new index entries
Reclaim space from old anchor point and delta records
This wasn't crash safe but POSTGRES did the right thing in spite of crashes. Crashes could leave duplicate records with a copy of the same version in both the archive and main store. But POSTGRES was relational and eliminated such duplicates.
I explain differences with modern PostgreSQL above in section 3.1.
5.1 [Performance Comparison] Assumptions
One nit I have with the paper is the argument that CPU is not a critical resource. It listed a few reasons for this -- CPUs were getting much faster than disk, multi-processors were coming, co-processors could be used and custom logic could be used. While the CPU-disk speed gap was growing the paper ignored that RAM density was growing quickly and many DBMS applications would be less IO-bound in the future.
Another nit is that the paper ignores the overhead from vacuum. Vacuum doesn't just use CPU. It reads from the vacuumed relations and dirties pages in them. Accounting for that overhead would be complicated and the focus of the paper was on simple performance models, which made it a nice paper to read.
No comments:
Post a Comment