Tuesday, July 14, 2020

Review: The Design of the Postgres Storage System

This is a review of The Design of the Postgres Storage System. The paper was in VLDB 1987 although my review used a version of the paper that might differ from the one in VLDB (which is appropriate, given MVCC). This is the also the first in a series of posts on MVCC GC. My standard disclaimer is that I don't review papers that aren't worth reading.

The paper is about POSTGRES, not modern PostgreSQL. POSTGRES was using POSTQUEL at the time and the paper has sample POSTQUEL statements. Something I like very much about the paper is the use of simple estimates to explain the design decisions.

1 Introduction

The abstract and introduction provide four reasons for the research:
  1. Optionally archive all previous versions of rows on separate storage as WORM optical disks were likely to arrive on the market. 
  2. Don't use a WAL (redo log). The benefits include instant recovery, less code complexity and better support for archiving.
  3. Use multiple processes to benefit from multi-processor systems that were likely to arrive and were in development at UC Berkeley and elsewhere.
  4. Use non-volatile RAM (NVRAM) to boost performance.
I assume that the first point (archive all previous versions) was the big deal. While this feature might not have been an immediate success it has turned into a big deal in production systems today. Atlas Data Lake from MongoDB is one example. Another popular way to archive all versions is via change data capture (CDC) to store the older versions in a data warehouse separate from the OLTP system.

2.1 Transaction System

POSTGRES used 40-bit transaction IDs and stated that was sufficient for 320 years of operation assuming 1 TPS. The POSTGRES transaction log uses 2 bits per transaction -- there is no redo log. Logically the log is a large array indexed by XID. The bits represent the status of a transaction: committed, aborted, in progress. The XID is assigned at transaction start. Commit is done by changing the bit in the log to committed, forcing the log page to stable storage and forcing modified database pages to stable storage. Stable storage is either magnetic disk or NVRAM.

The log tail is log from the XID of the oldest active transaction to the present and requires 2 bits per XID. The body is the rest of the log (transactions oldest than the oldest active transaction). As all transactions in the log body are either committed or aborted only 1 bit per XID is needed for the log body.

The goal is to keep the log tail in NVRAM and the log body cached in memory. The log body is read-only, the log tail is not. Both can be searched to determine whether a row is visible to a snapshot and the goal is to avoid disk reads in that search. The paper also explains that a bloom filter can be created on the XIDs of aborted transactions to avoid keeping the log body in memory.

Modern PostgreSQL uses 32-bit transaction IDs and wraparound is a source of problems. Other difference are that modern PostgreSQL has a redo log, doesn't force modified pages to stable storage on commit and doesn't (yet) try to take advantage of NVRAM.

I expect that POSTGRES had worse write-amplification then a system that didn't force dirty pages on commit. But I am unlikely to run the insert benchmark to confirm this. Besides, LMDB does FORCE on commit and has many happy users.

2.2 Relation Storage

The per-row metadata includes:

  • OID - system-assigned unique ID

  • Xmin, Xmax - the XID that starts, ends the version

  • Tmin, Tmax - commit time of XID from Xmin, Xmax

  • Cmin, Cmax - ID of command that starts, ends the version. This is 1-byte so there could be at most 256 commands (statements?) per transaction.

  • PTR - pointer to older or newer version of the row (explained below)


Modern PostgreSQL has similar per-row metadata. The differences are that in PostgreSQL the XID is 32 bits, there is only one field for command ID, there is a 6-byte tuple ID (TID) and the OID is usually not used for user tables.


Fields were set:

  • On insert the OID, Xmin and Cmin were set. Tmin was not set because commit had yet to occur.

  • On update Xmax and Cmax were set to end the row version and a new version of the row was inserted (hopefully to the same page). The new version reused the OID of the ended version and the PTR for the new version pointed to the ended version.

  • On delete Xmax and Cmax were set.

To use less space updates only stored fields that changed and the other fields were found by following the PTR chain (a singly-linked list). The oldest version of a row was called the anchor point. The notion of an anchor point and update (delta) chain is similar to the current support for Heap Only Tuples (HOT) in modern PostgreSQL. I wonder if that is a feature that was removed in early PostgreSQL and then was returned for a different reason.

2.3 Time Management


This section shows the logic required to determine whether a version is visible to a query. The check is more complicated than what InnoDB and RocksDB require, but I assume the CPU overhead is not that different than what occurs in modern PostgreSQL and in my testing of modern PG this isn't an issue. The logic includes a check of the transaction log to determine whether the transaction from Xmin or Xmax committed. That check wouldn't be needed if the commit timestamp were written into the row on commit -- but doing that is non-trivial and can hurt performance. The need to check the transaction log also means that the searched parts of the log must remain in memory or there will be disk reads. The ability to keep that in memory is explained in section 2.1. I am wary of the ability to keep the log in memory for high TPS systems but this is a problem they didn't need to solve at the time.


2.4 Concurrency Control and Timestamp Management


POSTGRES contains a TIME relation that has the commit time for each transaction. This has 32 bits per XID and is updated on commit. The tail of TIME should be stored in stable main memory to avoid forcing a disk page on commit.


Relations are marked by the user as no-archive, light-archive or heavy-archive. Tmin and Tmax are never set for no-archive relations and I assume old versions for them are not moved to the archive. For light-archive, old versions are moved to the archive but Tmin/Tmax are not set to avoid the overhead of doing a search of the transaction log to determine their status. For heavy-archive the reader (a query) will lookup the commit time from the TIME relation and update Tmin/Tmax (thus making a page dirty). Vacuum sets Tmin/Tmax for heavy-archive when moving older versions to the archive. It is possible that the thing (query, vacuum) that searches TIME will be delayed by disk reads.


2.5 Record Access


On each page there is a line table with an entry per anchor point record. Secondary index entry points to line table entry. On update a secondary index only needs maintenance if the indexed columns have been changed.


Modern PostgreSQL uses the name line pointer. Also modern PostgreSQL does secondary index maintenance for all secondary indexes unless no indexed columns have changed. So if there 3 secondary indexes and an update changes a column used by 1 of them then maintenance is done for all of them -- unless HOT is used. If no indexed columns have changed then the Heap Only Tuples (HOT) optimization is used and the new version is added to the end of the update chain and secondary index entries reference the line pointer for the head of the update chain. Quoting from the HOT document:

Without HOT, every version of a row in an update chain has its own index entries, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries.

3.1 Vacuuming the disk


POSTGRES had a command to trigger vacuum of a relation. The example was vacuum rel-name after "30 days". This reclaims space from aborted transactions and moves old versions to the archive. Old versions for relations marked as light-archive and heavy-archive are moved to archive storage. If heavy-archive is set for the relation then vacuum will set Tmin/Tmax if unset. Differences between POSTGRES and modern PostgreSQL include:

  • Vacuum in modern PostgreSQL doesn't move older versions to an archive. It does reclaim space for versions that have been deleted and are no longer visible. It also sets bits in the visibility map and does work to avoid transaction ID wraparound.

  • Vacuum did a full scan of the relation in POSTGRES while modern PostgreSQL only checks pages that require vacuum courtesy of the visibility map

  • Vacuum in modern PostgreSQL does a full index scan for every secondary index of the vacuumed relation when there are rows to remove.

3.2 Archival Medium


The target archival media was optical WORM. While WORM might not have been a huge hit CD-R and DVD-R were a big deal for a long time. Zip drives were a big deal for a shorter time and now we have USB thumb drives. Maybe WORM will return in the form of ultra-low-endurance NAND flash SSDs that support only one device write.


The paper also explained interesting ways to manage secondary indexes using both magnetic disk and archive devices with plans for R-trees to support efficient time-bounded queries.


3.3 Vacuum Process


Vacuum in POSTGRES did:

  1. Write archive record and index entries

  2. Write new anchor point in current database, insert new index entries

  3. Reclaim space from old anchor point and delta records


This wasn't crash safe but POSTGRES did the right thing in spite of crashes. Crashes could leave duplicate records with a copy of the same version in both the archive and main store. But POSTGRES was relational and eliminated such duplicates. I explain differences with modern PostgreSQL above in section 3.1.


5.1 [Performance Comparison] Assumptions


One nit I have with the paper is the argument that CPU is not a critical resource. It listed a few reasons for this -- CPUs were getting much faster than disk, multi-processors were coming, co-processors could be used and custom logic could be used. While the CPU-disk speed gap was growing the paper ignored that RAM density was growing quickly and many DBMS applications would be less IO-bound in the future.


Another nit is that the paper ignores the overhead from vacuum. Vacuum doesn't just use CPU. It reads from the vacuumed relations and dirties pages in them. Accounting for that overhead would be complicated and the focus of the paper was on simple performance models, which made it a nice paper to read.








No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...