Small Datum: July 2020

Wednesday, July 22, 2020

Sustaining high insert rates despite secondary indexes

If you want to sustain high insert rates in the presence of secondary indexes then there are two approaches at a high level. The first is to buy enough RAM to cache the entire database but that gets expensive. The second is to avoid storage reads during secondary index maintenance.

So now the problem shifts to methods for avoiding storage reads during secondary index maintenance and there are a few popular ways to do that.

Partition the database so that only a few of the partitions get inserts at any point in time and then use local (per-partition) secondary indexes. If the partitions are small enough they will fit in memory and then traditional indexing (B-Tree) can be used and the per-partition secondary indexes will fit it memory for the partitions that are getting inserts. TimescaleDB is an example of this.
Use an LSM, or something like an LSM, that supports read-free index maintenance for non-unique secondary indexes. MyRocks is an example of that.
Don't use secondary indexes because the target workloads expect to scan large amounts of data

The interesting thing about the first approach is that you end up with a tree per partition and then a query that uses the per-partition secondary index might have to search many trees (fanout or broadcast queries) unless there are other query predicates that enable enough partitions to be pruned.

Searching trees of trees sounds a lot like what an LSM does, which is a nice way of stating that the CPU overhead of an LSM for reads can be reproduced by the first approach. In an LSM with leveled compaction there is a sorted run (or tree) per level. However an LSM benefits from a bloom filter for point queries and might benefit from a prefix bloom filter for range queries. There are also several interesting research papers that show how to use something like a bloom filter for range queries -- see papers on SuRF and Rosetta. I am not sure whether systems that use the first approach (partition with local secondary indexes) also provide something like a bloom filter to match what an LSM can do. But if most partitions eventually become read-only then there are opportunities for being clever.

Updates

this post has more detail on avoiding reads in a write-optimized index structure -- RocksDB merge operator, InnoDB change buffer and more
I would put things like Postgres BRIN (min/max values per block) into the third category (do scans but more efficiently)

Postgres, vacuum and the insert benchmark

I have been running benchmarks with Postgres for a few months and am slowly learning enough about Postgres to trust the results that I get. When I first read about Postgres vacuum I didn't think it would be important for the insert benchmark because the workload is insert-only. This post explains why I was wrong.

Vacuum arrived in the original POSTGRES storage system in the 1980s. Today it serves several purposes: 1) reclaim space for old versions of rows that are no longer visible, 2) update statistics for the query planner, 3) update bits in the visibility map to help index-only queries and 4) preventing transactionID wraparound.

I didn't expect vacuum to help much with the insert benchmark because it doesn't do updates or deletes and there is no space for vacuum to reclaim. Also, query plans were OK and there weren't enough transactions for wraparound to be an issue.

What I missed was the impact of the visibility map on the index-only range scans run during the insert benchmark. These bits are only set by vacuum and if the bit is not set for a heap page, then the base row must be read from the heap page during an index-only scan for any index entries that point to that heap page. The impact from reading the base row is significant when the working set is not cached but still visible when the working set is cached.

Fortunately, the insert benchmark pattern of inserts seems to be friendly to Postgres in that a heap page will become full after some inserts and then doesn't get more inserts. So once the visibility map bit gets set it remains set for that heap page.

Details

Autovacuum is triggered for a table based on the number of updates and deletes done to the table. So while an insert-only workload can benefit from vacuum when it also does range scans, the inserts won't trigger vacuum. I haven't read the code to confirm that inserts don't trigger autovacuum. The main docs aren't clear as the routine vacuuming section states that When enabled, autovacuum checks for tables that have had a large number of inserted, updated or deleted tuples. But the docs for autovacuum_vacuum threshold state that it is only updates and deletes: Specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any one table.

I now run the insert benchmark as a sequence of steps: load some data with only PK indexes, create secondary indexes, load more data, then read+write. The read+write step does short range scans as fast as possible and the writes are rate-limited inserts where there is a target on the insert rate per second. As part of this I optionally run vacuum after creating the secondary indexes. Each read+write test runs for 1 hour and vacuum is started at the end of each hour. While my scripts wait for vacuum to finish after creating the indexes, and don't wait for it to finish during the read+write tests that doesn't matter much because vacuum is fast during the insert benchmark.

Vacuum internals

This is a brief description. Hopefully it is truthy. The key point is that vacuum can require full index scans for every index of the vacuumed table. That can take a long time. Alas, this isn't an issue for the insert benchmark because it is insert-only and doesn't create old versions of rows so there are no full scans of all indexes during autovacuum.

The Internals of Postgres web-site is a great place to start after reading the online Postgres docs if you want to learn more -- see chapter 5 and chapter 6. Note that chapter 6 of the Internals site lists steps 3 before step 2 and I need to figure that out before I write too many more blog posts on this topic. There are many useful blog posts on this topic including this blog post.

I don't think vacuum in original POSTGRES did full index scans. I have read some of the discussion that explains why a full index scan is done today but I didn't save links to that. My preference is for vacuum to do index probes rather than full index scans to reduce the time between consecutive vacuums for the same table.

Regardless let me explain some of the work that vacuum can do to make it obvious that vacuum for a table can take a long time, and does when I run Linkbench.

Scan heap pages for which visibility map bits are not set. From each heap page if a row is an old version (because delete or update) and not visible to any current or future transaction then copy the row's CTID (into an array?)
If CTIDs were found in the previous step then scan all secondary indexes for this table and for each index entry determine whether its CTID is in the array from step 1. If yes then reclaim the space for that index entry.
Repeat step 1 but this time reclaim the space for the dead rows.

Finally, the memory used to buffer CTIDs is determined by maintenance_work_mem. If that limit is reached then the steps (and full index scans) are repeated.

Friday, July 17, 2020

Review of -- TimescaleDB: SQL made scalable for time-series data

This is a short review of TimescaleDB: SQL made scalable for time-series data. My focus is on indexing and I ignore many interesting things described in that paper. My goal in reading it was to understand possible optimizations for workloads like the insert benchmark v3. I also read some of the online TimescaleDB docs. This blog post is truthy as I am not an expert on TimescaleDB. Hopefully I haven't made too many mistakes. The paper is several years old and TimescaleDB has improved a lot since then.

The paper does a great job asserting what they know to be true about the workloads for which TimescaleDB is targeted.

Assertion 1:

Time-series data is largely immutable. New data continually arrives, typically corresponding to the latest time periods. In other words, writes primarily occur as new inserts, not as updates to existing rows. Further, while the database needs to be able to support backfill for delayed data, writes are made primarily to recent time intervals.

Assertion 2:

Workloads have a natural partitioning across both time and space. Writes typically are made to the latest time interval(s) and across the “partitioning key” in the space dimension (e.g., data sources, devices, users, etc.). Queries typically ask questions about a specific time series or data source, or across many data sources constrained to some time interval. Yet the queries might not be limited to a particular metric, but may regularly select multiple metrics at once (or use predicates that rely on multiple metrics).

Storage
Rows are stored in hypertables rather than tables and hypertables are composed of chunks. A hypertable must have a time column and can have a partition column. The paper states that the partition column is required for clustered TimescaleDB. But the online docs have more nuance.

Chunks are right-sized (their name for it) to fit in memory. Without a partition column the hypertable is range partitioned on time into chunks. With the partition column the data is distributed by hash on the partition column and then range on the time column. In SQL DBMS partitioning terminology this is composite partitioning with: hash(partition), range(time).

But this isn't traditional partitioning because it is dynamic and automatic. Nobody has to run DDL to add, drop and change partitions. That is one way they add value.
Compression has been added since the paper was published. It is described here. I have yet to read that post but assume that old chunks are compressed while chunks still getting inserts remain in row format.
Secondary indexes are local to a chunk. I assume that means that fanout over (too) many chunks can happen when a query doesn't have a predicate on the partition column. But it also means that if the chunk fits in memory, then secondary index maintenance is not delayed by reading from storage and it is easier to sustain high ingest rates.

Challenges

What is a good indexing strategy for a simple time series workload where new data has four attributes: time, deviceID, metricID, metricValue? I write about this in my post on a replacement for the insert benchmark. Assume that I will use a partition column in addition to the time column. My choices with TimescaleDB are:

partition on deviceID and create local secondary indexes on metricID
partition on metricID and create local secondary indexes on deviceID

I am not sure what I would do to make the min/max query fast as explained in the section on Physical Schema - Read Optimized.

Edits

I quickly read the post on columnar compression and I am impressed they made this work in the context of Postgres. They use TOAST to store compressed columnar chunks off-page so IO is only done for the target columns. In terms of being surprised that this was feasible, it reminds me of the work to support columnar in SQL Server.

Wednesday, July 15, 2020

Indexing and write-heavy workloads

When I see impressive numbers for the insert rate that a DBMS can sustain I wonder what indexes exist and whether the inserts are in sequential or random order with respect to each index. One way to explain this is in terms of the numbers of points in the index at which the inserts occur. Although I use streams rather than insert points in what follows.

I am writing this in part so that I can reference this post in future performance reports when describing workloads. It isn't sufficient to state that inserts are in PK order. They can be in ascending or descending PK order. When ascending the point at which the inserts are done can be at the right end of the index (inserted keys > than existing keys) or somewhere in the middle of the index. When descending the inserts can be done at the left end of the index (inserted keys < existing keys) or somewhere in the middle of the index.

Explaining insert patterns

There are four attributes per index that can explain such insert patterns. The attributes are:

nAsc - number of streams for which inserts occur in ascending order WRT the index
nDesc - number of streams for which inserts occur in descending order WRT the index
nLHS - the number of descending streams that are at the left end of the index
nRHS - the number of ascending streams that are at the right end of the index

Constraints:

nAsc >= 0, nDesc >= 0 and (nAsc + nDesc) >= 1
nLHS and nRHS must be 0 or 1
if nLHS is 1 then nDesc must be >= 1
if nRHS is 1 then nAsc must be >= 1

There is one exception. When the insert pattern is random WRT the index then inf is used instead of the four attributes.

Geek Code

This is a geek code for explaining insert patterns. The attributes are specified per index. When there is only a PK index, named pk, and inserts occur in PK order at the right end of the index (right growing) then the geek code is:

pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)

When there is only a PK index but the inserts are in random order WRT the PK then the geek code is:

pk=(inf)

To improve readability I omit attributes for which the value is 0. So these mean the same thing:

pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)
pk=(nAsc:1, nRHS:1)

Why?

I am interested in this for three reasons. First, index maintenance has a big impact on insert performance whether or not the working set is in memory. Second, there are optimizations that a DBMS can do for some insert patterns and I suspect there is room for even more optimizations. Many storage engines optimize for right-growing inserts. In that case RocksDB with leveled compaction will have write amplification of 2 -- write once for the WAL, write again for the memtable flush, no compaction. Finally, this makes it easier to explain write-heavy workloads.

Steams and insert points

I use ordered arrays rather than indexes to explain streams (insert points). Assume the array starts as: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], this represents the keys in the PK index and there are no secondary indexes. Some of the examples showwhy I use streams to describe this.

Examples:

Random

pk=(inf)
insert sequence: 1.5, 6.5, 1.7, 8.1, 0.0, 4.5, ...

Right growing

pk=(nAsc:1, nRHS:1)
insert sequence: 10.0, 11.0, 12.0, ...

Left growing

pk=(nDesc:1, nLHS:1)
insert sequence: 0.0, -1.0, -2.0, ...

Left & right growing

pk=(nAsc:1, nDesc:1, nLHS:1, nRHS:1)
insert sequence: 10.0, 0.0, -1.0, 11.0, 12.0, -2.0
insert sequence as interleaved streams: [10.0, 11.0, 12.0] and [0.0, -1.0, -2.0]

1 middle ascending

pk=(nAsc:1, nRHS:0)
insert sequence: 8.1, 8.11, 8.111, 8.1111, ...

1 middle descending

pk=(nDesc:1, nLHS:1)
insert sequence: 7.9, 7.89, 7.889, 7.8889, ...

1 middle ascending, 1 middle descending

pk=(nAsc:1, nDesc:1)
insert sequence: 8.1, 7.9, 8.11, 7.89, 8.111, 7.889, ...
insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [7.9, 7.89, 7.889]

2 middle ascending:

pk=(nAsc:2)
insert sequence: 8.1, 6.1, 8.11, 6.11, 8.111, 6.111, ...
insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [6.1, 6.11, 6.111]

N middle ascending

pk=(nAsc:N) for some finite value N

Explaining the insert benchmark

Until recently I ran the insert benchmark by first creating a PK index and 3 secondary indexes per table (or collection) and then doing inserts. Informally, the inserts were in PK order but random WRT to each secondary index. More formally, the insert pattern is the following when the secondary indexes are named s1, s2 and s3:

pk=(nAsc:1, nRHS:1)
s1=(inf)
s2=(inf)
s3=(inf)

The insert benchmark can become extremely IO-bound because of the random insert patterns for each of the secondary indexes. In the worst case with a B-Tree there is one page read and one page written back per secondary index per insert (3 pages read, 3 pages written back with 3 secondary indexes).

I recently changed the way that I run the insert benchmark to create the PK index, load some data, create secondary indexes and then load more data. In this case the insert pattern during load some data is:

pk=(nAsc:1, nRHS:1)

And then during load more data (with secondary indexes in place) is:

pk=(nAsc:1, nRHS:1)
s1=(inf)
s2=(inf)
s3=(inf)

Tuesday, July 14, 2020

Review: The Design of the Postgres Storage System

This is a review of The Design of the Postgres Storage System. The paper was in VLDB 1987 although my review used a version of the paper that might differ from the one in VLDB (which is appropriate, given MVCC). This is the also the first in a series of posts on MVCC GC. My standard disclaimer is that I don't review papers that aren't worth reading.

The paper is about POSTGRES, not modern PostgreSQL. POSTGRES was using POSTQUEL at the time and the paper has sample POSTQUEL statements. Something I like very much about the paper is the use of simple estimates to explain the design decisions.

1 Introduction

The abstract and introduction provide four reasons for the research:

Optionally archive all previous versions of rows on separate storage as WORM optical disks were likely to arrive on the market.
Don't use a WAL (redo log). The benefits include instant recovery, less code complexity and better support for archiving.
Use multiple processes to benefit from multi-processor systems that were likely to arrive and were in development at UC Berkeley and elsewhere.
Use non-volatile RAM (NVRAM) to boost performance.

I assume that the first point (archive all previous versions) was the big deal. While this feature might not have been an immediate success it has turned into a big deal in production systems today. Atlas Data Lake from MongoDB is one example. Another popular way to archive all versions is via change data capture (CDC) to store the older versions in a data warehouse separate from the OLTP system.

2.1 Transaction System

POSTGRES used 40-bit transaction IDs and stated that was sufficient for 320 years of operation assuming 1 TPS. The POSTGRES transaction log uses 2 bits per transaction -- there is no redo log. Logically the log is a large array indexed by XID. The bits represent the status of a transaction: committed, aborted, in progress. The XID is assigned at transaction start. Commit is done by changing the bit in the log to committed, forcing the log page to stable storage and forcing modified database pages to stable storage. Stable storage is either magnetic disk or NVRAM.

The log tail is log from the XID of the oldest active transaction to the present and requires 2 bits per XID. The body is the rest of the log (transactions oldest than the oldest active transaction). As all transactions in the log body are either committed or aborted only 1 bit per XID is needed for the log body.

The goal is to keep the log tail in NVRAM and the log body cached in memory. The log body is read-only, the log tail is not. Both can be searched to determine whether a row is visible to a snapshot and the goal is to avoid disk reads in that search. The paper also explains that a bloom filter can be created on the XIDs of aborted transactions to avoid keeping the log body in memory.

Modern PostgreSQL uses 32-bit transaction IDs and wraparound is a source of problems. Other difference are that modern PostgreSQL has a redo log, doesn't force modified pages to stable storage on commit and doesn't (yet) try to take advantage of NVRAM.

I expect that POSTGRES had worse write-amplification then a system that didn't force dirty pages on commit. But I am unlikely to run the insert benchmark to confirm this. Besides, LMDB does FORCE on commit and has many happy users.

2.2 Relation Storage

The per-row metadata includes:

OID - system-assigned unique ID
Xmin, Xmax - the XID that starts, ends the version
Tmin, Tmax - commit time of XID from Xmin, Xmax
Cmin, Cmax - ID of command that starts, ends the version. This is 1-byte so there could be at most 256 commands (statements?) per transaction.
PTR - pointer to older or newer version of the row (explained below)

Modern PostgreSQL has similar per-row metadata. The differences are that in PostgreSQL the XID is 32 bits, there is only one field for command ID, there is a 6-byte tuple ID (TID) and the OID is usually not used for user tables.

Fields were set:

On insert the OID, Xmin and Cmin were set. Tmin was not set because commit had yet to occur.
On update Xmax and Cmax were set to end the row version and a new version of the row was inserted (hopefully to the same page). The new version reused the OID of the ended version and the PTR for the new version pointed to the ended version.
On delete Xmax and Cmax were set.

To use less space updates only stored fields that changed and the other fields were found by following the PTR chain (a singly-linked list). The oldest version of a row was called the anchor point. The notion of an anchor point and update (delta) chain is similar to the current support for Heap Only Tuples (HOT) in modern PostgreSQL. I wonder if that is a feature that was removed in early PostgreSQL and then was returned for a different reason.

2.3 Time Management

This section shows the logic required to determine whether a version is visible to a query. The check is more complicated than what InnoDB and RocksDB require, but I assume the CPU overhead is not that different than what occurs in modern PostgreSQL and in my testing of modern PG this isn't an issue. The logic includes a check of the transaction log to determine whether the transaction from Xmin or Xmax committed. That check wouldn't be needed if the commit timestamp were written into the row on commit -- but doing that is non-trivial and can hurt performance. The need to check the transaction log also means that the searched parts of the log must remain in memory or there will be disk reads. The ability to keep that in memory is explained in section 2.1. I am wary of the ability to keep the log in memory for high TPS systems but this is a problem they didn't need to solve at the time.

2.4 Concurrency Control and Timestamp Management

POSTGRES contains a TIME relation that has the commit time for each transaction. This has 32 bits per XID and is updated on commit. The tail of TIME should be stored in stable main memory to avoid forcing a disk page on commit.

Relations are marked by the user as no-archive, light-archive or heavy-archive. Tmin and Tmax are never set for no-archive relations and I assume old versions for them are not moved to the archive. For light-archive, old versions are moved to the archive but Tmin/Tmax are not set to avoid the overhead of doing a search of the transaction log to determine their status. For heavy-archive the reader (a query) will lookup the commit time from the TIME relation and update Tmin/Tmax (thus making a page dirty). Vacuum sets Tmin/Tmax for heavy-archive when moving older versions to the archive. It is possible that the thing (query, vacuum) that searches TIME will be delayed by disk reads.

2.5 Record Access

On each page there is a line table with an entry per anchor point record. Secondary index entry points to line table entry. On update a secondary index only needs maintenance if the indexed columns have been changed.

Modern PostgreSQL uses the name line pointer. Also modern PostgreSQL does secondary index maintenance for all secondary indexes unless no indexed columns have changed. So if there 3 secondary indexes and an update changes a column used by 1 of them then maintenance is done for all of them -- unless HOT is used. If no indexed columns have changed then the Heap Only Tuples (HOT) optimization is used and the new version is added to the end of the update chain and secondary index entries reference the line pointer for the head of the update chain. Quoting from the HOT document:

Without HOT, every version of a row in an update chain has its own index entries, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries.

3.1 Vacuuming the disk

POSTGRES had a command to trigger vacuum of a relation. The example was vacuum rel-name after "30 days". This reclaims space from aborted transactions and moves old versions to the archive. Old versions for relations marked as light-archive and heavy-archive are moved to archive storage. If heavy-archive is set for the relation then vacuum will set Tmin/Tmax if unset. Differences between POSTGRES and modern PostgreSQL include:

Vacuum in modern PostgreSQL doesn't move older versions to an archive. It does reclaim space for versions that have been deleted and are no longer visible. It also sets bits in the visibility map and does work to avoid transaction ID wraparound.
Vacuum did a full scan of the relation in POSTGRES while modern PostgreSQL only checks pages that require vacuum courtesy of the visibility map
Vacuum in modern PostgreSQL does a full index scan for every secondary index of the vacuumed relation when there are rows to remove.

3.2 Archival Medium

The target archival media was optical WORM. While WORM might not have been a huge hit CD-R and DVD-R were a big deal for a long time. Zip drives were a big deal for a shorter time and now we have USB thumb drives. Maybe WORM will return in the form of ultra-low-endurance NAND flash SSDs that support only one device write.

The paper also explained interesting ways to manage secondary indexes using both magnetic disk and archive devices with plans for R-trees to support efficient time-bounded queries.

3.3 Vacuum Process

Vacuum in POSTGRES did:

Write archive record and index entries
Write new anchor point in current database, insert new index entries
Reclaim space from old anchor point and delta records

This wasn't crash safe but POSTGRES did the right thing in spite of crashes. Crashes could leave duplicate records with a copy of the same version in both the archive and main store. But POSTGRES was relational and eliminated such duplicates. I explain differences with modern PostgreSQL above in section 3.1.

5.1 [Performance Comparison] Assumptions

One nit I have with the paper is the argument that CPU is not a critical resource. It listed a few reasons for this -- CPUs were getting much faster than disk, multi-processors were coming, co-processors could be used and custom logic could be used. While the CPU-disk speed gap was growing the paper ignored that RAM density was growing quickly and many DBMS applications would be less IO-bound in the future.

Another nit is that the paper ignores the overhead from vacuum. Vacuum doesn't just use CPU. It reads from the vacuumed relations and dirties pages in them. Accounting for that overhead would be complicated and the focus of the paper was on simple performance models, which made it a nice paper to read.

Monday, July 13, 2020

Updates for the insert benchmark

I continue to run and improve the insert benchmark. This is an update to the overview. Details on how I share performance results is here.

Per interval results

The first change is that the benchmark client displays per-second performance results including IPS, QPS and max response time for inserts and queries. This makes it easier to understand stalls.

This is output from a load and the legend for the columns is:

i_sec - number of seconds for this interval
t_sec - cumulative number of seconds
i_ips, i_qps - average insert and query rate for that interval
t_ips, t_qps - average insert and query rate from test start until now
max_i, max_q - max insert and query response time for that interval, in microseconds
t_ins, t_query - total inserts and queries from test start until now

i_sec t_sec i_ips t_ips i_qps t_qps max_i max_q t_ins t_query
1.0 1.0 61833 61833 0 0 2919 0 61900 0
1.0 2.0 63625 62729 0 0 1983 0 125600 0
1.0 3.0 63829 63095 0 0 2080 0 189500 0

Test steps

The second change is in how I run the benchmark. As described in the overview, it used to be run as: create tables/collections with secondary indexes, load X million rows/documents, do a full scan of each secondary index, do read+write tests with rate-limited writers.

I have changed it since then. The new pattern is:

create tables/collections without secondary indexes
load data (l.i0)
create secondary indexes (l.x)
load more data (l.i1)
read+write tests (q100.1, q100.2, q200.1, q200.2, ...)

The read+write tests use rate-limited writers and the tests are run for varying limits. The test starts with 100 insert/s per client (q100) and then 200, 400, 600, 800 and 1000. In some cases the DBMS is unable to sustain the target insert rates. Also, for each limit the test is run twice so the results might use the names q100.1, q100.2, q200.1, q200.2, ..., q1000.1, q1000.2.

Wednesday, July 1, 2020

Something changed for the better in create index between MySQL 8.0.18 and 8.0.20

I include MongoDB, Postgres, MySQL/InnoDB and MySQL/RocksDB (MyRocks) in the things I test via Linkbench and the insert benchmark. Hopefully I add another test later this year. I have been using MySQL 8.0.18 this year and recently started to use 8.0.20.

The new insert benchmark workflow is create tables with PK indexes, load tables, create 3 secondary indexes on each table, continue the load and then run several rounds of reads+inserts. For this test there were 8 clients with a table per client and 10M rows/table after the initial insert.

I think something changed for the better from MySQL 8.0.18 to 8.0.20 and I didn't spot the change in the release notes. I see:

Create index in 8.0.20 is 10% to 20% faster (nice, but not why I wrote this)
Create index in 8.0.18 uses ~7X more read IO with innodb_flush_method = O_DIRECT (14G vs 2G) and ~2X more read IO with it set to O_DIRECT_NO_FSYNC (4G vs 2G). The table was cached at the start of create index and the indexed table should be less than 2G. Tests used innodb_sort_buffer_size = 64M.

Something changed with the sort used by create index. Maybe it does fewer passes. I am not sure there is a way to monitor that.
I don't understand the impact from O_DIRECT vs O_DIRECT_NO_FSYNC in 8.0.18, while it had no impact in 8.0.20.

When the load was restarted after create index there was initially a lot of read IO with 8.0.20. Either the indexed table or the newly created indexes were not in cache and my bet is on the new indexes. This doesn't happen with 8.0.18.
For 8.0.20, create index is faster with innodb_use_native_aio set to ON, but I am ignoring that topic for now.

Tests are run for 3 configurations (c10b40, c10b40a, c10b40b). The InnoDB options for those configs are here. Perhaps because of the way I compiled MySQL from source, innodb_use_native_aio is off by default for 8.0.18 but on for 8.0.20 -- so 8.0.18 and 8.0.20 differ at runtime only for that option with the c10b40 config. For the c10b40a and c10b40b configs, the options are the same at run time because innodb_use_native_aio is set. The differences between the configs are:

c10b40 - use O_DIRECT
c10b40a - start with c10b40, add innodb_use_native_aio=FALSE
c10b40b - start with c10b40b, change to O_DIRECT_NO_FSYNC

Performance metrics are here. The slightly out of date legend for the metrics is here. Similar to my favorite InnoDB performance expert, the results are compressed so I can compare many configurations on one screen. The metrics don't include it but the time to create the index for 8.0.18 is (275, 281, 233) seconds and for 8.0.20 is (157, 228, 209) seconds for the (c10b40, c10b40a, c10b40b) configs. The ips metric in the tables for create index is indexed rows per second computed as (number of rows in table / time to create all indexes).