Small Datum: May 2023

Wednesday, May 31, 2023

Tuning Postgres and the new insert benchmark, round 2

A previous post shared results from the insert benchmark for Postgres 15.3 with a variety of configurations. This has results for even more configurations to understand the impact of several tuning variables. The goal is to understand which config options improve average throughput and/or reduce throughput variance. The focus in on the l.i1 benchmark step (inserts with secondary index maintenance) as that is the largest source of stress for MVCC GC.

tl;dr (for this workload):

the base config I have been using is good once I tuned autovacuum to be more agressive
adding wal_compression=lz4 helps a lot while pglz didn't help (in 2020, see here)
tuning autovacuum scale factors might help
several of the changes made performance much worse
increasing autovacuum_work_mem didn't help here but might help on IO-bound tests

Benchmark

This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory.

The configurations all use shared_buffers=4G so that the database is cached by the OS but not by Postgres -- this kind-of simulates fast storage. Results where the database is cached by Postgres are pending.

All of the config files are here. The base config file is cx7a_4g_gcp_c2s30 and then the other config files make a small change to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

one of them has a significant benefit: a1
a few of them have a small benefit: a9, a10, a11
one had no impact: a17
many of them hurt performance: a2, a5, a6, a7, a12, a13, a14, a15
a few of them can be ignored: a3, a4, a8

The configurations are:

conf.diff.cx7a1_4g_gcp_c2s30 (a1) - adds wal_compression=lz4
conf.diff.cx7a2_4g_gcp_c2s30 (a2) - adds backend_flush_after=256kb
conf.diff.cx7a3_4g_gcp_c2s30 (a3) - adds full_page_writes=off (not for production!)
conf.diff.cx7a4_4g_gcp_c2s30 (a4) - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
conf.diff.cx7a5_4g_gcp_c2s30 (a5) - adds min_wal_size=16384
conf.diff.cx7a6_4g_gcp_c2s30 (a6) - adds backend_flush_after=256kb, initdb --wal-segsize 256
conf.diff.cx7a7_4g_gcp_c2s30 (a7) - adds max_wal_size=16GB
conf.diff.cx7a8_4g_gcp_c2s30 (a8) - adds shared_buffers=8GB, ignore this, I was just curious about the benefit of doubling shared buffers from the base config
conf.diff.cx7a9_4g_gcp_c2s30 (a9) - adds checkpoint_timeout=900s
conf.diff.cx7a10_4g_gcp_c2s30 (a10) - adds autovacuum_work_mem=256MB
conf.diff.cx7a11_4g_gcp_c2s30 (a11) - adds autovacuum_work_mem=1GB
conf.diff.cx7a12_4g_gcp_c2s30 (a12) - adds wal_buffers=64MB
conf.diff.cx7a13_4g_gcp_c2s30 (a13) - adds wal_buffers=64MB, initdb --wal-segsize 256
conf.diff.cx7a14_4g_gcp_c2s30 (a14) - adds backend_flush_after=1MB
conf.diff.cx7a15_4g_gcp_c2s30 (a15) - adds max_wal_size=32GB
conf.diff.cx7a16_4g_gcp_c2s30 (a16) - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
conf.diff.cx7a17_4g_gcp_c2s30 (a17) - reduces bgwriter_lru_maxpages from 5000 to 1000

Results

I start by comparing the impact of the config changes with the results for the base config.

The summary is here. Relative to the base config the a1 config provides the largest improvement to average throughput for l.i1 (from 110299 to 122963 inserts/s).

Next I consider the impact on worst-case response time for inserts (see here). About half of the configs have either no impact or a minor impact on that compared to the base config. Some of the configs (a6, a11, a12, a13, a14, a15) make the worst-case insert time much worse.

Finally, I look at the charts for per-second insert rates vs time (see here). From that the a1 config makes the graph look much better and the a16 config has a minor benefit. The results are here for base, a1 and a16.

The conclusion is that I should definitely use a1 (wal_compression=lz4) and probably use a9, one of a10/a11 and a16 for this workload.

Tuesday, May 30, 2023

Insert+delete benchmark, small server and MyRocks

This has results for an in-memory, low-concurrency insert benchmark on a small server. The previous report is here. This used the --delete_per_insert option so that the write-heavy steps ran for a long time while the working set remained in memory.

tl;dr

5.6.35 has better perf than 8.0.28 courtesy of new CPU overheads in upstream MySQL 8
Variance is visible, but not horrible.

Benchmarks

The small server is a Beelink SER 4700u with 8 AMD cores, 16G RAM and NVMe SSD and low-concurrency was 1 and 4 clients.

An overview of the insert benchmark is here, here and here. The insert benchmark was run for a cached database with both 1 and 4 clients. For 1 client the benchmark used 1 table. For 4 clients the benchmark was used 4 tables with a client per table. The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each. The --delete_per_insert option was set for l.i1, q100.1, q500.1 and q1000.1.

Benchmarks were repeated for two configurations:

cached by RocksDB - all data fits in the RocksDB block cache
cached by OS - all data fits in the OS page cache but not the RocksDB block cache. For the small server RocksDB block cache size was set to 1G.

The my.cnf files are here for:

cached by RocksDB: 5.6.35 and 8.0.28
cached by OS: 5.6.35 and 8.0.28

The benchmark is a sequence of steps.

l.i0

insert 20 million rows across all tables without secondary indexes

create 3 secondary indexes. I usually ignore results from this step.

l.i1

insert another 50 million rows across all tables with secondary index maintenance. Also delete 50M rows across all tables. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.

q100.1

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background

q500.1

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background

q1000.1

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background

MyRocks

Tests used MyRocks from FB MySQL 5.6.35 and 8.0.28 with different builds to test compiler optimizations. The builds are described in a previous post and I tested the rel build for MyRocks 5.6.35 and the rel_native_lto build for 8.0.28.

Reports

Reports for cached by RocksDB:

Reports for cached by OS:

The first thing to acknowledge is the performance regression. There is more CPU overhead in MySQL 8.0 than 5.6 (in non-RocksDB code) so 8.0 gets less throughput than 5.6 in many cases. The following table shows the throughput in 8.0.28 relative to 5.6.35 and a value < 1.0 means 8.0 is slower. These results are from the Summary tables in the reports linked below.

The largest regression is in l.i0
Regressions are slightly worse for the 1-client case than the 4-client case

Throughput from 8.0.28 / 5.6.35

l.i0 l.x l.i1 q100 q500 q1000 cached

-- 1 client

0.80 0.95 0.89 0.93 0.92 0.92 by RocksDB

0.77 0.94 0.88 0.91 0.92 0.92 by OS

-- 4 clients

0.82 0.93 1.02 0.99 0.96 0.95 by RocksDB

0.82 0.94 1.02 0.95 0.97 0.91 by OS

Other performance artifacts:

1 client, cached by RocksDB

l.i1 - for 8.0.28 but not for 5.6.35 the max insert/delete response time charts show two levels, one at ~10ms and the other at ~20ms. See here.
q1000 - the QPS vs time chart shows a sawtooth pattern. This might be from changes in the CPU overhead/query as memtable flushes and L0->L1 compaction are done. There is more variance in the 8.0.28 result. See here.

1 client, cached by OS

l.i0 & l.i1 - for 8.0.28 but not for 5.6.35 the max insert response time chart has 2 levels (one at ~10ms, the other at ~20ms). See here for l.i0 and for l.i1.
q1000 - the QPS vs time chart shows a sawtooth parttern. See here.

4 clients, cached by RocksDB

l.i0 - the insert/s rate tails off, but the test finishes in 90s and that is too short from which to draw strong conclusions. See here.
q100 - QPS has variance and perhaps a sawtooth pattern. See here.
q500 - there are extended (>= 100s) for delete response time but not for inserts. They are worse for 8.0.28 than 5.6.35. See here.
q1000 - there are extended periods during which delete and query response time jump from <= 5ms to ~10ms. See here.

4 clients, cached by OS

l.i0 - similar to 4 clients, cached by RocksDB. See here.
l.i1 - insert and delete response times have two levels. See here.
q100 - similar to 4 clients, cached by RocksDB. See here.
q500 - similar to 4 clients, cached by RocksDB plus QPS has a sawtooth pattern. See here.
q1000 - delete and insert response times for 8.0.28 have a permanent step halfway into the run. See here.

Updates to the insert benchmark

The insert benchmark can now be called the insert+delete benchmark.

While I continue to work on the replacement for the insert benchmark (see insert benchmark v3) progress on that is slow. I recently enhanced the insert benchmark to let me run the write-heavy insert benchmark steps while keeping the working set in-memory and this allows me to run the write-heavy steps for much longer than a few minutes and search for problems in MVCC GC implementations.

This is done via the --delete_per_insert option that works as advertised -- a delete is done per insert to avoid growing the number of rows in a benchmark table. Note that the size of the database files might grow if MVCC GC gets unhappy, but that is the point of the insert benchmark, it exists to make MVCC GC unhappy.

More information on the insert benchmark is here and here.

The steps in the insert benchmark used to be:

l.i0 - Inserts in PK order. The PK is on the transactionid column. The benchmark table has a PK index but no secondary indexes. This should be fast for a (right growing) b-tree regardless of the buffer pool size. Of course it is faster for an LSM.
l.x - Create 3 secondary indexes on each benchmark table.
l.i1 - More inserts in PK order. These are slower than l.i0 because there are 3 secondary indexes that require maintenance. The drop in performance is much larger for a b-tree than an LSM, especially when the working set is larger than memory because secondary index maintenance is read-modify-write for a b-tree but read-free for MyRocks.
q100 - Do queries as fast as possible and for each query client there is another client that does 100 inserts/s.
q500 - Do queries as fast as possible and for each query client there is another client that does 500 inserts/s.
q1000 - Do queries as fast as possible and for each query client there is another client that does 1000 inserts/s.

The steps l.i1, q100, q500 and q1000 have been modified for the insert+delete benchmark. In each case there is a connection that does deletes in addition to the connection that does inserts. The delete rate (delete/s) is <= the insert rate but the goal is for them to be equal.

Implementation notes:

Each connection runs in a separate thread (or process because I use Python multiprocessing).
The default way to run this is with a client per table. But sometimes I run it so that all clients share the same table. The client per table option means I don't encounter fewer perf problems from contention.
The --delete_per_insert option only works with a client per table config.
With --delete_per_insert the number of inserts/s done by a client is shared with the delete connection so that deletes run as fast as inserts, but no faster. They can run slower, but that with be a performance problem.
Deletes are done from the tail of the table. When N rows are to be deleted, the delete client deletes the N with the smallest transactionid value. So the benchmark table is like a queue -- inserts to the head, deletes from the tail.

Saturday, May 27, 2023

Postgres 16beta1 looks good vs sysbench

This has results for Postgres 16 beta1 using sysbench on a small server with low-concurrency and a cached database. The goal is to determine whether there are CPU regressions from Postgres 15.3 to 16-beta1.

tl;dr

There were no regressions
A few queries are ~1.7X faster in PG 16-beta1

Benchmark

A description of how I run sysbench is here. The sysbench microbenchmarks were run for 20 clients, 300 (600) seconds per read (write) microbenchmark using 1 tables with 20M rows per table. The test database was <= 6G and fits in Postgres buffer pool.

The test server is a Beelink SER 4700u with 8 cores, 16G of RAM, 1T of NVMe SSD with XFS running Ubuntu 22.04. I compiled Postgres from source using gcc 11.3.

The config file, conf.diff.cx8_bee, is here for Postgres 15.3 and 16-beta1.

The previous post explains the builds that I used where each build uses different compiler optimizations. I used the def, o2_nofp and o3_native_lto builds and ran the benchmark 6 times -- once per build for each of Postgres 15.3 and 16-beta1.

I use sysbench to run 42 microbenchmarks and each microbenchmark is put in one of three groups based on the dominant operation: point query, range query, writes.

Results

The table below lists the relative throughput which is: (QPS for PG 16-beta1 / QPS for PG 15.3). When the relative throughput is less than 1 then PG 16-beta1 is slower than PG 15.3 and there might be a CPU regression.

With a few exceptions the relative throughput is between 0.97 and 1.04. I consider values between 0.97 and 1.0 to be noise rather than a regression so I will declare there are no regressions. The exceptions are:

point-query.pre_range=100 which has a noisy relative throughput of 0.89, 0.97 and 1.07. This benchmark step runs early in the benchmark and I assume this is just noise.
scan_range=100 which has a noisy relative throughput of 1.00, 0.92, 0.98. I have yet to explain it but the scan benchmark step seems to have more noise so I ignore this.
read-only.pre_range=10000 which has a relative throughput of 1.60, 1.46, 1.63
read-only.range=10000 which has a a relative throughput of 1.50, 1.39, 1.50

The results show that PG 16-beta1 does ~1.5X better for the read-only.*range=10000 benchmark step which is a big improvement. That is discussed in the Follow Up section below.

Legend:

* def - uses the def build

* o2 - uses the o2_nofp build

* o3 - uses the o3_native_lto build

------ build ------ --- benchmark step ---

def o2 o3

1.00 0.97 1.01 hot-points_range=100

0.89 0.97 1.07 point-query.pre_range=100

1.00 1.02 1.01 point-query_range=100

1.00 1.03 1.03 points-covered-pk.pre_range=100

0.99 1.04 1.03 points-covered-pk_range=100

1.00 1.03 1.04 points-covered-si.pre_range=100

0.99 1.01 1.03 points-covered-si_range=100

1.00 1.00 1.01 points-notcovered-pk.pre_range=100

1.00 1.00 1.02 points-notcovered-pk_range=100

1.00 1.00 1.02 points-notcovered-si.pre_range=100

1.00 1.00 1.01 points-notcovered-si_range=100

0.99 0.99 1.00 random-points.pre_range=1000

1.00 1.00 0.99 random-points.pre_range=100

1.01 0.99 1.02 random-points.pre_range=10

0.99 1.00 1.01 random-points_range=1000

1.00 1.00 1.01 random-points_range=100

1.02 0.99 1.02 random-points_range=10

0.99 0.98 1.01 range-covered-pk.pre_range=100

1.00 0.99 1.01 range-covered-pk_range=100

0.98 0.98 1.01 range-covered-si.pre_range=100

1.00 0.97 1.01 range-covered-si_range=100

0.97 0.98 0.99 range-notcovered-pk.pre_range=100

0.98 0.97 1.00 range-notcovered-pk_range=100

0.98 1.00 1.00 range-notcovered-si.pre_range=100

0.98 0.99 1.00 range-notcovered-si_range=100

1.60 1.46 1.63 read-only.pre_range=10000

1.00 0.99 1.04 read-only.pre_range=100

1.01 0.98 0.99 read-only.pre_range=10

1.50 1.39 1.50 read-only_range=10000

1.01 0.99 1.02 read-only_range=100

1.00 1.00 0.97 read-only_range=10

1.00 0.92 0.98 scan_range=100

1.02 1.01 1.04 delete_range=100

1.01 1.02 0.99 insert_range=100

1.02 1.00 1.03 read-write_range=100

1.00 0.98 0.98 read-write_range=10

1.00 0.99 0.98 update-index_range=100

1.00 0.98 1.00 update-inlist_range=100

1.01 1.00 0.98 update-nonindex_range=100

1.01 0.97 0.98 update-one_range=100

1.02 0.99 0.97 update-zipf_range=100

1.01 1.02 1.00 write-only_range=10000

Follow up

The read-only.pre_range=10000 and read-only.range=10000 benchmark steps run a read-only transaction that has 5 types of queries. They use the same queries but read-only.pre_range=10000 runs before a large number of write-heavy benchmark steps while read-only.range=10000 runs after them. The range=10000 in the name indicates that the range queries in the read-only transaction each scan 10,000 rows.

The Lua for the benchmark step is oltp_read_only.lua which relies on oltp_common.lua. The loop that does the queries is here.

There are five types of queries in the read-only*range=10000 benchmark step. The first (q1) is run 10 times per read-only transaction while the others are run once. For q2, q3, q4, q5 the value of N is 10,000. The read-only transaction is run in a loop and sysbench reports the number of transactions/second.

fetch one row by PK
SELECT c FROM sbtest WHERE id=?

scan N rows via PK index, return 1 column/row
SELECT c FROM sbtest WHERE id BETWEEN ? AND ?

scan N rows via an index, return sum of one column
SELECT SUM(k) FROM sbtest WHERE id BETWEEN ? AND ?"

like q2 but sorts the result
SELECT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c

like q4 but adds DISINCT
SELECT DISTINCT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c

To explain the perf improvement from PG 15.3 to 16-beta1 I tried the following:

Looked at query plans - they did not change. Plans are here.
Ran each of the queries separately

performance for q1, q2 and q3 did not change
q4 is ~1.7X faster in PG 16-beta1
q5 is ~1.5X faster in PG 16-beta1

Started to look at CPU profiles from the perf tool. Unfortunately, my work on this was incomplete so I might revisit this in a few days and for now will mumble something about PG 16 using less CPU thanks to ICU and less time doing sorts.

Tuesday, May 23, 2023

Tuning Postgres and the new insert benchmark

This post explains my work in progress to tune the configurations I use for Postgres. I support a DBMS that needs some tuning, I am wary of a DBMS that needs too much tuning. I assume that Postgres is the former, and at least for me so is MyRocks/RocksDB. I wish it were possible to do tuning in two parts, but I am not aware of a DBMS that supports such an approach today:

Describe the HW capacity the DBMS can use (RAM, number of cores, IO throughput)
Describe the desired SLA (favor reads vs writes, what matters WRT read, write & space amp)

I am a big fan of the Postgres community. I write about my good and bad experiences with Postgres. I do my best to avoid snark and in return I get advice from experts.

Updates:

Once again I made mistakes with huge pages. I need to repeat the cached by OS benchmarks and will replace the results I shared today.

My latest round of this is for the insert benchmark. It has 6 steps and they used to be:

Insert in PK order with a PK index and without secondary indexes
Create 3 secondary indexes
Insert in PK order, but random relative to the secondary indexes, with a PK index and 3 secondary indexes. All indexes require maintenance per insert.
Do queries as fast as possible. Do rate-limited inserts (100/s/client).
Do queries as fast as possible. Do rate-limited inserts (500/s/client).
Do queries as fast as possible. Do rate-limited inserts (1000/s/client).

The initial usage was for data sets larger than memory and I could control the benchmark step in which that occurred based on how I setup the benchmark. Eventually I started to use it for in-memory workloads to search for CPU regressions. Unfortunately when an insert-only workload runs for enough time the workload eventually stops being in-memory. And on a modern server I usually can't run benchmark steps 1 and 3 for more than a few minutes before the database is larger than memory.

This is a problem when I want to check that b-tree writeback/checkpoint and LSM compaction provide stable (minimal variance) throughput over time. So I finally did it -- I added an option to the insert benchmark (--delete_per_insert) that optionally does a delete per insert to keep the database from growing and I use this for benchmark steps 3, 4, 5 and 6 above. The inserts done in those steps use a separate thread (and separate database connection). The deletes do the same, and just match the rate of inserts. The deletes are done from the tail of the table. The table has a PK on transactionid, inserts use increasing transactionid values and deletes delete the N rows with the smallest transactionid values.

First up with the new and improved insert benchmark, or insert & delete benchmark, is Postgres. I am using the workload to improve the Postgres configuration that I use with much help from Postgres experts.

Setup

I ran two variants of the benchmark on a c2-standard-30 server that has 15 cores, hyperthreads disabled, 120G of RAM and XFS on SW RAID 0 across 4 NVMe devices. The variants are:

cached by Postgres - database is cached by Postgres
cached by OS - database is cached by the OS page cache but not by Postgres

The config files that I use are here and for cached by OS I use shared_buffers=4G. The benchmark uses 8 tables with a client per table. The benchmark steps are:

Insert 20M rows/table in parallel
Create 3 secondary indexes per table (I ignore perf from this step)
Insert 100M rows/table in parallel with a delete per insert
Do queries as fast as possible with 100/inserts/s/client and a delete per insert
Do queries as fast as possible with 500/inserts/s/client and a delete per insert
Do queries as fast as possible with 1000/inserts/s/client and a delete per insert

Results

The benchmark reports are here for cached by Postgres and cached by OS.

While I have yet to get results for the insert+delete benchmark from InnoDB or MyRocks I see more variance than desired with the Postgres configurations that I tested.

Things that interest me are:

Average throughput for l.i1 (benchmark step 3) for cached by Postgres and by OS
Response time distribution for inserts during l.i1 for cached by Postgres and by OS
Insert rate at 1-second intervals during l.i1 for cached by Postgres and by OS

From the above I rank the impact of changes to the base configuration file I used. The base is here for by Postgres and by OS. It already benefits from changes I did to increase the rate at which autovacuum runs (see here). With those changes the database doesn't get larger than ~50G. Without them it grew to be much larger than memory.

But there were other changes suggested by Postgres experts and I tried many of them. The base config for cached by Postgres is named conf.diff.cx7a_gcp_c2s30. There are 9 variations that I tried -- 8 suggested by others, one I added. Using the cached by Postgres names they are:

conf.diff.cx7a1_gcp_c2s30 - adds wal_compression=lz4
conf.diff.cx7a2_gcp_c2s30 - adds backend_flush_after=256kb
conf.diff.cx7a3_gcp_c2s30 - adds full_page_writes=off (not for production!)
conf.diff.cx7a4_gcp_c2s30 - adds checkpoint_timeout=1800s
conf.diff.cx7a5_gcp_c2s30 - adds min_wal_size=16384
conf.diff.cx7a6_gcp_c2s30 - adds backend_flush_after=256kb, initdb --wal-segsize 256, inherits larger value for wal_buffers
conf.diff.cx7a7_gcp_c2s30 - adds max_wal_size=16GB
conf.diff.cx7a8_gcp_c2s30 - adds shared_buffers=8GB (hurts perf for cached by Postgres, improves it for cached by OS because by OS base config has shared_buffers=4GB)
conf.diff.cx7a9_gcp_c2s30 - adds checkpoint_timeout=900s

I judged the impact from each of the config file variants listed above. The a1, a3, a4 and a9 configs help performance. The a2, a5 and a7 configs hurt performance. The a6 config hurts perf in 3 cases and helps it in one.

Legend for header:

* config -

* avg.tp - impact on average insert rate

* rt - impact on response time distribution

Legend for symbols:

+ -> some improvement

++ -> much improvement

- -> some regression

-- -> much regression

blank -> not much impact

NA -> ignored for now

avg.tp rt

config byOS byPG :: byOS byPG

a1 + +

a2 -- -- --

a3 ++ ++ + +

a4 +

a5 - -

a6 -- + --

a7 -- -- -- --

a8 NA NA NA NA

a9 + +

Compiling all releases of MySQL 8.0

This explains my experiences compiling all GA releases from MySQL 8.0. The first GA release is 8.0.11, the latest is 8.0.33 and 8.0.29 doesn't exist. That means there are 22 point releases including 8.0.33.

My build platform is Ubuntu 22.04 which has gcc 11.3.0. The first challenge is OpenSSL -- MySQL 8.0.28 and prior releases need version 1, 8.0.30 and later need version 3 and Ubuntu 22.04 includes version 3. The workaround is to download, compile and install version 1.

./config --prefix=/foo/bar/d/ssl --openssldir=/foo/bar/d/ssl shared
make; make install

The second challenge was adding missing include files to avoid compiler failures. I always disabled compiling router because there were build failures early in the 8.0 cycle. Perhaps those have been fixed but I don't need the feature and prefer faster build times so I never checked.

I have no idea whether these build failures occurred on older versions of Ubuntu and/or gcc. Perhaps there is a story for the missing include file errors that linger across multiple releases.

MySQL 8.0.11

Changes:

added -fPIC to the CFLAGS and CXXFLAGS to avoid odd linker errors
added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <stdexcept> to client/mysqltest.c
added include <cstdint> to include/my_hash_combine.h
added include <stdexcept> to plugin/x/ngs/src/thread.cc
added declarations to materialize templates in sql/dd/impl/cache/storage_adapter.cc

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.12

Changes:

added -fPIC to the CFLAGS and CXXFLAGS to avoid odd linker errors
added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <stdexcept> to client/mysqltest.c
added include <cstdint> to include/my_hash_combine.h

The CMake command line is here:

The diffs to upstream source are here.

MySQL 8.0.13

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <stdexcept> to client/mysqltest.c
added include <cstdint> to include/my_hash_combine.h
added include <stdexcept> to router/src/http/include/mysqlrouter/http_common.h
added include <stdexcept> to router/src/router/include/mysqlrouter/uri.h
disabled router in top-level CMakeLists.txt because of build failures and the CMake option to disable it didn't arrive until a later point release

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.14

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <stdexcept> to client/mysqltest.c
added include <cstdint> to include/my_hash_combine.h
added include <limits> to sql-common/sql_string.cc
disabled router in top-level CMakeLists.txt because of build failures and the CMake option to disable it didn't arrive until a later point release

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.15

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <stdexcept> to client/mysqltest.c
added include <cstdint> to include/my_hash_combine.h
added include <limits> to sql-common/sql_string.cc
disabled router in top-level CMakeLists.txt because of build failures and the CMake option to disable it didn't arrive until a later point release

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.16

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <cstdint> to include/my_hash_combine.h
added include <limits> to sql-common/sql_string.cc
added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.17

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <cstdint> to include/my_hash_combine.h
added include <limits> to sql-common/sql_string.cc
added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.18

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <cstdint> to include/my_hash_combine.h
added include <limits> to sql-common/sql_string.cc
added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
added include <string> to sql/rpl_utility.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.19

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <limits> to sql-common/sql_string.cc
added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.20

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <limits> to sql-common/sql_string.cc
added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.21

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <limits> to sql-common/sql_string.cc
added include <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.22

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <limits> to sql-common/sql_string.cc
added include <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.23

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
added include <limits> to xtra/robin-hood-hashing/robin_hood.h
added include <limits> to sql-common/sql_string.cc
added include <cstddef> to include/mysql/components/services/page_track_service.h
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

The diffs to upstream source are here.

MySQL 8.0.24

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
Disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.25

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.26

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.27

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.28

Changes:

added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.30

Changes:

disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.31

Changes:

disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.32

Changes:

disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

MySQL 8.0.33

Changes:

disabled router to avoid build errors via -DWITH_ROUTER=OFF

The CMake command line is here.

Upstream source compiled as-is.

Thursday, May 18, 2023

Summary of another round of benchmarks

This has links for my latest round of benchmarks for MyRocks, InnoDB and Postgres. They were run on a small and big server for sysbench and the insert benchmark. The big server is a c2-standard-60 with 30 cores, hyperthreads disabled and 240G of RAM. The small server is a mini PC with 8 cores and 16G of RAM. For MyRocks I used builds from FB MySQL 5.6.35 and 8.0.28. For InnoDB I used upstream MySQL 5.6, 5.7 and 8.0. For Postgres I used versions 11 through 15.

I tried to explain changes. I made mistakes, especially with configuring huge pages, that forced me to repeat tests a few times. The configurations I use might need improvements, especially for Postgres and I am working on that now.

Insert Benchmark

Small server - MyRocks, InnoDB, Postgres
Big server - MyRocks, InnoDB, Postgres
Other - Comparison of all DBMS

Sysbench

Small server - MyRocks, InnoDB, Postgres
Big server - MyRocks, Postgres (part 1 & part 2), InnoDB (part 1 & part 2)
Other - MyRocks vs InnoDB, read perf after writes, note on running sysbench, note on huge pages

The Insert Benchmark, MySQL 5.6, 5.7, 8.0 and a small server

This has results for MySQL vs the insert benchmark on a small server. Results from a big server are here.

The goals are to look for performance changes from MySQL 5.6 to 8.0 and to determine the impact of compiler optimizations. The context here is a cached workload (reads from cache, writes to storage) with low concurrency and a small server.

tl;dr

The rel_native_lto build has the best performance which is usually <= 4% better than the base case except for the l.x (create index) benchmark step.
QPS on the read+write benchmark steps (q100.1, q500.1, q1000.1) suffer from bug 109595. Fixing b109595 would increase QPS by ...

~1.25X for the cached by InnoDB setups
(~2X, ~6X and ~6X) for q100.1, q500.1, q1000.1) for the cached by OS setups. To be clear, I mean that QPS for q500.1 and q1000.1 is about 1/6th of what it could be if this bug were fixed.

For the cached by InnoDB setup with one client (connection, thread) the throughput for 8.0.32 is between 56% and 63% of 5.6.51. Most of the perf loss is from new CPU overhead.

Benchmarks

An overview of the insert benchmark is here and here. The insert benchmark was run for a cached database with both 1 and 4 clients. For 1 client the benchmark used 1 table. For 4 clients the benchmark was first run with 4 tables (client per table) and then again with 1 table (all clients shared the table). The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each.

The benchmark server is a Beelink SER 4700u described here with 8 AMD cores, 16G of RAM and 1T of NVMe SSD. The OS is Ubuntu 22.04 and the filesystem is XFS.

Benchmarks were repeated for two configurations:

cached by InnoDB - all data fits in the InnoDB buffer pool
cached by OS - all data fits in the OS page cache but not the InnoDB buffer pool. The buffer pool size is 1G and the database was ~10G at test end.

The configuration files are here for:

cached by InnoDB: versions 5.6, 5.7, 8.0
cached by OS: versions 5.6, 5.7, 8.0

The benchmark is a sequence of steps:

l.i0 - insert 20 million rows without secondary indexes
l.x - create 3 secondary indexes. I usually ignore results from this step.
l.i1 - insert another 20 million rows with the overhead of secondary index maintenance
q100.1 - do queries as fast as possible with 100 inserts/s/thread done in the background
q500.1 - do queries as fast as possible with 500 inserts/s/thread done in the background
q1000.1 - do queries as fast as possible with 1000 inserts/s/thread done in the background

Builds

I compiled MySQL from source with several build variations. Details are here. All builds are with upstream MySQL and gcc.

Reports

Performance summaries generated by shell scripts are below. A short guide to these results is here. There are three types of reports:

version 8.0.31 - results for MySQL 8.0.31 using different compiler optimizations
version 8.0 - results for MySQL 8.0.22, 8.0.28, 8.0.31 and sometimes 8.0.32
versions 5.6 to 8.0 - results for MySQL versions 5.6.51, 5.7.40, 8.0.22, 8.0.28, 8.0.28, 8.0.31 and 8.0.32

For each type of report there are two variations:

1 client, 1 table - used 1 client & 1 table
4 clients, 4 tables - used 4 clients & 4 tables with a client per table
4 clients, 1 table - used 4 clients & 1 table with all clients sharing one table

Reports for cached by InnoDB:

1 client, 1 table - version 8.0.31, version 8.0, versions 5.6 to 8.0
4 clients, 4 tables - version 8.0.31, version 8.0, versions 5.6 to 8.0
4 clients, 1 table - version 8.0.31, version 8.0, versions 5.6 to 8.0

Reports for cached by OS:

1 client, 1 table - version 8.0.31, version 8.0, versions 5.6 to 8.0
4 clients, 4 tables - version 8.0.31, version 8.0, versions 5.6 to 8.0
4 clients, 1 table - version 8.0.31, version 8.0, versions 5.6 to 8.0

Summaries:

There has been a gradual reduction in performance in each release because there has been a gradual increase in CPU overhead. The following excludes the l.x (create index) benchmark step.

1 client, 1 table

Cached by InnoDB

QPS for 8.0.32 is between 56% and 63% relative to 5.6.51

Cached by OS

QPS for 8.0.32 relative to 5.6.51 is (61%, 27%, 30%, 7%, 6%) for (l.i0, l.i1, q100.1, q500.1, q1000.1). See bug 109595.

4 clients, 4 tables

Cached by InnoDB

QPS for 8.0.32 is between 60% and 67% relative to 5.6.51

Cached by OS

QPS for 8.0.32 relative to 5.6.51 is (61%, 19%, 1%, 2%, 15%) for (l.i0, l.i1, q100.1, q500.1, q1000.1). See bug 109595.

4 clients, 1 table

Cached by InnoDB

QPS for 8.0.32 is between 62% and 75% relative to 5.6.51

Cached by OS

QPS for 8.0.32 relative to 5.6.51 is (60%, 18%, 2%, 2%, 9%) for (l.i0, l.i1, q100.1, q500.1, q1000.1). See bug 109595.

Tuesday, May 16, 2023

Understanding the insert benchmark results: variance and write stalls

Today I published a report for the insert benchmark with InnoDB and a big server and then updated reports for MyRocks and Postgres. One goal is to find things that can be made better. Another goal is to understand how performance and efficiency changes when context changes where context might be the version of the DBMS (newer releases fix some things, but add CPU overhead) or it might be whether the database is cached by the DBMS.

Most of my reports focus on one DBMS and don't compare MyRocks with Postgres and InnoDB, but I will do some of that here. I try to leave benchmarketing to others, but I am happy to do perf bug marketing via my posts.

The results are interesting, but before discussing them I have a few disclaimers:

I risk making strong conclusions from small samples. For example, I only ran the benchmark once per DBMS/configuration. My focus is on breadth, I tested many builds per DBMS. Had I tried to get 3 or 5 results per setup then I would have used 3X or 5X as much machine time.
Comparisons between setups can be hard because the work rates are not fixed. For the write-only benchmark steps (l.i0, l,i1) the inserts are done as fast as possible. For the read+write benchmark steps while the background write rate is fixed, the query clients run as fast as possible. Some comparisons would be easier if I were to fix the insert and query rates in all cases. It isn't unreasonable for a DBMS doing X inserts/second to have more stalls or response time variance than a DBMS doing X/10 inserts/second.
It is possible that better tuning will improve the results and I am open to feedback on that. However, I am wary of doing too much tuning, especially per-workload. I prefer to have a DBMS that offers robust tuning which does the right thing for many workloads.
Context matters. The context here is a workload with high-concurrency that fits in memory (either the OS page cache or the database) so there are writes to storage but few reads from storage.
InnoDB might be suffering from bug 109595. I hope it gets fixed soon.

Most important disclaimer (updated)

I should have included this up front. These are results for workloads expected to remain in memory (no reads from storage). The benchmark does inserts and range queries. If too many inserts are done then the working set will not fit in memory. And too long on many servers == ~5 minutes. It is risky to make strong judgements from such a short interval.

It is OK to make strong judgements on the CPU overhead of the foreground work done for inserts. But it is less OK to confirm that a storage engine doesn't get unhappy for an in-memory, write-heavy workload unless I am able to run that workload for a lot more time than 5 minutes.

While I have a replacement benchmark in progress that includes a fix for this problem, that isn't ready today. So I will revive a dormant feature in the insert benchmark to optionally do a delete per insert to keep the database at a fixed size.

Interesting results

A summary:

Max response time queries and inserts with MyRocks in all of the benchmark steps (l.i0, l.i1, q100.1, q500.1, q1000.1) is much smaller than it is for InnoDB and Postgres. I didn't expect this although I know that worst-case write-stalls have been made much smaller in recent RocksDB releases.
MyRocks QPS at per-second intervals has an interesting pattern in the read+write workloads (q100.1, q500.1, q1000.1). I assume this is the result of regular changes in the CPU overhead for queries -- memtable flush makes the memtable empty and the memtable search cost drops, L0 -> L1 compaction makes the L0 empty so there are fewer SSTs to check. But then the search costs for the memtable and L0 grows, repeat.
Some of the max response times for queries and inserts with Postgres were much worse with a larger buffer pool (database cached by Postgres) then a smaller buffer pool (cached by OS but not by Postgres). This isn't a shock but it would be interesting to explain that one day.
The InnoDB report has the longest tl;dr. Performance changes much more depending on context vs the same test for MyRocks and Postgres. That complexity isn't a feature, it means you have more things to worry about.
The difference in the l.i1 insert rate between cached by DBMS and cached by OS workloads is interesting.

How does perf change when the database isn't cached by the DBMS?

This table shows how the insert rate changes for both the l.i0 and l.i1 benchmark steps between the cached by OS and cached by DBMS setups. Note that l.i0 is the initial load in key order with only a PK index and l.i1 does inserts in random key order with 3 secondary indexes in place where index maintenance is required for each index. Also, the 20-tab and 1-tab columns are for the setups with 20 tables (table per client) and 1-table (all clients share 1 table).

The goal is to document how the ratios change between the l.i0 and l.i1 benchmark steps to understand whether either is more sensitive to the database fitting in the DBMS cache. It is reasonable to expect insert rates to be smaller with l.i1 than with l.i0 because l.i1 maintains 4 indexes while l.i0 only maintains 1 and the inserts are in key order for l.i0 but in random order for l.i1.

The summary is:

Performance for l.i0 is mostly cache oblivious. This is expected given the inserts are in key order.
Performance for l.i1 is mostly cache oblivious for the LSM (MyRocks) but not the b-trees (Postgres, InnoDB). This too is expected.

Performance doesn't change for MyRocks when it doesn't cache the database. Non-unique secondary index maintenance doesn't require reads from the secondary index.
Performance for Postgres with l.i1 drops, the number for l.i1 is less than for l.i0 with the same number of tables, but the drop is reasonable.
Performance for InnoDB with l.i1 drops too much. I have yet to explain this.

insert rates from: (cached by OS) / (cached by DBMS)
20-tab 20-tab 1-tab 1-tab
l.i0 l.i1 l.i0 l.i1
MyRocks 0.97 1.01 1.00 1.00
InnoDB 1.01 0.04 1.01 0.14
PG 1.04 0.36 1.04 0.50

How does perf change when index maintenance must be done?

This table shows the ratio of the insert rates for the l.i1 and l.i0 benchmark steps. See the previous section for a description of workloads.

The summary is:

The results for MyRocks show there is no difference between cached by DBMS and cached by OS because non-unique secondary index maintenance doesn't read from secondary indexes.
The results for InnoDB show that it is the most sensitive to not having the working set fit in the DBMS cache. The reduction for InnoDB from cached by DBMS to cached by OS is much larger than for Postgres.
The results for Postgres seem reasonable.

inserts rates from: l.i1 / l.i0
20-tab 20-tab 1-tab 1-tab
by-DBMS by-OS by-DBMS by-OS
MyRocks 0.29 0.30 0.22 0.22
InnoDB 0.41 0.02 0.68 0.09
PG 0.30 0.10 0.37 0.18

The Insert Benchmark, MySQL 5.6, 5.7, 8.0 and a big server

This has results for MySQL vs the insert benchmark on a big server. Results from a small server are here.

tl;dr (still long but the results vary a lot):

It is possible that QPS on the read+write benchmark steps (q100.1, q500.1, q1000.1) suffer from bug 109595.
I accept that tuning might be required, but I am also wary of a DBMS that requires too much per-workload tuning.
The insert rate for l.i1 drops from (~400k/s, 180k/s) when the database is cached by InnoDB to (~14k/s, ~25k/s) when the database is cached by the OS but not by InnoDB. The two numbers are for (20 tables, 1 table). While a reduction is expected when not cached by the database the drop is much larger for InnoDB than for Postgres.
The o3_rel_lto build has the best performance but that usually means <= 3% better than others. The benefit for link-time optimization here is much less than on the small server perhaps because the big server has different overheads from higher concurrency.
Max response times are mostly significantly worse in 8.0 than 5.6.
For a database cached by InnoDB

InnoDB was able to sustain the target background write rate for q100.1 (2000/s), q500.1 (10,000/s) and q1000.1 (20,000/s).
With a table per client

Write performance improved from 5.6 to 5.7 and again from 5.7 to 8.0 so that 8.0 gets 2X to 3X more throughput than 5.6. But within 8.0 there are regressions and 8.0.32 is up to 5% slower than 8.0.22.
Read performance gets worse from 5.6 to 5.7 and again from 5.7 to 8.0 so that 5.7 gets ~10% less QPS than 5.6 while 8.0 gets ~30% less QPS than 5.6.

With one table

Write - vs 5.6 version 5.7 gets ~1.25X more throughput and 8.0 gets ~1.15X more
Read - results are similar to table per client above

For a database not cached by InnoDB

InnoDB was unable to sustain the target background write rate in some cases and this means it failed the benchmark SLA. In all of these failures the background write rate for 8.0 is worse than for 5.6.

It was never able to sustain the rate for the q1000.1 benchmark step for both the 20 table and 1 table configurations.The target was 20,000/s and InnoDB got between ~10,000/s and ~15,000/s.
It was not able to sustain the rate for the q500.1 benchmark step for the 20 table case with 8.0. The target is 10,000/s and < ~9800/s is a failure so 5.7 passed and 5.6 almost passed. Note that with the 1 table test 5.7 and 8.0 pass but 5.6 fails.

With a table per client

Write - results are similar to database cached by InnoDB above
Read - with a low background write rate (20 X 100 /s, see q100.1) 8.0 is ~5X faster than 5.6 but when it is higher (20 X 1000/s, see q1000.1) then 8.0 is only ~1.2X faster than 5.6

With one table

Write - MySQL 8.0 throughput is 10% to 20% less than 5.6
Read - the comparison with 5.6 depends on the background write. For q100.1 (20 X 100 /s) then 8.0 gets ~3X more QPS vs 5.6. And then for q500.1 (20 X 500 / s) then 8.0 gets ~1.3X more QPS vs 5.6. But when it is highest (20 X 1000 / s) then 8.0 gets ~10% less QPS vs 5.6.

Benchmarks

The server is a c2-standard-60 from GCP with 30 cores, hyperthreads disabled, 240G RAM and 3T of NVMe (XFS, RAID 0 across 8 devices). Huge pages were enabled for all benchmark runs.

An overview of the insert benchmark is here and here. The insert benchmark was run for a cached database with 20 clients. The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each.

Benchmarks were repeated for two configurations:

cached by InnoDB - all data fits in the InnoDB buffer pool and vm.nr_hugepages was 85000
cached by OS - all data fits in the OS page cache but not the InnoDB buffer pool. The buffer pool size is 4G and the database was ~180G at test end. For huge pages I set vm.nr_hugepages to 4000.

The configuration files are here for:

cached by InnoDB: versions 5.6, 5.7, 8.0.2x, 8.0.3x
cached by OS: versions 5.6, 5.7, 8.0.2x, 8.0.3x

The benchmark is a sequence of steps:

l.i0 - insert 20 million rows without secondary indexes
l.x - create 3 secondary indexes. I usually ignore results from this step.
l.i1 - insert another 20 million rows with the overhead of secondary index maintenance
q100.1 - do queries as fast as possible with 100 inserts/s/thread done in the background
q500.1 - do queries as fast as possible with 500 inserts/s/thread done in the background
q1000.1 - do queries as fast as possible with 1000 inserts/s/thread done in the background

Builds

I compiled MySQL from source with several build variations. Details are here. All builds are with upstream MySQL and gcc.

Reports

Performance summaries generated by shell scripts are below. A short guide to these results is here. There are three types of reports:

version 8.0.31 - results for MySQL 8.0.31 using different compiler optimizations
version 8.0 - results for MySQL 8.0.22, 8.0.28, 8.0.31 and 8.0.32
versions 5.6 to 8.0 - results for MySQL versions 5.6.51, 5.7.40, 8.0.22, 8.0.28, 8.0.28, 8.0.31 and 8.0.32

For each type of report there are two variations:

20 clients, 20 tables - used 20 client & 20 tables with a client per table.
20 clients, 1 table - used 20 clients & 1 table

Reports for cached by InnoDB:

20 clients, 20 tables: version 8.0.31, version 8.0, versions 5.6 to 8.0
20 clients, 1 table: version 8.0.31, version 8.0, versions 5.6 to 8.0

Reports for cached by OS:

20 clients, 20 tables: version 8.0.31, version 8.0, versions 5.6 to 8.0
20 clients, 1 table: version 8.0.31 (missing), version 8.0, versions 5.6 to 8.0

Response time

Charts with throughput and response time per 1-second intervals are here. First, the disclaimers:

The graphs are from just one of the 20 clients.
I risk drawing strong conclusions from small samples.
It is risky to compare the results for l.i0 and l.i1 between MySQL 5.6 and 8.0 because 8.0 sustains higher insert rates so that comparison wouldn't be fair.
In the cached by OS configurations MySQL 8.0 isn't able to sustain the target rate for background writes in q1000.1 (20 X 1000 /s) and sometimes in q500.1 (20 X 500 /s). This means that InnoDB failed the benchmark SLA and a comparison with MySQL 5.6 isn't fair.

The charts:

cached by InnoDB

20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1

For l.i0 and l.i1 the variance in response time and insert rates is worse for 8.0 than 5.6, but note that 8.0 sustains higher insert rates so this isn't a fair comparison.

20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1

Results are similar to the 20 clients, 20 tables results above. The l.i1 results have more variance than l.i0.

cached by OS:

20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1

Results are similar to the 20 clients, 20 tables results above. The l.i1 results have more variance than l.i0.
For q100.1 the insert rate with 8.0 has more variance than 5.6. For q500.1 the insert rate with 8.0 has a lot more variance than 5.6. And the difference is even more stark for q1000.1. See the disclaimers above, the 8.0 configs failed to sustain the target rate for background writes.

20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1

For l.i1 the per-second insert rate drops over time for 5.6, 5.7 and 8.0
See the disclaimers above, the 8.0 configs failed to sustain the target rate for background writes.

Up next are tables with response time details per benchmark step -- both histograms and the max. Note that ms means milliseconds, by insert I mean a multi-row insert statement and the use of brackets ([]) below indicates the range of the max response time over multiple results (1 result per build). It can be risky to draw strong inferences from a small sample (one run per build).

cached by InnoDB

20 clients, 20 tables:

l.i0 - the max is [218ms, 310ms]. The histogram improves from 5.6 to 8.0.
l.i1 - the max grows from ~500ms in 5.6 to ~1200ms in 8.0. The response time histogram improves from 5.6 to 8.0.
q100.1 - the max is [19ms, 28ms] for queries and [19ms, 30ms] for inserts
q500.1 - the max is [24ms, 30ms] for queries and [40ms, 54ms] for inserts
q1000.1 - the max is [26ms, 36ms] for queries and [49ms, 57ms] for inserts
For q100.1, q500.1 and q1000.1 the insert response time histogram improves from 5.6 to 8.0.

20 clients, 1 table:

l.i0 - the max is [228ms, 270ms].
l.i1 - the max drops from 773ms in 5.6.51 to 279ms in 8.0.32.
q100.1 - the max is [23ms, 28ms] for queries and [19ms, 31ms] for inserts.
q500.1 - the max is [24ms, 28ms] for queries and [46ms, 52ms] for inserts.
q1000.1 - the max is [24ms, 41ms] for queries and [50ms, 119ms] for inserts.
For q100.1, q500.1 and q1000.1 the insert response time histogram gets worse from MySQL 5.7 to 8.0.

cached by OS

20 clients, 20 tables:

l.i0 - the max is [238ms, 313ms]. The response time histogram improves from 5.6 to 8.0.
l.i1 - the max is [1160ms, 1297ms]. The response time histogram improves from 5.6 to 8.0.
q100.1 - the max grows from ~100ms in 5.6 to ~600ms+ in 8.0 for queries and grows from ~600ms in 5.6 to >= 1500ms for 8.0.
q500.1 - the max grows from ~200ms in 5.6 to >= ~1000ms in 8.0 for queries and from ~400ms in 5.6 to >= 1500ms in 8.0 for inserts. InnoDB in 8.0 was unable to sustain the target insert rate of 10,000/s.
q1000.1 - the max grows from ~200ms in 5.6 to >= ~1000ms in 8.0 for queries and from ~500ms in 5.6 to >= 1500ms in 8.0 for inserts. InnoDB in 5.6, 5.7 and 8.0 was unable to sustain the target insert rate of 20,000/s.

20 clients, 1 table:

l.i0 - the max is [231ms, 262ms].
l.i1 - the max grows from ~1000ms in 5.6 to ~1200ms in 8.0.
q100.1 - the max grows from 75ms in 5.6 to ~200ms in 8.0 for queries and grows from ~400ms in 5.6 to >= 600ms in 8.0 for inserts.
q500.1 - the max grows from ~100ms in 5.6 to ~270ms in 8.0 for queries and grows from ~500ms in 5.6 to ~1000ms in 8.0 for inserts. InnoDB in 5.6 was unable to sustain the target insert rate of 10,000/s
q1000.1 - the max grows from 90ms in 5.6 to ~400ms in 8.0 for queries and grows from 548ms in 5.6 to >= 1200ms in 8.0 for inserts. InnoDB in 5.6, 5.7 and 8.0 was unable to sustain the target insert rate of 20,000/s.