Thursday, June 1, 2023

Tuning Postgres and the new insert benchmark, round 3

This has results for the insert benchmark on a medium server when Postgres caches the database. The previous post is for a similar setup except the database was cached by the OS but not by Postgres. 

tl;dr

  • the base config I have been using is good once I tuned autovacuum to be more agressive
  • the big improvement is from adding autovacuum_scale_factor=0.05 and autovacuum_insert_scale_factor=0.05 
  • adding wal_compression=lz4 has a small benefit
  • several of the changes made performance much worse
  • increasing autovacuum_work_mem didn't help here but will probably help on IO-bound tests

Benchmark

This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory.

The configurations all use shared_buffers=70G and the database is cached by Postgres. All of the config files are here. The base config file is cx7a_gcp_c2s30 and then the other config files make a small change to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

  • one of them has a significant benefit: a16
  • a few of them have a small benefit: a1, a2, a6, a9
  • many of them hurt performance:  a5, a7, a10, a11, a12, a13, a14, a15, a17
  • a few of them can be ignored: a3, a4, a8
The configurations are:

  • conf.diff.cx7a1_gcp_c2s30 (a1) - adds wal_compression=lz4
  • conf.diff.cx7a2_gcp_c2s30 (a2) - adds backend_flush_after=256kb
  • conf.diff.cx7a3_gcp_c2s30 (a3) - adds full_page_writes=off (not for production!)
  • conf.diff.cx7a4_gcp_c2s30 (a4) - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
  • conf.diff.cx7a5_gcp_c2s30 (a5) - adds min_wal_size=16384
  • conf.diff.cx7a6_gcp_c2s30 (a6) - adds backend_flush_after=256kb, initdb --wal-segsize 256
  • conf.diff.cx7a7_gcp_c2s30 (a7) - adds max_wal_size=16GB
  • conf.diff.cx7a8_gcp_c2s30 (a8) - adds shared_buffers=8GB, ignore this, I was just curious about the benefit of doubling shared buffers from the base config
  • conf.diff.cx7a9_gcp_c2s30 (a9) - adds checkpoint_timeout=900s
  • conf.diff.cx7a10_gcp_c2s30 (a10) - adds autovacuum_work_mem=256MB
  • conf.diff.cx7a11_gcp_c2s30 (a11) - adds autovacuum_work_mem=1GB
  • conf.diff.cx7a12_gcp_c2s30 (a12) - adds wal_buffers=64MB
  • conf.diff.cx7a13_gcp_c2s30 (a13) - adds wal_buffers=64MB, initdb --wal-segsize 256
  • conf.diff.cx7a14_gcp_c2s30 (a14) - adds backend_flush_after=1MB
  • conf.diff.cx7a15_gcp_c2s30 (a15) - adds max_wal_size=32GB
  • conf.diff.cx7a16_gcp_c2s30 (a16) - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • conf.diff.cx7a17_gcp_c2s30 (a17) - reduces bgwriter_lru_maxpages from 5000 to 1000
Results

I start by comparing the impact of the config changes with the results for the base config.

The summary is here. Relative to the base config the a16 config provides the largest improvement to average throughput for l.i1 (from 154202 to 249299 inserts/s). Some configs made it much worse than the base case: a5, a7, a8, a10, a11, a12, a13, a14, a15, a17.

Next I consider the impact on worst-case response time for inserts (see here). Some configs made it worse than the base case: a7, a10, a12, a13.

Finally, I look at the charts for per-second insert rates vs time (see here). The a16 config has the best average throughput and the chart for a16 is a bit different than the others.

The conclusion is that I should definitely use a16 and probably use a1, a2, a6 and a9 for this workload.











Wednesday, May 31, 2023

Tuning Postgres and the new insert benchmark, round 2

A previous post shared results from the insert benchmark for Postgres 15.3 with a variety of configurations. This has results for even more configurations to understand the impact of several tuning variables. The goal is to understand which config options improve average throughput and/or reduce throughput variance. The focus in on the l.i1 benchmark step (inserts with secondary index maintenance) as that is the largest source of stress for MVCC GC.

tl;dr (for this workload):

  • the base config I have been using is good once I tuned autovacuum to be more agressive
  • adding wal_compression=lz4 helps a lot while pglz didn't help (in 2020, see here)
  • tuning autovacuum scale factors might help
  • several of the changes made performance much worse
  • increasing autovacuum_work_mem didn't help here but might help on IO-bound tests

Benchmark

This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory.

The configurations all use shared_buffers=4G so that the database is cached by the OS but not by Postgres -- this kind-of simulates fast storage. Results where the database is cached by Postgres are pending.

All of the config files are here. The base config file is cx7a_4g_gcp_c2s30 and then the other config files make a small change to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

  • one of them has a significant benefit: a1
  • a few of them have a small benefit: a9, a10, a11
  • one had no impact: a17
  • many of them hurt performance:  a2, a5, a6, a7, a12, a13, a14, a15
  • a few of them can be ignored: a3, a4, a8
The configurations are:

  • conf.diff.cx7a1_4g_gcp_c2s30 (a1) - adds wal_compression=lz4
  • conf.diff.cx7a2_4g_gcp_c2s30 (a2) - adds backend_flush_after=256kb
  • conf.diff.cx7a3_4g_gcp_c2s30 (a3) - adds full_page_writes=off (not for production!)
  • conf.diff.cx7a4_4g_gcp_c2s30 (a4) - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
  • conf.diff.cx7a5_4g_gcp_c2s30 (a5) - adds min_wal_size=16384
  • conf.diff.cx7a6_4g_gcp_c2s30 (a6) - adds backend_flush_after=256kb, initdb --wal-segsize 256
  • conf.diff.cx7a7_4g_gcp_c2s30 (a7) - adds max_wal_size=16GB
  • conf.diff.cx7a8_4g_gcp_c2s30 (a8) - adds shared_buffers=8GB, ignore this, I was just curious about the benefit of doubling shared buffers from the base config
  • conf.diff.cx7a9_4g_gcp_c2s30 (a9) - adds checkpoint_timeout=900s
  • conf.diff.cx7a10_4g_gcp_c2s30 (a10) - adds autovacuum_work_mem=256MB
  • conf.diff.cx7a11_4g_gcp_c2s30 (a11) - adds autovacuum_work_mem=1GB
  • conf.diff.cx7a12_4g_gcp_c2s30 (a12) - adds wal_buffers=64MB
  • conf.diff.cx7a13_4g_gcp_c2s30 (a13) - adds wal_buffers=64MB, initdb --wal-segsize 256
  • conf.diff.cx7a14_4g_gcp_c2s30 (a14) - adds backend_flush_after=1MB
  • conf.diff.cx7a15_4g_gcp_c2s30 (a15) - adds max_wal_size=32GB
  • conf.diff.cx7a16_4g_gcp_c2s30 (a16) - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • conf.diff.cx7a17_4g_gcp_c2s30 (a17) - reduces bgwriter_lru_maxpages from 5000 to 1000
Results

I start by comparing the impact of the config changes with the results for the base config.

The summary is here. Relative to the base config the a1 config provides the largest improvement to average throughput for l.i1 (from 110299 to 122963 inserts/s).

Next I consider the impact on worst-case response time for inserts (see here). About half of the configs have either no impact or a minor impact on that compared to the base config. Some of the configs (a6, a11, a12, a13, a14, a15) make the worst-case insert time much worse.

Finally, I look at the charts for per-second insert rates vs time (see here). From that the a1 config makes the graph look much better and the a16 config has a minor benefit. The results are here for base, a1 and a16.

The conclusion is that I should definitely use a1 (wal_compression=lz4) and probably use a9, one of a10/a11 and a16 for this workload.









Tuesday, May 30, 2023

Insert+delete benchmark, small server and MyRocks

This has results for an in-memory, low-concurrency insert benchmark on a small server. The previous report is here. This used the --delete_per_insert option so that the write-heavy steps ran for a long time while the working set remained in memory.

tl;dr

  • 5.6.35 has better perf than 8.0.28 courtesy of new CPU overheads in upstream MySQL 8
  • Variance is visible, but not horrible.

Benchmarks

The small server is a Beelink SER 4700u with 8 AMD cores, 16G RAM and NVMe SSD and low-concurrency was 1 and 4 clients. 

An overview of the insert benchmark is herehere and here. The insert benchmark was run for a cached database with both 1 and 4 clients. For 1 client the benchmark used 1 table. For 4 clients the benchmark was used 4 tables with a client per table. The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each. The --delete_per_insert option was set for l.i1, q100.1, q500.1 and q1000.1.

Benchmarks were repeated for two configurations:
  • cached by RocksDB - all data fits in the RocksDB block cache
  • cached by OS - all data fits in the OS page cache but not the RocksDB block cache. For the small server RocksDB block cache size was set to 1G.
The my.cnf files are here for:
The benchmark is a sequence of steps.

  • l.i0
    • insert 20 million rows across all tables without secondary indexes
  • l.x
    • create 3 secondary indexes. I usually ignore results from this step.
  • l.i1
    • insert another 50 million rows across all tables with secondary index maintenance. Also delete 50M rows across all tables. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.
  • q100.1
    • do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background
  • q500.1
    • do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background
  • q1000.1
    • do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background
MyRocks

Tests used MyRocks from FB MySQL 5.6.35 and 8.0.28 with different builds to test compiler optimizations. The builds are described in a previous post and I tested the rel build for MyRocks 5.6.35 and the rel_native_lto build for 8.0.28.

Reports

Reports for cached by RocksDB:
Reports for cached by OS:
The first thing to acknowledge is the performance regression. There is more CPU overhead in MySQL 8.0 than 5.6 (in non-RocksDB code) so 8.0 gets less throughput than 5.6 in many cases. The following table shows the throughput in 8.0.28 relative to 5.6.35 and a value < 1.0 means 8.0 is slower. These results are from the Summary tables in the reports linked below.
  • The largest regression is in l.i0
  • Regressions are slightly worse for the 1-client case than the 4-client case

Throughput from 8.0.28 / 5.6.35

l.i0    l.x     l.i1    q100    q500    q1000   cached
-- 1 client
0.80    0.95    0.89    0.93    0.92    0.92    by RocksDB
0.77    0.94    0.88    0.91    0.92    0.92    by OS
-- 4 clients
0.82    0.93    1.02    0.99    0.96    0.95    by RocksDB
0.82    0.94    1.02    0.95    0.97    0.91    by OS

Other performance artifacts:

  • 1 client, cached by RocksDB
    • l.i1 - for 8.0.28 but not for 5.6.35 the max insert/delete response time charts show two levels, one at ~10ms and the other at ~20ms. See here.
    • q1000 - the QPS vs time chart shows a sawtooth pattern. This might be from changes in the CPU overhead/query as memtable flushes and L0->L1 compaction are done. There is more variance in the 8.0.28 result. See here.
  • 1 client, cached by OS
    • l.i0 & l.i1 - for 8.0.28 but not for 5.6.35 the max insert response time chart has 2 levels (one at ~10ms, the other at ~20ms). See here for l.i0 and for l.i1.
    • q1000 - the QPS vs time chart shows a sawtooth parttern. See here.
  • 4 clients, cached by RocksDB
    • l.i0 - the insert/s rate tails off, but the test finishes in 90s and that is too short from which to draw strong conclusions. See here.
    • q100 - QPS has variance and perhaps a sawtooth pattern. See here.
    • q500 - there are extended (>= 100s) for delete response time but not for inserts. They are worse for 8.0.28 than 5.6.35. See here.
    • q1000 - there are extended periods during which delete and query response time jump from <= 5ms to ~10ms. See here.
  • 4 clients, cached by OS
    • l.i0 - similar to 4 clients, cached by RocksDB. See here.
    • l.i1 - insert and delete response times have two levels. See here.
    • q100 - similar to 4 clients, cached by RocksDB. See here.
    • q500 - similar to 4 clients, cached by RocksDB plus QPS has a sawtooth pattern. See here.
    • q1000 - delete and insert response times for 8.0.28 have a permanent step halfway into the run. See here.

Updates to the insert benchmark

The insert benchmark can now be called the insert+delete benchmark.

While I continue to work on the replacement for the insert benchmark (see insert benchmark v3) progress on that is slow. I recently enhanced the insert benchmark to let me run the write-heavy insert benchmark steps while keeping the working set in-memory and this allows me to run the write-heavy steps for much longer than a few minutes and search for problems in MVCC GC implementations.

This is done via the --delete_per_insert option that works as advertised -- a delete is done per insert to avoid growing the number of rows in a benchmark table. Note that the size of the database files might grow if MVCC GC gets unhappy, but that is the point of the insert benchmark, it exists to make MVCC GC unhappy.

More information on the insert benchmark is here and here.

The steps in the insert benchmark used to be:

  1. l.i0 - Inserts in PK order. The PK is on the transactionid column. The benchmark table has a PK index but no secondary indexes. This should be fast for a (right growing) b-tree regardless of the buffer pool size. Of course it is faster for an LSM.
  2. l.x - Create 3 secondary indexes on each benchmark table.
  3. l.i1 - More inserts in PK order. These are slower than l.i0 because there are 3 secondary indexes that require maintenance. The drop in performance is much larger for a b-tree than an LSM, especially when the working set is larger than memory because secondary index maintenance is read-modify-write for a b-tree but read-free for MyRocks.
  4. q100 - Do queries as fast as possible and for each query client there is another client that does 100 inserts/s.
  5. q500 - Do queries as fast as possible and for each query client there is another client that does 500 inserts/s.
  6. q1000 - Do queries as fast as possible and for each query client there is another client that does 1000 inserts/s.
The steps l.i1, q100, q500 and q1000 have been modified for the insert+delete benchmark. In each case there is a connection that does deletes in addition to the connection that does inserts. The delete rate (delete/s) is <= the insert rate but the goal is for them to be equal.

Implementation notes:
  • Each connection runs in a separate thread (or process because I use Python multiprocessing).
  • The default way to run this is with a client per table. But sometimes I run it so that all clients share the same table. The client per table option means I don't encounter fewer perf problems from contention.
  • The --delete_per_insert option only works with a client per table config.
  • With --delete_per_insert the number of inserts/s done by a client is shared with the delete connection so that deletes run as fast as inserts, but no faster. They can run slower, but that with be a performance problem.
  • Deletes are done from the tail of the table. When N rows are to be deleted, the delete client deletes the N with the smallest transactionid value. So the benchmark table is like a queue -- inserts to the head, deletes from the tail.

Saturday, May 27, 2023

Postgres 16beta1 looks good vs sysbench

This has results for Postgres 16 beta1 using sysbench on a small server with low-concurrency and a cached database. The goal is to determine whether there are CPU regressions from Postgres 15.3 to 16-beta1.

tl;dr

  • There were no regressions
  • A few queries are ~1.7X faster in PG 16-beta1
Benchmark

A description of how I run sysbench is here. The sysbench microbenchmarks were run for 20 clients, 300 (600) seconds per read (write) microbenchmark using 1 tables with 20M rows per table. The test database was <= 6G and fits in Postgres buffer pool.

The test server is a Beelink SER 4700u with 8 cores, 16G of RAM, 1T of NVMe SSD with XFS running Ubuntu 22.04. I compiled Postgres from source using gcc 11.3.

The config file, conf.diff.cx8_bee, is here for Postgres 15.3 and 16-beta1.

The previous post explains the builds that I used where each build uses different compiler optimizations. I used the def, o2_nofp and o3_native_lto builds and ran the benchmark 6 times -- once per build for each of Postgres 15.3 and 16-beta1.

I use sysbench to run 42 microbenchmarks and each microbenchmark is put in one of three groups based on the dominant operation: point query, range query, writes.

Results

The table below lists the relative throughput which is: (QPS for PG 16-beta1 / QPS for PG 15.3). When the relative throughput is less than 1 then PG 16-beta1 is slower than PG 15.3 and there might be a CPU regression. 

With a few exceptions the relative throughput is between 0.97 and 1.04. I consider values between 0.97 and 1.0 to be noise rather than a regression so I will declare there are no regressions. The exceptions are:
  • point-query.pre_range=100 which has a noisy relative throughput of 0.89, 0.97 and 1.07. This benchmark step runs early in the benchmark and I assume this is just noise.
  • scan_range=100 which has a noisy relative throughput of 1.00, 0.92, 0.98. I have yet to explain it but the scan benchmark step seems to have more noise so I ignore this.
  • read-only.pre_range=10000 which has a relative throughput of 1.60, 1.46, 1.63
  • read-only.range=10000 which has a a relative throughput of 1.50, 1.39, 1.50
The results show that PG 16-beta1 does ~1.5X better for the read-only.*range=10000 benchmark step which is a big improvement. That is discussed in the Follow Up section below.
 
Legend:
* def - uses the def build
* o2 - uses the o2_nofp build
* o3 - uses the o3_native_lto build

------ build ------     --- benchmark step ---
def     o2      o3
1.00    0.97    1.01    hot-points_range=100
0.89    0.97    1.07    point-query.pre_range=100
1.00    1.02    1.01    point-query_range=100
1.00    1.03    1.03    points-covered-pk.pre_range=100
0.99    1.04    1.03    points-covered-pk_range=100
1.00    1.03    1.04    points-covered-si.pre_range=100
0.99    1.01    1.03    points-covered-si_range=100
1.00    1.00    1.01    points-notcovered-pk.pre_range=100
1.00    1.00    1.02    points-notcovered-pk_range=100
1.00    1.00    1.02    points-notcovered-si.pre_range=100
1.00    1.00    1.01    points-notcovered-si_range=100
0.99    0.99    1.00    random-points.pre_range=1000
1.00    1.00    0.99    random-points.pre_range=100
1.01    0.99    1.02    random-points.pre_range=10
0.99    1.00    1.01    random-points_range=1000
1.00    1.00    1.01    random-points_range=100
1.02    0.99    1.02    random-points_range=10
0.99    0.98    1.01    range-covered-pk.pre_range=100
1.00    0.99    1.01    range-covered-pk_range=100
0.98    0.98    1.01    range-covered-si.pre_range=100
1.00    0.97    1.01    range-covered-si_range=100
0.97    0.98    0.99    range-notcovered-pk.pre_range=100
0.98    0.97    1.00    range-notcovered-pk_range=100
0.98    1.00    1.00    range-notcovered-si.pre_range=100
0.98    0.99    1.00    range-notcovered-si_range=100
1.60    1.46    1.63    read-only.pre_range=10000
1.00    0.99    1.04    read-only.pre_range=100
1.01    0.98    0.99    read-only.pre_range=10
1.50    1.39    1.50    read-only_range=10000
1.01    0.99    1.02    read-only_range=100
1.00    1.00    0.97    read-only_range=10
1.00    0.92    0.98    scan_range=100
1.02    1.01    1.04    delete_range=100
1.01    1.02    0.99    insert_range=100
1.02    1.00    1.03    read-write_range=100
1.00    0.98    0.98    read-write_range=10
1.00    0.99    0.98    update-index_range=100
1.00    0.98    1.00    update-inlist_range=100
1.01    1.00    0.98    update-nonindex_range=100
1.01    0.97    0.98    update-one_range=100
1.02    0.99    0.97    update-zipf_range=100
1.01    1.02    1.00    write-only_range=10000

Follow up

The read-only.pre_range=10000 and read-only.range=10000 benchmark steps run a read-only transaction that has 5 types of queries. They use the same queries but read-only.pre_range=10000 runs before a large number of write-heavy benchmark steps while read-only.range=10000 runs after them. The range=10000 in the name indicates that the range queries in the read-only transaction each scan 10,000 rows.

The Lua for the benchmark step is oltp_read_only.lua which relies on oltp_common.lua. The loop that does the queries is here

There are five types of queries in the read-only*range=10000 benchmark step. The first (q1) is run 10 times per read-only transaction while the others are run once. For q2, q3, q4, q5 the value of N is 10,000. The read-only transaction is run in a loop and sysbench reports the number of transactions/second.
  • q1
    • fetch one row by PK
    • SELECT c FROM sbtest WHERE id=?
  • q2
    • scan N rows via PK index, return 1 column/row
    • SELECT c FROM sbtest WHERE id BETWEEN ? AND ?
  • q3
    • scan N rows via an index, return sum of one column
    • SELECT SUM(k) FROM sbtest WHERE id BETWEEN ? AND ?"
  • q4
    • like q2 but sorts the result
    • SELECT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c
  • q5
    • like q4 but adds DISINCT
    • SELECT DISTINCT c FROM sbtest WHERE id BETWEEN ? AND ? ORDER BY c
To explain the perf improvement from PG 15.3 to 16-beta1 I tried the following:
  • Looked at query plans - they did not change. Plans are here.
  • Ran each of the queries separately
    • performance for q1, q2 and q3 did not change
    • q4 is ~1.7X faster in PG 16-beta1
    • q5 is ~1.5X faster in PG 16-beta1
  • Started to look at CPU profiles from the perf tool. Unfortunately, my work on this was incomplete so I might revisit this in a few days and for now will mumble something about PG 16 using less CPU thanks to ICU and less time doing sorts.




























Tuesday, May 23, 2023

Tuning Postgres and the new insert benchmark

This post explains my work in progress to tune the configurations I use for Postgres. I support a DBMS that needs some tuning, I am wary of a DBMS that needs too much tuning. I assume that Postgres is the former, and at least for me so is MyRocks/RocksDB. I wish it were possible to do tuning in two parts, but I am not aware of a DBMS that supports such an approach today:
  1. Describe the HW capacity the DBMS can use (RAM, number of cores, IO throughput)
  2. Describe the desired SLA (favor reads vs writes, what matters WRT read, write & space amp)
I am a big fan of the Postgres community. I write about my good and bad experiences with Postgres. I do my best to avoid snark and in return I get advice from experts. 

Updates:
  • Once again I made mistakes with huge pages. I need to repeat the cached by OS benchmarks and will replace the results I shared today.
My latest round of this is for the insert benchmark. It has 6 steps and they used to be:
  1. Insert in PK order with a PK index and without secondary indexes
  2. Create 3 secondary indexes
  3. Insert in  PK order, but random relative to the secondary indexes, with a PK index and 3 secondary indexes. All indexes require maintenance per insert.
  4. Do queries as fast as possible. Do rate-limited inserts (100/s/client).
  5. Do queries as fast as possible. Do rate-limited inserts (500/s/client).
  6. Do queries as fast as possible. Do rate-limited inserts (1000/s/client).
The initial usage was for data sets larger than memory and I could control the benchmark step in which that occurred based on how I setup the benchmark. Eventually I started to use it for in-memory workloads to search for CPU regressions. Unfortunately when an insert-only workload runs for enough time the workload eventually stops being in-memory. And on a modern server I usually can't run benchmark steps 1 and 3 for more than a few minutes before the database is larger than memory.

This is a problem when I want to check that b-tree writeback/checkpoint and LSM compaction provide stable (minimal variance) throughput over time. So I finally did it -- I added an option to the insert benchmark (--delete_per_insert) that optionally does a delete per insert to keep the database from growing and I use this for benchmark steps 3, 4, 5 and 6 above. The inserts done in those steps use a separate thread (and separate database connection). The deletes do the same, and just match the rate of inserts. The deletes are done from the tail of the table. The table has a PK on transactionid, inserts use increasing transactionid values and deletes delete the N rows with the smallest transactionid values.

First up with the new and improved insert benchmark, or insert & delete benchmark, is Postgres. I am using the workload to improve the Postgres configuration that I use with much help from Postgres experts.

Setup

I ran two variants of the benchmark on a c2-standard-30 server that has 15 cores, hyperthreads disabled, 120G of RAM and XFS on SW RAID 0 across 4 NVMe devices. The variants are:
  • cached by Postgres - database is cached by Postgres
  • cached by OS - database is cached by the OS page cache but not by Postgres
The config files that I use are here and for cached by OS I use shared_buffers=4G. The benchmark uses 8 tables with a client per table. The benchmark steps are:
  1. Insert 20M rows/table in parallel
  2. Create 3 secondary indexes per table (I ignore perf from this step)
  3. Insert 100M rows/table in parallel with a delete per insert
  4. Do queries as fast as possible with 100/inserts/s/client and a delete per insert 
  5. Do queries as fast as possible with 500/inserts/s/client and a delete per insert 
  6. Do queries as fast as possible with 1000/inserts/s/client and a delete per insert 
Results

The benchmark reports are here for cached by Postgres and cached by OS.

While I have yet to get results for the insert+delete benchmark from InnoDB or MyRocks I see more variance than desired with the Postgres configurations that I tested.

Things that interest me are:
  • Average throughput for l.i1 (benchmark step 3) for cached by Postgres and by OS
  • Response time distribution for inserts during l.i1 for cached by Postgres and by OS
  • Insert rate at 1-second intervals during l.i1 for cached by Postgres and by OS
From the above I rank the impact of changes to the base configuration file I used. The base is here for by Postgres and by OS. It already benefits from changes I did to increase the rate at which autovacuum runs (see here). With those changes the database doesn't get larger than ~50G. Without them it grew to be much larger than memory.

But there were other changes suggested by Postgres experts and I tried many of them. The base config for cached by Postgres is named conf.diff.cx7a_gcp_c2s30. There are 9 variations that I tried -- 8 suggested by others, one I added. Using the cached by Postgres names they are:
  • conf.diff.cx7a1_gcp_c2s30 - adds wal_compression=lz4
  • conf.diff.cx7a2_gcp_c2s30 - adds backend_flush_after=256kb
  • conf.diff.cx7a3_gcp_c2s30 - adds full_page_writes=off (not for production!)
  • conf.diff.cx7a4_gcp_c2s30 - adds checkpoint_timeout=1800s
  • conf.diff.cx7a5_gcp_c2s30 - adds min_wal_size=16384
  • conf.diff.cx7a6_gcp_c2s30 - adds backend_flush_after=256kb, initdb --wal-segsize 256, inherits larger value for wal_buffers
  • conf.diff.cx7a7_gcp_c2s30 - adds max_wal_size=16GB
  • conf.diff.cx7a8_gcp_c2s30 - adds shared_buffers=8GB (hurts perf for cached by Postgres, improves it for cached by OS because by OS base config has shared_buffers=4GB)
  • conf.diff.cx7a9_gcp_c2s30 - adds checkpoint_timeout=900s
I judged the impact from each of the config file variants listed above. The a1, a3, a4 and a9 configs help performance. The a2, a5 and a7 configs hurt performance. The a6 config hurts perf in 3 cases and helps it in one.

Legend for header:
* config -
* avg.tp - impact on average insert rate
* rt - impact on response time distribution

Legend for symbols:
+     -> some improvement
++    -> much improvement
-     -> some regression
--    -> much regression
blank -> not much impact
NA    -> ignored for now

        avg.tp                  rt
config  byOS    byPG    ::      byOS    byPG
a1      +       +
a2      --                      --      --
a3      ++      ++              +       +
a4      +
a5      -       -
a6      --      +               --
a7      --      --              --      --
a8      NA      NA              NA      NA
a9      +       +

















Compiling all releases of MySQL 8.0

This explains my experiences compiling all GA releases from MySQL 8.0. The first GA release is 8.0.11, the latest is 8.0.33 and 8.0.29 doesn't exist. That means there are 22 point releases including 8.0.33.

My build platform is Ubuntu 22.04 which has gcc 11.3.0. The first challenge is OpenSSL -- MySQL 8.0.28 and prior releases need version 1, 8.0.30 and later need version 3 and Ubuntu 22.04 includes version 3. The workaround is to download, compile and install version 1.

./config --prefix=/foo/bar/d/ssl --openssldir=/foo/bar/d/ssl shared

make; make install

The second challenge was adding missing include files to avoid compiler failures. I always disabled compiling router because there were build failures early in the 8.0 cycle. Perhaps those have been fixed but I don't need the feature and prefer faster build times so I never checked.


I have no idea whether these build failures occurred on older versions of Ubuntu and/or gcc. Perhaps there is a story for the missing include file errors that linger across multiple releases.


MySQL 8.0.11


Changes:

  • added -fPIC to the CFLAGS and CXXFLAGS to avoid odd linker errors
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <stdexcept> to client/mysqltest.c
  • added include <cstdint> to include/my_hash_combine.h
  • added include <stdexcept> to plugin/x/ngs/src/thread.cc
  • added declarations to materialize templates in sql/dd/impl/cache/storage_adapter.cc
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.12


Changes:

  • added -fPIC to the CFLAGS and CXXFLAGS to avoid odd linker errors
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <stdexcept> to client/mysqltest.c
  • added include <cstdint> to include/my_hash_combine.h
The CMake command line is here:
The diffs to upstream source are here.

MySQL 8.0.13

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <stdexcept> to client/mysqltest.c
  • added include <cstdint> to include/my_hash_combine.h
  • added include <stdexcept> to router/src/http/include/mysqlrouter/http_common.h
  • added include <stdexcept> to router/src/router/include/mysqlrouter/uri.h
  • disabled router in top-level CMakeLists.txt because of build failures and the CMake option to disable it didn't arrive until a later point release
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.14

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <stdexcept> to client/mysqltest.c
  • added include <cstdint> to include/my_hash_combine.h
  • added include <limits> to sql-common/sql_string.cc
  • disabled router in top-level CMakeLists.txt because of build failures and the CMake option to disable it didn't arrive until a later point release
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.15

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <stdexcept> to client/mysqltest.c
  • added include <cstdint> to include/my_hash_combine.h
  • added include <limits> to sql-common/sql_string.cc
  • disabled router in top-level CMakeLists.txt because of build failures and the CMake option to disable it didn't arrive until a later point release
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.16

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <cstdint> to include/my_hash_combine.h
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.17

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <cstdint> to include/my_hash_combine.h
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.18

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <cstdint> to include/my_hash_combine.h
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
  • added include <string> to sql/rpl_utility.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.19

Changes:

  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.20

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstdint> & <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.21

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.22

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.23

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • added include <limits> to xtra/robin-hood-hashing/robin_hood.h
  • added include <limits> to sql-common/sql_string.cc
  • added include <cstddef> to include/mysql/components/services/page_track_service.h
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
The diffs to upstream source are here.

MySQL 8.0.24

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • Disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.25

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.26

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.27

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.28

Changes:
  • added -DWITH_SSL=/foo/bar/... to get OpenSSL 1
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.30

Changes:
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.31

Changes:
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.32

Changes:
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.

MySQL 8.0.33

Changes:
  • disabled router to avoid build errors via -DWITH_ROUTER=OFF
The CMake command line is here.
Upstream source compiled as-is.