Small Datum: September 2023

Tuesday, September 26, 2023

Trying out OrioleDB

I am curious about OrioleDB given my unsuccessful attempts to resolve some Postgres perf issues for the Insert Benchmark via tuning.

The first step is to compile and install their Postgres fork from source.

git clone https://github.com/orioledb/postgres.git oriole_pg

cd oriole_pg
git checkout patches16_18
bash mk.pg.o3_native_lto ~/d/pg16oriole
make install
cd contrib/pgstattuple
make install

The nexts step is to compile and install the OrioleDB extension from source. It has to find pg_config so I updated my PATH for that.

PATH=$PATH:~/d/pg16oriole
make USE_PGXS=1
make USE_PGXS=1 install

# installcheck requires testgres

pip3 install testgres
make USE_PGXS=1 installcheck

There is a test failure during make installcheck (see issue 275) but that doesn't matter to me today. I am not trying to use this in production.

The next step is to modify my Postgres init scripts so I can use OrioleDB. I lack many basic PG admin skills and stumbled my way through the locales issue. I hope OrioleDB updates their README to explain that.

cd ~/d/pg16oriole

# Add this to the Postgres configuration file
# shared_preload_libraries = 'orioledb.so'

# Add this to my init scripts.
# First make the test database use the C locale because OrioleDB

# tables must use ICO, C or POSIX locales.

# For this to work I must use template0.
bin/createdb ib -l "C" -T template0

# Add this to my init scripts.

bin/psql ib -c 'create extension orioledb'

# Finally initialize Postgres
bash ${insert-init-script-here}

The final step is to modify my benchmark clients to create OrioleDB tables and that is done by appending using orioledb to the end of the CREATE TABLE statements.

Configuration

The next step is to tune some of the OrioleDB options based on the advice here.

I am using this on a small (8 cores, 16G RAM) server.

Checking MyRocks 5.6 for regressions with the Insert Benchmark and a large server, revisited

I previously shared results for MyRocks 5.6 and claimed there was a perf regression in the Insert Benchmark. I then shared a follow up post as I searched for the source of the regression. The results were bogus, there is no regression and this post explains what happened.

The mistake I made is that all builds used FDO (feedback driven optimization) and while the builds tested span ~18 months (from early 2022 until mid 2023) they were all using the same profile input for FDO. While I don't know much about FDO I assume that is a bad idea. So I recompiled everything with FDO disabled and repeated the tests.

The results here use MyRocks builds from Feb 2022 to June 2023 which use RocksDB versions from 6.28 to 8.6. The goal is to determine whether there are perf regressions from the old to new versions. While MyRocks 5.6 did not change much in that time, RocksDB has changed significantly. So this is mostly a search for perf regressions in RocksDB.

tl;dr

There are no perf regressions for MyRocks from RocksDB 6.28.2 to 8.6.3
Throughput for the initial load (l.i0 benchmark step) is ~10% better starting with RocksDB 7.10.0

Builds

All builds used the Release build type, LTO (link time optimization) but disabled FDO (feedback driven optimization).

The builds are:

fbmy5635_20220307_6282_fdo0_lto1 - FB MySQL 5.6.35 at git hash e7d976ee (7 Mar 2022 tag) with RocksDB 6.28.2
fbmy5635_20220519_722_fdo0_lto1 - FB MySQL 5.6.35 at git hash d503bd77 (19 May 2022 tag) with RocksDB 7.2.2
fbmy5635_20220809_731_fdo0_lto1 - FB MySQL 5.6.35 at git hash 877a0e58 (9 Aug 2022 tag) with RocksDB 7.3.1
fbmy5635_20221011_731_fdo0_lto1 - FB MySQL 5.6.35 at git hash c691c716 (11 Oct 2022 tag) with RocksDB 7.3.1
fbmy5635_20230216_7100_fdo0_lto1 - FB MySQL 5.6.35 at git hash 21a2b0aa (16 Feb 2023 tag) with RocksDB 7.10.0
fbmy5635_20230412_7102_fdo0_lto1 - FB MySQL 5.6.35 at git hash 205c31dd (12 Apr 2023 tag) with RocksDB 7.10.2
fbmy5635_20230529_821_fdo0_lto1 - FB MySQL 5.6.35 at git hash b739eac1 (29 May 2023 tag) with RocksDB 8.2.1
fbmy5635_20230628_821_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) with RocksDB 8.2.1
fbmy5635_20230628_833_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.3.3
fbmy5635_20230628_844_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.4.4
fbmy5635_20230628_853_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.5.3
fbmy5635_20230628_863_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.6.3

The c5 configuration file (my.cnf) was used.

Benchmark

The Insert Benchmark was run in two setups:

cached by RocksDB - all tables fit in the RocksDB block cache
IO-bound - the database is larger than memory

The server has 80 HW threads, 40 cores, 256G of RAM and fast NVMe storage with XFS.

The benchmark is run with 24 clients, 24 tables and a client per table. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Results

Performance reports are here for Cached by RocksDB and for IO-bound.

From the Summary sections for cached and for IO-bound there are tables for absolute and relative QPS per benchmark step. The relative QPS is (QPS for me / for QPS fbmy5635_20220307_6282_fdo0_lto1).

For cached benchmark steps in modern MyRocks the QPS for the l.i0 benchmark step (initial load) is up to 10% faster than the base case. Otherwise there are no regressions.

For IO-bound benchmark steps in modern MyRocks the QPS for the l.i0 benchmark step (initial load) is up to 10% faster than the base case. Most other benchmark steps get 1% or 2% more QPS in modern MyRocks than the base case.

Monday, September 25, 2023

Checking Postgres for perf regressions from 11.21 to 16.0 with sysbench and a small server

This has results for Sysbench, a small server and Postgres 11, 12, 13, 14, 15 and 16. The goal is to see if there are perf regressions over time. This is a common problem as code gets added over time the cumulative impact is using more CPU to do the same amount of work. MySQL is an example where there are significant regressions over time.

Context - the workload here is a cached database with 1 client. I am searching for CPU regressions. I will repeat this on a medium server with more concurrency (8 clients).

tl;dr

Postgres avoids perf regressions over time. This is starting to get boring.
In many cases Postgres 16.0 is ~5% faster than 11.21
For full scans Postgres 16.0 is ~20% faster than 11.21
Postgres 16.0 is also ~20% faster than 11.21 on several of the update-only benchmark steps
See the results for range query, part 2 below. Postgres 16.0 does better than 11.21 for aggregation + short range scan, but then worse than 11.21 for aggregation + long range scan. This benchmark step is a read-only variant of the traditional sysbench workload. This might be worth investigating. I am sure whether it is a regression, but it might be one.

Builds

I compiled Postgres 11.21, 12.16, 13.12, 14.9, 15.4 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

I used sysbench and my usage is explained here. Postgres was configured to cache all tables.

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table with 20M rows. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each. The command line for my wrapper scripts is:

bash r.sh 1 20000000 600 1200 nvme0n1 1 1 1

The benchmark used the a2 config and I commented out the usage of autovacuum_vacuum_insert_scale_factor for Postgres 11.21 and 12.16 as they do not support that option.

Results

A spreadsheet with all results is here. It has two sheets: one with absolute QPS for each version tested, the other with the relative QPS per version. The relative QPS is: (QPS for me) / (QPS for 11.21). With relative QPS it is easy to quantify improvements and regressions.

There are ~42 tests and each can be called a benchmark step or microbenchmark. I will call them benchmark steps. These are put into one of 5 groups based on the workload:

point query, part 1 - point queries that don't use the random-points benchmark step
point query, part 2 - point query variants that use the random-points benchmark step
range query, part 1 - range queries without aggregation
range query, part 2 - read-only variants of the original sysbench workload that does range queries with aggregation
writes - benchmark steps that do insert, update and delete

The y-axis starts at 0.9 rather than 0 to make it easier to see the changes.

The results below for range queries, part 2 have an interesting pattern. There are two sets of results there - the first 3 groups of bars from the left are run prior to subjecting Postgres to random writes. The other 3 groups bars are the same tests but run after subjecting Postgres to random writes. All of the tests do range scans with some aggregation -- these are read-only variants of the traditional sysbench workload. The difference is that the length of the range scans vary. So below, some

Groups 1 and 4 scan 10,000 rows
Groups 2 and 5 scan 1,000 rows
Groups 3 and 6 scan 10 rows

Postgres 16.0 looks best with groups 3 and 6 (only scan 10 rows) and worst with groups 1 and 4 (scan 10,000 rows).

Variance in peak RSS with jemalloc 5.2.1

Peak RSS for jemalloc 5.2.1 has much variance with the Insert Benchmark with MyRocks. The variance is a function of how you build and configure jemalloc. The worst case (largest peak RSS) is jemalloc 5.2.1 provided by Ubuntu 22.04 and I have yet to figure out how to reproduce that result using jemalloc 5.2.1 compiled from source.

I previously shared results to show that jemalloc and tcmalloc are better than glibc malloc for RocksDB. That was followed by a post that shows the peak RSS with different jemalloc versions. This post has additional results for jemalloc 5.2.1 using different jemalloc config options.

tl;dr

Peak RSS has large spikes with jemalloc 4.4, 4.5 and somewhat in 5.0 and 5.1. Tobin Baker suggested these might be from changes to the usage of MADV_FREE and MADV_DONTNEED. These start to show during the l.i1 benchmark step and then are obvious during q100, q500 and q1000.
For tests that use Hyper Clock cache there is a large peak RSS with Ubuntu-provided jemalloc 5.2.1 that is obvious during the l.x and l.i1 benchmark steps. I can't reproduce this using jemalloc 5.2.1 compiled from source despite my attempts to match the configuration.
Benchmark throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3)

Builds

My previous post explains the benchmark and HW.

To get the jemalloc config details I added malloc-conf="stats_print:true" to my.cnf which causes stats and the config details to get written to the MySQL error log on shutdown.

I compiled many versions of jemalloc from source -- 4.0.4, 4.1.1, 4.2.1, 4.3.1, 4.4.0, 4.5.0, 5.0.1, 5.1.0, 5.2.0, 5.2.1, 5.3.0. All of these used the default jemalloc config, and while it isn't listed there the default value for background_thread is false.

config.cache_oblivious: true
config.debug: false
config.fill: true
config.lazy_lock: false
config.malloc_conf: ""
config.opt_safety_checks: false
config.prof: false
config.prof_libgcc: false
config.prof_libunwind: false
config.stats: true
config.utrace: false
config.xmalloc: false

The config for Ubuntu-provided 5.2.1 is below. This is also the config used by je-5.2.1.prof (see below). It also gets background_thread=false. It differs from what I show above by:

uses config.prof: true
uses config.prof_libgcc: true

config.cache_oblivious: true
config.debug: false
config.fill: true
config.lazy_lock: false
config.malloc_conf: ""
config.opt_safety_checks: false
config.prof: true
config.prof_libgcc: true
config.prof_libunwind: false
config.stats: true
config.utrace: false
config.xmalloc: false

Finally, I tried one more config when compiling from source to match the config that is used at work. I get that via:

configure --disable-cache-oblivious --enable-opt-safety-checks --enable-prof --disable-prof-libgcc --enable-prof-libunwind --with-malloc-conf="background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"

With that the option values are the following, plus the background thread is enabled. The build that uses this is named je-5.2.1.prod below.

config.cache_oblivious: false
config.debug: false
config.fill: true
config.lazy_lock: false
config.malloc_conf: "background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"
config.opt_safety_checks: true
config.prof: true
config.prof_libgcc: false
config.prof_libunwind: true
config.stats: true
config.utrace: false
config.xmalloc: false

Now I have results for variants of jemalloc 5.2.1 and the names here match the names I used on the spreadsheets that show peak RSS.

je-5.2.1.ub - Ubuntu-provided 5.2.1
je-5.2.1 - compiled from source with default options
je-5.2.1.prof - compiled with configure -enable-prof --enable-prof-prof_libgcc to get a config that matches je-5.2.1.ub
je-5.2.1.prod - 5.2.1 compiled from source using

Benchmarks

I ran the Insert Benchmark using a 60G RocksDB block cache. The benchmark was repeated twice -- once using the (older) LRU block cache, once using the (newer) Hyper Clock cache.

The benchmark was run in the IO-bound setup and the database is larger than memory. The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps and the peak RSS problem is worst for the l.x benchmark step that creates indexes and allocates a lot of memory while doing so:

l.i0

insert 500 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Results: perf reports

My standard perf reports are here for both types of block caches: LRU and Hyper Clock.

Throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3). See the tables with absolute and relative throughput in the Summary section for LRU and for Hyper Clock.
HW performance metrics are mostly similar regardless of the peak RSS spikes. See the tables for LRU and for Hyper Clock. The interesting columns include: cpupq has CPU per operation, cpups has the average value for vmstat's us + sy, csps has the average value for vmstat's cs and cspq has context switches per operation.

So the good news is that tests here don't find performance regressions, although the more interesting test would be on larger HW with more concurrency.

Results: peak RSS

I measured the peak RSS during each benchmark step. The spreadsheet is here.

Summary:

The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.

Summary

The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.
The peak RSS is larger values for je-5.2.1.ub during l.x and l.i1. I have been unable to reproduce that with jemalloc compiled from source despite matching the configuration.

Saturday, September 23, 2023

Postgres 16.0, sysbench and a medium server

This has results for Postgres 16.0 vs sysbench on a medium server and the news continues to be good. I previously shared results for this on a small server.

tl;dr

Postgres 16.0 is faster than 15.2
The improvement for reads is larger than for writes

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16 rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

I used sysbench and my usage is explained here. Postgres was configured to cache all tables.

This benchmark used a c2-standard-30 server from GCP with 15 cores, hyperthreads disabled, 120G of RAM, Ubuntu 22.04 and 1.5TB of NVMe SSD with XFS (SW RAID 0 over 4 local devices).

The benchmark is run with 8 clients and 4 table with 20M rows per table. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each. The command line for my wrapper scripts is:

bash r.sh 4 20000000 600 1200 md127 1 1 8

The benchmark used the x7a28 config.

Results

A spreadsheet with the results is here. It has two sheets: one with absolute QPS for each version tested, the other with the relative QPS per version. The relative QPS is: (QPS for me) / (QPS for 15.2). With relative QPS it is easy to quantify improvements and regressions.

There are ~42 tests and each can be called a benchmark step or microbenchmark. I will call them benchmark steps. These are put into one of 5 groups based on the workload:

point query, part 1 - point queries that don't use the random-points benchmark step
point query, part 2 - point query variants that use the random-points benchmark step
range query, part 1 - range queries without aggregation
range query, part 2 - read-only variants of the original sysbench workload that does range queries with aggregation
writes - benchmark steps that do insert, update and delete

The y-axis starts at 0.8 rather than 0 to make it easier to see the changes. In a few cases the y-axis range excludes outliers, values too big or too small. But everything can be read on the spreadsheet.

point queries, part 1 - Postgres 16.0 is faster than 15.2
point queries, part 2 - Postgres 16.0 is faster than 15.2
range queries, part 1 - Postgres 16.0 is faster than 15.2
range queries, part 2 - Postgres 16.0 is faster than 15.2
writes - Postgres 16.0 is faster than 15.2, but the improvement here is smaller than for reads

Postgres 16.0, sysbench and a small server

I recently shared results for Postgres 16rc1 vs sysbench on a small server. Now I have results for Postgres 16.0 and the news continues to be good.

tl;dr

insert QPS might have dropped by ~5%, otherwise there are no regressions for writes
for everything else QPS in Postgres 16.0 is within ~2% of 15.2, which is a great result

Builds

Benchmark

I used sysbench and my usage is explained here. Postgres was configured to cache all tables.

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table with 20M rows. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each. The command line for my wrapper scripts is:

bash r.sh 1 20000000 600 1200 nvme0n1 1 1 1

The benchmark used a1 config.

Results

There are ~42 tests and each can be called a benchmark step or microbenchmark. I will call them benchmark steps. These are put into one of 5 groups based on the workload:

point query, part 1 - point queries that don't use the random-points benchmark step
point query, part 2 - point query variants that use the random-points benchmark step
range query, part 1 - range queries without aggregation
range query, part 2 - read-only variants of the original sysbench workload that does range queries with aggregation
writes - benchmark steps that do insert, update and delete

The y-axis starts at 0.8 rather than 0 to make it easier to see the changes.

point queries, part 1 - there are no regression. There is too much variance for the second and third benchmark steps from the left -- point-query.pre_range=100 and point-query_range=100. I have to explain the cause.
point queries, part 2 - there are no regressions
range queries, part 1 - QPS might have dropped by ~2% or there might be noise
range queries, part 2 - there are no regressions
writes - insert QPS might have dropped by ~5%, otherwise there are no regressions

Friday, September 22, 2023

Hello world!

Small Datum LLC is a thing.

Monday, September 18, 2023

Perf regressions in MySQL from 5.6.21 to 8.0.34 using the Insert Benchmark and a small server

This post has results from the Insert Benchmark for some MySQL 5.6 releases, all 5.7 releases and all 8.0 releases. I previously shared results like this using sysbench.

Disclaimer - these results are from a low concurrency workload (1 to 3 concurrent connections). Results on a bigger server with more concurrency won't look as bad.

tl;dr

From MySQL 5.6.21 to 8.0.34

Throughput for 4 of 6 benchmark steps drops by ~50%
Throughput for one of the benchmark steps drops by ~20%
Throughput for the remaining benchmark step increases by ~20%
The problem is new CPU overhead

With MySQL 5.6

It mostly avoids perf regressions from 5.6.21 to 5.6.51 as throughput drops by ~5%

With MySQL 5.7

There is a large perf regression from the last 5.6 release to the first 5.7 release
It mostly avoids regressions from 5.7.10 to 5.7.43 as throughput drops by ~5%

With MySQL 8.0

There is a large perf regression from the last 5.7 release to the first 8.0 release
There are perf regressions from 8.0.13 to 8.0.34 as throughput drops by up to 20%

In summary, the largest regressions occur across major versions. In theory that is good news because when regressions are not spread across many releases they are easier to fix. But I am not sure that theory holds when the adjust releases span major releases. The largest regressions occur between these adjacent releases:

From 5.6.51 to 5.7.10
From 5.7.43 to 8.0.13

Builds

It isn't easy to build older code on newer systems, compilers, etc. Notes on that are here for 5.6, for 5.6 and 5.7, for 5.7 and for 8.0. A note on using cmake is here. The rel builds were used -- everything was compiled using CMAKE_BUILD_TYPE=Release.

Tests were done for:

5.6 - 5.6.21, 5.6.31, 5.6.41, 5.6.51
5.7 - all point releases from 5.7.10 to 5.7.43
8.0 - all point releases from 8.0.13 to 8.0.34

I used the cy10a_bee config and it is here for 5.6, 5.7 and 8.0. For 8.0 releases older than 8.0.19 I changed innodb_idle_flush_pct=1 to loose_innodb_idle_flush_pct=1.

Benchmark

The Insert Benchmark was run in one setup - the database was cached by InnoDB.

The benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

l.i0

insert 20 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1200 seconds.

Results

The Summary sections linked below have tables for absolute and relative throughput. Absolute throughput is just the QPS per version. The relative throughput is (QPS for my version / QPS for base version). The base version is 5.6.21 for the 5.6-only report, 5.7.10 for the 5.7-only report, 8.0.13 for the 8.0-only report and 5.6.21 for the report that compares 5.6, 5.7 and 8.0.

Reports are here for

Some MySQL 5.6 releases

Average throughput drops by ~5% from 5.6.21 to 5.6.51 for all benchmark steps. See the relative throughput table in the Summary.
CPU overhead doesn't change for write-heavy benchmark steps (l.i0, l.x, l.i1) and increases by ~5% for read-write steps (q100, q500, q1000). See the cpupq column (CPU/operation) in the Metrics tables.

All MySQL 5.7 releases

Average throughput drops by 5% to 7% for all benchmark steps. See the relative throughput table in the Summary.
CPU overhead doesn't change much for write-heavy benchmark steps (l.i0, l.x, l.i1) and increases by ~6% for read-write steps (q100, q500, q1000). See the cpupq column (CPU/operation) in the Metrics tables.

All MySQL 8.0 releases

Average throughput drops by 17% to 20% for some benchmark steps (l.i0, q100, q500, q1000) and increases by 1% for others (l.x, l.i1). See the relative throughput table in the Summary.
It is hard to explain the results for the write-heavy benchmark steps in terms of the iostat and vmstat results in the Metrics tables. But for CPU overhead increases by 17% to 19% for the read-heavy benchmark steps. See the cpupq column (CPU/operation) in the Metrics tables.

Comparing MySQL 5.6, 5.7 and 8.0

Average throughput drops almost in half for some benchmark steps (l.i0, q100, q500, q1000), drops by ~17% for l.i1 and increases by 21% for l.x. See the relative throughput table in the Summary.
CPU overhead increases explain the performance regressions and for the read-write benchmark steps the CPU/operation overhead grows by 1.9X. See the cpupq column (CPU/operation) in the Metrics tables.

Postgres 16.0 vs the Insert Benchmark on a medium server

This has results for Postgres 16.0 and rc1 vs the Insert Benchmark on a medium server. Previous posts have results for this benchmark with Postgres 16.0 on a small server and with Postgres 16 beta on a medium server.

tl;dr

Postgres 16.0 looks good
All versions of Postgres (old and new) have too much variance on this benchmark with the IO-bound setups. See rant here. I like Postgres but not so much for write-heavy workloads.
All versions of Postgres were unable to sustain the target insert rate (8000/s) for the IO-bound setup during q1000.
All versions of Postgres suffer from fairness during l.i1 -- some clients finish much sooner than others and in the IO-bound setups the slowest clients take ~3X longer than the fastest ones.

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

The insert benchmark was run in two setups.

cached by Postgres - all tables are cached by Postgres
IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables. The benchmark is a sequence of steps.

l.i0

insert X million rows per table where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance where X is 300 for Cached and 60 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.

Configurations

I used the a27 and a28 configurations. From previous work, a27 was good for Cached by Postgres and a28 was good for IO-bound. The diff between them is:

wal_compression is lz4 for a27 and off for a28
autovacuum_vacuum_cost_limit is 2000 for a27 and 4000 for a28
max_wal_size is 70G for a27 and 32GB for a28

Results

Reports are here for Cached by Postgres (a27 config, a28 config) and IO-bound (a27 config, a28 config).

The results for average throughput are interesting and confusing. The tables linked below use absolute and relative throughput where relative throughput is (QPS for my version / QPS for base version) and the base version is Postgres 15.2.

Cached, a27 config (see here)

For l.i1

The average throughput gradually declines by ~7% from 15.2 to 16.0. I don't know whether that is a regression or just noise (see here).
From HW metrics the CPU overhead (cpupq is CPU/operation) does not change.
From the charts with per-second insert rates there is much variance for all versions (see for 15.2 and 16.0) and the insert rate frequently changes from ~10k/s to ~40k/s.
All versions suffer from fairness. From the 8 benchmark clients the average time is ~10,000 seconds but the slowest client takes ~1000 more seconds than the fastest client.

For q100, q500 and q1000 the average throughput is the same for all versions.

Cached, a28 config (see here)

For l.i1

The average throughput is the same from 15.2 to 16.0.
From the charts with per-second insert rates there is much variance for all versions (see for 15.2 and 16.0) and the insert rate frequently changes from ~10k/s to ~40k/s.
All versions suffer from fairness. From the 8 benchmark clients the average time is ~10,000 seconds but the slowest client takes ~1000 more seconds than the fastest client.

For q100, q500 and q1000 the average throughput is the same for all versions

IO-bound, a27 config (see here)

For l.i1

I am wary of interpreting results because there is too much variance.
The per-second insert rate charts are similar for 15.2 and 16.0 -- there are spikes.
All versions suffer from fairness. From the 8 benchmark clients the times to finish the l.i1 benchmark step in seconds is: 17866, 19057, 34004, 35698, 36535, 38123, 44844, 45365. The slowest client takes almost 3X longer than the fastest.

For q100 and q500 results are probably similar across versions but there is noise.
For q1000 none of the versions were able to sustain the target insert rate (1000/s/client for a total of 8000/s). See the third table in the Summary section.

IO-bound, a28 config (see here)

For l.i1

I am wary of interpreting results because there is too much variance.
The per-second insert rate charts are similar for 15.2 and 16.0 - there are spikes.
All versions suffer from fairness. From the 8 benchmark clients the times to finish the l.i1 benchmark step in seconds is: 18387, 25221, 25764, 26830, 29587, 30214, 35562, 37566. The slowest client takes almost 2X longer than the fastest.

For q100 and q500 results are probably similar across versions but there is noise.
For q1000 none of the versions were able to sustain the target insert rate (1000/s/client for a total of 8000/s). See the third table in the Summary section.

Postgres 16.0 vs the Insert Benchmark on a small server

This has results for Postgres 16.0 and rc1 vs the Insert Benchmark on a small server. My previous post had results for beta1, beta2 and beta3.

tl;dr

Postgres 16.0 looks good
All versions of Postgres (old and new) have too much variance on this benchmark with the IO-bound setups. See rant below.

Rant

There is too much variance with Postgres vs the Insert Benchmark in IO-bound setups. This isn't a new issue, it occurs for all Postgres versions I have tested -- from version 10 or 11 to 16. I spent months testing different configuration options to see if I could tune this problem away. While better tuning helped it does not fix the problem. I think OtterTune said it best and while each new Postgres version has incremental improvements to vacuum I don't think it will ever be fixed incrementally.

I hope that something significant is done because while Postgres is a great DBMS I would be reluctant to use it for a write-intensive workload and it risks falling far behind the closed-source vendor forks.

So I focus more on the Cached by Postgres setup and less on the IO-bound setup. The Cached by Postgres setup is great when searching for CPU regressions and I do not see any in Postgres 16.0.

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

The Insert Benchmark was run in two setups.

cached by Postgres - all tables are cached by Postgres
IO-bound - the database is larger than memory

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

l.i0

insert X million rows per table where X is 20 for cached and 800 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance where X is 300 for Cached and 30 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.

Configurations

I used the a1 and a2 configurations. The difference between them is that a1 has wal_compression set to lz4 while a2 has it set to off.

Results

Reports are here for Cached (a1 config, a2 config) and IO-bound (a1 config, a2 config).

Cached, a1 config (see here)

From the summary tables with relative throughput I see that 16.0 does better on the l.i1 benchmark step and then worse on the q100 and q500 steps that follow -- relative throughput values are 1.06, 0.91 and 0.94. Perhaps 16.0 is able to process writes faster during l.i1 but then that means the benchmark steps that follow (q100, q500) inherit more MVCC GC debt. I will revisit this.
From the HW metrics the CPU overhead (cpupq is CPU/operation) with version16.0 is less in l.i1 but more in q100 and q500 vs earlier versions. This explains the throughput differences. I assume the increases in q100 and q500 are from doing more MVCC GC (because l.i1 was faster and given the cpupq in q1000 returns to normal, perhaps GC debt was repaid by then).
During l.i1 all versions tested have much variance. The per-second insert rates regularly jump from 10k/s and 34k/s.
The max insert response time graphs have an interesting pattern for q100 and q500 which is less frequent in q1000. I assume that q100 and q500 have more MVCC GC debt to repay after l.i1 but that debt is gone by q1000.
For q100 the per-second QPS charts have less noise in 16.0 than in 15.2. This is somewhat true in q500 (see 16.0 and 15.2) but by q1000 they are similar (see 16.0 and 15.2).

Cached, a2 config (see here)

The comment above for the a1 config might apply here give that the relative throughput is 1.05 for l.i1 and 0.97 for q100. See the summary tables.
From the HW metrics results are similar to the a1 config except the MVCC GC debt was repaid by the end of q100.
During l.i1 all versions tested have much variance. The per-second insert rates regularly jump from 10k/s and 34k/s.

IO-bound, a1 config (see here)

I am reluctant to characterize these because there is too much variance -- especially during l.i1 (which suffers from vacuum-induced variance) and q100 (follows l.i1, inherits MVCC GC debt).
There is much variance for insert rates measured at 1-second intervals. Results for 15.2 show a gradual decline to ~500/s, then a sudden jump to 4000/s followed by another decline to ~500/s. The gradual declines are expected, the sudden jumps are not. This benchmark step ran for ~20,000s and I'd rather not run it for days to hope that the variance fades. Results for 16.0 also have sudden jumps but with 16.0 the rates decline to ~400/s which is worse than what 15.2 does.

IO-bound, a2 config (see here)

I am reluctant to characterize these because there is too much variance -- especially during l.i1 (which suffers from vacuum-induced variance) and q100 (follows l.i1, inherits MVCC GC debt).
With the l.i1 benchmark step the gradual fade, sudden jump pattern is similar to the results above for the a1 config (see 16.0 and 15.2).

Tuesday, September 12, 2023

Perf regressions in MySQL from 5.6.21 to 8.0.34 using sysbench and a small server

This has results for sysbench vs upstream MySQL on a small server. I have results for every release in 5.7 and 8.0 and some releases in 5.6. The goal is to document where things get faster or slower over time for a low-concurrency and CPU-bound workload. The focus is on CPU regressions.

My results here aren't universal.

Things won't look the same with an IO-bound workload. If nothing else that will make many of the CPU regressions less significant.
Things won't look the same with a workload that has more concurrency. While MySQL tends to get slower over time from more CPU overhead it also gets faster over time on concurrent workloads from improvements to synchronization code. Results from a few months ago on a larger server are here and the regressions are much smaller.
Things won't look the same with a workload that has complex queries. Most of the queries used by sysbench are simple and short running. This amplifies the impact of perf regressions in parse, semantic analysis and query optimization.

tl;dr

Upstream MySQL would benefit from changepoint detection.
MySQL 8.0 is the worst for perf regressions, while 5.7 and 5.6 are better at avoiding them. In theory this is good news because newer regressions are easier to fix than older ones.
For most of the benchmark steps MySQL 8.0.34 gets between 20% and 50% less QPS than 5.6.21
There were large regressions for point query benchmark steps in 8.0.2x, Fortunately bug 102037 has been fixed in 8.0.3x (I found that bug via past usage of sysbench).

What about Postgres?

On the same setup the regressions from Postgres 11 to Postgres 16 are small while the regressions here are big.
On a medium server they are also small.

Builds

Tests were done for:

5.6 - 5.6.21, 5.6.31, 5.6.41, 5.6.51
5.7 - all point releases from 5.7.10 to 5.7.43
8.0 - all point releases from 8.0.13 to 8.0.34

I used the cy10a_bee config and it is here for 5.6, 5.7 and 8.0. For 8.0 releases older than 8.0.19 I changed innodb_idle_flush_pct=1 to loose_innodb_idle_flush_pct=1.

Benchmarks

I used sysbench and my usage is explained here. Tests were run on a small server I have at home (see here). The test tables are cached by InnoDB.

Results

For the results below I split the benchmark steps into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query benchmark steps part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. Unfortunately, I included the full scan benchmark step (scan_range=100) in part 2 but it doesn't do aggregation. I struggled with the presentation because there are a large number of results per chart in the sections that test all releases from 5.7 and 8.0. The spreadsheet with all data and charts is here and is easier to read.

All of the charts have relative throughput on the y-axis where that is (QPS for $me) / (QPS for $base), $me is a version (for example 5.7.24) and $base is the base version. The base version is the oldest version being compared and the goal is to see whether QPS is better or worse over time.

The legend on the RHS of each chart truncates the names I use for the benchmark steps and I don't know how to fix that other than sharing the link to the Google Sheet I used.

From 5.6.21 to 8.0.34

This section uses 5.6.21 as the base version and then compares that with 5.6.51, 5.7.10, 5.7.28, 5.7.43, 8.0.13, 8.0.23 and 8.0.34 to show how performance has changed from oldest tested (5.6.21) to newest tested (8.0.34).

For point queries, 8.0.3x gets 20% to 35% less QPS vs 5.6.21. There were larger regressions in 8.0.2x but most were fixed in 8.0.3x.
For range queries that don't do aggregation (part 1), 8.0.3x gets 30% to 40% less QPS vs 5.6.21. For range queries that do aggregation, 8.0.3x gets 0% to 35% less QPS vs 5.6.21. The difference depends on the length of the range scan (shorter scan == larger regression). Also, full scan (scan_range=100) has a significant regression.
For most writes, 8.0.3x gets 25% to 50% less QPS vs 5.6.21

It isn't easy to see but some of the lines are on top of each other.

MySQL 8.0: all point releases

This section uses 8.0.13 as the base version and compares every other 8.0 release with it.

For point queries, 8.0.34 gets 5% to 12% less QPS than 8.0.13. For some benchmark steps large regressions in 8.0.2x were fixed in 8.0.3x - that might have been bug 102037, found by my past usage of sysbench and fixed in 8.0.31. For the point-query* benchmark steps there are slow but steady regressions from 8.0.2x to 8.0.3x.
For range queries, 8.0.34 gets 5% to 20% less QPS than 8.0.13. There are slow but steady regressions from 8.0.2x to 8.0.3x and the impact is becoming significant. Full scan (scan_range=100) has the largest regression.
For writes, 8.0.34 does better than 8.0.13 on the update-index benchmark step but otherwise gets 15% to 25% less QPS than 8.0.13. The regressions are slow but steady.

MySQL 5.7: all point releases

This section uses 5.7.10 as the base version and compares every other 5.7 release with it.

For point queries, 5.7.43 gets 2% to 6% less QPS than 5.7.10
For range queries that don't do aggreation (part 1), 5.7.43 gets 5% to 12% less QPS than 5.7.10. For range queries that do aggregation, 5.7.43 gets about 5% less QPS than 5.7.10. Full scan is 15% slower in 5.7.43 vs 5.7.10.
For writes, 5.7.43 gets more QPS than 5.7.10 for update-index and update-inlist. Otherwise it gets between 3% and 8% less QPS.

MySQL 5.6: some point releases

This section uses 5.6.21 as the base version and compares 5.6.31, 5.6.41 and 5.6.51 with it.

For point queries, 5.6.51 and 5.6.21 have similar results.
For range queries, excluding full scan, 5.6.51 gets 2% to 4% less QPS than 5.6.21. For full scan (scan_range=100) it gets 16% less QPS.
For writes, 5.6.51 gets from 5% less to 2% more QPS than 5.6.21 except for update-list where it gets 20% more.