Small Datum: postgresql

Showing posts with label postgresql. Show all posts

Thursday, September 11, 2025

Postgres 18rc1 vs sysbench

This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and large server.

tl;dr

Postgres 18 looks great
I continue to see small CPU regressions in Postgres 18 for range queries that don't do aggregation on low-concurrency workloads. I have yet to explain that.
The throughput for the scan microbenchmark has more variance with Postgres 18. I assume this is related to more or less work getting done by vacuum but I have yet to debug the root cause.

Builds, configuration and hardware

I compiled Postgres from source for versions 17.6, 18 beta3 and 18 rc1.

The servers are:

small

an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.

large32

Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32 Cores and AMD SMT disabled, Ubuntu 24.04 and and NVMe device with ext4 and discard.

large48

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)

All configurations use synchronous IO which is the the only option prior to Postgres 18 and for Postgres 18 the config file sets io_method=sync.

Configuration files:

small server

for versions 17.6 (x10a_c8r32) and 18 (x10b_c8r32)

large servers

for versions 17.6 (x10a_c32r128) and 18 (x10b_c32r128)

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks

and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

For all servers the read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The number of tables and rows per table was:

small server - 1 table, 50M rows
large servers - 8 tables, 10M rows per table

The number of clients (amount of concurrency) was:

small server - 1
large32 server - 24
large48 servcer- 40

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for Postgres 17.6)

When the relative QPS is > 1 then some version is faster than PG 17.6. When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.

Results: small server

I continue to see small (~3%) regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. But I have yet to debug this and am not certain it is a regression. I am also skeptical about the great results for scan. I suspect that I have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

Relative to: Postgres 17.6

col-1 : 18beta3

col-2 : 18rc1

col-1 col-2 point queries

1.01 0.98 hot-points_range=100

1.01 1.00 point-query_range=100

1.02 1.02 points-covered-pk_range=100

0.99 1.01 points-covered-si_range=100

1.00 0.99 points-notcovered-pk_range=100

1.00 0.99 points-notcovered-si_range=100

1.01 1.00 random-points_range=1000

1.01 0.99 random-points_range=100

1.01 1.00 random-points_range=10

col-1 col-2 range queries without aggregation

0.97 0.96 range-covered-pk_range=100

0.97 0.97 range-covered-si_range=100

0.99 0.99 range-notcovered-pk_range=100

0.99 0.99 range-notcovered-si_range=100

1.35 1.36 scan_range=100

col-1 col-2 range queries with aggregation

1.02 1.03 read-only-count_range=1000

1.00 1.00 read-only-distinct_range=1000

0.99 0.99 read-only-order_range=1000

1.00 1.00 read-only_range=10000

1.00 0.99 read-only_range=100

0.99 0.98 read-only_range=10

1.01 1.01 read-only-simple_range=1000

1.02 1.00 read-only-sum_range=1000

col-1 col-2 writes

0.99 0.99 delete_range=100

0.99 1.01 insert_range=100

0.99 0.99 read-write_range=100

0.99 0.99 read-write_range=10

0.98 0.98 update-index_range=100

1.00 0.99 update-inlist_range=100

0.98 0.98 update-nonindex_range=100

0.98 0.97 update-one_range=100

0.98 0.97 update-zipf_range=100

0.99 0.98 write-only_range=10000

Results: large32 server

I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.

The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

Relative to: Postgres 17.6

col-1 : Postgres 18rc1

col-1 point queries

1.01 hot-points_range=100

1.01 point-query_range=100

1.01 points-covered-pk_range=100

1.01 points-covered-si_range=100

1.00 points-notcovered-pk_range=100

1.00 points-notcovered-si_range=100

1.01 random-points_range=1000

1.00 random-points_range=100

1.01 random-points_range=10

col-1 range queries without aggregation

0.99 range-covered-pk_range=100

0.99 range-covered-si_range=100

0.99 range-notcovered-pk_range=100

0.99 range-notcovered-si_range=100

1.12 scan_range=100

col-1 range queries with aggregation

1.00 read-only-count_range=1000

1.02 read-only-distinct_range=1000

1.01 read-only-order_range=1000

1.03 read-only_range=10000

1.00 read-only_range=100

1.00 read-only_range=10

1.00 read-only-simple_range=1000

1.00 read-only-sum_range=1000

col-1 writes

1.01 delete_range=100

1.00 insert_range=100

1.00 read-write_range=100

1.00 read-write_range=10

1.00 update-index_range=100

1.00 update-inlist_range=100

1.00 update-nonindex_range=100

0.99 update-one_range=100

1.00 update-zipf_range=100

1.00 write-only_range=10000

Results: large48 server

I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.

I am skeptical about the regression I see here for scan. That comes from using ~10% more CPU per query. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

I have not see the large improvements for the insert and delete microbenchmarks on previous tests on that large server. I assume this is another case where I need to figure out how to reduce variance when I run the benchmark.

Relative to: Postgres 17.6

col-1 : Postgres 18beta3

col-2 : Postgres 18rc1

col-1 col-2 point queries

0.99 0.99 hot-points_range=100

0.99 0.99 point-query_range=100

1.00 0.99 points-covered-pk_range=100

0.99 1.02 points-covered-si_range=100

1.00 0.99 points-notcovered-pk_range=100

0.99 1.01 points-notcovered-si_range=100

1.00 0.99 random-points_range=1000

1.00 0.99 random-points_range=100

1.00 1.00 random-points_range=10

col-1 col-2 range queries without aggregation

0.99 0.99 range-covered-pk_range=100

0.98 0.99 range-covered-si_range=100

0.99 0.99 range-notcovered-pk_range=100

1.01 1.01 range-notcovered-si_range=100

0.91 0.91 scan_range=100

col-1 col-2 range queries with aggregation

1.04 1.03 read-only-count_range=1000

1.02 1.01 read-only-distinct_range=1000

1.01 1.00 read-only-order_range=1000

1.06 1.06 read-only_range=10000

0.98 0.97 read-only_range=100

0.99 0.99 read-only_range=10

1.02 1.02 read-only-simple_range=1000

1.03 1.03 read-only-sum_range=1000

col-1 col-2 writes

1.46 1.49 delete_range=100

1.32 1.32 insert_range=100

0.99 1.00 read-write_range=100

0.98 1.00 read-write_range=10

0.99 1.00 update-index_range=100

0.95 1.03 update-inlist_range=100

1.00 1.02 update-nonindex_range=100

0.96 1.04 update-one_range=100

1.00 1.01 update-zipf_range=100

1.00 1.00 write-only_range=10000

Wednesday, June 11, 2025

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)

This is my second attempt at an IO-bound Insert Benchmark results with a small server. The first attempt is here and has been deprecated because sloppy programming by me meant the benchmark client was creating too many connections and that hurt results in some cases for Postgres 18 beta1.

There might be regressions from 17.5 to 18 beta1

QPS decreases by ~5% and CPU increases by ~5% on the l.i2 (write-only) step
QPS decreases by <= 2% and CPU increases by ~2% on the qr* (range query) steps

There might be regressions from 14.0 to 18 beta1

QPS decreases by ~6% and ~18% on the write-heavy steps (l.i1, l.i2)

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 14.0, 14.18, 15.0, 15.13, 16.0, 16.9, 17.0, 17.5 and 18 beta1.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM and one NVMe device for the database. The OS has been updated to Ubuntu 24.04. More details on it are here.

For Postgres versions 14.0 through 17.5 the configuration files are in the pg* subdirectories here with the name conf.diff.cx10a_c8r32. For Postgres 18 beta1 I used 3 variations, which are here:

conf.diff.cx10b_c8r32

uses io_method='sync' to match Postgres 17 behavior

conf.diff.cx10c_c8r32

uses io_method='worker' and io_workers=16 to do async IO via a thread pool. I eventually learned that 16 is too large.

conf.diff.cx10d_c8r32

uses io_method='io_uring' to do async IO via io_uring

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table with 800M rows. I provide two performance reports:

one to compare Postgres 14.0 through 18 beta1, all using synchronous IO
one to compare Postgres 17.5 with 18 beta1 using 3 configurations for 18 beta1 -- one for each of io_method= sync, workers and io_uring.

The benchmark steps are:

l.i0

insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance report is here for Postgres 14 through 18 and here for Postgres 18 configurations.

The summary sections (here here and here) have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for the benchmark steps. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.5.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: Postgres 14.0 through 18 beta1

The performance summary is here.

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 14.0 is the base version and that is compared with more recent Postgres versions. The results here are similar to what I reported prior to fixing the too many connections problem in the benchmark client.

For 14.0 through 18 beta1, QPS on ...

the initial load (l.i0)

Performance is stable across versions
18 beta1 and 17.5 have similar performance
rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.99)

create index (l.x)

~10% faster starting in 15.0
18 beta1 and 17.5 have similar performance
rQPS for (17.5, 18 beta1 with io_method=sync) is (1.11, 1.12)

first write-only step (l.i1)

Performance decreases ~7% from version 16.9 to 17.0. CPU overhead (see cpupq here) increases by ~5% in 17.0.
18 beta1 and 17.5 have similar performance
rQPS for (17.5, 18 beta1 with io_method=sync) is (0.93, 0.94)

second write-only step (l.i2)

Performance decreases ~6% in 15.0, ~8% in 17.0 and then ~5% in 18.0. CPU overhead (see cpupq here) increases ~5%, ~6% and ~5% in 15.0, 17.0 and 18beta1. Of all benchmark steps, this has the largest perf regression from 14.0 through 18 beta1 which is ~20%.
18 beta1 is ~4% slower than 17.5
rQPS for (17.5, 18 beta1 with io_method=sync) is (0.86, 0.82)

range query steps (qr100, qr500, qr1000)

18 beta1 and 17.5 have similar performance, but 18 beta1 is slightly slower
rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.99) for qr100, (0.97, 0.98) for qr500 and (0.97, 0.95) for qr1000. The issue is new CPU overhead, see cpupq here.

point query steps (qp100, qp500, qp1000)

18 beta1 and 17.5 have similar performance but 18 beta1 is slightly slower
rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.98) for qp100, (0.99, 0.98) for qp500 and (0.97, 0.96) for qp1000. The issue is new CPU overhead, see cpupq here.

Results: Postgres 17.5 vs 18 beta1

The performance summary is here.

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 17.5 is the base version and that is compared with results from 18 beta1 using the three configurations explained above:

x10b with io_method=sync
x10c with io_method=worker and io_workers=16
x10d with io_method=io_uring

The summary is:

initial load step (l.i0)

rQPS for (x10b, x10c, x10d) was (0.99, 100, 1.00)

create index step (l.x)

rQPS for (x10b, x10c, x10d) was (1.01, 1.02, 1.02)

first write-heavy ste (l.i1)

for l.i1 the rQPS for (x10b, x10c, x10d) was (1.00, 0.99, 1.01)

second write-heavy step (l.i2)

for l.i2 the rQPS for (x10b, x10c, x10d) was (0.96, 0.93, 0.94)
CPU overhead (see cpupq here) increases by ~5% in 18 beta1

range query steps (qr100, qr500, qr1000)

for qr100 the rQPS for (x10b, x10c, x10d) was (1.00, 0.99, 0.99)
for qr500 the rQPS for (x10b, x10c, x10d) was (1.00, 0.97, 0.99)
for qr1000 the rQPS for (x10b, x10c, x10d) was (0.99, 0.98, 0.97)
CPU overhead (see cpupq here, here and here) increases by ~2% in 18 beta1

point query steps (qp100, qp500, qp1000)

for qp100 the rQPS for (x10b, x10c, x10d) was (0.98, 0.99, 0.99)
for qp500 the rQPS for (x10b, x10c, x10d) was (0.99, 1.00, 1.00)
for qp1000 the rQPS for (x10b, x10c, x10d) was (0.99, 0.99, 0.99)

Sunday, June 8, 2025

Postgres 18 beta1: small server, CPU-bound Insert Benchmark (v2)

This is my second attempt at a CPU-bound Insert Benchmark results with a small server. The first attempt is here and has been deprecated because sloppy programming by me meant the benchmark client was creating too many connections and that hurt results in some cases for Postgres 18 beta1.

tl;dr

Performance between 17.5 and 18 beta1 is mostly similar on read-heavy steps
18 beta1 might have small regressions from new CPU overheads on write-heavy steps

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 14.0, 14.18, 15.0, 15.13, 16.0, 16.9, 17.0, 17.5 and 18 beta1.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM and one NVMe device for the database. The OS has been updated to Ubuntu 24.04 -- I used 22.04 prior to that. More details on it are here.

For Postgres versions 14.0 through 17.5 the configuration files are in the pg* subdirectories here with the name conf.diff.cx10a_c8r32. For Postgres 18 beta1 I used 3 variations, which are here:

conf.diff.cx10b_c8r32

uses io_method='sync' to match Postgres 17 behavior

conf.diff.cx10c_c8r32

uses io_method='worker' and io_workers=16 to do async IO via a thread pool. I eventually learned that 16 is too large.

conf.diff.cx10d_c8r32

uses io_method='io_uring' to do async IO via io_uring

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table with 20M rows. I provide two performance reports:

one to compare Postgres 14.0 through 18 beta1, all using synchronous IO
one to compare Postgres 17.5 with 18 beta1 using 3 configurations for 18 beta1 -- one for each of io_method= sync, workers and io_uring.

The benchmark steps are:

l.i0

insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 40M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted per table.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance report is here for Postgres 14 through 18 and here for Postgres 18 configurations.

The summary sections (here and here) have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: Postgres 14.0 through 18 beta1

The performance summary is here.

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 14.0 is the base version and that is compared with more recent Postgres versions.

For 14.0 through18 beta1, QPS on ...

l.i0 (the initial load)

Slightly faster starting in 15.0
Throughput was ~4% faster starting in 15.0 and that drops to ~2% in 18 beta1
18 beta1 and 17.5 have similar performance

l.x (create index)

Faster starting in 15.0
Throughput is between 9% and 17% faster in 15.0 through 18 beta1
18 beta1 and 17.5 have similar performance

l.i1 (write-only)

Slower starting in 15.0
It is ~3% slower in 15.0 and that increases to between 6% and 10% in 18 beta1
18 beta1 and 17.5 have similar performance

l.i2 (write-only)

Slower starting in 15.13 with a big drop in 17.0
18 beta1 with io_method= sync and io_uring is worse than 17.5. It isn't clear but one problem might be more CPU/operation (see cpupq here)

qr100, qr500, qr1000 (range query)

Stable from 14.0 through 18 beta1

qp100, qp500, qp1000 (point query)

Stable from 14.0 through 18 beta1

Results: Postgres 17.5 vs 18 beta1

The performance summary is here.

x10b with io_method=sync
x10c with io_method=worker and io_workers=16
x10d with io_method=io_uring

The summary of the summary is:

initial load step (l.i0)

18 beta1 is 1% to 3% slower than 17.5
This step is short running so I don't have a strong opinion on the change

create index step (l.x)

18 beta1 is 0% to 2% faster than 17.5
This step is short running so I don't have a strong opinion on the change

write-heavy step (l.i1)

18 beta1 with io_method= sync and workers has similar perf as 17.5
18 beta1 with io_method=io_uring is ~4% slower than 17.5. The problem might be more CPU/operation, see cpupq here

write-heavy step (l.i2)

18 beta1 with io_method=workers is ~2% faster than 17.5
18 beta1 with io_method= sync and workers is 6% and 8% slower than 17.5. The problem might be more CPU/operation, see cpupq here

range query steps (qr100, qr500, qr1000)

18 beta1 and 17.5 have similar performance

point query steps (qp100, qp500, qp1000)

18 beta1 and 17.5 have similar performance

The summary is:

initial load step (l.i0)

rQPS for (x10b, x10c, x10d) was (0.98, 0.99, 0.97)

create index step (l.x)

rQPS for (x10b, x10c, x10d) was (1.00, 1.02, 1.00)

write-heavy steps (l.i1, l.i2)

for l.i1 the rQPS for (x10b, x10c, x10d) was (1.01, 1.00, 0.96)
for l.i2 the rQPS for (x10b, x10c, x10d) was (0.94, 1.02, 0.92)

range query steps (qr100, qr500, qr1000)

for qr100 the rQPS for (x10b, x10c, x10d) was (0.99, 1.00, 1.00)
for qr500 the rQPS for (x10b, x10c, x10d) was (0.99, 1.01, 1.00)
for qr1000 the rQPS for (x10b, x10c, x10d) was (0.99, 1.00, 1.00)

point query steps (qp100, qp500, qp1000)

for qp100 the rQPS for (x10b, x10c, x10d) was (1.00, 1.00, 1.00)
for qp500 the rQPS for (x10b, x10c, x10d) was (0.99, 1.00, 1.00)
for qp1000 the rQPS for (x10b, x10c, x10d) was (0.99, 1.00, 0.98)

Friday, June 6, 2025

Postgres 18 beta1: large server, IO-bound Insert Benchmark

This has results for a CPU-bound Insert Benchmark with Postgres on a large server. A blog post about a CPU-bound workload on the same server is here.

tl;dr

initial load step (l.i0)

18 beta1 is 4% faster than 17.4

create index step (l.x)

18 beta1 with io_method =sync and =workers has similar perf as 17.4 and is 7% faster than 17.4 with =io_uring

write-heavy steps (l.i1, l.i2)

18 beta1 and 17.4 have similar performance except for l.i2 with 18 beta1 and io_method=workers where 18 beta1 is 40% faster. This is an odd result and I am repeating the benchmark.

range query steps (qr100, qr500, qr1000)

18 beta1 is up to (3%, 2%, 3%) slower than 17.4 with io_method= (sync, workers, io_uring). The issue might be new CPU overhead.

point query steps (qp100, qp500, qp1000)

18 beta1 is up to (3%, 5%, 2%) slower than 17.4 with io_method= (sync, workers, io_uring). The issue might be new CPU overhead.

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for version 18 beta1 and 17.4. I got the source for 18 beta1 from github using the REL_18_BETA1 tag because I started this benchmark effort a few days before the official release.

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and

ext4. More details on it are here.

The config file for Postgres 17.4 is here and named conf.diff.cx10a_c32r128.

For 18 beta1 I tested 3 configuration files, and they are here:

conf.diff.cx10b_c32r128 (x10b) - uses io_method=sync
conf.diff.cx10cw4_c32r128 (x10cw4) - uses io_method=worker with io_workers=4
conf.diff.cx10d_c32r128 (x10d) - uses io_method=io_uring

The Benchmark

The benchmark is explained here and is run with 20 client and tables (table per client) and 200M rows per table. The database is larger than memory. In some benchmark steps the working set is larger than memory (see the point query steps qp100, qp500, qp1000) while the working set it cached for other benchmarks steps (see the range query steps qr100, qr500 and qr1000).

The benchmark steps are:

l.i0

insert 10 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance report is here.

The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: details

The performance summary is here.

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 17.4 is the base version and that is compared with results from 18 beta1 using the three configurations explained above:

x10b with io_method=sync
x10cw4 with io_method=worker and io_workers=4
x10d with io_method=io_uring).

The summary of the summary is:

initial load step (l.i0)

18 beta1 is 4% faster than 17.4
From metrics, 18 beta1 has a lower context switch rate (cspq) and sustains a higher write rate to storage (wmbps).

create index step (l.x)

18 beta1 with io_method =sync and =workers has similar perf as 17.4 and is 7% faster than 17.4 with =io_uring
From metrics, 18 beta1 with io_method=io_uring sustains a higher write rate (wmbps)

write-heavy steps (l.i1, l.i2)

18 beta1 and 17.4 have similar performance except for l.i2 with 18 beta1 and io_method=workers where 18 beta1 is 40% faster. This is an odd result and I am repeating the benchmark.
From metrics for l.i1 and l.i2, in the case where 18 beta1 is 40% faster, there is much less CPU/operation (cpupq).

range query steps (qr100, qr500, qr1000)

18 beta1 is up to (3%, 2%, 3%) slower than 17.4 with io_method= (sync, workers, io_uring)
From metrics for qr100, qr500 and qr1000 the problem might be more CPU/operation (cpupq)
Both 17.4 and 18 beta1 failed to sustain the target rate of 20,000 inserts and 20,000 deletes/s. They were close and did ~18,000/s for each. See the third table here.

point query steps (qp100, qp500, qp1000)

18 beta1 is up to (3%, 5%, 2%) slower than 17.4 with io_method= (sync, workers, io_uring).
From metrics for qp100, qp500 and qp1000 the problem might be more CPU/operation (cpupq)
Both 17.4 and 18 beta1 failed to sustain the target rate of 20,000 inserts and 20,000 deletes/s. They were close and did ~18,000/s for each. See the third table here.

The summary is:

initial load step (l.i0)

rQPS for (x10b, x10cw4, x10d) was (1.04, 1.04, 1.04)

create index step (l.x)

rQPS for (x10b, x10cw4, x10d) was (0.99, 0.99, 1.07)

write-heavy steps (l.i1, l.i2)

for l.i1 the rQPS for (x10b, x10cw4, x10d) was (1.01, 0.99, 1.02)
for l.i2 the rQPS for (x10b, x10cw4, x10d) was (1.00, 1.40, 0.99)

range query steps (qr100, qr500, qr1000)

for qr100 the rQPS for (x10b, x10cw4, x10d) was (0.97, 0.98, 0.97)
for qr500 the rQPS for (x10b, x10cw4, x10d) was (0.98, 0.98, 0.97)
for qr1000 the rQPS for (x10b, x10cw4, x10d) was (1.00, 0.99, 0.98)

point query steps (qp100, qp500, qp1000)

for qp100 the rQPS for (x10b, x10cw4, x10d) was (1.00, 0.99, 0.98)
for qp500 the rQPS for (x10b, x10cw4, x10d) was (1.00, 0.95, 0.98)
for qp1000 the rQPS for (x10b, x10cw4, x10d) was (0.97, 0.95, 0.99)

Tuesday, September 24, 2024

Configuration options that make writes faster or slower with MySQL and Postgres

This post is inspired by a comment about a recent blog post I shared. While I don't focus on MySQL vs Postgres comparisons, sometimes I share them and in my benchmarks with CPU-bound workloads, and fsync disabled (even more CPU-bound) the general result was that insert throughput was much better with Postgres than MySQL+InnoDB.

The commenter stated that Postgres and MySQL+InnoDB get similar insert rates when the MySQL binlog is disabled. I rarely test such a config because I never ran MySQL in production without the binlog enabled. Regardless, the commenter is a smart person so I decided to run a few more tests.

Some of the configurations I test here are great for performance but lousy for durability so don't take anything here as tuning advice.

tl;dr

MySQL+InnoDB suffers more tha Postgres from updates that require index maintenance
For MySQL with the binlog disabled, insert throughput gets close to Postgres results but not many deployments run primaries with the binlog disabled. The gap is much larger, in favor of Postgres, for updates
When fsyncs are enabled the difference is much larger, in favor of Postgres. One reason is that MySQL must do an fsync for the binlog and InnoDB redo while Postgres only does it for the WAL. But there is more going on here and the fsync impact doesn't explain all of the performance differences.

Configurations tested

I used MySQL 8.0.39 and Postgres 17rc1 both compiled from source. The config files are here.

For MySQL I used 12 configurations:

name binlog fsync InnoDB doublewrite
bl0_sync0_dw0 off off off
bl0_sync0_dw1 off off on
bl0_sync0_dwDO off off DETECT_ONLY
bl0_sync1_dw0 off on off
bl0_sync1_dw1 off on on
bl0_sync1_dwDO off on DETECT_ONLY
bl1_sync0_dw0 on off off
bl1_sync0_dw1 on off on
bl1_sync0_dwDO on off DETECT_ONLY
bl1_sync1_dw0 on on off
bl1_sync1_dw1 on on on
bl1_sync1_dwDO on on DETECT_ONLY

For Postgres I used 12 configurations on the large server and on the small server I used 8 (skipped the 4 that use LZ4 for wal_compression:

name fsync wal_compression full_page_writes
sync0_comp0_fpi0 off off off
sync0_comp0_fpi1 off off on
sync0_comp1_fpi0 off pglz off
sync0_comp1_fpi1 off pglz on
sync0_compLZ4_fpi0 off lz4 off
sync0_compLZ4_fpi1 off lz4 on
sync1_comp0_fpi0 on off off
sync1_comp0_fpi1 on off on
sync1_comp1_fpi0 on pglz off
sync1_comp1_fpi1 on pglz on
sync1_compLZ4_fpi0 on lz4 off
sync1_compLZ4_fpi1 on lz4 on

And a few notes:

the high durability configuration for MySQL is bl1_sync1_dw1 but that means there will be two fsyncs per commit (one for binlog, one for InnoDB redo)
the InnoDB doublewrite options (disabled, enable,d DETECT_ONLY) don't impact performance here. They will matter on workloads that are heavier on IO.
by fsync enabled I mean fsync (or fdatasync) is done for any logs that are enabled (Postgres WAL, InnoDB redo, MySQL binlog) and enabled for InnoDB means innodb_flush_log_at_trx_commit =1 while disabled for InnoDB means =2.
for Postgres I use wal_sync_method=fdatasync but below I often refer to that as fsync
for Postgres fsync enabled/disabled is the value for synchronous_commit

Hardware

I used two server types:

small

Beelink SER7 7840HS with Ryzen 7 7840HS, 8 AMD cores, AMD SMT disabled, 32G RAM, 1T m.2 SSD for OS, an NVMe device for the database using ext4 and Ubuntu 22.04.

medium

A c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

I haven't measured it on either the small or medium server but fdatasync latency might not be excellent on the medium server and definitely isn't excellent on the small server.

Benchmark

I used sysbench and my usage is explained here. But I only ran a few of the tests from it rather than the full 40+. The ones for which I report performance are insert, update-index and update-noindex. Note that update-index means that index maintenance is required and update-noindex means that it is not.

For the small server the test is run with 1 thread, 1 table and 50M rows. For the medium server it is run with 12 threads, 8 tables and 10M rows per table.

Results: small server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

MySQL throughput with the binlog on is (57%, 83%, 54%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting and that looks like the issue is the low QPS from the test without the binlog.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
MySQL with the binlog off gets (80%, 20%, 62%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. The low value for update-index with MySQL supports the what I write above in the first bullet point.
MySQL with the binlog on gets (46%, 17%, 33%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
Using Postgres with full_page_writes=on (recommended) cuts throughput in half for the update tests and by ~30% for the insert test when wal_compression uses pglz. The impact is much less from lz4, so just use lz4.

When fsync is enabled

It is much harder to spot CPU regressions which I why most of my tests disable it
MySQL throughput with the binlog on is ~50% relative to the binlog off. This is expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
MySQL with the binlog off gets ~27% of the throughput relative to Postgres and for MySQL with the binlog on that drops to ~14% relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. These differences are larger than I expect.
The impact from full_page_writes=on for Postgres is smaller here than it is above, but again the real issue is the CPU overhead from pglz and the solution is to use lz4.

--- without fsync

- no binlog for MySQL

ins upd-ix upd-nix dbms

20413 5662 19837 my8039 bl0_sync0_dw0

20458 5881 20271 my8039 bl0_sync0_dw1

20252 5790 19742 my8039 bl0_sync0_dwDO

- binlog for MySQL

ins upd-ix upd-nix dbms

11814 4935 11051 my8039 bl1_sync0_dw0

11614 4910 10992 my8039 bl1_sync0_dw1

11649 4896 11062 my8039 bl1_sync0_dwDO

- Postgres

ins upd-ix upd-nix dbms

25590 28835 32951 pg17rc1 sync0_comp0_fpi0

25471 28840 32837 pg17rc1 sync0_comp0_fpi1

26452 29196 32702 pg17rc1 sync0_comp1_fpi0

18537 14276 16033 pg17rc1 sync0_comp1_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit: InnoDB)

ins upd-ix upd-nix dbms

139 135 142 my8039 bl0_sync1_dw0

140 134 142 my8039 bl0_sync1_dw1

140 134 142 my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit: binlog + InnoDB)

ins upd-ix upd-nix dbms

72 72 73 my8039 bl1_sync1_dw0

73 71 71 my8039 bl1_sync1_dw1

73 68 70 my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit: WAL)

ins upd-ix upd-nix dbms

512 513 516 pg17rc1 sync1_comp0_fpi0

505 507 517 pg17rc1 sync1_comp0_fpi1

512 514 516 pg17rc1 sync1_comp1_fpi0

438 422 387 pg17rc1 sync1_comp1_fpi1

Results: medium server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

MySQL throughput with the binlog on is (44%, 79%, 43%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
MySQL with the binlog off gets (96%, 16%, 70%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Once again the result for update-index with MySQL is odd.
MySQL with the binlog on gets (42%, 13%, 30%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
Using Postgres with full_page_writes=on (recommended) reduces throughput by ~5% when wal_compression uses pglz and the solution is lz4.

When fsync is enabled

MySQL throughput with the binlog on is (34%, 58%, 41%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. This is (mostly) expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
MySQL with the binlog off gets (133%, 68%, 132%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1.
MySQL with the binlog on gets (45%, 39%, 54%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1. Most MySQL primaries will run with the binlog enabled.
The impact from full_page_writes=on for Postgres is large for updates (reduces throughput by ~50%) and smaller for the insert test because the insert test loads in PK order.

--- without fsync

- no binlog for MySQL

ins upd-ix upd-nix dbms

117078 17620 104215 my8039 bl0_sync0_dw0

118038 19821 103625 my8039 bl0_sync0_dw1

117373 19582 103625 my8039 bl0_sync0_dwDO

- binlog for MySQL

ins upd-ix upd-nix dbms

52602 15201 44727 my8039 bl1_sync0_dw0

52193 15736 44372 my8039 bl1_sync0_dw1

52456 16293 45195 my8039 bl1_sync0_dwDO

- Postgres

ins upd-ix upd-nix dbms

127864 138330 154650 pg17rc1 sync0_comp0_fpi0

123512 123825 148386 pg17rc1 sync0_comp0_fpi1

127456 139634 155127 pg17rc1 sync0_comp1_fpi0

119504 117795 136421 pg17rc1 sync0_comp1_fpi1

127885 139301 154869 pg17rc1 sync0_compLZ4_fpi0

125192 125532 149092 pg17rc1 sync0_compLZ4_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit for InnoDB redo)

ins upd-ix upd-nix dbms

20602 6824 15804 my8039 bl0_sync1_dw0

20350 6640 15095 my8039 bl0_sync1_dw1

20677 6836 16134 my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit, one for binlog, one for InnoDB redo)

ins upd-ix upd-nix dbms

6972 3863 6238 my8039 bl1_sync1_dw0

6963 3829 6183 my8039 bl1_sync1_dw1

6922 3904 6266 my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit, for WAL)

ins upd-ix upd-nix dbms

19754 24293 26250 pg17rc1 sync1_comp0_fpi0

15089 9507 10963 pg17rc1 sync1_comp0_fpi1

19732 24335 26280 pg17rc1 sync1_comp1_fpi0

13031 8571 9204 pg17rc1 sync1_comp1_fpi1

19698 24629 26163 pg17rc1 sync1_compLZ4_fpi0

15309 9712 11420 pg17rc1 sync1_compLZ4_fpi1

Saturday, September 14, 2024

MySQL and Postgres vs the Insert Benchmark on a large server

This has benchmark results for MySQL and Postgres vs the Insert Benchmark on a large server. My intent is to document how performance changes over time more so than start a Postgres vs MySQL argument.

MySQL has accumulated large regressions from 5.6 to 8.0 that are obvious on low-concurrency benchmarks. While they are less obvious on high-concurrency benchmarks, and there have been significant improvements to make MySQL 8 better at high-concurrency, the regressions that hurt low-concurrency results also reduce throughput at high-concurrency.

tl;dr

For Postgres 17rc1 vs 15.8

Performance is mostly the same

For MySQL 8.0.39 vs 5.6.51

Writes are much faster in 8.0.39
Reads throughput is mixed, 8.0.39 is slower than, similar to and faster than 5.6.51 depending on the context

For MySQL vs Postgres

MySQL is faster on point queries
Postgres is faster on writes and range queries

Builds, configuration and hardware

I compiled from source:

Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer

The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The Postgres configuration file is here.

The MySQL configuration files are here for 5.6.51, 5.7.44 and 8.0.39.

The Benchmark

The benchmark is explained here and is run with 20 clients and a table per client with three workloads:

cached - database fits in the Postgres buffer pool
less IO-bound - database is larger than memory with 64M rows per table
more IO-bound - database is larger than memory with 200M rows per table

The benchmark steps are:

l.i0

insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 20 for cached, 64 for less IO-bound and 200 for more IO-bound.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for less/more IO-bound.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for less/more IO-bound.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance reports are here:

cached - MySQL vs Postgres, only MySQL, only Postgres
less IO-bound - MySQL vs Postgres, only MySQL, only Postgres
more IO-bound - MySQL vs Postgres, only MySQL, only Postgres

The summary section in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is MySQL 5.6.51 for MySQL vs Postgres, MySQL 5.6.51 for only MySQL and Postgres 15.8 for only Postgres.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: cached

From the summary for only Postgres:

Write-heavy steps are up to 11% faster in 17rc1 vs 15.8
Read-heavy steps are unchanged in 17rc1 vs 15.8

From the summary for only MySQL:

Write-heavy steps are ~2.5X to ~4X faster in 8.0.39 vs 5.6.51
Read-heavy steps are between 12% and 27% slower in 8.0.39 vs 5.6.51

From the summary for MySQL vs Postgres:

Postgres 17rc1 is faster on write-heavy and range queries
MySQL 8.0.39 and Postgres 17rc1 get similar QPS on point queries

Results: less IO-bound

From the summary for only Postgres:

Write-heavy steps are up 13% faster in 17rc1 vs 15.8 except for l.i2 which suffers from an intermittently high CPU overhead from get_actual_variable_range
Read-heavy steps are mostly similar between 17rc1 and 15.8

From the summary for only MySQL:

Write-heavy steps are ~1.9X to ~4X faster in 8.0.39 vs 5.6.51
Read-heavy steps are up to 24% slower in 8.0.39 vs 5.6.51

From the summary for MySQL vs Postgres:

Postgres 17rc1 is faster on write-heavy and range queries

The range queries are covering and the secondary indexes are mostly cached so the tests are CPU bound for both Postgres and MySQL (see rpq here)

MySQL 8.0.39 is faster on point queries

MySQL does ~0.1 reads/query while Postgres does ~3 (see rpq here). The table is index organized with MySQL and heap organized with Postgres. So with InnoDB the point queries just access the PK index while with Postgres they first access the PK index and then the heap table to get all of the columns.

Results: more IO-bound

From the summary for only Postgres:

Write-heavy steps have similar throughput between 17rc1 and 15.8 ignoring index create (l.x) and variance from l.i2 (problems with get_actual_variable_range)
Read-heavy steps are up to 6% faster in 17rc1 vs 15.8

From the summary for only MySQL:

Write-heavy steps are up to ~4X faster in 8.0.39 vs 5.6.51
Point queries are up to ~3X faster in 8.0.39 vs 5.6.51
Range queries are up to 27% slower in 8.0.39 vs 5.6.51

From the summary for MySQL vs Postgres:

Postgres 17rc1 is faster on write-heavy and range queries

See the less IO-bound section above
MySQL does more read IO per range query (see rpq here)

MySQL 8.0.39 is faster on point queries

See the less IO-bound section above
Postgres does ~2X more read IO per point query (see rpq here)

Thursday, September 12, 2024

MySQL + InnoDB vs sysbench on a large server

This has benchmark results for MySQL 5.6.51, 5.7.44 and 8.0.39 using sysbench on a large server. I also add a few results comparing Postgres 17rc1 with MySQL 8.0.39. My goal with that is to highlight things that can be made better in MySQL (and in Postgres).

One of the responses to my claims that MySQL is getting too many performance regressions over time is that this is true for low-concurrency tests but not for high-concurrency tests. Alas, that claim is more truthy than true and fixing some of these regressions would help make modern MySQL not look so slow when compared to modern Postgres.

tl;dr

MySQL 8.0 is faster than 5.6 for point queries and writes but slower for range queries
PostgresSQL 17rc1 is a lot faster than MySQL 8.0 for point queries and writes. For range queries Postgres was still faster but the difference was usually not as large.

Builds, configuration and hardware

I compiled

Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer

The Postgres configuration file is here.
The MySQL configuration files are here for 5.6.51, 5.7.44 and 8.0.39.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

For the large server the tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds. The command line to run all tests was: bash r.sh 8 10000000 180 300 md2 1 1 40

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here.

Values from iostat and vmstat divided by QPS are here for MySQL 5.6, 5.7, 8.0 and here for MySQL 8.0.39 vs Postgres 17rc1. These help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers in the spreadsheets are the relative QPS and $version is faster than the base case when it is greater than 1.0. When it is 3.0 then $version is 3X faster than the base case.

The relative QPS is computed by: (QPS for $version) / (QPS for base case).

For the comparison between MySQL 5.6.51, 5.7.44 and 8.0.33 the base case is 5.6.51 and $version is one of 5.7.44 or 8.0.39.

For the comparison between MySQL 8.0.39 and Postgres 17rc1 the base case is MySQL and $version is Postgres.

Results: MySQL 5.6.51 vs 5.7.44 vs 8.0.39

Point queries, part 1

MySQL 8.0 is slower than 5.6 on 4 of 11 tests and faster on 7 of 11 tests.
The 4 on which it is slower are all for secondary index lookups. From vmstat metrics, when MySQL 8.0 is slower the problem is CPU overhead (see cpu/o here).
The 7 on which it is faster are all for PK lookups. From two of those tests, it is faster because it uses less CPU per query (see cpu/o here).

Point queries, part 2

MySQL 8.0 is faster than 5.6 in all tests here. All of these tests use a PK index. The perf difference is larger for queries that fetch more data. This suggests that new optimizer overheads prevent it from being faster in cases when it fetches less data and/or InnoDB in MySQL 8.0 doesn't have regressions for point queries relative to MySQL 5.6.
From vmstat metrics (see cpu/o here) MySQL 8.0 is faster because it uses less CPU

Range queries, part 1

MySQL 5.7 and 8.0 are slower than 5.6 for range queries. This is a surprise and the regressions are getting larger over time (worse in 8.0 than 5.7). The queries here do range scans of various lengths without aggregation.
From vmstat metrics (see cpu/o here) MySQL 8.0 is slower because it uses more CPU per query

Range queries, part 2

All of the tests here do some aggregation except for scan (the right most bars). And MySQL 8.0 is faster than 5.6 for all such tests.
When MySQL 8.0 is faster the primary reason is that is uses less CPU per query (see cpu/o here) and sometimes it also does fewer context switches per query (see cs/o here) where a high context switch rate often implies mutex contention.

Writes

MySQL 8.0 is significantly faster than 5.6 for all tests. These show the benefit of the work to improve performance for high-concurrency workloads.
MySQL 8.0 is faster because it uses less CPU per SQL statement (see cpu/o here) and does few context switches per SQL statement (see cs/o here). A high context switch rate often implies mutex contention.

Results: MySQL 8.0.39 vs Postgres 17rc1

Point queries, part 1

The result for hot-points is an outlier, perhaps because prepared statements are better in Postgres than in MySQL. In MySQL, server-side prepared statements just save on parse. The difference is large because MySQL uses a lot more CPU than Postgres (see cpu/o here).
In most other tests Postgres is ~1.5X faster than MySQL and the root cause is that MySQL uses more CPU per query (see cpu/o here). For the context switch rate, sometimes it is larger for MySQL and sometimes it is smaller (see cs/o here).
Excluding hot-points, the largest difference occurs on tests that do non-covering secondary index lookups. InnoDB tables are index-organized and for non-covering secondary index lookups to fetch the missing columns InnoDB must traverse the PK index while Postgres just uses the rowID (tuple ID) to access the heap pages.

Point queries, part 2

Postgres is ~1.5X faster for point queries, see above in Point queries, part 1 for more details

Range queries, part 1

Postgres is much slower than InnoDB on scan (full table scan) and on range-notcovered-pk. While I can only wave hands about the scan result for the range-notcovered-pk result Postgres must scan the PK index and then lookup missing columns from heap pages but InnoDB has all columns in the PK index because it is index-organized. For scan, the CPU overhead and context switch rates are much larger for Postgres (see cpu/o and cs/o here).
Postgres is much faster on the two tests that do non-covering secondary index scans because it fetches missing columns by following the tupleID to a heap page while InnoDB must traverse the PK index.
Excluding the 4 outliers, InnoDB and Postgres have similar performance

Range queries, part 2

The performance difference here is smaller than it is above for point queries. In one case, MySQL is much faster, in three cases Postgres is much faster and in two cases perf is similar.
For tests run before writes (less b-tree fragmentation) or after writes (more b-tree fragmentation) the pattern is that the advantage for Postgres is larger when the length of the range query is shorter -- see results for read-only.pre_range=X and read-only_range=X for X in 10000, 100 and 10. My guess is that MySQL optimizer overhead hurts performance and that is more obvious for queries that take less time -- see the CPU overhead metrics here (cpu/o columns).

Writes

Postgres is faster for all tests that do writes and is up to ~5X faster. The reason is that Postgres uses less CPU per statement (see cpu/o here) and does fewer context switches per statement (see cs/o here) where context switch rates often predict mutex contention.
The difference is smaller for read-write* tests because they do a mix of reads and writes.