Showing posts with label postgresql. Show all posts
Showing posts with label postgresql. Show all posts

Thursday, September 11, 2025

Postgres 18rc1 vs sysbench

This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and large server.

tl;dr

  • Postgres 18 looks great
  • I continue to see small CPU regressions in Postgres 18 for range queries that don't do aggregation on low-concurrency workloads. I have yet to explain that. 
  • The throughput for the scan microbenchmark has more variance with Postgres 18. I assume this is related to more or less work getting done by vacuum but I have yet to debug the root cause.

Builds, configuration and hardware

I compiled Postgres from source for versions 17.6, 18 beta3 and 18 rc1.

The servers are:
  • small
    • an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
  • large32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32 Cores and AMD SMT disabled, Ubuntu 24.04 and and NVMe device with ext4 and discard.
  • large48
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
All configurations use synchronous IO which is the the only option prior to Postgres 18 and for Postgres 18 the config file sets io_method=sync.

Configuration files:

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

For all servers the read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The number of tables and rows per table was:
  • small server - 1 table, 50M rows
  • large servers - 8 tables, 10M rows per table
The number of clients (amount of concurrency) was:
  • small server - 1
  • large32 server - 24
  • large48 servcer- 40
Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.6)
When the relative QPS is > 1 then some version is faster than PG 17.6.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.

Results: small server

I continue to see small (~3%) regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. But I have yet to debug this and am not certain it is a regression. I am also skeptical about the great results for scan. I suspect that I have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

Relative to: Postgres 17.6
col-1 : 18beta3
col-2 : 18rc1

col-1   col-2   point queries
1.01    0.98    hot-points_range=100
1.01    1.00    point-query_range=100
1.02    1.02    points-covered-pk_range=100
0.99    1.01    points-covered-si_range=100
1.00    0.99    points-notcovered-pk_range=100
1.00    0.99    points-notcovered-si_range=100
1.01    1.00    random-points_range=1000
1.01    0.99    random-points_range=100
1.01    1.00    random-points_range=10

col-1   col-2   range queries without aggregation
0.97    0.96    range-covered-pk_range=100
0.97    0.97    range-covered-si_range=100
0.99    0.99    range-notcovered-pk_range=100
0.99    0.99    range-notcovered-si_range=100
1.35    1.36    scan_range=100

col-1   col-2   range queries with aggregation
1.02    1.03    read-only-count_range=1000
1.00    1.00    read-only-distinct_range=1000
0.99    0.99    read-only-order_range=1000
1.00    1.00    read-only_range=10000
1.00    0.99    read-only_range=100
0.99    0.98    read-only_range=10
1.01    1.01    read-only-simple_range=1000
1.02    1.00    read-only-sum_range=1000

col-1   col-2   writes
0.99    0.99    delete_range=100
0.99    1.01    insert_range=100
0.99    0.99    read-write_range=100
0.99    0.99    read-write_range=10
0.98    0.98    update-index_range=100
1.00    0.99    update-inlist_range=100
0.98    0.98    update-nonindex_range=100
0.98    0.97    update-one_range=100
0.98    0.97    update-zipf_range=100
0.99    0.98    write-only_range=10000

Results: large32 server

I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.

The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

Relative to: Postgres 17.6
col-1 : Postgres 18rc1

col-1   point queries
1.01    hot-points_range=100
1.01    point-query_range=100
1.01    points-covered-pk_range=100
1.01    points-covered-si_range=100
1.00    points-notcovered-pk_range=100
1.00    points-notcovered-si_range=100
1.01    random-points_range=1000
1.00    random-points_range=100
1.01    random-points_range=10

col-1   range queries without aggregation
0.99    range-covered-pk_range=100
0.99    range-covered-si_range=100
0.99    range-notcovered-pk_range=100
0.99    range-notcovered-si_range=100
1.12    scan_range=100

col-1   range queries with aggregation
1.00    read-only-count_range=1000
1.02    read-only-distinct_range=1000
1.01    read-only-order_range=1000
1.03    read-only_range=10000
1.00    read-only_range=100
1.00    read-only_range=10
1.00    read-only-simple_range=1000
1.00    read-only-sum_range=1000

col-1   writes
1.01    delete_range=100
1.00    insert_range=100
1.00    read-write_range=100
1.00    read-write_range=10
1.00    update-index_range=100
1.00    update-inlist_range=100
1.00    update-nonindex_range=100
0.99    update-one_range=100
1.00    update-zipf_range=100
1.00    write-only_range=10000

Results: large48 server

I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.

The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

I am skeptical about the regression I see here for scan. That comes from using ~10% more CPU per query. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

I have not see the large improvements for the insert and delete microbenchmarks on previous tests on that large server. I assume this is another case where I need to figure out how to reduce variance when I run the benchmark.

Relative to: Postgres 17.6
col-1 : Postgres 18beta3
col-2 : Postgres 18rc1

col-1   col-2   point queries
0.99    0.99    hot-points_range=100
0.99    0.99    point-query_range=100
1.00    0.99    points-covered-pk_range=100
0.99    1.02    points-covered-si_range=100
1.00    0.99    points-notcovered-pk_range=100
0.99    1.01    points-notcovered-si_range=100
1.00    0.99    random-points_range=1000
1.00    0.99    random-points_range=100
1.00    1.00    random-points_range=10

col-1   col-2   range queries without aggregation
0.99    0.99    range-covered-pk_range=100
0.98    0.99    range-covered-si_range=100
0.99    0.99    range-notcovered-pk_range=100
1.01    1.01    range-notcovered-si_range=100
0.91    0.91    scan_range=100

col-1   col-2   range queries with aggregation
1.04    1.03    read-only-count_range=1000
1.02    1.01    read-only-distinct_range=1000
1.01    1.00    read-only-order_range=1000
1.06    1.06    read-only_range=10000
0.98    0.97    read-only_range=100
0.99    0.99    read-only_range=10
1.02    1.02    read-only-simple_range=1000
1.03    1.03    read-only-sum_range=1000

col-1   col-2   writes
1.46    1.49    delete_range=100
1.32    1.32    insert_range=100
0.99    1.00    read-write_range=100
0.98    1.00    read-write_range=10
0.99    1.00    update-index_range=100
0.95    1.03    update-inlist_range=100
1.00    1.02    update-nonindex_range=100
0.96    1.04    update-one_range=100
1.00    1.01    update-zipf_range=100
1.00    1.00    write-only_range=10000




Wednesday, June 11, 2025

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)

This is my second attempt at an IO-bound Insert Benchmark results with a small server. The first attempt is here and has been deprecated because sloppy programming by me meant the benchmark client was creating too many connections and that hurt results in some cases for Postgres 18 beta1.

There might be regressions from 17.5 to 18 beta1

  • QPS decreases by ~5% and CPU increases by ~5% on the l.i2 (write-only) step
  • QPS decreases by <= 2% and CPU increases by ~2% on the qr* (range query) steps
There might be regressions from 14.0 to 18 beta1
  • QPS decreases by ~6% and ~18% on the write-heavy steps (l.i1, l.i2)

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions  14.0, 14.18, 15.0, 15.13, 16.0, 16.9, 17.0, 17.5 and 18 beta1.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM and one NVMe device for the database. The OS has been updated to Ubuntu 24.04. More details on it are here.

For Postgres versions 14.0 through 17.5 the configuration files are in the pg* subdirectories here with the name conf.diff.cx10a_c8r32. For Postgres 18 beta1 I used 3 variations, which are here:
  • conf.diff.cx10b_c8r32
    • uses io_method='sync' to match Postgres 17 behavior
  • conf.diff.cx10c_c8r32
    • uses io_method='worker' and io_workers=16 to do async IO via a thread pool. I eventually learned that 16 is too large.
  • conf.diff.cx10d_c8r32
    • uses io_method='io_uring' to do async IO via io_uring
The Benchmark

The benchmark is explained here and is run with 1 client and 1 table with 800M rows. I provide two performance reports:
  • one to compare Postgres 14.0 through 18 beta1, all using synchronous IO
  • one to compare Postgres 17.5 with 18 beta1 using 3 configurations for 18 beta1 -- one for each of io_method= sync, workers and io_uring.
The benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here for Postgres 14 through 18 and here for Postgres 18 configurations.

The summary sections (herehere and here) have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for the benchmark steps. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.5.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: Postgres 14.0 through 18 beta1

The performance summary is here

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 14.0 is the base version and that is compared with more recent Postgres versions. The results here are similar to what I reported prior to fixing the too many connections problem in the benchmark client.

For 14.0 through 18 beta1, QPS on ...
  • the initial load (l.i0)
    • Performance is stable across versions
    • 18 beta1 and 17.5 have similar performance
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.99)
  • create index (l.x)
    • ~10% faster starting in 15.0
    • 18 beta1 and 17.5 have similar performance
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.11, 1.12)
  • first write-only step (l.i1)
    • Performance decreases ~7% from version 16.9 to 17.0. CPU overhead (see cpupq here) increases by ~5% in 17.0.
    • 18 beta1 and 17.5 have similar performance
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (0.93, 0.94)
  • second write-only step (l.i2)
    • Performance decreases ~6% in 15.0, ~8% in 17.0 and then ~5% in 18.0. CPU overhead (see cpupq here) increases ~5%, ~6% and ~5% in 15.0, 17.0 and 18beta1. Of all benchmark steps, this has the largest perf regression from 14.0 through 18 beta1 which is ~20%.
    • 18 beta1 is ~4% slower than 17.5
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (0.86, 0.82)
  • range query steps (qr100, qr500, qr1000)
    • 18 beta1 and 17.5 have similar performance, but 18 beta1 is slightly slower
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.99) for qr100, (0.97, 0.98) for qr500 and (0.97, 0.95) for qr1000. The issue is new CPU overhead, see cpupq here.
  • point query steps (qp100, qp500, qp1000)
    • 18 beta1 and 17.5 have similar performance but 18 beta1 is slightly slower
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.98) for qp100, (0.99, 0.98) for qp500 and (0.97, 0.96) for qp1000. The issue is new CPU overhead, see cpupq here.
Results: Postgres 17.5 vs 18 beta1

The performance summary is here.

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 17.5 is the base version and that is compared with results from 18 beta1 using the three configurations explained above:
  • x10b with io_method=sync
  • x10c with io_method=worker and io_workers=16
  • x10d with io_method=io_uring
The summary is:
  • initial load step (l.i0)
    • rQPS for (x10b, x10c, x10d) was (0.99, 1001.00)
  • create index step (l.x)
    • rQPS for (x10b, x10c, x10d) was (1.011.021.02)
  • first write-heavy ste (l.i1)
    • for l.i1 the rQPS for (x10b, x10c, x10d) was (1.00, 0.99, 1.01)
  • second write-heavy step (l.i2)
    • for l.i2 the rQPS for (x10b, x10c, x10d) was (0.960.930.94)
    • CPU overhead (see cpupq here) increases by ~5% in 18 beta1
  • range query steps (qr100, qr500, qr1000)
    • for qr100 the rQPS for (x10b, x10c, x10d) was (1.00, 0.99, 0.99)
    • for qr500 the rQPS for (x10b, x10c, x10d) was (1.00, 0.97, 0.99)
    • for qr1000 the rQPS for (x10b, x10c, x10d) was (0.99, 0.98, 0.97)
    • CPU overhead (see cpupq here, here and here) increases by ~2% in 18 beta1
  • point query steps (qp100, qp500, qp1000)
    • for qp100 the rQPS for (x10b, x10c, x10d) was (0.98, 0.99, 0.99)
    • for qp500 the rQPS for (x10b, x10c, x10d) was (0.991.001.00)
    • for qp1000 the rQPS for (x10b, x10c, x10d) was (0.99, 0.990.99)










Sunday, June 8, 2025

Postgres 18 beta1: small server, CPU-bound Insert Benchmark (v2)

This is my second attempt at a CPU-bound Insert Benchmark results with a small server. The first attempt is here and has been deprecated because sloppy programming by me meant the benchmark client was creating too many connections and that hurt results in some cases for Postgres 18 beta1.

tl;dr

  • Performance between 17.5 and 18 beta1 is mostly similar on read-heavy steps
  • 18 beta1 might have small regressions from new CPU overheads on write-heavy steps

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions  14.0, 14.18, 15.0, 15.13, 16.0, 16.9, 17.0, 17.5 and 18 beta1.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM and one NVMe device for the database. The OS has been updated to Ubuntu 24.04 -- I used 22.04 prior to that. More details on it are here.

For Postgres versions 14.0 through 17.5 the configuration files are in the pg* subdirectories here with the name conf.diff.cx10a_c8r32. For Postgres 18 beta1 I used 3 variations, which are here:
  • conf.diff.cx10b_c8r32
    • uses io_method='sync' to match Postgres 17 behavior
  • conf.diff.cx10c_c8r32
    • uses io_method='worker' and io_workers=16 to do async IO via a thread pool. I eventually learned that 16 is too large.
  • conf.diff.cx10d_c8r32
    • uses io_method='io_uring' to do async IO via io_uring
The Benchmark

The benchmark is explained here and is run with 1 client and 1 table with 20M rows. I provide two performance reports:
  • one to compare Postgres 14.0 through 18 beta1, all using synchronous IO
  • one to compare Postgres 17.5 with 18 beta1 using 3 configurations for 18 beta1 -- one for each of io_method= sync, workers and io_uring.
The benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 40M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted per table.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here for Postgres 14 through 18 and here for Postgres 18 configurations.

The summary sections (here and here) have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: Postgres 14.0 through 18 beta1

The performance summary is here

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 14.0 is the base version and that is compared with more recent Postgres versions.

For 14.0 through18 beta1, QPS on ...
  • l.i0 (the initial load)
    • Slightly faster starting in 15.0
    • Throughput was ~4% faster starting in 15.0 and that drops to ~2% in 18 beta1
    • 18 beta1 and 17.5 have similar performance
  • l.x (create index) 
    • Faster starting in 15.0
    • Throughput is between 9% and 17% faster in 15.0 through 18 beta1
    • 18 beta1 and 17.5 have similar performance
  • l.i1 (write-only)
    • Slower starting in 15.0
    • It is ~3% slower in 15.0 and that increases to between 6% and 10% in 18 beta1
    • 18 beta1 and 17.5 have similar performance
  • l.i2 (write-only)
    • Slower starting in 15.13 with a big drop in 17.0
    • 18 beta1 with io_method= sync and io_uring is worse than 17.5. It isn't clear but one problem might be more CPU/operation (see cpupq here)
  • qr100, qr500, qr1000 (range query)
    • Stable from 14.0 through 18 beta1
  • qp100, qp500, qp1000 (point query) 
    • Stable from 14.0 through 18 beta1
Results: Postgres 17.5 vs 18 beta1

The performance summary is here

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 17.5 is the base version and that is compared with results from 18 beta1 using the three configurations explained above:
  • x10b with io_method=sync
  • x10c with io_method=worker and io_workers=16
  • x10d with io_method=io_uring
The summary of the summary is:
  • initial load step (l.i0)
    • 18 beta1 is 1% to 3% slower than 17.5
    • This step is short running so I don't have a strong opinion on the change
  • create index step (l.x)
    • 18 beta1 is 0% to 2% faster than 17.5
    • This step is short running so I don't have a strong opinion on the change
  • write-heavy step (l.i1)
    • 18 beta1 with io_method= sync and workers has similar perf as 17.5
    • 18 beta1 with io_method=io_uring is ~4% slower than 17.5. The problem might be more CPU/operation, see cpupq here
  • write-heavy step (l.i2)
    • 18 beta1 with io_method=workers is ~2% faster than 17.5
    • 18 beta1 with io_method= sync and workers is 6% and 8% slower than 17.5. The problem might be more CPU/operation, see cpupq here
  • range query steps (qr100, qr500, qr1000)
    • 18 beta1 and 17.5 have similar performance
  • point query steps (qp100, qp500, qp1000)
    • 18 beta1 and 17.5 have similar performance
The summary is:
  • initial load step (l.i0)
    • rQPS for (x10b, x10c, x10d) was (0.98, 0.99, 0.97)
  • create index step (l.x)
    • rQPS for (x10b, x10c, x10d) was (1.00, 1.02, 1.00)
  • write-heavy steps (l.i1, l.i2)
    • for l.i1 the rQPS for (x10b, x10c, x10d) was (1.011.00, 0.96)
    • for l.i2 the rQPS for (x10b, x10c, x10d) was (0.941.02, 0.92)
  • range query steps (qr100, qr500, qr1000)
    • for qr100 the rQPS for (x10b, x10c, x10d) was (0.99, 1.001.00)
    • for qr500 the rQPS for (x10b, x10c, x10d) was (0.991.011.00)
    • for qr1000 the rQPS for (x10b, x10c, x10d) was (0.99, 1.001.00)
  • point query steps (qp100, qp500, qp1000)
    • for qp100 the rQPS for (x10b, x10c, x10d) was (1.001.001.00)
    • for qp500 the rQPS for (x10b, x10c, x10d) was (0.991.001.00)
    • for qp1000 the rQPS for (x10b, x10c, x10d) was (0.991.00, 0.98)

Friday, June 6, 2025

Postgres 18 beta1: large server, IO-bound Insert Benchmark

This has results for a CPU-bound Insert Benchmark with Postgres on a large server. A blog post about a CPU-bound workload on the same server is here.

tl;dr

  • initial load step (l.i0)
    • 18 beta1 is 4% faster than 17.4
  • create index step (l.x)
    • 18 beta1 with io_method =sync and =workers has similar perf as 17.4 and is 7% faster than 17.4 with =io_uring
  • write-heavy steps (l.i1, l.i2)
    • 18 beta1 and 17.4 have similar performance except for l.i2 with 18 beta1 and io_method=workers where 18 beta1 is 40% faster. This is an odd result and I am repeating the benchmark.
  • range query steps (qr100, qr500, qr1000)
    • 18 beta1 is up to (3%, 2%, 3%) slower than 17.4 with io_method= (sync, workers, io_uring). The issue might be new CPU overhead.
  • point query steps (qp100, qp500, qp1000)
    • 18 beta1 is up to (3%, 5%, 2%) slower than 17.4 with io_method= (sync, workers, io_uring). The issue might be new CPU overhead.

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for version 18 beta1 and 17.4. I got the source for 18 beta1 from github using the REL_18_BETA1 tag because I started this benchmark effort a few days before the official release.

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and 
ext4. More details on it are here.

The config file for Postgres 17.4 is here and named conf.diff.cx10a_c32r128.

For 18 beta1 I tested 3 configuration files, and they are here:
  • conf.diff.cx10b_c32r128 (x10b) - uses io_method=sync
  • conf.diff.cx10cw4_c32r128 (x10cw4) - uses io_method=worker with io_workers=4
  • conf.diff.cx10d_c32r128 (x10d) - uses io_method=io_uring
The Benchmark

The benchmark is explained here and is run with 20 client and tables (table per client) and 200M rows per table. The database is larger than memory. In some benchmark steps the working set is larger than memory (see the point query steps qp100, qp500, qp1000) while the working set it cached for other benchmarks steps (see the range query steps qr100, qr500 and qr1000).

The benchmark steps are:

  • l.i0
    • insert 10 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: details

The performance summary is here

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 17.4 is the base version and that is compared with results from 18 beta1 using the three configurations explained above:
  • x10b with io_method=sync
  • x10cw4 with io_method=worker and io_workers=4
  • x10d with io_method=io_uring).
The summary of the summary is:
  • initial load step (l.i0)
    • 18 beta1 is 4% faster than 17.4
    • From metrics, 18 beta1 has a lower context switch rate (cspq) and sustains a higher write rate to storage (wmbps).
  • create index step (l.x)
    • 18 beta1 with io_method =sync and =workers has similar perf as 17.4 and is 7% faster than 17.4 with =io_uring
    • From metrics, 18 beta1 with io_method=io_uring sustains a higher write rate (wmbps)
  • write-heavy steps (l.i1, l.i2)
    • 18 beta1 and 17.4 have similar performance except for l.i2 with 18 beta1 and io_method=workers where 18 beta1 is 40% faster. This is an odd result and I am repeating the benchmark.
    • From metrics for l.i1 and l.i2, in the case where 18 beta1 is 40% faster, there is much less CPU/operation (cpupq).
  • range query steps (qr100, qr500, qr1000)
    • 18 beta1 is up to (3%, 2%, 3%) slower than 17.4 with io_method= (sync, workers, io_uring)
    • From metrics for qr100, qr500 and qr1000 the problem might be more CPU/operation (cpupq)
    • Both 17.4 and 18 beta1 failed to sustain the target rate of 20,000 inserts and 20,000 deletes/s. They were close and did ~18,000/s for each. See the third table here.
  • point query steps (qp100, qp500, qp1000)
    • 18 beta1 is up to (3%, 5%, 2%) slower than 17.4 with io_method= (sync, workers, io_uring).
    • From metrics for qp100, qp500 and qp1000 the problem might be more CPU/operation (cpupq)
    • Both 17.4 and 18 beta1 failed to sustain the target rate of 20,000 inserts and 20,000 deletes/s. They were close and did ~18,000/s for each. See the third table here.
The summary is:
  • initial load step (l.i0)
    • rQPS for (x10b, x10cw4, x10d) was (1.041.041.04)
  • create index step (l.x)
    • rQPS for (x10b, x10cw4, x10d) was (0.990.991.07)
  • write-heavy steps (l.i1, l.i2)
    • for l.i1 the rQPS for (x10b, x10cw4, x10d) was (1.010.991.02)
    • for l.i2 the rQPS for (x10b, x10cw4, x10d) was (1.001.400.99)
  • range query steps (qr100, qr500, qr1000)
    • for qr100 the rQPS for (x10b, x10cw4, x10d) was (0.970.980.97)
    • for qr500 the rQPS for (x10b, x10cw4, x10d) was (0.980.980.97)
    • for qr1000 the rQPS for (x10b, x10cw4, x10d) was (1.000.990.98)
  • point query steps (qp100, qp500, qp1000)
    • for qp100 the rQPS for (x10b, x10cw4, x10d) was (1.000.990.98)
    • for qp500 the rQPS for (x10b, x10cw4, x10d) was (1.000.95, 0.98)
    • for qp1000 the rQPS for (x10b, x10cw4, x10d) was (0.970.95, 0.99)

Tuesday, September 24, 2024

Configuration options that make writes faster or slower with MySQL and Postgres

This post is inspired by a comment about a recent blog post I shared. While I don't focus on MySQL vs Postgres comparisons, sometimes I share them and in my benchmarks with CPU-bound workloads, and fsync disabled (even more CPU-bound) the general result was that insert throughput was much better with Postgres than MySQL+InnoDB.

The commenter stated that Postgres and MySQL+InnoDB get similar insert rates when the MySQL binlog is disabled. I rarely test such a config because I never ran MySQL in production without the binlog enabled. Regardless, the commenter is a smart person so I decided to run a few more tests.

Some of the configurations I test here are great for performance but lousy for durability so don't take anything here as tuning advice.

tl;dr

  • MySQL+InnoDB suffers more tha Postgres from updates that require index maintenance
  • For MySQL with the binlog disabled, insert throughput gets close to Postgres results but not many deployments run primaries with the binlog disabled. The gap is much larger, in favor of Postgres, for updates 
  • When fsyncs are enabled the difference is much larger, in favor of Postgres. One reason is that MySQL must do an fsync for the binlog and InnoDB redo while Postgres only does it for the WAL. But there is more going on here and the fsync impact doesn't explain all of the performance differences.

Configurations tested

I used MySQL 8.0.39 and Postgres 17rc1 both compiled from source. The config files are here.

For MySQL I used 12 configurations:

name            binlog  fsync   InnoDB doublewrite
bl0_sync0_dw0   off     off     off
bl0_sync0_dw1   off     off     on
bl0_sync0_dwDO  off     off     DETECT_ONLY
bl0_sync1_dw0   off     on      off
bl0_sync1_dw1   off     on      on
bl0_sync1_dwDO  off     on      DETECT_ONLY
bl1_sync0_dw0   on      off     off
bl1_sync0_dw1   on      off     on
bl1_sync0_dwDO  on      off     DETECT_ONLY
bl1_sync1_dw0   on      on      off
bl1_sync1_dw1   on      on      on
bl1_sync1_dwDO  on      on      DETECT_ONLY

For Postgres I used 12 configurations on the large server and on the small server I used 8 (skipped the 4 that use LZ4 for wal_compression:

name                    fsync   wal_compression full_page_writes
sync0_comp0_fpi0        off     off             off
sync0_comp0_fpi1        off     off             on
sync0_comp1_fpi0        off     pglz            off
sync0_comp1_fpi1        off     pglz            on
sync0_compLZ4_fpi0      off     lz4             off
sync0_compLZ4_fpi1      off     lz4             on
sync1_comp0_fpi0        on      off             off
sync1_comp0_fpi1        on      off             on
sync1_comp1_fpi0        on      pglz            off
sync1_comp1_fpi1        on      pglz            on
sync1_compLZ4_fpi0      on      lz4             off
sync1_compLZ4_fpi1      on      lz4             on

And a few notes:
  • the high durability configuration for MySQL is bl1_sync1_dw1 but that means there will be two fsyncs per commit (one for binlog, one for InnoDB redo)
  • the InnoDB doublewrite options (disabled, enable,d DETECT_ONLY) don't impact performance here. They will matter on workloads that are heavier on IO.
  • by fsync enabled I mean fsync (or fdatasync) is done for any logs that are enabled (Postgres WAL, InnoDB redo, MySQL binlog) and enabled for InnoDB means innodb_flush_log_at_trx_commit =1 while disabled for InnoDB means =2.
  • for Postgres I use wal_sync_method=fdatasync but below I often refer to that as fsync
  • for Postgres fsync enabled/disabled is the value for synchronous_commit

Hardware

I used two server types:

  • small
    • Beelink SER7 7840HS with Ryzen 7 7840HS, 8 AMD cores, AMD SMT disabled, 32G RAM, 1T m.2 SSD for OS, an NVMe device for the database using ext4 and Ubuntu 22.04.
  • medium
    • A c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.
I haven't measured it on either the small or medium server but fdatasync latency might not be excellent on the medium server and definitely isn't excellent on the small server.

Benchmark

I used sysbench and my usage is explained here. But I only ran a few of the tests from it rather than the full 40+. The ones for which I report performance are insert, update-index and update-noindex. Note that update-index means that index maintenance is required and update-noindex means that it is not. 

For the small server the test is run with 1 thread, 1 table and 50M rows. For the medium server it is run with 12 threads, 8 tables and 10M rows per table.

Results: small server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

  • MySQL throughput with the binlog on is (57%, 83%, 54%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting and that looks like the issue is the low QPS from the test without the binlog.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
  • MySQL with the binlog off gets (80%, 20%, 62%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. The low value for update-index with MySQL supports the what I write above in the first bullet point.
  • MySQL with the binlog on gets (46%, 17%, 33%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
  • Using Postgres with full_page_writes=on (recommended) cuts throughput in half for the update tests and by ~30% for the insert test when wal_compression uses pglz. The impact is much less from lz4, so just use lz4.

When fsync is enabled

  • It is much harder to spot CPU regressions which I why most of my tests disable it
  • MySQL throughput with the binlog on is ~50% relative to the binlog off. This is expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
  • MySQL with the binlog off gets ~27% of the throughput relative to Postgres and for MySQL with the binlog on that drops to ~14% relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. These differences are larger than I expect.
  • The impact from full_page_writes=on for Postgres is smaller here than it is above, but again the real issue is the CPU overhead from pglz and the solution is to use lz4.
--- without fsync

- no binlog for MySQL
ins     upd-ix  upd-nix dbms
20413    5662   19837   my8039 bl0_sync0_dw0
20458    5881   20271   my8039 bl0_sync0_dw1
20252    5790   19742   my8039 bl0_sync0_dwDO

- binlog for MySQL
ins     upd-ix  upd-nix dbms
11814    4935   11051   my8039 bl1_sync0_dw0
11614    4910   10992   my8039 bl1_sync0_dw1
11649    4896   11062   my8039 bl1_sync0_dwDO

- Postgres
ins     upd-ix  upd-nix dbms
25590   28835   32951   pg17rc1 sync0_comp0_fpi0
25471   28840   32837   pg17rc1 sync0_comp0_fpi1
26452   29196   32702   pg17rc1 sync0_comp1_fpi0
18537   14276   16033   pg17rc1 sync0_comp1_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit: InnoDB)
ins     upd-ix  upd-nix dbms
  139     135     142   my8039 bl0_sync1_dw0
  140     134     142   my8039 bl0_sync1_dw1
  140     134     142   my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit: binlog + InnoDB)
ins     upd-ix  upd-nix dbms
   72      72      73   my8039 bl1_sync1_dw0
   73      71      71   my8039 bl1_sync1_dw1
   73      68      70   my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit: WAL)
ins     upd-ix  upd-nix dbms
  512     513     516   pg17rc1 sync1_comp0_fpi0
  505     507     517   pg17rc1 sync1_comp0_fpi1
  512     514     516   pg17rc1 sync1_comp1_fpi0
  438     422     387   pg17rc1 sync1_comp1_fpi1

Results: medium server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

  • MySQL throughput with the binlog on is (44%, 79%, 43%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
  • MySQL with the binlog off gets (96%, 16%, 70%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Once again the result for update-index with MySQL is odd.
  • MySQL with the binlog on gets (42%, 13%, 30%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
  • Using Postgres with full_page_writes=on (recommended) reduces throughput by ~5% when wal_compression uses pglz and the solution is lz4.

When fsync is enabled

  • MySQL throughput with the binlog on is (34%, 58%, 41%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1.  This is (mostly) expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
  • MySQL with the binlog off gets (133%, 68%, 132%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1. 
  • MySQL with the binlog on gets (45%, 39%, 54%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1. Most MySQL primaries will run with the binlog enabled.
  • The impact from full_page_writes=on for Postgres is large for updates (reduces throughput by ~50%) and smaller for the insert test because the insert test loads in PK order.

--- without fsync

- no binlog for MySQL
ins     upd-ix  upd-nix dbms
117078   17620  104215  my8039 bl0_sync0_dw0
118038   19821  103625  my8039 bl0_sync0_dw1
117373   19582  103625  my8039 bl0_sync0_dwDO

- binlog for MySQL
ins     upd-ix  upd-nix dbms
 52602   15201   44727  my8039 bl1_sync0_dw0
 52193   15736   44372  my8039 bl1_sync0_dw1
 52456   16293   45195  my8039 bl1_sync0_dwDO

- Postgres
ins     upd-ix  upd-nix dbms
127864  138330  154650  pg17rc1 sync0_comp0_fpi0
123512  123825  148386  pg17rc1 sync0_comp0_fpi1
127456  139634  155127  pg17rc1 sync0_comp1_fpi0
119504  117795  136421  pg17rc1 sync0_comp1_fpi1
127885  139301  154869  pg17rc1 sync0_compLZ4_fpi0
125192  125532  149092  pg17rc1 sync0_compLZ4_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit for InnoDB redo)
ins     upd-ix  upd-nix dbms
 20602    6824   15804  my8039 bl0_sync1_dw0
 20350    6640   15095  my8039 bl0_sync1_dw1
 20677    6836   16134  my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit, one for binlog, one for InnoDB redo)
ins     upd-ix  upd-nix dbms
  6972    3863    6238  my8039 bl1_sync1_dw0
  6963    3829    6183  my8039 bl1_sync1_dw1
  6922    3904    6266  my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit, for WAL)
ins     upd-ix  upd-nix dbms
 19754   24293   26250  pg17rc1 sync1_comp0_fpi0
 15089    9507   10963  pg17rc1 sync1_comp0_fpi1
 19732   24335   26280  pg17rc1 sync1_comp1_fpi0
 13031    8571    9204  pg17rc1 sync1_comp1_fpi1
 19698   24629   26163  pg17rc1 sync1_compLZ4_fpi0
 15309    9712   11420  pg17rc1 sync1_compLZ4_fpi1

Saturday, September 14, 2024

MySQL and Postgres vs the Insert Benchmark on a large server

This has benchmark results for MySQL and Postgres vs the Insert Benchmark on a large server. My intent is to document how performance changes over time more so than start a Postgres vs MySQL argument.

MySQL has accumulated large regressions from 5.6 to 8.0 that are obvious on low-concurrency benchmarks. While they are less obvious on high-concurrency benchmarks, and there have been significant improvements to make MySQL 8 better at high-concurrency, the regressions that hurt low-concurrency results also reduce throughput at high-concurrency.

tl;dr

  • For Postgres 17rc1 vs 15.8
    • Performance is mostly the same
  • For MySQL 8.0.39 vs 5.6.51
    • Writes are much faster in 8.0.39
    • Reads throughput is mixed,  8.0.39 is slower than, similar to and faster than 5.6.51 depending on the context
  • For MySQL vs Postgres
    • MySQL is faster on point queries
    • Postgres is faster on writes and range queries
Builds, configuration and hardware

I compiled from source:
  • Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
  • MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer
The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The Postgres configuration file is here.
The MySQL configuration files are here for 5.6.515.7.44 and 8.0.39

The Benchmark

The benchmark is explained here and is run with 20 clients and a table per client with three workloads:
  • cached - database fits in the Postgres buffer pool
  • less IO-bound - database is larger than memory with 64M rows per table
  • more IO-bound - database is larger than memory with 200M rows per table
The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 20 for cached, 64 for less IO-bound and 200 for more IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for less/more IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for less/more IO-bound.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance reports are here:
The summary section in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is MySQL 5.6.51 for MySQL vs Postgres, MySQL 5.6.51 for only MySQL and Postgres 15.8 for only Postgres.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: cached

From the summary for only Postgres:
  • Write-heavy steps are up to 11% faster in 17rc1 vs 15.8
  • Read-heavy steps are unchanged in 17rc1 vs 15.8
From the summary for only MySQL:
  • Write-heavy steps are ~2.5X to ~4X faster in 8.0.39 vs 5.6.51
  • Read-heavy steps are between 12% and 27% slower in 8.0.39 vs 5.6.51 
From the summary for MySQL vs Postgres:
  • Postgres 17rc1 is faster on write-heavy and range queries
  • MySQL 8.0.39 and Postgres 17rc1 get similar QPS on point queries
Results: less IO-bound

From the summary for only Postgres:
  • Write-heavy steps are up 13% faster in 17rc1 vs 15.8 except for l.i2 which suffers from an intermittently high CPU overhead from get_actual_variable_range
  • Read-heavy steps are mostly similar between 17rc1 and 15.8
From the summary for only MySQL:
  • Write-heavy steps are ~1.9X to ~4X faster in 8.0.39 vs 5.6.51
  • Read-heavy steps are up to 24% slower in 8.0.39 vs 5.6.51
From the summary for MySQL vs Postgres:
  • Postgres 17rc1 is faster on write-heavy and range queries
    • The range queries are covering and the secondary indexes are mostly cached so the tests are CPU bound for both Postgres and MySQL (see rpq here)
  • MySQL 8.0.39 is faster on point queries
    • MySQL does ~0.1 reads/query while Postgres does ~3 (see rpq here). The table is index organized with MySQL and heap organized with Postgres. So with InnoDB the point queries just access the PK index while with Postgres they first access the PK index and then the heap table to get all of the columns.
Results: more IO-bound

From the summary for only Postgres:
  • Write-heavy steps have similar throughput between 17rc1 and 15.8 ignoring index create (l.x) and variance from l.i2 (problems with get_actual_variable_range)
  • Read-heavy steps are up to 6% faster in 17rc1 vs 15.8
From the summary for only MySQL:
  • Write-heavy steps are up to ~4X faster in 8.0.39 vs 5.6.51
  • Point queries are up to ~3X faster in 8.0.39 vs 5.6.51
  • Range queries are up to 27% slower in 8.0.39 vs 5.6.51
From the summary for MySQL vs Postgres:
  • Postgres 17rc1 is faster on write-heavy and range queries
    • See the less IO-bound section above
    • MySQL does more read IO per range query (see rpq here)
  • MySQL 8.0.39 is faster on point queries
    • See the less IO-bound section above
    • Postgres does ~2X more read IO per point query (see rpq here)








Thursday, September 12, 2024

MySQL + InnoDB vs sysbench on a large server

This has benchmark results for MySQL 5.6.51, 5.7.44 and 8.0.39 using sysbench on a large server. I also add a few results comparing Postgres 17rc1 with MySQL 8.0.39. My goal with that is to highlight things that can be made better in MySQL (and in Postgres).

One of the responses to my claims that MySQL is getting too many performance regressions over time is that this is true for low-concurrency tests but not for high-concurrency tests. Alas, that claim is more truthy than true and fixing some of these regressions would help make modern MySQL not look so slow when compared to modern Postgres.

tl;dr

  • MySQL 8.0 is faster than 5.6 for point queries and writes but slower for range queries
  • PostgresSQL 17rc1 is a lot faster than MySQL 8.0 for point queries and writes. For range queries Postgres was still faster but the difference was usually not as large.

Builds, configuration and hardware

I compiled 
  • Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
  • MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer
The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The Postgres configuration file is here.
The MySQL configuration files are here for 5.6.51, 5.7.44 and 8.0.39.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

For the large server the tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds. The command line to run all tests was: bash r.sh 8 10000000 180 300 md2 1 1 40

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here.

Values from iostat and vmstat divided by QPS are here for MySQL 5.6, 5.7, 8.0 and here for MySQL 8.0.39 vs Postgres 17rc1. These help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers in the spreadsheets are the relative QPS and $version is faster than the base case when it is greater than 1.0. When it is 3.0 then $version is 3X faster than the base case.

The relative QPS is computed by: (QPS for $version) / (QPS for base case).

For the comparison between MySQL 5.6.51, 5.7.44 and 8.0.33 the base case is 5.6.51 and $version is one of 5.7.44 or 8.0.39.

For the comparison between MySQL 8.0.39 and Postgres 17rc1 the base case is MySQL and $version is Postgres.

Results: MySQL 5.6.51 vs 5.7.44 vs 8.0.39

Point queries, part 1
  • MySQL 8.0 is slower than 5.6 on 4 of 11 tests and faster on 7 of 11 tests. 
  • The 4 on which it is slower are all for secondary index lookups. From vmstat metrics, when MySQL 8.0 is slower the problem is CPU overhead (see cpu/o here).
  • The 7 on which it is faster are all for PK lookups. From two of those tests, it is faster because it uses less CPU per query (see cpu/o here).
Point queries, part 2
  • MySQL 8.0 is faster than 5.6 in all tests here. All of these tests use a PK index. The perf difference is larger for queries that fetch more data. This suggests that new optimizer overheads prevent it from being faster in cases when it fetches less data and/or InnoDB in MySQL 8.0 doesn't have regressions for point queries relative to MySQL 5.6.
  • From vmstat metrics (see cpu/o here) MySQL 8.0 is faster because it uses less CPU
Range queries, part 1
  • MySQL 5.7 and 8.0 are slower than 5.6 for range queries. This is a surprise and the regressions are getting larger over time (worse in 8.0 than 5.7). The queries here do range scans of various lengths without aggregation.
  • From vmstat metrics (see cpu/o here) MySQL 8.0 is slower because it uses more CPU per query
Range queries, part 2
  • All of the tests here do some aggregation except for scan (the right most bars). And MySQL 8.0 is faster than 5.6 for all such tests.
  • When MySQL 8.0 is faster the primary reason is that is uses less CPU per query (see cpu/o here) and sometimes it also does fewer context switches per query (see cs/o here) where a high context switch rate often implies mutex contention.
Writes
  • MySQL 8.0 is significantly faster than 5.6 for all tests. These show the benefit of the work to improve performance for high-concurrency workloads.
  • MySQL 8.0 is faster because it uses less CPU per SQL statement (see cpu/o here) and does few context switches per SQL statement (see cs/o here). A high context switch rate often implies mutex contention.
Results: MySQL 8.0.39 vs Postgres 17rc1

Point queries, part 1
  • The result for hot-points is an outlier, perhaps because prepared statements are better in Postgres than in MySQL. In MySQL, server-side prepared statements just save on parse. The difference is large because MySQL uses a lot more CPU than Postgres (see cpu/o here).
  • In most other tests Postgres is ~1.5X faster than MySQL and the root cause is that MySQL uses more CPU per query (see cpu/o here). For the context switch rate, sometimes it is larger for MySQL and sometimes it is smaller (see cs/o here).
  • Excluding hot-points, the largest difference occurs on tests that do non-covering secondary index lookups. InnoDB tables are index-organized and for non-covering secondary index lookups to fetch the missing columns InnoDB must traverse the PK index while Postgres just uses the rowID (tuple ID) to access the heap pages.
Point queries, part 2
  • Postgres is ~1.5X faster for point queries, see above in Point queries, part 1 for more details
Range queries, part 1
  • Postgres is much slower than InnoDB on scan (full table scan) and on range-notcovered-pk. While I can only wave hands about the scan result for the range-notcovered-pk result Postgres must scan the PK index and then lookup missing columns from heap pages but InnoDB has all columns in the PK index because it is index-organized. For scan, the CPU overhead and context switch rates are much larger for Postgres (see cpu/o and cs/o here).
  • Postgres is much faster on the two tests that do non-covering secondary index scans because it fetches missing columns by following the tupleID to a heap page while InnoDB must traverse the PK index.
  • Excluding the 4 outliers, InnoDB and Postgres have similar performance
Range queries, part 2
  • The performance difference here is smaller than it is above for point queries. In one case, MySQL is much faster, in three cases Postgres is much faster and in two cases perf is similar.
  • For tests run before writes (less b-tree fragmentation) or after writes (more b-tree fragmentation) the pattern is that the advantage for Postgres is larger when the length of the range query is shorter -- see results for read-only.pre_range=X and read-only_range=X for X in 10000, 100 and 10. My guess is that MySQL optimizer overhead hurts performance and that is more obvious for queries that take less time -- see the CPU overhead metrics here (cpu/o columns).
Writes
  • Postgres is faster for all tests that do writes and is up to ~5X faster. The reason is that Postgres uses less CPU per statement (see cpu/o here) and does fewer context switches per statement (see cs/o here) where context switch rates often predict mutex contention.
  • The difference is smaller for read-write* tests because they do a mix of reads and writes.

Postgres 18rc1 vs sysbench

This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and larg...