Small Datum: Updated Insert benchmark: Postgres 9.x to 16.x, small server, IO-bound database

This has results for Postgres vs the Insert Benchmark on a small server with an IO-bound workload. I include results for the latest point release from all major versions from 9.0 to 16.

tl;dr

While there are no regressions in the CPU-bound (cached) workload there are regressions here
There are two changes related to get_actual_variable_range and get_actual_variable_endpoint (a previous post also explained this). Note some parts of this workload are not typical and regressions I find here aren't relevant to many other workloads.

Starting in Postgres 12 the throughput for l.i1 and l.i2 improves by ~2X because the CPU overhead from the query planner during DELETE statements has been reduced.
Starting in Postgres 14 the throughput for range queries decreases by ~30% because the CPU overhead for range queries and DELETE statements has grown. I am still debugging this.

Most versions were unable to sustain the target write rates (1000 inserts/s and 1000 delete/s) during the qr1000 and qp1000 benchmark steps. Only Postgres 12.17 and 13.13 were able to sustain it, most others were far from the target and the worst were Postgres 14.10, 15.5 and 16.1.
Something changed for the worse in Postgres 14 that increases CPU overhead for queries and DELETE statements in this workload.

Comparing throughput in Postgres 16.1 to 9.0.23

Write-heavy - Postgres 16.1 is between 1.2X and 2.3X faster than 9.0.23
Range queries - Postgres 16.1 is up to ~20% slower than 9.0.23
Point queries - Postgres 16.1 is similar to 9.0.23

Build + Configuration

See the previous report for more details. I tested these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1.

The configuration files are in subdirectories from here. Search for files named conf.diff.cx9a2_bee which exist for each major version of Postgres.

The Benchmark

The test server is a Beelink SER4 with 8 AMD cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

l.i0

insert 800 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 4M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and inserts 1M rows total
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.

qr100

use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results

The performance report is here.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

From the summary:

The base case is pg9023_def which means Postgres 9.0.23
For the read-heavy benchmark steps that do range queries (qr100, qr500, qr1000) throughput improved between Postgres 9.2 and 13 and then it drops by ~30% in Postgres 14.10 and I confirmed the drop is also in Postgres 14.0. I will start to explain this in another post.
For the read-heavy benchmark steps that do point queries (qp100, qp500, qp1000) thoughput is mostly unchanged from 9.0.23 through 16.1.
For the write-heavy steps (l.i0, l.x, l.i1, l.i2) throughput improves a lot

l.i0 - things get a lot better in Postgres 11.22
l.x - things get a lot better between Postgres 9.4.26 and 11.22
l.i1, l.i2 - things get a lot better in Postgres 12.17 likely because the query planner overhead during DELETE statements has been reduced (see the comments about get_actual_variable_range)

Comparing throughput in Postgres 16.1 to 9.0.23

Write-heavy -- Postgres 16 is faster

l.i0, l.x, l.i1, l.i2 - relative QPS is 1.22, 2.32, 1.83, 1.87

Range queries -- Postgres 16 is mostly slower

qr100, qr500, qr1000 - relative QPS is 0.81, 0.89, 1.01

Point queries -- Postgres 16 is slightly slower

qp100, qp500, qp1000 - relative QPS is 0.98, 0.96, 1.00

Target write rates

The third table in the summary shows the write rates sustained during the read-write benchmark steps. The target write rates are 100/s for qr100 and qp100, 500/s for qr500 and qp500 and then 1000/s for qr1000 and qp1000. Note that X/s means X inserts/s and X delete/s. When the value is close enough to the target then I assume the target has been sustained. The table cells in red indicate the cases where the target has not been sustained.

For qr100, qp100, qr500, qp500 -- all versions sustained the targets
For qr1000, qp1000 - only Postgres 12.17 and 13.13 sustained the targets.

One session is used for INSERT statements and another for DELETE statements. They run at the same rate so if one session runs slow, then both will be slow. I assume the problem here is that DELETE processing is slow and this is related to changes in get_actual_variable_range.

The following table show the number of CPU seconds consumed per connection during the qp1000 benchmark step. There is:

a big increase in CPU starting in 12.17 for the query connection
a big decrease in CPU starting in 12.17 for the delete connection
a big increase in CPU starting in 14.10 for the query and delete connection

CPU seconds per connection during qp1000

* query = connection that does point queries

* ins = connection that does inserts

* del = connection that does deletes

query ins del

11.22 626 157 3657

12.17 311 144 1671

13.13 312 145 1758

14.10 595 158 3596

15.5 609 156 3714

16.1 612 158 3716

2 comments:

Jeremy SchneiderFebruary 17, 2024 at 11:03 AM
I like the grouped bar-chart graph presentation. Traditionally I've seen analytical benchmarks like TPC-H presented in this format, it's like you borrowed the idea to use it for more OLTP-oriented testing. It's a really clear way to present testing like this.

BTW, for some reason I can't comment using my google account... signon doesn't seem to work, likely it's a blogspot problem you can fix though

Small Datum

Saturday, January 27, 2024

Updated Insert benchmark: Postgres 9.x to 16.x, small server, IO-bound database

2 comments:

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)