Small Datum: Postgres 17beta2 vs the Insert Benchmark on a medium server:: looking good

This has benchmark results for Postgres 17beta2 using the Insert Benchmark and a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. A result for sysbench from the same server is here. Results for the Insert Benchmark comparing 16.3 and 17beta1 on large server are here and here.

This work was done by Small Datum LLC.

tl;dr

17beta2 looks good but results on public cloud HW might have more noise
I will repeat tests on my home servers soon

Builds, configuration and hardware

I compiled Postgres versions 16.3, 17beta1 and 17beta2 from source.

The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

The configuration file is here.

The Benchmark

The benchmark is explained here and is run with 8 clients and a table per client with two workloads:

cached - database fits in the Postgres buffer pool
IO-bound - database is larger than memory and there are many reads from disk

The benchmark steps are:

l.i0

insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 10 for cached and 128 for IO-bound.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for IO-bound.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for IO-bound.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance reports are here for cached and for IO-bound.

The summary (for cached and for IO-bound) in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is Postgres 16.3 and the versions for $me are 17beta1 and 17beta2. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: cached

Summary:

Results here are less positive than the results on a large server I have at home. Perhaps there is more variance in HW perf in the public cloud. For now I will claim this detects no regressions and repeat tests on my home servers.
The worst results are from l.i2 which is related to problems from the CPU overhead in get_actual_variable_range, a problem I hope is fixed soon.
From vmstat metrics for l.i1 and l.i2 I don't see changes in CPU overhead (cpupq is CPU /operation) or context switches (cspq is context switches /operation). I do see a reduction in CPU utilization (cpups is the sum of vmstat us and sy columns).
From ps (output not shared here) I see that the connections doing DELETE statements used ~7% more CPU in 17beta2 vs 16.3. When DELETEs take longer to process, than the INSERT connection will be idle for more time which explains why the CPU utilization is lower with 17beta2 than with 16.3.

From the summary the relative throughput per benchmark step is:

l.i0

relative QPS is 0.97 in PG 17beta1
relative QPS is 0.98 in PG 17beta2

l.x - I ignore this for now
l.i1, l.i2

relative QPS is 1.03, 0.95 in PG 17beta1
relative QPS is 1.04, 0.93 in PG 17beta2

qr100, qr500, qr1000

relative QPS is 1.01, 1.03, 1.05 in PG 17beta1
relative QPS is 1.01, 1.01, 1.04 in PG 17beta2

qp100, qp500, qp1000

relative QPS is 1.01, 1.00, 1.00 in PG 17beta1
relative QPS is 1.01, 1.01, 1.00 in PG 17beta2

Results: IO-bound

Summary:

Results here are less positive than the results on a large server I have at home. Perhaps there is more variance in HW perf in the public cloud. For now I will claim this detects no regressions and repeat tests on my home servers.
The worst results are from l.i2 which is related to problems from the CPU overhead in get_actual_variable_range, a problem I hope is fixed soon.
From vmstat metrics for l.i1 and l.i2 I see changes in CPU overhead (cpupq is CPU /operation) but not in context switches (cspq is context switches /operation).
From ps (output not shared here) I see that the connections doing DELETE statements used ~20% more CPU in 17beta2 vs 16.3. When DELETEs take longer to process, than the INSERT connection will be idle for more time which explains why the CPU utilization is lower with 17beta2 than with 16.3.

From the summary the relative throughput per benchmark step is:

l.i0

relative QPS is 0.96 in PG 17beta1
relative QPS is 0.96 in PG 17beta2

l.x - I ignore this for now
l.i1, l.i2

relative QPS is 1.02, 0.79 in PG 17beta1
relative QPS is 0.98, 0.78 in PG 17beta2

qr100, qr500, qr1000

relative QPS is 1.00, 1.00, 0.98 in PG 17beta1
relative QPS is 1.01, 1.00, 0.99 in PG 17beta2

qp100, qp500, qp1000

relative QPS is 1.00, 0.99, 0.99 in PG 17beta1
relative QPS is 1.00, 0.99, 0.89 in PG 17beta2

Small Datum

Friday, July 12, 2024

Postgres 17beta2 vs the Insert Benchmark on a medium server:: looking good

No comments:

Post a Comment

Sysbench for MySQL 5.6 thru 9.4 on a small server