This has results for the Insert Benchmark with Postgres on a large server.
There might be small regressions, but I have more work in progress to explain that:
- for a workload with 1 client and a cached database I see a small increase in CPU/operation (~10%) during the l.i2 benchmark step. I am repeating that benchmark.
- for a workload with 20 clients and an IO-bound database I see a small decrease in QPS (typically 2% to 4%) during read+write benchmark steps.
Builds, configuration and hardware
I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 18 beta2, 18 beta1, 17.5 and 17.4.
The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and
ext4. More details on it are here.
The config file for Postgres 17.4 and 17.5 is here and named conf.diff.cx10a_c32r128.
For 18 beta1 and beta2 I tested 3 configuration files, and they are here:
- conf.diff.cx10b_c32r128 (x10b) - uses io_method=sync
- conf.diff.cx10cw4_c32r128 (x10cw4) - uses io_method=worker with io_workers=4
- conf.diff.cx10d_c32r128 (x10d) - uses io_method=io_uring
The Benchmark
The benchmark is explained here and was run for three workloads:
- 1 client, cached
- run with 1 client, 1 table and a cached database
- load 50M rows in step l.i0, do 16M writes in step l.i1 and 4M in l.i2
- 20 clients, cached
- run with 20 clients, 20 tables (table per client) and a cached database
- for each client/table - load 10M rows in step l.i0, do 16M writes in step l.i1 and 4M in l.i2
- 20 clients, IO-bound
- run with 20 clients, 20 tables (table per client) and a database larger than RAM
- for each client/table - load 200M rows in step l.i0, do 4M writes in step l.i1 and 1M in l.i2
- for the qr100, qr500 and qr1000 steps the working set is cached, otherwise it is not
The benchmark steps are:
- l.i0
- insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts Y million rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and Z million rows are inserted and deleted per table.
- Wait for N seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of N is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview
The performance report are here:
The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. The summary sections are here:
Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.
When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures:
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.
Results: 1 client, cached
Normally I summarize the summary but I don't do that here to save space.
There might be regressions on the l.i2 benchmark step that does inserts+deletes with smaller transactions while l.i1 does the same but with larger transactions. These first arrive with Postgres 17.5 but I will ignore that because it sustained a higher rate on the preceding benchmark step (l.i1) and might suffer from more vacuum debt during l.i2.
From the response time table, 18 beta2 does better than 18 beta1 based on the 256us and 1ms columns.
From the vmstat and iostat metrics, there is a ~10% increase in CPU/operation starting in Postgres 17.5 -- the value in the cpupq column increases from 596 for PG 17.4 to ~660 starting in 17.5. With one client the l.i2 step finishes in ~500 seconds and that might be too short. I am repeating the bencchmark to run that step for 4X longer.
Results: 20 clients, cached
Normally I summarize the summary but I don't do that here to save space. Regardless, this is easy to summarize - there are small improvements (~4%) on the l.i1 and l.i2 benchmark steps and no regressions elsewhere.
Results: 20 clients, IO-bound
Normally I summarize the summary but I don't do that here to save space.
From the summary, Postgres did not sustain the target write rates during qp1000 and qr1000 but I won't claim that it should have been able to -- perhaps I need faster IO. The first table in the summary section uses a grey background to indicate that. Fortunately, all versions were able to sustain a similar write rate. This was a also a problem for some versions on the qp500 step.
For the l.i2 step there is an odd outlier with PG 18beta2 and the cx10cw4_c32r128 config (that uses io_method=worker). I will ignore that for now.
For many of the read+write tests (qp100, qr100, qp500, qr500, qp1000, qr1000) thoughput with PG 18 beta1 and beta2 is up to 5% less than for PG 17.4. The regression might be explained by a small increase in CPU/operation.
No comments:
Post a Comment