Wednesday, January 10, 2024

Updated Insert benchmark: Postgres 9.x to 16.x, small server, cached database, v2

I recently shared results for the updated Insert Benchmark with Postgres versions 9.0 to 16 using a small server and cached database. Here I have results for a slightly larger but still cached database. The reason for using a larger database is to get some of the benchmark steps to run for more time.

tl;dr

  • Results here are similar to the previous results although a performance problem during the l.i1 and l.i2 benchmark steps is more clear here. In some benchmark steps the planner can spend too much CPU time trying to determine the min and/or max value of a column by reading from the index.
  • While Postgres performance is mostly getting better from old to new releases, there have been regressions in a few major releases (PG 11 through 13) for benchmark steps where this is an issue.
  • The regressions are likely to be larger for the IO-bound benchmark but that will take a few more days to finish.
The Problem

I shared details about the problem here and as expected a Postgres expert quickly replied with advice pointing me to a few changes that improve the problem.

The problem is the pattern of inserts and deletes. Several of the benchmark steps do inserts in ascending PK order (inserts to the head) while doing deletes at the same rate to keep the number of rows fixed. The deletes are done from the other end of the table (deletes to the tail) by removing batches of rows with the smallest value for the PK.

The PG planner has code in get_actual_variable_range to determine the min or max value of a column when there is a predicate on that column like X < $const or X > $const and $const falls into the largest or smallest histogram bucket. From PMP thread stacks, what I see is too much time with that function on the call stack. From ps output, the session that does delete statements can use 10X to 100X more CPU than the session that does insert statements. From explain analyze I see that the planner spends ~100 milliseconds per delete statement.

Build + Configuration

See the previous report for more details. I used all of the versions described there: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1. And then I also tested 11.19 and 13.10.

The Benchmark

The benchmark is explained here except the first benchmark step, l.i0, loads 30M rows/table here while previously it only loaded 20M. The database still fits in memory as the test server has 16G of RAM and the database tables are ~8G.

The test server was named SER4 in the previous report. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

  • l.i0
    • insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions).
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
  • qr100
    • use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results

The benchmark report is here.

I start with the summary for the current round with 30M rows loaded and the previous round with 20M rows loaded. Here I focus on the benchmark steps where things are slightly different between the current and previous rounds -- the results for the l.i1 and l.i2 benchmark steps where regressions are more obvious in the current round.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

There are big regressions in 11.19, 11.22 and a small one in 13.13 for the l.i1 and l.i2 benchmark steps which is visible in the summary.
  • For the l.i1 benchmark step the inserts/s rate drops from ~18k/s in 9.6.24 and 10.23 to ~11k/s in 11.19 and 11.22. It also drops by ~14% from 13.10 to 13.13.
  • The regressions for the l.i2 benchmark step occur in the same versions but are larger. The issue is that the delete statements in l.i1 delete more rows per statement, so the planner overhead per deleted row is larger for l.i2.
From the iostat and vmstat metrics collected per benchmark step with both absolute and normalized values (normalized values are absolute value divided by the insert rate) I see that the CPU overhead (cpupq is CPU usecs per insert) per version is inversely correlated with the insert rate.

This table shows the value of cpupq (CPU overhead) per version for the l.i1 and l.i2 benchmark steps. All of the numbers for iostat and vmstat are here for l.i1 and for l.i2.

versionl.i1l.i2
10.2312535157
11.1916198285
11.2216236611
12.1712635815
13.1012223373
13.1313674863
14.1011263449

The table above includes all CPU overhead from everything running on the server (Postgres and the benchmark client). The data below shows the CPU time per session measured by ps near the end of a benchmark step. There is one connection/session that only does delete statements and another that only does insert statements. The output from ps is here. The table below has the CPU seconds per version for  both connections -- insert and delete. There are big changes in CPU overhead for the delete connection.

versioninsertdelete
10.235872587
11.195745196
11.224973851
12.175733137
13.105321278
13.135482403
14.105321317

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...