Thursday, June 16, 2022

Insert Benchmark for Postgres 12, 13, 14 and 15

I ran the Insert Benchmark for Postgres versions 12.11, 13.7, 14.3 and 15b1. Reports are provided for cached and IO-bound workloads. The benchmark is run on Intel NUC servers at low concurrency. The goal is to determine how performance changes over time. A description of the Insert Benchmark is here.

tl;dr

  • I am not a Postgres expert
  • regressions are small in most cases
  • the l.i0 benchmark step has regressions that are worth investigating in versions 14 and 15b1
    • these regressions might have been fixed, see the perf report for the patch (15b1p1)
Updates:
  • regressions have been fixed, scroll to the end
  • added links to the configuration files
  • part 2 with throughput vs time graphs is here
  • provided data on the percentage of time in parse, analyze, optimize, execute
  • added command lines 
  • added the index + relation sizes after each test step
  • added links to performance reports when prepared statements were used for queries
  • added links to performance reports with a patch to improve version 15b1
  • added a second set of results for 15b1p1 (15b1 with the patch) for cached and IO-bound. Results are similar to the first set.
  • livestreaming my debugging efforts, see Debugging the regression below
  • added data from pg_stat_wal to the final section
  • added results for 15b2 for cached and IO-bound
The regression

After I shared this blog post a possible cause for the regression was discussed here and the community quickly provided a patch for me to test. The results are excellent as 15b1p1 (15b1 with the patch) almost matches the results for version 13.7 on the cached workload. For the IO-bound workload the patch fixes the regression for all but one of the benchmark steps (l.i1).

The improvement is obvious in the cached workload Summary. The improvement is also there for the IO-bound workload, but less obvious. In the IO-bound workload Summary the throughput for l.i0 (load without secondary indexes) and q100.1 are better than results for version 14.3 and almost match version 13.7. While the throughput for l.i1 (inserts with secondary index maintenance) did not improve. From the IO-bound metrics for l.i1 I see that CPU (cpupq == CPU per statement) has been reducing with each major release, but with the patch (15b1p1) the value increased to match the value for version 13.7. I don't see this effect in the l.i1 metrics with the cached workload.

Summer has arrived and thermal throttling might be an issue for my SSDs which would be an unfortunate source of variance. The next step is to repeat the IO-bound workload at night when it is cooler and then look at CPU profiles for l.i1.

A guide to the performance reports is here. For explaining performance the most interesting metric is cpupq (CPU/query or CPU/insert depending on the benchmark step). From the guide, cpupq includes CPU by the client, DBMS and anything else running on the server. The benchmark client, written in Python, uses a lot of CPU. That can make it harder to see how CPU changes from the DBMS. With MySQL, I can also measure the CPU time used by mysqld. With Postgres I am still figuring out how to provide a DBMS-only measurement of the CPU overheads.

Test setup

Tests are run on Intel NUC servers I have at home. The OS is Ubuntu 20.04, hyperthreading and turbo-boost are disabled. All binaries are compiled using the same gcc/glibc toolchain and all tests used the same server.

The benchmark is repeated for a cached and IO-bound workload. The database fits in memory for the cached workload and is larger than memory for the IO-bound workload. The Postgres configuration files are here for 12.11, 13.7, 14.3 and 15b1.

The Insert Benchmark description has more details. In summary the benchmark has several steps and I focus on the ones where performance changes:

  • l.i0 - load in PK order, the table that has no secondary indexes
  • l.x - create three secondary indexes
  • l.i1 - insert in PK order, the table that has 3 secondary indexes
  • q100.1 - measure range query QPS from one client while another does 100 inserts/s
  • q500.1 - measure range query QPS from one client while another does 500 inserts/s
  • q1000.1 - measure range query QPS from one client while another does 1000 inserts/s
The l.i0 step inserts 20m (500m) rows for the cached (IO-bound) workload. The l.i1 step inserts 20m (10m) rows for the cached (IO-bound) workload. The q100.1, q500.1 and q1000.1 steps each run for 2 hours.

Command lines to try out the insert benchmark:
# old school, no prepared statements
python3 iibench.py --db_user=foo --db_host=127.0.0.1 \
  --db_password=bar --dbms=postgres --db_name=ib --setup \
  --query_threads=1 --max_rows=1000000 --secs_per_report=1

# try out the new support for prepared statements
python3 iibench.py --db_user=foo --db_host=127.0.0.1 \
  --db_password=bar --dbms=postgres --db_name=ib --setup \
  --query_threads=1 --max_rows=1000000 --secs_per_report=1 \
  --use_prepared_insert --use_prepared_query

Cached workload

The reports are here for ibench without and with prepared statements used for queries. Next is a report for prepared statements and a patch for 15b1 (I call it 15b1p1) that fixes a CPU regression. Postgres query throughput improves by ~1.5X when prepared statements are used for queries. Prepared statements are not yet used for inserts because I need to debug CPU perf issues on the ibench client.

I use relative throughput to document performance changes. It is: my_throughput / base_throughput where base_throughput is the result from Postgres 13.7. The interesting changes are below. The results aren't surprising. In most cases the regressions are small because Postgres is good at avoiding them. The regressions for 15b1 are larger because it is a beta and a newer release.

Throughput relative to 13.7
version l.i0    l.i1    q100.1
14.3    0.934   0.971   0.971
15b1    0.856   0.946   0.953

The percentage of samples from pg_plan_queries, pg_parse_query, pg_analyze_and_rewrite and query execution from v14.3. Prepared statements were not used.

step    plan    parse   analyze execute
l.i0    20      11      25      44
l.i1    8       5       10      77
q100.1  33      6       10      51

Cached - l.i0

The largest regressions are from the l.i0 step. While this test doesn't run for a long time, the results are similar for the IO-bound workload. Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. For 15b1 I renamed ExecInitExprInternal to ExecInitExpr to match the names used in older versions.

I also have a benchmark report here for Postgres versions 14.0, 14.1, 14.2 and 14.3. There is little change across them so any regressions from 13.7 to 14.3 are likely in 13.7 to 14.0.

Before looking at the differential flamegraphs I compared the output by hand. For functions called by PostgresMain not much changes between versions in pg_plan_queries and pg_parse_queries. There is a small improvement for pg_analyze_and_rewrite and a small regression for PortalRun. The rest of the drill down is here with a focus on functions called under PortalRun. There are regressions in a few functions when ExecScan is on the call stack.

Percentage of samples for children of PostgresMain
12.11   13.7    14.3    15b1    function
-----   -----   -----   -----   --------
21.50   20.27   20.44   19.84   pg_plan_queries
11.56   12.19   11.34   10.34   pg_parse_queries
25.76   25.70   24.92   22.64   pg_analyze_and_rewrite
33.02   33.06   34.94   37.31   PortalRun

Cached - l.i1

Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. I didn't do the summary by hand because the differences are small.

Cached - q100.1

Flamegraphs are here for 12.1113.714.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. I didn't do the summary by hand because the differences are small. ReadyForQuery is much less prominent in 15b1 relative to earlier versions. It isn't clear that the differential graph for 14.3 vs 15b1 is useful. Perhaps too much code has changed.

Table and index sizes

These are the sizes from pg_index_sizes, pg_relation_size and pg_total_relation_size. I suspect that the improvements (size reduction) in recent Postgres versions don't reproduce here because the workload is insert only.

Legend
* all sizes are in MB unless indicated with G (for GB)
* index is value from pg_index_sizes
* rel is value from pg_relation_size
* total is value from pg_total_relation_size

l.i0 - after load in PK order without secondary indexes
version index   rel     total
12.11   428     1531    1959
13.7    428     1531    1959
14.3    428     1531    1959
15b1    428     1531    1959

l.x - after create 3 secondary indexes
version index   rel     total
12.11   2243    1538    3782
13.7    2243    1538    3782
14.3    2243    1538    3782
15b1    2243    1538    3782

l.i1 - after load more data with 3 secondary indexes to maintain
version index   rel     total
12.11   5299    3069    8368
13.7    5308    3069    8378
14.3    5308    3069    8378
15b1    5308    3069    8378

q3.1000 - at the end of the read+write steps
version index   rel     total
12.11   8338    3951    12G
13.7    8333    3950    12G
14.3    8333    3950    12G
15b1    8333    3950    12G

IO-bound workload

The reports are here for ibench without and with prepared statements used for queries. Next is a report for prepared statements and a patch for 15b1 (I call it 15b1p1) that fixes a CPU regression.

Postgres query throughput improves by ~1.5X when prepared statements are used for queries. Prepared statements are not yet used for inserts because I need to debug CPU perf issues on the ibench client. I still need to figure out why more IO isn't done per query for this workload given that it is IO bound and index indexes are much larger than memory.

I don't think the IO-bound report needs additional analysis as the regression for l.i0 is analyzed above. I also have a benchmark report here for Postgres versions 14.0, 14.1, 14.2 and 14.3. There is little change across them so any regressions from 13.7 to 14.3 are likely in 13.7 to 14.0.

Table and index sizes

These are the sizes from pg_index_sizes, pg_relation_size and pg_total_relation_size. I suspect that the improvements (size reduction) in recent Postgres versions don't reproduce here because the workload is insert only.

Legend
* all sizes are in GB
* index is value from pg_index_sizes
* rel is value from pg_relation_size
* total is value from pg_total_relation_size

l.i0 - after load in PK order without secondary indexes
version index   rel     total
12.11   10      37      48
13.7    10      37      48
14.3    10      37      48
15b1    10      37      48

l.x - after create 3 secondary indexes
version index   rel     total
12.11   55      37      92
13.7    55      37      92
14.3    55      37      92
15b1    55      37      92

l.i1 - after load more data with 3 secondary indexes to maintain
version index   rel     total
12.11   55      38      94
13.7    55      38      94
14.3    55      38      94
15b1    55      38      94

q3.1000 - at the end of the read+write steps
version index   rel     total
12.11   57      39      96
13.7    57      39      96
14.3    57      39      96
15b1    57      39      96

Debugging the regression

This section has details from trying to debug the regression while I collaborate with a few experts from the Postgres community. For the regression in the l.i1 benchmark step (load in PK order with 3 secondary indexes) on the IO-bound workload there is:

  • ~7% increase in KB written per insert(wkbpi)
  • ~7% increase in KB read per insert (rkbpq)
  • ~12% increase in CPU per insert (cpupq), but note that CPU here includes everything on the host, including the benchmark client

From pg_stat_bgwriter there are differences between 15b1 and 15b1p1 ...

select * from pg_stat_bgwriter is here for pg15b1p1:

checkpoints_timed     | 32

checkpoints_req       | 6

checkpoint_write_time | 7734090

checkpoint_sync_time  | 6471

buffers_checkpoint    | 12444695

buffers_clean         | 13580345

maxwritten_clean      | 0

buffers_backend       | 7593943

buffers_backend_fsync | 0

buffers_alloc         | 19970817

stats_reset           | 2022-06-23 21:45:38.834884-07

And then for pg15b1:


checkpoints_timed     | 37

checkpoints_req       | 2

checkpoint_write_time | 8283432

checkpoint_sync_time  | 6109

buffers_checkpoint    | 12753126

buffers_clean         | 13702867

maxwritten_clean      | 0

buffers_backend       | 6500038

buffers_backend_fsync | 0

buffers_alloc         | 20002787

stats_reset           | 2022-06-20 16:55:15.35976-07

And finally the graph of throughput vs time for l.i1 shows some kind of stall for 15b1p1 that doesn't occur for 15b1, and this stall is repeatable. It occurs in both runs of the benchmark I did for 15b1p1.

First is the graph for 15b1 which don't have the stall. The x-axis is time, the y-axis is inserts/s and the results are reported by the benchmark client at 1-second intervals.

Next is the graph for 15b1p1 which has a stall.


From a few more tests, this error message doesn't occur with 15b1 but does occur with 15b1p1 and variations of 15b1p1 that exclude some of the patches:
2022-06-27 ... PDT [1004196] ERROR:  canceling autovacuum task

The end

The regression is fixed and 15b1 with the patch now matches the performance from 13.7, so this fixed all of the regressions that arrived in version 14.

The fix for the regression comes in two parts. First, there is the patch that reduces CPU overhead. Second, I increased max_wal_size from 20G to 40G and that resolves some noise that shows up on the l.i1 (inserts with secondary index maintenenace) step. I started with the cx5 config that has max_wal_size set to 20G and then switched to the cx7 config that sets it to 40G.

The performance reports are here for the cached and IO-bound workloads.

The data below is from pg_stat_wal at the end of each benchmark step for versions 14.3, 15b1 and 15b1p1. All had wal_buffers = 16M. I don't know why wal_buffers_full and wal_write increased from 14.3 to 15b1, nor do I know whether that is an issue but this might explain whey I needed to increase max_wal_size from 20G to 40G. Most of that difference occurs during the l.x step (create secondary indexes).

from pg_stat_wal at end of l.i0
                        14.3    15b1    15b1p1
wal_records             1.0B    1.0B    1.0B
wal_fpi                 3.0M    3.0M    2.6M
wal_bytes               109.2B  109.5B  106.4B
wal_buffers_full        0       0       0
wal_write               4.5M    4.6M    4.5M
wal_sync                103K    103K    99K

from pg_stat_wal at end of l.ix
                        14.3    15b1    15b1p1
wal_records             1.0B    1.0B    1.0B
wal_fpi                 8.8M    9.3M    9.1M
wal_bytes               153.0B  157.1B  155.3B
wal_buffers_full        2.3M    3.5M    3.6M
wal_write               6.9M    8.2M    8.1M
wal_sync                107K    106K    102K

from pg_stat_wal at end of l.i1
                        14.3    15b1    15b1p1
wal_records             1.1B    1.1B    1.1B
wal_fpi                 26.3M   26.8M   26.5M
wal_bytes               285.5B  289.7B  287.4B
wal_buffers_full        2.3M    3.5M    3.6M
wal_write               7.1M    8.4M    8.3M
wal_sync                207K    207K    202K

CREATE INDEX in 15b2

Create index is ~3% slower for the IO-bound workload. See the summary and the metrics. I can reproduce this. The config file for these tests has max_parallel_workers = 0 so I assume that parallel index create isn't done.

From the metrics, this doesn't look like a CPU regression although I had to consult the vmstat output to confirm because the metrics page just shows cpupq = 4 for both. The average CPU utilization (us) is lower for 15b2 but the CPU/work is similar because the value of us * Nsamp is similar for 15b1 and 15b2 (Nsamp is the number of vmstat lines processed). Samples were take every second.

Averages from vmstat
cat 500m.pg15b1.cx7/l.x/o.vm.dop1 \
  | grep -v procs | grep -v swpd \
  | awk '{ c+=1; s12+=$12; s13+=$13; s14+=$14; s15+=$15; s16+=$16 } END { printf "%s\t%.1f\t%.1f\t%.1f\t%.1f\t%.1f\n", c, s12/c, s13/c, s14/c, s15/c, s16/c }'

version Nsamp   cs      us      sy      id      wa
15b1    415     769.7   21.3    3.0     73.1    2.6
15b2    426     752.2   20.8    3.0     73.0    3.2

From the metrics this also doesn't look like an IO regression because wkbpi and rkbpq haven't changed. For this benchmark step wkbpi is KB-written / Num-rows and rkbpq is KB-read / Num-rows where KB-written and KB-read are measured by iostat and Num-rows is the number of rows in the indexed table.

But from iostat output the values for r_await and w_await are larger for 15b2. HW (SSD thermal throttling in this case) is one cause, but programmers always love to start by blaming the HW. I will repeat this at night and let 15b2 run first.

cat 500m.pg15b1.cx7/l.x/o.io.dop1 \
  | grep nvme \
  | awk '{ c+=1; s2+=$2; s3+=$3; s4+=$4; s5+=$5; s6+=$6; s7+=$7; s8+=$8; s9+=$9; s10+=$10; s11+=$11; s12+=$12; s13+=$13 } END { printf "%s\t%.1f\t%.1f\t%.3f\t%.3f\t%.3f\t%.1f\t%.1f\t%.1f\t%.3f\t%.3f\t%.3f\t%.1f\n", c, s2/c, s3/c, s4/c, s5/c, s6/c, s7/c, s8/c, s9/c, s10/c, s11/c, s12/c, s13/c }'

version Nsamp   r/s     rkB/s   rrqm/s  %rrqm   r_await rareq-sz w/s    wkB/s   wrqm/s  %wrqm   w_await wareq-sz
15b1    414     666.0   84702.1 0.000   0.000   0.195   113.5   228.6   91061.3 3.820   2.048   6.010   331.7
15b2    425     648.2   82450.5 0.000   0.000   0.224   114.4   218.1   88508.5 3.428   2.017   7.271   342.5

No comments:

Post a Comment