I ran the Insert Benchmark for Postgres versions 12.11, 13.7, 14.3 and 15b1. Reports are provided for cached and IO-bound workloads. The benchmark is run on Intel NUC servers at low concurrency. The goal is to determine how performance changes over time. A description of the Insert Benchmark is here.
tl;dr
- I am not a Postgres expert
- regressions are small in most cases
- the l.i0 benchmark step has regressions that are worth investigating in versions 14 and 15b1
- these regressions might have been fixed, see the perf report for the patch (15b1p1)
Updates:
- added links to the configuration files
- part 2 with throughput vs time graphs is here
- provided data on the percentage of time in parse, analyze, optimize, execute
- added command lines
- added the index + relation sizes after each test step
- added links to performance reports when prepared statements were used for queries
- added links to performance reports with a patch to improve version 15b1
- added a second set of results for 15b1p1 (15b1 with the patch) for cached and IO-bound. Results are similar to the first set.
- livestreaming my debugging efforts, see Debugging the regression below
The regression
After I shared this blog post a possible cause for the regression was discussed
here and the community quickly provided
a patch for me to test. The results are excellent as 15b1p1 (15b1 with the patch) almost matches the results for version 13.7 on the cached workload. For the IO-bound workload the patch fixes the regression for all but one of the benchmark steps (l.i1).
The improvement is obvious in the cached workload
Summary. The improvement is also there for the IO-bound workload, but less obvious. In the IO-bound workload
Summary the throughput for l.i0 (load without secondary indexes) and q100.1 are better than results for version 14.3 and almost match version 13.7. While the throughput for l.i1 (inserts with secondary index maintenance) did not improve. From the
IO-bound metrics for l.i1 I see that CPU (cpupq == CPU per statement) has been reducing with each major release, but with the patch (15b1p1) the value increased to match the value for version 13.7. I don't see this effect in the
l.i1 metrics with the cached workload.
Summer has arrived and thermal throttling might be an issue for my SSDs which would be an unfortunate source of variance. The next step is to repeat the IO-bound workload at night when it is cooler and then look at CPU profiles for l.i1.
A guide to the performance reports
is here. For explaining performance the most interesting metric is cpupq (CPU/query or CPU/insert depending on the benchmark step). From the guide, cpupq
includes CPU by the client, DBMS and anything else running on the server. The benchmark client, written in Python, uses a lot of CPU. That can make it harder to see how CPU changes from the DBMS. With MySQL, I can also measure the CPU time used by mysqld. With Postgres I am still figuring out how to provide a DBMS-only measurement of the CPU overheads.
Test setup
Tests are run on Intel NUC servers I have at home. The OS is Ubuntu 20.04, hyperthreading and turbo-boost are disabled. All binaries are compiled using the same gcc/glibc toolchain and all tests used the same server.
The benchmark is repeated for a cached and IO-bound workload. The database fits in memory for the cached workload and is larger than memory for the IO-bound workload. The Postgres configuration files are here for 12.11, 13.7, 14.3 and 15b1.
The Insert Benchmark description has more details. In summary the benchmark has several steps and I focus on the ones where performance changes:
- l.i0 - load in PK order, the table that has no secondary indexes
- l.x - create three secondary indexes
- l.i1 - insert in PK order, the table that has 3 secondary indexes
- q100.1 - measure range query QPS from one client while another does 100 inserts/s
- q500.1 - measure range query QPS from one client while another does 500 inserts/s
- q1000.1 - measure range query QPS from one client while another does 1000 inserts/s
The l.i0 step inserts 20m (500m) rows for the cached (IO-bound) workload. The l.i1 step inserts 20m (10m) rows for the cached (IO-bound) workload. The q100.1, q500.1 and q1000.1 steps each run for 2 hours.
# old school, no prepared statementspython3 iibench.py --db_user=foo --db_host=127.0.0.1 \
--db_password=bar --dbms=postgres --db_name=ib --setup \
--query_threads=1 --max_rows=1000000 --secs_per_report=1
# try out the new support for prepared statements
python3 iibench.py --db_user=foo --db_host=127.0.0.1 \
--db_password=bar --dbms=postgres --db_name=ib --setup \
--query_threads=1 --max_rows=1000000 --secs_per_report=1 \
--use_prepared_insert --use_prepared_query
Cached workload
The reports are here for ibench without and with prepared statements used for queries. Next is a report for prepared statements and a patch for 15b1 (I call it 15b1p1) that fixes a CPU regression. Postgres query throughput improves by ~1.5X when prepared statements are used for queries. Prepared statements are not yet used for inserts because I need to debug CPU perf issues on the ibench client.
I use relative throughput to document performance changes. It is: my_throughput / base_throughput where base_throughput is the result from Postgres 13.7. The interesting changes are below. The results aren't surprising. In most cases the regressions are small because Postgres is good at avoiding them. The regressions for 15b1 are larger because it is a beta and a newer release.
Throughput relative to 13.7
version l.i0 l.i1 q100.1
14.3 0.934 0.971 0.971
15b1 0.856 0.946 0.953
The percentage of samples from pg_plan_queries, pg_parse_query, pg_analyze_and_rewrite and query execution from v14.3. Prepared statements were not used.
step plan parse analyze execute
l.i0 20 11 25 44
l.i1 8 5 10 77
q100.1 33 6 10 51
Cached - l.i0
The largest regressions are from the l.i0 step. While this test doesn't run for a long time, the results are similar for the IO-bound workload. Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. For 15b1 I renamed ExecInitExprInternal to ExecInitExpr to match the names used in older versions.
I also have a benchmark report here for Postgres versions 14.0, 14.1, 14.2 and 14.3. There is little change across them so any regressions from 13.7 to 14.3 are likely in 13.7 to 14.0.
Before looking at the differential flamegraphs I compared the output by hand. For functions called by PostgresMain not much changes between versions in pg_plan_queries and pg_parse_queries. There is a small improvement for pg_analyze_and_rewrite and a small regression for PortalRun. The rest of the drill down is here with a focus on functions called under PortalRun. There are regressions in a few functions when ExecScan is on the call stack.
Percentage of samples for children of PostgresMain
12.11 13.7 14.3 15b1 function
----- ----- ----- ----- --------
21.50 20.27 20.44 19.84 pg_plan_queries
11.56 12.19 11.34 10.34 pg_parse_queries
25.76 25.70 24.92 22.64 pg_analyze_and_rewrite
33.02 33.06 34.94 37.31 PortalRun
Cached - l.i1
Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. I didn't do the summary by hand because the differences are small.
Cached - q100.1
Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. I didn't do the summary by hand because the differences are small. ReadyForQuery is much less prominent in 15b1 relative to earlier versions. It isn't clear that the differential graph for 14.3 vs 15b1 is useful. Perhaps too much code has changed.
Table and index sizes
These are the sizes from pg_index_sizes, pg_relation_size and pg_total_relation_size. I suspect that the improvements (size reduction) in recent Postgres versions don't reproduce here because the workload is insert only.
Legend
* all sizes are in MB unless indicated with G (for GB)
* index is value from pg_index_sizes
* rel is value from pg_relation_size
* total is value from pg_total_relation_size
l.i0 - after load in PK order without secondary indexes
version index rel total
12.11 428 1531 1959
13.7 428 1531 1959
14.3 428 1531 1959
15b1 428 1531 1959
l.x - after create 3 secondary indexes
version index rel total
12.11 2243 1538 3782
13.7 2243 1538 3782
14.3 2243 1538 3782
15b1 2243 1538 3782
l.i1 - after load more data with 3 secondary indexes to maintain
version index rel total
12.11 5299 3069 8368
13.7 5308 3069 8378
14.3 5308 3069 8378
15b1 5308 3069 8378
q3.1000 - at the end of the read+write steps
version index rel total
12.11 8338 3951 12G
13.7 8333 3950 12G
14.3 8333 3950 12G
15b1 8333 3950 12G
IO-bound workload
The reports are here for ibench without and with prepared statements used for queries. Next is a report for prepared statements and a patch for 15b1 (I call it 15b1p1) that fixes a CPU regression.
Postgres query throughput improves by ~1.5X when prepared statements are used for queries. Prepared statements are not yet used for inserts because I need to debug CPU perf issues on the ibench client. I still need to figure out why more IO isn't done per query for this workload given that it is IO bound and index indexes are much larger than memory.
I don't think the IO-bound report needs additional analysis as the regression for l.i0 is analyzed above. I also have a benchmark report here for Postgres versions 14.0, 14.1, 14.2 and 14.3. There is little change across them so any regressions from 13.7 to 14.3 are likely in 13.7 to 14.0.
Table and index sizes
These are the sizes from pg_index_sizes, pg_relation_size and pg_total_relation_size. I suspect that the improvements (size reduction) in recent Postgres versions don't reproduce here because the workload is insert only.
Legend
* all sizes are in GB
* index is value from pg_index_sizes
* rel is value from pg_relation_size
* total is value from pg_total_relation_size
l.i0 - after load in PK order without secondary indexes
version index rel total
12.11 10 37 48
13.7 10 37 48
14.3 10 37 48
15b1 10 37 48
l.x - after create 3 secondary indexes
version index rel total
12.11 55 37 92
13.7 55 37 92
14.3 55 37 92
15b1 55 37 92
l.i1 - after load more data with 3 secondary indexes to maintain
version index rel total
12.11 55 38 94
13.7 55 38 94
14.3 55 38 94
15b1 55 38 94
q3.1000 - at the end of the read+write steps
version index rel total
12.11 57 39 96
13.7 57 39 96
14.3 57 39 96
15b1 57 39 96
Debugging the regression
This section has details from trying to debug the regression while I collaborate with a few experts from the Postgres community. For the regression in the l.i1 benchmark step (load in PK order with 3 secondary indexes) on the IO-bound workload there is:
- ~7% increase in KB written per insert(wkbpi)
- ~7% increase in KB read per insert (rkbpq)
- ~12% increase in CPU per insert (cpupq), but note that CPU here includes everything on the host, including the benchmark client
From pg_stat_bgwriter there are differences between 15b1 and 15b1p1 ...
select * from pg_stat_bgwriter is here for pg15b1p1:
checkpoints_timed | 32
checkpoints_req | 6
checkpoint_write_time | 7734090
checkpoint_sync_time | 6471
buffers_checkpoint | 12444695
buffers_clean | 13580345
maxwritten_clean | 0
buffers_backend | 7593943
buffers_backend_fsync | 0
buffers_alloc | 19970817
stats_reset | 2022-06-23 21:45:38.834884-07
And then for pg15b1:
checkpoints_timed | 37
checkpoints_req | 2
checkpoint_write_time | 8283432
checkpoint_sync_time | 6109
buffers_checkpoint | 12753126
buffers_clean | 13702867
maxwritten_clean | 0
buffers_backend | 6500038
buffers_backend_fsync | 0
buffers_alloc | 20002787
stats_reset | 2022-06-20 16:55:15.35976-07
And finally the graph of throughput vs time for l.i1 shows some kind of stall for 15b1p1 that doesn't occur for 15b1, and this stall is repeatable. It occurs in both runs of the benchmark I did for 15b1p1.
First is the graph for 15b1 which don't have the stall. The x-axis is time, the y-axis is inserts/s and the results are reported by the benchmark client at 1-second intervals.
Next is the graph for 15b1p1 which has a stall.