This has results for Postgres vs the insert benchmark on a big server. My previous post had results for a small server. The goals are to look for performance changes from PG version 11 to 15 and to determine the impact of compiler optimizations. The context here is a cached workload (reads from cache, writes to storage) with high concurrency.
tl;dr
- See the Most Important Disclaimer here
- I accept that tuning might be required to improve throughput and/or reduce response time variance, but I am also wary of a DBMS that requires too much per-workload tuning.
- The insert rate for l.i1 drops from (420k/s, 400k/s) when the database is cached by Postgres to (150k/s, 200k/s) when the database is cached by the OS but not by Postgres. The two numbers are for (20 tables, 1 table). I don't think this is a problem, but I document it here for comparisons with other DBMS.
- Per-second response time and insert rates for l.i1 show much variance, aka write stalls. I have yet to explain what causes this. The variance is visible but not as significant for l.i0.
- the o3_native_lto build has the best perf but the benefit isn't as large here as it was on the small server. The benefit was <= 3% for writes and <= 5% for reads
- I ignore the create index benchmark step (see l.x below)
- For the configuration with a table per client
- For inserts without secondary indexes version 15.2 is ~2% slower than 11.19 (see l.0 below)
- For inserts with secondary indexes version 15.2 is ~1% faster than 11.19 (see l.1 below)
- Reads are ~4% faster in version 12.14 than 11.19 and the perf has not changed since then (see q100.1, q500.1, q1000.1 below)
- For the configuration with all clients sharing one table
- For inserts without secondary indexes version 15.2 is ~4% slower than 11.19 (see l.0 below). This is slightly worse than the case above with a table per client.
- For inserts with secondary indexes perf is the same as above for a table per client
- Read perf is the same as above for a table per client
Benchmarks
The server is a c2-standard-60 from GCP with 30 cores, hyperthreads disabled, 240G RAM and 3T of NVMe (XFS, RAID 0 across 8 devices). Huge pages were enabled for all benchmark runs.
An overview of the insert benchmark is here and here. The insert benchmark was run for a cached database with 20 clients. The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each.
Benchmarks were repeated for two configurations:
- cached by Postgres - all data fits in the Postgres buffer pool and vm.nr_hugepages was 85000
- cached by OS - all data fits in the OS page cache but not the Postgres buffer pool. The buffer pool size is 4G and the database was ~190G at test end. For huge pages I set vm.nr_hugepages to 4000.
The configuration files are here for:
The benchmark is a sequence of steps:
- l.i0 - insert 20 million rows without secondary indexes
- l.x - create 3 secondary indexes. I usually ignore results from this step.
- l.i1 - insert another 20 million rows with the overhead of secondary index maintenance
- q100.1 - do queries as fast as possible with 100 inserts/s/thread done in the background
- q500.1 - do queries as fast as possible with 500 inserts/s/thread done in the background
- q1000.1 - do queries as fast as possible with 1000 inserts/s/thread done in the background
Builds
For Postgres 15.1 I have results for several builds. A description of the builds is here. For Postgres versions 11.19, 12.14, 13.10, 14.7 and 15.2 I only have results for the o3_native_lto build as that tends to have the best performance.
Reports
Performance summaries generated by shell scripts are below. A short guide to these results is here. There are two types of reports. The version 15.1 builds report compares performance for Postgres 15.1 using different compiler optimizations. The versions 11 to 15 report compares performance for Postgres versions 11.19, 12.14, 13.10, 14.7, 15.1 and 15.2. For each type of report there are two variations:
- 20 clients, 20 tables - used 20 client & 20 tables with a client per table.
- 20 clients, 1 table - used 20 clients & 1 table
Reports for cached by Postgres:
- 20 clients, 20 tables: version 15.1 builds, versions 11 to 15
- 20 clients, 1 table: version 15.1 builds, versions 11 to 15
Reports for cached by OS:
- 20 clients, 20 tables: version 15.1 builds, versions 11 to 15
- 20 clients, 1 table: version 15.1 builds, versions 11 to 15
Response time
Charts with throughput and response time per 1-second intervals are here. First, the disclaimers:
- The graphs are from just one of the 20 clients.
- I risk drawing strong conclusions from small samples.
The charts:
- cached by Postgres
- 20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1
- For l.i0 there are regular drops in the insert rate from X to X/2 per second.
- For l.i1 there is a sawtooth pattern where the insert rate drops to near zero every ~300 seconds.
- For q500.1 there is much noise on the per-second insert rate graphs and the per-second response time graphs have a sawtooth pattern. For this test the background insert rate is 10000/s from 500 inserts/s/client.
- The q1000.1 results are similar to q500.1 -- much variance for inserts.
- 20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1
- Results are similar to 20 clients, 20 tables above
- cached by OS:
- 20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1
- Results for l.i0 are similar to the results for cached by Postgres
- Results for l.i1 are much worse than for cached by Postgres. While they have a similar sawtooth pattern there is a much larger drop between 1500 and 2000 seconds.
- For q500.1 and q1000.1 the per-second insert rates are similar to cached by Postgres but the per-second query rates have more variance.
Up next are tables with response time details per benchmark step -- both histograms and the max. Note that ms means milliseconds, by insert I mean a multi-row insert statement and the use of brackets ([]) below indicates the range of the max response time over multiple results (1 result per build). It can be risky to draw strong inferences from a small sample (one run per build).
- cached by Postgres
- 20 clients, 20 tables
- l.i0 - the max is [410ms, 1234ms]. Version 11.19 has the largest max.
- l.i1 - the max is [1045ms, 1255ms].
- q100.1 - the max is [27ms, 129ms] for queries and [36ms, 778ms] for inserts.
- q500.1 - the max is [45ms, 285ms] for queries and [230ms, 399ms] for inserts.
- q1000.1 - the max is [58ms, 113ms] for queries and [199ms, 254ms] for inserts.
- 20 clients, 1 table
- l.i0 - the max is [612ms, 1883ms]. Version 14.7 has the largest max.
- l.i1 - the max is [884ms, 2045ms].
- q100.1 - the max is [23ms, 1654ms] for queries and [45ms, 1807ms] for inserts.
- q500.1 - the max is [35ms, 352ms] for queries and [171ms, 473ms] for inserts.
- q1000.1 - the max is [33ms, 54ms] for queries and [163ms, 216ms] for inserts.
- cached by OS
- 20 clients, 20 tables
- l.i0 - the max is [386ms, 682ms]. Results here are better than cached by Postgres. The worst-case stall might be a function of the buffer pool size.
- l.i1 - the max is [722ms, 1055ms].
- q100.1 - the max is [137ms, 372ms] for queries and [38ms, 292ms] for inserts.
- q500.1 - the max is [65ms, 158ms] for queries and [219ms, 318ms] for inserts.
- q1000.1 - the max is [76ms, 134ms] for queries and [205ms, 489ms] for inserts.
- 20 clients, 1 table
- l.i0 - the max is [366ms, 928ms]. Results here are better than cached by Postgres.
- l.i1 - the max is [539ms, 753ms].
- q100.1 - the max is [15ms, 24ms] for queries and [39ms, 318ms] for inserts.
- q500.1 - the max is [25ms, 270ms] for queries and [103ms, 377ms] for inserts.
- q1000.1 - the max is [48ms, 238ms] for queries and [217ms, 344ms] for inserts.
No comments:
Post a Comment