This presents performance and efficiency results for Postgres on the Insert Benchmark using release 12.3 and a small server with 4 CPU cores, 16G of RAM and a fast SSD. The goal is to understand the impact of my.cnf options and I have results for 11 different configurations.
I use scripts to generate the performance reports. The reports are here for the in-memory, less IO-bound and more IO-bound database sizes. My summary of the results is below. Some of the conclusions are specific to this workload and HW so be careful about applying this advice.
I didn't try as many configurations for Postgres as I did for InnoDB and MyRocks. I am not sure that is because there is less tuning opportunity in Postgres. Notice that I didn't try to tune vacuum, which has many options, and instead my test scripts did a manual non-blocking vacuum after each test step. While the only writes in this test are inserts, vacuum is still needed to set bits in the visibility map. I think Postgres can be more clever about setting those bits after inserts.
A summary of the results is below.
- With wal_compression=on, see configs cx2 through cx4b, performance drops for create index (l.x) and load more data (l.i1). The drop was larger than I expected. The compression method is PGLZ and I wonder if lz4 would help. The benefit from WAL compression should be fewer checkpoints and less checkpoint IO. See the summaries for in-memory, less and more IO-bound. For create index (l.x) the cpu overhead (cpupq) doubles from 4 to 8 (or 9) when WAL compression is enabled from cx1 to cx2 for in-memory, less and more IO-bound but there isn't a significant reduction in KB written per row (wkbpi). The impact on CPU overhead (cpupq) is even larger for the load more data (l.i1) step for in-memory, less and more IO-bound.
- With a small WAL, see max_wal_size not set for cx2 and =2G for cx3a, performance suffers during load more data (l.i1). This is expected as a larger WAL means fewer checkpoints and less checkpoint IO and KB written to storage per insert (wkbpi) is lower with a larger WAL for in-memory, less and more IO-bound. The benefit is smaller for more IO-bound because page writeback there is triggered more from buffer pool pressure than from checkpoint.
- QPS during the read+write tests didn't vary much with different configurations.
There is much detail in the reports that I mostly ignore, including response time histograms and the HW efficiency metrics. But they can be used to explain the results and answer questions.
The insert benchmark was run for three database sizes that I call in-memory, less IO-bound and more IO-bound. The database was cached in memory for the in-memory size but not for the others. The initial load size was 20M rows for in-memory, 100M rows for less IO-bound and 500M rows for more IO-bound. Additional rows were inserted after the initial load, but most of the inserts were done by the initial load.
Tests were run at most 2 threads/connections -- one for inserts and another for queries -- and I consider that to be 1 insert benchmark client.
The benchmark has several steps as described here -- initial load, create 3 secondary indexes, load more data, read+write. Each read+write step is run for 30 minutes with a query thread that runs as fast as possible and a writer that that is rate limited to 100, 100, 200, 200, 400, 400, 600, 600, 800, 800, 1000 and 1000 inserts/s. The load more data step inserts 20M, 10M and 10M more rows for the in-memory, less IO-bound and more IO-bound workloads.
In the performance reports, the steps are named:
- l.i0 - initial load
- l.x - create 3 secondary indexes
- l.i1 - load more data
- q100.2, q200.2, ..., q1000.2 - read+write where the insert rate limit is 100, 200, ..., 1000 and the .2 means the second run at each rate. The reports focus on the second run.
All tests were done with fsync-on-commit disabled to avoid becoming an fsync latency benchmark. To save space. I also leave off conf.diff. when naming config files.
- conf.diff.cx0 - minimal configuration
- conf.diff.cx1 - cx0 + shared_buffers=10G
- conf.diff.cx2 - cx1 + wal_compression=on
- conf.diff.cx3a - cx2 + max_wal_size=2G
- conf.diff.cx3b - cx2 + max_wal_size=10G
- conf.diff.cx3c - cx2 + max_wal_size=20G
- conf.diff.cx3d - cx2 + max_wal_size=40G
- conf.diff.cx4a - cx3c + bgwriter_lru_maxpages=2000, bgwriter_delay=100ms
- conf.diff.cx4b - cx3c + bgwriter_lru_maxpages=1000, bgwriter_delay=50ms
- conf.diff.cx5 - cx4a + wal_compression=off
- conf.diff.cx6 - cx4a + shared_buffers=2G