Monday, March 27, 2023

Perf regressions in Postgres, a larger server and sysbench

This has results for in-memory sysbench on a c2-standard-60 server in GCP to determine whether there are CPU performance regressions from old Postgres (version 11) to modern Postgres (version 15). This is the second post from my current round of tests with Postgres, the previous post is here.

There are two goals from these benchmarks. The first goal is to determine whether there are CPU regressions (more CPU/query) from old versions to new versions. The second goal is to determine which compiler optimizations I should use when building Postgres from source.

tl;dr

  • The o3_native_lto build has the best performance but the improvement is usually between 1% and 5% which is smaller than the improvements from the small server test. That is a reasonable result because a test here with more concurrency can also suffer from more mutex contention. Regardless, link time optimization seems like a nice improvement.
  • There aren't CPU regressions from version 11 to version 15 for the point and range query microbenchmarks.
  • There are significant improvements for write throughput from version 11 to version 15 as the median improvement for write microbenchmarks is larger than 1.5X. I assume this is from architectural improvements.
Hardware

The c2-standard-60 server has 30 cores (hyperthreading disabled), 240G RAM. The storage is SW RAID 0 across 8 local NVMe devices with XFS. The database is cached by Postgres so there are no reads from storage.

Benchmark

A description of how I run sysbench is here. The sysbench microbenchmarks were run for 20 clients, 600 seconds per microbenchmark using 4 table with 50M rows per table. The test database fits in the Postgres buffer pool.

The conf.diff.cx7_gcp_c2s60 Postgres configuration was used which is here for version 11version 12version 13version 14 and version 15

Builds

I tested Postgres versions 11.19, 12.14, 13.10, 14.7, 15.1 and 15.2 using multiple builds for each version. For each build+version the full set of sysbench microbenchmarks was repeated.

For version 15.1 I tested these builds. The command line for configure, output from configure and output from make is here for the builds I did on my home Beelink servers. This sequence of builds helps to distinguish between the impact from -O3, CPU specific optimizations (-march, -mtune) and link time optimization. Results below show that most of the benefit comes from link time optimization.
  • def - uses -O2
  • o2_nofp - uses -O2 -fno-omit-frame-pointer 
  • o3 - uses -O3
  • o3_native - uses -O3 -march=native -mtune=native
  • o3_native_lto - uses -O3 -march=native -mtune=native -flto, -flto enables link time optimization
For versions 11.19, 12.14, 13.10, 14.7 and 15.2 I tested these builds. The command line for configure, output from configure and output from make is here for 11.1912.1413.1014.7 and 15.2 -- from the builds I did on my home Beelink servers. I have yet to publish the make/configure output for the GCP builds.
  • def - uses -O2
  • o3_native_lto - uses -O3 -march=native -mtune=native -flto, -flto enables link time optimization
Results: per-version

The result spreadsheet is here.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. For all versions the base case is the def build that uses -O2.

For all but version 15.1 there is 1 bar per microbenchmark (throughput for o3_native_lto). There are three graphs per Postgres version: one for point queries, one for range queries, one for writes. 

The write microbenchmarks have a lot of variance. I didn't investigate it and will ignore the issue for now to focus on the point and range microbenchmarks.

Disclaimers:
  • Readability is much better via the spreadsheet so I did not make the graphs x-large here. 
  • For most of the graphs the axis with values doesn't start at 0 to improve readability
For version 11.19 the relative throughput median for o3_native_lto is 1.03 for point, 1.05 for range, 1.00 for writes.

For version 12.14 the relative throughput median for o3_native_lto is 1.04 for point, 1.03 for range, 0.95 for writes.
For version 13.10 the relative throughput median for o3_native_lto is 1.03 for point, 1.05 for range, 1.04 for writes.
For version 14.7 the relative throughput median for o3_native_lto is 1.03 for point, 1.05 for range, 0.99 for writes.
For version 15.1 the relative throughput median for o3_native_lto is 1.04 for point, 1.04 for range, 1.04 for writes. The graphs for 15.1 have more data (more builds were tested) so I swapped the x and y axis. The green bar has results for o3_native_lto, which again is the best build.

The results here show that most of the benefit from o3_native_lto comes from lto (link time optimization) and not from o3 (-O3) nor from native (-march=native -mtune=native).
For version 15.2 the relative throughput median for o3_native_lto is 1.04 for point, 1.05 for range, 1.01 for writes.
Results: all versions

These have relative throughput for Postgres versions 12.14, 13.10, 14.7, 15.1 and 15.2. The base case is version 11.19 and all used the o3_native_lto build. For version 15.2 the relative throughput median for o3_native_lto (relative to 11.19) is 1.01 for point, 1.00 for range, 1.65 for writes.
Summary statistics: all versions

Summary statistics of the relative throughput values for versions 12.14, 13.10, 14.7. 15.1 and 15.2 relative to version 11.19 are below. These are grouped by workload type: point, range and writes. All use the results from the o3_native_lto build. 

Summary:
  • My small server tests showed two microbenchmarks (scan, update-one_range=100) for which Postgres version 15 did much better than 11.19. Those results do not reproduce here. But note that the scan test here only uses 4 clients (one per table).
  • The relative throughput for writes is much better here in versions 14.7, 15.1 and 15.2 relative to 11.19. This does not reproduce in the small servers tests.
  • The write benchmarks have a lot of variance and I have yet to explain that. It could be a result of checkpoints and write-heavy tests that only run for 10 minutes.
  • Otherwise results are great in that there aren't regressions in from version 11.19 to version 15 -- so the community has done a great job in avoiding CPU regressions.

11.1912.1413.1014.715.115.2
Point: avg1.001.001.011.011.01
Point: median1.001.001.011.011.01
Point: min0.960.980.970.950.94
Point: max1.031.031.031.031.03
Point: stddev0.0140.0130.0150.0190.021
Range: avg0.980.990.981.011.00
Range: median0.991.010.991.001.00
Range: min0.850.830.870.960.89
Range: max1.021.031.011.131.12
Range: stddev0.0390.0480.0360.0400.049
Write: avg1.111.031.591.671.68
Write: median1.021.051.541.691.65
Write: min0.800.550.770.850.84
Write: max1.631.622.423.153.24
Write: stddev0.2830.3460.5930.7020.718

Searching for performance regressions in Postgres

I used sysbench to search for performance regressions from old Postgres (11.19) to modern Postgres (15.2) and to determine the impact of compiler optimizations because I build it from source. The context for the results is in-memory (cached by Postgres) with low-concurrency (1 & 4 clients) on a small server (8-core AMD).

tl;dr

  • Small improvements are more common than small regressions but on average the CPU/query overhead has not changed from 11.19 to 15.2 which is great news. 
  • Modern Postgres (version 15.x) does slightly better relative to version 11.19 on the 4-thread tests than on the 1-thread tests.
  • Two microbenchmarks have large improvements in modern Postgres relative to version 11.19 -- full table scan and an update-only workload that updates only one row out of 20M
  • The o3_native_lto build has the best performance with 5% to 10% more QPS than the def build. This is a result I have repeated elsewhere -- the o3_native_lto build benefits a little bit from -O3 and CPU specific optimizations, but mostly from link time optimization
Benchmark

A description of how I run sysbench is here. Tests use the Beelink server (8-core AMD, 16G RAM, NVMe SSD). The sysbench tests were run for 600 seconds per microbenchmark using 1 table with 20M rows. The test database fits in the Postgres buffer pool. The benchmark was repeated for 1 and 4 clients.

The conf.diff.cx7 (x7) Postgres configuration was used which is here for version 12, version 13, version 14 and version 15. For Postgres version 11 I used the version 12 config.

Builds

I tested Postgres versions 11.19, 12.14, 13.10, 14.7, 15.1 and 15.2 using multiple builds for each version. For each build+version the full set of sysbench microbenchmarks was repeated.

For version 15.1 I tested these builds. The command line for configure, output from configure and output from make is here. This sequence of builds helps to distinguish between the impact from -O3, CPU specific optimizations (-march, -mtune) and link time optimization. Results below show that most of the benefit comes from link time optimization.
  • def - uses -O2
  • o2_nofp - uses -O2 -fno-omit-frame-pointer 
  • o3 - uses -O3
  • o3_native - uses -O3 -march=native -mtune=native
  • o3_native_lto - uses -O3 -march=native -mtune=native -flto, -flto enables link time optimization
For versions 11.19, 12.14, 13.10, 14.7 and 15.2 I tested these builds. The command line for configure, output from configure and output from make is here for 11.19, 12.14, 13.10, 14.7 and 15.2.
  • def - uses -O2
  • o3_native - uses -O3 -march=native -mtune=native
  • o3_native_lto - uses -O3 -march=native -mtune=native -flto, -flto enables link time optimization
Results: 1 thread, per-version

The result spreadsheet is here for 1-thread and for 4-threads.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. For all versions the base case is the def build that uses -O2.

For all but version 15.1 there are 2 bars per microbenchmark and the red bar is the interesting one (o3_native_lto). There are three graphs per Postgres version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes. 

For the point query microbenchmarks the hot-points test benefits the most from the o3_native_lto build. For range queries, the o3_native_lto build has the largest benefit for the first four microbenchmarks (range-covered-pk.* & range-covered-si.*).

Disclaimers:
  • Readability is much better via the spreadsheet so I did not make the graphs x-large here. 
  • For most of the graphs the axis with values doesn't start at 0 to improve readability
For version 11.19 the relative throughput median for o3_native_lto is 1.05 for point, 1.10 for range, 1.07 for writes.
For version 12.14 the relative throughput median for o3_native_lto is 1.07 for point, 1.09 for range, 1.06 for writes.
For version 13.10 the relative throughput median for o3_native_lto is 1.06 for point, 1.09 for range, 1.08 for writes.
For version 14.7 the relative throughput median for o3_native_lto is 1.08 for point, 1.10 for range, 1.08 for writes.

For version 15.1 the relative throughput median for o3_native_lto is 1.08 for point, 1.12 for range, 1.09 for writes. The graphs for 15.1 have more data (more builds were tested) so I swapped the x and y axis. The green bar has results for o3_native_lto, which again is the best build.

The results here show that most of the benefit from o3_native_lto comes from lto (link time optimization) and not from o3 (-O3) nor from native (-march=native -mtune=native).
For version 15.2 the relative throughput median for o3_native_lto is 1.08 for point, 1.08 for range, 1.09 for writes.
Results: 1 thread, all versions

These have relative throughput for Postgres versions 12.14, 13.10, 14.7, 15.1 and 15.2. The base case is version 11.19 and all used the o3_native_lto build. There are two outliers for which throughput is much better in modern Postgres -- full table scan (scan_range=100) starting in 14.7 and update the same row (update-one_range=100) starting in 12.14.

For point queries the median relative throughput is 1.01 for version 15.2 versus version 11.19. There are small improvements for many of the point query workloads in modern Postgres .

For range queries the median relative throughput for version 15.2 is 1.00 versus version 11.19. There were improvements to relative throughput in versions 12 and 13 but those are going away in modern Postgres.

For writes the median relative throughput for version 15.2 is 0.99 versions version 11.19.
Summary statistics: 1 thread, all versions

Summary statistics of the relative throughput values for versions 12.14, 13.10, 14.7. 15.1 and 15.2 relative to version 11.19 are below. These are grouped by workload type: point, range and writes. All use the results from the o3_native_lto build.  

Summary:
  • Excluding the outliers there aren't CPU regressions from old (11.19) to modern (15.2) Postgres which is great news.
  • The outliers are two microbenchmarks for which modern Postgres does much better than old Postgres -- scan_range=100 starting in version 14.7 which does full table scan, update-one_range=100 starting in 12.14 which does updates to the same row (out of 20M rows)

11.1912.1413.1014.715.115.2
Point: avg1.011.001.021.021.02
Point: median1.011.001.021.021.01
Point: min0.970.990.970.970.99
Point: max1.071.021.061.071.06
Point: stddev0.0260.0090.0260.0280.023
Range: avg0.991.001.021.031.00
Range: median0.991.001.011.011.00
Range: min0.960.980.940.990.94
Range: max1.011.031.211.241.21
Range: stddev0.0130.0150.0580.0610.061
Write: avg1.041.061.091.121.05
Write: median0.980.991.021.030.99
Write: min0.960.950.841.000.89
Write: max1.541.731.801.841.76
Write: stddev0.1780.2370.2600.2570.254

Summary statistics: 4 threads, all versions

To save space I won't show the graphs for the benchmarks repeated with 4 threads but they were similar to the 1-thread results. I will share the summary statistics for the o3_native_lto builds across all versions (11.19, 12.14, 13.10, 14.7, 15.1. 15.2).

The result spreadsheet is here for 1-thread and for 4-threads.

The average and median relative throughput here is slightly better than for the 1-thread results above.

11.1912.1413.1014.715.115.2
Point: avg1.021.031.031.041.04
Point: median1.021.021.011.031.03
Point: min1.001.010.991.011.01
Point: max1.031.081.091.121.11
Point: stddev0.0110.0200.0300.0300.029
Range: avg1.001.011.021.041.02
Range: median1.001.011.011.031.01
Range: min0.970.980.920.980.96
Range: max1.031.041.191.191.19
Range: stddev0.0170.0160.0630.0480.053
Write: avg0.991.001.011.041.03
Write: median1.001.011.011.021.01
Write: min0.950.950.790.960.96
Write: max1.021.031.141.141.14
Write: stddev0.0210.0280.0950.0630.063