Saturday, January 27, 2024

Updated Insert benchmark: Postgres 9.x to 16.x, small server, IO-bound database

This has results for Postgres vs the Insert Benchmark on a small server with an IO-bound workload. I include results for the latest point release from all major versions from 9.0 to 16.

tl;dr
  • While there are no regressions in the CPU-bound (cached) workload there are regressions here
  • There are two changes related to get_actual_variable_range and get_actual_variable_endpoint (a previous post also explained this). Note some parts of this workload are not typical and regressions I find here aren't relevant to many other workloads.
    • Starting in Postgres 12 the throughput for l.i1 and l.i2 improves by ~2X because the CPU overhead from the query planner during DELETE statements has been reduced.
    • Starting in Postgres 14 the throughput for range queries decreases by ~30% because the CPU overhead for range queries and DELETE statements has grown. I am still debugging this.
  • Most versions were unable to sustain the target write rates (1000 inserts/s and 1000 delete/s) during the qr1000 and qp1000 benchmark steps.  Only Postgres 12.17 and 13.13 were able to sustain it, most others were far from the target and the worst were Postgres 14.10, 15.5 and 16.1.
  • Something changed for the worse in Postgres 14 that increases CPU overhead for queries and DELETE statements in this workload.
Comparing throughput in Postgres 16.1 to 9.0.23
  • Write-heavy - Postgres 16.1 is between 1.2X and 2.3X faster than 9.0.23
  • Range queries - Postgres 16.1 is up to ~20% slower than 9.0.23
  • Point queries - Postgres 16.1 is similar to 9.0.23

    Build + Configuration

    See the previous report for more details. I tested these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1. 

    The configuration files are in subdirectories from here. Search for files named conf.diff.cx9a2_bee which exist for each major version of Postgres.

    The Benchmark

    The test server is a Beelink SER4 with 8 AMD cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

    The benchmark steps are:

    • l.i0
      • insert 800 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts 4M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and inserts 1M rows total
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
    • qr100
      • use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results

    The performance report is here.

    The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    • The base case is pg9023_def which means Postgres 9.0.23
    • For the read-heavy benchmark steps that do range queries (qr100, qr500, qr1000) throughput improved between Postgres 9.2 and 13 and then it drops by ~30% in Postgres 14.10 and I confirmed the drop is also in Postgres 14.0. I will start to explain this in another post.
    • For the read-heavy benchmark steps that do point queries (qp100, qp500, qp1000) thoughput is mostly unchanged from 9.0.23 through 16.1.
    • For the write-heavy steps (l.i0, l.x, l.i1, l.i2) throughput improves a lot
      • l.i0 - things get a lot better in Postgres 11.22
      • l.x - things get a lot better between Postgres 9.4.26 and 11.22
      • l.i1, l.i2 - things get a lot better in Postgres 12.17 likely because the query planner overhead during DELETE statements has been reduced (see the comments about get_actual_variable_range)
    • Comparing throughput in Postgres 16.1 to 9.0.23
      • Write-heavy -- Postgres 16 is faster
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 1.22, 2.321.831.87
      • Range queries -- Postgres 16 is mostly slower
        • qr100, qr500, qr1000 - relative QPS is 0.810.891.01
      • Point queries -- Postgres 16 is slightly slower
        • qp100, qp500, qp1000 - relative QPS is 0.980.961.00
    Target write rates

    The third table in the summary shows the write rates sustained during the read-write benchmark steps. The target write rates are 100/s for qr100 and qp100, 500/s for qr500 and qp500 and then 1000/s for qr1000 and qp1000. Note that X/s means X inserts/s and X delete/s. When the value is close enough to the target then I assume the target has been sustained. The table cells in red indicate the cases where the target has not been sustained.
    • For qr100, qp100, qr500, qp500 -- all versions sustained the targets
    • For qr1000, qp1000 - only Postgres 12.17 and 13.13 sustained the targets.
    One session is used for INSERT statements and another for DELETE statements. They run at the same rate so if one session runs slow, then both will be slow. I assume the problem here is that DELETE processing is slow and this is related to changes in get_actual_variable_range.

    The following table show the number of CPU seconds consumed per connection during the qp1000 benchmark step. There is:
    • a big increase in CPU starting in 12.17 for the query connection
    • a big decrease in CPU starting in 12.17 for the delete connection
    • a big increase in CPU starting in 14.10 for the query and delete connection
    CPU seconds per connection during qp1000
    * query = connection that does point queries
    * ins = connection that does inserts
    * del = connection that does deletes

            query   ins     del
    11.22   626     157     3657
    12.17   311     144     1671
    13.13   312     145     1758
    14.10   595     158     3596
    15.5    609     156     3714
    16.1    612     158     3716




    2 comments:

    1. I like the grouped bar-chart graph presentation. Traditionally I've seen analytical benchmarks like TPC-H presented in this format, it's like you borrowed the idea to use it for more OLTP-oriented testing. It's a really clear way to present testing like this.

      BTW, for some reason I can't comment using my google account... signon doesn't seem to work, likely it's a blogspot problem you can fix though

      ReplyDelete
      Replies
      1. Thank you. I continue to iterate on presentation. There are competing demands -- I want it to be readable, but given that I publish so many of these I also need to be able to generate these quickly.

        Delete

    RocksDB on a big server: LRU vs hyperclock, v2

    This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...