Sunday, January 28, 2024

Explaining a performance regression in Postgres 14

I am trying to explain a performance regression that arrives in Postgres 14 during the Insert Benchmark.

The primary problem appears to be more CPU used by the query planner for DELETE statements when the predicates in the WHERE clause have constants that fall into either the max or min histogram bucket for a given column. An example is a DELETE statement like the following and transactionid is the primary key so there is an index on it.

delete from t1 where (transactionid>=100 and transactionid<110)

The table is used like a queue -- inserts are done in increasing order with respect to transactionid and when N rows are inserted, then N more rows are deleted to keep the size of the table constant. The rows to be deleted are the N rows with the smallest value for transactionid.

The problem is worse for IO-bound workloads (see here) than for cached workloads (see here) probably because the extra work done by the query planner involves accessing the index and possibly reading data from storage.

It is always possible I am doing something wrong but I suspect there is a fixable performance regression in Postgres 14 for this workload. The workload is explained here and note that vacuum (analyze) is done between the write-heavy and read-heavy benchmark steps.

There are three issues:

  1. There is only one good index for the DELETE statement, yet the query planner does (too much) work to figure out the selectivity for that index.
  2. When the constants used in WHERE clause predicates fall into the largest or smallest histogram bucket for a column, then the query planner reads from the index to figure out the real min or max value in the index. The code for this is in the function get_actual_variable_range.
  3. Extra work is done while reading from the index because there are too many entries that can be but have yet to be removed by vacuum. So the index scan encounters and then skips them for a while until it reaches a visible entry.

Issue #3 is made worse by the workload. The table is used like a queue. There is a sequence for the PK column, inserts are in ascending order getting new values for the PK column from a sequence. Deletes are done to the other end of the table -- each delete statement deletes the N rows with the smallest value for the PK column. Similar problems can occur with InnoDB and MyRocks -- I know from experience.

I suspect the solution in this case is to not try as hard to figure out selectivity when there is only one good index (fix issue #1). Although it might help to do something about issue #2  as well.

Request 1

Can the query planner to do less work when there is only one index that should be used? The full DDL for the table is here.

An abbreviated version of the DDL is below and the PK is on transactionid which uses a sequence.

     Column     |            Type             |
----------------+-----------------------------+
 transactionid  | bigint                      |
 dateandtime    | timestamp without time zone |
 cashregisterid | integer                     |
 customerid     | integer                     |
 productid      | integer                     |
 price          | integer                     |
 data           | character varying(4000)     |
Indexes:
    "pi1_pkey" PRIMARY KEY, btree (transactionid)
    "pi1_marketsegment" btree (productid, customerid, price)
    "pi1_pdc" btree (price, dateandtime, customerid)
    "pi1_registersegment" btree (cashregisterid, customerid, price)
Access method: heap

For a DELETE statement like the following, the only efficient index is pi1_pkey. So I prefer that the query planner do less work to figure that out.

delete from t1 where (transactionid>=100 and transactionid<110)

CPU overhead

When I run the Insert Benchmark there are 6 read-write benchmark steps -- 3 that do range queries as fast as possible, 3 that do point queries as fast as possible. For all of them there are also inserts and deletes done concurrent with the range queries and they are rate limited -- first at 100 inserts/s and 100 deletes/s, then at 500 inserts/s and 500 deletes/s and finally at 1000 inserts/s and 1000 deletes/s. So the work for writes (inserts & deletes) is fixed per benchmark step while the work done by queries is not. Also, for each benchmark step there are three connections -- one for queries, one for inserts, one for deletes. 

Using separate connections makes it easier to spot changes in CPU overhead and below I show the number of CPU seconds for the range query benchmark steps (qr100, qr500, qr1000) where the number indicates the write (insert & delete) rate. Results are provided for Postgres 13.13 and 14.10 from the benchmark I described here (small server, IO-bound).

From below I see two problems. First, the CPU overhead for the delete connection is much larger with Postgres 14.10 for all benchmark steps (qr100, qr500, qr1000). Second, the CPU overhead for the query connection is much larger with Postgres 14.10 for qr1000, the benchmark step with the largest write rate.

Legend
* ins = connection that does inserts
* del = connection that does deletes
* query = connection that does range queries

CPU seconds with 100 inserts/s, 100 deletes/s -> qr100
        ins     del     query
13.13   5       14      1121
14.10   15      187     1148

CPU seconds with 500 inserts/s, 500 deletes/s -> qr500
        ins     del     query
13.13   71        71    1128
14.10   73      1050    1144

CPU seconds with 1000 inserts/s, 1000 deletes/s -> qr1000
        ins     del     query
13.13   135     1113    1129
14.10   151     2912    1906

Debugging after the fact: CPU profiling

I repeated the benchmark for Postgres 13.13 and 14.10 and after it finished repeated the qr100 benchmark step a few times for each of Postgres 13.13 and 14.10. The things that I measure here don't match exactly what happens during the benchmark because the database might be in a better state with respect to write back and vacuum.

While this is far from scientific, I used explain analyze on a few DELETE statements some time after they were used. The results are here. I repeated the statement twice for each Postgres release and the planning time for the first explain is 49.985ms for Postgres 13.13 vs 100.660ms for Postgres 14.10.

So I assume the problem is the CPU overhead from the planner and not from executing the statement.

Then I looked at the CPU seconds used by the connection that does deletes after running for 10 minutes and it was ~50s for Postgres 13.13 vs ~71s for 14.10. So the difference at this point is large, but much smaller than what I report above which means the things I want to spot via CPU profiling might be harder to spot. Also, if the problem is IO latency rather than CPU overhead then CPU profiling won't be as useful.

This gist has the top-5 call stacks from hierarchical profiling with perf for the connection that does deletes. While there isn't an obvious difference between Postgres 13.13 and 14.10 there is something I don't like -- all stacks are from the query planner and include the function get_actual_variable_range.

IO profiling

It looks like the query planner does more read IO for delete statements in Postgres 14.10 than in 13.13.

From the full benchmark I see the following for the range query benchmark steps which means there is more read IO (see rps column) with Postgres 14.10 for the qr100 and qr500 benchmark steps but not with the qr1000 benchmark step. And in call cases the range query rate (see qps column) is significantly less with Postgres 14.10.

Legend:
* qps = range queries/s
* rps = read IO requests/s per iostat

        qr100
version qps     rps
13.13   8338.2  166.5
14.10   5822.6  183.5

        qr500
version qps     rps
13.13   8101.7  615.6
14.10   5917.9  885.6

        qr1000
version qps     rps
13.13   7090.1  1682.9
14.10   5139.0  1036.2


Saturday, January 27, 2024

Updated Insert benchmark: Postgres 9.x to 16.x, small server, IO-bound database

This has results for Postgres vs the Insert Benchmark on a small server with an IO-bound workload. I include results for the latest point release from all major versions from 9.0 to 16.

tl;dr
  • While there are no regressions in the CPU-bound (cached) workload there are regressions here
  • There are two changes related to get_actual_variable_range and get_actual_variable_endpoint (a previous post also explained this). Note some parts of this workload are not typical and regressions I find here aren't relevant to many other workloads.
    • Starting in Postgres 12 the throughput for l.i1 and l.i2 improves by ~2X because the CPU overhead from the query planner during DELETE statements has been reduced.
    • Starting in Postgres 14 the throughput for range queries decreases by ~30% because the CPU overhead for range queries and DELETE statements has grown. I am still debugging this.
  • Most versions were unable to sustain the target write rates (1000 inserts/s and 1000 delete/s) during the qr1000 and qp1000 benchmark steps.  Only Postgres 12.17 and 13.13 were able to sustain it, most others were far from the target and the worst were Postgres 14.10, 15.5 and 16.1.
  • Something changed for the worse in Postgres 14 that increases CPU overhead for queries and DELETE statements in this workload.
Comparing throughput in Postgres 16.1 to 9.0.23
  • Write-heavy - Postgres 16.1 is between 1.2X and 2.3X faster than 9.0.23
  • Range queries - Postgres 16.1 is up to ~20% slower than 9.0.23
  • Point queries - Postgres 16.1 is similar to 9.0.23

    Build + Configuration

    See the previous report for more details. I tested these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1. 

    The configuration files are in subdirectories from here. Search for files named conf.diff.cx9a2_bee which exist for each major version of Postgres.

    The Benchmark

    The test server is a Beelink SER4 with 8 AMD cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

    The benchmark steps are:

    • l.i0
      • insert 800 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts 4M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and inserts 1M rows total
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
    • qr100
      • use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results

    The performance report is here.

    The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    • The base case is pg9023_def which means Postgres 9.0.23
    • For the read-heavy benchmark steps that do range queries (qr100, qr500, qr1000) throughput improved between Postgres 9.2 and 13 and then it drops by ~30% in Postgres 14.10 and I confirmed the drop is also in Postgres 14.0. I will start to explain this in another post.
    • For the read-heavy benchmark steps that do point queries (qp100, qp500, qp1000) thoughput is mostly unchanged from 9.0.23 through 16.1.
    • For the write-heavy steps (l.i0, l.x, l.i1, l.i2) throughput improves a lot
      • l.i0 - things get a lot better in Postgres 11.22
      • l.x - things get a lot better between Postgres 9.4.26 and 11.22
      • l.i1, l.i2 - things get a lot better in Postgres 12.17 likely because the query planner overhead during DELETE statements has been reduced (see the comments about get_actual_variable_range)
    • Comparing throughput in Postgres 16.1 to 9.0.23
      • Write-heavy -- Postgres 16 is faster
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 1.22, 2.321.831.87
      • Range queries -- Postgres 16 is mostly slower
        • qr100, qr500, qr1000 - relative QPS is 0.810.891.01
      • Point queries -- Postgres 16 is slightly slower
        • qp100, qp500, qp1000 - relative QPS is 0.980.961.00
    Target write rates

    The third table in the summary shows the write rates sustained during the read-write benchmark steps. The target write rates are 100/s for qr100 and qp100, 500/s for qr500 and qp500 and then 1000/s for qr1000 and qp1000. Note that X/s means X inserts/s and X delete/s. When the value is close enough to the target then I assume the target has been sustained. The table cells in red indicate the cases where the target has not been sustained.
    • For qr100, qp100, qr500, qp500 -- all versions sustained the targets
    • For qr1000, qp1000 - only Postgres 12.17 and 13.13 sustained the targets.
    One session is used for INSERT statements and another for DELETE statements. They run at the same rate so if one session runs slow, then both will be slow. I assume the problem here is that DELETE processing is slow and this is related to changes in get_actual_variable_range.

    The following table show the number of CPU seconds consumed per connection during the qp1000 benchmark step. There is:
    • a big increase in CPU starting in 12.17 for the query connection
    • a big decrease in CPU starting in 12.17 for the delete connection
    • a big increase in CPU starting in 14.10 for the query and delete connection
    CPU seconds per connection during qp1000
    * query = connection that does point queries
    * ins = connection that does inserts
    * del = connection that does deletes

            query   ins     del
    11.22   626     157     3657
    12.17   311     144     1671
    13.13   312     145     1758
    14.10   595     158     3596
    15.5    609     156     3714
    16.1    612     158     3716




    Thursday, January 25, 2024

    Updated Insert benchmark: InnoDB/MySQL 5.6, 5.7 and 8.0, small server, cached database

    I now have 4 server types at home (8 cores + 16G RAM, 8 cores + 32G RAM,  24 cores, 32 cores) and am trying to finish a round of the Insert Benchmark for each. This has results for the smallest (8 cores + 16G RAM) using a cached workload and MySQL 5.6, 5.7, 8.0.

    tl;dr

    • For this setup MySQL has large regressions over time while Postgres does not
    • The regressions in MySQL are large here, but smaller on workloads with more concurrency
    • There are few regressions with the 5.6 and 5.7 release cycles
    • There are large regressions within the 8.0 release cycle
    • There are large regressions at the start of the 5.7 and 8.0 release cycles
    • Enabling the perf schema reduces throughput by ~4% for most write heavy benchmark steps, by ~10% for read heavy benchmark steps and a lot more for index create

    Build + Configuration

    I tested many versions of MySQL 5.6, 5.7 and 8.0 These were compiled from source. I used the CMake files from here with the patches here to fix problems that otherwise prevent compiling older MySQL releases on modern Ubuntu. In all cases I use the rel build that uses CMAKE_BUILD_TYPE =Release.

    I used the cz10a_bee my.cnf files that are here for 5.6for 5.7 and for 8.0. For 5.7 and 8.0 there are many variants of that file to make them work on a range of the point releases.

    The versions I tested are:
    • 5.6
      • 5.6.21, 5.6.31, 5.6.41, 5.6.51
    • 5.7
      • 5.7.10, 5.7.20, 5.7.30, 5.7.44
    • 8.0
      • 8.0.13, 8.0.14, 8.0.20, 8.0.28, 8.0.35, 8.0.36
    For 8.0.35 I tested a few variations from what is described above to understand the cost of the performance schema:
    • my8035_rel.cz10aps0_bee
      • this uses my.cnf.cz10aps0_bee which is the same as my.cnf.cz10a_bee except it adds performance_schema =0
    • my8035_rel_lessps.cz10a_bee
      • the build disables as much as possible of the performance schema. The CMake file is here.
    Benchmark

    The test server is a Beelink SER4 with 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device. The benchmark is run with one client.

    I used the updated Insert Benchmark so there are more benchmark steps described below. In order, the benchmark steps are:

    • l.i0
      • insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions).
    • qr100
      • use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • lik qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results

    The performance reports are here for MySQL 5.6MySQL 5.7MySQL 8.0 and MySQL 5.6 to 8.0.

    The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    From the summary for 5.6
    • The base case is 5.6.21
    • Throughput in 5.6.51 is ~2% less than 5.6.21
    From the summary for 5.7
    • The base case is 5.7.10
    • Throughput in 5.7.44 is ~3% less than 5.7.10
    From the summary for 8.0
    • The base case is 8.0.13
    • I ignore the 8.0.35 variations (cz10aps0_bee config, rel_lessps build) for now
    • Unlike MySQL 5.6 and 5.7 above, there are larger regressions during the 8.0 cycle. Comparing 8.0.36 with 8.0.13
      • l.i0 - relative QPS is 0.81 in 8.0.36
      • l.x (create index) - I ignore this for now but but read this
      • l.i1, l.i2 - relative QPS is 0.91 and 0.80 in 8.0.36
      • qr100, qr500, qr1000 - relative QPS is 0.97, 0.96 and 0.94 in 8.0.36
      • qp100, qp500, qp1000 - relative QPS is 0.86, 0.88 and 0.84 in 8.0.36
    From the summary for 8.0 focusing on the 8.0.35 variations that disable the perf schema
    • Throughput for write-heavy steps (l.i0, l.i1, l.i2) is up to 4% better
    • Throughput for read-heavy steps (qr*, qp*) is ~11% better
    • Throughput for parallel index create is ~1.5X better (read this)
    From the summary for 5.6, 5.7, 8.0
    • The base case is 5.6.21
    • Comparing 5.7.44 and 8.0.36 with 5.6.21 shows the large regressions
      • l.i0
        • relative QPS is 0.81 in 5.7.44
        • relative QPS is 0.55 in 8.0.36
      • l.x - I ignore this for now
      • l.i1, l.i2
        • relative QPS is 1.100.86 in 5.7.44
        • relative QPS is 0.91, 0.71 in 8.0.36
      • qr100, qr500, qr1000
        • relative QPS is 0.73, 0.72, 0.72 in 5.7.44
        • relative QPS is 0.62, 0.63, 0.62 in 8.0.36
      • qp100, qp500, qp1000
        • relative QPS is 0.81, 0.80, 0.80 in 5.7.44
        • relative QPS is 0.60, 0.61, 0.61 in 8.0.36

    Wednesday, January 24, 2024

    Updated Insert benchmark: Postgres 9.x to 16.x, small server, cached database, v3

    I now have 4 server types at home (8 cores + 16G RAM, 8 cores + 32G RAM,  24 cores, 32 cores) and am trying to finish a round of the Insert Benchmark for each. This has results for the smallest (8 cores + 16G RAM) using a cached workload and Postgres.

    In previous blog posts I claimed that there are large regressions from old to new MySQL but not from old to new Postgres. And I shared results for MySQL 5.6, 5.7 and 8.0 along with Postgres versions 10 through 16. A comment about these results is the comparison was unfair because the first GA MySQL 5.6 release is 5.6.10 from 2013 while the first Postgres 10 GA release is 10.0 from 2017.

    Here I have results going back to Postgres 9.0.23 and the first 9.0 release is 9.0.0 from 2010.

    tl;dr

    • the song remains the same: MySQL has large regressions over time while Postgres avoids them
    • comparing Postgres 16.1 with Postgres 9.0.23
      • for write-heavy benchmark steps PG 16.1 gets between 1.2X and 2.8X more throughput
      • for range queries PG 16.1 gets ~1.2X more throughput
      • for point queries PG 16.1 gets ~1.1X more throughput

    Build + Configuration

    See the previous report for more details. I used these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1. 

    The configuration files are in subdirectories from here. Search for files named conf.diff.cx9a2_bee which exist for each major version of Postgres.

    The Benchmark

    The benchmark is explained here except the first benchmark step, l.i0, loads 30M rows/table here while previously it only loaded 20M. The database still fits in memory as the test server has 16G of RAM and the database tables are ~8G. The benchmark is run with 1 client.

    The test server was named SER4 in the previous report. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

    The benchmark steps are:

    • l.i0
      • insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts 40M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows total
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
    • qr100
      • use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results

    The performance report is here.

    The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    • The base case is pg9023_def which means Postgres 9.0.23
    • For most of the read-write benchmark steps throughput improves a lot from 9.1.24 to 9.2.24 and has been stable since then. The exception is the last step (qp1000) for which throughput is flat. It might be that writeback and/or vacuum hurts query throughput by that point.
    • For the write-heavy steps (l.i0, l.x, l.i1, l.i2) throughput improves a lot
      • l.i0 - things get a lot better in Postgres 11.22
      • l.x - things get a lot better in Postgres 9.6.24
      • l.i1 - things get a lot better in Postgres 9.5.25 and then again in 12.17
      • l.i2 - improvements are similar to l.i1 but not as good because of the query planner overhead during DELETE statements (see the comments about get_actual_variable_range)
    • Comparing throughput in Postgres 16.1 to 9.0.23
      • Write-heavy
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 1.231.812.822.69
      • Range queries
        • qr100, qr500, qr1000 - relative QPS is 1.201.241.25
      • Point queries
        • qp100, qp500, qp1000 - relative QPS is 1.101.091.00

    Wednesday, January 17, 2024

    Updated Insert benchmark: MyRocks 5.6 and 8.0, medium server, IO-bound database, v2

    This has results for the Insert Benchmark using MyRocks 5.6 and 8.0, a medium server and an IO-bound workload with a working set that isn't cached.

    tl;dr

    • The cost from enabling the perf schema was insignificant for the write-heavy and point-query benchmark steps. It was significant for the range-query benchmark steps.
    Comparing latest MyRocks 5.6.35 to older MyRocks 5.6.35
    • Write-heavy perf mostly improves, especially on the initial load step (l.i0)
    • Point-query perf is stable
    • Range-query perf shows a big regression between the fbmy5635_rel_202210112144 and fbmy5635_rel_202302162102 builds
    Comparing latest MyRocks 8.0.32 to older MyRocks 5.6.35
    • The cost of the perf schema is large for range queries and otherwise not large
    • Write-heavy perf mostly improves, especially on the initial load step (l.i0)
    • Point-query perf is stable
    • Range-query perf shows a big regression between the fbmy5635_rel_202210112144 and fbmy5635_rel_202302162102 builds and doesn't recover in the 8.0 builds
    Comparing latest MyRocks 8.0.32 to latest MyRocks 5.6.35
    • Write-heavy perf is similar except for the initial load step (l.i0) in which 8.0 is almost 20% slower
    • Point-query perf is similar
    • Range-query perf is ~5% worse in 8.0
    Comparing latest MyRocks 8.0.32 to latest MyRocks 8.0.28
    • Results are similar

    Build + Configuration

    See the previous report.

    Benchmark

    See the previous report

    Benchmark steps

    The benchmark is run with 8 clients and a client per table.

    The benchmark is a sequence of steps that are run in order:
    • l.i0
      • insert 500M rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One does inserts as fast as possible and the other does deletes at the same rate as the inserts to avoid changing the number of rows in the table. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions).
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow where X is max(1200, 60 + #nrows/1M). While waiting do things to reduce writeback debt where the things are:
        • MyRocks (see here) - set rocksdb_force_flush_memtable_now to flush the memtable, wait 20 seconds and then set rocksdb_compact_lzero_now to flush L0. Note that rocksdb_compact_lzero_now wasn't supported until mid-2023.
    • qr100
      • use 3 connections/client. One does range queries as fast as possible and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • lik qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results

    The performance reports are here for
    The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    From the summary for 5.6
    • The base case is fbmy5635_rel_202104072149
    • Comparing throughput in fbmy5635_rel_221222 to the base case
      • Write-heavy
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 1.110.921.001.00
      • Range queries
        • qr100, qr500, qr1000 - relative QPS is 0.620.790.77
      • Point queries
        • qp100, qp500, qp1000 - relative QPS is 0.971.00, 0.99
    From the summary for 8.0
    • The base case is fbmy8028_rel_221222
    • The cost of the perf schema is <= 2% for write-heavy, <= 19% for range queries and <= 1% for point queries. I am not certain that the impact on range queries is all from the perf schema. I still need to explain why the range query benchmark steps have too much noise.
    • Comparing throughput in fbmy8032_rel_221222 to the base case
      • Write-heavy
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.960.980.990.97
      • Range queries
        • qr100, qr500, qr1000 - relative QPS is 1.021.041.04
      • Point queries
        • qp100, qp500, qp1000 - relative QPS is 0.991.000.98
    From the summary for 5.6, 8.0 with many versions
    • The base case is fbmy5635_rel_202104072149
    • Comparing throughput in fbmy8032_rel_221222 to the base case
      • Write-heavy
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.910.870.990.97
      • Range queries
        • qr100, qr500, qr1000 - relative QPS is 0.580.76, 0.74
      • Point queries
        • qp100, qp500, qp1000 - relative QPS is 0.981.031.01
    From the summary for 5.6, 8.0 with latest versions
    • The base case is fbmy5635_rel_221222
    • Comparing throughput in fbmy8032_rel_221222 to the base case
      • Write-heavy
        • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.820.950.980.97
      • Range queries
        • qr100, qr500, qr1000 - relative QPS is 0.950.960.96
      • Point queries
        • qp100, qp500, qp1000 - relative QPS is 1.011.021.02

    Friday, January 12, 2024

    Updated Insert benchmark: MyRocks 5.6 and 8.0, small(est) server, cached database, v2

    This has results for the Insert Benchmark using MyRocks 5.6 and 8.0, a small server and a cached workload. I have two versions of small servers -- Beelink SER4 with 16G of RAM, Beelink SER7 with 32G of RAM. This report uses the SER4. This report replaces a January 2 report for the Beelink SER4. The difference is that I improved the benchmark scripts to reduce compaction debt prior to the read-write benchmark steps. My intention was to reduce noise in the throughput results. Alas, I have more work to do.

    tl;dr

    • Enabling the perf schema reduces throughput by up to 10% for write-heavy and up to 5% for read-heavy.
    • The range query benchmark steps (qr*) have too much noise that I have yet to explain
    • Comparing latest MyRocks 8.0.32 to 5.6.35 shows
      • 8.0.32 gets 20% to 30% less throughput for write-heavy
      • 8.0.32 gets ~10% less throughput for point queries
      • There is too much noise on the range query benchmark steps

      Build + Configuration


      Benchmark

      The server is a Beelink SER4 described here with 8 cores, 16G RAM, Ubuntu 22.04 and XFS on a fast m.2 NVMe device. The benchmark is run with 1 client.

      The benchmark is a sequence of steps that are run in order:
      • l.i0
        • insert 30M rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
      • l.x
        • create 3 secondary indexes per table. There is one connection per client.
      • l.i1
        • use 2 connections/client. One does inserts as fast as possible and the other does deletes at the same rate as the inserts to avoid changing the number of rows in the table. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
      • l.i2
        • like l.i1 but each transaction modifies 5 rows (small transactions).
        • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow where X is max(1200, 60 + #nrows/1M). While waiting do things to reduce writeback debt where the things are:
          • MyRocks (see here) - set rocksdb_force_flush_memtable_now to flush the memtable, wait 20 seconds and then set rocksdb_compact_lzero_now to flush L0. Note that rocksdb_compact_lzero_now wasn't supported until mid-2023.
      • qr100
        • use 3 connections/client. One does range queries as fast as possible and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
      • qp100
        • like qr100 except uses point queries on the PK index
      • qr500
        • like qr100 but the insert and delete rates are increased from 100/s to 500/s
      • qp500
        • like qp100 but the insert and delete rates are increased from 100/s to 500/s
      • qr1000
        • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
      • qp1000
        • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
      Results

      The performance reports are here for
      The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

      Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
      • insert/s for l.i0, l.i1, l.i2
      • indexed rows/s for l.x
      • range queries/s for qr100, qr500, qr1000
      • point queries/s for qp100, qp500, qp1000
      Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

      The range query benchmark steps suffer from too much noise that I have yet to explain.

      From the summary for 5.6
      • The base case is fbmy5635_rel_202104072149
      • The results with the builds that use clang are similar to gcc except for the l.i0 and l.ix benchmark steps. I opened a bug against LLVM for code generation related to crc32 functions.
      • Comparing throughput in fbmy5635_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.960.980.991.00
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.800.86, 1.63
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.971.001.03
      From the summary for 8.0
      • The base case is fbmy8028_rel_20220829_752
      • The results with clang are worse than gcc. See the previous section for details.
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.981.021.011.03
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 1.330.950.94
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.97, 0.980.97
      From the summary for 5.6, 8.0 with many versions
      • The base case is fbmy5635_rel_202104072149
      • Enabling the perf schema costs up to 10% of throughput for write-heavy and up to 5% for read-heavy.
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.690.880.830.84
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.921.031.55
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.880.890.92
      From the summary for 5.6, 8.0 with latest versions
      • The base case is fbmy5635_rel_221222
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.680.880.810.79
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 1.021.300.90
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.900.890.89



      Updated Insert benchmark: MyRocks 5.6 and 8.0, small server, cached database, v2

      This has results for the Insert Benchmark using MyRocks 5.6 and 8.0, a small server and a cached workload. I have two versions of small servers -- Beelink SER4 with 16G of RAM, Beelink SER7 with 32G of RAM. This report uses the SER7. A recent report from the Beelink SER4 is here but that report will be replaced in a few days.

      tl;dr

      • Some of the regressions between MyRocks 5.6 and 8.0 come from upstream. Here that shows up on the l.i0, qp100, qp500 and qr1000 benchmark steps.
      • There is too much noise in the range query benchmark steps (qr*) that I have yet to explain

      Noise

      I recently improved the benchmark scripts to remove writeback and compaction debt after the l.i2 benchmark step to reduce noise in the read-write steps that follow. At least for MyRocks, the range query benchmark steps (qr100, qr500, qr1000) have more noise. The worst case for noise with MyRocks is the qr100 step, and this is more obvious on a small server. 

      For MyRocks, the benchmark script now does the following after l.i2:

      • wait for X seconds where X = min(1200, 60 + #rows / 1M)
      • while waiting: flush memtable, wait 20 seconds, compact L0 into L1. But compacting L0 into L1 is only done for MyRocks builds from mid-2023 or newer because the feature I used for that was buggy prior to mid-2023.
      When the qr100 benchmark step starts the memtable is empty and the L0 might be empty. On small servers when I run the benchmark step for less than one hour the memtable never gets full and there are no memtable flushes. On larger servers the memtable is likely to be flushed many times.

      Regardless, I have yet to figure out why there is more noise with MyRocks on the range query benchmark steps. Until then, with MyRocks I focus on qr500 and qr1000 or on the results from larger servers in my search for regressions in range queries. What I see now is that the CPU/query overhead changes significantly, but I need to explain why that happens.

      Build + Configuration

      I tested MyRocks 5.6.35, 8.0.28 and 8.0.32 using the latest code as of December 2023. I also repeated tests for older builds for MyRocks 5.6.35 and 8.0.28. These were compiled from source. All builds use CMAKE_BUILD_TYPE =Release.

      MyRocks 5.6.35 builds:
      • fbmy5635_rel_202104072149
        • from code as of 2021-04-07 at git hash f896415f with RocksDB 6.19.0
      • fbmy5635_rel_202203072101
        • from code as of 2022-03-07 at git hash e7d976ee with RocksDB 6.28.2
      • fbmy5635_rel_202205192101
        • from code as of 2022-05-19 at git hash d503bd77 with RocksDB 7.2.2
      • fbmy5635_rel_202208092101
        • from code as of 2022-08-09 at git hash 877a0e58 with RocksDB 7.3.1
      • fbmy5635_rel_202210112144
        • from code as of 2022-10-11 at git hash c691c716 with RocksDB 7.3.1
      • fbmy5635_rel_202302162102
        • from code as of 2023-02-16 at git hash 21a2b0aa with RocksDB 7.10.0
      • fbmy5635_rel_202304122154
        • from code as of 2023-04-12 at git hash 205c31dd with RocksDB 7.10.2
      • fbmy5635_rel_202305292102
        • from code as of 2023-05-29 at git hash b739eac1 with RocksDB 8.2.1
      • fbmy5635_rel_20230529_832
        • from code as of 2023-05-29 at git hash b739eac1 with RocksDB 8.3.2
      • fbmy5635_rel_20230529_843
        • from code as of 2023-05-29 at git hash b739eac1 with RocksDB 8.4.3
      • fbmy5635_rel_20230529_850
        • from code as of 2023-05-29 at git hash b739eac1 with RocksDB 8.5.0
      • fbmy5635_rel_221222
        • from code as of 2023-12-22 at git hash 4f3a57a1, RocksDB 8.7.0 at git hash 29005f0b
      MyRocks 8.0.28 builds:
      • fbmy8028_rel_20220829_752
        • from code as of 2022-08-29 at git hash a35c8dfeab, RocksDB 7.5.2
      • fbmy8028_rel_20230129_754
        • from code as of 2023-01-29 at git hash 4d3d44a0459, RocksDB 7.5.4
      • fbmy8028_rel_20230502_810
        • from code as of 2023-05-02 at git hash d1ca8b276d, RocksDB 8.1.0
      • fbmy8028_rel_20230523_821
        • from code as of 2023-05-23 at git hash b08cc536f1, RocksDB 8.2.1
      • fbmy8028_rel_20230619_831
        • from code as of 2023-06-19 at git hash 6164cf0274, RocksDB 8.3.1
      • fbmy8028_rel_20230629_831
        • from code as of 2023-06-29 at git hash ab522f6df7c, RocksDB 8.3.1
      • fbmy8028_rel_221222
        • from code as of 2023-12-22 at git hash 2ad105fc, RocksDB 8.7.0 at git hash 29005f0b
      MyRocks 8.0.32 builds:
      • fbmy8032_rel_221222
        • from code as of 2023-12-22 at git hash 76707b44, RocksDB 8.7.0 at git hash 29005f0b

      Benchmark

      The server is a Beelink SER7 described here with 8 cores, 32G RAM, Ubuntu 22.04 and XFS on a fast m.2 NVMe device. The benchmark is run with 1 client.

      The benchmark is a sequence of steps that are run in order:
      • l.i0
        • insert 60M rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
      • l.x
        • create 3 secondary indexes per table. There is one connection per client.
      • l.i1
        • use 2 connections/client. One does inserts as fast as possible and the other does deletes at the same rate as the inserts to avoid changing the number of rows in the table. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
      • l.i2
        • like l.i1 but each transaction modifies 5 rows (small transactions).
        • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow where X is max(1200, 60 + #nrows/1M). While waiting do things to reduce writeback debt where the things are:
          • MyRocks (see here) - set rocksdb_force_flush_memtable_now to flush the memtable, wait 20 seconds and then set rocksdb_compact_lzero_now to flush L0. Note that rocksdb_compact_lzero_now wasn't supported until mid-2023.
      • qr100
        • use 3 connections/client. One does range queries as fast as possible and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
      • qp100
        • like qr100 except uses point queries on the PK index
      • qr500
        • like qr100 but the insert and delete rates are increased from 100/s to 500/s
      • qp500
        • like qp100 but the insert and delete rates are increased from 100/s to 500/s
      • qr1000
        • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
      • qp1000
        • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
      Results

      The performance reports are here for
      The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

      Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
      • insert/s for l.i0, l.i1, l.i2
      • indexed rows/s for l.x
      • range queries/s for qr100, qr500, qr1000
      • point queries/s for qp100, qp500, qp1000
      Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

      The range query benchmark steps suffer from too much noise that I have yet to explain.

      From the summary for 5.6
      • The base case is fbmy5635_rel_202104072149
      • Comparing throughput in fbmy5635_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.950.980.970.95
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.65, 1.110.70
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 1.000.990.99
      From the summary for 8.0
      • The base case is fbmy8028_rel_20220829_752
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.951.01, 1.000.97
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.980.721.04
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.991.000.99
      From the summary for 5.6, 8.0 with many versions
      • The base case is fbmy5635_rel_202104072149
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.660.890.820.81
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.931.04, 0.69
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.860.860.83
      From the summary for 5.6, 8.0 with latest versions
      • The base case is fbmy5635_rel_221222
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.690.910.850.84
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 1.440.930.98
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.860.870.84





      Thursday, January 11, 2024

      Updated Insert benchmark: MyRocks 5.6 and 8.0, medium server, cached database, v2

      This has results for the Insert Benchmark using MyRocks 5.6 and 8.0, a medium server and a cached workload. This replaces a recent report. The difference between this and the recent report is that I changed the benchmark scripts to reduce writeback and compaction debt between the last write-only benchmark step (l.i2) and the first read-write benchmark step (qr100). The intention is to reduce variance and make it easier to spot regressions. Alas, that is still an unsolved problem especially on the range query benchmark steps.

      tl;dr - context matters

      The biggest concerns I have are the ~16% slowdown on the initial load (l.i0) benchmark step from MyRocks 5.6.35 to 8.0.32 and the ~5% slowdown for benchmark steps that do point queries (qp*) from MyRocks 8.0.28 to 8.0.32.

      Comparing latest MyRocks 8.0.32 relative to latest MyRocks 5.6.35
      • Initial load is ~17% slower
      • Other write-heavy benchmark steps are ~3% slower
      • Range queries are between 6% and 14% faster
      • Point queries are ~7% faster
      Comparing latest MyRocks 8.0.32 to an old build of MyRocks 5.6.35
      • Initial load is ~16% slower
      • Other write-heavy benchmarks steps are between 2% and 6% slower
      • Range queries are between 5% slower and 5% faster
      • Point queries are 5% to 11% faster
      Comparing latest MyRocks 8.0.32 to latest MyRocks 8.0.28
      • Initial load is ~4% slower
      • Other write-heavy benchmark steps are between 3% slower and 2% faster
      • Range queries are between 1% slower and 6% faster
      • Point queries are ~5% slower

      Build + Configuration

      See the previous report.

      Benchmark

      See the previous report

      Benchmark steps

      The benchmark is run with 8 clients and a client per table.

      The benchmark is a sequence of steps that are run in order:
      • l.i0
        • insert 20M rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
      • l.x
        • create 3 secondary indexes per table. There is one connection per client.
      • l.i1
        • use 2 connections/client. One does inserts as fast as possible and the other does deletes at the same rate as the inserts to avoid changing the number of rows in the table. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
      • l.i2
        • like l.i1 but each transaction modifies 5 rows (small transactions).
        • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow where X is max(1200, 60 + #nrows/1M). While waiting do things to reduce writeback debt where the things are:
          • MyRocks (see here) - set rocksdb_force_flush_memtable_now to flush the memtable, wait 20 seconds and then set rocksdb_compact_lzero_now to flush L0. Note that rocksdb_compact_lzero_now wasn't supported until mid-2023.
      • qr100
        • use 3 connections/client. One does range queries as fast as possible and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1200 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
      • qp100
        • lik qr100 except uses point queries on the PK index
      • qr500
        • like qr100 but the insert and delete rates are increased from 100/s to 500/s
      • qp500
        • like qp100 but the insert and delete rates are increased from 100/s to 500/s
      • qr1000
        • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
      • qp1000
        • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
      Results

      The performance reports are here for
      The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

      Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
      • insert/s for l.i0, l.i1, l.i2
      • indexed rows/s for l.x
      • range queries/s for qr100, qr500, qr1000
      • point queries/s for qp100, qp500, qp1000
      Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

      From the summary for 5.6
      • The base case is fbmy5635_rel_202104072149
      • Comparing throughput in fbmy5635_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 1.02, 0.97, 0.97, 1.01
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.930.920.99
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.981.031.01
      From the summary for 8.0
      • The base case is fbmy8028_rel_221222
      • The cost of the perf schema is <= 3% for write-heavy, <= 14% for range queries and <= 5% for point queries
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.961.020.980.97
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 1.011.060.99
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 0.950.960.95
      From the summary for 5.6, 8.0 with many versions
      • The base case is fbmy5635_rel_202104072149
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.840.940.950.98
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 0.951.051.00
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 1.051.111.09
      From the summary for 5.6, 8.0 with latest versions
      • The base case is fbmy5635_rel_221222
      • Comparing throughput in fbmy8032_rel_221222 to the base case
        • Write-heavy
          • l.i0, l.x, l.i1, l.i2 - relative QPS is 0.830.970.970.97
        • Range queries
          • qr100, qr500, qr1000 - relative QPS is 1.061.061.14
        • Point queries
          • qp100, qp500, qp1000 - relative QPS is 1.071.071.07

      Wednesday, January 10, 2024

      Updated Insert benchmark: Postgres 9.x to 16.x, small server, cached database, v2

      I recently shared results for the updated Insert Benchmark with Postgres versions 9.0 to 16 using a small server and cached database. Here I have results for a slightly larger but still cached database. The reason for using a larger database is to get some of the benchmark steps to run for more time.

      tl;dr

      • Results here are similar to the previous results although a performance problem during the l.i1 and l.i2 benchmark steps is more clear here. In some benchmark steps the planner can spend too much CPU time trying to determine the min and/or max value of a column by reading from the index.
      • While Postgres performance is mostly getting better from old to new releases, there have been regressions in a few major releases (PG 11 through 13) for benchmark steps where this is an issue.
      • The regressions are likely to be larger for the IO-bound benchmark but that will take a few more days to finish.
      The Problem

      I shared details about the problem here and as expected a Postgres expert quickly replied with advice pointing me to a few changes that improve the problem.

      The problem is the pattern of inserts and deletes. Several of the benchmark steps do inserts in ascending PK order (inserts to the head) while doing deletes at the same rate to keep the number of rows fixed. The deletes are done from the other end of the table (deletes to the tail) by removing batches of rows with the smallest value for the PK.

      The PG planner has code in get_actual_variable_range to determine the min or max value of a column when there is a predicate on that column like X < $const or X > $const and $const falls into the largest or smallest histogram bucket. From PMP thread stacks, what I see is too much time with that function on the call stack. From ps output, the session that does delete statements can use 10X to 100X more CPU than the session that does insert statements. From explain analyze I see that the planner spends ~100 milliseconds per delete statement.

      Build + Configuration

      See the previous report for more details. I used all of the versions described there: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 15.5, 16.1. And then I also tested 11.19 and 13.10.

      The Benchmark

      The benchmark is explained here except the first benchmark step, l.i0, loads 30M rows/table here while previously it only loaded 20M. The database still fits in memory as the test server has 16G of RAM and the database tables are ~8G.

      The test server was named SER4 in the previous report. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

      The benchmark steps are:

      • l.i0
        • insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
      • l.x
        • create 3 secondary indexes per table. There is one connection per client.
      • l.i1
        • use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
      • l.i2
        • like l.i1 but each transaction modifies 5 rows (small transactions).
        • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
      • qr100
        • use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
      • qp100
        • like qr100 except uses point queries on the PK index
      • qr500
        • like qr100 but the insert and delete rates are increased from 100/s to 500/s
      • qp500
        • like qp100 but the insert and delete rates are increased from 100/s to 500/s
      • qr1000
        • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
      • qp1000
        • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
      Results

      The benchmark report is here.

      I start with the summary for the current round with 30M rows loaded and the previous round with 20M rows loaded. Here I focus on the benchmark steps where things are slightly different between the current and previous rounds -- the results for the l.i1 and l.i2 benchmark steps where regressions are more obvious in the current round.

      The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

      There are big regressions in 11.19, 11.22 and a small one in 13.13 for the l.i1 and l.i2 benchmark steps which is visible in the summary.
      • For the l.i1 benchmark step the inserts/s rate drops from ~18k/s in 9.6.24 and 10.23 to ~11k/s in 11.19 and 11.22. It also drops by ~14% from 13.10 to 13.13.
      • The regressions for the l.i2 benchmark step occur in the same versions but are larger. The issue is that the delete statements in l.i1 delete more rows per statement, so the planner overhead per deleted row is larger for l.i2.
      From the iostat and vmstat metrics collected per benchmark step with both absolute and normalized values (normalized values are absolute value divided by the insert rate) I see that the CPU overhead (cpupq is CPU usecs per insert) per version is inversely correlated with the insert rate.

      This table shows the value of cpupq (CPU overhead) per version for the l.i1 and l.i2 benchmark steps. All of the numbers for iostat and vmstat are here for l.i1 and for l.i2.

      versionl.i1l.i2
      10.2312535157
      11.1916198285
      11.2216236611
      12.1712635815
      13.1012223373
      13.1313674863
      14.1011263449

      The table above includes all CPU overhead from everything running on the server (Postgres and the benchmark client). The data below shows the CPU time per session measured by ps near the end of a benchmark step. There is one connection/session that only does delete statements and another that only does insert statements. The output from ps is here. The table below has the CPU seconds per version for  both connections -- insert and delete. There are big changes in CPU overhead for the delete connection.

      versioninsertdelete
      10.235872587
      11.195745196
      11.224973851
      12.175733137
      13.105321278
      13.135482403
      14.105321317