Sunday, January 28, 2024

Explaining a performance regression in Postgres 14

I am trying to explain a performance regression that arrives in Postgres 14 during the Insert Benchmark.

The primary problem appears to be more CPU used by the query planner for DELETE statements when the predicates in the WHERE clause have constants that fall into either the max or min histogram bucket for a given column. An example is a DELETE statement like the following and transactionid is the primary key so there is an index on it.

delete from t1 where (transactionid>=100 and transactionid<110)

The table is used like a queue -- inserts are done in increasing order with respect to transactionid and when N rows are inserted, then N more rows are deleted to keep the size of the table constant. The rows to be deleted are the N rows with the smallest value for transactionid.

The problem is worse for IO-bound workloads (see here) than for cached workloads (see here) probably because the extra work done by the query planner involves accessing the index and possibly reading data from storage.

It is always possible I am doing something wrong but I suspect there is a fixable performance regression in Postgres 14 for this workload. The workload is explained here and note that vacuum (analyze) is done between the write-heavy and read-heavy benchmark steps.

There are three issues:

  1. There is only one good index for the DELETE statement, yet the query planner does (too much) work to figure out the selectivity for that index.
  2. When the constants used in WHERE clause predicates fall into the largest or smallest histogram bucket for a column, then the query planner reads from the index to figure out the real min or max value in the index. The code for this is in the function get_actual_variable_range.
  3. Extra work is done while reading from the index because there are too many entries that can be but have yet to be removed by vacuum. So the index scan encounters and then skips them for a while until it reaches a visible entry.

Issue #3 is made worse by the workload. The table is used like a queue. There is a sequence for the PK column, inserts are in ascending order getting new values for the PK column from a sequence. Deletes are done to the other end of the table -- each delete statement deletes the N rows with the smallest value for the PK column. Similar problems can occur with InnoDB and MyRocks -- I know from experience.

I suspect the solution in this case is to not try as hard to figure out selectivity when there is only one good index (fix issue #1). Although it might help to do something about issue #2  as well.

Request 1

Can the query planner to do less work when there is only one index that should be used? The full DDL for the table is here.

An abbreviated version of the DDL is below and the PK is on transactionid which uses a sequence.

     Column     |            Type             |
----------------+-----------------------------+
 transactionid  | bigint                      |
 dateandtime    | timestamp without time zone |
 cashregisterid | integer                     |
 customerid     | integer                     |
 productid      | integer                     |
 price          | integer                     |
 data           | character varying(4000)     |
Indexes:
    "pi1_pkey" PRIMARY KEY, btree (transactionid)
    "pi1_marketsegment" btree (productid, customerid, price)
    "pi1_pdc" btree (price, dateandtime, customerid)
    "pi1_registersegment" btree (cashregisterid, customerid, price)
Access method: heap

For a DELETE statement like the following, the only efficient index is pi1_pkey. So I prefer that the query planner do less work to figure that out.

delete from t1 where (transactionid>=100 and transactionid<110)

CPU overhead

When I run the Insert Benchmark there are 6 read-write benchmark steps -- 3 that do range queries as fast as possible, 3 that do point queries as fast as possible. For all of them there are also inserts and deletes done concurrent with the range queries and they are rate limited -- first at 100 inserts/s and 100 deletes/s, then at 500 inserts/s and 500 deletes/s and finally at 1000 inserts/s and 1000 deletes/s. So the work for writes (inserts & deletes) is fixed per benchmark step while the work done by queries is not. Also, for each benchmark step there are three connections -- one for queries, one for inserts, one for deletes. 

Using separate connections makes it easier to spot changes in CPU overhead and below I show the number of CPU seconds for the range query benchmark steps (qr100, qr500, qr1000) where the number indicates the write (insert & delete) rate. Results are provided for Postgres 13.13 and 14.10 from the benchmark I described here (small server, IO-bound).

From below I see two problems. First, the CPU overhead for the delete connection is much larger with Postgres 14.10 for all benchmark steps (qr100, qr500, qr1000). Second, the CPU overhead for the query connection is much larger with Postgres 14.10 for qr1000, the benchmark step with the largest write rate.

Legend
* ins = connection that does inserts
* del = connection that does deletes
* query = connection that does range queries

CPU seconds with 100 inserts/s, 100 deletes/s -> qr100
        ins     del     query
13.13   5       14      1121
14.10   15      187     1148

CPU seconds with 500 inserts/s, 500 deletes/s -> qr500
        ins     del     query
13.13   71        71    1128
14.10   73      1050    1144

CPU seconds with 1000 inserts/s, 1000 deletes/s -> qr1000
        ins     del     query
13.13   135     1113    1129
14.10   151     2912    1906

Debugging after the fact: CPU profiling

I repeated the benchmark for Postgres 13.13 and 14.10 and after it finished repeated the qr100 benchmark step a few times for each of Postgres 13.13 and 14.10. The things that I measure here don't match exactly what happens during the benchmark because the database might be in a better state with respect to write back and vacuum.

While this is far from scientific, I used explain analyze on a few DELETE statements some time after they were used. The results are here. I repeated the statement twice for each Postgres release and the planning time for the first explain is 49.985ms for Postgres 13.13 vs 100.660ms for Postgres 14.10.

So I assume the problem is the CPU overhead from the planner and not from executing the statement.

Then I looked at the CPU seconds used by the connection that does deletes after running for 10 minutes and it was ~50s for Postgres 13.13 vs ~71s for 14.10. So the difference at this point is large, but much smaller than what I report above which means the things I want to spot via CPU profiling might be harder to spot. Also, if the problem is IO latency rather than CPU overhead then CPU profiling won't be as useful.

This gist has the top-5 call stacks from hierarchical profiling with perf for the connection that does deletes. While there isn't an obvious difference between Postgres 13.13 and 14.10 there is something I don't like -- all stacks are from the query planner and include the function get_actual_variable_range.

IO profiling

It looks like the query planner does more read IO for delete statements in Postgres 14.10 than in 13.13.

From the full benchmark I see the following for the range query benchmark steps which means there is more read IO (see rps column) with Postgres 14.10 for the qr100 and qr500 benchmark steps but not with the qr1000 benchmark step. And in call cases the range query rate (see qps column) is significantly less with Postgres 14.10.

Legend:
* qps = range queries/s
* rps = read IO requests/s per iostat

        qr100
version qps     rps
13.13   8338.2  166.5
14.10   5822.6  183.5

        qr500
version qps     rps
13.13   8101.7  615.6
14.10   5917.9  885.6

        qr1000
version qps     rps
13.13   7090.1  1682.9
14.10   5139.0  1036.2


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...