Monday, June 3, 2024

Does disabling full_page_writes reduce variance for Postgres with a cached Insert Benchmark

This is a follow up to my previous benchmark report. While Postgres does a great job at avoiding performance regressions (performance doesn't get worse over time) it also hasn't gotten better over time with respect to variance on some of the steps of the Insert Benchmark.

Two of the benchmark steps (l.i0, l.i1 -- see here) do deletes in one connection, inserts in the other and the benchmark client does them at the same rate. If deletes are slow then inserts will be slow, and vice versa. The per-second rates for inserts and deletes have too much variance (see here). From the response time histograms it is obvious that the problem is that deletes are slower.

Updates:

  1. based on advice from experts I repeated tests after adding vacuum_index_cleanup=ON to the create table SQL and with that throughput dropped by ~7% while CPU increased by ~20%
  2. A patch to enforce VISITED_PAGES_LIMIT during get_actual_variable_range makes a huge difference
I am not a fan of vacuum

The issue isn't new to me. The root cause is the overhead from get_actual_variable_range trying to figure out the min value of the PK column by reading from the index, and that takes too much time when it must scan and skip many not-visible index entries that will eventually be removed when vacuum runs again. See my posts on this topic. If your writes are clustered in a key range then any query that encounters such a range will be slower than expected because there can be a long time between committing the writes and vacuum cleaning them up leaving much work for queries to scan and skip.

Results

So I repeated the Insert Benchmark for Postgres 16.3 with a cached workload and a small server which is the same setup as reported here. This time I only ran it for two configurations:

  • pg163_def.cx9a2_c8r32 - Postgres 16.3 with the cx9a2_c8r32 config as previously used
  • pg163_def.cx9a2nofpw_c8r32 - Postgres 16.3 with the cx9a2nofpw_c8r32 config that starts with the cx9a_c8r32 config and then adds full_page_writes=off.
The results are here
  • Throughput is similar for all benchmark steps
  • KB written to storage per insert (wkbpi) are reduced with full_page_writes=off for the l.i1 and l.i2 benchmark steps that are write heavy. This is expected.
  • Using full_page_writes=off doesn't fix the throughput variance problem and from the response time histograms the root cause remains the same (delete statements).
  • LWN reported on this recently and I look forward to widespread support for atomic writes across SSD vendors with Linux. Even more so when this doesn't require a proprietary file system or vendor API. With such support everyone can safely use full_page_writes=off and turn off the InnoDB doublewrite buffer.
  • If you scroll down on the IPS charts here you reach more charts for DPS (deletes/s) and max delete response time, both over 1 second intervals. While the max insert response time chart doesn't show variance, the max delete response time chart shows much variance. From the chart there are ~16 spikes at regular intervals over 3237 seconds or one per ~200 seconds.
From logs that I have not shared for the full_page_writes=off result:
  • the l.i0 step started at 11:07 and finished at 11:13
  • the l.x step finished at 11:15
  • the l.i1 step finished at 12:09
    • one checkpoint started and completed when it was running, then a second checkpoint started but did not finish (see below)
    • many vacuums were done during this step and the interval between the spikes on the delete graphs aligns with the vacuum jobs listed below
  • the l.i2 step finished at 14:07
  • vacuum was one for the test table followed by checkpoint, this finished at 14:09
  • the read-write test steps were done in sequence: qr100, qp100, ..., qr1000, qp1000 and qp1000 finished at 20:10, each runs for 1 hour
The full output for logfile is here for the test with full_page_writes=ON and full_page_writes=OFF.

The following is from the test with full_page_writes=OFF.

Output from grep checkpoint logfile is here and grep "checkpoint starting" logfile output is below:

2024-06-02 11:07:38.612 UTC [1340363] LOG:  checkpoint starting: immediate force wait
2024-06-02 11:37:39.041 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 12:07:39.938 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 12:37:39.817 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 13:07:39.765 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 13:37:39.609 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 14:07:39.717 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 14:08:04.767 UTC [1340363] LOG:  checkpoint starting: immediate force wait
2024-06-02 14:38:05.017 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 15:08:05.261 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 15:38:05.265 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 16:08:05.181 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 16:38:05.211 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 17:08:05.436 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 17:38:05.545 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 18:08:05.581 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 18:38:05.401 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 19:08:05.801 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 19:38:05.757 UTC [1340363] LOG:  checkpoint starting: time
2024-06-02 20:08:05.629 UTC [1340363] LOG:  checkpoint starting: time

The output from grep vacuum logfile is below and the spikes above for the per-second delete rates appear to align with the vacuum frequency show below. The first vacuums for the pi1 table are fast and more frquent because index scans: 0 is shown for them -- so they do less work. Eventually vacuums that have index scans: 1 are done every ~4 minutes.

Note that I recently began using index_cleanup ON as an option to the manual vacuums done after the l.i2 step finishes, otherwise indexes are cleaned given the current heuristics. Unfortunately, I don't yet know whether there is a way (or whether I need to do it) for the automatic vacuums that represent the majority of the vacuum jobs listed below.

2024-06-02 11:07:39.879 UTC [1340408] LOG:  automatic aggressive vacuum of table "template1.pg_catalog.pg_attribute": index scans: 0
2024-06-02 11:07:39.888 UTC [1340408] LOG:  automatic aggressive vacuum of table "template1.pg_catalog.pg_class": index scans: 0
2024-06-02 11:07:39.892 UTC [1340408] LOG:  automatic aggressive vacuum of table "template1.pg_catalog.pg_description": index scans: 0
2024-06-02 11:07:39.896 UTC [1340408] LOG:  automatic aggressive vacuum of table "template1.pg_catalog.pg_depend": index scans: 1
2024-06-02 11:07:46.574 UTC [1340428] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:07:51.646 UTC [1340430] LOG:  automatic aggressive vacuum of table "template1.pg_catalog.pg_statistic": index scans: 1
2024-06-02 11:07:56.670 UTC [1340433] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:08:06.671 UTC [1340437] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:08:16.673 UTC [1340447] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:08:26.674 UTC [1340451] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:08:36.675 UTC [1340464] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:08:46.678 UTC [1340474] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:08:56.679 UTC [1340478] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:09:06.680 UTC [1340482] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:09:16.681 UTC [1340492] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:09:26.681 UTC [1340496] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:09:36.782 UTC [1340509] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:09:46.882 UTC [1340519] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:09:56.982 UTC [1340523] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:10:06.983 UTC [1340528] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:10:16.985 UTC [1340538] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:10:26.986 UTC [1340542] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:10:37.088 UTC [1340555] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:10:47.089 UTC [1340565] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:10:57.091 UTC [1340569] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:11:07.092 UTC [1340573] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:11:27.213 UTC [1340588] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:11:47.316 UTC [1340611] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:12:07.319 UTC [1340619] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:12:30.655 UTC [1340634] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:12:50.621 UTC [1340657] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:13:10.625 UTC [1340671] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 0
2024-06-02 11:17:00.374 UTC [1341771] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:20:37.635 UTC [1341979] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:24:08.272 UTC [1342191] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:27:32.919 UTC [1342394] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:31:36.540 UTC [1342628] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:35:01.656 UTC [1342836] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:38:42.656 UTC [1343055] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:42:38.499 UTC [1343281] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:45:52.954 UTC [1343488] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:49:16.466 UTC [1343691] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:52:49.834 UTC [1343896] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:56:09.758 UTC [1344102] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 11:59:40.492 UTC [1344308] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 12:02:56.674 UTC [1344514] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 12:06:14.134 UTC [1344715] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 12:09:27.171 UTC [1344905] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 12:39:11.117 UTC [1347226] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 13:07:49.413 UTC [1349013] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 13:36:56.651 UTC [1350908] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 14:07:32.120 UTC [1352802] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 17:09:33.324 UTC [1586393] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 18:21:56.439 UTC [1811927] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 19:03:48.109 UTC [1813892] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1
2024-06-02 19:45:44.782 UTC [1946850] LOG:  automatic vacuum of table "ib.public.pi1": index scans: 1

Update 1: try vacuum_index_cleanup=ON

A new performance report is here after I changed the benchmark client to add WITH ( vacuum_index_cleanup=ON) to the create table SQL. Throughput was reduced by ~7% for many of the benchmark steps with that change. The per-second charts for inserts/s, delete/s, insert response time and delete response time for the l.i1 benchmark step are here and some of the charts look a bit worse with vacuum_index_cleanup=ON.
    The DBMS tested are:
    • pg163_def.cx9a2_c8r32 - full_page_writes=ON, doesn't use vacuum_index_cleanup
    • pg163_def.cx9a2vic_c8r32 - full_page_writes=ON, does use vacuum_index_cleanup
    • pg163_def.cx9a2nofpw_c8r32 - full_page_writes=OFF, doesn't use vacuum_index_cleanup
    • pg163_def.cx9a2nofpwvic_c8r32 - full_page_writes=OFF, does use vacuum_index_cleanup
    From the metrics for the l.i1 benchmark step that does random writes the CPU per insert (cpupq) is at least 1.2X larger with vacuum_index_cleanup=ON. Other metrics changed but that was the largest.

    From a simple analysis of logfile, I count the number of automatic vacuum entries with index scans: 1 and index scans: 0.
    • pg163_def.cx9a2_c8r32 - 27 for scans: 0 vs 24 for scans: 1
    • pg163_def.cx9a2vic_c8r32 - 28 for scans: 0 vs 24 for scans: 1
      • pg163_def.cx9a2nofpw_c8r32 - 27 for scans: 0 vs 24 for scans: 1
        • pg163_def.cx9a2nofpwvic_c8r32 - 28 for scans: 0 vs 24 for scans: 1
        Update 2: the patch

        There is a patch to enforce VISITED_PAGES_LIMIT during get_actual_variable_range calls. The patch as written keeps it at 100 and I tried it with two variations -- as-is (=100) and then I reduced it to 10. Then I repeated the Insert Benchmark on the small server with a cached database for 3 binaries:
        • pg17beta1_def.cx9a2_c8r32
          • Postgres 17beta1 with the cx9a2_c8r32 config
        • pg17beta1_def_hack10.cx9a2_c8r32
          • Postgres 17beta1 with the cx9a2_c8r32 config and a patch with VISITED_PAGES_LIMIT=10 enforced during get_actual_variable_range
        • pg17beta1_def_hack100.cx9a2_c8r32
          • Postgres 17beta1 with the cx9a2_c8r32 config and a patch with VISITED_PAGES_LIMIT=100 enforced during get_actual_variable_range
        The patch improves throughput on l.i1 by ~5X and on l.i2 by ~20X because it reduces the CPU overhead of the optimizer during DELETE statements and DELETE throughput was the bottleneck.
        • the performance summary is here which shows the ~5X and ~20X improvements
        • the charts for per second throughput and response time are here and the improvement is obvious. The insert and delete rates have much variance without the patch while they look much better with the patch (here and here)
        • response time histograms are here and there is a huge improvement for DELETE statements























          No comments:

          Post a Comment

          Managing CPU frequency for AMD on Ubuntu 22.04

          I need stable performance from the servers I use for benchmarks. I also need servers that don't run too hot because too-hot servers caus...