Thursday, May 26, 2022

The benefit of lz4 and zstd for Postgres WAL compression

In 2020 I published benchmark results to understand the impact of DBMS tuning options for the insert benchmark with MySQL and Postgres (here and here). One of the results was that wal_compression = pglz used more CPU than expected. I asked aloud whether support for lz4 would help.

Well, the Postgres community has delivered and either lz4 or zstd can be used in place of pglz in the upcoming Postgres version 15. Docs are here. Support for lz4 to compress TOAST was added in version 14.

tl;dr - wal_compression=lz4 or =zstd are a big improvement over =pglz

Details

Postgres optionally writes page images (the pre-image) to the WAL the first time a page is to be written back after a checkpoint. This provides protection from torn-page writes. InnoDB solves this via the doublewrite buffer. An LSM like RocksDB doesn't modify previously written database pages so it doesn't have something similar.

Writing pages to the redo log increases the frequency at which log files must be rotated. A possible side effect of that is to trigger checkpoints more often which means more page writeback IO and this can hurt performance. 

Compressing those page images can reduce the write rate which then reduces the rate at which checkpoint happens (and also reducing the rate at which checkout IO happens). Prior to Postgres version 15 the only option for wal_compression was pglz. With Postgres version 15 there are new options -- lz4 and zstd. My github pages repo has a benchmark report for the Insert Benchmark run on a small server with an IO-bound workload to understand the impact of this option.

Results for IO-bound

This section has results for a benchmark where the database is much larger than RAM. The report is here. For this benchmark I used four configurations:

  • cx5 - wal_compression=off
  • cx5a - wal_compression=pglz
  • cx5b - wall_compression=lz4
  • cx5c - wall_compression=zstd

Summarizing the summary from the (shell script generated) report:

  • Loads in PK order without secondary indexes have similar performance regardless of the wal_compression setting. This isn't a surprise as PK order loads should have the least writeback per insert. The metrics are here.
  • Create secondary index is ~1.7X faster with wal_compression=lz4 compared to =pglz. From the metrics with pglz the CPU overhead is almost 2X larger than with lz4 (see cpupq column). The performance with =lz4 is the same as with =off.
  • Inserts that are in PK order but random order for the 3 secondary indexes are faster with =lz4 or =zstd than with =off or =pglz where =lz4 is ~1.4X faster than =pglz. From the metrics the use of =lz4 or =zstd reduces IO somewhat (see wkbpi column) and CPU a lot (see cpupq column).
Results for cached

This section has results for a benchmark where the database fits in the Postgres buffer pool. The report is here. From the summary the results are similar to the IO-bound results above. The CPU and IO improvements from lz4 and zstd here are also similar to the IO-bound results above -- see the metrics sections for create secondary index (l.x) and load with secondary index maintenance (l.i1)

7 comments:

  1. I wonder if Postgres could be made smarter regarding writing page images to the WAL. MariaDB Server 10.5 and later will skip the doublewrite buffer when writing out a page that was fully initialized since the previous checkpoint. In that case, recovery can reconstruct the page based on the log records, without reading it from the data file at all.

    To my understanding, the InnoDB recovery in MySQL always reads data pages for which any log records were found since the latest checkpoint. That may slow down recovery and make it more prone to failure. I would guess that Postgres recovery never reads any data pages; all information should be available in the WAL.

    ReplyDelete
    Replies
    1. I lack expertise on Postgres internals to provide a useful response

      Delete
    2. We already don't emit full page images when a page was fully initialized since the last checkpoint. The rule basically is that we emit an FPW if a) a modification doesn't start from an empty / uninitialized buffer b) the page hasn't been modified since the last checkpoint.

      Delete
    3. Addendum: That's the reason why there's no performance effect of compression in l.i0.

      Delete
    4. Thank you for explaining that.

      Delete
    5. Thank you, Andres Freund. So, Postgres and MariaDB are implementing similar logic with regard to avoiding full page writes or the doublewrite buffer.

      When it comes to concurrently writing clients, I assume that log compression will require that a single thread is responsible for writing log. That might be a scalability bottleneck.

      MariaDB Server 10.8 replaced log_sys.mutex with an rw-lock, so that multiple threads may concurrently append their log to the global buffer, which may be persistent memory, or regular RAM for ultimately writing to the log file.

      A committing mini-transaction acquires a shared log_sys.latch and shortly acquires a mutex to determine the write position. During the copying, only the shared rw-lock will be held. Log checkpoint will hold an exclusive log_sys.latch, to ensure that any "holes" at the end of the log will be have been filled.

      Delete
    6. Marko - if you ever run the insert benchmark I am happy to consult

      Delete

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for mak...