Posts

Showing posts from May, 2022

The benefit of lz4 and zstd for Postgres WAL compression

In 2020 I published benchmark results to understand the impact of DBMS tuning options for the insert benchmark with MySQL and Postgres ( here and here ). One of the results was that wal_compression = pglz used more CPU than expected. I asked aloud whether support for lz4 would help. Well, the Postgres community has delivered and either lz4 or zstd can be used in place of pglz in the upcoming Postgres version 15. Docs are here . Support for lz4 to compress TOAST was added in version 14. tl;dr - wal_compression=lz4 or =zstd are a big improvement over =pglz Details Postgres optionally writes page images (the pre-image) to the WAL the first time a page is to be written back after a checkpoint. This provides protection from torn-page writes. InnoDB solves this via the doublewrite buffer . An LSM like RocksDB doesn't modify previously written database pages so it doesn't have something similar. Writing pages to the redo log increases the frequency at which log files must be rotated

Defining "X has too many options"

This is another post in my Well, actually series. Sadly, I didn't think of naming that series until now so the previous posts might take a while to identify. This post is inspired by the claims I read that RocksDB has too many options. I agree that it does, but the claim is true for every database that I have used. This claim is too vague. I can make it less vague via multiple levels: level 0 - X has too many options and they all matter level 1 - X has too many options and the ones that matter change based the workload. Figuring out which options matter per workload can be expensive. level 2 - X has too many options but only a small subset matters regardless of workload. Figuring out which options matter here can be expensive but is usually a one time cost. I have done some option sensitivity analysis benchmarks for MyRocks and RocksDB. level 3 - X has too many options but none of them matter Any system in level 0 is going to be a problem. RocksDB is between levels 1 and level 2.

RocksDB internals: LRU

RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here. An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference. Updates: fix a bogus claim about cache_index_and_filter_blocks The big picture The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor . With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here . The cache is created by a call to NewLRUCache . When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amou

Performance bugs: p50 vs p99

One way to classify performance bugs is p50 vs p99.  A p50 performance bug occurs for all workloads A p99 performance bug only happens for some workloads or in some contexts. By some workloads I mean it might only occur during hot backup or load. By some contexts I mean it might only happen under high concurrency or load spikes which don't represent the steady state. The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster. The impact from fixing a p99 bug is less obvious. On average there might be little impact with respect to average throughput and efficiency. There can be a huge impact with respect to response time goals. The challenge is to quantify the benefit and I have learned over the years that motivating the fix for p99 perf bugs is a hard problem. For MySQL I wrote many blog posts to market open

Using mmap with RocksDB

Image
RocksDB inherited support for mmap from LevelDB. I was curious how performance with mmap compared buffered IO (pread) and ran benchmarks with db_bench. I haven't had great experiences with mmap for database workloads in the past but tried to be fair. tl;dr Don't use mmap for IO-bound workloads until issue 9931 is fixed For in-memory databases performance with mmap can be better than pread if db_bench is run with --cache_index_and_filter_blocks=false. Setting --verify_checksum=false provides a smaller performance boost. Longer than tl;dr If the database doesn't fit in memory then  issue 9931  means there will be significant overfetching on reads from storage and this will ruin performance. I assume the fix is to add  madvise  calls similar to the  posix_fadvise  calls used for buffered IO (pread). With --cache_index_and_filter_blocks=false the RocksDB process might use more memory than you expect so be careful while chasing performance.  When it is true and --mmap_read=tru

To compress or not compress

What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels. tl;dr - if levels 0, 1 and 2 are compressed: Benefits less write-amplification Costs there might be more write stalls write throughput might be lower if there are more write stalls time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2  compaction CPU overhead is increased By might be more write stalls I mean that this is only an issue when the write rate is large. For the results here I ran one test with the write-rate limited to 10 MB/s and there were no write stalls whether or not L0/L1/L2 were compressed. Then I ran a test without a limit on the write rate and stalls were worse when L0/L1/L2 were compressed. Quantifyi