Small Datum: Write stalls with RocksDB from a too small block cache and/or too many block cache shards

The RocksDB block cache is sharded to reduce mutex contention. There are two implementations of the RocksDB block cache - LRU (old) and Hyper (new). I wrote about LRU internals here. While I have yet to write about hyper clock cache internals, I have shared results to show that performance with it is excellent.

tl;dr

Don't use a too small block cache but sometimes that cannot be avoided. By default InnoDB uses 8 buffer pool instances while MyRocks is likely to use 64 block cache shards. Hopefully RocksDB works find with 8 block cache shards, at least when using the Hyper clock cache.
Manually set the number of cache shards to avoid shards that are too small. Alas, MyRocks does not allow this to be set and I filed a feature request to change that.
I never show precisely how slow reads from L0 SSTs leads to write stalls. I have shown that write stalls are much worse when there are too many slow reads from the L0. I can speculate at the cause but for now correlation is causation.

Sharding

Both LRU and hyper caches are sharded but hyper doesn't need as many shards. The MakeSharedCache function creates an instance of an LRU cache or a Hyper cache. If you haven't set the value of num_shard_bits the the number of shards is computed for you and it is a function of the size of the cache -- see the calls to GetDefaultCacheShardBits for LRU and for Hyper. The minimum size of a shard is 512KB for LRU and 32MB for Hyper. Unless you manually set it via the num_shard_bits option, the number of shards (and shard bits) is computed by the following where the number of shards is 2^num-shard-bits:

num-shards ~= min(64, sizeof(block-cache) / min-shard-size))

This is a about performance problems I had with MyRocks and the insert benchmark. MyRocks lets you choose between the LRU cache and the Hyper cache -- LRU is default, Hyper is enabled by rocksdb_use_hyper_clock_cache=ON. But MyRocks doesn't let you choose the number of cache shards and by default RocksDB will try to use a lot of shards. You get 64 shards once the block cache size is >= 32M if using LRU cache and >= 2G if using Hyper cache.

With more shards there is less memory per shard. I run some benchmarks with a 4G block cache to stress the DBMS buffer/block cache code (for MyRocks, InnoDB and Postgres). That means I get 64 shards and 64M/shard (4G / 64 = 64M). On smaller servers I use a 1G block cache and get 64 shards (16M/shard) with LRU cache and 32 shards (32M/shard) with Hyper cache.

A conspiracy!

When the size of a block cache shard is X MB, then putting an object of size ~X MB into the shard will wipe (evict) everything else in that shard. If this happens frequently then it can be a big and difficult to diagnose performance problem.

The common source of large objects is the filter block for a large SST. But my tests are leveled compaction with target_file_size_base=64M. This means that the size of an SST should be ~64M. With 64-byte rows and 2X compression there will be ~2M KV pairs in an SST. Assuming 10 bits/key for the bloom filter the filter block size should be ~2.5M (TODO: confirm my math here). And while putting an object of size 2.5M is going to evict some things from a 64M cache shard, it won't evict everything.

However with leveled compaction there can be intra-L0 compaction (input is from L0, output written to L0) which is used when the L0 has too many SSTs but L0->L1 compaction isn't possible (because L0->L1 is in progress or because the inputs from L1 are already busy with L1->L2 compaction jobs). When intra-L0 runs it can create SSTs in L0 that have a size up to max_compaction_bytes (25 * target_file_size_base) and in my case that is 1600MB. An SST that is 25X larger than 64M can have a filter block that is 25X larger than the 2.5M I estimated above (25 * 2.5M = 62.5M). And objects of size 62.5M will evict almost everything else in a cache shard.

Notes on intra-L0 compaction are here.

Is this true or truthy?

So far there is much speculation from me. I try to provide more proof here. While running the l.i1 benchmark step from the insert benchmark with delete-per-insert enabled to avoid growing the size of a table I measure the worst-case response time for an insert. The l.i1 benchmark step is explained here and does inserts + deletes for a table with a PK index and 3 secondary indexes.

Here is one example of worst-case response time for inserts by LRU block cache size. There is a problem at 4GB which doesn't occur when the cache size is >= 8GB. The server has 256G of RAM, the database is cached by the OS but not by RocksDB and the benchmark uses 24 clients with 3 connections/client (1 is busy, the other 2 have think time).

cache response
GB time(s)
4 23.7
8 2.5
12 1.7
16 2.5
20 2.5
24 2.7
28 2.1
32 1.8

How do I debug this? First, I have some experience with this problem of large objects wiping the block cache. One thing I see is much worse worst-case response time for file reads from L0 versus larger levels -- see here. The worst case is between 380ms and 570ms for L0, between 33ms and 50ms for L3 (logical L1), between 4.4ms and 6.6ms for L4 (logical L2), between 22ms and 33ms for L5 (logical L3), and between 1.9ms and 2.9ms for L6 (logical L4). From the read response time distributions ~0.4% of all reads from L0 take at least 9.9ms. The response time distributions for levels larger than L0 are much better as are the distributions for L0 when the block cache is >= 8G -- see here for the L0 response time distribution when the block cache is 8G.

Other details:

From LOG files I see that the largest/slowest intra-L0 compaction jobs are similar between tests that use a 4G and 8G block cache.
From compaction IO stats I see a big change from 4G to 8G+ in the stats for compaction into L3 (the logical L1)
From counters I see a more than 10X difference in rocksdb.block.cache.filter.miss. It is 868,992 for the 4G block cache vs 74,818 for the 8G block cache.

How do you fix this?

If there were no intra-L0 compaction then there would be no too-large SSTs in the L0. But there is no option to disable intra-L0 compaction and a future blog post will show that perf suffers if you disable it (via a hacked version of MyRocks). In theory you can reduce the value of max_compaction_bytes to limit the max size of an L0 SST produced by intra-L0, but that will also impact L0->L1 compactions.

The best advice is:

Don't use a too small block cache but sometimes that cannot be avoided.
Manually set the number of cache shards to avoid shards that are too small. Alas, MyRocks does not allow this to be set and I filed a feature request to change that.

Small Datum

Monday, July 10, 2023

Write stalls with RocksDB from a too small block cache and/or too many block cache shards

No comments:

Post a Comment

Postgres 18 beta2: large server, Insert Benchmark, part 2