Thursday, June 22, 2023

Universal Compaction in RocksDB and me

I almost always use leveled compaction with RocksDB. I rarely use universal. It is useful for many workloads but will take more time for users to get working. One overview of RocksDB compaction is here and details on universal are here. A comparison of RocksDB universal with other implementations of tiered is here.

If you are a pessimist this is a rant. If you are an optimist this is a short list of work that can be done to improve universal compaction.

Universal compaction in RocksDB is called tiered compaction by every other LSM. I wish RocksDB didn't create a new name as that is a source of confusion. 

Updates

  • I added the Trivial Move section with stats to show write amp during fillseq


How it began

Universal started as a clever hack by reusing much code from leveled. The clever hack was to keep all SSTs into the L0. Alas, this has a bad side-effect. Bloom filters and block indexes are per-SST and not paged (paged == split into ~4kb pieces) when cached. When the SST is huge then the bloom filter or block index block for that SST will be large -- reading & decompressing this takes more time and then it is cached as-is and putting a large thing into the block cache will frequently displace more useful info. And the best practice at the time of using 64 shards for the block cache (to reduce mutex contention) makes the problem worse because each shard is smaller since 64 shards means each shard gets 1/64th of the block cache memory. 

Large SSTs are expected with universal, thus this was a common perf problem with it. This caused performance problems for many users and they weren't easy to diagnose. The solution was to use levels beyond the L0 for the largest SSTs so that one large logical sorted run could be stored on level N and split into many smaller physical sorted runs (SSTs) and then the per-SST bloom filter and index blocks would be smaller. If you think of sorted runs with tiered compaction as being ordered in a list (run-1, run-2, ..., run-6 and a leveled LSM tree with levels L0 to L5 then L5 can store run-6 split into many SSTs, L4 might store run-5 split into many SSTs, and run-1 to run-4 might all be in L0 with each using just one SST.

Performance problems

Common performance problems with universal include:

  • transient space-amp
    • transient space-amp is large during compaction (true for all tiered, until they add incremental) but is larger in practice than you expect because any SST created after a long-running compaction has started cannot be dropped until the long-running compaction ends. See issue 10533.
  • weak limit on space-amp
    • while leveled compaction has a strong limit on worst-case space amp, universal compaction does not and this is a different issue than the one with transient space-amp listed above. See issue 9561.
  • single-threaded compaction
    • a compaction job is single-threaded, ignoring subcompactions, and with universal the amount of data processed by one compaction job can be huge (tens of GB or more) which means the equivalent of major compaction in RocksDB can take many hours. Note that subcompactions allow some compaction jobs to be multi-threaded. Alas, subcompactions ignore the max_background_jobs setting and you can easily have too many threads running when using this feature (issue 9291 is still open). Note that subcompactions are disabled by default. A few more details on subcompactions are here.
  • trivial move is better with leveled
    • Inserts in key order with leveled frequently have minimal (=2) write-amplification (write to WAL, write during memtable flush) thanks to the trivial move optimization (see here and here). While this can be enabled for universal it doesn't work as well, maybe, but that is based on what I reported in this issue and eventually I will revisit it.
  • compaction is bursty
    • With leveled compaction it is likely that compaction jobs are running all of the time and each compaction job processes a not-too-large amount of data (no more than a few GB). If you haven't enabled subcompactions and have set max_background_jobs to a reasonable value then you can assume ~X threads will always be busy doing compaction where X = max_background_jobs. There is some variance because the amount of data processed for an L0->L1 compaction is likely to be larger than for Ln->Ln+1 where n > 0, but that usually isn't a big deal. And the meaning of busy also has variance because L0->L1 compaction less likely to be IO-bound (reads should be from the OS page cache if not using O_DIRECT). While larger levels might have more CPU overhead if using no or lighter (lz4) compression for smaller levels but zstd for larger levels. But larger levels are also more likely to be IO-bound.
    • With universal compaction there can also be some compaction jobs running all of the time, although not as much with leveled. But there can also be bursts of work because compaction jobs are started less often (less write-amp means less compaction) and the jobs can run for a long time. These bursts can be a surprise if you didn't leave spare CPU and IO capacity to handle them (kind of like not paying taxes gives you more money to spend, until the tax bill arrives).
Several of these problems will be less of an issue if incremental universal is implemented.

Trivial move

I ran the following to document the difference between leveled and universal. For universal the test was repeated with and without the trivial move optimization enabled. The fillseq benchmark inserts keys in ascending key order. The test was run in 3 configurations:
  1. no compression
  2. lz4 compression for all levels
  3. lz4 compression with min_level_to_compress=2: in this case the trivial move optimization is only used to move uncompressed SSTs to level 2 of the LSM tree, then they must be compressed.
Then I looked at the compaction IO statistics printed when test finishes and the results are below. The behavior for leveled is similar to universal with the trivial move optimization for the first two cases but not for the third where min_level_to_compress=2. I am curious if the difference in the third case is by design, an oversight or a bug.

The compaction IO stats from test end are here for case 1, case 2 and case 3. The test script is here.

--- no compression

GB written by compaction and memtable flush
11.3    leveled
68.9    universal with universal_allow_trivial_move disabled (default)
11.3    universal with universal_allow_trivial_move enabled

--- lz4 for everything

GB written by compaction and memtable flush
 6.3    leveled
32.5    universal with universal_allow_trivial_move disabled (default)
 6.3    universal with universal_allow_trivial_move enabled

--- lz4 for everything, min_level_to_compress=2

GB written by compaction and memtable flush
16.8    leveled
37.0    universal with universal_allow_trivial_move disabled (default)
11.3    universal with universal_allow_trivial_move enabled

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...