Thursday, February 10, 2022

RocksDB internals: trivial move

RocksDB has an optimization called trivial move that reduces write amplification. There is a short post on it, but it needs a longer post. I am not sure whether this optimization originated with RocksDB. I don't know whether it is implemented elsewhere, nor do I recall whether it has been described in a conference paper. Update - RocksDB inherited trivial move from LevelDB.

Trivial move can be done for leveled and universal compaction but here I focus on leveled. Trivial move is disabled by default for universal and is enabled via the allow_trivial_move option.

For leveled, when compaction is done from level N to level N+1 there is usually much work: read the input SSTs from level N, read the ~10 input SSTs from level N+1, merge them and write ~10 SSTs to level N+1.

The trivial move can be done when the key range for the input SST from level N doesn't overlap with the key range of any SST in level N+1. In this case the input SST fits in between the key ranges of the level N+1 SSTs and rather than reading and re-writing SSTs the trivial move optimization simply changes metadata (the MANIFEST) to indicate that the SST that had been on level N is now on level N+1, thus the name trivial move. With the exception of the need to rewrite the MANIFEST there is no write-amplification from a trivial move.

Code

This is the code where compaction decides whether to do a trivial move and the logic is in the IsTrivialMove method. A trivial move is not done when ...

  1. ... the compression (lz4, zstd, none) used for the input SST does not match the compression to be used for the output level. If bottommost_compression is zstd and non-bottom levels do not use zstd then trivial moves can be done up to Lmax-1 and compaction from Lmax-1 to Lmax cannot use trivial move (Lmax is max level, Lmax-1 is the next to max level). The logic for that is here.
  2. ... the SST moved to level N+1 would overlap too many SSTs in level N+2. The logic for that is here. And overlap too many SSTs means that the compaction from level N+1 to N+2 would exceed max_compaction_bytes. The max_compaction_bytes option is 0 by default, and when 0 is set to 25 * target_file_size_base. So overlap too many SSTs == overlap ~25 SSTs by default. This check is only for leveled compaction.
I encounter #1 when using --benchmarks=fillseq with db_bench. In that case I make sure that all levels use the same compression type. And for the benchmarks that are run after fillseq I will change the config so that the larger levels use a better but more CPU intensive compression while the smaller levels use no compression. I created issue 9292 with the request that RocksDB be enhanced to avoid this complexity.

You might encounter #2 if you increase max_bytes_for_level_multiplier from the default (10) to something larger like 20 or 30 in which case you can explicitly set max_compaction_bytes to something larger than 25 * target_file_size_base. I created issue 9544 for this.

Logs

These are examples of messages in LOG when trivial move is done:
EVENT_LOG_v1 {"time_micros": 1644111871547273, "job": 429, "event": "trivial_move", "destination_level": 5, "files": 1, "total_files_size": 16872686}

[/db_impl/db_impl_compaction_flush.cc:3189] [default] Moved #1 files to level-5 16872686 bytes OK: base level 4 level ...

Monitoring

Columns in Compaction IO statistics shows when trivial move is done:
  • the Moved(GB) column counts the number of bytes that were trivial moved into that level
  • the Write(GB) column counts the number of bytes that were not trivial moved into that level and by not trivial moved I mean that normal compaction was done. 

No comments:

Post a Comment

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL , a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions f...