RocksDB has an optimization called trivial move that reduces write amplification. There is a short post on it, but it needs a longer post. I am not sure whether this optimization originated with RocksDB. I don't know whether it is implemented elsewhere, nor do I recall whether it has been described in a conference paper. Update - RocksDB inherited trivial move from LevelDB.
Trivial move can be done for leveled and universal compaction but here I focus on leveled. Trivial move is disabled by default for universal and is enabled via the allow_trivial_move option.
For leveled, when compaction is done from level N to level N+1 there is usually much work: read the input SSTs from level N, read the ~10 input SSTs from level N+1, merge them and write ~10 SSTs to level N+1.
The trivial move can be done when the key range for the input SST from level N doesn't overlap with the key range of any SST in level N+1. In this case the input SST fits in between the key ranges of the level N+1 SSTs and rather than reading and re-writing SSTs the trivial move optimization simply changes metadata (the MANIFEST) to indicate that the SST that had been on level N is now on level N+1, thus the name trivial move. With the exception of the need to rewrite the MANIFEST there is no write-amplification from a trivial move.
Code
This is the code where compaction decides whether to do a trivial move and the logic is in the IsTrivialMove method. A trivial move is not done when ...
- ... the compression (lz4, zstd, none) used for the input SST does not match the compression to be used for the output level. If bottommost_compression is zstd and non-bottom levels do not use zstd then trivial moves can be done up to Lmax-1 and compaction from Lmax-1 to Lmax cannot use trivial move (Lmax is max level, Lmax-1 is the next to max level). The logic for that is here.
- ... the SST moved to level N+1 would overlap too many SSTs in level N+2. The logic for that is here. And overlap too many SSTs means that the compaction from level N+1 to N+2 would exceed max_compaction_bytes. The max_compaction_bytes option is 0 by default, and when 0 is set to 25 * target_file_size_base. So overlap too many SSTs == overlap ~25 SSTs by default. This check is only for leveled compaction.
Logs
These are examples of messages in LOG when trivial move is done:
Monitoring
Columns in Compaction IO statistics shows when trivial move is done:
- the Moved(GB) column counts the number of bytes that were trivial moved into that level
- the Write(GB) column counts the number of bytes that were not trivial moved into that level and by not trivial moved I mean that normal compaction was done.
No comments:
Post a Comment