Small Datum: RocksDB internals: trivial move and benchmarks

My last post explained some aspects of the trivial move optimization based on configuration options and source code. This post explains it based on benchmarks.

I have speculated in the past about whether workloads with N streams of inserts would benefit from the trivial move optimization in RocksDB and I am happy to discover that they do.

But first let me explain streams of inserts here and see this post for more detail.

An insert stream is a sequence of keys in ascending or descending order.
For a given stream the order is always ascending or always descending, the stream cannot switch.
The keys in the stream are inserted (a RocksDB Put) and by insert I mean that the key does not already exist in the database.
When there are N streams the keys from each stream use a different prefix to guarantee that RocksDB will see N different streams.

The goal is to determine the write-amplification for each workload. When write-amp=1 then trivial move is always used during compaction. This is the best case. But even when write-amp is greater than one there can still be a benefit from trivial move and that benefit can be estimated by the ratio between bytes that get trivial move and bytes that must be rewritten during compaction. Numbers for these are listed in the compaction IO statistics output -- the Moved(GB) vs Write(GB) columns.

tl;dr

Trivial move works great and as expected for a small number of insert streams
It wasn't used as much as I expected when there were 8+ streams. Perhaps my expectations were wrong or perhaps this can be improved.

Update - WRT to the second bullet point, the problem was me. The issue is that RocksDB does not know there are N streams and when creating SSTs is unlikely to end them on keys that respect the streams and this makes trivial move less likely. LSMs with optimizations for time-series workloads have a solution for this. Solutions are likely to require guidance from the user. I don't have an opinion on whether detecting the streams at runtime is a good thing to do and won't consume too much CPU. The sst_partitioner in RocksDB is something that would consume the guidance were it provided.

Implementation details

All of my tests used db_bench with --benchmarks=fillseq although I had to modify db_bench in some cases.

asc.1 - to get one ascending stream I used db_bench as-is. In this case the key is derived from a counter that is incremented (see here).
desc.1 - to get one descending stream I used a modified version of db_bench. The modification is to decrement the counter rather than increment it in the key generator (see here).
asc.N - to get N concurrent and ascending streams I used --keys_per_prefix and --prefix_size. Note to self, don't forget to set prefix_size because it is 0 by default. By concurrent, I mean there are N streams during one run of db_bench.
asc.a, asc.b, asc.c - To get N (well, N= 2 or 3) not-concurrent and ascending streams I used 3 modified versions of db_bench. Each used a different one-byte prefix ('a', 'b', or 'c') for their keys and then the incrementing counter made the keys ascending. There was one ascending stream per run of db_bench and db_bench was run two or three times.

The shell scripts I used to run tests and a diff with all changes for db_bench are here.

Workloads

I confirmed that trivial move was done for each of the workloads described below. The database was empty at the start of each workload. In the best case all compaction was done by trivial move and write-amp=1 and the best case occurs for all workloads except N ascending & concurrent.

The results for the N ascending & concurrent workloads are the most interesting because trivial move doesn't happen for all compactions. I expected it to be done all compactions into Ln+1 where Ln is the first level large enough to have N or more SSTs. However, dynamic leveled compaction makes this complicated, so I also use move-ratio to judge the frequency of trivial move where move-ratio is Moved(GB) / Write(GB) for levels excluding L0. When write-amp=1 then move-ratio=INF.

For each workload I provide a link to compaction IO stats reported near test end. The Write(GB) shows how many bytes were written by compaction (for levels > 0) or flushed (for L0). The Moved(GB) columns show how many bytes used trivial move. When write-amp=1, then the Write(GB) column is zero for all levels except L0.

The workloads were:

1 ascending

Use asc.1 and run db_bench one time to insert 100M rows, see r.asc.sh
Write-amp=1 because all compaction was done by trivial move, stats are here

1 descending

Use desc.1 and run db_bench one time to insert 100M rows, see r.desc.sh
Write-amp=1 because all compaction was done by trivial move, stats are here

N ascending & concurrent

Use asc.N for N = 1, 2, 4, ..., 524288, 1048576 but I only share results up to N=64
Run db_bench once time for each value of N to insert 100M rows
For N=1 write-amp=1, move-ratio=INF and stats are here
For N=2, write-amp=4.0, move-ratio=0.779 and stats are here
For N=4, write-amp=5.3, move-ratio=0.276 and stats are here
For N=8, write-amp=5.8, move-ratio=0.097 and stats are here
For N=16, write-amp=6.6, move-ratio=0.066 and stats are here
For N=32, write-amp=6.7, move-ratio=0.023 and stats are here
For N=64, write-amp=6.7, move-ratio=0.016 and stats are here

2 ascending but not concurrent

These run db_bench two times to insert 50M rows/run, see r.asc.a.b.sh and r.asc.b.a.sh
Write-amp=1 because all compaction was done by trivial move
Use asc.a then asc.b, stats from the end of asc.a and asc.b are here
Use asc.b then asc.a, stats from the end of asc.b and asc.a are here