My last post explained some aspects of the trivial move optimization based on configuration options and source code. This post explains it based on benchmarks.
I have speculated in the past about whether workloads with N streams of inserts would benefit from the trivial move optimization in RocksDB and I am happy to discover that they do.
But first let me explain streams of inserts here and see this post for more detail.
- An insert stream is a sequence of keys in ascending or descending order.
- For a given stream the order is always ascending or always descending, the stream cannot switch.
- The keys in the stream are inserted (a RocksDB Put) and by insert I mean that the key does not already exist in the database.
- When there are N streams the keys from each stream use a different prefix to guarantee that RocksDB will see N different streams.
- Trivial move works great and as expected for a small number of insert streams
- It wasn't used as much as I expected when there were 8+ streams. Perhaps my expectations were wrong or perhaps this can be improved.
- asc.1 - to get one ascending stream I used db_bench as-is. In this case the key is derived from a counter that is incremented (see here).
- desc.1 - to get one descending stream I used a modified version of db_bench. The modification is to decrement the counter rather than increment it in the key generator (see here).
- asc.N - to get N concurrent and ascending streams I used --keys_per_prefix and --prefix_size. Note to self, don't forget to set prefix_size because it is 0 by default. By concurrent, I mean there are N streams during one run of db_bench.
- asc.a, asc.b, asc.c - To get N (well, N= 2 or 3) not-concurrent and ascending streams I used 3 modified versions of db_bench. Each used a different one-byte prefix ('a', 'b', or 'c') for their keys and then the incrementing counter made the keys ascending. There was one ascending stream per run of db_bench and db_bench was run two or three times.
Workloads
I confirmed that trivial move was done for each of the workloads described below. The database was empty at the start of each workload. In the best case all compaction was done by trivial move and write-amp=1 and the best case occurs for all workloads except N ascending & concurrent.
The results for the N ascending & concurrent workloads are the most interesting because trivial move doesn't happen for all compactions. I expected it to be done all compactions into Ln+1 where Ln is the first level large enough to have N or more SSTs. However, dynamic leveled compaction makes this complicated, so I also use move-ratio to judge the frequency of trivial move where move-ratio is Moved(GB) / Write(GB) for levels excluding L0. When write-amp=1 then move-ratio=INF.
For each workload I provide a link to compaction IO stats reported near test end. The Write(GB) shows how many bytes were written by compaction (for levels > 0) or flushed (for L0). The Moved(GB) columns show how many bytes used trivial move. When write-amp=1, then the Write(GB) column is zero for all levels except L0.
The workloads were:
- 1 ascending
- Use asc.1 and run db_bench one time to insert 100M rows, see r.asc.sh
- Write-amp=1 because all compaction was done by trivial move, stats are here
- 1 descending
- Use desc.1 and run db_bench one time to insert 100M rows, see r.desc.sh
- Write-amp=1 because all compaction was done by trivial move, stats are here
- N ascending & concurrent
- Use asc.N for N = 1, 2, 4, ..., 524288, 1048576 but I only share results up to N=64
- Run db_bench once time for each value of N to insert 100M rows
- For N=1 write-amp=1, move-ratio=INF and stats are here
- For N=2, write-amp=4.0, move-ratio=0.779 and stats are here
- For N=4, write-amp=5.3, move-ratio=0.276 and stats are here
- For N=8, write-amp=5.8, move-ratio=0.097 and stats are here
- For N=16, write-amp=6.6, move-ratio=0.066 and stats are here
- For N=32, write-amp=6.7, move-ratio=0.023 and stats are here
- For N=64, write-amp=6.7, move-ratio=0.016 and stats are here
- 2 ascending but not concurrent
- These run db_bench two times to insert 50M rows/run, see r.asc.a.b.sh and r.asc.b.a.sh
- Write-amp=1 because all compaction was done by trivial move
- Use asc.a then asc.b, stats from the end of asc.a and asc.b are here
- Use asc.b then asc.a, stats from the end of asc.b and asc.a are here
- 3 ascending, but not concurrent
- These run db_bench three times to insert 30M or 40M rows/run
- Write-amp=1 because all compaction was done by trivial move
- Use asc.a, asc.b, asc.c. See r.asc.a.b.c.sh. Stats are here.
- Use asc.a, asc.c, asc.b. See r.asc.a.c.b.sh. Stats are here.
- Use asc.b, asc.a, asc.c. See r.asc.b.a.c.sh. Stats are here.
- Use asc.b, asc.c, asc.a. See r.asc.b.c.a.sh. Stats are here.
- Use asc.c, asc.a, asc.b. See r.asc.c.a.b.sh. Stats are here.
- Use asc.c, asc.b, asc.a. See r.asc.c.b.a.sh. Stats are here.
No comments:
Post a Comment