I previously shared results for MyRocks and the Insert Benchmark where some queries were too slow because there was too much MVCC GC debt and the queries encountered long runs of tombstones. I am happy to share results here that show the problem can be avoided by adding a few options to the MyRocks my.cnf file.
By too slow I mean the query could take tens of seconds when it is expected to take tens of milliseconds. The problem occurs when a table is used like a queue with inserts at one end, deletes at the other end and the slow query does select min(transactionid) from $table. The query is slow when it encounters runs of tombstones from the deletes.
The problem is avoided by setting two options in my.cnf that are disabled by default. Documentation isn't great for these options and I have yet to read source code related to this. But the results are great so I will share that for now. The options are rocksdb_compaction_sequential_deletes and rocksdb_compaction_sequential_deletes_window. If compaction encounters X tombstones within any window of length Y (where X=r_c_sequential_deletes, Y=r_c_sequential_deletes_window) then extra work is done to get rid of the tombstones. Alas, today I am fuzzy on what that extra work is.
Note that dropping of tombstones allows for false negatives. Without the options described in the previous paragraph, many tombstones that can be dropped will not be dropped until they reach the max level of the LSM tree because the check for whether a tombstone can be dropped is meant to be cheap (some CPU, no IO) -- if the tombstone's key overlaps any SSTs from the larger levels below the tombstone's SST then it is not dropped. It might be possible to spend a bit more CPU and check bloom filters when there is overlap but I am not aware of an LSM that does that today.
MyRocks also makes this less of an issue by using SingleDelete for deletes of secondary index entries as those are easy to drop before they reach the max level of the LSM tree.
tl;dr - setting these options ...
- Prevents a scenario where some queries take 1000X longer than expected
- There is some cost -- it increases CPU overheads by up to 10% and write-amplification by up to 20% for this benchmark. The cost varies depending on the value the options are set to.
Builds
I used FB MySQL 8.0.28 compiled from git hash ef5b9b101 in June 2023.
Benchmark
The insert benchmark was run in two setups.
- cached by RocksDB - all tables are cached by RocksDB
- IO-bound - the database is larger than memory
The benchmark used a server with 80 cores, hyperthreads enabled, 256G of RAM and fast NVMe with XFS.
The benchmark is run with 24 clients and 24 tables (client per table). The benchmark is a sequence of steps.
- l.i0
- insert X million rows per table where X is 20 for cached and 500 for IO-bound
- l.x
- create 3 secondary indexes. I usually ignore performance from this step.
- l.i1
- insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
- q100, q500, q1000
- do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
- y9cx_u (base) - base config
- y9cx_u_sd8k (sd8k) - sets rocksdb_compaction_sequential_deletes[_window] to 8k-1, 8k
- y9cx_u_sd32k (sd32k) - sets rocksdb_compaction_sequential_deletes[_window] to 32k-1, 32k
- y9cx_u_sd128k (sd128k) - sets rocksdb_compaction_sequential_deletes[_window] to 128k-1, 128k
For Cached by RocksDB the average throughput for all configs is similar.
- The CPU overhead is similar to the base config. See the cpupq columns here and cpupq means CPU/insert or /query.
- There is 10% increase in write-amplification based on the wkbpi column for the l.i1 benchmark step. Note that wkbpi stands for KB written to storage per insert.
- The increases are largest for the sd8k config and small for the sd128k config.
- There is a 5% to 10% increase in CPU overhead based on the cpupq column here for l.i1, q100, q500 and q1000
- There is an increase in write-amplification of 20% based on the wkibpi column for the l.i1 benchmark step
- The increases are largest for the sd8k config and small for the sd128k config.
How does the LSM tree look at the end of the benchmark?
How much overhead do the sequential delete options add to compaction?
Compaction IO statistics are here for Cached by RocksDB and for IO-bound (click on Raw, lines are long).
For Cached by RocksDB
- The total CPU time doing compaction is ~96,000 seconds for the base config and between 102,000 and 106,000 seconds for the configs that set sequential delete options which is an increase of between 7% and 9.6%. The increase is largest for the sd8k config. See the CompMergeCPU(sec) column in the lines that start with Sum like this one. Most of the increase is for compaction into the max level of the LSM tree (see the lines for sd8k and for base).
- Write-amplification increases by between 7.7% and 12.1% based on the wkbpi columns in the Sum lines (see the lines for sd8k and for base).
- The total CPU time doing compaction is ~178,000 seconds for the base config and between 190,436 (for sd128k) and 209,000 (for sd8k) seconds for the other configs. So the CPU overhead from compaction increases by between 7% (for sd128k) and 17.3% (for sd8k). See the Sum lines for sd8k and for base.
- Write-amplification increases by between 5.6% and 14% based on the wkbpi columns in the Sum lines (see the lines for sd8k and for base).
rocksdb_perf_context_level=2
Then I computed the difference of the counter values from after and before with a focus on INTERNAL_DELETE_SKIPPED_COUNT and INTERNAL_KEY_SKIPPED_COUNT. An example where the counters are huge is:
The script to compute the diff is:
for f in 1 3 5 ; do
echo; echo $x - $f
join o.pc.${x}.${f} o.pc.${x}.$(( f + 1 )) | \
awk '{ printf "%d\t\t%s\n", $3 - $2, $1 }' | \
sort -n -k 1,1 | \
tail -5
done
done
No comments:
Post a Comment