Monday, August 14, 2023

Tuning MyRocks to reduce MVCC GC debt

I previously shared results for MyRocks and the Insert Benchmark where some queries were too slow because there was too much MVCC GC debt and the queries encountered long runs of tombstones. I am happy to share results here that show the problem can be avoided by adding a few options to the MyRocks my.cnf file.

By too slow I mean the query could take tens of seconds when it is expected to take tens of milliseconds. The problem occurs when a table is used like a queue with inserts at one end, deletes at the other end and the slow query does select min(transactionid) from $table. The query is slow when it encounters runs of tombstones from the deletes.

The problem is avoided by setting two options in my.cnf that are disabled by default. Documentation isn't great for these options and I have yet to read source code related to this. But the results are great so I will share that for now. The options are rocksdb_compaction_sequential_deletes and rocksdb_compaction_sequential_deletes_window. If compaction encounters X tombstones within any window of length Y (where X=r_c_sequential_deletes, Y=r_c_sequential_deletes_window) then extra work is done to get rid of the tombstones. Alas, today I am fuzzy on what that extra work is.

Note that dropping of tombstones allows for false negatives. Without the options described in the previous paragraph, many tombstones that can be dropped will not be dropped until they reach the max level of the LSM tree because the check for whether a tombstone can be dropped is meant to be cheap (some CPU, no IO) -- if the tombstone's key overlaps any SSTs from the larger levels below the tombstone's SST then it is not dropped. It might be possible to spend a bit more CPU and check bloom filters when there is overlap but I am not aware of an LSM that does that today.

MyRocks also makes this less of an issue by using SingleDelete for deletes of secondary index entries as those are easy to drop before they reach the max level of the LSM tree.

tl;dr - setting these options ...

  • Prevents a scenario where some queries take 1000X longer than expected
  • There is some cost -- it increases CPU overheads by up to 10% and write-amplification by up to 20% for this benchmark. The cost varies depending on the value the options are set to.

Builds

I used FB MySQL 8.0.28 compiled from git hash ef5b9b101 in June 2023.

Benchmark

The insert benchmark was run in two setups.

  • cached by RocksDB - all tables are cached by RocksDB
  • IO-bound - the database is larger than memory

The benchmark used a server with 80 cores, hyperthreads enabled, 256G of RAM and fast NVMe with XFS. 

The benchmark is run with 24 clients and 24 tables (client per table). The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows per table where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
Configurations

The configuration files are here. They all include rocksdb_perf_context_level=2 which was done for debugging and isn't intended to always be set for production.
  • y9cx_u (base) - base config
  • y9cx_u_sd8k (sd8k) - sets rocksdb_compaction_sequential_deletes[_window] to 8k-1, 8k
  • y9cx_u_sd32k (sd32k) - sets rocksdb_compaction_sequential_deletes[_window] to 32k-1, 32k
  • y9cx_u_sd128k (sd128k) - sets rocksdb_compaction_sequential_deletes[_window] to 128k-1, 128k
Results

Performance reports are here for Cached by RocksDB and IO-bound.

The tables below show the time in seconds for the select min(transactionid) query that is run once at the start of some benchmark steps. It is fast at the start of l.i1 because there is no MVCC GC debt that that point. It is slow for the base config at the start of q100, q500 and q1000 because there is much MVCC GC debt at that point. However, it isn't slow for the configs that set the sequential delete my.cnf options

Cached by RocksDB
config  l.i1    q100    q500    q1000
base    0.001   8.206   9.926   9.188
sd8k    0.001   0.196   0.003   0.149
sd32k   0.001   0.194   0.021   0.102
sd128k  0.001   0.077   0.071   0.113

IO-bound
config  l.i1    q100    q500    q1000
base    0.001   10.883  10.267  12.833
sd8k    0.001   0.132   0.016   0.247
sd32k   0.001   0.115   0.020   0.234
sd128k  0.001   0.107   0.070   0.254

Results: average throughput

For Cached by RocksDB the average throughput for all configs is similar. 

For IO-bound the query rate for q100, q500 and q1000 is about 5% less for the configs that set sequential delete options (sd8k, sd32k, sd128k) when compared to the base config.

Results: CPU and IO overhead

For Cached by RocksDB and configs that set sequential delete options (sd8k, sd32k, sd128k)
  • The CPU overhead is similar to the base config. See the cpupq columns here and cpupq means CPU/insert or /query.
  • There is 10% increase in write-amplification based on the wkbpi column for the l.i1 benchmark step. Note that wkbpi stands for KB written to storage per insert.
  • The increases are largest for the sd8k config and small for the sd128k config.
For IO-bound and configs that set sequential delete options (sd8k, sd32k, sd128k)
  • There is a 5% to 10% increase in CPU overhead based on the cpupq column here for l.i1, q100, q500 and q1000
  • There is an increase in write-amplification of 20% based on the wkibpi column for the l.i1 benchmark step
  • The increases are largest for the sd8k config and small for the sd128k config.
Results: compaction IO statistics

How does the LSM tree look at the end of the benchmark?
How much overhead do the sequential delete options add to compaction?

Compaction IO statistics are here for Cached by RocksDB and for IO-bound (click on Raw, lines are long).

For Cached by RocksDB
  • The total CPU time doing compaction is ~96,000 seconds for the base config and between 102,000 and 106,000 seconds for the configs that set sequential delete options which is an increase of between 7% and 9.6%. The increase is largest for the sd8k config. See the CompMergeCPU(sec) column in the lines that start with Sum like this one. Most of the increase is for compaction into the max level of the LSM tree (see the lines for sd8k and for base).
  • Write-amplification increases by between 7.7% and 12.1% based on the wkbpi columns in the Sum lines (see the lines for sd8k and for base).
For IO-bound
  • The total CPU time doing compaction is ~178,000 seconds for the base config and between 190,436 (for sd128k) and 209,000 (for sd8k) seconds for the other configs. So the CPU overhead from compaction increases by between 7% (for sd128k) and 17.3% (for sd8k). See the Sum lines for sd8k and for base.
  • Write-amplification increases by between 5.6% and 14% based on the wkbpi columns in the Sum lines (see the lines for sd8k and for base).
Debugging

This wasn't easy to debug. Fortunately I have easy access to experts. To confirm that tombstones were the problem I set the perf_context level to 2 in my.cnf with:
rocksdb_perf_context_level=2
Then I queried information_schema.rocksdb_perf_context_global before and after running the query:

mysql -B -e 'select * from information_schema.rocksdb_perf_context_global' > o.pc.x.3
mysql ib -e 'select min(transactionid) from pi1'
mysql -B -e 'select * from information_schema.rocksdb_perf_context_global' > o.pc.x.4

Then I computed the difference of the counter values from after and before with a focus on INTERNAL_DELETE_SKIPPED_COUNT and INTERNAL_KEY_SKIPPED_COUNT. An example where the counters are huge is:

36144100                INTERNAL_DELETE_SKIPPED_COUNT
36144100                INTERNAL_KEY_SKIPPED_COUNT

The script to compute the diff is:

for x in x i1 q1 q2 q3 ; do
for f in 1 3 5 ; do
  echo; echo $x - $f
  join o.pc.${x}.${f} o.pc.${x}.$(( f + 1 )) | \
  awk '{ printf "%d\t\t%s\n", $3 - $2, $1 }' | \
  sort -n -k 1,1 | \
  tail -5
done
done

























No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...