Thursday, February 24, 2022

RocksDB externals: avoiding problems with db_bench --seed

This is the first post in the RocksDB externals series. This post is about the db_bench --seed option. The seed is used by random number generators that generate keys to be used in RocksDB requests. Setting the seed allows you to have deterministic key sequences in tests. 

If you run db_bench --benchmarks=readrandom --seed=10 twice then the second run uses the same sequence of keys as the first run. That is usually not the desired behavior. If the database is larger than memory but there is a cache for storage (OS page cache + buffered IO, etc) then the first test warms the cache and the second run can run faster than expected thanks to a better cache hit rate.

Your benchmark results will be misleading if you are not careful. I have made this mistake more than once. Sometimes I spot it quickly, but I have lost more than a few hours from this. This post might help you avoid repeating my mistakes.

I have written about this before (here, here and here). 

One way to avoid problems

Assuming you want to run db_bench 1+ times for your benchmark, the following will usually avoid problems from reuse of the seed:

  • Optionally clear the OS page cache: sync; echo 3 > /proc/sys/vm/drop_caches
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
The above pattern is fine as long as the number of seconds between each run is larger than the number of threads (X above). See below for explanation of the number of seconds vs number of threads issue. 

The above pattern works when --benchmarks has one benchmark. If it has two, for example --benchmarks=readrandom,overwrite then the seeds used for readrandom will be reused for overwrite. There is no way to avoid that problem with RocksDB's db_bench until issue 9632 is fixed. I just noticed that LevelDB's db_bench has a fix for it.

Number of threads vs number of seconds

Note that date +%s is the number of seconds since the epoch.

With db_bench --seed=1 --threads=4 then four threads are created to make RocksDB requests and use the seeds 1, 2, 3 and 4. If you then run db_bench --seed=2 --threads=4 then four threads are created and use the seeds 2, 3, 4, 5. So there is some overlap in the seeds between the first and second runs. If --seed=$( date + %s ) is used in place of --seed=1 and --seed=2 then seed overlap is avoided when the value for --threads in the first run is smaller then the number of seconds required for the first run.

Implementation details

Both RocksDB and LevelDB have a seed reuse bug:

  • With RocksDB there is reuse when --benchmarks lists more than one test (issue 9632)
  • With LevelDB there is reuse across runs of db_bench

How --seed is used in RocksDB's db_bench:

  • When there are N threads, the IDs for the threads range from 0 to N-1 (see here)
  • And from this code:
    • Each thread has a random number generator (RNG) initialized with a seed
    • The value of base_seed comes from --seed when not zero, else it is 1000
    • The value of the seed for each per-thread RNG is base_seed + ID 
How seeds are used in LevelDB's db_bench:
  • seed_base is hardwired to 1000 (see here)
  • A counter is incremented each time a thread is created (see here)
  • The per-thread seed is seed_base + counter (see here)

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...