Thursday, February 24, 2022

RocksDB externals: avoiding problems with db_bench --seed

This is the first post in the RocksDB externals series. This post is about the db_bench --seed option. The seed is used by random number generators that generate keys to be used in RocksDB requests. Setting the seed allows you to have deterministic key sequences in tests. 

If you run db_bench --benchmarks=readrandom --seed=10 twice then the second run uses the same sequence of keys as the first run. That is usually not the desired behavior. If the database is larger than memory but there is a cache for storage (OS page cache + buffered IO, etc) then the first test warms the cache and the second run can run faster than expected thanks to a better cache hit rate.

Your benchmark results will be misleading if you are not careful. I have made this mistake more than once. Sometimes I spot it quickly, but I have lost more than a few hours from this. This post might help you avoid repeating my mistakes.

I have written about this before (here, here and here). 

One way to avoid problems

Assuming you want to run db_bench 1+ times for your benchmark, the following will usually avoid problems from reuse of the seed:

  • Optionally clear the OS page cache: sync; echo 3 > /proc/sys/vm/drop_caches
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
The above pattern is fine as long as the number of seconds between each run is larger than the number of threads (X above). See below for explanation of the number of seconds vs number of threads issue. 

The above pattern works when --benchmarks has one benchmark. If it has two, for example --benchmarks=readrandom,overwrite then the seeds used for readrandom will be reused for overwrite. There is no way to avoid that problem with RocksDB's db_bench until issue 9632 is fixed. I just noticed that LevelDB's db_bench has a fix for it.

Number of threads vs number of seconds

Note that date +%s is the number of seconds since the epoch.

With db_bench --seed=1 --threads=4 then four threads are created to make RocksDB requests and use the seeds 1, 2, 3 and 4. If you then run db_bench --seed=2 --threads=4 then four threads are created and use the seeds 2, 3, 4, 5. So there is some overlap in the seeds between the first and second runs. If --seed=$( date + %s ) is used in place of --seed=1 and --seed=2 then seed overlap is avoided when the value for --threads in the first run is smaller then the number of seconds required for the first run.

Implementation details

Both RocksDB and LevelDB have a seed reuse bug:

  • With RocksDB there is reuse when --benchmarks lists more than one test (issue 9632)
  • With LevelDB there is reuse across runs of db_bench

How --seed is used in RocksDB's db_bench:

  • When there are N threads, the IDs for the threads range from 0 to N-1 (see here)
  • And from this code:
    • Each thread has a random number generator (RNG) initialized with a seed
    • The value of base_seed comes from --seed when not zero, else it is 1000
    • The value of the seed for each per-thread RNG is base_seed + ID 
How seeds are used in LevelDB's db_bench:
  • seed_base is hardwired to 1000 (see here)
  • A counter is incremented each time a thread is created (see here)
  • The per-thread seed is seed_base + counter (see here)

RocksDB externals: the series

This is a series of posts on using RocksDB to complement the series on RocksDB Internals.

Tuesday, February 15, 2022

RocksDB internals: trivial move and benchmarks

My last post explained some aspects of the trivial move optimization based on configuration options and source code. This post explains it based on benchmarks.

I have speculated in the past about whether workloads with N streams of inserts would benefit from the trivial move optimization in RocksDB and I am happy to discover that they do.

But first let me explain streams of inserts here and see this post for more detail. 

  • An insert stream is a sequence of keys in ascending or descending order. 
  • For a given stream the order is always ascending or always descending, the stream cannot switch.
  • The keys in the stream are inserted (a RocksDB Put) and by insert I mean that the key does not already exist in the database.
  • When there are N streams the keys from each stream use a different prefix to guarantee that RocksDB will see N different streams.
The goal is to determine the write-amplification for each workload. When write-amp=1 then trivial move is always used during compaction. This is the best case. But even when write-amp is greater than one there can still be a benefit from trivial move and that benefit can be estimated by the ratio between bytes that get trivial move and bytes that must be rewritten during compaction. Numbers for these are listed in the compaction IO statistics output -- the Moved(GB) vs Write(GB) columns.

tl;dr
  • Trivial move works great and as expected for a small number of insert streams
  • It wasn't used as much as I expected when there were 8+ streams. Perhaps my expectations were wrong or perhaps this can be improved. 
Update - WRT to the second bullet point, the problem was me. The issue is that RocksDB does not know there are N streams and when creating SSTs is unlikely to end them on keys that respect the streams and this makes trivial move less likely. LSMs with optimizations for time-series workloads have a solution for this. Solutions are likely to require guidance from the user. I don't have an opinion on whether detecting the streams at runtime is a good thing to do and won't consume too much CPU. The sst_partitioner in RocksDB is something that would consume the guidance were it provided.

Implementation details

All of my tests used db_bench with --benchmarks=fillseq although I had to modify db_bench in some cases. 
  • asc.1 - to get one ascending stream I used db_bench as-is. In this case the key is derived from a counter that is incremented (see here).
  • desc.1 - to get one descending stream I used a modified version of db_bench. The modification is to decrement the counter rather than increment it in the key generator (see here).
  • asc.N - to get N concurrent and ascending streams I used --keys_per_prefix and --prefix_size. Note to self, don't forget to set prefix_size because it is 0 by default. By concurrent, I mean there are N streams during one run of db_bench.
  • asc.a, asc.b, asc.c - To get N (well, N= 2 or 3) not-concurrent and ascending streams I used 3 modified versions of db_bench. Each used a different one-byte prefix ('a', 'b', or 'c') for their keys and then the incrementing counter made the keys ascending. There was one ascending stream per run of db_bench and db_bench was run two or three times.
The shell scripts I used to run tests and a diff with all changes for db_bench are here.

Workloads

I confirmed that trivial move was done for each of the workloads described below. The database was empty at the start of each workload. In the best case all compaction was done by trivial move and write-amp=1 and the best case occurs for all workloads except N ascending & concurrent

The results for the N ascending & concurrent workloads are the most interesting because trivial move doesn't happen for all compactions. I expected it to be done all compactions into Ln+1 where Ln is the first level large enough to have N or more SSTs. However, dynamic leveled compaction makes this complicated, so I also use move-ratio to judge the frequency of trivial move where move-ratio is Moved(GB) / Write(GB) for levels excluding L0. When write-amp=1 then move-ratio=INF.

For each workload I provide a link to compaction IO stats reported near test end. The Write(GB) shows how many bytes were written by compaction (for levels > 0) or flushed (for L0). The Moved(GB) columns show how many bytes used trivial move. When write-amp=1, then the Write(GB) column is zero for all levels except L0.

The workloads were:

  • 1 ascending
    • Use asc.1 and run db_bench one time to insert 100M rows, see r.asc.sh
    • Write-amp=1 because all compaction was done by trivial move, stats are here
  • 1 descending
    • Use desc.1 and run db_bench one time to insert 100M rows, see r.desc.sh
    • Write-amp=1 because all compaction was done by trivial move, stats are here
  • N ascending & concurrent
    • Use asc.N for N = 1, 2, 4, ..., 524288, 1048576 but I only share results up to N=64
    • Run db_bench once time for each value of N to insert 100M rows
    • For N=1 write-amp=1, move-ratio=INF and stats are here
    • For N=2, write-amp=4.0, move-ratio=0.779 and stats are here
    • For N=4, write-amp=5.3, move-ratio=0.276  and stats are here
    • For N=8, write-amp=5.8, move-ratio=0.097  and stats are here
    • For N=16, write-amp=6.6, move-ratio=0.066  and stats are here
    • For N=32, write-amp=6.7, move-ratio=0.023  and stats are here
    • For N=64, write-amp=6.7, move-ratio=0.016  and stats are here
  • 2 ascending but not concurrent
    • These run db_bench two times to insert 50M rows/run, see r.asc.a.b.sh and r.asc.b.a.sh
    • Write-amp=1 because all compaction was done by trivial move
    • Use asc.a then asc.b, stats from the end of asc.a and asc.b are here
    • Use asc.b then asc.a, stats from the end of asc.b and asc.a are here
  • 3 ascending, but not concurrent























Thursday, February 10, 2022

RocksDB internals: trivial move

RocksDB has an optimization called trivial move that reduces write amplification. There is a short post on it, but it needs a longer post. I am not sure whether this optimization originated with RocksDB. I don't know whether it is implemented elsewhere, nor do I recall whether it has been described in a conference paper. Update - RocksDB inherited trivial move from LevelDB.

Trivial move can be done for leveled and universal compaction but here I focus on leveled. Trivial move is disabled by default for universal and is enabled via the allow_trivial_move option.

For leveled, when compaction is done from level N to level N+1 there is usually much work: read the input SSTs from level N, read the ~10 input SSTs from level N+1, merge them and write ~10 SSTs to level N+1.

The trivial move can be done when the key range for the input SST from level N doesn't overlap with the key range of any SST in level N+1. In this case the input SST fits in between the key ranges of the level N+1 SSTs and rather than reading and re-writing SSTs the trivial move optimization simply changes metadata (the MANIFEST) to indicate that the SST that had been on level N is now on level N+1, thus the name trivial move. With the exception of the need to rewrite the MANIFEST there is no write-amplification from a trivial move.

Code

This is the code where compaction decides whether to do a trivial move and the logic is in the IsTrivialMove method. A trivial move is not done when ...

  1. ... the compression (lz4, zstd, none) used for the input SST does not match the compression to be used for the output level. If bottommost_compression is zstd and non-bottom levels do not use zstd then trivial moves can be done up to Lmax-1 and compaction from Lmax-1 to Lmax cannot use trivial move (Lmax is max level, Lmax-1 is the next to max level). The logic for that is here.
  2. ... the SST moved to level N+1 would overlap too many SSTs in level N+2. The logic for that is here. And overlap too many SSTs means that the compaction from level N+1 to N+2 would exceed max_compaction_bytes. The max_compaction_bytes option is 0 by default, and when 0 is set to 25 * target_file_size_base. So overlap too many SSTs == overlap ~25 SSTs by default. This check is only for leveled compaction.
I encounter #1 when using --benchmarks=fillseq with db_bench. In that case I make sure that all levels use the same compression type. And for the benchmarks that are run after fillseq I will change the config so that the larger levels use a better but more CPU intensive compression while the smaller levels use no compression. I created issue 9292 with the request that RocksDB be enhanced to avoid this complexity.

You might encounter #2 if you increase max_bytes_for_level_multiplier from the default (10) to something larger like 20 or 30 in which case you can explicitly set max_compaction_bytes to something larger than 25 * target_file_size_base. I created issue 9544 for this.

Logs

These are examples of messages in LOG when trivial move is done:
EVENT_LOG_v1 {"time_micros": 1644111871547273, "job": 429, "event": "trivial_move", "destination_level": 5, "files": 1, "total_files_size": 16872686}

[/db_impl/db_impl_compaction_flush.cc:3189] [default] Moved #1 files to level-5 16872686 bytes OK: base level 4 level ...

Monitoring

Columns in Compaction IO statistics shows when trivial move is done:
  • the Moved(GB) column counts the number of bytes that were trivial moved into that level
  • the Write(GB) column counts the number of bytes that were not trivial moved into that level and by not trivial moved I mean that normal compaction was done. 

Tuesday, February 8, 2022

RocksDB internals: the series

This is a series of posts based on reading RocksDB source to figure out how things are implemented.

RocksDB internals: prefetch and/or readahead

RocksDB can optimize IO for large and small read requests. Small read requests are done for user queries while large read requests can be done for iterators from users and compaction.

By reads I mean reading data from the filesystem. That data might be in the OS page cache otherwise it must be read from a storage device. Back in the day the choices were to use buffered IO or mmap. Today there is a new option -- O_DIRECT.

tl;dr for POSIX (some else can document this for Windows)

  • for small reads RocksDB can use posix_fadvise with POSIX_FADV_RANDOM
  • for large reads RocksDB can use posix_fadvise with POSIX_FADV_SEQUENTIAL to request filesystem readahead. It can also do large reads synchronously in the context of the thread that consumes the read. Some of the RocksDB docs describe this as readahead and/or prefetch. To (pedantic) me prefetch (definitely) and readahead (possibly) implies async requests. Writing this post helps me avoid that confusion.

History In the early days all reads by RocksDB were done one block at a time where the uncompressed block size was usually between 4kb and 16kb. The block might be compressed so the size of the pread request can be less than that. Unless O_DIRECT is used, the block is not aligned to the filesystem page size so the read request can span filesystem pages meaning that reading a 2kb compressed block might end up causing two filesystem pages (4kb each) getting read.

Today all reads are not done one block at a time. There are at least two cases where a file will be read sequentially -- compaction and a long range scan via an iterator -- for which RocksDB can do larger (multi-block) read requests. RocksDB also has a Hint method for open files and that can trigger a call to posix_fadvise.

Some details on iterator readahead are here. Search for automatic readahead.

Prefetch, readahead and large IO requests

RocksDB can request that the kernel does readahead or prefetching via a posix_fadvise call. This is controlled by the access_hint_on_compaction option. RocksDB can also explicitly use large read requests for user and compaction iterators. Some of the code is complicated. It is easy to see where the options are used and where the large reads are done. But the path between the option use and the large read isn't as easy to trace compared to what I described in my past few posts. Two important methods are FilePrefetchBuffer::Prefetch and BlockPrefetcher::PrefetchIfNeeded.

Options

The relevant options include:

  • max_auto_readahead_size - the max size of a readahead request for an iterator. This is adaptive, the readahead size starts at 8kb and grows to max_auto_readahead_size.
  • compaction_readahead_size - sets the readahead size for iterators used by compaction to read input SSTs. The use is here and then here in the BlockBasedTableIterator ctor.
  • prepopulate_block_cache - this isn't readahead but is related. When this option is set then the output from memtable flushes (new L0 SSTs) are written into the block cache. This is useful when O_DIRECT is used as it saves the cost of reading new L0 files into the block cache. It might help to extend this to new SSTs written to the L1. Start here to read the code that caches blocks from memtable flushes.
  • advise_random_on_open - when True, Hint(kRandom) is called when a file is opened for POSIX that triggers a call to posix_fadvise with POSIX_FADV_RANDOM. This is intended for files that will be used for user queries.
  • access_hint_on_compaction_start - specifies the access hint that should be used for reads done by compaction. This is used in SetupForCompaction when Hint is called. The Hint implementation for POSIX calls Fadvise which calls posix_fadvise. It might help to use SEQUENTIAL as the value for access_hint_on_compaction_start but I haven't tested this in a long time. Issue 9448 is open because there is a race WRT to the posix_fadvise call for this and advise_random_on_open.
  • Things in ReadOptions:
    • adaptive_readahead - enables adaptive readahead for iterators that increases the size of readahead as more data is read from the iterator
    • readahead_size - default is 0, when set can increase the size of the initial readahead. When not set the initial readahead is 2 blocks.
  • verify_checksums_readahead_size - the size of read requests when verifying checkums on SST files that will be ingested
  • blob_compaction_readahead_size - sets the readahead size for reads from blob files. I am not sure whether this is only for reads from blobs during leveled LSM compaction or during compaction triggered by blob_garbage_collection_force_threshold.
  • log_readahead_size - the readahead size for log reads

Other options for large IO requests:

Options that limit the amount of dirty data from compaction and the WAL

  • bytes_per_sync - in some cases when zero the value is set to 1M at startup. When non-zero, RangeSync is called after this many bytes have been written to an SST during compaction. The RangeSync method calls sync_file_range to trigger writeback and reduce the duration of the fsync or fdatasync done when compaction is done writing to that SST. The call to sync_file_range might block but the hope is that it doesn't because the goal is async write-behind.
  • wal_bytes_per_sync - similar to bytes_per_sync but for the WAL. I am confused by the impact of this option. Ignoring the cases where the value of it is assigned to the value of another option, this is the only use for it.
  • strict_bytes_per_sync - when true then SYNC_FILE_RANGE_WAIT_BEFORE is used in the sync_file_range call. Otherwise only SYNC_FILE_RANGE_WRITE is used. When true this provides a strict bound on the amount of data that would have to be written to storage when fsync or fdatasync are called when the SST is full/closed.