Posts

Showing posts from February, 2022

RocksDB externals: avoiding problems with db_bench --seed

This is the first post in the RocksDB externals series. This post is about the db_bench --seed option . The seed is used by random number generators that generate keys to be used in RocksDB requests. Setting the seed allows you to have deterministic key sequences in tests.  If you run db_bench --benchmarks=readrandom --seed=10 twice then the second run uses the same sequence of keys as the first run. That is usually not the desired behavior. If the database is larger than memory but there is a cache for storage (OS page cache + buffered IO, etc) then the first test warms the cache and the second run can run faster than expected thanks to a better cache hit rate. Your benchmark results will be misleading if you are not careful. I have made this mistake more than once. Sometimes I spot it quickly, but I have lost more than a few hours from this. This post might help you avoid repeating my mistakes. I have written about this before ( here , here and here ).  One way to avoid problems A

RocksDB externals: the series

This is a series of posts on using RocksDB to complement the series on RocksDB Internals . db_bench --seed db_bench --benchmarks=mixgraph , running the mixgraph benchmark Examples of the trivial move optimization using db_bench

RocksDB internals: trivial move and benchmarks

My last post explained some aspects of the trivial move optimization based on configuration options and source code. This post explains it based on benchmarks. I have speculated in the past about whether workloads with N streams of inserts would benefit from the trivial move optimization in RocksDB and I am happy to discover that they do. But first let me explain streams of inserts here and see this post for more detail.  An insert stream is a sequence of keys in ascending or descending order.  For a given stream the order is always ascending or always descending, the stream cannot switch. The keys in the stream are inserted (a RocksDB Put) and by insert I mean that the key does not already exist in the database. When there are N streams the keys from each stream use a different prefix to guarantee that RocksDB will see N different streams. The goal is to determine the write-amplification for each workload. When write-amp=1 then trivial move is always used during compaction. This i

RocksDB internals: trivial move

RocksDB has an optimization called trivial move that reduces write amplification. There is a short post on it, but it needs a longer post. I am not sure whether this optimization originated with RocksDB. I don't know whether it is implemented elsewhere, nor do I recall whether it has been described in a conference paper. Update - RocksDB inherited trivial move from LevelDB . Trivial move can be done for leveled and universal compaction but here I focus on leveled. Trivial move is disabled by default for universal and is enabled via the allow_trivial_move option . For leveled, when compaction is done from level N to level N+1 there is usually much work: read the input SSTs from level N, read the ~10 input SSTs from level N+1, merge them and write ~10 SSTs to level N+1. The trivial move can be done when the key range for the input SST from level N doesn't overlap with the key range of any SST in level N+1. In this case the input SST fits in between the key ranges of the level

RocksDB internals: the series

This is a series of posts based on reading RocksDB source to figure out how things are implemented. bytes pending compaction intra-L0 compaction compaction stall counters write rate limiter prefetch and/or readahead trivial move  and benchmark results Block cache

RocksDB internals: prefetch and/or readahead

RocksDB can optimize IO for large and small read requests. Small read requests are done for user queries while large read requests can be done for iterators from users and compaction. By reads I mean reading data from the filesystem. That data might be in the OS page cache otherwise it must be read from a storage device. Back in the day the choices were to use buffered IO or mmap. Today there is a new option -- O_DIRECT. tl;dr for POSIX (some else can document this for Windows) for small reads RocksDB can use posix_fadvise with POSIX_FADV_RANDOM for large reads RocksDB can use posix_fadvise with POSIX_FADV_SEQUENTIAL to request filesystem readahead. It can also do large reads synchronously in the context of the thread that consumes the read. Some of the RocksDB docs describe this as readahead and/or prefetch. To (pedantic) me prefetch (definitely) and readahead (possibly) implies async requests. Writing this post helps me avoid that confusion. History In the early days all reads by