Tuesday, May 10, 2022

RocksDB internals: LRU

RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.

An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.

Updates:

  • fix a bogus claim about cache_index_and_filter_blocks

The big picture

The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.

When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.

From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache. 

The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.

Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.

  1. Search to find the SST for which the searched for key is between the min and max key for the SST
  2. Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
  3. Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
  4. Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.

Implementation overview

These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.

The code is mostly in 4 classes/structs:

LRUHandle

LRUHandle has great comments that I summarize here:
  • each entry is a variable length heap-allocated structure
  • entries are referenced by cache and/or by any external entity
  • the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.

LRUHandle in one of three states:
  1. Referenced externally AND in hash table
    • the entry is *not* in the LRU list
    • refs >= 1 && in_cache == true
  2. Not referenced externally AND in hash table
    • the entry is in the LRU list and can be freed
    • refs == 0 && in_cache == true
  3. Referenced externally AND not in hash table
    • entry is not in the LRU list and not in hash table
    • entry can be freed when refs becomes 0
    • refs >= 1 && in_cache == false
All newly created LRUHandles are in state 1.
  • If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
  • To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
  • To move from state 2 to state 1, use LRUCacheShard::Lookup.
  • Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
Explaining the stack traces

The rest of this post explains what I saw in the stack traces that show contention for LRUCacheShard::Lookup and LRUCacheShard::Release. A block search is done in Lookup and the block is pinned (referenced) if found. The block is unpinned by Release when the search is finished with it. Both require a lock/unlock of the per-shard mutex. And Release can also write to the (contended) head of the LRU list. I also have edited PMP output for those traces.

A cache search starts with ShardedCache::Lookup. That computes hash to determine the shard and then calls LRUCacheShard::Lookup on that shard. When the searcher is done with the block it calls 

  lock LRUCacheShard::mutex_
  if LRUHandleTable::Lookup returns an entry (e)
    if not LRUHandle::HasRefs():
      LRU_Remove(e) // entry was in LRU since it has references
    LRUHandle::Ref (for e)    // does refs++
    LRUHandle::SetHit (for e) // does flags |= ...
  unlock LRUCacheShard::mutex_


  removes an LRUHandle from double linked LRU list
  calls LRUCacheHandle::CalcTotalCharge
  updates memory accounting

  if (metadata_charge_policy == kFullChargeCacheMetadata)
    call malloc_usable_size to update meta_charge

I was curious about the overhead from frequent calls to malloc_usable_size which can be done when the per-shard mutex is locked. Results are here. Preventing these calls doesn't solve the mutex contention problem.

  lock LRUCacheShard::mutex_
  call e->Unref()
  
  if ref count is 0 and e->InCache()
    if cache_full
      LRUHandleTable::Remove(e)
      e->SetInCache(False)
    else
      LRU_Insert(e)
  
  if (no longer referenced):
    call e->CalcTotalCharge
  unlock LRUCacheShard::mutex_
  
  if (no longer referenced):
    // This is done after the mutex has been unlocked
    e->Free()

  // LRUCacheShard::mutex_ held by caller
  call e->CalcTotalCharge()
  if high_pri_pool enabled && (e->IsHighPri() || e->HasHit())
    move e to head of LRU list
    MaintainPoolSize()
  else
    move e to head of low-pri pool

  while high pri pool too big
    move entry from tail of high pri LRU to low-pri pool

  call FindPointer
  remove from double linked list

  walk a list to find the LRUHandle for a key

The number of entries on the list that FindPointer will encounter should be ~1 on average and  LRUHandleTable::Insert calls LRUHandleTable::Resize to make that likely. I am curious whether calls to Resize are a source of stalls because a per-shard mutex is locked when that happens. That won't be a common stall.

Performance bugs: p50 vs p99

One way to classify performance bugs is p50 vs p99. 

  • A p50 performance bug occurs for all workloads
  • A p99 performance bug only happens for some workloads or in some contexts. By some workloads I mean it might only occur during hot backup or load. By some contexts I mean it might only happen under high concurrency or load spikes which don't represent the steady state.
The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster.

The impact from fixing a p99 bug is less obvious. On average there might be little impact with respect to average throughput and efficiency. There can be a huge impact with respect to response time goals. The challenge is to quantify the benefit and I have learned over the years that motivating the fix for p99 perf bugs is a hard problem. For MySQL I wrote many blog posts to market open bug reports.

One aspect is that there are easy workarounds for p50 perf bugs. If a bug makes something 10% less efficient then buy 10% more hardware. There might be no workarounds for a p99 perf bug. If something causes a long stall once/day or ruins performance for one infrequent part of your workload then the choices are to either tolerate it (someone suffers) or replace the system.

I spent a few years iterating on p99 bugs in MySQL back in my web-scale MySQL+InnoDB days. It was interesting work and I benefited from production which was a great source of p99 bugs and a colleague named Domas who was great at documenting such bugs.

Monday, May 9, 2022

Using mmap with RocksDB

RocksDB inherited support for mmap from LevelDB. I was curious how performance with mmap compared buffered IO (pread) and ran benchmarks with db_bench. I haven't had great experiences with mmap for database workloads in the past but tried to be fair.

tl;dr

  • Don't use mmap for IO-bound workloads until issue 9931 is fixed
  • For in-memory databases performance with mmap can be better than pread if db_bench is run with --cache_index_and_filter_blocks=false. Setting --verify_checksum=false provides a smaller performance boost.
Longer than tl;dr

If the database doesn't fit in memory then issue 9931 means there will be significant overfetching on reads from storage and this will ruin performance. I assume the fix is to add madvise calls similar to the posix_fadvise calls used for buffered IO (pread).

With --cache_index_and_filter_blocks=false the RocksDB process might use more memory than you expect so be careful while chasing performance.  When it is true and --mmap_read=true then index and filter blocks will still be copied into the RocksDB block cache.

While --verify_checksum=false improves performance in some cases, it also prevents the use of checksums to detect file corruption on data blocks. Checksums for filter and index blocks read into the block cache are always checked regardless of the value of --verify_checksum because the BlockFetcher uses a ReadOptions struct with default values and the default for ReadOptions::verify_checksums is true (see here and here). 

Configuration

I used benchmark.sh to run db_bench on a server with more than 32 CPU cores and then hyperthreading was enabled. The database was run in three configurations: cached by RocksDB, cached by OS and IO-bound. For cached by RocksDB the block cache was larger than the database files. For cached by OS the block cache was 1G but the database fits in the OS page cache. For IO-bound the database is larger than memory. I don't share graphs for cached by OS but the results were similar to cached by RocksDB.

The benchmarks were run with 32 client threads (--threads=32) and WAL sync disabled. The background writer was limited to 2MB/s.

The benchmarks were run in this order:

  • fillseq - load the database in key order
  • readrandom - point lookups via Get
  • multireadrandom - point lookups via MultiGet
  • revrangeww - reverse range scans with a background writer
  • fwdrangeww - forward range scans with a background writer
  • readww - point lookups via Get with a background writer
  • overwrite - writes via Put run as fast as possible.

IO-bound

The database was loaded with 4B key-value pairs by fillseq. The throughput graphs show that results were lousy for mmap on the read-heavy benchmarks. On my server RocksDB with mmap was fetching ~100X more data per query compared to pread. While tuning filesystem options might have reduced the excessive fetching I am not sure it would prevent it and I am reluctant to change global options on a server to benefit one of the applications that use the server. The real fix and more stats on the excessive fetching are explained in issue 9931.

For pread and mmap the benchmarks used --verify_checksum=true and --cache_index_and_filter_blocks=true.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Cached by RocksDB

The database was loaded with 40M key-value pairs by fillseq. Results are provided for several configurations. Font size reduced to avoid spanning lines.

  • pread-ver1-cm1: -mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=true
  • mmap-ver1-cm1: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=true
  • pread-ver0 -cm1: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=true
  • pread-ver0-cm0: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=false
  • pread-ver1-cm0: --mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=false
  • mmap-ver0-cm1: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=true
  • mmap-ver0-cm0: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=false
  • mmap-ver1-cm0: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=false

Command lines are here for pread-ver0-cm1. To use the *-cm0 variants see here. To run the IO-bound variants see here.

The most important db_bench option for performance is --cache_index_and_filter_blocks=false as shown in the results for pread-ver*-cm0 and mmap-ver*-cm0.

Setting --verify_checksum=false helps for mmap but not for pread. See pread-ver0-cm0 vs pread-ver1-cm0 and mmap-ver0-cm0 vs mmap-ver1-cm0. This might be workload specific.

Setting --mmap_read also helps compared to pread but only when --cache_index_and_filter_blocks=false. It is possible that also enabling the option for compaction to pre-populate the block cache would allow pread to be as fast as mmap but I did not try it.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Thursday, May 5, 2022

To compress or not compress

What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels.

tl;dr - if levels 0, 1 and 2 are compressed:

  • Benefits
    • less write-amplification
  • Costs
    • there might be more write stalls
    • write throughput might be lower if there are more write stalls
    • time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2 
    • compaction CPU overhead is increased
By might be more write stalls I mean that this is only an issue when the write rate is large. For the results here I ran one test with the write-rate limited to 10 MB/s and there were no write stalls whether or not L0/L1/L2 were compressed. Then I ran a test without a limit on the write rate and stalls were worse when L0/L1/L2 were compressed.

Quantifying the tl;dr based on the overwrite benchmark:
  • Write-amplification was 1.36X larger when L0/L1/L2 were not compressed
  • Write throughput was 1.32X larger when L0/L1/L2 were not compressed
  • Compaction CPU was 1.36X larger when L0/L1/L2 were compressed via lz4

Overview

The benchmark command lines are here and compaction IO statistics are here. Tests were run on a server with more than 32 CPU cores and hyperthreading was enabled. The server has fast SSD and enough DRAM.

I used the db_bench benchmark client for RocksDB and RocksDB version 6.28.2 to run a sequence of benchmarks. Here I explain results from two of them -- readwhilewriting and overwrite. The readwhilewriting benchmark ran with the write limited to 10MB/second. The overwrite benchmark was run without a constraint on the write rate. WAL fsync was disabled to allow the writes to happen as fast as possible.

Levels L3 and larger were compressed with lz4. Tests were then repeated with L0/L1/L2 not compressed vs compressed with lz4.

Data used in this benchmark compresses in half as the db_bench command line included --compression_ratio=0.5.

Results, part 1

Notes:

  • Write rates (wps) are the same for readwhilewriting because the writer had a 10M/s rate limit
  • Write rates (wps) are ~1.3X larger for overwrite when L0/L1/L2 are not compressed because there were fewer write stalls
  • Bytes flushed per write are ~2X larger when flushes (L0) are not compressed
  • Write-amp is ~1.3X larger when L0/L1/L2 are not compressed. The LSM tree in this workload has 7 levels, the write rate into each level is similar despite the amount of data per level varying greatly. Thus, whether or not you compress 3 of the 7 levels has a significant impact on write-amp. 

Legend:
* wps = writes/second
* ingest - GB of data written into RocksDB
* flush - GB of memtable flushes
* compact - GB of compaction

wps     ingest  flush   compact w-amp  L0/L1/L2-compression
--- readwhilewriting
220555  18.05   17.81   313     18.3   none
224320  18.05    9.84   245     14.1   lz4
--- overwrite
117383  167.25  169.42  2627    16.7   none
 89177  127.07   71.12  1486    12.3   lz4

Results, part 2

Notes:

  • Write rates are ~1.3X larger when L0/L1/L2 are not compressed
  • Write stalls are more common when L0/L1/L2 are compressed
  • The compaction CPU overhead per write is ~1.3X larger when L0/L1/L2 are compressed

Legend:
* wps = writes/second
* c.csecs - CPU seconds of compaction
* cpu/w - c.csecs / wps = compaction CPU / write
* stall% - percentage of time writes are stopped or stalled

wps     c.wsecs c.csecs cpu/w   stall%  L0/L1/L2-compression
--- readwhilewriting
220555  2867    2798    NA      0       none
224320  2737    2693    NA      0       lz4
--- overwrite
117383  17084   16464   0.140   49.7    none
 89177  14554   14209   0.159   62.9    lz4

Results, part 3

This shows the time per flush or compaction job.  A flush job is the work done to flush one memtable into an L0 SST. A compaction job is the time to do a compaction into L1+. A compaction job into L1 usually merges all data in L0 with all data in L1. A compaction job into Ln (n > 1) usually merges ~10 files in Ln-1 with 1 file in Ln. Therefore, compaction jobs into L1 usually process more data than jobs into Ln (n > 1). This data is from compaction IO statistics output.

Notes:

  • Compaction/flush jobs into L0/L1/L2 take up to 3X longer when L0/L1/L2 is compressed
  • Compaction jobs into L1 are the longest running. This happens with overwrite because it was run without rate limit for the writer, compaction fell behind and RocksDB allows L0 to grow large in that case. One side-effect is that the L0->L1 compaction job takes longer because it merges all data from L0 with all data from L1.

readwhilewriting - wall clock seconds per compaction/flush job

        compression
level   none    lz4
L0       0.05    0.08
L2       0.53    1.03
L3       0.17    0.47
L4       0.86    0.92
L5       0.86    0.88
L6       0.87    0.88
L7       0.58    0.55

overwrite - wall clock seconds per compaction/flush job


        compression
level   none    lz4
L0       0.11    0.18
L2      13.38   28.86
L3       0.14    0.42
L4       0.31    0.44
L5       0.37    0.40
L6       0.59    0.58
L7       0.77    0.79

Monday, April 25, 2022

clang vs crc32c in RocksDB

By default RocksDB uses crc32c as a block checksum. The source is here and a benchmark is here (db_bench --benchmarks=crc32c). By default RocksDB is compiled with gcc on Linux but it is easy to switch to clang. I did that switch to compare benchmark performance between gcc and clang builds and the initial results were interesting.

The overhead for crc32c is apparent for benchmarks that do a lot of IO from fast storage (fast SSD or the OS page cache).

tl;dr for x86 HW

  • For crc32
    • throughput is more than 2X faster with gcc 9.4 than clang for clang versions < 14
    • the difference drops to 1.36X at clang version 14
  • For xxh3
    • clang versions 10 to 13 are ~1.08X faster than gcc 9.4
    • the difference drops to ~1.04X faster for clang 14
  • clang versions 14 and 15 have similar performance, so do clang versions 10 through 13

Update - issue 55153 filed for LLVM

Results

Legend:

  • cc - compiler
  • crc32c, xxhash, xxh3 - db_bench benchmark names
  • others are abbreviated names for db_bench benchmarks: xxh64 = xxhash64, comp = compress, uncomp = uncompress
The numbers in the table are the throughput in MB/s reported by db_bench.

cc      crc32c  xxhash  xxh64   xxh3    comp    uncomp
gcc     19327   5036    9796    26805   661     5201
clang    8935   5043    9847    29327   658     5185
clang11  8367   5050    9849    29344   660     5184
clang12  7594   5024    9832    28392   658     4858
clang13  7571   5021    9832    29031   659     5234
clang14 14239   5008    9847    27946   660     4770
clang15 14266   5027    9811    28020   660     5170

It is not shown here but the results for the uncompress test had too much variance so I ignore them. Perhaps the test needs to run for more time.

Setup

Most of my tests used an Intel NUC described here. This has Ubuntu 20.04 with gcc 9.4.0 and clang 10.0.0-4ubuntu1. After noticing that gcc 9.4 was much faster than clang 10 I tried clang versions 11, 12, 13 and 14 and the scripts here made it easy to install the newer versions of clang.

Other notes:

  • RocksDB is compiled with -O2 and all of the needed flags/includes to get a fast crc32.
  • From clang --version the versions were 11.1.0, 12.0.1, 13.0.1 and 14.0.1.

I then compiled RocksDB using gcc and clang:

# Compile for gcc 9.4
make clean; make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.gcc.use1

# Compile for clang 10.0.0-4ubuntu1
make clean; CC=/usr/bin/clang CXX=/usr/bin/clang++ USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib
db_bench; mv db_bench db_bench.clang.use1

# Compile for clang versions 11, 12, 13, 14
for v in 11 12 13 14; do make clean; CC=/usr/bin/clang-${v} CXX=/usr/bin/clang++-${v} USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.clang${v}.use1 ; done

And then I ran each of the CPU-intensive microbenchmarks 3 times and reported the median result:


for bm in crc32c xxhash xxhash64 xxh3 compress uncompress ; do
  echo; echo $bm; echo
  for x in gcc clang.use1 clang11.use1 clang12.use1 clang13.use1 clang14.use1 ; do
    echo; echo $x
    for z in 1 2 3; do
      ./db_bench.$x --benchmarks=$bm --stats_per_interval=1 --stats_interval_seconds=600 2> /dev/null | grep ^"$bm"
    done
  done
done

Clang versions

$ /usr/bin/clang-11 --version
Ubuntu clang version 11.1.0-++20211011094159+1fdec59bffc1-1~exp1~20211011214622.5
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-12 --version
Ubuntu clang version 12.0.1-++20211029101322+fed41342a82f-1~exp1~20211029221816.4
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-13 --version
Ubuntu clang version 13.0.1-++20220120110924+75e33f71c2da-1~exp1~20220120231001.58
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-14 --version
Ubuntu clang version 14.0.1-++20220423123024+9a3e81e1f91f-1~exp1~20220423003108.124
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-15 --version
Ubuntu clang version 15.0.0-++20220424052740+3f0f20366622-1~exp1~20220424172822.231
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Monday, April 4, 2022

Becoming less confused about perf record

I previously wrote about generating bogus flamegraphs and now I can be more clear about what I didn't understand. The default behavior for perf record is frequency-based and on modern Linux you probably get perf record -F 4000 as the default. For now assume the command line also used -p $pid and did not use -e. So in that case perf record does sampling of one process for the cycles event.

Disclaimer -- I am still waiting for an expert to review this.

With frequency-based sampling (-F $frequency) perf record tries to generate $frequency samples per second. The perf tutorial wiki does a good job explaining what happens next. A sample (stack trace) is taken when the PMU cycles counter overflows. The challenge is that the number of cycles consumed per second by your process varies (more when it is CPU-bound, less when it is IO or mutex bound). So the sample period (the number of cycles at which overflow occurs) must be changed dynamically to provide $frequency samples/second. A key point here is that perf is sampling based on the number of cycles consumed by the process named in -p. If it were just doing that based on wall-clock time, for example running PMP once/second, then no adjustment is needed (just wake up $frequency times/second and get samples).

The sample periods are displayed in perf script output and here is an example where 1 cycles and 258 cycles are the sample periods. The sample periods also serve as the weight of a sample. For example, if there are 3 stack traces each with a sample period like the following, then perf report will report that g,h,i accounts for 50% of the time.

stack   period/cycles

a,b,c   1

d,e,f   1

g,h,i   2

The equivalence of sample period with what I called weight in my previous post confused me at first. If perf record is run with -c instead of -F then sample periods are not adjusted and all samples have the same sample period (same weight, a value likely to be much larger than 1) which is another workaround for FlameGraph issue 165 as suggested in the issue report.

As I wrote in a LinkedIn discussion: this is interesting and complicated. Given perf record -g -p $pid and a stack trace taken after a sample period of X cycles then you assume that stack trace represents where CPU was consumed by that process for the entire X cycles of that sample period. When the sample period is constant for all stack traces (perf record -c) then this doesn't matter because all stack traces count equally (have equal weight). But it does matter when the sample period is adjusted (perf record -F or perf record) and stack traces have different sample period (the X in X cycles changes) as in this example.

In addition to the links above, these helped reduce my confusion. So did a co-worker from the kernel team:


Sunday, April 3, 2022

Examples of the RocksDB trivial move optimization

I recently wrote two posts (here and here) on the trivial move optimization in RocksDB and now have a third post that provides examples via db_bench to demonstrate where it works, the impact of using compression on some levels and the complexity introduced by dynamic leveled. All of the examples use db_bench and the fillseq benchmark that inserts keys in ascending order. The four examples are:

  • dynamic level enabled, compression disabled
  • dynamic level disabled, compression disabled
  • dynamic level enabled, compression enabled
  • dynamic level disabled, compression enabled
Configuration

Dynamic leveled is explained by this post, enabled by this configuration option and set on the db_bench command line by --level_compaction_dynamic_level_bytes.

Compression is enabled by the db_bench flags --min_level_to_compress=2 --compression_type=lz4.

Dynamic level enabled, compression disabled

The db_bench command line and output are here. This shows the best case as there are no writes for levels beyond L0 -- Write(GB)=0 and Moved(GB) > 0. Trivial move cannot be done for writes into L0 (memtable flush) because the LSM has to write data at least once.

Dynamic level disabled, compression disabled

The db_bench command line and output are here. This shows the best case as there are no writes for levels beyond L0 -- Write(GB)=0 and Moved(GB) > 0.  Trivial move cannot be done for writes into L0 (memtable flush) because the LSM has to write data at least once.

Dynamic level enabled, compression enabled

The db_bench command line and output are here. This shows the impact of compression. When Ln and Ln+1 don't have the same style of compression then trivial move cannot be done from Ln to Ln+1. This is complicated by dynamic leveled because the LSM tree starts with L0 then grows into L0,L6 then grows into L0,L5,L6 and finally reaches L0,L4,L5,L6. Note I am using the level names displayed in compaction IO statistics and I consider these to be physical names. The logical names would be L0,L1,L2,L3. This is one source of complexity created by a great feature (dynamic leveled).

When the tree has data in L0,L5,L6 then the --min_level_to_compress option means that only L6 should be compressed. When the tree reaches L0,L4,L5,L6 then L5 and L6 should be compressed. This is another source of complexity from dynamic leveled.

This explains why both Write(GB) and Moved(GB) are greater than zero for L5 and L6. That doesn't occur with dynamic level disabled.

  • For L0 all bytes written are from memtable flush, Write(GB) > 0 & Moved(GB)=0.
  • For L4 only trivial move is done, Write(GB)=0 & Moved(GB) > 0
  • For L5 both Write(GB) > 0 & Moved(GB) > 0. 
    • With L0,L5,L6  trivial moves are done from L0 to L5 with uncompressed data
    • With L0,L4,L5,L6 writes are done from L4 to L5 because L4 is uncompressed and L5 uses lz4
  • For L6 both Write(GB) > 0 & Moved(GB) > 0. 
    • With L0,L6 trivial moves are done from L0 to L6 with uncompressed data
    • With L0,L5,L6 writes are done from L5 to L6 because L5 is uncompressed and L6 uses lz4
    • With L0,L4,L5,L6 trivial moves are done from L5 to L6 because both use lz4


Dynamic level disabled, compression enabled

The db_bench command line and output are here. When Ln and Ln+1 don't have the same style of compression then trivial move cannot be done from Ln to Ln+1. However dynamic leveled is disabled so the compaction IO statistics are easier to read.

  • For L0 all bytes written are from memtable flush, Write(GB) > 0 & Moved(GB)=0.
  • For L1 only trivial move is done, Write(GB)=0 & Moved(GB) > 0
  • For L2 only writes are done, Write(GB) > 0 & Moved(GB)=0,  because L1 is uncompressed & L2 is
  • For L3 only trivial move is done, Write(GB)=0 & Moved(GB) > 0, because L2 and L3 use lz4







Friday, April 1, 2022

One problem with flamegraphs

In the past I used PMP a lot with some usage of perf. PMP was great for me because my focus was on stalls from IO and synchronization. PMP is simple but good for showing off-cpu stalls. As I catch up to modern performance tools I have begun to use flamegraphs with perf. However, I also started to share bogus flamegraphs because there is a bug and I did not validate that the flamegraphs matched the output from perf report. Fortunately, Domas pointed out the problem.

The bug is FlameGraph issue 165 and the problem is that stackcollapse-perf.pl assumes that all stack traces are weighted equally but the output from perf script does not match that assumption. In the perf script output the stack traces have weights like 1 cycles and 258 cycles (collected via a script like this). If you use those values to compute time spent in each stack trace then your results will match those provided by perf report (at least in the cases I checked). But stackcollapse-perf.pl doesn't parse that number and just assumes each stack trace has weight 1 (see here).

I started with PR 250 but it is ~20 diffs behind and I am not sure it is correct. My variant of PR 250 that works for me is here. And now I get valid flamegraphs again that are extremely useful. Note that I am not an expert in Perl, perf or the FlameGraph repo.

Update - I am now less confused about perf record



Wednesday, March 9, 2022

The db_bench mixgraph benchmark

The mixgraph workload for db_bench was first described in this paper. The paper's first author was on the RocksDB team and is now a professor at ASU (see here and here). I am happy that he spent time making RocksDB better and is now doing research. A few years near production and in R&D is a great education for researchers.

Mixgraph

The mixgraph workload is interesting because it is more complex than the other db_bench workloads. By workload I mean the value foo in db_bench --benchmarks=$foo. Most, or perhaps all, of the workloads are listed here. By perhaps all I acknowledge there is some tech debt in db_bench and I hope to reduce that this year. Use --benchmarks=mixgraph to run the mixgraph benchmark. 

The mixgraph workload mimics social graph OLTP. It is implemented as part of db_bench and uses the RocksDB API directly. This is a simpler version of LinkBench. The workload uses a configurable mix of Put, Get and Seek+Next operations (see here). It doesn't use transactions (yet). The access pattern distributions are configurable. The value size distributions are configurable. There is much that is configurable via options, which is good but adds complexity. The default value for most of these options is zero and I suspect that isn't a good default for several of them. For example, if --iter_k and --iter_sigma aren't explicitly set then scan_length will be zero and Next won't get called after Seek. In other cases, the implication of changing an option requires math and stats. While the default for the max value length is 1024 the value length per Put is determined by a Pareto distribution and is ~35 with --value_k=0.2615 --value_sigma=25.45. I created issue 9672 to improve monitoring for mixgraph to make it easier to learn the implications of such changes. Regardless, start by reading the appendix in this paper if you want to use mixgraph. All of the db_bench options for mixgraph are listed here.

Another problem, that I will try to fix soon, is that mixgraph shouldn't be run with more than one thread, so use db_bench --benchmarks=mixgraph --threads=1 or don't specify --threads as the default value is 1. See issue 9651 for details.

Using mixgraph

I created shell scripts to make it easy for me to run mixgraph and collection performance data. They are here. The scripts provide the four configurations listed in the appendix (all_dist, prefix_dist, all_random, prefix_random) and then one more, hwperf, from elsewhere. With these scripts the average value length is ~35 bytes. The average scan length, Seeks per Next, varies by configuration. It is zero for hwperf and ~560 for the others when mix_max_scan_len is 10000. The scan length is zero for hwperf because it doesn't set --iter_k or --iter_sigma. For this reason I suggest you use the other configurations.

I was able to get between 10,000 and 20,000 operations/s from a single thread depending on the configuration for an IO-bound benchmark and a server with a fast SSD.

I spend (too) much time running that use uniform distributions for access patterns. With mixgraph you can get a workload with a non-uniform distribution and I was curious how that impacted the block cache hit rate. This table shows the block cache hit rate as a function of block cache size. The block cache size varies from 1 to 128 GB. The database size is 574 GB. That table also has many other metrics. The results make it clear that all_random and prefix_random are more likely to use uniform random access distributions while all_dist and prefix_dist are not.

I was curious about the IO overhead for Get vs Seek+Next and repeated tests limited to Get or Seek+Next operations. The former is called Get-only and the latter is called Seek-only in the results. The results are here and interesting details are:

  • block cache hit rate for Get-only is much lower than for Seek-only. I assume that each Next call is a block cache access and thus many Next calls per Seek inflate the hit rate for that.
  • operations/second for Get-only is much higher than for Seek-only because Seek-only does ~560 Next calls per Seek, except for hwperf.












Thursday, February 24, 2022

RocksDB externals: avoiding problems with db_bench --seed

This is the first post in the RocksDB externals series. This post is about the db_bench --seed option. The seed is used by random number generators that generate keys to be used in RocksDB requests. Setting the seed allows you to have deterministic key sequences in tests. 

If you run db_bench --benchmarks=readrandom --seed=10 twice then the second run uses the same sequence of keys as the first run. That is usually not the desired behavior. If the database is larger than memory but there is a cache for storage (OS page cache + buffered IO, etc) then the first test warms the cache and the second run can run faster than expected thanks to a better cache hit rate.

Your benchmark results will be misleading if you are not careful. I have made this mistake more than once. Sometimes I spot it quickly, but I have lost more than a few hours from this. This post might help you avoid repeating my mistakes.

I have written about this before (here, here and here). 

One way to avoid problems

Assuming you want to run db_bench 1+ times for your benchmark, the following will usually avoid problems from reuse of the seed:

  • Optionally clear the OS page cache: sync; echo 3 > /proc/sys/vm/drop_caches
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
  • db_bench --seed=$( date +%s ) --threads=X --benchmarks=...
The above pattern is fine as long as the number of seconds between each run is larger than the number of threads (X above). See below for explanation of the number of seconds vs number of threads issue. 

The above pattern works when --benchmarks has one benchmark. If it has two, for example --benchmarks=readrandom,overwrite then the seeds used for readrandom will be reused for overwrite. There is no way to avoid that problem with RocksDB's db_bench until issue 9632 is fixed. I just noticed that LevelDB's db_bench has a fix for it.

Number of threads vs number of seconds

Note that date +%s is the number of seconds since the epoch.

With db_bench --seed=1 --threads=4 then four threads are created to make RocksDB requests and use the seeds 1, 2, 3 and 4. If you then run db_bench --seed=2 --threads=4 then four threads are created and use the seeds 2, 3, 4, 5. So there is some overlap in the seeds between the first and second runs. If --seed=$( date + %s ) is used in place of --seed=1 and --seed=2 then seed overlap is avoided when the value for --threads in the first run is smaller then the number of seconds required for the first run.

Implementation details

Both RocksDB and LevelDB have a seed reuse bug:

  • With RocksDB there is reuse when --benchmarks lists more than one test (issue 9632)
  • With LevelDB there is reuse across runs of db_bench

How --seed is used in RocksDB's db_bench:

  • When there are N threads, the IDs for the threads range from 0 to N-1 (see here)
  • And from this code:
    • Each thread has a random number generator (RNG) initialized with a seed
    • The value of base_seed comes from --seed when not zero, else it is 1000
    • The value of the seed for each per-thread RNG is base_seed + ID 
How seeds are used in LevelDB's db_bench:
  • seed_base is hardwired to 1000 (see here)
  • A counter is incremented each time a thread is created (see here)
  • The per-thread seed is seed_base + counter (see here)

RocksDB externals: the series

This is a series of posts on using RocksDB to complement the series on RocksDB Internals.

Tuesday, February 15, 2022

RocksDB internals: trivial move and benchmarks

My last post explained some aspects of the trivial move optimization based on configuration options and source code. This post explains it based on benchmarks.

I have speculated in the past about whether workloads with N streams of inserts would benefit from the trivial move optimization in RocksDB and I am happy to discover that they do.

But first let me explain streams of inserts here and see this post for more detail. 

  • An insert stream is a sequence of keys in ascending or descending order. 
  • For a given stream the order is always ascending or always descending, the stream cannot switch.
  • The keys in the stream are inserted (a RocksDB Put) and by insert I mean that the key does not already exist in the database.
  • When there are N streams the keys from each stream use a different prefix to guarantee that RocksDB will see N different streams.
The goal is to determine the write-amplification for each workload. When write-amp=1 then trivial move is always used during compaction. This is the best case. But even when write-amp is greater than one there can still be a benefit from trivial move and that benefit can be estimated by the ratio between bytes that get trivial move and bytes that must be rewritten during compaction. Numbers for these are listed in the compaction IO statistics output -- the Moved(GB) vs Write(GB) columns.

tl;dr
  • Trivial move works great and as expected for a small number of insert streams
  • It wasn't used as much as I expected when there were 8+ streams. Perhaps my expectations were wrong or perhaps this can be improved. 
Update - WRT to the second bullet point, the problem was me. The issue is that RocksDB does not know there are N streams and when creating SSTs is unlikely to end them on keys that respect the streams and this makes trivial move less likely. LSMs with optimizations for time-series workloads have a solution for this. Solutions are likely to require guidance from the user. I don't have an opinion on whether detecting the streams at runtime is a good thing to do and won't consume too much CPU. The sst_partitioner in RocksDB is something that would consume the guidance were it provided.

Implementation details

All of my tests used db_bench with --benchmarks=fillseq although I had to modify db_bench in some cases. 
  • asc.1 - to get one ascending stream I used db_bench as-is. In this case the key is derived from a counter that is incremented (see here).
  • desc.1 - to get one descending stream I used a modified version of db_bench. The modification is to decrement the counter rather than increment it in the key generator (see here).
  • asc.N - to get N concurrent and ascending streams I used --keys_per_prefix and --prefix_size. Note to self, don't forget to set prefix_size because it is 0 by default. By concurrent, I mean there are N streams during one run of db_bench.
  • asc.a, asc.b, asc.c - To get N (well, N= 2 or 3) not-concurrent and ascending streams I used 3 modified versions of db_bench. Each used a different one-byte prefix ('a', 'b', or 'c') for their keys and then the incrementing counter made the keys ascending. There was one ascending stream per run of db_bench and db_bench was run two or three times.
The shell scripts I used to run tests and a diff with all changes for db_bench are here.

Workloads

I confirmed that trivial move was done for each of the workloads described below. The database was empty at the start of each workload. In the best case all compaction was done by trivial move and write-amp=1 and the best case occurs for all workloads except N ascending & concurrent

The results for the N ascending & concurrent workloads are the most interesting because trivial move doesn't happen for all compactions. I expected it to be done all compactions into Ln+1 where Ln is the first level large enough to have N or more SSTs. However, dynamic leveled compaction makes this complicated, so I also use move-ratio to judge the frequency of trivial move where move-ratio is Moved(GB) / Write(GB) for levels excluding L0. When write-amp=1 then move-ratio=INF.

For each workload I provide a link to compaction IO stats reported near test end. The Write(GB) shows how many bytes were written by compaction (for levels > 0) or flushed (for L0). The Moved(GB) columns show how many bytes used trivial move. When write-amp=1, then the Write(GB) column is zero for all levels except L0.

The workloads were:

  • 1 ascending
    • Use asc.1 and run db_bench one time to insert 100M rows, see r.asc.sh
    • Write-amp=1 because all compaction was done by trivial move, stats are here
  • 1 descending
    • Use desc.1 and run db_bench one time to insert 100M rows, see r.desc.sh
    • Write-amp=1 because all compaction was done by trivial move, stats are here
  • N ascending & concurrent
    • Use asc.N for N = 1, 2, 4, ..., 524288, 1048576 but I only share results up to N=64
    • Run db_bench once time for each value of N to insert 100M rows
    • For N=1 write-amp=1, move-ratio=INF and stats are here
    • For N=2, write-amp=4.0, move-ratio=0.779 and stats are here
    • For N=4, write-amp=5.3, move-ratio=0.276  and stats are here
    • For N=8, write-amp=5.8, move-ratio=0.097  and stats are here
    • For N=16, write-amp=6.6, move-ratio=0.066  and stats are here
    • For N=32, write-amp=6.7, move-ratio=0.023  and stats are here
    • For N=64, write-amp=6.7, move-ratio=0.016  and stats are here
  • 2 ascending but not concurrent
    • These run db_bench two times to insert 50M rows/run, see r.asc.a.b.sh and r.asc.b.a.sh
    • Write-amp=1 because all compaction was done by trivial move
    • Use asc.a then asc.b, stats from the end of asc.a and asc.b are here
    • Use asc.b then asc.a, stats from the end of asc.b and asc.a are here
  • 3 ascending, but not concurrent























Thursday, February 10, 2022

RocksDB internals: trivial move

RocksDB has an optimization called trivial move that reduces write amplification. There is a short post on it, but it needs a longer post. I am not sure whether this optimization originated with RocksDB. I don't know whether it is implemented elsewhere, nor do I recall whether it has been described in a conference paper. Update - RocksDB inherited trivial move from LevelDB.

Trivial move can be done for leveled and universal compaction but here I focus on leveled. Trivial move is disabled by default for universal and is enabled via the allow_trivial_move option.

For leveled, when compaction is done from level N to level N+1 there is usually much work: read the input SSTs from level N, read the ~10 input SSTs from level N+1, merge them and write ~10 SSTs to level N+1.

The trivial move can be done when the key range for the input SST from level N doesn't overlap with the key range of any SST in level N+1. In this case the input SST fits in between the key ranges of the level N+1 SSTs and rather than reading and re-writing SSTs the trivial move optimization simply changes metadata (the MANIFEST) to indicate that the SST that had been on level N is now on level N+1, thus the name trivial move. With the exception of the need to rewrite the MANIFEST there is no write-amplification from a trivial move.

Code

This is the code where compaction decides whether to do a trivial move and the logic is in the IsTrivialMove method. A trivial move is not done when ...

  1. ... the compression (lz4, zstd, none) used for the input SST does not match the compression to be used for the output level. If bottommost_compression is zstd and non-bottom levels do not use zstd then trivial moves can be done up to Lmax-1 and compaction from Lmax-1 to Lmax cannot use trivial move (Lmax is max level, Lmax-1 is the next to max level). The logic for that is here.
  2. ... the SST moved to level N+1 would overlap too many SSTs in level N+2. The logic for that is here. And overlap too many SSTs means that the compaction from level N+1 to N+2 would exceed max_compaction_bytes. The max_compaction_bytes option is 0 by default, and when 0 is set to 25 * target_file_size_base. So overlap too many SSTs == overlap ~25 SSTs by default. This check is only for leveled compaction.
I encounter #1 when using --benchmarks=fillseq with db_bench. In that case I make sure that all levels use the same compression type. And for the benchmarks that are run after fillseq I will change the config so that the larger levels use a better but more CPU intensive compression while the smaller levels use no compression. I created issue 9292 with the request that RocksDB be enhanced to avoid this complexity.

You might encounter #2 if you increase max_bytes_for_level_multiplier from the default (10) to something larger like 20 or 30 in which case you can explicitly set max_compaction_bytes to something larger than 25 * target_file_size_base. I created issue 9544 for this.

Logs

These are examples of messages in LOG when trivial move is done:
EVENT_LOG_v1 {"time_micros": 1644111871547273, "job": 429, "event": "trivial_move", "destination_level": 5, "files": 1, "total_files_size": 16872686}

[/db_impl/db_impl_compaction_flush.cc:3189] [default] Moved #1 files to level-5 16872686 bytes OK: base level 4 level ...

Monitoring

Columns in Compaction IO statistics shows when trivial move is done:
  • the Moved(GB) column counts the number of bytes that were trivial moved into that level
  • the Write(GB) column counts the number of bytes that were not trivial moved into that level and by not trivial moved I mean that normal compaction was done. 

Tuesday, February 8, 2022

RocksDB internals: the series

This is a series of posts based on reading RocksDB source to figure out how things are implemented.

RocksDB internals: prefetch and/or readahead

RocksDB can optimize IO for large and small read requests. Small read requests are done for user queries while large read requests can be done for iterators from users and compaction.

By reads I mean reading data from the filesystem. That data might be in the OS page cache otherwise it must be read from a storage device. Back in the day the choices were to use buffered IO or mmap. Today there is a new option -- O_DIRECT.

tl;dr for POSIX (some else can document this for Windows)

  • for small reads RocksDB can use posix_fadvise with POSIX_FADV_RANDOM
  • for large reads RocksDB can use posix_fadvise with POSIX_FADV_SEQUENTIAL to request filesystem readahead. It can also do large reads synchronously in the context of the thread that consumes the read. Some of the RocksDB docs describe this as readahead and/or prefetch. To (pedantic) me prefetch (definitely) and readahead (possibly) implies async requests. Writing this post helps me avoid that confusion.

History In the early days all reads by RocksDB were done one block at a time where the uncompressed block size was usually between 4kb and 16kb. The block might be compressed so the size of the pread request can be less than that. Unless O_DIRECT is used, the block is not aligned to the filesystem page size so the read request can span filesystem pages meaning that reading a 2kb compressed block might end up causing two filesystem pages (4kb each) getting read.

Today all reads are not done one block at a time. There are at least two cases where a file will be read sequentially -- compaction and a long range scan via an iterator -- for which RocksDB can do larger (multi-block) read requests. RocksDB also has a Hint method for open files and that can trigger a call to posix_fadvise.

Some details on iterator readahead are here. Search for automatic readahead.

Prefetch, readahead and large IO requests

RocksDB can request that the kernel does readahead or prefetching via a posix_fadvise call. This is controlled by the access_hint_on_compaction option. RocksDB can also explicitly use large read requests for user and compaction iterators. Some of the code is complicated. It is easy to see where the options are used and where the large reads are done. But the path between the option use and the large read isn't as easy to trace compared to what I described in my past few posts. Two important methods are FilePrefetchBuffer::Prefetch and BlockPrefetcher::PrefetchIfNeeded.

Options

The relevant options include:

  • max_auto_readahead_size - the max size of a readahead request for an iterator. This is adaptive, the readahead size starts at 8kb and grows to max_auto_readahead_size.
  • compaction_readahead_size - sets the readahead size for iterators used by compaction to read input SSTs. The use is here and then here in the BlockBasedTableIterator ctor.
  • prepopulate_block_cache - this isn't readahead but is related. When this option is set then the output from memtable flushes (new L0 SSTs) are written into the block cache. This is useful when O_DIRECT is used as it saves the cost of reading new L0 files into the block cache. It might help to extend this to new SSTs written to the L1. Start here to read the code that caches blocks from memtable flushes.
  • advise_random_on_open - when True, Hint(kRandom) is called when a file is opened for POSIX that triggers a call to posix_fadvise with POSIX_FADV_RANDOM. This is intended for files that will be used for user queries.
  • access_hint_on_compaction_start - specifies the access hint that should be used for reads done by compaction. This is used in SetupForCompaction when Hint is called. The Hint implementation for POSIX calls Fadvise which calls posix_fadvise. It might help to use SEQUENTIAL as the value for access_hint_on_compaction_start but I haven't tested this in a long time. Issue 9448 is open because there is a race WRT to the posix_fadvise call for this and advise_random_on_open.
  • Things in ReadOptions:
    • adaptive_readahead - enables adaptive readahead for iterators that increases the size of readahead as more data is read from the iterator
    • readahead_size - default is 0, when set can increase the size of the initial readahead. When not set the initial readahead is 2 blocks.
  • verify_checksums_readahead_size - the size of read requests when verifying checkums on SST files that will be ingested
  • blob_compaction_readahead_size - sets the readahead size for reads from blob files. I am not sure whether this is only for reads from blobs during leveled LSM compaction or during compaction triggered by blob_garbage_collection_force_threshold.
  • log_readahead_size - the readahead size for log reads

Other options for large IO requests:

Options that limit the amount of dirty data from compaction and the WAL

  • bytes_per_sync - in some cases when zero the value is set to 1M at startup. When non-zero, RangeSync is called after this many bytes have been written to an SST during compaction. The RangeSync method calls sync_file_range to trigger writeback and reduce the duration of the fsync or fdatasync done when compaction is done writing to that SST. The call to sync_file_range might block but the hope is that it doesn't because the goal is async write-behind.
  • wal_bytes_per_sync - similar to bytes_per_sync but for the WAL. I am confused by the impact of this option. Ignoring the cases where the value of it is assigned to the value of another option, this is the only use for it.
  • strict_bytes_per_sync - when true then SYNC_FILE_RANGE_WAIT_BEFORE is used in the sync_file_range call. Otherwise only SYNC_FILE_RANGE_WRITE is used. When true this provides a strict bound on the amount of data that would have to be written to storage when fsync or fdatasync are called when the SST is full/closed.