Tuesday, May 10, 2022

RocksDB internals: LRU

RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.

An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.

Updates:

  • fix a bogus claim about cache_index_and_filter_blocks

The big picture

The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.

When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.

From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache. 

The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.

Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.

  1. Search to find the SST for which the searched for key is between the min and max key for the SST
  2. Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
  3. Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
  4. Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.

Implementation overview

These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.

The code is mostly in 4 classes/structs:

LRUHandle

LRUHandle has great comments that I summarize here:
  • each entry is a variable length heap-allocated structure
  • entries are referenced by cache and/or by any external entity
  • the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.

LRUHandle in one of three states:
  1. Referenced externally AND in hash table
    • the entry is *not* in the LRU list
    • refs >= 1 && in_cache == true
  2. Not referenced externally AND in hash table
    • the entry is in the LRU list and can be freed
    • refs == 0 && in_cache == true
  3. Referenced externally AND not in hash table
    • entry is not in the LRU list and not in hash table
    • entry can be freed when refs becomes 0
    • refs >= 1 && in_cache == false
All newly created LRUHandles are in state 1.
  • If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
  • To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
  • To move from state 2 to state 1, use LRUCacheShard::Lookup.
  • Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
Explaining the stack traces

The rest of this post explains what I saw in the stack traces that show contention for LRUCacheShard::Lookup and LRUCacheShard::Release. A block search is done in Lookup and the block is pinned (referenced) if found. The block is unpinned by Release when the search is finished with it. Both require a lock/unlock of the per-shard mutex. And Release can also write to the (contended) head of the LRU list. I also have edited PMP output for those traces.

A cache search starts with ShardedCache::Lookup. That computes hash to determine the shard and then calls LRUCacheShard::Lookup on that shard. When the searcher is done with the block it calls 

  lock LRUCacheShard::mutex_
  if LRUHandleTable::Lookup returns an entry (e)
    if not LRUHandle::HasRefs():
      LRU_Remove(e) // entry was in LRU since it has references
    LRUHandle::Ref (for e)    // does refs++
    LRUHandle::SetHit (for e) // does flags |= ...
  unlock LRUCacheShard::mutex_


  removes an LRUHandle from double linked LRU list
  calls LRUCacheHandle::CalcTotalCharge
  updates memory accounting

  if (metadata_charge_policy == kFullChargeCacheMetadata)
    call malloc_usable_size to update meta_charge

I was curious about the overhead from frequent calls to malloc_usable_size which can be done when the per-shard mutex is locked. Results are here. Preventing these calls doesn't solve the mutex contention problem.

  lock LRUCacheShard::mutex_
  call e->Unref()
  
  if ref count is 0 and e->InCache()
    if cache_full
      LRUHandleTable::Remove(e)
      e->SetInCache(False)
    else
      LRU_Insert(e)
  
  if (no longer referenced):
    call e->CalcTotalCharge
  unlock LRUCacheShard::mutex_
  
  if (no longer referenced):
    // This is done after the mutex has been unlocked
    e->Free()

  // LRUCacheShard::mutex_ held by caller
  call e->CalcTotalCharge()
  if high_pri_pool enabled && (e->IsHighPri() || e->HasHit())
    move e to head of LRU list
    MaintainPoolSize()
  else
    move e to head of low-pri pool

  while high pri pool too big
    move entry from tail of high pri LRU to low-pri pool

  call FindPointer
  remove from double linked list

  walk a list to find the LRUHandle for a key

The number of entries on the list that FindPointer will encounter should be ~1 on average and  LRUHandleTable::Insert calls LRUHandleTable::Resize to make that likely. I am curious whether calls to Resize are a source of stalls because a per-shard mutex is locked when that happens. That won't be a common stall.

Performance bugs: p50 vs p99

One way to classify performance bugs is p50 vs p99. 

  • A p50 performance bug occurs for all workloads
  • A p99 performance bug only happens for some workloads or in some contexts. By some workloads I mean it might only occur during hot backup or load. By some contexts I mean it might only happen under high concurrency or load spikes which don't represent the steady state.
The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster.

The impact from fixing a p99 bug is less obvious. On average there might be little impact with respect to average throughput and efficiency. There can be a huge impact with respect to response time goals. The challenge is to quantify the benefit and I have learned over the years that motivating the fix for p99 perf bugs is a hard problem. For MySQL I wrote many blog posts to market open bug reports.

One aspect is that there are easy workarounds for p50 perf bugs. If a bug makes something 10% less efficient then buy 10% more hardware. There might be no workarounds for a p99 perf bug. If something causes a long stall once/day or ruins performance for one infrequent part of your workload then the choices are to either tolerate it (someone suffers) or replace the system.

I spent a few years iterating on p99 bugs in MySQL back in my web-scale MySQL+InnoDB days. It was interesting work and I benefited from production which was a great source of p99 bugs and a colleague named Domas who was great at documenting such bugs.

Monday, May 9, 2022

Using mmap with RocksDB

RocksDB inherited support for mmap from LevelDB. I was curious how performance with mmap compared buffered IO (pread) and ran benchmarks with db_bench. I haven't had great experiences with mmap for database workloads in the past but tried to be fair.

tl;dr

  • Don't use mmap for IO-bound workloads until issue 9931 is fixed
  • For in-memory databases performance with mmap can be better than pread if db_bench is run with --cache_index_and_filter_blocks=false. Setting --verify_checksum=false provides a smaller performance boost.
Longer than tl;dr

If the database doesn't fit in memory then issue 9931 means there will be significant overfetching on reads from storage and this will ruin performance. I assume the fix is to add madvise calls similar to the posix_fadvise calls used for buffered IO (pread).

With --cache_index_and_filter_blocks=false the RocksDB process might use more memory than you expect so be careful while chasing performance.  When it is true and --mmap_read=true then index and filter blocks will still be copied into the RocksDB block cache.

While --verify_checksum=false improves performance in some cases, it also prevents the use of checksums to detect file corruption on data blocks. Checksums for filter and index blocks read into the block cache are always checked regardless of the value of --verify_checksum because the BlockFetcher uses a ReadOptions struct with default values and the default for ReadOptions::verify_checksums is true (see here and here). 

Configuration

I used benchmark.sh to run db_bench on a server with more than 32 CPU cores and then hyperthreading was enabled. The database was run in three configurations: cached by RocksDB, cached by OS and IO-bound. For cached by RocksDB the block cache was larger than the database files. For cached by OS the block cache was 1G but the database fits in the OS page cache. For IO-bound the database is larger than memory. I don't share graphs for cached by OS but the results were similar to cached by RocksDB.

The benchmarks were run with 32 client threads (--threads=32) and WAL sync disabled. The background writer was limited to 2MB/s.

The benchmarks were run in this order:

  • fillseq - load the database in key order
  • readrandom - point lookups via Get
  • multireadrandom - point lookups via MultiGet
  • revrangeww - reverse range scans with a background writer
  • fwdrangeww - forward range scans with a background writer
  • readww - point lookups via Get with a background writer
  • overwrite - writes via Put run as fast as possible.

IO-bound

The database was loaded with 4B key-value pairs by fillseq. The throughput graphs show that results were lousy for mmap on the read-heavy benchmarks. On my server RocksDB with mmap was fetching ~100X more data per query compared to pread. While tuning filesystem options might have reduced the excessive fetching I am not sure it would prevent it and I am reluctant to change global options on a server to benefit one of the applications that use the server. The real fix and more stats on the excessive fetching are explained in issue 9931.

For pread and mmap the benchmarks used --verify_checksum=true and --cache_index_and_filter_blocks=true.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Cached by RocksDB

The database was loaded with 40M key-value pairs by fillseq. Results are provided for several configurations. Font size reduced to avoid spanning lines.

  • pread-ver1-cm1: -mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=true
  • mmap-ver1-cm1: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=true
  • pread-ver0 -cm1: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=true
  • pread-ver0-cm0: --mmap_read=false --verify_checksum=false --cache_index_and_filter_blocks=false
  • pread-ver1-cm0: --mmap_read=false --verify_checksum=true --cache_index_and_filter_blocks=false
  • mmap-ver0-cm1: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=true
  • mmap-ver0-cm0: --mmap_read=true --verify_checksum=false --cache_index_and_filter_blocks=false
  • mmap-ver1-cm0: --mmap_read=true --verify_checksum=true --cache_index_and_filter_blocks=false

Command lines are here for pread-ver0-cm1. To use the *-cm0 variants see here. To run the IO-bound variants see here.

The most important db_bench option for performance is --cache_index_and_filter_blocks=false as shown in the results for pread-ver*-cm0 and mmap-ver*-cm0.

Setting --verify_checksum=false helps for mmap but not for pread. See pread-ver0-cm0 vs pread-ver1-cm0 and mmap-ver0-cm0 vs mmap-ver1-cm0. This might be workload specific.

Setting --mmap_read also helps compared to pread but only when --cache_index_and_filter_blocks=false. It is possible that also enabling the option for compaction to pre-populate the block cache would allow pread to be as fast as mmap but I did not try it.

The graph shows throughput on the y-axis by benchmark + configuration on the x-axis.

Thursday, May 5, 2022

To compress or not compress

What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels.

tl;dr - if levels 0, 1 and 2 are compressed:

  • Benefits
    • less write-amplification
  • Costs
    • there might be more write stalls
    • write throughput might be lower if there are more write stalls
    • time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2 
    • compaction CPU overhead is increased
By might be more write stalls I mean that this is only an issue when the write rate is large. For the results here I ran one test with the write-rate limited to 10 MB/s and there were no write stalls whether or not L0/L1/L2 were compressed. Then I ran a test without a limit on the write rate and stalls were worse when L0/L1/L2 were compressed.

Quantifying the tl;dr based on the overwrite benchmark:
  • Write-amplification was 1.36X larger when L0/L1/L2 were not compressed
  • Write throughput was 1.32X larger when L0/L1/L2 were not compressed
  • Compaction CPU was 1.36X larger when L0/L1/L2 were compressed via lz4

Overview

The benchmark command lines are here and compaction IO statistics are here. Tests were run on a server with more than 32 CPU cores and hyperthreading was enabled. The server has fast SSD and enough DRAM.

I used the db_bench benchmark client for RocksDB and RocksDB version 6.28.2 to run a sequence of benchmarks. Here I explain results from two of them -- readwhilewriting and overwrite. The readwhilewriting benchmark ran with the write limited to 10MB/second. The overwrite benchmark was run without a constraint on the write rate. WAL fsync was disabled to allow the writes to happen as fast as possible.

Levels L3 and larger were compressed with lz4. Tests were then repeated with L0/L1/L2 not compressed vs compressed with lz4.

Data used in this benchmark compresses in half as the db_bench command line included --compression_ratio=0.5.

Results, part 1

Notes:

  • Write rates (wps) are the same for readwhilewriting because the writer had a 10M/s rate limit
  • Write rates (wps) are ~1.3X larger for overwrite when L0/L1/L2 are not compressed because there were fewer write stalls
  • Bytes flushed per write are ~2X larger when flushes (L0) are not compressed
  • Write-amp is ~1.3X larger when L0/L1/L2 are not compressed. The LSM tree in this workload has 7 levels, the write rate into each level is similar despite the amount of data per level varying greatly. Thus, whether or not you compress 3 of the 7 levels has a significant impact on write-amp. 

Legend:
* wps = writes/second
* ingest - GB of data written into RocksDB
* flush - GB of memtable flushes
* compact - GB of compaction

wps     ingest  flush   compact w-amp  L0/L1/L2-compression
--- readwhilewriting
220555  18.05   17.81   313     18.3   none
224320  18.05    9.84   245     14.1   lz4
--- overwrite
117383  167.25  169.42  2627    16.7   none
 89177  127.07   71.12  1486    12.3   lz4

Results, part 2

Notes:

  • Write rates are ~1.3X larger when L0/L1/L2 are not compressed
  • Write stalls are more common when L0/L1/L2 are compressed
  • The compaction CPU overhead per write is ~1.3X larger when L0/L1/L2 are compressed

Legend:
* wps = writes/second
* c.csecs - CPU seconds of compaction
* cpu/w - c.csecs / wps = compaction CPU / write
* stall% - percentage of time writes are stopped or stalled

wps     c.wsecs c.csecs cpu/w   stall%  L0/L1/L2-compression
--- readwhilewriting
220555  2867    2798    NA      0       none
224320  2737    2693    NA      0       lz4
--- overwrite
117383  17084   16464   0.140   49.7    none
 89177  14554   14209   0.159   62.9    lz4

Results, part 3

This shows the time per flush or compaction job.  A flush job is the work done to flush one memtable into an L0 SST. A compaction job is the time to do a compaction into L1+. A compaction job into L1 usually merges all data in L0 with all data in L1. A compaction job into Ln (n > 1) usually merges ~10 files in Ln-1 with 1 file in Ln. Therefore, compaction jobs into L1 usually process more data than jobs into Ln (n > 1). This data is from compaction IO statistics output.

Notes:

  • Compaction/flush jobs into L0/L1/L2 take up to 3X longer when L0/L1/L2 is compressed
  • Compaction jobs into L1 are the longest running. This happens with overwrite because it was run without rate limit for the writer, compaction fell behind and RocksDB allows L0 to grow large in that case. One side-effect is that the L0->L1 compaction job takes longer because it merges all data from L0 with all data from L1.

readwhilewriting - wall clock seconds per compaction/flush job

        compression
level   none    lz4
L0       0.05    0.08
L2       0.53    1.03
L3       0.17    0.47
L4       0.86    0.92
L5       0.86    0.88
L6       0.87    0.88
L7       0.58    0.55

overwrite - wall clock seconds per compaction/flush job


        compression
level   none    lz4
L0       0.11    0.18
L2      13.38   28.86
L3       0.14    0.42
L4       0.31    0.44
L5       0.37    0.40
L6       0.59    0.58
L7       0.77    0.79

Monday, April 25, 2022

clang vs crc32c in RocksDB

By default RocksDB uses crc32c as a block checksum. The source is here and a benchmark is here (db_bench --benchmarks=crc32c). By default RocksDB is compiled with gcc on Linux but it is easy to switch to clang. I did that switch to compare benchmark performance between gcc and clang builds and the initial results were interesting.

The overhead for crc32c is apparent for benchmarks that do a lot of IO from fast storage (fast SSD or the OS page cache).

tl;dr for x86 HW

  • For crc32
    • throughput is more than 2X faster with gcc 9.4 than clang for clang versions < 14
    • the difference drops to 1.36X at clang version 14
  • For xxh3
    • clang versions 10 to 13 are ~1.08X faster than gcc 9.4
    • the difference drops to ~1.04X faster for clang 14
  • clang versions 14 and 15 have similar performance, so do clang versions 10 through 13

Update - issue 55153 filed for LLVM

Results

Legend:

  • cc - compiler
  • crc32c, xxhash, xxh3 - db_bench benchmark names
  • others are abbreviated names for db_bench benchmarks: xxh64 = xxhash64, comp = compress, uncomp = uncompress
The numbers in the table are the throughput in MB/s reported by db_bench.

cc      crc32c  xxhash  xxh64   xxh3    comp    uncomp
gcc     19327   5036    9796    26805   661     5201
clang    8935   5043    9847    29327   658     5185
clang11  8367   5050    9849    29344   660     5184
clang12  7594   5024    9832    28392   658     4858
clang13  7571   5021    9832    29031   659     5234
clang14 14239   5008    9847    27946   660     4770
clang15 14266   5027    9811    28020   660     5170

It is not shown here but the results for the uncompress test had too much variance so I ignore them. Perhaps the test needs to run for more time.

Setup

Most of my tests used an Intel NUC described here. This has Ubuntu 20.04 with gcc 9.4.0 and clang 10.0.0-4ubuntu1. After noticing that gcc 9.4 was much faster than clang 10 I tried clang versions 11, 12, 13 and 14 and the scripts here made it easy to install the newer versions of clang.

Other notes:

  • RocksDB is compiled with -O2 and all of the needed flags/includes to get a fast crc32.
  • From clang --version the versions were 11.1.0, 12.0.1, 13.0.1 and 14.0.1.

I then compiled RocksDB using gcc and clang:

# Compile for gcc 9.4
make clean; make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.gcc.use1

# Compile for clang 10.0.0-4ubuntu1
make clean; CC=/usr/bin/clang CXX=/usr/bin/clang++ USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib
db_bench; mv db_bench db_bench.clang.use1

# Compile for clang versions 11, 12, 13, 14
for v in 11 12 13 14; do make clean; CC=/usr/bin/clang-${v} CXX=/usr/bin/clang++-${v} USE_CLANG=1 make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j4 static_lib db_bench; mv db_bench db_bench.clang${v}.use1 ; done

And then I ran each of the CPU-intensive microbenchmarks 3 times and reported the median result:


for bm in crc32c xxhash xxhash64 xxh3 compress uncompress ; do
  echo; echo $bm; echo
  for x in gcc clang.use1 clang11.use1 clang12.use1 clang13.use1 clang14.use1 ; do
    echo; echo $x
    for z in 1 2 3; do
      ./db_bench.$x --benchmarks=$bm --stats_per_interval=1 --stats_interval_seconds=600 2> /dev/null | grep ^"$bm"
    done
  done
done

Clang versions

$ /usr/bin/clang-11 --version
Ubuntu clang version 11.1.0-++20211011094159+1fdec59bffc1-1~exp1~20211011214622.5
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-12 --version
Ubuntu clang version 12.0.1-++20211029101322+fed41342a82f-1~exp1~20211029221816.4
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-13 --version
Ubuntu clang version 13.0.1-++20220120110924+75e33f71c2da-1~exp1~20220120231001.58
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-14 --version
Ubuntu clang version 14.0.1-++20220423123024+9a3e81e1f91f-1~exp1~20220423003108.124
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ /usr/bin/clang-15 --version
Ubuntu clang version 15.0.0-++20220424052740+3f0f20366622-1~exp1~20220424172822.231
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Monday, April 4, 2022

Becoming less confused about perf record

I previously wrote about generating bogus flamegraphs and now I can be more clear about what I didn't understand. The default behavior for perf record is frequency-based and on modern Linux you probably get perf record -F 4000 as the default. For now assume the command line also used -p $pid and did not use -e. So in that case perf record does sampling of one process for the cycles event.

Disclaimer -- I am still waiting for an expert to review this.

With frequency-based sampling (-F $frequency) perf record tries to generate $frequency samples per second. The perf tutorial wiki does a good job explaining what happens next. A sample (stack trace) is taken when the PMU cycles counter overflows. The challenge is that the number of cycles consumed per second by your process varies (more when it is CPU-bound, less when it is IO or mutex bound). So the sample period (the number of cycles at which overflow occurs) must be changed dynamically to provide $frequency samples/second. A key point here is that perf is sampling based on the number of cycles consumed by the process named in -p. If it were just doing that based on wall-clock time, for example running PMP once/second, then no adjustment is needed (just wake up $frequency times/second and get samples).

The sample periods are displayed in perf script output and here is an example where 1 cycles and 258 cycles are the sample periods. The sample periods also serve as the weight of a sample. For example, if there are 3 stack traces each with a sample period like the following, then perf report will report that g,h,i accounts for 50% of the time.

stack   period/cycles

a,b,c   1

d,e,f   1

g,h,i   2

The equivalence of sample period with what I called weight in my previous post confused me at first. If perf record is run with -c instead of -F then sample periods are not adjusted and all samples have the same sample period (same weight, a value likely to be much larger than 1) which is another workaround for FlameGraph issue 165 as suggested in the issue report.

As I wrote in a LinkedIn discussion: this is interesting and complicated. Given perf record -g -p $pid and a stack trace taken after a sample period of X cycles then you assume that stack trace represents where CPU was consumed by that process for the entire X cycles of that sample period. When the sample period is constant for all stack traces (perf record -c) then this doesn't matter because all stack traces count equally (have equal weight). But it does matter when the sample period is adjusted (perf record -F or perf record) and stack traces have different sample period (the X in X cycles changes) as in this example.

In addition to the links above, these helped reduce my confusion. So did a co-worker from the kernel team:


Sunday, April 3, 2022

Examples of the RocksDB trivial move optimization

I recently wrote two posts (here and here) on the trivial move optimization in RocksDB and now have a third post that provides examples via db_bench to demonstrate where it works, the impact of using compression on some levels and the complexity introduced by dynamic leveled. All of the examples use db_bench and the fillseq benchmark that inserts keys in ascending order. The four examples are:

  • dynamic level enabled, compression disabled
  • dynamic level disabled, compression disabled
  • dynamic level enabled, compression enabled
  • dynamic level disabled, compression enabled
Configuration

Dynamic leveled is explained by this post, enabled by this configuration option and set on the db_bench command line by --level_compaction_dynamic_level_bytes.

Compression is enabled by the db_bench flags --min_level_to_compress=2 --compression_type=lz4.

Dynamic level enabled, compression disabled

The db_bench command line and output are here. This shows the best case as there are no writes for levels beyond L0 -- Write(GB)=0 and Moved(GB) > 0. Trivial move cannot be done for writes into L0 (memtable flush) because the LSM has to write data at least once.

Dynamic level disabled, compression disabled

The db_bench command line and output are here. This shows the best case as there are no writes for levels beyond L0 -- Write(GB)=0 and Moved(GB) > 0.  Trivial move cannot be done for writes into L0 (memtable flush) because the LSM has to write data at least once.

Dynamic level enabled, compression enabled

The db_bench command line and output are here. This shows the impact of compression. When Ln and Ln+1 don't have the same style of compression then trivial move cannot be done from Ln to Ln+1. This is complicated by dynamic leveled because the LSM tree starts with L0 then grows into L0,L6 then grows into L0,L5,L6 and finally reaches L0,L4,L5,L6. Note I am using the level names displayed in compaction IO statistics and I consider these to be physical names. The logical names would be L0,L1,L2,L3. This is one source of complexity created by a great feature (dynamic leveled).

When the tree has data in L0,L5,L6 then the --min_level_to_compress option means that only L6 should be compressed. When the tree reaches L0,L4,L5,L6 then L5 and L6 should be compressed. This is another source of complexity from dynamic leveled.

This explains why both Write(GB) and Moved(GB) are greater than zero for L5 and L6. That doesn't occur with dynamic level disabled.

  • For L0 all bytes written are from memtable flush, Write(GB) > 0 & Moved(GB)=0.
  • For L4 only trivial move is done, Write(GB)=0 & Moved(GB) > 0
  • For L5 both Write(GB) > 0 & Moved(GB) > 0. 
    • With L0,L5,L6  trivial moves are done from L0 to L5 with uncompressed data
    • With L0,L4,L5,L6 writes are done from L4 to L5 because L4 is uncompressed and L5 uses lz4
  • For L6 both Write(GB) > 0 & Moved(GB) > 0. 
    • With L0,L6 trivial moves are done from L0 to L6 with uncompressed data
    • With L0,L5,L6 writes are done from L5 to L6 because L5 is uncompressed and L6 uses lz4
    • With L0,L4,L5,L6 trivial moves are done from L5 to L6 because both use lz4


Dynamic level disabled, compression enabled

The db_bench command line and output are here. When Ln and Ln+1 don't have the same style of compression then trivial move cannot be done from Ln to Ln+1. However dynamic leveled is disabled so the compaction IO statistics are easier to read.

  • For L0 all bytes written are from memtable flush, Write(GB) > 0 & Moved(GB)=0.
  • For L1 only trivial move is done, Write(GB)=0 & Moved(GB) > 0
  • For L2 only writes are done, Write(GB) > 0 & Moved(GB)=0,  because L1 is uncompressed & L2 is
  • For L3 only trivial move is done, Write(GB)=0 & Moved(GB) > 0, because L2 and L3 use lz4