Small Datum: To compress or not compress

What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels.

tl;dr - if levels 0, 1 and 2 are compressed:

Benefits

less write-amplification

Costs

there might be more write stalls
write throughput might be lower if there are more write stalls
time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2
compaction CPU overhead is increased

By might be more write stalls I mean that this is only an issue when the write rate is large. For the results here I ran one test with the write-rate limited to 10 MB/s and there were no write stalls whether or not L0/L1/L2 were compressed. Then I ran a test without a limit on the write rate and stalls were worse when L0/L1/L2 were compressed.

Quantifying the tl;dr based on the overwrite benchmark:

Write-amplification was 1.36X larger when L0/L1/L2 were not compressed
Write throughput was 1.32X larger when L0/L1/L2 were not compressed
Compaction CPU was 1.36X larger when L0/L1/L2 were compressed via lz4

Overview

The benchmark command lines are here and compaction IO statistics are here. Tests were run on a server with more than 32 CPU cores and hyperthreading was enabled. The server has fast SSD and enough DRAM.

I used the db_bench benchmark client for RocksDB and RocksDB version 6.28.2 to run a sequence of benchmarks. Here I explain results from two of them -- readwhilewriting and overwrite. The readwhilewriting benchmark ran with the write limited to 10MB/second. The overwrite benchmark was run without a constraint on the write rate. WAL fsync was disabled to allow the writes to happen as fast as possible.

Levels L3 and larger were compressed with lz4. Tests were then repeated with L0/L1/L2 not compressed vs compressed with lz4.

Data used in this benchmark compresses in half as the db_bench command line included --compression_ratio=0.5.

Results, part 1

Notes:

Write rates (wps) are the same for readwhilewriting because the writer had a 10M/s rate limit
Write rates (wps) are ~1.3X larger for overwrite when L0/L1/L2 are not compressed because there were fewer write stalls
Bytes flushed per write are ~2X larger when flushes (L0) are not compressed
Write-amp is ~1.3X larger when L0/L1/L2 are not compressed. The LSM tree in this workload has 7 levels, the write rate into each level is similar despite the amount of data per level varying greatly. Thus, whether or not you compress 3 of the 7 levels has a significant impact on write-amp.

Legend:
* wps = writes/second
* ingest - GB of data written into RocksDB
* flush - GB of memtable flushes
* compact - GB of compaction

wps ingest flush compact w-amp L0/L1/L2-compression

--- readwhilewriting

220555 18.05 17.81 313 18.3 none
224320 18.05 9.84 245 14.1 lz4

--- overwrite

117383 167.25 169.42 2627 16.7 none
89177 127.07 71.12 1486 12.3 lz4

Results, part 2

Notes:

Write rates are ~1.3X larger when L0/L1/L2 are not compressed
Write stalls are more common when L0/L1/L2 are compressed
The compaction CPU overhead per write is ~1.3X larger when L0/L1/L2 are compressed

Legend:

* wps = writes/second

* c.csecs - CPU seconds of compaction

* cpu/w - c.csecs / wps = compaction CPU / write

* stall% - percentage of time writes are stopped or stalled

wps c.wsecs c.csecs cpu/w stall% L0/L1/L2-compression

--- readwhilewriting

220555 2867 2798 NA 0 none

224320 2737 2693 NA 0 lz4

--- overwrite

117383 17084 16464 0.140 49.7 none

89177 14554 14209 0.159 62.9 lz4

Results, part 3

This shows the time per flush or compaction job. A flush job is the work done to flush one memtable into an L0 SST. A compaction job is the time to do a compaction into L1+. A compaction job into L1 usually merges all data in L0 with all data in L1. A compaction job into Ln (n > 1) usually merges ~10 files in Ln-1 with 1 file in Ln. Therefore, compaction jobs into L1 usually process more data than jobs into Ln (n > 1). This data is from compaction IO statistics output.

Notes:

Compaction/flush jobs into L0/L1/L2 take up to 3X longer when L0/L1/L2 is compressed
Compaction jobs into L1 are the longest running. This happens with overwrite because it was run without rate limit for the writer, compaction fell behind and RocksDB allows L0 to grow large in that case. One side-effect is that the L0->L1 compaction job takes longer because it merges all data from L0 with all data from L1.

readwhilewriting - wall clock seconds per compaction/flush job

compression
level none lz4
L0 0.05 0.08
L2 0.53 1.03
L3 0.17 0.47
L4 0.86 0.92
L5 0.86 0.88
L6 0.87 0.88
L7 0.58 0.55

overwrite - wall clock seconds per compaction/flush job

compression
level none lz4
L0 0.11 0.18
L2 13.38 28.86
L3 0.14 0.42
L4 0.31 0.44
L5 0.37 0.40
L6 0.59 0.58
L7 0.77 0.79