What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels.
tl;dr - if levels 0, 1 and 2 are compressed:
- less write-amplification
- there might be more write stalls
- write throughput might be lower if there are more write stalls
- time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2
- compaction CPU overhead is increased
- Write-amplification was 1.36X larger when L0/L1/L2 were not compressed
- Write throughput was 1.32X larger when L0/L1/L2 were not compressed
- Compaction CPU was 1.36X larger when L0/L1/L2 were compressed via lz4
The benchmark command lines are here and compaction IO statistics are here. Tests were run on a server with more than 32 CPU cores and hyperthreading was enabled. The server has fast SSD and enough DRAM.
I used the db_bench benchmark client for RocksDB and RocksDB version 6.28.2 to run a sequence of benchmarks. Here I explain results from two of them -- readwhilewriting and overwrite. The readwhilewriting benchmark ran with the write limited to 10MB/second. The overwrite benchmark was run without a constraint on the write rate. WAL fsync was disabled to allow the writes to happen as fast as possible.
Levels L3 and larger were compressed with lz4. Tests were then repeated with L0/L1/L2 not compressed vs compressed with lz4.
Data used in this benchmark compresses in half as the db_bench command line included --compression_ratio=0.5.
Results, part 1
- Write rates (wps) are the same for readwhilewriting because the writer had a 10M/s rate limit
- Write rates (wps) are ~1.3X larger for overwrite when L0/L1/L2 are not compressed because there were fewer write stalls
- Bytes flushed per write are ~2X larger when flushes (L0) are not compressed
- Write-amp is ~1.3X larger when L0/L1/L2 are not compressed. The LSM tree in this workload has 7 levels, the write rate into each level is similar despite the amount of data per level varying greatly. Thus, whether or not you compress 3 of the 7 levels has a significant impact on write-amp.
* wps = writes/second
* ingest - GB of data written into RocksDB
* flush - GB of memtable flushes
* compact - GB of compaction
wps ingest flush compact w-amp L0/L1/L2-compression
224320 18.05 9.84 245 14.1 lz4
89177 127.07 71.12 1486 12.3 lz4
Results, part 2
- Write rates are ~1.3X larger when L0/L1/L2 are not compressed
- Write stalls are more common when L0/L1/L2 are compressed
- The compaction CPU overhead per write is ~1.3X larger when L0/L1/L2 are compressed
Results, part 3
This shows the time per flush or compaction job. A flush job is the work done to flush one memtable into an L0 SST. A compaction job is the time to do a compaction into L1+. A compaction job into L1 usually merges all data in L0 with all data in L1. A compaction job into Ln (n > 1) usually merges ~10 files in Ln-1 with 1 file in Ln. Therefore, compaction jobs into L1 usually process more data than jobs into Ln (n > 1). This data is from compaction IO statistics output.
- Compaction/flush jobs into L0/L1/L2 take up to 3X longer when L0/L1/L2 is compressed
- Compaction jobs into L1 are the longest running. This happens with overwrite because it was run without rate limit for the writer, compaction fell behind and RocksDB allows L0 to grow large in that case. One side-effect is that the L0->L1 compaction job takes longer because it merges all data from L0 with all data from L1.
level none lz4
L0 0.05 0.08
L2 0.53 1.03
L3 0.17 0.47
L4 0.86 0.92
L5 0.86 0.88
L6 0.87 0.88
L7 0.58 0.55
overwrite - wall clock seconds per compaction/flush job
level none lz4
L0 0.11 0.18
L2 13.38 28.86
L3 0.14 0.42
L4 0.31 0.44
L5 0.37 0.40
L6 0.59 0.58
L7 0.77 0.79