Thursday, May 5, 2022

To compress or not compress

What are the benefits and costs from compressing the smaller levels of an LSM tree? You won't save much space from compressing them unless the LSM tree doesn't have data beyond L2. I usually suggest no compression for levels 0, 1 and 2. Here I share results from a benchmark to show the impact from using compression for those levels.

tl;dr - if levels 0, 1 and 2 are compressed:

  • Benefits
    • less write-amplification
  • Costs
    • there might be more write stalls
    • write throughput might be lower if there are more write stalls
    • time per compaction/flush job is 2X larger for flush to L0 and compaction to L1/L2 
    • compaction CPU overhead is increased
By might be more write stalls I mean that this is only an issue when the write rate is large. For the results here I ran one test with the write-rate limited to 10 MB/s and there were no write stalls whether or not L0/L1/L2 were compressed. Then I ran a test without a limit on the write rate and stalls were worse when L0/L1/L2 were compressed.

Quantifying the tl;dr based on the overwrite benchmark:
  • Write-amplification was 1.36X larger when L0/L1/L2 were not compressed
  • Write throughput was 1.32X larger when L0/L1/L2 were not compressed
  • Compaction CPU was 1.36X larger when L0/L1/L2 were compressed via lz4


The benchmark command lines are here and compaction IO statistics are here. Tests were run on a server with more than 32 CPU cores and hyperthreading was enabled. The server has fast SSD and enough DRAM.

I used the db_bench benchmark client for RocksDB and RocksDB version 6.28.2 to run a sequence of benchmarks. Here I explain results from two of them -- readwhilewriting and overwrite. The readwhilewriting benchmark ran with the write limited to 10MB/second. The overwrite benchmark was run without a constraint on the write rate. WAL fsync was disabled to allow the writes to happen as fast as possible.

Levels L3 and larger were compressed with lz4. Tests were then repeated with L0/L1/L2 not compressed vs compressed with lz4.

Data used in this benchmark compresses in half as the db_bench command line included --compression_ratio=0.5.

Results, part 1


  • Write rates (wps) are the same for readwhilewriting because the writer had a 10M/s rate limit
  • Write rates (wps) are ~1.3X larger for overwrite when L0/L1/L2 are not compressed because there were fewer write stalls
  • Bytes flushed per write are ~2X larger when flushes (L0) are not compressed
  • Write-amp is ~1.3X larger when L0/L1/L2 are not compressed. The LSM tree in this workload has 7 levels, the write rate into each level is similar despite the amount of data per level varying greatly. Thus, whether or not you compress 3 of the 7 levels has a significant impact on write-amp. 

* wps = writes/second
* ingest - GB of data written into RocksDB
* flush - GB of memtable flushes
* compact - GB of compaction

wps     ingest  flush   compact w-amp  L0/L1/L2-compression
--- readwhilewriting
220555  18.05   17.81   313     18.3   none
224320  18.05    9.84   245     14.1   lz4
--- overwrite
117383  167.25  169.42  2627    16.7   none
 89177  127.07   71.12  1486    12.3   lz4

Results, part 2


  • Write rates are ~1.3X larger when L0/L1/L2 are not compressed
  • Write stalls are more common when L0/L1/L2 are compressed
  • The compaction CPU overhead per write is ~1.3X larger when L0/L1/L2 are compressed

* wps = writes/second
* c.csecs - CPU seconds of compaction
* cpu/w - c.csecs / wps = compaction CPU / write
* stall% - percentage of time writes are stopped or stalled

wps     c.wsecs c.csecs cpu/w   stall%  L0/L1/L2-compression
--- readwhilewriting
220555  2867    2798    NA      0       none
224320  2737    2693    NA      0       lz4
--- overwrite
117383  17084   16464   0.140   49.7    none
 89177  14554   14209   0.159   62.9    lz4

Results, part 3

This shows the time per flush or compaction job.  A flush job is the work done to flush one memtable into an L0 SST. A compaction job is the time to do a compaction into L1+. A compaction job into L1 usually merges all data in L0 with all data in L1. A compaction job into Ln (n > 1) usually merges ~10 files in Ln-1 with 1 file in Ln. Therefore, compaction jobs into L1 usually process more data than jobs into Ln (n > 1). This data is from compaction IO statistics output.


  • Compaction/flush jobs into L0/L1/L2 take up to 3X longer when L0/L1/L2 is compressed
  • Compaction jobs into L1 are the longest running. This happens with overwrite because it was run without rate limit for the writer, compaction fell behind and RocksDB allows L0 to grow large in that case. One side-effect is that the L0->L1 compaction job takes longer because it merges all data from L0 with all data from L1.

readwhilewriting - wall clock seconds per compaction/flush job

level   none    lz4
L0       0.05    0.08
L2       0.53    1.03
L3       0.17    0.47
L4       0.86    0.92
L5       0.86    0.88
L6       0.87    0.88
L7       0.58    0.55

overwrite - wall clock seconds per compaction/flush job

level   none    lz4
L0       0.11    0.18
L2      13.38   28.86
L3       0.14    0.42
L4       0.31    0.44
L5       0.37    0.40
L6       0.59    0.58
L7       0.77    0.79

No comments:

Post a Comment