Thursday, May 23, 2024

Space vs CPU tradeoffs with universal compaction in RocksDB

This post explains the impact of compression for write-heavy workloads with universal compaction in RocksDB. With an LSM you can spend on CPU to save on space. For leveled compaction I normally use no compression for L0, L1 and L2 because compressing them has a small impact on space but costs a lot in CPU and write stalls. My advice for universal compaction is similar -- don't compress everything.

tl;dr

  • For fillseq (load in key order) the insert rates are better with less compression, so no compression beats LZ4 and LZ4 beats ZSTD because with stronger compression there are more write stalls. But for overwrite the insert rate is similar regardless of compression for reasons I haven't fully explained.
  • There is more CPU used by compaction when there is more compression. The difference in CPU overhead can exceed 3X when comparing no compression (m1.none) and ZSTD for everything (m1.zstd).
  • There is less write-amp when there is more compression because with a smaller database there is less to write.

Workload

I used a hacked version of my db_bench helper scripts to run the fillseq and overwrite benchmark steps on a small server (see v5 here, Beelink SER7 with 8 cores, Ryzen7 CPU, 32G RAM, m.2 storage, Ubuntu 22.04 and XFS).

Note that db_bench was configured to generate values that can compress in half, thus the numbers below show that an uncompressed database can be ~2X larger than a compressed database.

The sequence of benchmark steps is below. I look at performance results for fillseq and overwriteandwait. The command lines are here for the m1, 80 and 90 config groups.

  • fillseq - load 100M KV pairs in key order
  • overwritesome - overwrite 10M KV pairs to fragment the LSM tree
  • waitforcompaction - wait for compaction to finish so that all configs start from a similar state for the next benchmark step
  • overwriteandwait - overwrite keys as fast as possible for 20 minutes

There were 3 types of configurations that I call m1, 80 and 90 below. They differ in the value used with --universal_compression_size_percent. The configs set it to -1, 80 and 90 for the m1, 80 and 90 configuration types. 

  • m1 - all SSTs are either compressed or not compressed (see below)
  • 80 - 80% of the SSTs are compressed, the 20% smallest (newest) are not compressed.
  • 90 - 90% of the SSTs are compressed, the 90% smallest (newest) are not compressed

Then within each configuration type I try a mix of no, lz4 and zstd compression. The configs descriped with except the largest sorted run get that by setting bottommost_compression. The full set of configs tested are:

  • m1.none
    • nothing is compressed
  • m1.lz4
    • uses LZ4 compression for all SSTs
  • m1.zstd
    • uses ZSTD compression for all SSTs
  • m1.lz
    • uses LZ4 compression for all SSTs except the largest sorted run which uses ZSTD
  • 80.none
    • nothing is compressed, just like m1.none
  • 80.lz4
    • uses LZ4 to compress the largest (oldest) 80% of the SSTs
  • 80.zstd
    • uses ZSTD for the largest (oldest) 80% of the SSTs
  • 90.nnl
    • uses LZ4 for the largest sorted run and no compression elsewhere
  • 90.nnz
    • uses ZSTD for the largest sorted run and no compression elsewhere
  • 90.nlz
    • uses ZSTD for the largest sorted run, LZ4 for ~90% of the next largest SSTs and then no compression for everything else (the newest, smallest SSTs)

Results

Legend for the tables below:

  • config
    • name of the tested configuration and the values are explained above
  • ips
    • the average rate for inserts/s
  • stall%
    • percentage of time for which there are write stalls. Note that tests are run with fsync on commit disabled which can make write stalls more likely
  • w_amp
    • write amplification. Without compression the minimum should be 2 (write once to WAL, once to L0 on memtable flush) but with compression it can be less.
  • c_wsecs
    • wall-clock seconds doing compaction
  • c_csecs
    • CPU seconds doing compaction. In some cases this is larger than c_wsecs because subcompactions were enabled (set to 2) and one compaction job can use 2 threads.
  • sizeGB
    • size in GB of the database. For fillseq this is the total database size but for overwriteandwait it is only the size of the largest sorted run (which was stored in L39) because there is too much variance on the total database size with universal compaction for benchmark steps other than fillseq.
Results from the fillseq benchmark step that does inserts in key order. I use green to indicate the best results from each column and red for the worst. All of the results are here.

  • The insert rates are better with less compression, so no beats LZ4 and LZ4 beats ZSTD. 
  • With more compression there are more write stalls.
  • There is more CPU used by compaction when there is more compression. The difference in CPU overhead is larger than 3X when comparing no compression (m1.none) and ZSTD for everything (m1.zstd).
  • There is less write-amp when there is more compression because with a smaller database there is less to write.

configipsstall%w_ampc_wsecsc_csecssizeGB
m1.none112411744.02.0837139.9
m1.lz483473957.81.111410521.9
m1.zstd40128278.20.924423618.0
m1.lz81853457.81.111610821.9
80.none99501150.22.0897039.9
80.lz482089957.81.111410321.9
80.zstd39917178.20.924523618.0
90.nnl101072349.32.0927739.9
90.nnz101031849.12.0927739.9
90.nlz80805858.51.111810921.9

Results from the overwriteandwait benchmark step. I use green to indicate the best results from each column and red for the worst.

  • All configs provide a similar insert rate because there isn't a large difference in write stall percentage between the configs that use more compression and the configs that use less. The benchmark client uses more CPU per request here than for fillseq because key generation here uses a RNG while above it just increments a counter. Were I able to test something that sustains a higher request rate there might be differences here similar to what I see with fillseq.
  • There is a large difference in CPU overhead between configs that use less compression and configs that use more. But this CPU overhead is consumed by background threads that do compaction so it doesn't change the insert rate.
  • There is less write-amp when there is more compression because with a smaller database there is less to write.

configipsstall%w_ampc_wsecsc_csecssizeGB
m1.none3074770.66.91880123240.0
m1.lz43122200.13.71797191521.6
m1.zstd3006342.23.14565530717.8
m1.lz3124760.13.82776354917.8
80.none3071140.56.91864121940.0
80.lz43088540.15.91733162021.6
80.zstd3107170.15.62451265617.8
90.nnl3078810.46.92061190021.6
90.nnz3107310.26.22431279317.8
90.nlz3087930.25.72470299617.8


1 comment:

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...