Small Datum: Space vs CPU tradeoffs with universal compaction in RocksDB

This post explains the impact of compression for write-heavy workloads with universal compaction in RocksDB. With an LSM you can spend on CPU to save on space. For leveled compaction I normally use no compression for L0, L1 and L2 because compressing them has a small impact on space but costs a lot in CPU and write stalls. My advice for universal compaction is similar -- don't compress everything.

tl;dr

For fillseq (load in key order) the insert rates are better with less compression, so no compression beats LZ4 and LZ4 beats ZSTD because with stronger compression there are more write stalls. But for overwrite the insert rate is similar regardless of compression for reasons I haven't fully explained.
There is more CPU used by compaction when there is more compression. The difference in CPU overhead can exceed 3X when comparing no compression (m1.none) and ZSTD for everything (m1.zstd).
There is less write-amp when there is more compression because with a smaller database there is less to write.

Workload

I used a hacked version of my db_bench helper scripts to run the fillseq and overwrite benchmark steps on a small server (see v5 here, Beelink SER7 with 8 cores, Ryzen7 CPU, 32G RAM, m.2 storage, Ubuntu 22.04 and XFS).

Note that db_bench was configured to generate values that can compress in half, thus the numbers below show that an uncompressed database can be ~2X larger than a compressed database.

The sequence of benchmark steps is below. I look at performance results for fillseq and overwriteandwait. The command lines are here for the m1, 80 and 90 config groups.

fillseq - load 100M KV pairs in key order
overwritesome - overwrite 10M KV pairs to fragment the LSM tree
waitforcompaction - wait for compaction to finish so that all configs start from a similar state for the next benchmark step
overwriteandwait - overwrite keys as fast as possible for 20 minutes

There were 3 types of configurations that I call m1, 80 and 90 below. They differ in the value used with --universal_compression_size_percent. The configs set it to -1, 80 and 90 for the m1, 80 and 90 configuration types.

m1 - all SSTs are either compressed or not compressed (see below)
80 - 80% of the SSTs are compressed, the 20% smallest (newest) are not compressed.
90 - 90% of the SSTs are compressed, the 90% smallest (newest) are not compressed

Then within each configuration type I try a mix of no, lz4 and zstd compression. The configs descriped with except the largest sorted run get that by setting bottommost_compression. The full set of configs tested are:

m1.none

nothing is compressed

m1.lz4

uses LZ4 compression for all SSTs

m1.zstd

uses ZSTD compression for all SSTs

m1.lz

uses LZ4 compression for all SSTs except the largest sorted run which uses ZSTD

80.none

nothing is compressed, just like m1.none

80.lz4

uses LZ4 to compress the largest (oldest) 80% of the SSTs

80.zstd

uses ZSTD for the largest (oldest) 80% of the SSTs

90.nnl

uses LZ4 for the largest sorted run and no compression elsewhere

90.nnz

uses ZSTD for the largest sorted run and no compression elsewhere

90.nlz

uses ZSTD for the largest sorted run, LZ4 for ~90% of the next largest SSTs and then no compression for everything else (the newest, smallest SSTs)

Results

Legend for the tables below:

config

name of the tested configuration and the values are explained above

the average rate for inserts/s

stall%

percentage of time for which there are write stalls. Note that tests are run with fsync on commit disabled which can make write stalls more likely

w_amp

write amplification. Without compression the minimum should be 2 (write once to WAL, once to L0 on memtable flush) but with compression it can be less.

c_wsecs

wall-clock seconds doing compaction

c_csecs

CPU seconds doing compaction. In some cases this is larger than c_wsecs because subcompactions were enabled (set to 2) and one compaction job can use 2 threads.

sizeGB

size in GB of the database. For fillseq this is the total database size but for overwriteandwait it is only the size of the largest sorted run (which was stored in L39) because there is too much variance on the total database size with universal compaction for benchmark steps other than fillseq.

Results from the fillseq benchmark step that does inserts in key order. I use green to indicate the best results from each column and red for the worst. All of the results are here.

The insert rates are better with less compression, so no beats LZ4 and LZ4 beats ZSTD.
With more compression there are more write stalls.
There is more CPU used by compaction when there is more compression. The difference in CPU overhead is larger than 3X when comparing no compression (m1.none) and ZSTD for everything (m1.zstd).
There is less write-amp when there is more compression because with a smaller database there is less to write.

config	ips	stall%	w_amp	c_wsecs	c_csecs	sizeGB
m1.none	1124117	44.0	2.0	83	71	39.9
m1.lz4	834739	57.8	1.1	114	105	21.9
m1.zstd	401282	78.2	0.9	244	236	18.0
m1.lz	818534	57.8	1.1	116	108	21.9
80.none	995011	50.2	2.0	89	70	39.9
80.lz4	820899	57.8	1.1	114	103	21.9
80.zstd	399171	78.2	0.9	245	236	18.0
90.nnl	1010723	49.3	2.0	92	77	39.9
90.nnz	1010318	49.1	2.0	92	77	39.9
90.nlz	808058	58.5	1.1	118	109	21.9

Results from the overwriteandwait benchmark step. I use green to indicate the best results from each column and red for the worst.

All configs provide a similar insert rate because there isn't a large difference in write stall percentage between the configs that use more compression and the configs that use less. The benchmark client uses more CPU per request here than for fillseq because key generation here uses a RNG while above it just increments a counter. Were I able to test something that sustains a higher request rate there might be differences here similar to what I see with fillseq.
There is a large difference in CPU overhead between configs that use less compression and configs that use more. But this CPU overhead is consumed by background threads that do compaction so it doesn't change the insert rate.
There is less write-amp when there is more compression because with a smaller database there is less to write.

config	ips	stall%	w_amp	c_wsecs	c_csecs	sizeGB
m1.none	307477	0.6	6.9	1880	1232	40.0
m1.lz4	312220	0.1	3.7	1797	1915	21.6
m1.zstd	300634	2.2	3.1	4565	5307	17.8
m1.lz	312476	0.1	3.8	2776	3549	17.8
80.none	307114	0.5	6.9	1864	1219	40.0
80.lz4	308854	0.1	5.9	1733	1620	21.6
80.zstd	310717	0.1	5.6	2451	2656	17.8
90.nnl	307881	0.4	6.9	2061	1900	21.6
90.nnz	310731	0.2	6.2	2431	2793	17.8
90.nlz	308793	0.2	5.7	2470	2996	17.8

Small Datum

Thursday, May 23, 2024

Space vs CPU tradeoffs with universal compaction in RocksDB

1 comment:

Postgres 17.4 vs sysbench on a large server, revisited part 2