Thursday, January 4, 2024

RocksDB 8.x benchmarks: large server, IO-bound

This post has results for performance tests in all versions of 8.x from 8.0.0 to 8.9.2 using a large server and IO-bound workload. In a previous post I shared results for the same hardware with a cached database.

tl;dr

  • There is a small regression that arrives in RocksDB 8.6 for overwriteandwait (write-only, random writes). But only for buffered IO. I think this is caused by changes to compaction readahead. For now I will reuse RocksDB issue 12038 for this.
I focus on the benchmark steps that aren't read-only because they suffer less from noise.  These benchmark steps are fillseq, revrangewhilewriting, fwdrangewhilewriting, readwhilewriting and overwriteandwait. I also focus on leveled more so than universal, in part because there is more noise with universal but also because the workloads I care about most use leveled.

Builds

I compiled with gcc RocksDB 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3 and 8.8.1 and 8.9.2 which are the latest patch releases.

Benchmark

All tests used a server with 40 cores, 80 HW threads, 2 sockets, 256GB of RAM and many TB of fast NVMe SSD with Linux 5.1.2, XFS and SW RAID 0 across 6 devices. For the results here, the database is cached by RocksDB. The benchmark was repeated for leveled and universal compaction using both buffered IO and O_DIRECT.

Everything used the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache once RocksDB 9.0 arrives.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run them is: 
bash x3.sh 24 no 3600 c40r256bc180 40000000 4000000000 iobuf iodir
A spreadsheet with all results is here and performance summaries are here:
Results: leveled

There is one fake regression in overwriteandwait for RocksDB 8.6.7. The issue is that the db_bench benchmark client ignored a new default value for compaction_readahead_size. That has been fixed in 8.7.

The is one real regression in overwriteandwait that probably arrived in 8.6 and is definitely in 8.7 through 8.9.  The throughput for overwriteandwait drops about 5% from 8.5 to 8.7+. I assume this is from changes to compaction readahead that arrived in 8.6. These changes are for readahead done when buffered IO is used, but not when O_DIRECT is used and in the charts below the regression does not repeat with O_DIRECT.

From the performance summary for overwriteandwait with buffered IO (see here)
  • compaction wall clock time (c_wsecs) increases by ~3% from ~18200 in 8.5 to ~18700 in 8.7+
  • compaction CPU seconds (c_csecs) decreases by ~5% from ~18000 in 8.5 to ~17200 in 8.7+
  • the c_csecs / c_wsecs ratio is ~0.99 for 8.0 thru 8.5 and drops to ~0.92 in 8.7+, so one side effect of the change in 8.6 is that compaction threads see more IO latency
  • this issue doesn't repeat with O_DIRECT, see here
From iostat metrics during overwriteandwait with buffered IO
  • rawait (r_await) drops from 0.21 in 8.5 to ~0.08 in 8.7+
  • rareq-sz (rareqsz) drops from 28.3 in 8.5 to ~9 in 8.7+
  • the increase in rawait was expected given the decrease in rareq-sz, the real problem is the drop in rareq-sz as the only reads during overwriteandwait are from compaction 
  • this issue doesn't repeat with O_DIRECT
leveled, buffered IO
c       rps     rmbps   rrqmps  rawait  rareqsz wps     wmbps   wrqmps  wawait  wareqsz ver
3762    4762    70.0    0.00    0.21    28.3    5648    576.2   0.00    0.06    104.3   8.5.4
3879    21308   90.9    0.00    0.05    4.2     4393    447.7   0.00    0.06    104.8   8.6.7
3790    9029    79.9    0.00    0.07    9.0     5229    535.8   0.00    0.06    105.2   8.7.3
3790    9678    74.5    0.00    0.08    8.3     5283    539.6   0.00    0.06    104.8   8.8.1
3790    9808    75.5    0.00    0.08    8.5     5298    540.1   0.00    0.06    104.7   8.9.2

leveled, O_DIRECT
c       rps     rmbps   rrqmps  rawait  rareqsz wps     wmbps   wrqmps  wawait  wareqsz ver
3765    5236    619.4   0.00    0.32    120.5   5779    687.7   0.00    0.07    121.1   8.5.4
4187    37528   340.4   0.00    0.09    9.2     1908    218.5   0.00    0.06    118.0   8.6.7
3754    5170    612.6   0.00    0.33    121.1   5708    679.1   0.00    0.07    121.4   8.7.3
3759    5084    602.8   0.00    0.35    121.1   5612    668.0   0.00    0.08    121.5   8.8.1
3759    5048    598.1   0.00    0.37    121.1   5574    663.3   0.00    0.08    121.4   8.9.2

These charts show relative QPS which is (QPS for my version / QPS for RocksDB 8.0).

First is with buffered IO (no O_DIRECT)
Next is with O_DIRECT (no OS page cache)
Results: universal

Summary
  • Just like above for leveled, there is a bogus regression for overwriteandwait with RocksDB 8.6
  • Results here have more variance than the results for leveled above. While I have yet to prove this, universal compaction benchmarks are likely prone to more variance. So I don't think there are regressions here.
These charts show relative QPS which is (QPS for my version / QPS for RocksDB 8.0).

First is with buffered IO (no O_DIRECT)
Next is with O_DIRECT (no OS page cache)


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...