Monday, July 22, 2024

Searching for regressions in RocksDB with db_bench: part 2

In a recent post I shared results for RocksDB performance tests using versions from 6.0 through 9.0 and 3 different types of servers (small, medium, big). While there were few regressions over time, there is one regression that arrived in version 8.6, bug 12038, and the workarounds are one of:

  • use O_DIRECT for compaction reads
  • set compaction_readahead_size to be <= max_sectors_kb for the database storage device. When SW RAID is used I don't know whether the value that matters is from the underlying storage devices or the SW RAID device.
In this post I have more results from tests done with compaction_readahead_size set to a value <= max_sectors_kb.

tl;dr
  • Setting compaction_readahead_size to be <= max_sectors_kb was good for performance on the small and big servers. One effect of this is the average read request size is large (tens of KB) when the value is correctly sized and ~4K (single-block reads) when it is not.
  • If you don't want to worry about this then use O_DIRECT for compaction reads
Read the Builds and Benchmark sections from my recent post for more context.

Hardware

I tested on three servers:
  • Small server
    • SER4 - Beelink SER 4700u (see here) with 8 cores and a Ryzen 7 4700u CPU, ext4 with data=writeback and 1 NVMe device. The storage device has 128 for max_hw_sectors_kb and max_sectors_kb.
    • I set compaction_readahead_size to 96K
  • Medium server
    • C2D - a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU and 16 cores, XFS with data=writeback, SW RAID 0 and 4 NVMe devices. The RAID device has 512 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
    • I set compaction_readahead_size to 512K
  • Big server
    • BIG - Intel CPU, 1-socket, 48 cores with HT enabled, enough RAM (vague on purpose), xfs with SW RAID 0 and 3 devices. The RAID device has 128 for max_hw_sectors_kb and max_sectors_kb while the storage devices have max_hw_sectors_kb =2048 and max_sectors_kb =1280.
    • I set compaction_readahead_size to 96K

Workloads

There are two workloads:

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
Results: byrx (cached)

For each server there are links to two sets of results.

The first set of results has 3 lines per test. The first line is from RocksDB 8.5.4, the second from 8.7.3 using the default (=2M) for compaction_readahead_size and the third from 8.7.3 with compaction_readahead_size =96K. An example is here.

The second set of results is similar to the first, except the second and third lines are from RocksDB 9.3.1 instead of 8.7.3.

Below I use CRS to mean compaction_readahead_size and compare the QPS from the overwriteandwait microbenchmark.

The results:
  • SER4 (small server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~12% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~6% less QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf
  • C2D (medium server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~2% more QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~26% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf
  • BIG (big server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 gets the same QPS as 8.5.4 with CRS set to either =2M or =96K
      • 9.3.1 gets ~4% less QPS than 8.5.4 with CRS set to either =2M or =96K
Summary
  • Setting compaction_readahead_size to be <= max_sectors_kb helps performance on the small and medium server but not on the big server. Note there are large differences on the big server between the value of max_sectors_kb for the RAID device and for the underlying storage devices -- it is much larger for the storage devices.
  • In the cases where reducing the value of compaction_readahead_size helped, QPS from overwriteandwait in RocksDB 8.5.4 is still better than the versions that follow
Results: iobuf (IO-bound with buffered IO)

For each server there are links to two sets of results.

The first set of results has 4 lines per test. The first line is from RocksDB 8.5.4, the second from 8.7.3 using the default (=2M) for compaction_readahead_size and the third from 8.7.3 with compaction_readahead_size =96K. O_DIRECT was not used for the first three lines. The fourth line is from 8.7.3 using O_DIRECT. An example is here.

The second set of results is similar to the first, except the second, third and fourth lines are from RocksDB 9.3.1 instead of 8.7.3.

Below I use CRS to mean compaction_readahead_size and compare the QPS from the overwriteandwait microbenchmark.

The results:
  • SER4 (small server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get 30% to 40% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get ~10% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get ~2% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf but O_DIRECT is better
      • Average ead request size per iostat (see rareqsz here) is much larger with CRS =96K than =2M (84.5 vs 4.6)
  • C2D (medium server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get 4% to 7% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =512K get ~20% more QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get ~11% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf and better than O_DIRECT
      • Average read request size per iostat (see rareqsz here) is similar with CRS =512K and 2M (185.9 vs 194.1)
  • BIG (big server)
    • Results for 8.5.4 vs 8.7.3 and then for 8.5.4 vs 9.3.1
    • Results for overwriteandwait are here for 8.7.3 and for 9.3.1
      • 8.7.3 and 9.3.1 with CRS =2M get ~28% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with CRS =96K get 4% to 7% less QPS than 8.5.4
      • 8.7.3 and 9.3.1 with O_DIRECT get 7% to 10% more QPS than 8.5.4
      • Setting CRS to be <= max_sectors_kb is good for perf but O_DIRECT is better
      • Average ead request size per iostat (see rareqsz here) is much larger with CRS =96K than =2M (61.2 vs 5.1)
Summary
  • Setting compaction_readahead_size to be <= max_sectors_kb helps on all servers
  • On the small and big server, performance with O_DIRECT was better than without.













No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...