Thursday, July 21, 2022

The impact of max_background_threads, part 2

This is a follow to the results I published for the impact of max_background_jobs on RocksDB performance. The previous post used an IO-bound workload. This post uses a cached workload. The purpose for this work is to evaluate my rule of thumb that the number of busy threads should be <= the number of CPU cores. This is explained in detail in the previous post.

tl;dr

  • Similar to the results for IO-bound, while throughput increases with concurrency, so does variance so my rule of thumb still holds (keep the number of busy threads <= the number of cores).

Details

In this post I use jobs=X to indicate the value for max_background_jobs. Tests were run for jobs=8 and jobs=16. Command lines are here for a run that uses max_background_jobs=8 and 8 client threads. The IO-bound test used a database with 4B key-value pairs. This test uses 40M key-value pairs so the database fits in the RocksDB block cache. See the previous post for more details.

Results: throughput

The benchmark summaries are here for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads.

For all of the throughput graphs below except overwrite, the improvement with concurrency after 32 client threads isn't as good compared to the IO-bound results in the previous post. This is expected because the results here are for a CPU-bound workload, the server has 40 CPUs with 80 HW threads, and hyperthreading doesn't double compute capacity. The overwrite case is special because 1 client thread is sufficient to saturate throughput and beyond that the extra client threads just interfere with compaction threads.

Graphs for throughput are next. The first graph is for readrandom. I don't share the graph for fillseq because the benchmark step finishes in less than 1 minute. Throughput improves with concurrency but the improvement degrades with more concurrency.


The graph for fwdrangewhilewriting is similar to readrandom.


The graph for readwhilewriting is similar to readrandom.


The graph for overwrite shows that 1 client thread is sufficient to saturate throughput. That isn't a surprise because fsync isn't enabled and writes are just inserts into an in-memory table. The average throughput is usually, but not always, better for jobs=16 than jobs=8.


Results: p99.99 response time


The p99.99 response time graphs are similar. Performance degrades once the CPU becomes saturated around 40 client threads. The first graph is for readrandom.


The next graph is for fwdrangewhilewriting.


The next graph is for read while writing. The curve here is straighter than the curve above for fwdrangewhilewriting. I have yet to explain that.


And the final graph is for overwrite. There is an interesting artifact between 64 and 80 client threads. I have yet to explain that.



Results: throughput vs time

The next graphs display throughput per 5-second interval. There are two graphs per benchmark step -- one for jobs=8 and another for jobs=16.

The readrandom results aren't a surprise. Throughput increases with concurrency as does variance - compare the results for 1 client thread (nt1) with 80 client threads (nt80).


The graphs for fwdrangewhilewriting have more variance with increase concurrency compared to readrandom. This is expected for two reasons. First, there is less CPU because client threads compete with compaction threads for CPU. Second, there is more mutex contention and can be more write stalls. In this case there weren't any write stalls but mutex contention is harder to measure. Also note that throughput doesn't improve much beyond 32 client threads. This is also OK.





Results for readwhilewriting are similar to fwdrangewhilewriting in that beyond 32 client threads there is more variance and not much improvement in throughput. This lack of improvement is OK but I hope we can reduce the variance. However there are differences. There is less variance in general here than for fwdrangewhilewriting. My theory is that the range queries in fwdrangewhilewriting are more sensitive to the shape of the LSM tree. The point queries benefit from bloom filters. The range queries do not and there is periodic behavior as new SSTs are added to the L0, then compacted, repeat.



The overwrite graphs show the most variance to the point that they are hard to read. Keep on reading and there will be other graphs. Note that 1 client thread is sufficient to saturate throughput for overwrite and extra threads just compete for the CPU with compaction threads and increase variance.



The results for overwrite have much variance even at 1 client thread. The results for jobs=8 and jobs=16 are similar. The worst-case write stalls are less than 0.5 seconds.



The results for overwrite with 8 client threads have a different pattern than 1 client thread, but still much variance.




The last pair of graphs is for overwrite with 32 client threads. The results are similar to the graphs above for 8 client threads.



Results: overwrite

This section has a few more graphs for overwrite.

The p50 response times are similar for jobs=8 and jobs=16.

The p99 response times are similar for jobs=8 and jobs=16.


The p99.9 response times are similar for jobs=8 and jobs=16.


The p99.99 response times are similar for jobs=8 and jobs=16.


The max response times are similar for jobs=8 and jobs=16, but less so than the graphs above.


The graphs for write stall% are similar for jobs=8 vs jobs=16 but not similar to the graphs above. The best (lowest) case occurs for 1 client thread. The worst is for ~32 client threads. I don't know why it drops from there.


Wednesday, July 20, 2022

The impact of max_background_jobs on write throughput for RocksDB

I rely on rules of thumb when configuring a DBMS for a benchmark. This post is about configuring the number of background threads and the number of client threads for RocksDB to improve throughput and quality of service (low variance) with IO-bound workloads. The background threads are used for compaction and memtable flushes. The option is max_background_jobs

Eventually I need to evaluate or re-evaluate whether these rules of thumb are valid and I am doing that here. One rule I use is that the number of busy threads should be less than or equal to the number of CPU cores (real cores, not HW threads courtesy of hyperthreading) and the reason for that rule is to reduce variance for workloads that are either CPU-bound or write-heavy (write-heavy implies much CPU from compaction). I frequently run benchmarks to look for performance regressions and variance makes that search harder. By busy threads I mean #client-threads + #background-threads where #client-threads is the number of client threads used by the benchmark client and #background-threads is the value for max_background_jobs.

For this blog post I compare the performance between two RocksDB configurations that are identical with one exception. One uses max_background_jobs=8 and the other uses =16.

The tl;dr context is an IO-bound workload (many storage reads and writes) on a server with fast storage and 80 HW threads (40 cores, hyperthreading enabled).

tl;dr

  • Throughput improves with concurrency for up to 80 client threads for IO-bound workloads that are read-only or read-mostly. This comes at a reasonable increase in response time, usually <= 1.5x although fwdrangewhilewriting stops improving at 72 client threads.
  • Throughput for overwrite (write-only) degrades with concurrency. The peak is at <= 16 client threads. This isn't a surprise, as there is no fsync on write and the additional client threads just compete with compaction threads for CPU although the interference is interesting (keep on reading).
  • Throughput for overwrite is ~1.4X better with max_background_jobs=16 vs =8 on average. But the cost of better average throughput is more variance.
For now I will stick with jobs=8 when doing benchmarks on this hardware to get better quality of service but less throughput for overwrite.

Details

Something I learned recently is that the number of flush threads is max_background_jobs/4 and the number of compaction threads is max_background_jobs minus the number of flush threads. The code is here.

I used benchmark.sh, a server with 80 HW threads (hyperthreading enabled) and fast storage. The workload was IO-bound -- 4B KV pairs, ~1TB database and a low cache hit rate. The benchmarks were run for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads using two configurations that only differed in the configuration value for max_background jobs -- 8 vs 16. Note that fillseq always uses 1 client thread.

I used a recent build of db_bench at git hash 7e2004a123. This has code that will become the 7.5 release. It includes two changes that greatly reduce the worst-case write stall (6115254 and b397dcd) but that is a topic for another post.

The sequence of benchmark steps is determined by benchmark_compare.sh and the interesting steps were: fillseq, readrandom, multireadrandom, fwdrangewhilewriting, readwhilewriting and overwrite. The *whilewriting steps used a 2MB/s rate limit on the background writer and were not write heavy. The fillseq step is write heavy but doesn't require compaction because inserts are in key order and the push down optimization is used. The final step, overwrite, is write heavy. Command lines are here for a run with 8 client threads and max_background_jobs=8.

I neglected to use --multiread_batched=true with the multireadrandom benchmark step so the results are mostly a duplicate of readrandom.

The benchmark summaries are here for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads.

These benchmarks might suffer from coordinated omission because a fixed number of client threads is used. These results might also overstate the benefit of more throughput from more concurrency because the benchmark client doesn't enforce a response time SLA.

Results: throughput

The throughput graphs have results for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads.

I used gnuplot for the graphs and did not set the y-range to zero. This was done to make it easier to see the differences but can mislead a drive-by reader. I also use jobs= rather than max_background_jobs= to save on typing. I also used lines rather than bars. Next time I might use bars.

Throughput for fillseq was slightly better with jobs=8 than =16. The difference is small and I won't try to explain it. The fillseq benchmark step does inserts in key order and always uses 1 client thread.

Throughput for readrandom improves with concurrency up to 80 client threads but the improvement degrades and USL might explain that. There is no difference between jobs=8 and =16. That is expected because this is read-only.

The results for fwdrangewhilewriting are similar to readrandom. There is no difference between jobs=8 and =16. That is probably because the write rate is low (2MB/s).

The results for readwhilewriting are similar to readrandom.

The throughput for overwrite is ~1.4X better for jobs=16 than =8. I didn't expect this but it is not a huge surprise. Write amplification during this test is ~17. Compaction IO statistics from the test end are here. The LSM tree has data in 8 levels so this is one example where the write-amp estimate of per-level fanout X number-of-levels is pessimistic.



Results: p99 response time

These graphs show the p99 response time in microseconds.

The results for readrandom and multireadrandom are interesting but I won't try to explain them.

The p99 response time for fwdrangewhilewriting shows that saturation happens after 72 client threads. That it happens at some point is expected.

The p99 response time for readwhilewriting doesn't show the saturation that is visible above for fwdrangewhilewriting.

The p99 response time for overwrite is the most interesting. Above it was shown that throughput for jobs=16 was ~1.4X better than for =8. But here the p99 response times are similar. While I haven't displayed it, the p50 response times are also similar. This will be explained below.

Throughput vs time

Next up is graphs for throughput vs time at 5-second intervals and I just wrote a note to myself change this to 1-second intervals for future tests. For each benchmark step there are two graphs, one for jobs=8 and another for jobs=16, with 11 lines per graph. There is one line for each number of threads, from 1 to 80 and they are labeled as nt$X on the graph where $X is the number of threads.

The results for fillseq show more variance at first glance for jobs=8 but that might be an illusion because the max for the y-range is larger for jobs=16. The throughput range in both cases is between ~100k/second and ~300k/second after 2000 seconds.



For readrandom the variance increases with the number of client threads. The result for 80 client threads (nt80) has the most jitter. Results for multireadrandom are similar and I won't show them here.

Results for fwdrangewhilewriting show increasing variance as the number of client threads increases.


Results for readwhilewriting show increasing variance as the number of client threads increases.

The overwrite benchmark step has the most interesting graphs although it can be hard to see that here given there are 11 lines per graph. That will be fixed below if you keep on reading. Important points are:

  1. The worst-case write stall (minutes long) doesn't occur because it was recently fixed. I will explain more about that soon in another blog post.
  2. The result for jobs=8 shows a regular pattern of a peak then a gradual decline.
  3. The result for jobs=16 has more variance.
Which do you prefer -- better average throughput or less variance? Obviously, both is my answer.


The graphs above for overwrite are busy so I also have graphs limited to a specific number of client threads -- 1, 8 and 32. The problem with variance for jobs=16 doesn't occur at 1 client thread (nt1) but does occur at 8 or more client threads. I hope to explain this one day. For now I will wave my hands and claim that the additional client threads interfere with the compaction threads -- for mutexes and CPU time.


Overwrite: the mystery about p50 and p99

Above I showed that overwrite has similar p99 response times for jobs=8 and jobs=16 but average throughput is ~1.4X better for jobs=16. I didn't show it but the p50 response times are also similar. But in the p99.9 and p99.99 results (no graphs but read the summaries linked above) the values are worse for jobs=16.

The problem is that outliers are worse for jobs=8 and one way to view that is via the response time histrograms printed at the end of the benchmark step. They are here for the nt1 (1 client thread) test. With jobs=16, 99.278% of responses take <= 10 usecs vs (see here and here).

While they are listed in the summaries linked above I don't have graphs for the write stall% from overwrite as this post is already too long. But from the linked summaries at 1 client thread the write stall% is 24.6 for jobs=8 vs 0% for jobs=16. At 8 client threads it is 54.2% for jobs=8 vs 33.1% for jobs=16.