Small Datum: How I do RocksDB performance tests, part 2

This extends on my previous post. This post isn’t specific to RocksDB. It also has more opinions and might serve as speaker notes were I to write slides. I am writing this prior to speaking to peers about my work so it might have an audience of one (me) but that is typical of many of my posts. Regardless, I appreciate that people read and engage with some of my posts.

Points

How did I get here?

Long ago I worked on DBMS internals full time - I added features, fixed bugs and created bugs. Then I moved to web-scale MySQL at Google and started to spend time in production. Production is a great education but it came at the cost of less time for new features. I still spent much time finding and fixing bugs. After a few years I moved to Facebook and the trend continued. Over time I stopped doing feature development, spent much less time fixing bugs but still spend time reporting bugs. I read a lot of code when trying to explain things, but I don't write much that makes it upstream. I have enjoyed the change. I don't need to write code because I am surrounded by talented developers. I can specialize in my thing, and others specializing in their thing. It is hard to be expert in too many things.

Benchmarks are what you make of them

They are far from perfect but they are quite useful. Testing by comparing things and explaining the results makes them more value. Benchmarks informed by production are even better.

How does code age?

Single-thread performance on CPUs isn't improving like it used to. Long-lived code tends to attract CPU regressions. This combination is a problem. Good regression tests help spot the problems and spotting them early is a big deal because removing them long after they arrive is too expensive. This isn't just a technical problem. How do you reject new features that help a fraction of the user base when the cost if more CPU overhead for all users?

Needs improvement

I hope to get better about using benchmarks that avoid coordinated omission, have more statistical rigor, expand beyond single-node tests, use benchmark workloads that are adaptive and use benchmarks that enforce response time constraints.

Build a network of smart peers

I have been fortunate to have many talented peers. I engage with Postgres experts on Twitter and have also met smart people who work on upstream projects via bug report discussions.

Explain things

Explain your results. But find a balance because explaining everything will slow you down.

Testing an LSM is complicated

Old posts on this are here and here.
The shape of an LSM tree has more variance than the shape of a B-tree. This can be a source of variance in benchmarks, especially in read-heavy ones. While this is still a work in progress there are db_bench commands to make the LSM shape more deterministic (flush memtable, compact L0, compact L1, wait-for-compaction).
Another problem is a test that inherits a non-deterministic amount of compaction debt. If the sequence is: —benchmarks=write-heavy,read-heavy then the read-heavy step might suffer from compaction debt inherited from write-heavy. The impact of reducing this debt during the read-heavy step can vary and produce confusing results for the read-heavy step.
Try to get the LSM tree into a steady state before read-heavy tests. For example, after fillseq there is no overlap between SSTs. After a full compaction there is only one level (or one SST). These are usually not steady states.
For a load + query benchmark it is easy for the LSM (or B-Tree) to not be in a steady state after the load and many benchmarks suffer from this. If the load is done in key order then the PK index won’t be fragmented with a B-Tree and the SSTs won’t overlap with an LSM — which hides some of the overhead that occurs during query processing. When storage is a local attached SSD and the workload is heavy on IO then you need to worry about non-determinism from the SSD — you either want no SSD GC or to get SSD GC into a steady state (by running the test for long enough and having database files that are large enough, something between 50% and 90% of the device capacity).

Make the DBMS unhappy

Find ways to make the DBMS unhappy and see if it falls over. The challenge is that there are more and less realistic ways to make a DBMS fall over. An easy way to make a DBMS unhappy is to provide it with too many concurrent requests, especially a DBMS that doesn’t provide admission control (like RocksDB). But some problems are best fixed elsewhere because fixes have an opportunity cost. It might be nice to have an optional RocksDB API that implements admission control.

Define your goals

Do you care about average throughput or outliers (p99, p99.9, p99.99). I have a post on this. Average throughput is easy to measure but p99 and beyond (p99.9, p99.99) matters in production because outliers determine user experience and capacity planning is based on p99. While single-valued metrics like p99 are easy to share, graphs for throughput over time at 1-second intervals make it easier to spot problems like stalls, cyclic behavior or throughput that degrades over time.

Statistical rigor

Statistical rigor is great but can be expensive. Repeating every benchmark 3 times increases the accuracy of your results at the cost of 3X more HW. I usually get less rigorous statistical rigor because I frequently repeat benchmark runs because I made a mistake or need to measure one more thing. Another way to think of this is: assume B units of HW capacity, each benchmark has a warmup cost of W and runtime of R. Then solve for N in B = N(W+R) where N is the number of times the benchmark is repeated. A larger value for N implies a smaller value for R and the confidence interval is a function of both N and R.

Coordinated omission

Coordinated omission is a real problem. All of the benchmark clients that I use suffer from it, yet they are still useful. Two things prevent me from doing open-loop benchmarks. First, the benchmark clients I use don’t support it and it takes work to implement a new benchmark client and incorporate it into my workflow. Second, an open-loop benchmark takes more work to setup as I need to discover an arrival rate that the DBMS can handle — or I need a more complicated client that can discover it for me. One day I will use open-loop clients.

Response time constraints

The db_bench benchmark client for RocksDB doesn't have an option to use response time constraints (ignore responses slower than X ms). Another problem is computing throughput without a response time constraint. More concurrency usually means more throughput, but it also means worse response time and more response time outliers. Those slow responses should not be counted. Most of the benchmark clients that I use don’t enforce a response time SLA. Such an SLA is more work, you need to select a reasonable value, but I hope to improve with this. I hope to add them to db_bench.

Single node

Most of my testing runs the client and server on the same server. While I prefer to use separate servers for client & server when the DBMS supports it, that introduces the risk of perf variance because I will be sharing the network.

Stable platform

I use HW at work, in the public cloud and my home test cluster. My work HW has value-added services that consume a variable and occasionally significant amount of compute and storage so I am wary of using it for low-concurrency benchmarks. Public cloud HW means I am using a VM and might be sharing compute and storage with noisy neighbors so I found a way to reduce the CPU variance by using the largest number of CPUs for a given instance type and disabling HT. From quick tests with fio there wasn't much variance in the cloud block storage I chose. My home HW is the most stable after I disabled HT and turbo boost. Alas, it is also the least capable — 4 CPUs, 16G of RAM.

Compare things

I rarely test one thing in isolation because the results are hard to interpret. So I do A/B or even A/B/C/D/... testing where these represent different DBMS, different versions of the same DBMS or different configurations for one version of one DBMS.

Measure things

Start with throughput, then add variance, then add CPU, IO and memory. Foreground CPU and IO can remain constant while background CPU and IO change significantly and most DBMS do much work in the background (excluding SQLite which doesn’t have background threads. Don’t forget to watch VSZ/RSS for the DBMS processes because increases there might lead to OOM. Has disk space usage increases because that can lead to out of space errors. When something is slower search top down. Look at iostat metrics to see if IO/query has changed. Look at vmstat to see if CPU/query has changed. Look at vmstat to see if context switches/query has changed (mutex contention?). Normalize your metrics — IO/query, CPU/query, context switches/query. I frequently have scripts running that scrape output from ps and top. To watch for disk space issues I have a script that runs du and ls in a loop during benchmarks.

Summarize things

One practice I have it to create one line performance summaries with useful numbers for throughput, HW (CPU/storage/memory/disk space) usage, normalized HW usage (CPU/query, IO/query). One line summaries make it easy to compare performance when A/B or A/B/C/D/... testing is done. They also make it easy to spot regressions that don't directly hurt throughput but are a concern -- larger RSS for the DBMS process, more disk space used, more CPU consumed by background threads. The summaries also provide a starting point when I try to explain a performance change. An example is here.

Name & archive things

A mistake I have made many times is starting a benchmark, getting interrupted for a week and forgetting an important detail about the benchmark when I return. Naming patterns reduces the change of this. I try to archive the test scripts and command lines via Github. Saving RocksDB LOG files is also important. All of my important scripts are in Github.

Adaptive benchmark clients

I often have to assume things when configuring a benchmark client. The number of threads that db_bench uses for clients is currently fixed. It would be nice to have some benchmarks that increase the request rate or number of request clients over time or until a response time constraint is violated. I currently do this manually and my solution is sad.

Proactive vs reactive

Is it a bug when it has yet to happen in production? That is an interesting question. The answer requires nuance. Some bugs do happen but have yet to be noticed, some bugs might happen and are worth avoiding other bugs just aren't worth fixing. It isn't always easy to classify a bug into one of these groups.

Small Datum

Tuesday, August 23, 2022

How I do RocksDB performance tests, part 2

No comments:

Post a Comment

Common prefix skipping, adaptive sort