## Wednesday, February 1, 2023

### Statistical rigor and database benchmarks

I am good at benchmarks and not good at statistics. I have no doubt that my work lacks statistical rigor -- I am also wary of the X has a flaw, therefore X is useless argument that is overdone on the web, but I don't know whether that applies here.

Update - if you have a suggestions for a paper or book I should read then please share.

The challenge is that I have a finite budget (in my time and HW time) in which to learn something and what I want to learn are performance and efficiency characteristics about a DBMS for a given workload. Things I want to learn include:

• HW consumed (CPU and IO per request, database size on disk, etc)
• Response time at 50th, 99th and other percentiles. Histograms are also great but take more space. Think of response time percentiles as a lossy-compressed histogram (flawed but useful).
• Results at different concurrency levels
• Results for IO-bound and CPU-bound workloads
A benchmark has a startup cost (time to load and index) and a warmup cost (warm the caches, fragment index structures, get SSDs to start doing GC). It also has time varying behavior and the longer you run the benchmark the more likely you are to see such behavior -- stalls from B-Tree checkpoint, LSM compaction and flash GC are things you want to capture.

Assume the budget is B units of time, it takes S units of time to load and index and it takes W units of time to warmup after load and index. With such assumptions what is the best way to use that budget to learn something about performance and have sufficient statistical rigor?
1. Run the benchmark once and collect results from the remaining (B - S - W) units of time
2. Run the benchmark N times and collect results from the remaining (B - S - NW) units of time assuming that I can archive/restore the database quickly after the load and index step. Otherwise the remaining time is (B - NS - NW).
If #2 is the better approach then what is a good value for N?

1. Generally standard deviation goes down with the square root of N, thus 4 experiments halfs the standard deviation. The 95% confidence interval is that your mean value will be between X - 2*std_dev, X+2*std_dev, replacing 2 by 3 gives about 99% confidence interval. This is assuming Gaussian distribution which is a good quick estimate.

1. We are discussing the thing I have a question about. Using the equations above if I do N=4 rather than N=1 experiments, then the run time for each experiment must be reduced because (B -- HW time) is fixed.

2. Actually when testing a DBMS it isn't proper to think of experiments as independent of each other as the above does. The reason is that DBMS runs checkpoints regularly which affects performance. My experience is that running several experiments or just longer experiments doesn't matter so much. Given the workings of a DBMS I would more or less say that each period of running benchmark is a separate experiment. These experiments are not completely independent, so statistics isn't perfect, but running longer makes at least the variance go down, although not as fast as the square root of the number of periods.

3. I am in the camp that prefers to run benchmarks for longer periods of time, but quantifying whether my rule of thumb is good advice will expensive. The goal is to make sure the indexes (btree fragmentation and the LSM equivalent) are in a steady state.

I see too many benchmarks that load in PK order, then create secondary indexes then immediately measure perf. At that point those indexes are in great shape (unfragmented) which both helps and hurts perf. It hurts because they are (likely) more prone to time consuming index splits. It helps because they (likely) have minimal space amplification and are most dense.

2. Of course, that's assuming his benchmark is capturing benchmark *throughput*, for which working under Gaussian assumptions is acceptable. But since he mentioned "response time" in the article, which is decidedly not Gaussian in nature, he would want to avoid mean/sddev/stderr in his benchmark reporting.

1. Forgot to mention that I am looking for paper/book recommendations on this topic.

2. My library is full of books on mathematical statistics, but these are too theoretical for you I think, I suggest books on performance analysis of computer systems that also go through basic statistics. I have 2 such books in my library, The Art of Computer Systems Performance Analysis by Raj Jain and Perfformance Modeling and Design of Computer Systems by Mor Harchol-Balter. Personally I like mathematics as a hobby as well, so will have got a book in birthday present to learn Nonlinear Programming (don't expect it to benefit my work on DBMSs, but one never knows :) ). Good luck with the reading.

3. We have similar hobbies. I have more than a few math books, including a few for Linear & Nonlinear programming and the titles you mention above, that I will eventually spend time on. I had downtime several years ago and enjoyed a few months with them. I look forward to repeating that.

3. https://parsec.cs.princeton.edu/publications/chen14ieeetc.pdf or https://aakinshin.net/posts/statistics-for-performance/ :)

4. The answer I have for you here is "it depends a lot on your benchmark and it's subtle." I also don't know if there are good books on this. I learned about the useful statistics for computer benchmarks from studying econometrics, but I don't have a textbook to recommend.

Your benchmark is not stationary (due to LSM compactions and GC), but there may be a steady state point where you can assume that it is roughly stationary. If you call the time to reach that steady state S, and you assume that this "spin-up" time is "dirty" (since DBs tend to run for a long time, the production numbers you care about are probably only from some time after S), running one trial will get you much better data, and you can get accurate descriptive statistics from the one run. Running N trials may actually reduce the fidelity of your benchmark, since you incur N * S spin-up time instead of S, and variance in the spin-up period can hurt. However, if your database does not spin up in the same way every time, running only one benchmark will get you bad numbers.

Without knowing more, my inclination is to suggest that you run either 1 trial if you are confident that the benchmark will be repeatable (because you know a lot about the database) or N <= 5 trials (to validate the meta-assumptions involved in only running one trial), taking the results from the median trial, depending on how paranoid you are about the repeatability of the benchmark.

1. Thanks for the reply. Books with titles like "Math for Economists" have been great for refreshing my skills but I have yet to open the econometrics texts as those are a harder read. But when I return to my math texts I will start with probability, stats and linear algebra.

WRT to variance after S, that is a big issue for read-only benchmarks with an LSM because the LSM tree shape can vary between runs once the workload becomes read-only which has a big impact on CPU overhead per query. Issues include the number of keys in the memtable and the number of SSTs in the smaller levels of the LSM tree.

5. If your systems are stable, one iteration is generally enough.

However, my main attention would be not focused just on final numbers, but rather on overall stability and progress in processing during test workloads. For ex. when you see stalls -- for many users it'll not be acceptable, and DBMS vendor need to address it, etc. Same about response times stability (and not just a final AVG or P95)..

And the main rule : "you can always generate more money, but you cannot generate more time.." -- so, don't forget to enjoy your life ;-))

Rgds,
-Dimitri

1. Plotting throughput over time is something I don't do as much as I should. It shows things that aren't easy to see via percentiles and histograms. A common problem that throughput graphs highlight is regular stalls from checkpoint or compaction.

WRT enjoying life -- I have my odd hobbies, spring is almost here which means I spend much time outside.

2. Plotting not only throughput QPS/TPS, but also Min/Max/Avg/P95 times for every 1 sec. to observe if your processing is stable, or you have "waves", stalls, and so on.. -- e.g. not just final numbers, and then everything is easy to see and point ;-))

Many users have different ETA requirements, some will never accept stalls, some have max latency "stability barrier", and so on.. While many Benchmark workloads are not processing within the same conditions over a time than it was on the beginning (data will grow, BTree will be larger, more background work involved, etc.) -- so, only by having the whole picture you can confirm (at least for yourself) that on the given workload you did not miss any problem (or not) -- always better to discover this in tests than in Production ;-))

Happy to hear all is good on your side ! and All My Best to all your future testings !

Rgds,
-Dimitri

6. I have always used the following approach to testing since my early IBM days:

* Repeat the same test multiple times, specially in cloud environments. I often choose to do 12 tests.
* Discards the best and worst outliers of the test.
* for all datapoints collected, summarize and get the median, average and the spread. This not only for the summary metrics like (total transactions, errors, etc..) but also for the metrics collected this way you can see the median behavior and how much deviation appears which is also important when measuring the microstalls due to compaction, etc...

Also I have historically ran the MySQL tests at 10 different levels of concurrency using sysbench: 1,2...1024 threads (or less depending on the machine). As someone mentioned before, you want to see the steady state -aproximativo, as db's in many scenarios do not have such state due to temporal load variations- so the idea is to run at each state for enough time that the steady state is achieve, which is often around 20-30 minutes (except when dealing with very large machines and datasets).