Small Datum: Statistical rigor and database benchmarks

Wednesday, February 1, 2023

Statistical rigor and database benchmarks

I am good at benchmarks and not good at statistics. I have no doubt that my work lacks statistical rigor -- I am also wary of the X has a flaw, therefore X is useless argument that is overdone on the web, but I don't know whether that applies here.

Update - if you have a suggestions for a paper or book I should read then please share.

The challenge is that I have a finite budget (in my time and HW time) in which to learn something and what I want to learn are performance and efficiency characteristics about a DBMS for a given workload. Things I want to learn include:

HW consumed (CPU and IO per request, database size on disk, etc)
Response time at 50th, 99th and other percentiles. Histograms are also great but take more space. Think of response time percentiles as a lossy-compressed histogram (flawed but useful).
Results at different concurrency levels
Results for IO-bound and CPU-bound workloads

A benchmark has a startup cost (time to load and index) and a warmup cost (warm the caches, fragment index structures, get SSDs to start doing GC). It also has time varying behavior and the longer you run the benchmark the more likely you are to see such behavior -- stalls from B-Tree checkpoint, LSM compaction and flash GC are things you want to capture.

Assume the budget is B units of time, it takes S units of time to load and index and it takes W units of time to warmup after load and index. With such assumptions what is the best way to use that budget to learn something about performance and have sufficient statistical rigor?

Run the benchmark once and collect results from the remaining (B - S - W) units of time
Run the benchmark N times and collect results from the remaining (B - S - NW) units of time assuming that I can archive/restore the database quickly after the load and index step. Otherwise the remaining time is (B - NS - NW).

If #2 is the better approach then what is a good value for N?

15 comments:

Mikael RonstromFebruary 1, 2023 at 12:10 PM
Generally standard deviation goes down with the square root of N, thus 4 experiments halfs the standard deviation. The 95% confidence interval is that your mean value will be between X - 2*std_dev, X+2*std_dev, replacing 2 by 3 gives about 99% confidence interval. This is assuming Gaussian distribution which is a good quick estimate.
ReplyDelete
Replies
Mark E. Dawson, JrFebruary 1, 2023 at 5:20 PM
Of course, that's assuming his benchmark is capturing benchmark *throughput*, for which working under Gaussian assumptions is acceptable. But since he mentioned "response time" in the article, which is decidedly not Gaussian in nature, he would want to avoid mean/sddev/stderr in his benchmark reporting.
ReplyDelete
Replies
VKFebruary 2, 2023 at 5:28 AM
https://parsec.cs.princeton.edu/publications/chen14ieeetc.pdf or https://aakinshin.net/posts/statistics-for-performance/ :)
ReplyDelete
Replies
NimaFebruary 2, 2023 at 7:04 AM
The answer I have for you here is "it depends a lot on your benchmark and it's subtle." I also don't know if there are good books on this. I learned about the useful statistics for computer benchmarks from studying econometrics, but I don't have a textbook to recommend.

Your benchmark is not stationary (due to LSM compactions and GC), but there may be a steady state point where you can assume that it is roughly stationary. If you call the time to reach that steady state S, and you assume that this "spin-up" time is "dirty" (since DBs tend to run for a long time, the production numbers you care about are probably only from some time after S), running one trial will get you much better data, and you can get accurate descriptive statistics from the one run. Running N trials may actually reduce the fidelity of your benchmark, since you incur N * S spin-up time instead of S, and variance in the spin-up period can hurt. However, if your database does not spin up in the same way every time, running only one benchmark will get you bad numbers.

Without knowing more, my inclination is to suggest that you run either 1 trial if you are confident that the benchmark will be repeatable (because you know a lot about the database) or N <= 5 trials (to validate the meta-assumptions involved in only running one trial), taking the results from the median trial, depending on how paranoid you are about the repeatability of the benchmark.
ReplyDelete
Replies
DimitriFebruary 3, 2023 at 3:03 AM
If your systems are stable, one iteration is generally enough.

However, my main attention would be not focused just on final numbers, but rather on overall stability and progress in processing during test workloads. For ex. when you see stalls -- for many users it'll not be acceptable, and DBMS vendor need to address it, etc. Same about response times stability (and not just a final AVG or P95)..

And the main rule : "you can always generate more money, but you cannot generate more time.." -- so, don't forget to enjoy your life ;-))

Rgds,
-Dimitri
ReplyDelete
Replies
VincentFebruary 13, 2023 at 4:47 PM
I have always used the following approach to testing since my early IBM days:

* Repeat the same test multiple times, specially in cloud environments. I often choose to do 12 tests.
* Discards the best and worst outliers of the test.
* for all datapoints collected, summarize and get the median, average and the spread. This not only for the summary metrics like (total transactions, errors, etc..) but also for the metrics collected this way you can see the median behavior and how much deviation appears which is also important when measuring the microstalls due to compaction, etc...

Also I have historically ran the MySQL tests at 10 different levels of concurrency using sysbench: 1,2...1024 threads (or less depending on the machine). As someone mentioned before, you want to see the steady state -aproximativo, as db's in many scenarios do not have such state due to temporal load variations- so the idea is to run at each state for enough time that the steady state is achieve, which is often around 20-30 minutes (except when dealing with very large machines and datasets).

ReplyDelete
Replies

Add comment

Wednesday, February 1, 2023

Statistical rigor and database benchmarks

15 comments:

Postgres 18 beta2: large server, Insert Benchmark, part 2