tag:blogger.com,1999:blog-9149523927864751087.post8158494532745474152..comments2024-07-22T18:07:23.549-07:00Comments on Small Datum: Statistical rigor and database benchmarksMark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.comBlogger15125tag:blogger.com,1999:blog-9149523927864751087.post-52973362601489909442023-02-13T16:47:10.541-08:002023-02-13T16:47:10.541-08:00I have always used the following approach to testi...I have always used the following approach to testing since my early IBM days:<br /><br />* Repeat the same test multiple times, specially in cloud environments. I often choose to do 12 tests.<br />* Discards the best and worst outliers of the test.<br />* for all datapoints collected, summarize and get the median, average and the spread. This not only for the summary metrics like (total transactions, errors, etc..) but also for the metrics collected this way you can see the median behavior and how much deviation appears which is also important when measuring the microstalls due to compaction, etc...<br /><br />Also I have historically ran the MySQL tests at 10 different levels of concurrency using sysbench: 1,2...1024 threads (or less depending on the machine). As someone mentioned before, you want to see the steady state -aproximativo, as db's in many scenarios do not have such state due to temporal load variations- so the idea is to run at each state for enough time that the steady state is achieve, which is often around 20-30 minutes (except when dealing with very large machines and datasets).<br /><br />Vincenthttps://www.linkedin.com/in/vincent-w/noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-22927729877888989232023-02-03T11:21:26.272-08:002023-02-03T11:21:26.272-08:00Plotting not only throughput QPS/TPS, but also Min...Plotting not only throughput QPS/TPS, but also Min/Max/Avg/P95 times for every 1 sec. to observe if your processing is stable, or you have "waves", stalls, and so on.. -- e.g. not just final numbers, and then everything is easy to see and point ;-))<br /><br />Many users have different ETA requirements, some will never accept stalls, some have max latency "stability barrier", and so on.. While many Benchmark workloads are not processing within the same conditions over a time than it was on the beginning (data will grow, BTree will be larger, more background work involved, etc.) -- so, only by having the whole picture you can confirm (at least for yourself) that on the given workload you did not miss any problem (or not) -- always better to discover this in tests than in Production ;-))<br /><br />Happy to hear all is good on your side ! and All My Best to all your future testings !<br /><br />Rgds,<br />-DimitriDimitrihttp://dimitrik.free.fr/blognoreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-67472736417343487402023-02-03T08:18:53.045-08:002023-02-03T08:18:53.045-08:00Plotting throughput over time is something I don&#...Plotting throughput over time is something I don't do as much as I should. It shows things that aren't easy to see via percentiles and histograms. A common problem that throughput graphs highlight is regular stalls from checkpoint or compaction.<br /><br />WRT enjoying life -- I have my odd hobbies, spring is almost here which means I spend much time outside.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-19312405268338797612023-02-03T03:03:23.485-08:002023-02-03T03:03:23.485-08:00If your systems are stable, one iteration is gener...If your systems are stable, one iteration is generally enough.<br /><br />However, my main attention would be not focused just on final numbers, but rather on overall stability and progress in processing during test workloads. For ex. when you see stalls -- for many users it'll not be acceptable, and DBMS vendor need to address it, etc. Same about response times stability (and not just a final AVG or P95)..<br /><br />And the main rule : "you can always generate more money, but you cannot generate more time.." -- so, don't forget to enjoy your life ;-))<br /><br />Rgds,<br />-DimitriDimitrihttp://dimitrik.free.frnoreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-47901697642234821912023-02-02T09:37:02.297-08:002023-02-02T09:37:02.297-08:00I am in the camp that prefers to run benchmarks fo...I am in the camp that prefers to run benchmarks for longer periods of time, but quantifying whether my rule of thumb is good advice will expensive. The goal is to make sure the indexes (btree fragmentation and the LSM equivalent) are in a steady state.<br /><br />I see too many benchmarks that load in PK order, then create secondary indexes then immediately measure perf. At that point those indexes are in great shape (unfragmented) which both helps and hurts perf. It hurts because they are (likely) more prone to time consuming index splits. It helps because they (likely) have minimal space amplification and are most dense.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-75802187928858158122023-02-02T08:42:32.138-08:002023-02-02T08:42:32.138-08:00Thanks for the reply. Books with titles like "...Thanks for the reply. Books with titles like "Math for Economists" have been great for refreshing my skills but I have yet to open the econometrics texts as those are a harder read. But when I return to my math texts I will start with probability, stats and linear algebra.<br /><br />WRT to variance after S, that is a big issue for read-only benchmarks with an LSM because the LSM tree shape can vary between runs once the workload becomes read-only which has a big impact on CPU overhead per query. Issues include the number of keys in the memtable and the number of SSTs in the smaller levels of the LSM tree.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-71678867373479645732023-02-02T08:36:18.815-08:002023-02-02T08:36:18.815-08:00We have similar hobbies. I have more than a few ma...We have similar hobbies. I have more than a few math books, including a few for Linear & Nonlinear programming and the titles you mention above, that I will eventually spend time on. I had downtime several years ago and enjoyed a few months with them. I look forward to repeating that.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-62129590206324109562023-02-02T07:04:21.292-08:002023-02-02T07:04:21.292-08:00The answer I have for you here is "it depends...The answer I have for you here is "it depends a lot on your benchmark and it's subtle." I also don't know if there are good books on this. I learned about the useful statistics for computer benchmarks from studying econometrics, but I don't have a textbook to recommend.<br /><br />Your benchmark is not stationary (due to LSM compactions and GC), but there may be a steady state point where you can assume that it is roughly stationary. If you call the time to reach that steady state S, and you assume that this "spin-up" time is "dirty" (since DBs tend to run for a long time, the production numbers you care about are probably only from some time after S), running one trial will get you much better data, and you can get accurate descriptive statistics from the one run. Running N trials may actually reduce the fidelity of your benchmark, since you incur N * S spin-up time instead of S, and variance in the spin-up period can hurt. However, if your database does not spin up in the same way every time, running only one benchmark will get you bad numbers.<br /><br />Without knowing more, my inclination is to suggest that you run either 1 trial if you are confident that the benchmark will be repeatable (because you know a lot about the database) or N <= 5 trials (to validate the meta-assumptions involved in only running one trial), taking the results from the median trial, depending on how paranoid you are about the repeatability of the benchmark.Nimahttps://specbranch.comnoreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-22814216986421803792023-02-02T05:28:19.180-08:002023-02-02T05:28:19.180-08:00https://parsec.cs.princeton.edu/publications/chen1...https://parsec.cs.princeton.edu/publications/chen14ieeetc.pdf or https://aakinshin.net/posts/statistics-for-performance/ :)VKhttps://www.blogger.com/profile/14048725351852117724noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-41013533225539468172023-02-02T01:03:44.345-08:002023-02-02T01:03:44.345-08:00My library is full of books on mathematical statis...My library is full of books on mathematical statistics, but these are too theoretical for you I think, I suggest books on performance analysis of computer systems that also go through basic statistics. I have 2 such books in my library, The Art of Computer Systems Performance Analysis by Raj Jain and Perfformance Modeling and Design of Computer Systems by Mor Harchol-Balter. Personally I like mathematics as a hobby as well, so will have got a book in birthday present to learn Nonlinear Programming (don't expect it to benefit my work on DBMSs, but one never knows :) ). Good luck with the reading.Mikael Ronstromhttps://www.blogger.com/profile/07134215866292829917noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-67346472690768584232023-02-02T00:59:35.736-08:002023-02-02T00:59:35.736-08:00Actually when testing a DBMS it isn't proper t...Actually when testing a DBMS it isn't proper to think of experiments as independent of each other as the above does. The reason is that DBMS runs checkpoints regularly which affects performance. My experience is that running several experiments or just longer experiments doesn't matter so much. Given the workings of a DBMS I would more or less say that each period of running benchmark is a separate experiment. These experiments are not completely independent, so statistics isn't perfect, but running longer makes at least the variance go down, although not as fast as the square root of the number of periods.Mikael Ronstromhttps://www.blogger.com/profile/07134215866292829917noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-35321099417715874982023-02-01T18:35:44.308-08:002023-02-01T18:35:44.308-08:00Forgot to mention that I am looking for paper/book...Forgot to mention that I am looking for paper/book recommendations on this topic.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-26825743843135265762023-02-01T18:20:54.166-08:002023-02-01T18:20:54.166-08:00We are discussing the thing I have a question abou...We are discussing the thing I have a question about. Using the equations above if I do N=4 rather than N=1 experiments, then the run time for each experiment must be reduced because (B -- HW time) is fixed.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-48289231668551279142023-02-01T17:20:45.388-08:002023-02-01T17:20:45.388-08:00Of course, that's assuming his benchmark is ca...Of course, that's assuming his benchmark is capturing benchmark *throughput*, for which working under Gaussian assumptions is acceptable. But since he mentioned "response time" in the article, which is decidedly not Gaussian in nature, he would want to avoid mean/sddev/stderr in his benchmark reporting.Mark E. Dawson, Jrhttps://www.jabperf.comnoreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-83633580098275427252023-02-01T12:10:14.135-08:002023-02-01T12:10:14.135-08:00Generally standard deviation goes down with the sq...Generally standard deviation goes down with the square root of N, thus 4 experiments halfs the standard deviation. The 95% confidence interval is that your mean value will be between X - 2*std_dev, X+2*std_dev, replacing 2 by 3 gives about 99% confidence interval. This is assuming Gaussian distribution which is a good quick estimate.Mikael Ronstromhttps://www.blogger.com/profile/07134215866292829917noreply@blogger.com