Friday, February 14, 2020

Benchmarks vs real workloads

How do you test a DBMS that you are building for internal usage or the world? This is a good question whether you are the upstream DBMS or maintaining a downstream fork. The choices include real workloads, synthetic benchmarks and throw it over the wall.

There is a great paper from Microsoft on doing live capture & replay at scale for real customers. I hope to add a link for a talk on the live capture & replay that was done for testing at Facebook.

Real workloads

Real workloads are great. Alas real workloads are hard to get (code IP, data privacy). Even if these are resolved a bigger problem is that real workloads are complex and buggy (all software is buggy). If you get a real workload without a team to support it then you will spend a lot of time debugging it and otherwise trying to figure it out. This is probably more true for OLTP than analytics.

Real workloads v2

Capture & replay (C&R) is an easier way to get some of the goodness of a real workload with less effort. This avoids the overhead of dealing with the code from the real workload. This doesn't avoid issues with data privacy.

C&R can be online or offline. Online C&R does the replay online (immediately after the capture). Online C&R is a great choice when you have significant internal workloads that use the DBMS you are building. Running a DBMS in production is a great education for database developers in so many ways. This is one more perk.

Offline C&R archives the captured workload for replay at a later time. This is useful for workloads that can't be shadowed (customer workloads). This is also useful when you want a workload that can be repeated (replayed many times) as online C&R does not allow for that. An interesting problem for offline C&R is making the replay realistic -- respecting the timing and concurrency that was recorded. This is more challenging when the DBMS being tested is a lot faster than the DBMS used for the capture.

One question for C&R is the point at which the capture is done. When capture is done at the DBMS then you don't miss anything. But sometimes it is easier to do the capture at the application tier.

The Facebook MySQL team had a great setup for online C&R via a shadow tier. I don't have much experience with offline C&R but I know it has been useful for others. In my Google MySQL years I built a tool to sample production queries (SELECT, not writes) and replay them against replicas running new and old binaries to compare performance and correctness. That was a fun and useful project for me. It was mostly online C&R. One interesting bug it found was a result of changing the internal datatype for aggregation between decimal and IEEE754.

Synthetic benchmarks

Benchmarks have a bad reputation but I like them. I even enjoy benchmarketing but only when I am not involved. I rarely run benchmarks in isolation. Absolute values (QPS, TPS) don't help me. I am a big fan of relative comparisons (new binary vs old binary, this DBMS vs that DBMS).

Comparisons require sufficient expertise in all of the systems that are tested as the goal is to get results that can be explained and trusted. Expertise is hard to find. Therefore I have less faith in benchmark results that test more than one DBMS -- whether this is work done by myself or others. Perhaps one day someone will do a study and provide a function that estimates truthiness as a function of the number of DBMS tested.

Throw it over the wall

This has to be done at some point in time, otherwise you are building a DBMS that nobody uses. Hopefully the steps above have been done to increase the chance that there won't be too much drama at this point.


  1. The part of this problem was described in the article:

    1. Thank you for sharing an interesting post.

      Is the star schema benchmark still in use? I used it for a while at Oracle when I worked on bitmap indexes but I am curious how well it scales (behaves with larger scale factors). Searching now I found a paper from 2009 that I will read -

  2. Having studied and implemented TPC-C a few times, I am particularly a fan of well defined benchmarks. There's so much more involved in running a benchmark; the workload definition is a small but a critical part of it.

    I enjoyed reading your linked post on benchmarketing; it succinctly describes the enormous challenges of running benchmarks, publishihng and comparing them.

    1. I assume there is a lot pressure when doing tpc-c given that so much HW is required and that HW isn't available indefinitely.

      And then there was this story. Is that another kind of byzantine failure?

    2. Well, I never did get the opportunity to run TPC-C at a large enough scale to be able to publish the results. I have developed applications/workload that emulate TPC-C as closely as possible. TPC-C.js is an example of that. For the next implementation I intend to use either Golang or Rust so that at least the client-side of infrastructure could be used efficiently, and so it can be cheap for the vendor doing the benchmark run.

    3. Your project is interesting. I have run tpcc-mysql a few times but my focus has been elsewhere. I am slowly figuring out whether to use Java or golang for my next benchmark client.