Tuesday, May 10, 2022

Performance bugs: p50 vs p99

One way to classify performance bugs is p50 vs p99. 

  • A p50 performance bug occurs for all workloads
  • A p99 performance bug only happens for some workloads or in some contexts. By some workloads I mean it might only occur during hot backup or load. By some contexts I mean it might only happen under high concurrency or load spikes which don't represent the steady state.
The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster.

The impact from fixing a p99 bug is less obvious. On average there might be little impact with respect to average throughput and efficiency. There can be a huge impact with respect to response time goals. The challenge is to quantify the benefit and I have learned over the years that motivating the fix for p99 perf bugs is a hard problem. For MySQL I wrote many blog posts to market open bug reports.

One aspect is that there are easy workarounds for p50 perf bugs. If a bug makes something 10% less efficient then buy 10% more hardware. There might be no workarounds for a p99 perf bug. If something causes a long stall once/day or ruins performance for one infrequent part of your workload then the choices are to either tolerate it (someone suffers) or replace the system.

I spent a few years iterating on p99 bugs in MySQL back in my web-scale MySQL+InnoDB days. It was interesting work and I benefited from production which was a great source of p99 bugs and a colleague named Domas who was great at documenting such bugs.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...