Saturday, December 21, 2019

What is the future of off-cpu analysis?

The answer is I don't know although I am really asking about tools I will use in 2020 and others will make different choices. I will have better answers in a few months. People have been telling me there is something better than PMP for many years. Perhaps that claim is finally true but this post from Percona suggests that PMP might have a future.

I start with Brendan Gregg when I want to learn about modern performance debugging and he has a page on off-cpu analysis. From there I learn eBPF, perf and bcc are the future and I hope that is true. For now I will summarize my use cases and potential solutions.

I have three use cases:
  1. Small server (< 10 cores) doing ~100k QPS on benchmarks
  2. Many-core server (<= 2 sockets, lots of cores/socket) doing x00k QPS on benchmarks.
  3. Servers in production
Stalls are more tolerable in the first two cases. Crashes and multi-second stalls in production are rarely acceptable. Although when a production server is extremely unhappy then a long stall or crash might be OK. Sometimes I want my long-running benchmarks to collect some thread stacks for off and on CPU analysis. This is more important for workloads that take days to setup. But I have yet to figure out how to exclude the impact from that when collecting throughput and response time metrics. This is more of an issue for off-cpu analysis versus using perf for on-cpu analysis as the impact from perf is smaller.

Some approaches, gdb and quickstack, have a per sample overhead which is probably linear in the number of thread stacks. I assume work done by gdb and quickstack to get stack traces is single-threaded which contributes to longer stalls on busier servers.

Other approaches like eBPF have an overhead that is independent of the number of samples -- the overhead is gone once samples have been collected. The overhead is probably linear in the amount of activity on the server (linear in the number of CPU cores) and I hope that much of it is handled in parallel -- each CPU core has more work to do when scheduling threads.

The possible approaches are:
  • Runtime - Java makes this easy with jstack. I am not sure whether gperftools and libunwind make this easy for C and C++.
  • PMP/gdb - Percona provides PMP via pt-pmp. Percona has a great post on fast ways to get thread stacks and I look forward to evaluating their advice. This might make PMP useful for the first two use cases listed above.
  • PMP/quickstack - quickstack is much faster than original PMP/gdb but was also less truthy. Regardless it made PMP much better in production.
  • eBPF - I am waiting for the eBPF book to arrive. Until then I have the web.


  1. FYI ClickHouse has sampling (PMP-style) query profiler that is embedded in the server and is run continuously in production. All collected traces is saved in ClickHouse itself.

    1. I am happy to learn that and to see how fast Clickhouse is growing per db-engines --