I start with Brendan Gregg when I want to learn about modern performance debugging and he has a page on off-cpu analysis. From there I learn eBPF, perf and bcc are the future and I hope that is true. For now I will summarize my use cases and potential solutions.
I have three use cases:
- Small server (< 10 cores) doing ~100k QPS on benchmarks
- Many-core server (<= 2 sockets, lots of cores/socket) doing x00k QPS on benchmarks.
- Servers in production
Stalls are more tolerable in the first two cases. Crashes and multi-second stalls in production are rarely acceptable. Although when a production server is extremely unhappy then a long stall or crash might be OK. Sometimes I want my long-running benchmarks to collect some thread stacks for off and on CPU analysis. This is more important for workloads that take days to setup. But I have yet to figure out how to exclude the impact from that when collecting throughput and response time metrics. This is more of an issue for off-cpu analysis versus using perf for on-cpu analysis as the impact from perf is smaller.
Some approaches, gdb and quickstack, have a per sample overhead which is probably linear in the number of thread stacks. I assume work done by gdb and quickstack to get stack traces is single-threaded which contributes to longer stalls on busier servers.
Some approaches, gdb and quickstack, have a per sample overhead which is probably linear in the number of thread stacks. I assume work done by gdb and quickstack to get stack traces is single-threaded which contributes to longer stalls on busier servers.
Other approaches like eBPF have an overhead that is independent of the number of samples -- the overhead is gone once samples have been collected. The overhead is probably linear in the amount of activity on the server (linear in the number of CPU cores) and I hope that much of it is handled in parallel -- each CPU core has more work to do when scheduling threads.
The possible approaches are:
- Runtime - Java makes this easy with jstack. I am not sure whether gperftools and libunwind make this easy for C and C++.
- PMP/gdb - Percona provides PMP via pt-pmp. Percona has a great post on fast ways to get thread stacks and I look forward to evaluating their advice. This might make PMP useful for the first two use cases listed above.
- PMP/quickstack - quickstack is much faster than original PMP/gdb but was also less truthy. Regardless it made PMP much better in production.
- eBPF - I am waiting for the eBPF book to arrive. Until then I have the web.
FYI ClickHouse has sampling (PMP-style) query profiler that is embedded in the server and is run continuously in production. All collected traces is saved in ClickHouse itself.
ReplyDeletehttps://clickhouse.yandex/docs/en/operations/settings/settings/#query_profiler_real_time_period_ns
I am happy to learn that and to see how fast Clickhouse is growing per db-engines -- https://db-engines.com/en/ranking
Delete