Small Datum: Code bloat vs memory system bloat : why is something getting slower

Wednesday, October 16, 2024

Code bloat vs memory system bloat : why is something getting slower

As I document performance regressions over time in MySQL it helps to assign names for common problems that I see. While there are many problems in general including mutex contention and poor usage of IO, my current focus is on the following:

code bloat - the system uses more instructions per unit of work
memory system bloat - the instructions don't get executed as fast because there is more TLB and cache activity, IPC decreases and CPI increases.

Fortunately, I can use perf stat to measure all of these and what I see for MySQL 5.6 to 8.0 is that both code bloat and memory system bloat are to blame.

An example

I use the results from a recent run of CPU-bound sysbench on an AMD 7735HS CPU. I focus on the scan microbenchmark for InnoDB. And in call cases the relative numbers are relative to the result from InnoDB with MySQL 5.6.51. From the table of results I see:

the relative QPS (rQPS) for InnoDB in MySQL 8.0.39 is 0.75. This means that 8.0.39 gets 75% of the throughput relative to 5.6.51 (or 5.6.51 runs ~1.33X faster than 8.0.39).
the relative CPU overhead (cpu/o) for 8.0.39 is 1.36. For CPU-bound workloads I expect the relative CPU overhead to be (approximately) the inverse of the relative QPS -- unless there is too much mutex contention. And that is true here as 1/1.36 ~= 0.74.
the relative number of instructions per unit of work for 8.0.39 is 1.44. If code bloat weren't partially to blame for the perf regressions then the value for this would be ~1.0. But 1/1.44 ~= 0.69 and it is possible that most of the regression is from code bloat.

Disclaimer - using CPU and instruction overhead as I do here has some risks.

For CPU overhead I measure all CPU usage on a server. That includes my benchmark client, other things on the server and the DBMS process(es). Fortunately there aren't many other things on the server and the overhead from the benchmark client should be somewhat constant across DBMS versions. However the CPU used by the DBMS process(es) include things that matter more to performance time (CPU used processing a request) and things that matter less (CPU used by background tasks).

For instruction overhead I use perf stat on the DBMS process. For Postgres, with a process per client, this is the process for one of the connections used by my benchmark. For MySQL, with a multithreaded server, this is the process and includes all of the threads (foreground and background). So this still captures some overheads that matter more (foreground) and some that matter less (background).

Human nature

Code bloat is hard to avoid in long-lived software projects. Some of it is human nature. People new to the project want to (or must) show impact and new features are viewed as more impactful than improving existing code. I assume that company culture is to blame in many cases, but this happens outside of companies as well.

Features are added to grow the market, but eventually you reach the point where the new features grow the market by a small amount with a much larger cost in complexity and performance.

Code bloat often leads to memory system bloat. All of that new code means instruction working set is spread out. So now we have clever link-time optimizations (LTO, Bolt) that try to better organize the hot paths to undo some of the damage. We also have an option to use huge pages for text and reduce iTLB activity. These help but are not solutions and have a large cost in manageability.

And on the data path we also have the option to use huge pages for some cases such as large allocations done for buffer pools (see here and here). This can help but is not a solution and has a large cost in manageability.

I am happy to link to content that describes these problems in more detail.

Small Datum

Wednesday, October 16, 2024

Code bloat vs memory system bloat : why is something getting slower

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)