We were testing the impact of changing from gcc 4.9 to 5.x and from older to newer jemalloc. For one of the in-memory sysbench tests the QPS at high-concurrency dropped by 10%. The test was read-only with range-size set to 1000. If the test was limited to the order-ranges query then the QPS dropped by ~25%. This wasn't good.
I repeated the test with newer jemalloc and gcc 4.9 and there was no loss of QPS. So now it looked like a problem with gcc 5.x and not jemalloc. I then did a build with tcmalloc, but at startup mysqld would get an illegal free error from tcmalloc. After finding what I think is a RocksDB bug and then another problem, I found a workaround and soon discovered that for tcmalloc and glibc malloc there was also a decrease in QPS for gcc 5.x but not 4.9. Now I was worried that I might lose the debugging expertise of the internal jemalloc team at work, but fortunately they found the problem while collaborating with the MyRocks team.
AFAIK this isn't a MyRocks-only problem because it comes from the allocation done to sort for the order by clause. But I am tired of running tests and won't test it for InnoDB. Good news for the rest of the world. This is an issue for MySQL 5.6 but probably not for 5.7 and 8.x.
The problem is a gcc5 optimization (see gcc issues 67618 and 83022) that transforms the call sequence malloc, memset into a call to calloc. This appears to be done even for 2mb allocations (my.cnf had sort_buffer_size=2m). The output from perf wasn't clear about the problem this creates. For jemalloc it reported a new CPU overhead from smp_call_function_interrupt calling flush_tlb_func and all of that is kernel code. I was told that was from jemalloc zero-ing pages. A workaround that doesn't require a code change is to compile with -fno-builtin-malloc. There are workarounds that require code changes that I won't list here.
Here are performance results from mysqld compiled with gcc 4.9 vs 5.x and linked with different allocators (jemalloc, tcmalloc, glibc malloc). The test is in-memory sysbench read-only with range-size=1000, 8 tables and 1M rows/table. The test uses 48 concurrent connections and the server has 48 HW threads (24 cores, HT enabled).
78622 gcc4.9, jemalloc
76787 gcc4.9, tcmalloc
73340 gcc4.9, glibc malloc
65673 gcc5.x, jemalloc
58958 gcc5.x, tcmalloc
48750 gcc5.x, glibc malloc
78028 gcc5.x, jemalloc, -fno-builtin-malloc
78135 gcc5.x, tcmalloc, -fno-builtin-malloc
78207 gcc5.x, glibc malloc, -fno-builtin-malloc