Wednesday, October 19, 2016

Make MyRocks 2X less slow

Fixing mutex contention has been good for my career. I had the workload, an RDBMS that needed a few improvements and support from a great team. Usually someone else found the bugs and I got to fix many of them. Sometimes I got too much credit because a good bug report is as valuable as the bug fix. These days I don't see many mutex contention bugs but I have begun to see more bugs from memory contention. My perf debugging skills need refreshing. They are far from modern. Thankfully we have Brendan Gregg.

For someone who debugs performance, shared_ptr is a gift. Mistakenly passing shared_ptr by value means the reference count will be changed too much and that is not good on a concurrent workload. I have encountered that at least twice in RocksDB and MyRocks. I even encountered it in MongoDB with SERVER-13382.

I have twice made MyRocks 2X less slow. First with issue 231 peak compaction throughput was doubled and now with issue 343 we almost double range-scan throughput (for long range scans with many concurrent queries). Someone important recently reported a disappointing performance result when comparing MyRocks with InnoDB. After a few days with sysbench I was able to reproduce it. This should be easy to fix.

Not mutex contention

In this bug, with sysbench read-only and read-write the peak QPS for MyRocks saturated long before InnoDB. While MyRocks and InnoDB had similar QPS at low concurrency, the QPS at high concurrency was almost 2X better for InnoDB. This was only an issue for longer range scans (try --oltp-range-size=10000) and the default was a shorter range scan (--oltp-range-size=100). My first guess was mutex contention. There was an odd pattern in vmstat where the context switch rate alternated every second for MyRocks but was steady for InnoDB. Spikes in context switch rate sometimes mean mutex contention but I did not see that with PMP. What next?

The next guess is memory system contention but my debugging skills for that problem are weak. I have told myself many times this year that I need to refresh my skills. So I started with this blog post from Brendan Gregg and tried perf stat and found that InnoDB completed almost twice the number of instructions compared to MyRocks in the same time period. Why is IPC almost 2X better for InnoDB? Results from perf are here.

I then tried a different perf stat command to get other hardware perf counters and results are here. This also shows that InnoDB completed twice the number of instructions while both have a similar value for bus-cycles, so MyRocks uses 2X the number of bus-cycles per instruction. That can explain why it is slower. What are bus-cycles? Most of the documentation only explained this is as [Hardware event] and without more details I can't look that up in Intel manuals. I asked internally and learned the magic code, 0x013c, that leads to more information. Start with this article (and consider subscribing to LWN, I do).

The next step was to get call graphs when bus-cycles was incremented. I used the command below to find the offending code. Disabling that code fixes the problem, but work remains to make that code performant. InnoDB and MyRocks have similar code to count rows read and InnoDB doesn't fall over because of it. I want to make MyRocks not fall over because of it.
perf record -ag -p $( pidof mysqld ) -e bus-cycles -- sleep 10

Useful commands

I used all of these commands today:
perf stat -a sleep 10
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
perf stat -e 'syscalls:sys_enter_*' -a sleep 10
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -a sleep 10
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses -a sleep 10
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches -a sleep 10
perf top -e raw_syscalls:sys_enter -ns comm
perf list --help
perf record -ag -p $( pidof mysqld ) -e bus-cycles -- sleep 10

4 comments:

  1. I am still chasing down the bug so my mind & time are fully occupied by that and I haven't read through all of the links that I shared. I assume it is the number of cycles during which the memory bus (or interconnect given modern CPU) is in use. Not sure if that means that the CPU can still be doing something or if it implies that the CPU is blocked waiting for a memory operation to complete.

    I wish that one day we can get better docs for Linux performance internals - so many things to learn the hard way.

    ReplyDelete
  2. I read Brendangregg blog ,talk about perf,but not mysql part.Thank you share

    ReplyDelete
    Replies
    1. Brendan is awesome and I hope to learn a lot from him

      Delete

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs ...