Monday, November 27, 2023

Explaining changes in MySQL performance via hardware perf counters: part 3

This is part 3 of my series on using HW counters from Linux perf to explain why MySQL gets slower from 5.6 to 8.0. Refer to part 1 for an overview.

What happened in MySQL 5.7 and 8.0 to put so much more stress on the memory system?

tl;dr

  • It looks like someone sprinkled magic go slower dust across most of the MySQL code because the slowdown from MySQL 5.6 to 8.0 is not isolated to a few call stacks.
  • MySQL 8.0 uses ~1.5X more instructions/operation than 5.6. Cache activity (references, loads, misses are frequently up 1.5X or more. TLB activity (loads, misses) is frequently up 2X to 5X with the iTLB being a bigger problem than the dTLB.
  • innodb_log_writer_threads=ON is worse than I thought. I will soon have a post on that.

Too long to be a tl;dr
  • I don't have much experience using Linux perf counters to explain performance changes.
  • There are huge increases for data TLB, instruction TLB and L1 cache counters. From the Xeon CPU (socket2 below) the change from MySQL 5.6.21 to 8.0.34 measured as events/query are:
    • branches, branch misses: up ~1.6X, ~1.5X
    • cache references: up ~3.7X
    • instructions: up ~1.5X
    • dTLB loads, load-misses, stores: up ~1.6X, ~2.5X, ~1.7X
    • iTLB loads, load-misses: up ~2.0X, ~2.8X
    • L1 data cache loads, load-misses, stores: up ~1.6X, ~1.5X, ~1.7X
    • L1 instruction cache load-misses: up ~1.9X
    • LLC loads, stores, store-misses: up ~2.2X, ~2.7X, ~1.3X
    • Context switches are flat, CPU migrations down
  • For many of the HW counters the biggest jumps occur between the last point release in 5.6 and the first in 5.7 and then again between the last in 5.7 and the first in 8.0. Perhaps this is good news because it means the problems are not spread across every point release.

The posts in this series are: 

  • part 1 - introduction and results for the l.i0 (initial load) benchmark step
  • part 2 - results for the l.i1 (write-only) benchmark step
  • part 3 - (this post) results for the q100 (read+write) benchmark step
  • part 4 - results for the q1000 (read+write) benchmark step

Performance

The charts below show average throughput (QPS, or really operations/s) for the l.i0 benchmark step.

  • The benchmark uses 1 client for the small servers (beelink, ser7) and 12 clients for the big server (socket2). 
  • MySQL 8.0 is much slower than 5.6 on the small servers (beelink, ser7) and slightly slower on the big server (socket2).
Results

Refer to the Results section in part 1 to understand what is displayed on the charts. The y-axis frequently does not start at zero to improve readability. But this also makes it harder to compare adjacent graphs.

Below when I write up 30% that is the same as up 1.3X. I switch from the former to the latter when the increase is larger than 99%.

Spreadsheets are here for beelinkser7 and socket2. See the Servers section above to understand the HW for beelink, ser7 and socket2.

Results: branches and branch misses

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 branches up 29%, branch-misses up 55%
    • from 5.7.43 to 8.0.34 branches up 17%, branch-misses up 67%
  • on ser7
    • from 5.6.51 to 5.7.10 branches up 39%, branch-misses up 52%
    • from 5.7.43 to 8.0.34 branches up 18%, branch-misses up 56%
  • on socket2
    • from 5.6.51 to 5.7.10 branches up 19%, branch-misses up 20%
    • from 5.7.43 to 8.0.34 branches up 17%, branch-misses up 22%
Results: cache references and misses

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 references up 38%, misses up 38%
    • from 5.7.43 to 8.0.34 references up 32%, misses up 36%
  • on ser7
    • from 5.6.51 to 5.7.10 references up 57%, misses up 2.5X
    • from 5.7.43 to 8.0.34 references up 45%, misses up 2.0X
  • on socket2
    • from 5.6.51 to 5.7.10 references up 69%, misses up 1%
    • from 5.7.43 to 8.0.34 references up 79%, misses down 5%

Results: cycles, instructions, CPI

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 cycles up 11%, instructions are flat, cpi up 12%
    • from 5.7.43 to 8.0.34 cycles up 42%, instructions up 18%, cpi up 18%
  • on ser7
    • from 5.6.51 to 5.7.10 cycles up 66%, instructions up 52%, cpi up 9%
    • from 5.7.43 to 8.0.34 cycles up 36%, instructions up 12%, cpi up 22%
  • on socket2
    • from 5.6.51 to 5.7.10 cycles up 28%, instructions up 21%, cpi up 5%
    • from 5.7.43 to 8.0.34 cycles up 36%, instructions up 16%, cpi up 17%
Results: dTLB

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 dTLB-loads up 49%, dTLB-load-misses up 43%
    • from 5.7.43 to 8.0.34 dTLB-loads up 39%, dTLB-load-misses up 53%
  • on ser7
    • from 5.6.51 to 5.7.10 dTLB-loads up 2X, dTLB-load-misses up 56%
    • from 5.7.43 to 8.0.34 dTLB-loads up 44%, dTLB-load-misses up 23%
  • on socket2
    • loads
      • from 5.6.51 to 5.7.10 dTLB-loads up 22%, dTLB-load-misses up 18%
      • from 5.7.43 to 8.0.34 dTLB-loads up 14%, dTLB-load-misses up 76%
    • stores
      • from 5.6.51 to 5.7.10 dTLB-stores up 27%, dTLB-store-misses down 61%
      • from 5.7.43 to 8.0.34 dTLB-stores up 20%, dTLB-store-misses up 2.4X
Results:  iTLB

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 iTLB-loads up 48%, iTLB-load-misses up 3.1X
    • from 5.7.43 to 8.0.34 iTLB-loads up 37%, iTLB-load-misses up 2.6X
  • on ser7
    • from 5.6.51 to 5.7.10 iTLB-loads up 81%, iTLB-load-misses up 95%
    • from 5.7.43 to 8.0.34 iTLB-loads up 30%, iTLB-load-misses up 2.8X
  • on socket2
    • from 5.6.51 to 5.7.10 iTLB-loads up 39%, iTLB-load-misses up 33%
    • from 5.7.43 to 8.0.34 iTLB-loads up 25%, iTLB-load-misses up 2.1X
Results: L1 cache

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • dcache
      • from 5.6.51 to 5.7.10 loads up 24%, load-misses up 16%
      • from 5.7.43 to 8.0.34 loads up 21%, load-misses up 24%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses up 23%
      • from 5.7.43 to 8.0.34 loads-misses up 33%
  • on ser7
    • dcache
      • from 5.6.51 to 5.7.10 loads up 38%, load-misses up 27%
      • from 5.7.43 to 8.0.34 loads up 28%, load-misses up 27%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses up 40%
      • from 5.7.43 to 8.0.34 loads-misses up 34%
  • on socket2
    • dcache
      • from 5.6.51 to 5.7.10 loads up 22%, load-misses up 13%, stores up 26%
      • from 5.7.43 to 8.0.34 loads up 18%, load-misses up 20%, stores up 24%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses  up 30%
      • from 5.7.43 to 8.0.34 loads-misses up 37%
Results: LLC

The LLC counters were only supported in the socket2 CPU.

Summary:

  • the Results section above explains the y-axis
  • on socket2
    • loads
      • from 5.6.51 to 5.7.10 loads up 26%, load-misses down 3%
      • from 5.7.43 to 8.0.34 loads up 24%, load-misses down 18%
    • stores
      • from 5.6.51 to 5.7.10 stores up 16%, store-misses down 25%
      • from 5.7.43 to 8.0.34 stores up 69%, store-misses up 29%
Results: context switches

Summary:
  • on beelink
    • from 5.6.21 to 5.7.10 context switches down 31%
    • from 5.7.43 to 8.0.34 context switches up 3%
  • on ser7
    • from 5.6.21 to 5.7.10 context switches up 8%
    • from 5.7.43 to 8.0.34 context switches down 5%
  • on socket2
    • from 5.6.21 to 5.7.10 context switches up 1%
    • from 5.7.43 to 8.0.34 context switches down 2%
Results: CPU migrations

Summary:
  • on beelink
    • from 5.6.21 to 5.7.10 CPU migrations down 38%
    • from 5.7.43 to 8.0.34 CPU migrations up 94%
  • on ser7
    • from 5.6.21 to 5.7.10 CPU migrations up 41%
    • from 5.7.43 to 8.0.34 CPU migrations up 5.1X
  • on socket2
    • from 5.6.21 to 5.7.10 CPU migrations down 41%
    • from 5.7.43 to 8.0.34 CPU migrations down 54%

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...