Monday, September 23, 2024

Why do table scans get slower with MySQL from 5.6 to 8.0?

As I search to explain why MySQL is getting (too much) slower over time I collect a lot of data from many servers and then get overwhelmed trying to summarize it. Here I focus on one microbenchmark (table scan) on one server (ax162-s from Hetzner) using upstream MySQL with InnoDB and FB MyRocks.

While my primary goal is to explain the regressions in MyRocks, many of them are inherited from upstream MySQL so I must first explain them in upstream MySQL with InnoDB.

My focus is on a test that does full table scans so the ratio of (time spent in storage engine / time spent elsewhere) is much larger for that test than many of the other tests I use with sysbench.

The basic problem is that the table scan done by my tests is

  • ~1.5X faster for InnoDB in MySQL 5.6.51 vs 8.0.39
  • ~1.5X faster for MyRocks in FB MySQL 5.6.35 vs 8.0.32

The problem is spread over many releases, but the summary is:

  • Code bloat - more instructions are executed per SQL statement
    • InnoDB 8.0.28 uses ~1.3X more instructions/query than InnoDB 5.6.51
    • MyRocks 8.0.32 uses ~1.3X more instructions/query than MyRocks 5.6.35

    • Memory system bloat - more TLB and cache misses per SQL statement
      • InnoDB 8.0.28 has a lot more cache & TLB loads & misses relative to InnoDB 5.6.51. The difference ranges from ~1.6X more cache-misses to ~1700X more iTLB-loads.
      • MyRocks 8.0.32 has a lot more cache & TLB loads & misses relative to MyRocks 5.6.35. The difference ranges from ~1.6X more cache-misses to ~22X more iTLB-loads.

    The benchmark

    I used sysbench but not as described here. Instead I just did load, a few point-query tests and then the full scan test. Here I focus on the full scan. I ran a subset of the tests to get more results in less time.

    Tests are run with 1 thread, 1 table and 50M rows. While modern MySQL suffers from new CPU overheads it also benefits from a reduction in mutex contention. Alas, I don't highlight the improvements here. And modern MySQL would be even faster if it could reduce mutex contention without increasing CPU overheads.

    I ran things 3 times:

    1. plain - just run the test, don't use perf to get HW counters or flamegraphs
    2. stat - run perf stat to collect HW counters
    3. svg - run perf record -e cycles -F 1999 -g -p $pid to generate flame graphs
    For stat the counters are samples over 10-second intervals for several groups of counters via:

    perf stat -e cpu-clock,cycles,bus-cycles,instructions,branches,branch-misses -p $pid -- sleep 10
    perf stat -e cache-references,cache-misses,stalled-cycles-backend,stalled-cycles-frontend -p $pid -- sleep 10
    perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -p $pid -- sleep 10
    perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -p $pid -- sleep 10
    perf stat -e iTLB-load-misses,iTLB-loads,L1-icache-loads-misses,L1-icache-loads -p $pid -- sleep 10
    perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches -p $pid -- sleep 10
    perf stat -e alignment-faults,context-switches,migrations,major-faults,minor-faults,faults -p $pid -- sleep 10

    Database versions tested

    For upstream MySQL I tested InnoDB from 5.6.51, 5.7.1, 5.7.3, 5.7.5, 5.7.7, 5.7.9, 5.7.10, 5.7.19, 5.7.35, 5.7.44, 8.0.0, 8.0.2, 8.0.4, 8.0.11, 8.0.13, 8.0.15, 8.0.17, 8.0.19, 8.0.20, 8.0.21, 8.0.23, 8.0.25, 8.0.27, 8.0.28, 8.0.39.

    For MyRocks I tested:

    • 5635.old - MyRocks 5.6.35 from Apr 7, 2021 at git hash f896415f with RocksDB 6.19.0
    • 5635.new - MyRocks 5.6.35 from Oct 16, 2023 at git hash 4f3a57a1 with RocksDB 8.7.0
    • 8028.old - MyRocks 8.0.28 from Aug 29 2022 at git hash a35c8dfe with RocksDB 7.5.2
    • 8028.new - MyRocks 8.0.28 from Dec 2 2023 at git hash 4edf1eec with RocksDB 8.7.0
    • 8032.old - MyRocks 8.0.32 from Dec 4 2023 at git hash e3a854e8 with RocksDB 8.7.0
    • 8032.new - MyRocks 8.0.32 from May 29 2024 at git hash 49b37dfe with RocksDB 9.2.1

    Charts: InnoDB

    The charts use relative QPS which is: (QPS for my version / QPS for InnoDB 5.6.51). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

    The regressions during 5.7 are small.
    The regressions during 8.0 are larger. The worst occurs after 8.0.28 which hopefully is resolved in the soon-to-be released 8.0.40 -- see bug 111538..
    Charts: MyRocks

    The charts use relative QPS which is: (QPS for my version / QPS for MyRocks 5635.old). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

    The large regression occurs during the 5.6.35 to 8.0.28 transition.
    Hardware counters

    Above I explained how perf stat is used to collect HW counters over 10 second intervals. It is important to remember that measurements are made over a fixed amount of time rather than a fixed amount of work (queries). I want know to know things like how many instructions are executed per query and for that I have to normalize the results from perf stat by QPS.

    But before I spend too much time evaluating results from perf stat I confirm that the clock rate during the samples is similar. I learned this the hard way when using the schedutil CPU frequency governor with MyRocks (see here). The information for this comes from perf stat -e cycles via the line for cycles:

             10,010.49 msec cpu-clock                 #    1.001 CPUs utilized
        27,526,557,704      cycles                    #    2.750 GHz
       <not supported>      bus-cycles
       122,334,672,524      instructions              #    4.44  insn per cycle
        23,595,569,084      branches                  #    2.357 G/sec
             9,662,382      branch-misses             #    0.04% of all branches

          10.002641688 seconds time elapsed

    I assume that the 10-second sample is representative and that is more true for read-heavy tests and less true for write-heavy tests because write-heavy has intermittent background work (b-tree write back, LSM compaction).

    Using relative values

    I focus on using relative values for the counters which is: (value for my version) / (value for base case). This makes it easy to see how things change over time from old to new releases but it doesn't show the significance of the change. For example, in many cases the value for context switches during a table scan grows by a large amount (as in 40X) but that isn't significant because the change might be from 10 context switches per second to 400. And I doubt such a change has any impact on the performance regressions.

    Unfortunately, it isn't trivial to understand the significance of changes for other metrics. For example, when should I care more about cache vs TLB misses?

    The data

    At last, I can share some of the data collected by perf stat. The data is here for:
    • InnoDB 5.6.51 to 5.7.44
      • see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.
    • InnoDB 5.7.44 to 8.0.39
      • see here. The gist has three parts -- first is the counters (as relative values) that might explain the regressions, the second is the absolute values for those counters and the third is all of the counters as relative values.
    • MyRocks 5.6.35 to 8.0.32
      • see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.
    First up is InnoDB:
    • MySQL 8.0.28 uses ~1.3 times more instructions/query than 5.6.51
    • MySQL 8.0.28 has 1.6X or more cache & TLB misses and loads per query than 5.6.51. The worst ratio is for iTLB-loads (over 1700X) but I don't know whether the absolute cost from that is more significant than for the other counters.
    (value/query for MySQL 8.0.28) / (value/query for MySQL 5.6.51)

    instructions            (.991 / .109) / (1 / .135) =    1.227
    cache-misses            (1.301 / .109) / (1 / .135) =   1.611
    cache-references        (1.504 / .109) / (1 / .135) =   1.862
    dTLB-loads              (2.903 / .109) / (1 / .135) =   3.595
    iTLB-load-misses        (2.011 / .109) / (1 / .135) =   2.490
    iTLB-loads              (1412.0 / .109) / (1 / .135) =  1748.903
    L1-dcache-load-misses   (1.723 / .109) / (1 / .135) =   2.134
    L1-icache-loads         (3.101 / .109) / (1 / .135) =   3.840
    L1-icache-loads-misses  (15.607 / .109) / (1 / .135) =  19.330
    stalled-cycles-backend  (1.427 / .109) / (1 / .135) =   1.767

    Next up is MyRocks:
    • MyRocks 8032.new uses ~1.3X more instructions/query vs 5635.old
    • MyRocks 8032.new has 1.6X or more cache & TLB misses and loads per query than 5635.old. The worst ratio is for iTLB-loads (over 22X) but I don't know whether the absolute cost from that is more significant than for the other counters.
    (value/query for MyRocks 8032.new) / (value/query for 5635.old)

    instructions            (0.883 / .049) / (1 / .073) =   1.315
    cache-references        (1.410 / .049) / (1 / .073) =   2.100
    dTLB-loads              (1.253 / .049) / (1 / .073) =   1.866
    iTLB-load-misses        (1.104 / .049) / (1 / .073) =   1.644
    iTLB-loads              (15.068 / .049) / (1 / .073) =  22.449
    L1-dcache-load-misses   (1.275 / .049) / (1 / .073) =   1.899
    L1-icache-loads         (1.432 / .049) / (1 / .073) =   2.133
    L1-icache-loads-misses  (5.306 / .049) / (1 / .073) =   7.905
    stalled-cycles-frontend (1.429 / .049) / (1 / .073) =   2.128

    No comments:

    Post a Comment

    Managing CPU frequency for AMD on Ubuntu 22.04

    I need stable performance from the servers I use for benchmarks. I also need servers that don't run too hot because too-hot servers caus...