Wednesday, July 17, 2024

Make MyRocks 2X faster by changing the CPU frequency governor

Thank you for clicking on my of my better clickbait titles. Obviously this trick won't work everywhere but it does work on one of my home servers.

tl;dr

  • Switching from the Ubuntu 22.04 Server default CPU frequency governor (schedutil) to the performance governor increases QPS by ~2X for many write-heavy microbenchmarks with MyRocks on one of my large home servers.
  • The problem is that Linux is under-clocking the CPU with MyRocks on the server with the schedutil frequency governor. Switching to the performance frequency governor fixes that -- I am not making MyRocks faster by over-clocking. 
  • The same change doesn't impact InnoDB performance.
  • The same change doesn't appear to make a difference on my other home servers.

Introduction

My home servers are listed here and changing the CPU frequency governor from schedutil to performance improves QPS by up to 2X on the write-heavy benchmark steps of sysbench. An example of my sysbench usage is here.

But first a short diversion, I finally switched from XFS to ext4 on my benchmark servers because I get odd performance and odd warnings (grep for hogged CPU in syslog). That was my first odd problem in Ubuntu 22.04 Server with the HWE kernel (currently 6.5.0-41-generic) -- see here.

This blog post is about my second odd problem and it is limited to one of my servers which is named socket2 below. 

I have been experimenting with this on 3 servers. It is only a problem on one -- socket2:

  • socket2
    • the large server with the problem. This is the v6 server here and is a SuperMicro SuperWorkstation with 2 sockets, 12 cores/socket, Intel Xeon CPUs with HT disabled, 64G RAM, Ubuntu 22.04 and ext4 using data=writeback and SW RAID 0 over 2 NVMe devices. I enabled HWE kernels and the kernel is 6.5.0-41-generic.
  • dell32
    • another large server that doesn't have this problem. This is the v7 server here and is a Dell Precision 7865 Tower with 32-cores, AMD Ryzen Threadripper PRO 5975WX, AMD SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 using data=writeback and SW RAID 0 over 2 NVMe devices. I enabled HWE kernels and the kernel is 6.5.0-41-generic.
  • pn53
    • a small server that doesn't have this problem. This is the v8 server here and is an ASUS ExpertCenter PN53 with 8 cores, AMD SMT disabled, an AMD Ryzen 7 7735HS CPU, 32G RAM, Ubuntu 22.04 and ext4 using data=writeback over 1 NVMe device. I enabled HWE kernels and the kernel is 6.5.0-41-generic.

The symptoms

The relative QPS for FB MyRocks was worse than expected on the socket2 server. By relative QPS in this case I mean: (QPS for FB MyRocks) / (QPS for InnoDB) where:

  • FB MyRocks
    • FB MySQL 8.0.32 with source as of 24-05-29 built at git hash 49b37dfe using RocksDB 9.2.1
  • InnoDB
    • InnoDB from MySQL 8.0.37
Using a subset of the sysbench microbenchmarks, the relative QPS for FB MyRocks on socket2 has values that are bad (less than 0.5) for some write-heavy tests. When the relative QPS is 0.48 (see update-index below) then FB MyRocks gets ~48% of QPS vs InnoDB (or InnoDB gets ~2X more QPS). This result is much worse than what I see on my other large server, the dell32.

0.82    point-query.pre_range=100
0.78    range-notcovered-si.pre_range=100
0.56    range-notcovered-si_range=100
0.46    scan_range=100
0.31    delete_range=100
0.24    insert_range=100
0.70    read-write_range=100
0.71    read-write_range=10
0.48    update-index_range=100
0.31    update-inlist_range=100
0.27    update-nonindex_range=100
0.27    update-one_range=100
0.27    update-zipf_range=100
0.78    write-only_range=10000

For reference, the results from dell32 which show that InnoDB is still doing better than MyRocks, but the worst case below has a difference of ~2X and the worst case above is ~4X.

0.84    point-query.pre_range=100
0.73    range-notcovered-si.pre_range=100
0.49    range-notcovered-si_range=100
0.42    scan_range=100
0.45    delete_range=100
0.46    insert_range=100
0.68    read-write_range=100
0.68    read-write_range=10
1.22    update-index_range=100
0.44    update-inlist_range=100
0.53    update-nonindex_range=100
0.54    update-one_range=100
0.53    update-zipf_range=100
0.76    write-only_range=10000

Diagnosis: step 1

Why does MyRocks do so much worse on socket2? I started with flamegraphs but they were similar between socket2 and dell32. So the distribution of time spent across the code is similar.

Then I repeated the benchmark while recording perf counters (see here) and the first clue is below where perf claims the CPU runs at 1.018 GHz for MyRocks vs 2.823 GHz for InnoDB.

With update-index I get this for MyRocks on socket2:

       81,862.93 msec task-clock        #   8.182 CPUs utilized
         690,877      context-switches  #   8.439 K/sec
          12,119      cpu-migrations    # 148.040 /sec
          12,944      page-faults       # 158.118 /sec
  83,298,404,517      cycles            #   1.018 GHz
  50,239,502,774      instructions      #   0.60  insn per cycle
   9,318,574,550      branches          # 113.831 M/sec
     309,350,411      branch-misses     #   3.32% of all branches

And then with update-index for InnoDB on socket2:

      152,948.80 msec task-clock        #  15.290 CPUs utilized
       2,085,965      context-switches  #  13.638 K/sec
         270,671      cpu-migrations    #   1.770 K/sec
         167,039      page-faults       #   1.092 K/sec
 431,800,251,266      cycles            #   2.823 GHz
 503,680,176,190      instructions      #   1.17  insn per cycle
  86,250,048,599      branches          # 563.915 M/sec
     787,451,136      branch-misses     #   0.91% of all branches

Results on the on the dell32 server don't reproduce the problem as perf claims the CPU is 3.0.38 GHz for MyRocks vs 3.269 GHz for InnoDB.

With update-index I get this for MyRocks on dell32:

      106,338.62 msec task-clock        #   10.629 CPUs utilized
       1,998,242      context-switches  #   18.791 K/sec
          66,738      cpu-migrations    #  627.599 /sec
          55,534      page-faults       #  522.237 /sec
 323,031,027,098      cycles            #    3.038 GHz                         (83.39%)
   2,209,382,673      stalled-cycles-frontend
                  #  0.68% frontend cycles idle (82.94%)
  41,816,695,185      stalled-cycles-backend           
                  # 12.95% backend cycles idle (83.23%)
 208,542,471,854      instructions      #    0.65  insn per cycle
                                        
                  #  0.20  stalled cycles per insn (83.37%)
  40,354,000,154      branches          # 379.486 M/sec (83.46%)
   3,275,786,898      branch-misses.    #   8.12% of all branches (83.63%)

And then with update-index for InnoDB on dell32:

      174,166.34 msec task-clock        #  17.409 CPUs utilized
       2,202,343      context-switches  #  12.645 K/sec
         262,216      cpu-migrations    #   1.506 K/sec
              15      page-faults       #   0.086 /sec
 569,317,425,554      cycles            #   3.269 GHz (83.39%)
   3,105,643,611      stalled-cycles-frontend          
                  #  0.55% frontend cycles idle (83.38%)
  57,983,678,763      stalled-cycles-backend
                  # 10.18% backend cycles idle (83.19%)
 728,678,166,055      instructions      #   1.28  insn per cycle
                                        #   0.08  stalled cycles per insn (83.29%)
 127,614,340,327      branches          # 732.715 M/sec (83.35%)
   3,729,114,189      branch-misses     #   2.92% of all branches (83.42%)

Diagnosis: step 2

I then edited my helper scripts to sample per-core speeds from /proc/cpuinfo every 5 seconds and then aggregate it into 100 MHz buckets.

Note that the socket2 server has 24 cores and the benchmark is run with 16 clients. From cpupower frequency-info I see the following for socket2 and learn that the CPU can range between 1 GHz and 3.5 GHz. Also, the schedutil CPU frequency governor is used.

  driver: intel_cpufreq
  CPUs which run at the same hardware frequency: 23
  CPUs which need to have their frequency coordinated by software: 23
  maximum transition latency: 20.0 us
  hardware limits: 1000 MHz - 3.50 GHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 1000 MHz and 3.50 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.

The output for dell32 is slightly different as it uses the acpi-cpufreq driver, and at least in theory, with AMD the CPU frequency is limited to one of three values:

  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 31
  CPUs which need to have their frequency coordinated by software: 31
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.80 GHz - 7.01 GHz
  available frequency steps:  3.60 GHz, 2.70 GHz, 1.80 GHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 1.80 GHz and 3.60 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.

With update-index I get this for MyRocks on socket2 when I aggregate into 100 MHz buckets the CPU speeds listed in /proc/cpuinfo. Note that the histogram is skewed towards the lower values. This means that most of the cores run at a low speed most of the time which isn't expected because the workload is CPU-bound and most cores should be busy:

  count MHz
   1042 900
    970 1000
    161 1100
     94 1200
     68 1300
     95 1400
    167 1500
     67 1600
     46 1700
     29 1800
     12 1900
      6 2000
     13 2100
     14 2200
     64 2300
     38 2400
      2 2500
      4 2600
      9 2700
      2 2900
      1 3500

With update-index and InnoDB on socket2 the histogram is skewed towards the larger values. So the result with InnoDB is what I expect and much better than the result above for MyRocks.

  count MHz
      1 900
     22 1000
      4 1100
      5 1200
      6 1300
      1 1400
      4 1500
      3 1600
     10 1700
      5 1800
      8 1900
      9 2000
     12 2100
     16 2200
     34 2300
     31 2400
     37 2500
     58 2600
     69 2700
    102 2800
    554 2900
   1895 3000
      6 3100
      8 3200
      3 3300
      1 3500

The fix

Why are the CPUs being run at low speed (often 1 GHz) for MyRocks, while with InnoDB they are run at high speed (often 3 GHz)? Thermal throttling is one reason but that wasn't happening here.

I then changed the CPU frequency governor from schedutil to performance using these commands:

cpupower frequency-set --governor performance
cpupower frequency-set -u 2.5GHz
cpupower frequency-set -d 2.0GHz

And then I repeated the benchmark. The first group of numbers below are the results with schedutil and the second group are the results with the performance frequency governor. Note that the values for many of the write-heavy microbenchmarks is about 2X larger with the performance frequency governor, which means that MyRocks is ~2X faster on them with the change. Unfortunately, I might never learn why.

First, the relative QPS for MyRocks prior to the change when using schedutil:

0.82    point-query.pre_range=100
0.78    range-notcovered-si.pre_range=100
0.56    range-notcovered-si_range=100
0.46    scan_range=100
0.31    delete_range=100
0.24    insert_range=100
0.70    read-write_range=100
0.71    read-write_range=10
0.48    update-index_range=100
0.31    update-inlist_range=100
0.27    update-nonindex_range=100
0.27    update-one_range=100
0.27    update-zipf_range=100
0.78    write-only_range=10000

And then the values for MyRocks after then change from schedutil to performance:

0.82    point-query.pre_range=100
0.77    range-notcovered-si.pre_range=100
0.54    range-notcovered-si_range=100
0.48    scan_range=100
0.52    delete_range=100
0.53    insert_range=100
0.73    read-write_range=100
0.75    read-write_range=10
1.03    update-index_range=100
0.52    update-inlist_range=100
0.60    update-nonindex_range=100
0.62    update-one_range=100
0.60    update-zipf_range=100
0.86    write-only_range=10000

Appendix 

Random articles I found while debugging

5 comments:

  1. Old knowledge: https://clickhouse.com/docs/en/operations/tips#cpu-scaling-governor

    ReplyDelete
    Replies
    1. that old knowledge would be more valuable if there were more details that motivated the advice

      Delete
  2. A shot in the dark: Schedutil prioritizes power consumption and so it has to sample the load. It's quite possible that the sampling falls into some kind of convoy pattern. The hypothesis being that LSM writes are batched and large and CPU consumption could be low during those periods? Maybe its sampling method is broken?

    ReplyDelete
  3. Another candidate could be: https://www.reddit.com/r/linux/comments/15p4bfs/amd_pstate_and_amd_pstate_epp_scaling_driver/ apparently Intel CPUs have similar issues.

    ReplyDelete
    Replies
    1. I want everything to be simple & easy except for the thing in which I have the time and expertise to cope with the complexity.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...