Sunday, October 6, 2024

How to run MySQL with text in huge pages

I have been documenting why MySQL gets slower (in some cases) from version 5.6 to 8.0. One of the causes is more cache and TLB activity, especially misses. I have tried huge pages for data and it helps a bit, but doesn't solve the problem - see here and here. But it comes at cost in usability.

Here I explain how to use huge pages for text (the code of mysqld). I have work in progress to share more details about the changes in cache and TLB activity over time from MySQL 5.6 to 8.0.

tl;dr - using huge pages for text ...

  • improved throughput by up to 9% but a typical improvement was ~5%. This is nice but iTLB activity is obviously not the only problem.
  • comes at a cost in usability
Another interesting result is that QPS with gcc was ~3% better than with clang. I am curious if this will repeat on other CPUs, as the result here is from my smallest small server.

Education

Documentation for this is worse than the docs that exist for using huge pages with data. I used:
Building MySQL

First up, is to install a few things: sudo apt install libhugetlbfs-dev libhugetlbfs-bin

The first step is to build mysqld so that it can use huge pages for text. I started with the advice from here and the first step is to add flags for the linker to align text for 2MB pages.
    -DCMAKE_CXX_LINK_FLAGS="-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152"

I did this with MySQL 8.0.28 and that wasn't sufficient, I also had to prevent the gold linker from being used via: -DUSE_LD_GOLD=OFF. The docs for USE_LD_GOLD (and USE_LD_LLD) are here and state that USE_LD_GOLD was removed after MySQL 8.0.31.

The next step was to enable the usage of huge pages. But note that this has an impact on the memory available to the kernel whether or not the reserved huge pages (128 here) are in use. In this case, that means that 256M (128 pages at 2M each) of memory can't be used elsewhere. To undo the effect of hugeadm --pool-pages-min 2M:128 I can run hugeadm --pool-pages-min 2M:0 and the impact of 2M:0 means there are no huge pages available for use (in /proc/meminfo HugePages_Total returns to 0).

$ sudo apt install libhugetlbfs-bin
$ sudo hugeadm --create-global-mounts
$ sudo hugeadm --pool-pages-min 2M:128

When doing that it helps to run this before and after running the commands:

cat /proc/meminfo  | grep huge -i

The next step was to either run the binary with help from hugectl --text or to use hugeedit --text and then run mysqld as you normally run it. Since I start mysqld via mysqld_safe I first tried hugeedit so that I could avoid adding hugectl --text to the mysqld_safe scripts but it wasn't clear that worked.

So I just did "ps mysqld" to see the command line when mysqld was running and then changed my helper scripts to use that as:
    hugectl --force --text bin/mysqld ...

While the blog post above doesn't mention a need for --force I had to add it to avoid error messages, and to understand those error messages I ended up reading the source for hugectl. The error message was: LD_PRELOAD not appropriate for this map combination (see here).

At last, things work (assuming I didn't forget a step).

/proc/meminfo is your friend

There is information in /proc/meminfo that can tell you whether huge pages can be used and whether they are used. This is example output when huge pages cannot be used because HugePages_Total is 0.

$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB

After I run sudo hugeadm --pool-pages-min 2M:128 then huge pages can be used and I can see that in /proc/meminfo as HugePages_Total is 128, but none of the 128 huge pages are used because HugePages_Free is also 128.

$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:     128
HugePages_Free:      128
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:          262144 kB

When I start mysqld and it uses some of those 128 huge pages I see this and 30 pages are used by mysqld because HugePages_Total - HugePages_Free = 30.

$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:     128
HugePages_Free:       98
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:          262144 kB

Benchmark

I used MySQL 8.0.28, small server, a Beelink SER 4700u (see here) with 8 cores, a Ryzen 7 4700u CPU, 16G RAM and Ubuntu 22.04.

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB. It is run with 1 thread, 1 table and 30M rows. 

The benchmark was run with and without using perf stat to collect HW counters. The command line for my helper scripts was one of:
    # when not running perf
    bash r.sh 1 30000000 300 600 nvme0n1 1 1 1
    # when running perf 
    bash r.sh 1 30000000 330 630 nvme0n1 1 1 1

Results: QPS

The builds are done on Ubuntu 22.04 and the compilers are:

  • gcc 11.4.0-1ubuntu1~22.04
  • clang 14.0.0-1ubuntu1.1
This has results for 5 different MySQL builds, all using 8.0.28, CMAKE_BUILD_TYPE=Release and this my.cnf.

  • my8028_rel_o2nofp (the base version below)
    • the base version in the results below
    • MySQL 8.0.28 using gcc, -fno-omit-frame-pointer and -O2
  • my8028_rel_o2nofp_clang
    • col-1 in the results below
    • MySQL 8.0.28 using clang, -fno-omit-frame-pointer and -O2
  • my8028_rel_o2nofp_nogold
    • col-2 in the results below
    • MySQL 8.0.28 using gcc, -DUSE_GOLD_LD=OFF, -fno-omit-frame-pointer and -O2
  • my8028_rel_o2nofp_huge
    • col-3 in the results below
    • MySQL 8.0.28 using gcc, -fno-omit-frame-pointer, -O2 and huge pages for text
  • my8028_rel_o2nofp_huge_clang
    • col-4 in the results below
    • MySQL 8.0.28 using clang, -fno-omit-frame-pointer, -O2 and huge pages for text

A summary of the results below:

  • QPS with clang (col-1) are slightly worse than with gcc (base case) and the worst case was for scan where clang was ~11% slower. The common case is that clang was ~3% slower.
  • QPS without the gold linker (col-2) are similar to the base case
  • QPS with gcc and huge pages for text (col-3) are up to 9% better than the base case
  • QPS with clang and huge pages for text (col-4) are up to 4% better than the base case but definitely worse than gcc + huge pages (col-3)

These results show the relative QPS which is: (QPS for $my_version / QPS for $base_version) where $base_version is my8028_rel_o2nofp. When the relative QPS is > 0 then $my_version is faster than the base version. When it is < 0 then $my_version is slower than the base version.

col-1   col-2   col-3   col-4
0.99    1.00    1.03    1.03    hot-points_range=100
0.97    1.03    1.05    1.01    point-query.pre_range=100
0.98    1.03    1.06    1.01    point-query_range=100
1.00    1.00    1.03    1.03    points-covered-pk.pre_range=100
1.00    1.00    1.03    1.03    points-covered-pk_range=100
1.01    1.00    1.03    1.04    points-covered-si.pre_range=100
1.00    0.99    1.02    1.03    points-covered-si_range=100
1.00    1.00    1.02    1.04    points-notcovered-pk.pre_range=100
1.00    1.00    1.02    1.03    points-notcovered-pk_range=100
1.01    0.99    1.01    1.02    points-notcovered-si.pre_range=100
1.02    1.00    1.02    1.02    points-notcovered-si_range=100
1.02    1.02    1.02    1.04    random-points.pre_range=1000
1.00    1.00    1.02    1.03    random-points.pre_range=100
0.97    0.99    1.02    0.99    random-points.pre_range=10
1.02    1.02    1.02    1.04    random-points_range=1000
1.00    0.99    1.02    1.03    random-points_range=100
0.98    0.98    1.02    0.98    random-points_range=10
0.94    0.96    1.05    0.98    range-covered-pk.pre_range=100
0.93    0.96    1.04    0.98    range-covered-pk_range=100
0.96    0.97    1.07    0.99    range-covered-si.pre_range=100
0.96    0.97    1.06    1.00    range-covered-si_range=100
0.98    1.00    1.03    1.02    range-notcovered-pk.pre_range=100
0.98    1.00    1.03    1.02    range-notcovered-pk_range=100
1.01    1.00    1.02    1.02    range-notcovered-si.pre_range=100
1.01    1.00    1.01    1.02    range-notcovered-si_range=100
0.98    1.00    1.05    1.02    read-only.pre_range=10000
0.98    1.01    1.05    1.02    read-only.pre_range=100
0.97    1.01    1.06    1.01    read-only.pre_range=10
0.98    1.00    1.05    1.03    read-only_range=10000
0.98    1.01    1.06    1.02    read-only_range=100
0.98    1.01    1.06    1.01    read-only_range=10
0.89    0.96    1.05    0.93    scan_range=100
0.97    1.00    1.04    1.01    delete_range=100
0.98    1.00    1.05    1.02    insert_range=100
0.98    1.00    1.05    1.02    read-write_range=100
0.98    1.00    1.06    1.02    read-write_range=10
0.98    1.05    1.09    1.01    update-index_range=100
0.99    1.00    1.04    1.01    update-inlist_range=100
0.96    1.00    1.05    1.01    update-nonindex_range=100
0.98    1.01    1.05    1.02    update-one_range=100
0.97    1.00    1.05    1.01    update-zipf_range=100
0.97    1.00    1.06    1.01    write-only_range=10000

Results from perf stat

Here I have results from perf stat run for 10 seconds approximately 60 seconds into each benchmark step. I only share results for 4 of the 42 microbenchmarks: 

  • scan - does a full table scan
  • insert - does inserts
  • point-query - each SQL statement fetches one row by exact match on the PK
  • update-index - does updates that require secondary index maintenance
For each of the 4 microbenchmarks there is a link to a gist that has both absolute values and relative values. The absolute values are useful to understand whether differences are significant. For example, using huge pages might reduce values for a counter, but if the reduction is from 100 to 10 then the impact might not be significant. While reducing a counter from 1B to 100M might be significant. Alas, it isn't trivial to figure out the performance impact from a counter value. 

The relative value is: (value for $my_version / value for $base_version) where the base version is the same as above (my8028_rel_o2nofp). There are four columns in the relative value results (similar to col-1 through col-4 above). There are five columns for the absolute value results (the first column is for the base version and that is followed by col-1 through col-4 results).

I focus on results from gcc + huge pages which is col-3 above.

The perf stat results are:
  • scan (relative and absolute values)
    • QPS is 5% better with gcc and huge pages
    • iTLB-loads drops from 1M+ to ~30 (see here)
    • L1-icache-loads drops by ~6X and L1-icache-loads-misses drops by ~8% (see here)
  • insert (relative and absolute values)
    • QPS is 5% better with gcc and huge pages
    • iTLB-loads and iTLB-load-misses drop by more than 16X (see here)
    • L1-icache-loads and L1-icache-loads-misses increase by ~8% (see here)
  • point-query (relative and absolute values)
    • QPS is 6% better with gcc and huge pages
    • iTLB-loads and iTLB-load-misses drop almost to zero (see here)
    • L1-icache-loads and L1-icache-loads-misses increase by ~5% (see here)
  • update-index (relative and absolute values)
    • QPS is 9% better with gcc and huge pages
    • iTLB-loads and iTLB-load-misses drop by ~30X (see here)
    • L1-icache-loads-misses increases by ~6% (see here)

Results from gcc vs clang

The results above show that mysqld compiled with gcc was slightly faster than clang when using -O2 so I recompiled with -O3 to see if that would repeat and it did. The difference isn't big, maybe ~3% is the typical difference, but that result is still interesting to me.

The QPS results for -O2 and -O3 are here. The base case is my8028_rel_o2nofp for -O2 and (not listed above) my8028_rel for -O3. That is compared to my8028_rel_o2nofp_clang for -O2 and my8028_rel_clang for -O3.

The worst-case comparison is for scan where clang is ~10% slower than gcc. The perf counters are here: relative and absolute values. I am not sure whether I will do more to investigate this but I see that ipc is ~6% worse for clang with -O2 and ~8% worse for clang with -O3.




No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...