I have been documenting why MySQL gets slower (in some cases) from version 5.6 to 8.0. One of the causes is more cache and TLB activity, especially misses. I have tried huge pages for data and it helps a bit, but doesn't solve the problem - see here and here. But it comes at cost in usability.
Here I explain how to use huge pages for text (the code of mysqld). I have work in progress to share more details about the changes in cache and TLB activity over time from MySQL 5.6 to 8.0.
tl;dr - using huge pages for text ...
- improved throughput by up to 9% but a typical improvement was ~5%. This is nice but iTLB activity is obviously not the only problem.
- comes at a cost in usability
- this description of the contents of /proc/$pid/smaps to understand whether my attempts to use huge pages for text had an impact
- this overview of /proc/meminfo to understand whether huge pages were configured (also see here)
- this tutorial was my starting point and most of the advice here worked for me although I am not sure that hugeedit --text worked for me
- background reading
- background reading from Oracle
- using THP (from Oracle)
- CMake: linker flags that can be set, flags overview,
- hugectl source when you need to understand an error
- using readelf to understand whether a binary can use huge pages
- another great blog post that explains why and how to do this
First up, is to install a few things: sudo apt install libhugetlbfs-dev libhugetlbfs-bin
$ sudo apt install libhugetlbfs-bin
$ sudo hugeadm --create-global-mounts
$ sudo hugeadm --pool-pages-min 2M:128
cat /proc/meminfo | grep huge -i
So I just did "ps mysqld" to see the command line when mysqld was running and then changed my helper scripts to use that as:
At last, things work (assuming I didn't forget a step).
When I start mysqld and it uses some of those 128 huge pages I see this and 30 pages are used by mysqld because HugePages_Total - HugePages_Free = 30.
Results: QPS
The builds are done on Ubuntu 22.04 and the compilers are:
- gcc 11.4.0-1ubuntu1~22.04
- clang 14.0.0-1ubuntu1.1
- my8028_rel_o2nofp (the base version below)
- the base version in the results below
- MySQL 8.0.28 using gcc, -fno-omit-frame-pointer and -O2
- my8028_rel_o2nofp_clang
- col-1 in the results below
- MySQL 8.0.28 using clang, -fno-omit-frame-pointer and -O2
- my8028_rel_o2nofp_nogold
- col-2 in the results below
- MySQL 8.0.28 using gcc, -DUSE_GOLD_LD=OFF, -fno-omit-frame-pointer and -O2
- my8028_rel_o2nofp_huge
- col-3 in the results below
- MySQL 8.0.28 using gcc, -fno-omit-frame-pointer, -O2 and huge pages for text
- my8028_rel_o2nofp_huge_clang
- col-4 in the results below
- MySQL 8.0.28 using clang, -fno-omit-frame-pointer, -O2 and huge pages for text
A summary of the results below:
- QPS with clang (col-1) are slightly worse than with gcc (base case) and the worst case was for scan where clang was ~11% slower. The common case is that clang was ~3% slower.
- QPS without the gold linker (col-2) are similar to the base case
- QPS with gcc and huge pages for text (col-3) are up to 9% better than the base case
- QPS with clang and huge pages for text (col-4) are up to 4% better than the base case but definitely worse than gcc + huge pages (col-3)
These results show the relative QPS which is: (QPS for $my_version / QPS for $base_version) where $base_version is my8028_rel_o2nofp. When the relative QPS is > 0 then $my_version is faster than the base version. When it is < 0 then $my_version is slower than the base version.
0.99 1.00 1.03 1.03 hot-points_range=100
0.97 1.03 1.05 1.01 point-query.pre_range=100
0.98 1.03 1.06 1.01 point-query_range=100
1.00 1.00 1.03 1.03 points-covered-pk.pre_range=100
1.00 1.00 1.03 1.03 points-covered-pk_range=100
1.01 1.00 1.03 1.04 points-covered-si.pre_range=100
1.00 0.99 1.02 1.03 points-covered-si_range=100
1.00 1.00 1.02 1.04 points-notcovered-pk.pre_range=100
1.00 1.00 1.02 1.03 points-notcovered-pk_range=100
1.01 0.99 1.01 1.02 points-notcovered-si.pre_range=100
1.02 1.00 1.02 1.02 points-notcovered-si_range=100
1.02 1.02 1.02 1.04 random-points.pre_range=1000
1.00 1.00 1.02 1.03 random-points.pre_range=100
0.97 0.99 1.02 0.99 random-points.pre_range=10
1.02 1.02 1.02 1.04 random-points_range=1000
1.00 0.99 1.02 1.03 random-points_range=100
0.98 0.98 1.02 0.98 random-points_range=10
0.94 0.96 1.05 0.98 range-covered-pk.pre_range=100
0.93 0.96 1.04 0.98 range-covered-pk_range=100
0.96 0.97 1.07 0.99 range-covered-si.pre_range=100
0.96 0.97 1.06 1.00 range-covered-si_range=100
0.98 1.00 1.03 1.02 range-notcovered-pk.pre_range=100
0.98 1.00 1.03 1.02 range-notcovered-pk_range=100
1.01 1.00 1.02 1.02 range-notcovered-si.pre_range=100
1.01 1.00 1.01 1.02 range-notcovered-si_range=100
0.98 1.00 1.05 1.02 read-only.pre_range=10000
0.98 1.01 1.05 1.02 read-only.pre_range=100
0.97 1.01 1.06 1.01 read-only.pre_range=10
0.98 1.00 1.05 1.03 read-only_range=10000
0.98 1.01 1.06 1.02 read-only_range=100
0.98 1.01 1.06 1.01 read-only_range=10
0.89 0.96 1.05 0.93 scan_range=100
0.97 1.00 1.04 1.01 delete_range=100
0.98 1.00 1.05 1.02 insert_range=100
0.98 1.00 1.05 1.02 read-write_range=100
0.98 1.00 1.06 1.02 read-write_range=10
0.98 1.05 1.09 1.01 update-index_range=100
0.99 1.00 1.04 1.01 update-inlist_range=100
0.96 1.00 1.05 1.01 update-nonindex_range=100
0.98 1.01 1.05 1.02 update-one_range=100
0.97 1.00 1.05 1.01 update-zipf_range=100
0.97 1.00 1.06 1.01 write-only_range=10000
Results from perf stat
Here I have results from perf stat run for 10 seconds approximately 60 seconds into each benchmark step. I only share results for 4 of the 42 microbenchmarks:
- scan - does a full table scan
- insert - does inserts
- point-query - each SQL statement fetches one row by exact match on the PK
- update-index - does updates that require secondary index maintenance
The relative value is: (value for $my_version / value for $base_version) where the base version is the same as above (my8028_rel_o2nofp). There are four columns in the relative value results (similar to col-1 through col-4 above). There are five columns for the absolute value results (the first column is for the base version and that is followed by col-1 through col-4 results).
I focus on results from gcc + huge pages which is col-3 above.
- scan (relative and absolute values)
- QPS is 5% better with gcc and huge pages
- iTLB-loads drops from 1M+ to ~30 (see here)
- L1-icache-loads drops by ~6X and L1-icache-loads-misses drops by ~8% (see here)
- insert (relative and absolute values)
- QPS is 5% better with gcc and huge pages
- iTLB-loads and iTLB-load-misses drop by more than 16X (see here)
- L1-icache-loads and L1-icache-loads-misses increase by ~8% (see here)
- point-query (relative and absolute values)
- QPS is 6% better with gcc and huge pages
- iTLB-loads and iTLB-load-misses drop almost to zero (see here)
- L1-icache-loads and L1-icache-loads-misses increase by ~5% (see here)
- update-index (relative and absolute values)
The results above show that mysqld compiled with gcc was slightly faster than clang when using -O2 so I recompiled with -O3 to see if that would repeat and it did. The difference isn't big, maybe ~3% is the typical difference, but that result is still interesting to me.
The QPS results for -O2 and -O3 are here. The base case is my8028_rel_o2nofp for -O2 and (not listed above) my8028_rel for -O3. That is compared to my8028_rel_o2nofp_clang for -O2 and my8028_rel_clang for -O3.
The worst-case comparison is for scan where clang is ~10% slower than gcc. The perf counters are here: relative and absolute values. I am not sure whether I will do more to investigate this but I see that ipc is ~6% worse for clang with -O2 and ~8% worse for clang with -O3.
No comments:
Post a Comment