Thursday, March 16, 2023

Huge pages with Postgres & InnoDB: better perf, but a bit more work

I started to use huge pages for the benchmarks I do with Postgres and MySQL/InnoDB. It helps performance but requires a few more steps to get working and this will be another source of failures for production deployments. See here for an example of the perf improvements.

Disclaimer - I am new to this and far from an expert.

Postgres

As always, the Postgres docs are useful:

  1. Figure out how many huge pages will be needed, call this X
  2. Edit /etc/sysctl.conf with vm.nr_huge_pages=$X
  3. sudo sysctl -p
  4. Compact the Linux VM (optional, not in the PG docs)
  5. Add huge_pages=try or huge_pages=on to the Postgres configuration
  6. Start Postgres
To estimate the number of huge pages (the value for X). First figure out the value of Hugepagesize (grep Huge /proc/meminfo) and in the example below that is 2MB. Then X = sizeof(buffer pool) / Hugepagesize.

$ grep Huge /proc/meminfo

AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
I suggest compacting the Linux VM because huge pages might not be available if you don't do this (I learned this from experience) and you can judge that based on the value of HugePages_Total from /proc/meminfo. After starting Postgres check /proc/meminfo again to confirm that huge pages were used. This post has more details

 My script for compacting the VM is:

sync; sync; sync

echo 3 > /proc/sys/vm/drop_caches

echo 1 > /proc/sys/vm/compact_memory

MySQL/InnoDB

The docs are here for InnoDB with huge pages although I think they need to be updated. AFAIK InnoDB always uses mmap now, while it used to use shm (shmget, shmat) prior to modern 8.0. When shm is used you need to worry about /proc/sys/vm/hugetlb_shm_group, /proc/sys/kernel/shmmax and /proc/sys/kernel/shmmall as mentioned in the InnoDB docs. I don't think those need to be updated now that InnoDB appears to always use mmap. For the 5.7 code that uses shm see here and for modern 8.0 that uses mmap see here.

To use huge pages with InnoDB:

  1. Figure out how many huge pages will be needed, call this X. See the Postgres section above.
  2. Edit /etc/sysctl.conf with vm.nr_huge_pages=$X
  3. sudo sysctl -p
  4. Compact the Linux VM (optional, not in the InnoDB docs). See the Postgres section above.
  5. Confirm that innodb_buffer_pool_chunk_size is larger than the huge page size. The default is 128M which is great when the huge page size is 2M.
  6. Add large_pages=ON to my.cnf
  7. Start mysqld
When I first tried this with MySQL 5.6 I forgot to edit /proc/sys/vm/hugetlb_shm_group as described in the InnoDB docs. I was able to start mysqld, but there were errors in the database error log and odd query and connection errors afterwards. Fortunately, I don't think you need to edit hugetlb_shm_group with modern 8.0 because it uses mmap instead of shm.





Wednesday, March 15, 2023

Compiler optimizations, MyRocks and a small server

I revisited prior work after taking more care to document the impact of the flags I use when compiling MyRocks. I ran two in-memory benchmarks (sysbench, insert benchmark) to understand the impact of the compile-time flags on CPU efficiency.

tl;dr:

  • use CMAKE_BUILD_TYPE=Release
  • enable link time optimization (-flto with gcc) via -DWITH_LTO=ON
  • use -march=native & -mtune=native (if possible)
For MyRocks with FB MySQL 8.0.28 a build that uses both link time optimization and enables CPU specific optimizations (-march=native -mtune=native) (rel_native_lto) gets about 5% more throughput with the insert benchmark and 5%, 6%, 7% with sysbench for point queries, range queries and writes.

Disclaimer - the MySQL build for MyRocks wraps the RocksDB build and that gets in the way. While the RocksDB build respected most of the flags I used described below, I wasn't able to control usage of -march=native which ended up being used by RocksDB source files for all builds, even when I didn't want it to be used.

Compile time options

I tested the following builds for MyRocks. I wasn't able to get link time optimization working for the 5.6.35 builds. It might be possible but I stopped after trying for a few hours.

Note that in many cases with modern MySQL the use of CMAKE_BUILD_TYPE=RelWithDebInfo implies the use of link time optimization (-flto in gcc) while CMAKE_BUILD_TYPE=Release does not. With help from experts I documented that in this post. However, the only builds listed below that use link time optimization are ones with _lto in their name and that is done via -DWITH_LTO=ON.

I tested these builds for FB MySQL 5.6.35 at git sha 256826240. The CMake command lines are in the cmk.* files here:
  • rel_withdbg - CMAKE_BUILD_TYPE=RelWithDebInfo which implies -O2
  • rel_o2 - CMAKE_BUILD_TYPE=Release, explicitly set -O2
  • rel - CMAKE_BUILD_TYPE=Release which implies -O3
I tested these builds for FB MySQL 8.0.28 at git sha 8fae2bbdc. The CMake command lines are in the cmk.* files here:

  • rel_withdbg - CMAKE_BUILD_TYPE=RelWithDebInfo which implies -O2
  • rel_o2 - CMAKE_BUILD_TYPE=Release, explicitly set -O2
  • rel - CMAKE_BUILD_TYPE=Release which implies -O3
  • rel_native - CMAKE_BUILD_TYPE=Release which implies -O3, added -march=native -mtune=native
  • rel_o2_lto - CMAKE_BUILD_TYPE=Release, explicitly set -O2, added -flto
  • rel_lto - CMAKE_BUILD_TYPE=Release which implies -O3, added -flto
  • rel_native_lto - CMAKE_BUILD_TYPE=Release which implies -O3, added -march=native -mtune=native -flto
Benchmark HW

The HW is Beelink SER 4700u with 8 AMD cores, 16G RAM and fast NVMe SSD described here.

Benchmarks

The benchmarks were sysbench and the insert benchmark configured so that the database fits in memory. For some benchmark steps there were writes to storage but there were no reads from storage. 

An overview of the insert benchmark is here and here. My scripts for running it are here. It was run in three configurations: 1 client & 1 table, 4 clients & 4 tables, 4 clients & 1 table and each run has 6 steps:

  • l.i0 - insert 20M rows without secondary indexes
  • l.x - create 3 secondary indexes
  • l.i1 - insert 20M rows with 3 secondary indexes in place. 
  • q100 - range queries with 100 inserts/s in the background, runs for 30 minutes
  • q500 - range queries with 500 inserts/s in the background, runs for 30 minutes
  • q1000 - range queries with 1000 inserts/s in the background, runs for 1 hour

My scripts for running sysbench are here and the lua directory includes additional benchmark steps not included upstream. There are 42 Lua scripts for which I provide results. Each represents a (micro)benchmark step that was run for 10 minutes. I place them into three groups -- point, range, write -- based on the common operation done for each where point does point queries, range does range queries and write does insert/update/delete. This was repeated for two configurations: 1 table & 1 thread, 1 table & 4 threads. 

Results: insert benchmark

The benchmark was repeated for three configurations. The links are to performance reports:
A spreadsheet with the throughput for each benchmark step for each of the three configurations is here. Graphs with relative throughput for the 1 thread & 1 table configuration are below. By relative throughput I mean throughput for the build relative to the rel_withdbg build. The y-axis starts at 0.8 to improve readability. Graphs for the 4 thread configurations are similar. Conclusions from the graphs:
  • For MyRocks with 5.6.35 the rel build usually has ~3% more throughput and with 8.0.28 the rel_native_lto build has ~5% more throughput compared to the base case (rel_withdbg).
  • The l.x benchmark step has the most variance. I ignore that for now.
  • Performance with the rel build (uses -O3) is slightly better than the rel_o2 builds (uses -O2)
  • Using link time optimization helps (see results for rel_lto)
  • Using -march=native -mtune=native helps (see results for rel_native_lto)

Results: sysbench

The benchmark was repeated for two configurations: 1 table & 1 thread, 1 table & 4 threads. In each case the table starts with 20M rows and each benchmark step ran for 10 minutes. The benchmark steps are grouped into three classes based on the dominant operation: point (for point queries), range (for range queries) and write (for insert, update & delete).

The best performance comes from the rel build in FB MySQL 5.6.35 and the rel_native_lto build in FB MySQL 8.0.28 as shown by the tables below with summary statistics. The numbers are the throughput for the build relative to the throughput for the rel_withdbg build and for all benchmark steps except scan the throughput is QPS while for scan it is millions rows read/second.

The spreadsheet with the throughput, summary statistics and graphs is here.

Summary statistics for MyRocks in FB MySQL 8.0.28 with the 1 table & 1 thread configuration show that the rel build on average gets 3% more QPS on point queries, 2% more on range queries and 0% more on writes relative to the rel_withdbg build. Results for the 1 table & 4 threads configuration are similar.

fbmy5635relrel_o2
Point: avg1.031.01
Point: median1.011.00
Point: min0.990.98
Point: max1.221.18
Point: stddev0.0590.052
Range: avg1.021.01
Range: median1.021.00
Range: min1.000.98
Range: max1.071.03
Range: stddev0.0150.015
Write: avg1.000.98
Write: median1.000.99
Write: min0.990.90
Write: max1.041.02
Write: stddev0.0160.032

Summary statistics for MyRocks in FB MySQL 8.0.28 show that the rel_native_lto build provides the best performance and the rel_lto build is second best. The rel_native_build on average gets 5% more QPS on point queries, 7% more on range queries and 6% more on writes.

fbmy8028relrel_o2rel_nativerel_o2_ltorel_ltorel_native_lto
Point: avg0.990.980.981.021.031.05
Point: median1.001.000.991.031.051.06
Point: min0.830.820.830.910.830.83
Point: max1.071.061.021.061.091.14
Point: stddev0.0520.0520.0470.0370.0670.074
Range: avg1.010.991.001.031.051.07
Range: median1.011.001.001.041.061.07
Range: min0.990.960.991.000.991.05
Range: max1.031.011.021.101.111.11
Range: stddev0.0130.0130.0120.0260.0290.017
Write: avg1.011.011.001.051.061.06
Write: median1.011.001.011.061.071.07
Write: min0.990.990.991.011.030.98
Write: max1.021.021.021.071.091.10
Write: stddev0.0100.0120.0110.0200.0190.032

Results: sysbench graphs

There are a few outlier benchmark steps that get much more improvement than average in the rel build with 5.6.35 and the rel_native_lto_build with 8.0.22. They are visible in the graphs below. The x-axis for the graphs starts at 0.8 rather than 0 to improve readability.

For MyRocks in 5.6.35 the benchmark step names are:
  • points-covered-si_range=100 - rel does 1.22X better
  • points-notcovered-si_range=100 -  rel does 1.14X better
  • range-covered-si_range=100 - rel does 1.07X better
  • read-write_range=100 - rel does 1.04X better
And for MyRocks in 8.0.28:
  • hot-points_range=100 - rel_native_lto does 1.14X better
  • point-query.pre_range=100 - rel_native_lto does 1.11X better
  • point-query_range=100 - rel_native_lto does 1.11X better
First, the graphs for MyRocks in FB MySQL 5.6.35.
Next the graphs for MyRocks in FB MySQL 8.0.28. These only have results for the rel_lto and rel_native_lto builds to improve readability.























Wednesday, March 1, 2023

Adventures in compiling MySQL: RelWithDebInfo vs Release

I am a non-expert in many build tools -- CMake for MySQL, autoconf for Postgres, scons for MongoDB and Maven for Linkbench. While working to confirm my MySQL builds are OK I used sysbench to compare several of them and was confused by the results. This is part 3 of my adventure - parts 1 and 2 are here and here.

tl;dr

  • RelWithDebInfo uses link time optimization by default
  • Release does not use link time optimization by default
  • Performance is better with link time optimization
  • Link time optimization helped point queries more than range queries or writes
Compiling MySQL from source

The options for CMAKE_BUILD_TYPE with MySQL include RelWithDebInfo and Relase. On Ubuntu 22.04 with gcc the interesting differences include:
  • RelWithDebInfo uses -O2, link time optimization (-flto) and -fstack-protector-strong
  • Release uses -O3 but does not use link time optimization by default
In my benchmarks (CPU-bound sysbench) the RelWithDebInfo build gets better throughput because it used less CPU/query and the difference was up to 10%. Looking at statistics from perf I saw that insn per cycle was frequently ~10% better for the RelWithDebInfo build and memory system counters were also better for RelWithDebInfo.

My first guess at the root cause was the binary size from -O2 vs -O3, 54M for RelWithDebInfo vs 66M for Release using stripped mysqld binaries. But that was wrong.

Then I remembered that RelWithDebInfo used -flto while Release did not. I ignored that last week, but it turned out the explain the difference. 

Why don't Release builds use link time optimization? Yura Sorokin explained this to me. In CMakeLists.txt there is this code that sets WITH_PACKAGE_FLAGS_DEFAULT to ON for RelWithDebInfo and OFF for Release. When that is set to ON then the output from dpkg-buildflags is added to compile and linker command lines (dpkg-buildflags --get $X for X in CPPFLAGS, CFLAGS, CXXFLAGS, LDFLAGS). And on Ubuntu 22.04 I see:

$ dpkg-buildflags --get CPPFLAGS

-Wdate-time -D_FORTIFY_SOURCE=2


$ dpkg-buildflags --get CFLAGS

-g -O2 -ffile-prefix-map=/home/mdcallag/git/mytools/bench/sysbench.lua/r.1tab.1thr.feb23.repro=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security


$ dpkg-buildflags --get CXXFLAGS

-g -O2 -ffile-prefix-map=/home/mdcallag/git/mytools/bench/sysbench.lua/r.1tab.1thr.feb23.repro=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security


$ dpkg-buildflags --get LDFLAGS

-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro

Thus, I get -flto by default with RelWithDebInfo but not with Release. To enable link time optimization for Release I can do one of:

Results: graphs

The full spreadsheet is here and has the throughput for builds relative to a RelWithDebInfo build (QPS for build / QPS for RelWithDebInfo). LTO means link time optimization which is enabled by default for RelWithDebInfo, disabled by default for Release and enabled via WITH_LTO=ON. The builds tested are:
  • RelWithDebInfo - CMAKE_BUILD_TYPE=RelWithDebInfo, LTO, -O2
  • Release - CMAKE_BUILD_TYPE=Release, -O3, does not use LTO
  • Release+LTO - CMAKE_BUILD_TYPE=Release, LTO, -O3
  • Release+LTO+O2 - uses CMAKE_BUILD_TYPE=Release, LTO, -O2
  • Release+LTO+Native - CMAKE_BUILD_TYPE=Release, LTO, -march=native, -mtune=native
The spreadsheet has results for all builds. The graphs only show QPS for Release and Release+LTO relative to RelWithDebInfo to improve readability. From Release+LTO I see that LTO benefits point queries more than range queries or writes and that is also clear in the summary statistics displayed in the following section. Also, range query and writes each have two outliers for which LTO had a large benefit:
The graphs show the relative QPS for each build which is (QPS for the build / QPS for RelWithDebInfo).

Results: summary statistics

This section has summary statistics: the average, median, min and max value for each of the benchmark types (point, range and write) for each of the builds using the relative QPS (QPS for the build / QPS for RelWithDebInfo build). From the Average column it is clear that LTO benefits point queries more than range queries or writes. It is also clear that the use of LTO improves performance.

Release
PointRangeWrite
Average0.970.950.93
Median0.980.960.93
Min0.910.910.90
Max1.000.990.98
Release+LTO
PointRangeWrite
Average1.131.061.13
Median1.151.041.07
Min1.020.971.02
Max1.201.211.45
Rlease+LTO+O2
PointRangeWrite
Average1.091.031.09
Median1.101.021.04
Min1.010.970.99
Max1.141.161.33
Release+LTO+Native
PointRangeWrite
Average1.131.041.14
Median1.151.031.07
Min1.040.931.03
Max1.191.201.45

CMake command lines

RelWithDebInfo
cmake .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF \
  -DWITH_MYSQLX=OFF \
  -DWITH_UNIT_TESTS=OFF
Release
BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF \
  -DWITH_MYSQLX=OFF \
  -DWITH_UNIT_TESTS=OFF
Release+LTO
BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DWITH_LTO=ON \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF
Release+LTO+O2
BF=" -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DCMAKE_C_FLAGS_RELEASE="-O2 -DNDEBUG" \
  -DCMAKE_CXX_FLAGS_RELEASE="-O2 -DNDEBUG" \
  -DWITH_LTO=ON \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF
Release+LTO+Native
BF=" -march=native -mtune=native -g1 "
CF=" $BF "
CXXF=" $BF "

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_BOOST=$PWD/../boost \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DWITH_LTO=ON \
  -DWITH_NUMA=ON \
  -DWITH_ROUTER=OFF -DWITH_MYSQLX=OFF -DWITH_UNIT_TESTS=OFF
Compiler command lines

This section lists the interesting diffs in the compiler command lines for mysqld.cc

RelWithDebInfo vs Release
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -flto=auto
< -fstack-protector-strong
< -g

< -O2
---
> -O3
RelWithDebInfo vs Release+LTO
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g

< -flto=auto
---
> -flto


< -O2
---
> -O3
RelWithDebInfo vs Release+LTO+O2
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g

< -flto=auto
---
> -flto
RelWithDebInfo vs Release+LTO+Native
< -D_FORTIFY_SOURCE=2
< -ffat-lto-objects
< -ffile-prefix-map=/foobar/build.rel_withdbg=.
< -fstack-protector-strong
< -g
---
> -march=native
> -mtune=native

< -flto=auto
---
> -flto

< -O2
---
> -O3