Tuesday, January 17, 2023

clang, gcc & compiler flags vs the insert benchmark & ARM

I have a large set of results from the Insert Benchmark run on AWS servers that use ARM CPUs (c7g). But before sharing them I wanted to make sure my builds (Postgres and MySQL compiled from source) weren't ruined by using the wrong compiler flags. And by wrong I mean that by default you get a build optimized for an older version of the ARM architecture. The flags that I tried optimize it for something more modern. So I ran a smaller batch of tests to compare the impact of compiler flags for builds with both gcc and clang using for ARM (c7g) instance types.

My goal is to determine whether my larger set of results is truthy, because they were compiled so that MySQL and PG use what they think is right. And my special builds that used things like -march=native, -mcpu=native and -mtune=native, those special builds didn't get significantly better performance. I am in favor of using the =native options.

The conclusion is that my builds were OK and I didn't detect a large benefit from using CPU specific compiler flags. The disclaimer is that my methods won't detect small differences -- some of the tests are short running and I didn't repeat the benchmark enough times to detect such small differences. Regardless, I was happy to learn that my results weren't ruined by using the wrong compiler flags.

Servers

I used c7g instances types (c7g.2xlarge, c7g.8xlarge) on AWS with Ubuntu 22.04. The compilers were clang 14.0.0-1ubuntu1 and gcc 11.3.0-1ubuntu1~22.04, 

The c7g.2xlarge server has 8 cores, 16G RAM and I ran the insert benchmark at 1 thread.  Storage was EBS (io2, 256G, 10k IOPs.

The c7g.8xlarge servers has 32 cores, 64G RAM and I ran the insert benchmark at 8 and 16 threads. Storage was EBS (io2, 2000G, 49k IOPs).

Compiling

I used Postgres 15.1 and MySQL 8.0.31.  By MySQL I mean upstream.

A nice summary of -mtune, -mcpu and -march compiler options is here.

Postgres + gcc was compiled three ways. Later in this post I call them gcc.default, gcc.arch.native and gcc.cpu.native.

1) configure --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer"
2) configure --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer -march=native"
3) configure --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer -mcpu=native"

Postgres + clang was also compiled three ways and I call them clang.default, clang.cpu.native and clang.tune.native. I used -mtnue=native for clang vs -mcpu=native for gcc because clang doesn't support -mcpu=native.

1) configure CC=/usr/bin/clang --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer"
2) configure CC=/usr/bin/clang --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer -march=native"
3) configure CC=/usr/bin/clang --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer -mtune=native"

For the gcc and clang builds with Postgres, ARM related output and CFLAGS are here.

MySQL + gcc was compiled three ways that I call gcc.default, gcc.arch.native and gcc.cpu.native similar to what was done for Postgres. Perhaps my cmake skills are weak, but I ended up editing configure.cmake via  diffs like this to get binaries I call gcc.default, gcc.arch.native and gcc.cpu.native. ARM related output and CFLAGS/CXXFLAGS for each build are here.

MySQL + clang was compiled two ways that I call clang.default and clang.cpu.native. My configure.cmake hack didn't work for -mtune=native so there is no clang.tune.native build. ARM related output and CFLAGS/CXXFLAGS for each build are here.

The MySQL builds were done via cmake like:

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SSL=system -DWITH_ZLIB=bundled -DMYSQL_MAINTAINER_MODE=0 -DENABLED_LOCAL_INFILE=1 -DCMAKE_INSTALL_PREFIX=$1 -DWITH_BOOST=$PWD/../boost -DWITH_NUMA=ON 
CC=/usr/bin/clang CXX=/usr/bin/clang++ cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SSL=system -DWITH_ZLIB=bundled -DMYSQL_MAINTAINER_MODE=0 -DENABLED_LOCAL_INFILE=1 -DCMAKE_INSTALL_PREFIX=$1 -DWITH_BOOST=$PWD/../boost -DWITH_NUMA=ON

Benchmark

For an overview of the insert benchmark see here and here. I only use the cached workload here and the database fits in the DBMS buffer pool. The read+write benchmark steps were run for 1 hour each. Using terminology from the Benchmark section of my previous post:

  • for c7g.2xlarge - X=20M, Y=20M and there was 1 client thread
  • for c7g.8xlarge - X=75M, Y=25M and the benchmark was repeated for 8 and 16 threads

For the c7g.2xlarge server I used the x7 config file for Postgres and y8 config file for InnoDB. For the c7g.8xlarge server I used the x7_c32r64_50kio config file for Postgres and y8_c32r64_50kio config file for MySQL. Note that I have to edit malloc-lib in the MySQL config files for ARM and add the fix for bug 109429.

A spreadsheet with results (QPS by benchmark step) is here. There are 3 sheets for: c7g.2xlarge with 1 thread, c7g.8xlarge with 8 threads, and c7g.8xlarge with 16 threads. For l.i0 and l.i1 the spreadsheet lists the insert rate. For q100, q500 and q1000 the spreadsheet lists the query rate. Note that l.i0 and l.i1 are short running (not great, I have a replacement for the insert benchmark in progress). Regardless, the results are similar regardless of the compiler flags.

The benchmark steps are:

  • l.i0 - load X (see above) rows in PK order, table only has a PK index
  • l.i1 - load Y (see above) rows in PK order, table has a PK and 3 secondary indexes
  • q100 - each thread does short range queries, and is paired with a background thread that does 100 inserts/s
  • q500 - same as q100, but background thread does 500 inserts/s
  • q1000 - same as q100, but background thread does 1000 inserts/s

3 comments:

  1. I have found that Feedback compilation has a major impact on MySQL Server performance, saw 20% improvement of this before. Unfortunately newer GCC versions (going from 8 to 10) worsened the performance by 5%. I haven't had time to investigate, but feel pretty confident that the problem is that newer GCC is to aggressive in inlining. So should be possible to to regain the lost performance by some compiler flags. Modern CPUs haven't invented so many new ways of running code, so gather your results are somewhat expected, but they are sure getting faster :)

    ReplyDelete
    Replies
    1. Thanks. I have seen good results from them for production workloads. For benchmarks I have yet to use them because time spent on that leaves less time for other things. Also, I am wary of a binary optimized for one benchmark when production is more varied. But I have yet to figure out whether my concerns are valid.

      Delete
    2. I have seen great results from feedback for production workloads. I am wary of using them for benchmarks because the benchmark workloads have less variety and I worry this will distort results, but I have yet to verify that. Or maybe I just don't want the time & complexity of adding this to my build process because if I spend time on this, I will have less time to spend on other things.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...