Wednesday, November 30, 2022

Compiling MySQL 5.6 & 5.7 on Ubuntu 22.04

One of my hobbies is testing open source DBMS for CPU regressions and for that I want to compare perf between old and new versions of the DBMS. Depending on the DBMS it can be a challenge to build the old DBMS with the current (modern) compiler toolchain. 

Using open source frequently means compiling from source and compiling from source eventually means debugging a failed build. Alas, the proliferation of build tools means you are likely to be debugging a build tool you know little about. For me that included svn+MongoDB, cmake+MySQL, make/configure+MySQL, mvn+Linkbench, mvn+First_Robotics and make+RocksDB. Perhaps my debugging would be easier if there weren't as many build tools.

Postgres might be an exception WRT compiling old versions - it works great. Alas, this isn't as easy with MySQL versions 5.6.51 and 5.7.39. Note that MySQL 5.6 reached end of life in 2021 but 5.7 doesn't reach that until next year. For MySQL 5.6, 5.7 and perhaps some 8.0 releases prior to 8.0.31, there is a common error -- cmake fails with a message about being unable to find ssl and a suggestion to install it. 
CMake Error at cmake/ssl.cmake:63 (MESSAGE):
  Please install the appropriate openssl developer package.

Call Stack (most recent call first):
  cmake/ssl.cmake:306 (FATAL_SSL_NOT_FOUND_ERROR)
  CMakeLists.txt:603 (MYSQL_CHECK_SSL)
This occurs on Ubuntu 22.04.1 with gcc 11.3.0 and OpenSSL was definitely installed, and used when I compiled MySQL 8.0.31 from source. The problem is there are several conditions that trigger the error message and one of them occurs when the version number is wrong or the version number cannot be parsed from a header. So a better error message would make debugging easier in the future.

For 5.7.39 I fixed the error by backporting code from 8.0.31 that parses the SSL version numbers from openssl/openslv.h. A diff for that is here. Perhaps the real fix needed is smaller but I was in copy/paste mode given my lack of cmake skills.

For 5.6.51 I fixed the error in the same way and then added --std=c++11 to the CXX flags because some code uses the register keyword which is no longer a thing in C++ 2017. A diff for that is here. Output from the compiler error before --std=c++11 was added is here.

Updates:
  • Filed bug 109251, must stay ahead of Vilnius DB in the bug reporter contest
  • Compiling FB MySQL 5.6.35 was even more fun, see below
Compiling FB MySQL 5.6.35

A diff to make this work is here. I had to modify cmake/ssl.cmake as explained above and storage/rocksdb/get_rocksdb_files.sh.  The cmake command line is here.

I also did cd rocksdb; make static_lib to generate rocksdb/util/build_version.cc because the automation to do that was not working, but after I applied the diff listed above that wasn't needed.

Compiling FB MySQL 8.0.28

I needed a small diff to edit storage/rocksdb/get_rocksdb_files.sh similar to what was done for FB MySQL 5.6.35 above. The cmake command line is here.

While boost is installed by git submodule update I am not sure that works for my OSS builds and/or cmake command line (-DWITH_BOOST=$PWD/../boost). So I just do:
  cd $SRC_ROOT; mv boost boost.up; cp -p -r ~/mysql-8.0.28/boost .

Tuesday, November 29, 2022

Insert benchmark: Postgres, InnoDB and MyRocks with low concurrency

This has results for the insert benchmark using Postgres, InnoDB and MyRocks. For an overview of the insert benchmark see here and here. Some information on the performance summaries generated by my test scripts is here. I used small servers and ran the test at low concurrency (1 or 2 threads) for cached and IO-bound workloads. The insert benchmark has several phases and the interesting phases are: insert-only without secondary indexes, insert-only with 3 secondary indexes and then range queries with rate-limited inserts.

Performance reports are provided for:

Disclaimer - I used a small server and will soon repeat this on larger servers and with more concurrency.

tl;dr

  • Postgres is boring: no CPU regressions from version 12 to 15 for cached and IO-bound workloads.
    • The insert benchmark found a CPU regression in 15beta1 which was quickly fixed.
  • InnoDB with a cached workload has large CPU regressions from 5.6 to 8.0.
    • This is visible here. By large I mean that the insert and query rates for InnoDB in 8.0.31 are ~64% and ~75% of the rates in 5.6. The root cause is new CPU overhead in code other than the storage engine.
  • MyRocks with a cached workload has large CPU regressions from 5.6 to 8.0
    • This is visible here. The regression for insert-only without secondary indexes is similar to InnoDB. The regression for range queries is smaller than the one above for InnoDB (throughput declines by ~8%). The root cause is new CPU overhead in upstream MySQL.
  • InnoDB with an IO-bound workload has large CPU regressions in the insert-only without secondary index phase, which is normally CPU-bound. See here.
    • The regression is similar to what I describe above for the cached workload. Throughput for the insert-only with secondary index phase and range queries with rate-limited inserts phase is better in 8.0.31 than 5.6.
  • MyRocks with an IO-bound workload has CPU regressions that are similar to MyRocks with a cached workload. See here.

Friday, November 18, 2022

SSD read response time: raw device vs a filesystem

I am trying to understand why 4kb random reads from an SSD are about 2X slower when using a filesystem vs a raw device and this reproduces across different servers but I am only sharing the results from my home Intel NUCs. The symptoms are, the read response time is:

  • ~2X larger per iostat's r_await for reads done via a filesystem vs a raw device (.04 vs .08 millisecs)
  • ~3X larger per blkparse for reads done via a filesystem vs a raw device (~16 vs 50+ microsecs)
Once again, I am confused. Is something below user land doing (if from-filesystem -> go-slow). In theory, if the D->C transition reported by blkparse includes a lot of CPU overhead from the filesystem then that might explain this, but the CPU per IO overhead isn't large enough for that to be true here.

My test scripts for this are rfio.sh and do_fio.sh.

Update - the mystery has been solved thanks to advice from an expert (Andreas Freund) who is in my Twitter circle, and engaging with experts I would never get to meet in real life is why I use Twitter. I have been running the raw device tests with the device mostly empty and the fix is to run the test with it mostly full otherwise the SSD firmware can do something special (and faster) when reading data that was never written.

blktrace + blkparse

I used fio for 3 configurations: raw device, O_DIRECT and buffered IO. For O_DIRECT and buffered IO the filesystem is XFS and there were 8 20G files on a server that has 16G of RAM. The results for O_DIRECT and buffered IO were similar.

The results I discuss in the next 2 paragraphs are from fio run with --numjobs=1.

Output from blkparse for a few IOs are here. The lifecycle of an IO starts with state Q (queued) and ends with state C (completed). The number in parentheses at the end of the line for state C is the response time in microseconds. The RWBS field has RA for buffered IO (R = read, A = possible readahead) and R for O_DIRECT and raw. The timestamp format is seconds.nanoseconds and I pasted examples from ~18 seconds into the measurement. Most of the time for an IO request occurs between state D (dispatch to device) and C (completed). The blkparse man page is here.

From the examples I shared the response time is ~15 microseconds for raw and 50+ microseconds for O_DIRECT and buffered IO. Averaging the values from all samples collected over 20 seconds shows the average was 15 microseconds for raw vs ~73 microseconds for O_DIRECT and buffered. The read request size is 4096 in all cases (see + 8 in the blkparse output).

The sector offsets in the blkparse output are all a multiple of 8 (8 x 512 == 4096) so the requests have that in common between raw, O_DIRECT and buffered.

fio

Command lines and performance metrics from fio are here. I ran fio for numjobs in 1, 2, 4, 8, 16, 32, 48 and 64. A summary of the results:
And results after making the fix described in Update above, the results from raw are only slightly better than O_DIRECT, while buffered gets a better throughput number because it benefits from some hits in the OS page cache.