Small Datum: September 2024

Thursday, September 26, 2024

InnoDB code bloat in MySQL 8.0 makes me sad

InnoDB uses a lot more CPU per query in MySQL 8.0 which makes it a lot slower for low-concurrency workloads that are CPU-bound. That is offset by improvements which reduce mutex contention which means it might not be slower for high-concurrency workloads?

It would be great to get the reduction in mutex contention without the increase in CPU overhead. Alas, I don't think that is ever going to happen and this makes me sad. I doubt the performance regressions for InnoDB in MySQL 8.0 will ever be fixed and the workarounds are: MyRocks, MariaDB, MySQL 5.7 forever, Postgres.

I explained in a recent post that the problems are code bloat:

MySQL 8.0 uses more instructions per query
MySQL 8.0 wastes more time on cache and TLB activity per query

Another recent post shows that the MySQL binary is growing at an alarming rate. This post has more details with a focus on InnoDB. Again, this makes me sad. InnoDB was very good to me for a long time but that time is ending.

I have spent much time trying to find a way to undo some of the bloat and I have learned multiple times that I can't undo the damage via compile-time options, even though there are a few things that might make things 5% to 20% faster. But some of those things also make older MySQL faster.

Updates

v1 of this post claimed that InnoDB started to use Boost in MySQL 8.0.2. That is incorrect, it started to use STL (unordered_map, array) in 8.0.2
see Bloat in 8.0.2 from STL below for details on the impact of the diff that adds STL in 8.0.2

Bloat in 8.0.2 from STL

While I suspect that the new usage of STL with InnoDB in fil0fil.cc contributes to code bloat that I report below, and there is a big jump in the size of fil0fil.o from 8.0.1 to 8.0.2, it is difficult to prove that is one of the causes. And even the results here don't prove it because the diff that adds STL in 8.0.2 is a large squash merge with many other changes.

I compiled MySQL at the diff that first uses STL in fil0fil.cc and the one prior to it:

201b2b20d1

adds the usage of STL. But it is a squash merge that combines many commits. So STL remains a suspect but I am still not certain. Regardless, the results show that things grow a lot from this commit.

817379925c

the diff prior to 201b2b20d1.

The patches that I had to apply to get these to compile and the CMake command line are here. Too bad it is hard to automate the search for bloat because all of these builds need patches to compile. I won't blame C++ for this because I don't need patches to compile old versions of RocksDB.

The conclusion, from 817379925c to 201b2b20d1

libinnobase.a grows by ~5M from 26492K to 31518K
fil0fil.o grows by 2.6X from 548K to 1440K

A few things that don't have a big impact

A short summary of things I tried:

MySQL 5.7 started to use -fPIC while compiling the server. That was not used in 5.6. Alas, switching from -fPIC to -fpic doesn't have a big impact.
Avoiding -fPIC via -DDISABLE_SHARED (in the cases where that doesn't break the build) also doesn't have a big impact.
Compiling with -DWITH_LTO=ON makes things ~5% faster
PGO can be a big deal, but needs more evaluation to understand whether I need PGO build per workload which is costly to maintain

Measuring code bloat

InnoDB started to use STL in MySQL 8.0.2-dmr and that is definitely part of the code-bloat problem.

Here I measure bloat using several methods. I am sure some of them are more flawed then others but they all show much bloat in MySQL 8.0. The methods are:

lines of code (yes, this is far from perfect)
size of libinnobase.a for MySQL compiled with CMAKE_BUILD_TYPE=Release
size of object files for MySQL compiled with CMAKE_BUILD_TYPE=Release

From the files I checked, fil0fil.cc grew the most

Up and up, keep on growing!

Based on the size of object files there is much innovation in fil0fil.cc. There is a big jump from 8.0.1 to 8.0.2 -- STL (unordered_map, array) is first used in 8.0.2, and there were many other changes.

There is small growth up to 8.0.28 and then things take off. If 8.0.40 really fixes bug 111538 then some of this will be undone.

Tuesday, September 24, 2024

Configuration options that make writes faster or slower with MySQL and Postgres

This post is inspired by a comment about a recent blog post I shared. While I don't focus on MySQL vs Postgres comparisons, sometimes I share them and in my benchmarks with CPU-bound workloads, and fsync disabled (even more CPU-bound) the general result was that insert throughput was much better with Postgres than MySQL+InnoDB.

The commenter stated that Postgres and MySQL+InnoDB get similar insert rates when the MySQL binlog is disabled. I rarely test such a config because I never ran MySQL in production without the binlog enabled. Regardless, the commenter is a smart person so I decided to run a few more tests.

Some of the configurations I test here are great for performance but lousy for durability so don't take anything here as tuning advice.

tl;dr

MySQL+InnoDB suffers more tha Postgres from updates that require index maintenance
For MySQL with the binlog disabled, insert throughput gets close to Postgres results but not many deployments run primaries with the binlog disabled. The gap is much larger, in favor of Postgres, for updates
When fsyncs are enabled the difference is much larger, in favor of Postgres. One reason is that MySQL must do an fsync for the binlog and InnoDB redo while Postgres only does it for the WAL. But there is more going on here and the fsync impact doesn't explain all of the performance differences.

Configurations tested

I used MySQL 8.0.39 and Postgres 17rc1 both compiled from source. The config files are here.

For MySQL I used 12 configurations:

name binlog fsync InnoDB doublewrite
bl0_sync0_dw0 off off off
bl0_sync0_dw1 off off on
bl0_sync0_dwDO off off DETECT_ONLY
bl0_sync1_dw0 off on off
bl0_sync1_dw1 off on on
bl0_sync1_dwDO off on DETECT_ONLY
bl1_sync0_dw0 on off off
bl1_sync0_dw1 on off on
bl1_sync0_dwDO on off DETECT_ONLY
bl1_sync1_dw0 on on off
bl1_sync1_dw1 on on on
bl1_sync1_dwDO on on DETECT_ONLY

For Postgres I used 12 configurations on the large server and on the small server I used 8 (skipped the 4 that use LZ4 for wal_compression:

name fsync wal_compression full_page_writes
sync0_comp0_fpi0 off off off
sync0_comp0_fpi1 off off on
sync0_comp1_fpi0 off pglz off
sync0_comp1_fpi1 off pglz on
sync0_compLZ4_fpi0 off lz4 off
sync0_compLZ4_fpi1 off lz4 on
sync1_comp0_fpi0 on off off
sync1_comp0_fpi1 on off on
sync1_comp1_fpi0 on pglz off
sync1_comp1_fpi1 on pglz on
sync1_compLZ4_fpi0 on lz4 off
sync1_compLZ4_fpi1 on lz4 on

And a few notes:

the high durability configuration for MySQL is bl1_sync1_dw1 but that means there will be two fsyncs per commit (one for binlog, one for InnoDB redo)
the InnoDB doublewrite options (disabled, enable,d DETECT_ONLY) don't impact performance here. They will matter on workloads that are heavier on IO.
by fsync enabled I mean fsync (or fdatasync) is done for any logs that are enabled (Postgres WAL, InnoDB redo, MySQL binlog) and enabled for InnoDB means innodb_flush_log_at_trx_commit =1 while disabled for InnoDB means =2.
for Postgres I use wal_sync_method=fdatasync but below I often refer to that as fsync
for Postgres fsync enabled/disabled is the value for synchronous_commit

Hardware

I used two server types:

small

Beelink SER7 7840HS with Ryzen 7 7840HS, 8 AMD cores, AMD SMT disabled, 32G RAM, 1T m.2 SSD for OS, an NVMe device for the database using ext4 and Ubuntu 22.04.

medium

A c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

I haven't measured it on either the small or medium server but fdatasync latency might not be excellent on the medium server and definitely isn't excellent on the small server.

Benchmark

I used sysbench and my usage is explained here. But I only ran a few of the tests from it rather than the full 40+. The ones for which I report performance are insert, update-index and update-noindex. Note that update-index means that index maintenance is required and update-noindex means that it is not.

For the small server the test is run with 1 thread, 1 table and 50M rows. For the medium server it is run with 12 threads, 8 tables and 10M rows per table.

Results: small server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

MySQL throughput with the binlog on is (57%, 83%, 54%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting and that looks like the issue is the low QPS from the test without the binlog.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
MySQL with the binlog off gets (80%, 20%, 62%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. The low value for update-index with MySQL supports the what I write above in the first bullet point.
MySQL with the binlog on gets (46%, 17%, 33%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
Using Postgres with full_page_writes=on (recommended) cuts throughput in half for the update tests and by ~30% for the insert test when wal_compression uses pglz. The impact is much less from lz4, so just use lz4.

When fsync is enabled

It is much harder to spot CPU regressions which I why most of my tests disable it
MySQL throughput with the binlog on is ~50% relative to the binlog off. This is expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
MySQL with the binlog off gets ~27% of the throughput relative to Postgres and for MySQL with the binlog on that drops to ~14% relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. These differences are larger than I expect.
The impact from full_page_writes=on for Postgres is smaller here than it is above, but again the real issue is the CPU overhead from pglz and the solution is to use lz4.

--- without fsync

- no binlog for MySQL

ins upd-ix upd-nix dbms

20413 5662 19837 my8039 bl0_sync0_dw0

20458 5881 20271 my8039 bl0_sync0_dw1

20252 5790 19742 my8039 bl0_sync0_dwDO

- binlog for MySQL

ins upd-ix upd-nix dbms

11814 4935 11051 my8039 bl1_sync0_dw0

11614 4910 10992 my8039 bl1_sync0_dw1

11649 4896 11062 my8039 bl1_sync0_dwDO

- Postgres

ins upd-ix upd-nix dbms

25590 28835 32951 pg17rc1 sync0_comp0_fpi0

25471 28840 32837 pg17rc1 sync0_comp0_fpi1

26452 29196 32702 pg17rc1 sync0_comp1_fpi0

18537 14276 16033 pg17rc1 sync0_comp1_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit: InnoDB)

ins upd-ix upd-nix dbms

139 135 142 my8039 bl0_sync1_dw0

140 134 142 my8039 bl0_sync1_dw1

140 134 142 my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit: binlog + InnoDB)

ins upd-ix upd-nix dbms

72 72 73 my8039 bl1_sync1_dw0

73 71 71 my8039 bl1_sync1_dw1

73 68 70 my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit: WAL)

ins upd-ix upd-nix dbms

512 513 516 pg17rc1 sync1_comp0_fpi0

505 507 517 pg17rc1 sync1_comp0_fpi1

512 514 516 pg17rc1 sync1_comp1_fpi0

438 422 387 pg17rc1 sync1_comp1_fpi1

Results: medium server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

MySQL throughput with the binlog on is (44%, 79%, 43%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
MySQL with the binlog off gets (96%, 16%, 70%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Once again the result for update-index with MySQL is odd.
MySQL with the binlog on gets (42%, 13%, 30%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
Using Postgres with full_page_writes=on (recommended) reduces throughput by ~5% when wal_compression uses pglz and the solution is lz4.

When fsync is enabled

MySQL throughput with the binlog on is (34%, 58%, 41%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. This is (mostly) expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
MySQL with the binlog off gets (133%, 68%, 132%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1.
MySQL with the binlog on gets (45%, 39%, 54%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1. Most MySQL primaries will run with the binlog enabled.
The impact from full_page_writes=on for Postgres is large for updates (reduces throughput by ~50%) and smaller for the insert test because the insert test loads in PK order.

--- without fsync

- no binlog for MySQL

ins upd-ix upd-nix dbms

117078 17620 104215 my8039 bl0_sync0_dw0

118038 19821 103625 my8039 bl0_sync0_dw1

117373 19582 103625 my8039 bl0_sync0_dwDO

- binlog for MySQL

ins upd-ix upd-nix dbms

52602 15201 44727 my8039 bl1_sync0_dw0

52193 15736 44372 my8039 bl1_sync0_dw1

52456 16293 45195 my8039 bl1_sync0_dwDO

- Postgres

ins upd-ix upd-nix dbms

127864 138330 154650 pg17rc1 sync0_comp0_fpi0

123512 123825 148386 pg17rc1 sync0_comp0_fpi1

127456 139634 155127 pg17rc1 sync0_comp1_fpi0

119504 117795 136421 pg17rc1 sync0_comp1_fpi1

127885 139301 154869 pg17rc1 sync0_compLZ4_fpi0

125192 125532 149092 pg17rc1 sync0_compLZ4_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit for InnoDB redo)

ins upd-ix upd-nix dbms

20602 6824 15804 my8039 bl0_sync1_dw0

20350 6640 15095 my8039 bl0_sync1_dw1

20677 6836 16134 my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit, one for binlog, one for InnoDB redo)

ins upd-ix upd-nix dbms

6972 3863 6238 my8039 bl1_sync1_dw0

6963 3829 6183 my8039 bl1_sync1_dw1

6922 3904 6266 my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit, for WAL)

ins upd-ix upd-nix dbms

19754 24293 26250 pg17rc1 sync1_comp0_fpi0

15089 9507 10963 pg17rc1 sync1_comp0_fpi1

19732 24335 26280 pg17rc1 sync1_comp1_fpi0

13031 8571 9204 pg17rc1 sync1_comp1_fpi1

19698 24629 26163 pg17rc1 sync1_compLZ4_fpi0

15309 9712 11420 pg17rc1 sync1_compLZ4_fpi1

The size of the mysqld binary as a proxy for innovation

I have been documenting performance regressions over time in MySQL. The regressions mean that some workloads get less throughput because the server uses more CPU per SQL operation. From perf stat I see there is more instruction cache and TLB activity per query. I also see that it takes more instructions per query. A recent blog post from me has more details.

This is bloat if you are a pessimist and a side-effect of innovation if you are an optimist. I must repeat that the issue is more serious for low-concurrency workloads because there has been much great work to reduce mutex contention.

The table below shows the size of the mysqld binary both as-is (not stripped, includes debug symbols) and stripped. The size of the stripped binary (almost) doubled from the last 5.6 release (5.6.51) to the last 5.7 release (5.7.44). It doubled again from 5.7.44 to a somewhat recent 8.0 release (8.0.28).

Assuming the binary size is a proxy for innovation then there is much innovation (2X per major release). But I am not sure that ends well given the impact on CPU overhead.

So I ask two things:

Be more careful about innovation going forward
Start using Nyrkio to detect regressions early in the development cycle

-- size in MB --

version as-is stripped

5651 28 16
571 29 15
573 31 16
575 40 19
577 69 26
579 71 27
5710 72 27
5719 72 27
5727 73 28
5735 80 30
5744 81 30
800 117 39
801 109 39
802 159 47
804 176 47
8012 179 50
8013 190 52
8014 192 52
8015 192 52
8016 194 54
8017 196 55
8018 207 57
8019 209 57
8020 213 58
8021 219 58
8022 221 59
8023 223 60
8024 225 60
8025 225 60
8026 242 62
8027 232 61
8028 233 61

Monday, September 23, 2024

Why do table scans get slower with MySQL from 5.6 to 8.0?

As I search to explain why MySQL is getting (too much) slower over time I collect a lot of data from many servers and then get overwhelmed trying to summarize it. Here I focus on one microbenchmark (table scan) on one server (ax162-s from Hetzner) using upstream MySQL with InnoDB and FB MyRocks.

While my primary goal is to explain the regressions in MyRocks, many of them are inherited from upstream MySQL so I must first explain them in upstream MySQL with InnoDB.

My focus is on a test that does full table scans so the ratio of (time spent in storage engine / time spent elsewhere) is much larger for that test than many of the other tests I use with sysbench.

The basic problem is that the table scan done by my tests is

~1.5X faster for InnoDB in MySQL 5.6.51 vs 8.0.39
~1.5X faster for MyRocks in FB MySQL 5.6.35 vs 8.0.32

The problem is spread over many releases, but the summary is:

Code bloat - more instructions are executed per SQL statement

InnoDB 8.0.28 uses ~1.3X more instructions/query than InnoDB 5.6.51
MyRocks 8.0.32 uses ~1.3X more instructions/query than MyRocks 5.6.35

Memory system bloat - more TLB and cache misses per SQL statement

InnoDB 8.0.28 has a lot more cache & TLB loads & misses relative to InnoDB 5.6.51. The difference ranges from ~1.6X more cache-misses to ~1700X more iTLB-loads.
MyRocks 8.0.32 has a lot more cache & TLB loads & misses relative to MyRocks 5.6.35. The difference ranges from ~1.6X more cache-misses to ~22X more iTLB-loads.

The benchmark

I used sysbench but not as described here. Instead I just did load, a few point-query tests and then the full scan test. Here I focus on the full scan. I ran a subset of the tests to get more results in less time.

Tests are run with 1 thread, 1 table and 50M rows. While modern MySQL suffers from new CPU overheads it also benefits from a reduction in mutex contention. Alas, I don't highlight the improvements here. And modern MySQL would be even faster if it could reduce mutex contention without increasing CPU overheads.

I ran things 3 times:

plain - just run the test, don't use perf to get HW counters or flamegraphs
stat - run perf stat to collect HW counters
svg - run perf record -e cycles -F 1999 -g -p $pid to generate flame graphs

For stat the counters are samples over 10-second intervals for several groups of counters via:

perf stat -e cpu-clock,cycles,bus-cycles,instructions,branches,branch-misses -p $pid -- sleep 10

perf stat -e cache-references,cache-misses,stalled-cycles-backend,stalled-cycles-frontend -p $pid -- sleep 10

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -p $pid -- sleep 10

perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -p $pid -- sleep 10

perf stat -e iTLB-load-misses,iTLB-loads,L1-icache-loads-misses,L1-icache-loads -p $pid -- sleep 10

perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches -p $pid -- sleep 10

perf stat -e alignment-faults,context-switches,migrations,major-faults,minor-faults,faults -p $pid -- sleep 10

Database versions tested

For upstream MySQL I tested InnoDB from 5.6.51, 5.7.1, 5.7.3, 5.7.5, 5.7.7, 5.7.9, 5.7.10, 5.7.19, 5.7.35, 5.7.44, 8.0.0, 8.0.2, 8.0.4, 8.0.11, 8.0.13, 8.0.15, 8.0.17, 8.0.19, 8.0.20, 8.0.21, 8.0.23, 8.0.25, 8.0.27, 8.0.28, 8.0.39.

For MyRocks I tested:

5635.old - MyRocks 5.6.35 from Apr 7, 2021 at git hash f896415f with RocksDB 6.19.0
5635.new - MyRocks 5.6.35 from Oct 16, 2023 at git hash 4f3a57a1 with RocksDB 8.7.0
8028.old - MyRocks 8.0.28 from Aug 29 2022 at git hash a35c8dfe with RocksDB 7.5.2
8028.new - MyRocks 8.0.28 from Dec 2 2023 at git hash 4edf1eec with RocksDB 8.7.0
8032.old - MyRocks 8.0.32 from Dec 4 2023 at git hash e3a854e8 with RocksDB 8.7.0
8032.new - MyRocks 8.0.32 from May 29 2024 at git hash 49b37dfe with RocksDB 9.2.1

Charts: InnoDB

The charts use relative QPS which is: (QPS for my version / QPS for InnoDB 5.6.51). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

The regressions during 5.7 are small.

The regressions during 8.0 are larger. The worst occurs after 8.0.28 which hopefully is resolved in the soon-to-be released 8.0.40 -- see bug 111538..

Charts: MyRocks

The charts use relative QPS which is: (QPS for my version / QPS for MyRocks 5635.old). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

The large regression occurs during the 5.6.35 to 8.0.28 transition.

Hardware counters

Above I explained how perf stat is used to collect HW counters over 10 second intervals. It is important to remember that measurements are made over a fixed amount of time rather than a fixed amount of work (queries). I want know to know things like how many instructions are executed per query and for that I have to normalize the results from perf stat by QPS.

But before I spend too much time evaluating results from perf stat I confirm that the clock rate during the samples is similar. I learned this the hard way when using the schedutil CPU frequency governor with MyRocks (see here). The information for this comes from perf stat -e cycles via the line for cycles:

10,010.49 msec cpu-clock # 1.001 CPUs utilized

27,526,557,704 cycles # 2.750 GHz

<not supported> bus-cycles

122,334,672,524 instructions # 4.44 insn per cycle

23,595,569,084 branches # 2.357 G/sec

9,662,382 branch-misses # 0.04% of all branches

10.002641688 seconds time elapsed

I assume that the 10-second sample is representative and that is more true for read-heavy tests and less true for write-heavy tests because write-heavy has intermittent background work (b-tree write back, LSM compaction).

Using relative values

I focus on using relative values for the counters which is: (value for my version) / (value for base case). This makes it easy to see how things change over time from old to new releases but it doesn't show the significance of the change. For example, in many cases the value for context switches during a table scan grows by a large amount (as in 40X) but that isn't significant because the change might be from 10 context switches per second to 400. And I doubt such a change has any impact on the performance regressions.

Unfortunately, it isn't trivial to understand the significance of changes for other metrics. For example, when should I care more about cache vs TLB misses?

The data

At last, I can share some of the data collected by perf stat. The data is here for:

InnoDB 5.6.51 to 5.7.44

see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.

InnoDB 5.7.44 to 8.0.39

see here. The gist has three parts -- first is the counters (as relative values) that might explain the regressions, the second is the absolute values for those counters and the third is all of the counters as relative values.

MyRocks 5.6.35 to 8.0.32

see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.

First up is InnoDB:

MySQL 8.0.28 uses ~1.3 times more instructions/query than 5.6.51
MySQL 8.0.28 has 1.6X or more cache & TLB misses and loads per query than 5.6.51. The worst ratio is for iTLB-loads (over 1700X) but I don't know whether the absolute cost from that is more significant than for the other counters.

(value/query for MySQL 8.0.28) / (value/query for MySQL 5.6.51)

instructions (.991 / .109) / (1 / .135) = 1.227

cache-misses (1.301 / .109) / (1 / .135) = 1.611

cache-references (1.504 / .109) / (1 / .135) = 1.862

dTLB-loads (2.903 / .109) / (1 / .135) = 3.595

iTLB-load-misses (2.011 / .109) / (1 / .135) = 2.490

iTLB-loads (1412.0 / .109) / (1 / .135) = 1748.903

L1-dcache-load-misses (1.723 / .109) / (1 / .135) = 2.134

L1-icache-loads (3.101 / .109) / (1 / .135) = 3.840

L1-icache-loads-misses (15.607 / .109) / (1 / .135) = 19.330

stalled-cycles-backend (1.427 / .109) / (1 / .135) = 1.767

Next up is MyRocks:

MyRocks 8032.new uses ~1.3X more instructions/query vs 5635.old
MyRocks 8032.new has 1.6X or more cache & TLB misses and loads per query than 5635.old. The worst ratio is for iTLB-loads (over 22X) but I don't know whether the absolute cost from that is more significant than for the other counters.

(value/query for MyRocks 8032.new) / (value/query for 5635.old)

instructions (0.883 / .049) / (1 / .073) = 1.315

cache-references (1.410 / .049) / (1 / .073) = 2.100

dTLB-loads (1.253 / .049) / (1 / .073) = 1.866

iTLB-load-misses (1.104 / .049) / (1 / .073) = 1.644

iTLB-loads (15.068 / .049) / (1 / .073) = 22.449

L1-dcache-load-misses (1.275 / .049) / (1 / .073) = 1.899

L1-icache-loads (1.432 / .049) / (1 / .073) = 2.133

L1-icache-loads-misses (5.306 / .049) / (1 / .073) = 7.905

stalled-cycles-frontend (1.429 / .049) / (1 / .073) = 2.128

Friday, September 20, 2024

Ubuntu 22.04, Linux kernels and me

This is a summary of interesting performance issues I have had with the HWE kernels (6.5.x) on Ubuntu 22.04, mostly this year. This is mostly a note to my future self.

Use ext4 instead of XFS

I have used XFS since pre-2010 because it avoided a few perf problems that show up in the ext family. But I finally found a reason to switch to ext4. The problem is that with 6.5 kernels (when HWE is used) there are perf problems and many error messages (WQ_UNBOUND, CPU hogging). Some details are here. I am curious if XFS needs a few updates to match whatever changed in the 6.5 kernels.

Switch from schedutil to performance CPU frequency governor

This issue hurt MyRocks a lot more than Postgres or InnoDB. I don't know the root cause but MyRocks perf was lousy with the schedutil CPU frequency governor. See here for more details from me, and this LWN post might be relevant.

readahead requests are dropped

RocksDB changed how readahead is done when O_DIRECT is not used and the change uses the readahead call. It looks like readahead requests are ignored when the request size is larger than the value of max_sectors_kb. Alas, to figure out the value of max_sectors_kb you must read /sys/block/$device/queue/max_sectors_kb and I prefer to avoid that (whether I have to do it, or RocksDB code does it). Surely there is a better way to handle this.

Saturday, September 14, 2024

MySQL and Postgres vs the Insert Benchmark on a large server

This has benchmark results for MySQL and Postgres vs the Insert Benchmark on a large server. My intent is to document how performance changes over time more so than start a Postgres vs MySQL argument.

MySQL has accumulated large regressions from 5.6 to 8.0 that are obvious on low-concurrency benchmarks. While they are less obvious on high-concurrency benchmarks, and there have been significant improvements to make MySQL 8 better at high-concurrency, the regressions that hurt low-concurrency results also reduce throughput at high-concurrency.

tl;dr

For Postgres 17rc1 vs 15.8

Performance is mostly the same

For MySQL 8.0.39 vs 5.6.51

Writes are much faster in 8.0.39
Reads throughput is mixed, 8.0.39 is slower than, similar to and faster than 5.6.51 depending on the context

For MySQL vs Postgres

MySQL is faster on point queries
Postgres is faster on writes and range queries

Builds, configuration and hardware

I compiled from source:

Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer

The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The Postgres configuration file is here.

The MySQL configuration files are here for 5.6.51, 5.7.44 and 8.0.39.

The Benchmark

The benchmark is explained here and is run with 20 clients and a table per client with three workloads:

cached - database fits in the Postgres buffer pool
less IO-bound - database is larger than memory with 64M rows per table
more IO-bound - database is larger than memory with 200M rows per table

The benchmark steps are:

l.i0

insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 20 for cached, 64 for less IO-bound and 200 for more IO-bound.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for less/more IO-bound.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for less/more IO-bound.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance reports are here:

cached - MySQL vs Postgres, only MySQL, only Postgres
less IO-bound - MySQL vs Postgres, only MySQL, only Postgres
more IO-bound - MySQL vs Postgres, only MySQL, only Postgres

The summary section in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is MySQL 5.6.51 for MySQL vs Postgres, MySQL 5.6.51 for only MySQL and Postgres 15.8 for only Postgres.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: cached

From the summary for only Postgres:

Write-heavy steps are up to 11% faster in 17rc1 vs 15.8
Read-heavy steps are unchanged in 17rc1 vs 15.8

From the summary for only MySQL:

Write-heavy steps are ~2.5X to ~4X faster in 8.0.39 vs 5.6.51
Read-heavy steps are between 12% and 27% slower in 8.0.39 vs 5.6.51

From the summary for MySQL vs Postgres:

Postgres 17rc1 is faster on write-heavy and range queries
MySQL 8.0.39 and Postgres 17rc1 get similar QPS on point queries

Results: less IO-bound

From the summary for only Postgres:

Write-heavy steps are up 13% faster in 17rc1 vs 15.8 except for l.i2 which suffers from an intermittently high CPU overhead from get_actual_variable_range
Read-heavy steps are mostly similar between 17rc1 and 15.8

From the summary for only MySQL:

Write-heavy steps are ~1.9X to ~4X faster in 8.0.39 vs 5.6.51
Read-heavy steps are up to 24% slower in 8.0.39 vs 5.6.51

From the summary for MySQL vs Postgres:

Postgres 17rc1 is faster on write-heavy and range queries

The range queries are covering and the secondary indexes are mostly cached so the tests are CPU bound for both Postgres and MySQL (see rpq here)

MySQL 8.0.39 is faster on point queries

MySQL does ~0.1 reads/query while Postgres does ~3 (see rpq here). The table is index organized with MySQL and heap organized with Postgres. So with InnoDB the point queries just access the PK index while with Postgres they first access the PK index and then the heap table to get all of the columns.

Results: more IO-bound

From the summary for only Postgres:

Write-heavy steps have similar throughput between 17rc1 and 15.8 ignoring index create (l.x) and variance from l.i2 (problems with get_actual_variable_range)
Read-heavy steps are up to 6% faster in 17rc1 vs 15.8

From the summary for only MySQL:

Write-heavy steps are up to ~4X faster in 8.0.39 vs 5.6.51
Point queries are up to ~3X faster in 8.0.39 vs 5.6.51
Range queries are up to 27% slower in 8.0.39 vs 5.6.51

From the summary for MySQL vs Postgres:

Postgres 17rc1 is faster on write-heavy and range queries

See the less IO-bound section above
MySQL does more read IO per range query (see rpq here)

MySQL 8.0.39 is faster on point queries

See the less IO-bound section above
Postgres does ~2X more read IO per point query (see rpq here)

Thursday, September 12, 2024

HTAP benchmarks: trying out HATrick with MySQL

For a few years I have only used sysbench and the Insert Benchmark for work but now I need an HTAP benchmark. There are several to choose from but I will start with HATrick. The source is here. It uses C++ and ODBC. I prefer Java (and JDBC), golang or Python but lets see how this goes.

My current OS is Ubuntu 22.04 and I install MySQL in custom locations so I assume that I need to install MySQL's Connector/ODBC from source. That depends on an ODBC driver manager, and for Ubuntu I will try iodbc. I have no experience with ODBC and now I am worried about pointless complexity (I am looking at you Log4J, CORBA and EJB).

The benchmark looks interesting and useful but I'd rather not deal with setting up ODBC again, nor with core dumps. So I will move on and try CH-benchmark from benchmark (Java + JDBC).

The details

So the first step is to install iodbc via: sudo apt install libiodbc2 libiodbc2-dev

Then configure MySQL Connector/ODBC. I didn't expect to need ODBC_INCLUDES= because CMake was able to find iodbc without it, but I do because that path wasn't added to CXXFLAGS unless I added it:

cmake .. -G "Unix Makefiles" \

-DMYSQLCLIENT_STATIC_LINKING=ON \
-DBUNDLE_DEPENDENCIES=ON \
-DMYSQL_CONFIG_EXECUTABLE=/path/to/mysql_config \
-DDISABLE_GUI=1 \

-DCMAKE_INSTALL_PREFIX=/install/path/to/myodbc \
-DODBC_INCLUDES=/usr/include/iodbc

make; make install

Then I compile HATrack after downloading it from github. I had to edit Makefile:

diff --git a/Makefile b/Makefile

index b55f767..fea8860 100644

--- a/Makefile

+++ b/Makefile

@@ -1,6 +1,7 @@

CC=g++

-CXXFLAGS=-c -std=c++2a

-LDFLAGS=-lodbc -pthread

+CXXFLAGS=-c -std=c++2a -I/usr/include/iodbc

+LDFLAGS=-L/home/me/d/myodbc/lib -lmyodbc8a -pthread

+#LDFLAGS=-lodbc -pthread

SRCS= src/DataGen.cpp src/DataSrc.cpp src/UserInput.cpp src/Driver.cpp \

src/DBInit.cpp src/SQLDialect.cpp src/Barrier.cpp src/AnalyticalClient.cpp \

src/TransactionalClient.cpp src/Workload.cpp src/Globals.cpp \

Then I compile HATrack. The scale factor is hardcoded in src/UserInput.h so you probably need to edit it and compile once per different scale factor you want to try. Compiling just means: make

Then generate data. I assume that ":" is a good value for the delimiter. I had to add LD_LIBRARY_PATH to find the MySQL ODBC libraries that I compiled above.

LD_LIBRARY_PATH=/home/me/d/myodbc/lib \
./HATtrickBench -gen -pa /home/me/hatrick.data -d :

Then load the database. I added secure_file_priv="" to etc/my.cnf for this to work

LD_LIBRARY_PATH=/home/me/d/myodbc/lib \
./HATtrickBench -init -pa /home/me/hatrick.data -usr root -pwd pw -db mysql

All 4 steps completed OK but there was a core dump at the end. While it might be easy to fix I will move on to CH-benchmark which is in benchbase as that means Java + JDBC so I don't have to deal with ODBC, nor must I worry about core dumps.

Program received signal SIGABRT, Aborted.

__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352611648) at ./nptl/pthread_kill.c:44

44 ./nptl/pthread_kill.c: No such file or directory.

(gdb) where

#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352611648) at ./nptl/pthread_kill.c:44

#1 __pthread_kill_internal (signo=6, threadid=140737352611648) at ./nptl/pthread_kill.c:78

#2 __GI___pthread_kill (threadid=140737352611648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89

#3 0x00007ffff6e42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

#4 0x00007ffff6e287f3 in __GI_abort () at ./stdlib/abort.c:79

#5 0x00007ffff6e2871b in __assert_fail_base (fmt=0x7ffff6fdd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7ffff6fd9599 "mutex->__data.__owner == 0", file=0x7ffff6fd956e "pthread_mutex_lock.c", line=94, function=<optimized out>)

at ./assert/assert.c:92

#6 0x00007ffff6e39e96 in __GI___assert_fail (assertion=assertion@entry=0x7ffff6fd9599 "mutex->__data.__owner == 0", file=file@entry=0x7ffff6fd956e "pthread_mutex_lock.c", line=line@entry=94,

function=function@entry=0x7ffff6fe1840 <__PRETTY_FUNCTION__.0> "___pthread_mutex_lock") at ./assert/assert.c:101

#7 0x00007ffff6e980d0 in ___pthread_mutex_lock (mutex=<optimized out>) at ./nptl/pthread_mutex_lock.c:94

#8 0x0000555555579fdf in __gthread_mutex_lock(pthread_mutex_t*) ()

#9 0x000055555557a034 in std::mutex::lock() ()

#10 0x000055555557a30f in std::unique_lock<std::mutex>::lock() ()

#11 0x000055555557a1a3 in std::unique_lock<std::mutex>::unique_lock(std::mutex&) ()

#12 0x00007ffff770caca in ENV::remove_dbc(DBC*) () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

#13 0x00007ffff770cf2e in DBC::~DBC() () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

#14 0x00007ffff770d610 in my_SQLFreeConnect(void*) () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

#15 0x00007ffff770e4bb in SQLFreeHandle () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

#16 0x000055555556ff64 in Driver::disconnectDB(void*&) ()

#17 0x000055555558a6c8 in main ()

MySQL + InnoDB vs sysbench on a large server

This has benchmark results for MySQL 5.6.51, 5.7.44 and 8.0.39 using sysbench on a large server. I also add a few results comparing Postgres 17rc1 with MySQL 8.0.39. My goal with that is to highlight things that can be made better in MySQL (and in Postgres).

One of the responses to my claims that MySQL is getting too many performance regressions over time is that this is true for low-concurrency tests but not for high-concurrency tests. Alas, that claim is more truthy than true and fixing some of these regressions would help make modern MySQL not look so slow when compared to modern Postgres.

tl;dr

MySQL 8.0 is faster than 5.6 for point queries and writes but slower for range queries
PostgresSQL 17rc1 is a lot faster than MySQL 8.0 for point queries and writes. For range queries Postgres was still faster but the difference was usually not as large.

Builds, configuration and hardware

I compiled

Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer

The Postgres configuration file is here.
The MySQL configuration files are here for 5.6.51, 5.7.44 and 8.0.39.

Benchmark

I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

For the large server the tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds. The command line to run all tests was: bash r.sh 8 10000000 180 300 md2 1 1 40

Results

For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here.

Values from iostat and vmstat divided by QPS are here for MySQL 5.6, 5.7, 8.0 and here for MySQL 8.0.39 vs Postgres 17rc1. These help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers in the spreadsheets are the relative QPS and $version is faster than the base case when it is greater than 1.0. When it is 3.0 then $version is 3X faster than the base case.

The relative QPS is computed by: (QPS for $version) / (QPS for base case).

For the comparison between MySQL 5.6.51, 5.7.44 and 8.0.33 the base case is 5.6.51 and $version is one of 5.7.44 or 8.0.39.

For the comparison between MySQL 8.0.39 and Postgres 17rc1 the base case is MySQL and $version is Postgres.

Results: MySQL 5.6.51 vs 5.7.44 vs 8.0.39

Point queries, part 1

MySQL 8.0 is slower than 5.6 on 4 of 11 tests and faster on 7 of 11 tests.
The 4 on which it is slower are all for secondary index lookups. From vmstat metrics, when MySQL 8.0 is slower the problem is CPU overhead (see cpu/o here).
The 7 on which it is faster are all for PK lookups. From two of those tests, it is faster because it uses less CPU per query (see cpu/o here).

Point queries, part 2

MySQL 8.0 is faster than 5.6 in all tests here. All of these tests use a PK index. The perf difference is larger for queries that fetch more data. This suggests that new optimizer overheads prevent it from being faster in cases when it fetches less data and/or InnoDB in MySQL 8.0 doesn't have regressions for point queries relative to MySQL 5.6.
From vmstat metrics (see cpu/o here) MySQL 8.0 is faster because it uses less CPU

Range queries, part 1

MySQL 5.7 and 8.0 are slower than 5.6 for range queries. This is a surprise and the regressions are getting larger over time (worse in 8.0 than 5.7). The queries here do range scans of various lengths without aggregation.
From vmstat metrics (see cpu/o here) MySQL 8.0 is slower because it uses more CPU per query

Range queries, part 2

All of the tests here do some aggregation except for scan (the right most bars). And MySQL 8.0 is faster than 5.6 for all such tests.
When MySQL 8.0 is faster the primary reason is that is uses less CPU per query (see cpu/o here) and sometimes it also does fewer context switches per query (see cs/o here) where a high context switch rate often implies mutex contention.

Writes

MySQL 8.0 is significantly faster than 5.6 for all tests. These show the benefit of the work to improve performance for high-concurrency workloads.
MySQL 8.0 is faster because it uses less CPU per SQL statement (see cpu/o here) and does few context switches per SQL statement (see cs/o here). A high context switch rate often implies mutex contention.

Results: MySQL 8.0.39 vs Postgres 17rc1

Point queries, part 1

The result for hot-points is an outlier, perhaps because prepared statements are better in Postgres than in MySQL. In MySQL, server-side prepared statements just save on parse. The difference is large because MySQL uses a lot more CPU than Postgres (see cpu/o here).
In most other tests Postgres is ~1.5X faster than MySQL and the root cause is that MySQL uses more CPU per query (see cpu/o here). For the context switch rate, sometimes it is larger for MySQL and sometimes it is smaller (see cs/o here).
Excluding hot-points, the largest difference occurs on tests that do non-covering secondary index lookups. InnoDB tables are index-organized and for non-covering secondary index lookups to fetch the missing columns InnoDB must traverse the PK index while Postgres just uses the rowID (tuple ID) to access the heap pages.

Point queries, part 2

Postgres is ~1.5X faster for point queries, see above in Point queries, part 1 for more details

Range queries, part 1

Postgres is much slower than InnoDB on scan (full table scan) and on range-notcovered-pk. While I can only wave hands about the scan result for the range-notcovered-pk result Postgres must scan the PK index and then lookup missing columns from heap pages but InnoDB has all columns in the PK index because it is index-organized. For scan, the CPU overhead and context switch rates are much larger for Postgres (see cpu/o and cs/o here).
Postgres is much faster on the two tests that do non-covering secondary index scans because it fetches missing columns by following the tupleID to a heap page while InnoDB must traverse the PK index.
Excluding the 4 outliers, InnoDB and Postgres have similar performance

Range queries, part 2

The performance difference here is smaller than it is above for point queries. In one case, MySQL is much faster, in three cases Postgres is much faster and in two cases perf is similar.
For tests run before writes (less b-tree fragmentation) or after writes (more b-tree fragmentation) the pattern is that the advantage for Postgres is larger when the length of the range query is shorter -- see results for read-only.pre_range=X and read-only_range=X for X in 10000, 100 and 10. My guess is that MySQL optimizer overhead hurts performance and that is more obvious for queries that take less time -- see the CPU overhead metrics here (cpu/o columns).

Writes

Postgres is faster for all tests that do writes and is up to ~5X faster. The reason is that Postgres uses less CPU per statement (see cpu/o here) and does fewer context switches per statement (see cs/o here) where context switch rates often predict mutex contention.
The difference is smaller for read-write* tests because they do a mix of reads and writes.