Small Datum: 2023

Sunday, December 31, 2023

Updated Insert benchmark: MyRocks 5.6 and 8.0, large server, cached database

This has results for the Insert Benchmark using MyRocks 5.6 and 8.0 using a large server and cached workload.

tl;dr

performance between MyRocks 5.6.35, 8.0.28 and 8.0.32 are similar, unlike what I see with InnoDB where there are large changes, both good (writes) and bad (for reads) using the same setup

Build + Configuration

I tested MyRocks 5.6.35, 8.0.28 and 8.0.32. These were compiled from source. All builds use CMAKE_BUILD_TYPE =Release.

MyRocks 5.6.35

compiled with clang on Dec 21, 2023 at git hash 4f3a57a1, RocksDB 8.7.0 at git hash 29005f0b

MyRocks 8.0.28

compiled with clang on Dec 22, 2023 at git hash 2ad105fc, RocksDB 8.7.0 at git hash 29005f0b

MyRocks 8.0.32

compiled with clang on Dec 22, 2023 at git hash 76707b44, RocksDB 8.7.0 at git hash 29005f0b

The my.cnf files are here for 5.6.35 (d1), 8.0.28 (d1 and d2) and 8.0.32 (d1 and d2). For MyRocks 8.0 the d1 configs enable the performance_schema and the d2 configs disable it. The full names for the configs are my.cnf.cy9d1_u (d1) and my.cnf.cy9d2_u (d2).

Benchmark

The benchmark is run with 24 clients to avoid over-subscribing the CPU. The server has 40 cores, 80 HW threads (hyperthreads enabled), 256G of RAM and many TB of fast SSD.

I used the updated Insert Benchmark so there are more benchmark steps described below. In order, the benchmark steps are:

l.i0

insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions).

qr100

use 3 connections/client. One does range queries for 3600 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

lik qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results

The performance report is here. It has a lot more detail including charts, tables and metrics from iostat and vmstat to help explain the performance differences.

From the summary throughput is similar between MyRocks 5.6 and 8.0. The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to MyRocks 5.6.35. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Another report is here and the difference is that I changed the benchmark to wait for X seconds after l.i2 to reduce writeback debt before the read-write steps start. And X is max(1200, 60 + #rows/1M). The goal is to reduce variance for the read-write benchmark steps.

From the HW efficiency metrics per benchmark step

l.i0, l.x

metrics are similar

l.i1

most metrics are similar but CPU overhead (cpupq) is ~13% larger in 8.0 and max response time (max_op) is more than 2X larger in 8.0

l.i2

most metrics are similar but CPU overhead (cpupq) is ~39% larger in 8.0 and max response time (max_op) is more than 2X larger in 8.0

qr100, qp100, qr500, qp500, qr1000, qp1000

metrics are similar

It is interesting that the CPU overhead is larger for MyRocks 8.0.28 and 8.0.32 during the l.i1 and l.i2 benchmark steps because throughput and response time distributions are similar to MyRocks 5.6.35. So the extra CPU in MyRocks 8.0 does not appear to occur in the foreground threads that processes the insert and delete statements.

The CPU overhead from compaction is 3% to 6% larger for MyRocks 8.0 -- see the CompMergeCPU(sec) column on the Sum line for 5.6.35, 8.0.28 and 8.0.32. But that can only explain a few (<= 5) usecs/operation of CPU overhead -- not the 20 to 100+ that I want to explain.

If I just look at the output from ps aux | grep mysqld I see that MyRocks 8.0 uses at least 1.2X more CPU than 5.6. But again, that can be from background activities and might not be on the path for foreground operations (queries).

For now this is a mystery to which I will return later.

Updated Insert benchmark: InnoDB, MySQL 5.6 to 8.0, large server, cached database

This has results for the Insert Benchmark using InnoDB from MySQL 5.6 to 8.0, a large server and a cached workload.

tl;dr

MySQL 8.0 is faster than 5.6 for writes because the improvements for concurrent workloads are larger than the regressions from new CPU overheads. With 8.0 the throughput is ~2X larger than 5.6 for l.i0 (inserts with no secondary indexes) and ~3X larger for l.i1/l.i2 (random inserts, random deletes).
MySQL 8.0 is slower than 5.6 for reads because there are new CPU overheads. For range queries with 8.0 the throughput is ~13% to ~40% less than 5.6 and for point queries it is ~20% less.
With MySQL 5.7 the write throughput is not as large as MySQL 8.0 but the read throughput is better.
Context matters as the results here for MySQL 8.0 are better than results on a small server. The difference is that I use a workload with high concurrency (24 clients) on the large server and low concurrency (1 client) on the small server. On both servers MySQL 8.0 pays a price from new CPU overheads. But on the large server it also gets a benefit from some of that new code.

Build + Configuration

I tested MySQL 5.6.51, 5.7.40 and 8.0.33. These were compiled from source. The builds use non-default compiler toolchains to run on production hardware so the build process is non-trivial (CMake input files have to set many things) and I am reluctant to build many versions from each major release because it takes hours to days to figure out the CMake input file. All builds use CMAKE_BUILD_TYPE =Release, -march=native and -mtune=native. The 8.0 builds might have used -DWITH_LTO =ON.

The my.cnf files are here for 5.6, for 5.7 and for 8.0.

Benchmark

The benchmark is run with 24 clients to avoid over-subscribing the CPU.

I used the updated Insert Benchmark so there are more benchmark steps described below. In order, the benchmark steps are:

l.i0

insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions).

qr100

use 3 connections/client. One does range queries for 3600 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

lik qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results

The performance report is here. It has a lot more detail including charts, tables and metrics from iostat and vmstat to help explain the performance differences.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

A brief overview from the summary

MySQL 8.0 is faster than 5.6 for writes because the improvements for concurrent workloads are larger than the regressions from new CPU overheads. With 8.0 the throughput is ~2X larger than 5.6 for l.i0 (inserts with no secondary indexes) and ~3X larger for l.i1/l.i2 (random inserts, random deletes).
MySQL 8.0 is slower than 5.6 for reads because there are new CPU overheads. For range queries with 8.0 the throughput is ~13% to ~40% less than 5.6 and for point queries it is ~20% less.
MySQL 5.7 does better than 8.0 for reads but worse for writes.

Most of the performance differences can be explained by CPU overhead. See the cpupq column in the tables here. This is the total CPU time per operation where CPU time is the sum of the us and sy columns in vmstat and operations is inserts for l.i0, l.i1 and l.i2, indexed rows for l.x and queries for the qr* and qp* steps. Note that that cpupq isn't perfect -- it measures too much including the benchmark client CPU -- but it still is a useful tool.

Updates for the Insert Benchmark, December 2023

This describes the Insert Benchmark after another round of changes I recently made. Past descriptions are here and here. Source code for the benchmark is here. It is still written in Python and I hope that one day a Python JIT will arrive to reduce the overhead of the benchmark client.

The changes include:

switched from float to int for the price column
changed the marketsegment index from (price, customerid) to (productid, customerid). The other indexes are unchanged -- registersegment is on (cashregisterid, price, customerid) and pdc is on (price, dateandtime, customerid)
added a mode for the read-write benchmark step to do point queries on the PK index
changed some of the benchmark steps to do a delete per insert to avoid growing the table size. I made this change a few months ago.
add the l.i2 benchmark step that modifies fewer rows per transaction compared to l.i1
added code to reduce checkpoint (InnoDB) and compaction (RocksDB) debt that runs between the l.i2 and qr100 benchmark steps. The code for Postgres was already there.

Alas, I have yet to address coordinated omission.

Benchmark steps

The benchmark is run with X clients and usually with a client per table.

The benchmark is a sequence of steps that are run in order:

l.i0

insert Y million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One does inserts as fast as possible and the other does deletes at the same rate as the inserts to avoid changing the number of rows in the table. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions).
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow where X is max(1200, 60 + #nrows/1M). While waiting do things to reduce writeback debt where the things are:

Postgres (see here) - do a vacuum analyze for all tables concurrently. When that finishes do a checkpoint.
InnoDB (see here) - change innodb_max_dirty_pages_pct[_lwm] to 1, change innodb_idle_flush_pct to 100. When done waiting restore them to previous values.
MyRocks (see here) - set rocksdb_force_flush_memtable_now to flush the memtable, wait 20 seconds and then set rocksdb_compact_lzero_now to flush L0. Note that rocksdb_compact_lzero_now wasn't supported until mid-2023.

qr100

use 3 connections/client. One does range queries as fast as possible and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

lik qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Friday, December 29, 2023

RocksDB 8.x benchmarks: large server, cached database

This post has results for performance regressions in all versions of 8.x from 8.0.0 to 8.9.2 using a large server. In a previous post I shared results for RocksDB 8.0 to 8.8 on the same hardware.

tl;dr

The regressions from 8.0 to 8.9 are small. For the non read-only benchmark steps, QPS from 8.9.2 is at most 3% less than 8.0.0 with leveled and at most 1% less with universal.

I focus on the benchmark steps that aren't read-only because they suffer less from noise. These benchmark steps are fillseq, revrangewhilewriting, fwdrangewhilewriting, readwhilewriting and overwriteandwait. I also focus on leveled more so than universal, in part because there is more noise with universal but also because the workloads I care about most use leveled.

Builds

I compiled with gcc RocksDB 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3 and 8.8.1 and 8.9.2 which are the latest patch releases.

Benchmark

All tests used a server with 40 cores, 80 HW threads, 2 sockets, 256GB of RAM and many TB of fast NVMe SSD with Linux 5.1.2, XFS and SW RAID 0 across 6 devices. For the results here, the database is cached by RocksDB. The benchmark was repeated for leveled and universal compaction.

Everything used the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache once RocksDB 9.0 arrives.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run them is:

bash x3.sh 24 no 3600 c40r256bc180 40000000 4000000000 byrx

A spreadsheet with all results is here and performance summaries are here for leveled and for universal.

Results: leveled

For the non read-only benchmark steps, QPS from 8.9.2 is at most 3% less than from 8.0.0.

Results: universal

For the non read-only benchmark steps, QPS from 8.9.2 is at most 1% less than from 8.0.0.

Sunday, December 24, 2023

Create InnoDB indexes 2X faster with this simple trick

I made parallel create index for InnoDB up to 2.4X faster by doing one of:

performance_schema=0
removing this line of code

I filed bug 113505 for this. The problem is that the parallel create index threads all call that function and there is too much contention on the counter that is incremented. The problem wasn't obvious from Linux perf when using the default metric (cycles). I then tried perf stat and IPC (instructions/cycle) was about 2X larger during create index with performance_schema=0. Finally, I repeated the perf measurements using memory system counters (cache-misses, etc) and with perf record, perf report and perf annotate the problem was obvious.

I am still deciding whether this 2X faster bug fix is more exciting then the one I found for Falcon.

Results

The results here are from a server with 2 sockets, 12 cores/socket, 64G of RAM running Ubuntu 22.04 and one m.2 device with 2TB and XFS. All tests use MySQL 8.0.35 and all of the my.cnf files are here.

The test is the create index benchmark step from the Insert Benchmark with one client, 80M rows and 3 secondary indexes created in one ALTER statement. With innodb_ddl_threads=N if N=1 then only one thread handles the index create work, otherwise it is done in parallel over N threads -- this is assuming innodb_parallel_read_threads >= innodb_ddl_threads, I am not sure I want to remember how they interact.

I tested my.cnf configs with innodb_ddl_threads and innodb_parallel_read_threads set to =1, =4, =8 and =16. Note that the default is =4. Tests were repeated with performance_schema=1 and =0 and for =1, nothing else was set for the perf schema.

The following configurations are tested:

ddl1 - innodb_ddl_threads=1, innodb_parallel_read_threads=1, performance_schema=1
ddl1+ps0 - like ddl1 but with perfomance_schema=0
ddl4 - innodb_ddl_threads=4, innodb_parallel_read_threads=4, performance_schema=1
ddl4+ps0 - like ddl4 but with perfomance_schema=0
ddl8 - innodb_ddl_threads=8, innodb_parallel_read_threads=8, performance_schema=1
ddl8+ps0 - like ddl8 but with perfomance_schema=0
ddl16 - innodb_ddl_threads=16, innodb_parallel_read_threads=16, performance_schema=1
ddl16+ps0 - like ddl16 but with perfomance_schema=0

The chart shows the time to create indexes for MySQL 8.0.35 using two builds. Note that shorter bars are better here. The as-is build is 8.0.35 unchanged and the fixed build is 8.0.35 with this line removed.

When performance_schema=0 then there is no difference between the as-is and fixed builds
When performance_schema=1 then the fixed build is ~2X faster than the as-is build

Monday, December 18, 2023

What is the cost of the performance schema in MySQL 5.7 and 8.0: small server, cached database, simple queries

This post has results from the Insert Benchmark for some MySQL 8.0.35 with InnoDB using the Insert Benchmark, a small server and a cached workload. The goal is to document the overhead of the performance schema.

tl;dr

The perf schema in 5.7 costs <= 4% of QPS
The perf schema in 8.0 costs ~10% of QPS, except for index create where the cost is much larger. I will try to explain that later.

Build + configuration

I used InnoDB with MySQL 8.0.35 and 5.7.44.

The cmake files for each of the builds are here for 5.7.44 and for 8.0.35.

By default the my.cnf files I use for MySQL 5.7 and 8.0 have performance_schema=1 but I don't otherwise enable instruments. So the tests here document the overhead from using whatever is enabled by default with the perf schema.

For 8.0.35 there were 5 variants of the build and 2 my.cnf files. I didn't test the full cross product, but there are 7 different combinations I tried that are listed below.

The 8.0.35 builds are:

my8035_rel

MySQL 8.0.35 and the rel build with CMAKE_BUILD_TYPE=Release

my8035_rel_native

MySQL 8.0.35 and the rel_native build with CMAKE_BUILD_TYPE=Release -march=native -mtune=native

my8035_rel_native_lto

MySQL 8.0.35 and the rel_native_lto build with CMAKE_BUILD_TYPE=Release -march=native -mtune=native WITH_LTO=ON

my8035_rel_less

MySQL 8.0.35 and the rel_less build with CMAKE_BUILD_TYPE=Release ENABLED_PROFILING=OFF WITH_RAPID=OFF

my8035_rel_lessps.cz10a_bee

MySQL 8.0.35 and the rel build with CMAKE_BUILD_TYPE=Release and as much as possible of the perf schema code disabled at compile time. See the cmake files linked above.

The 8.0.35 my.cnf files:

my.cnf.cz10a_bee - my new default my.cnf for the small server
my.cnf.cz10aps0_bee - same as cz10a_bee except the perf schema is disabled

I tried 7 combinations of build + configuration for 8.0.35:

my8035_rel.cz10a_bee
my8035_rel_native.cz10a_bee
my8035_rel_native_lto.cz10a_bee
my8035_rel_less.cz10a_bee
my8035_rel_lessps.cz10a_bee
my8035_rel.cz10aps0_bee
my8035_rel_lessps.cz10aps0_bee

For 5.7.44 I tried two combinations of build + configuration

my5744_rel.cz10a_bee - uses the rel build and cz10a_bee my.cnf
my5744_rel.cz10aps0_bee - uses the rel build and cz10aps0_bee my.cnf that disables the perf schema

Benchmark

The Insert Benchmark was run in one setup - a cached workload.

The benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

l.i0

insert 20 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Results

Reports are here for 5.7.44 and for 8.0.35.

The Summary sections linked below have tables for absolute and relative throughput. The relative throughput is (QPS for my version / QPS for base version). The base version is my5744_rel.cz10a_bee for the 5.7.44 report and my8035_rel.cz10a_bee for the 8.0.35 report.

The summaries linked below have three tables. The first table shows the absolute throughput (queries or writes/s). The second table shows the throughput for a given version relative to MySQL 5.6.21 (a value less than 1.0 means the version is slower). The third table shows the average background insert rate during q100, q500 and q1000. All versions sustained the targets, so that table can be ignored here.

By write-heavy I mean the l.i0, l.x and l.i1 benchmark steps. By read-heavy I mean the q100, q500 and q1000 benchmark steps. Below I use red and green to indicate when the relative QPS is down or up by more than 5% and grey otherwise.

From the Summary for 5.7.44

relative QPS is relative to my5744_rel.cz10a_bee
relative QPS for my5744_rel.cz10aps0_bee is (1.03, 1.12, 1.03) and (1.02, 1.03, 1.04) for write- and read-heavy. So throughput benefit from disabling the performance schema is <= 4% except for the l.x benchmark step that creates indexes.

From the Summary for 8.0.35

relative QPS is relative to my8035_rel.cz10a_bee
my8035_rel_native.cz10a_bee and my8035_rel_less.cz10a_bee have performance similar to the base case
my8035_rel_native_lto.cz10a_bee gets between 4% and 9% more QPS than the base case
builds that disable the perf schema in my.cnf or at compile time get ~1.07X more QPS for l.i0 and l.i1, ~1.5X more throughput for index create and between 1.11X and 1.15X more QPS for read-heavy. I will soon get flamegraphs to explain some of this.

Perf regressions in MySQL from 5.6 to 8.0 using the Insert Benchmark and a small server, Dec23: part 1

This post has results from the Insert Benchmark for some MySQL 5.6 releases, all 5.7 releases and all 8.0 releases using a small server and a cached workload. It follows up on work I published in October. The differences are that I updated the my.cnf files to make them more similar across releases and I added results for pre-GA releases from 5.7 and 8.0.

tl;dr

For 4 of the 6 benchmark steps, MySQL 8.0.35 gets about 60% of the throughput relative to 5.6.51 with simple queries and a cached database. The regressions for random writes (l.i1) is less and index create (l.x) is faster in 8.0 than in 5.6.
For MySQL 5.7 about half of the perf regressions occur between the last 5.6 release (5.6.51) and the first 5.7 release that I was able to compile (5.7.5). The rest are spread evenly across 5.7.
For MySQL 8.0 almost all of the perf regressions are spread across 8.0 releases from 8.0.0 to 8.0.35.
For InnoDB there is at least one significant regression arriving in 8.0.30+ related to which code gets inlined by the compiler. This is somewhat obfuscated because two perf bugs that hurt this benchmark were fixed after 8.0.28. For more details see PS-8822.

Builds

It isn't easy to build older code on newer systems, compilers, etc. Notes on that are here for 5.6, for 5.6 and 5.7, for 5.7 and for 8.0. A note on using cmake is here. The rel builds were used as everything was compiled using CMAKE_BUILD_TYPE=Release.

Tests were done for:

5.6 - 5.6.21, 5.6.51
5.7 - 5.7.5 thru 5.7.10, 5.7.20, 5.7.30, 5.7.44
8.0 - 8.0.0 thru 8.0.5, 8.0.11 thru 8.0.14, 8.0.20, 8.0.22, 8.0.28, 8.0.30, 8.0.34, 8.0.35

I used the cz10a_bee config and it is here for

5.6 - all point releases
5.7 - 5.7.5 and 5.7.6, 5.7.7, 5.7.8 to 5.7.44
8.0 - 8.0.0 to 8.0.2, 8.0.3 to 8.0.4, 8.0.11 to 8.0.18, 8.0.19 to 8.0.35

Benchmark

The Insert Benchmark was run in one setup - a cached workload.

The benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

l.i0

insert 20 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Results

The Summary sections linked below have tables for absolute and relative throughput. Absolute throughput is just the QPS per version. The relative throughput is (QPS for my version / QPS for base version). The base version is 5.6.21 for the 5.6-only report, 5.7.5 for the 5.7-only report, 8.0.0 for the 8.0-only report and 5.6.21 for the report that compares 5.6, 5.7 and 8.0.

Reports are here for 5.6, 5.7, 8.0, 5.6 thru 8.0.

From the Summary for 5.6, 5.7 and 8.0

relative QPS is relative to MySQL 5.6.21
relative QPS in 5.7.44 is (0.82, 1.35, 1.10) and (0.73, 0.72, 0.72) for write- and read-heavy

There are perf regressions for the l.i0, q100, q500, and q1000 benchmark steps. About half of that is from the 5.6 to 5.7 transition and the rest is a gradual loss during across 5.7.
For the l.x benchmark step perf improves from 5.6 to 5.7 and then gradually declines across 5.7.
For the l.i1 benchmark step perf improves from 5.6 to 5.7 and is steady across 5.7

relative QPS in 8.0.35 is (0.55, 1.24, 0.89) and (0.62, 0.61, 0.62) for write- and read-heavy

Perf in 8.0.0 is close to 5.7.44 and 8.0.0 is the first (pre-GA) 8.0 release
There are perf regressions for all benchmark steps. These are gradual across 8.0.

From the Summary for 8.0

relative QPS is relative to MySQL 8.0.0
relative QPS in 8.0.35 is (0.69, 0.92, 0.84) and (0.84, 0.85, 0.85) for write- and read-heavy

From the Summary for 5.7

relative QPS is relative to MySQL 5.7.5
relative QPS in 5.7.44 is (0.92, 0.83, 1.02) and (0.88, 0.87, 0.88) for write- and read-heavy

From the Summary for 5.6

relative QPS is relative to MySQL 5.6.21
relative QPS in 5.6.51 is (0.96, 1.02, 0.97) and (0.98, 0.97, 0.98) for write- and read-heavy

Monday, November 27, 2023

Explaining changes in MySQL performance via hardware perf counters: part 4

This is part 4 of my series on using HW counters from Linux perf to explain why MySQL gets slower from 5.6 to 8.0. Refer to part 1 for an overview.

What happened in MySQL 5.7 and 8.0 to put so much more stress on the memory system?

tl;dr

It looks like someone sprinkled magic go slower dust across most of the MySQL code because the slowdown from MySQL 5.6 to 8.0 is not isolated to a few call stacks.
MySQL 8.0 uses ~1.5X more instructions/operation than 5.6. Cache activity (references, loads, misses are frequently up 1.5X or more. TLB activity (loads, misses) is frequently up 2X to 5X with the iTLB being a bigger problem than the dTLB.
innodb_log_writer_threads=ON is worse than I thought. I will soon have a post on that.

Too long to be a tl;dr

I don't have much experience using Linux perf counters to explain performance changes.
There are huge increases for data TLB, instruction TLB and L1 cache counters. From the Xeon CPU (socket2 below) the change from MySQL 5.6.21 to 8.0.34 measured as events/query are:

branches, branch misses: up ~1.5X, ~1.2X
cache references: up ~2.8X
instructions: up ~1.5X
dTLB loads, load-misses, stores: up ~1.6X, ~1.9X, ~1.8X
iTLB loads, load-misses: up ~2.1X, ~2.4X
L1 data cache loads, load-misses, stores: up ~1.6X, ~1.5X, ~1.8X
L1 instruction cache load-misses: up ~1.9X
LLC loads, stores: up ~1.6X, ~1.9X
Context switches and CPU migrations are down

For many of the HW counters the biggest jumps occur between the last point release in 5.6 and the first in 5.7 and then again between the last in 5.7 and the first in 8.0. Perhaps this is good news because it means the problems are not spread across every point release.

The posts in this series are:

part 1 - introduction and results for the l.i0 (initial load) benchmark step
part 2 - results for the l.i1 (write-only) benchmark step
part 3 - results for the q100 (read+write) benchmark step
part 4 - (this post) results for the q1000 (read+write) benchmark step

Performance

The charts below show average throughput (QPS, or really operations/s) for the l.i0 benchmark step.

The benchmark uses 1 client for the small servers (beelink, ser7) and 12 clients for the big server (socket2).
MySQL 8.0 is much slower than 5.6 on the small servers (beelink, ser7) and slightly slower on the big server (socket2).

Results

Refer to the Results section in part 1 to understand what is displayed on the charts. The y-axis frequently does not start at zero to improve readability. But this also makes it harder to compare adjacent graphs.

Below when I write up 30% that is the same as up 1.3X. I switch from the former to the latter when the increase is larger than 99%.

Spreadsheets are here for beelink, ser7 and socket2. See the Servers section above to understand the HW for beelink, ser7 and socket2.

Results: branches and branch misses

Summary:

the Results section above explains the y-axis
on beelink

from 5.6.51 to 5.7.10 branches up 24%, branch-misses up 49%
from 5.7.43 to 8.0.34 branches up 32%, branch-misses up 71%

on ser7

from 5.6.51 to 5.7.10 branches up 27%, branch-misses up 36%
from 5.7.43 to 8.0.34 branches up 20%, branch-misses up 53%

on socket2

from 5.6.51 to 5.7.10 branches up 14%, branch-misses up 6%
from 5.7.43 to 8.0.34 branches up 17%, branch-misses up 12%

Results: cache references and misses

Summary:

the Results section above explains the y-axis
on beelink

from 5.6.51 to 5.7.10 references up 35%, misses up 30%
from 5.7.43 to 8.0.34 references up 42%, misses up 41%

on ser7

from 5.6.51 to 5.7.10 references up 43%, misses up 2.3X
from 5.7.43 to 8.0.34 references up 44%, misses up 94%

on socket2

from 5.6.51 to 5.7.10 references up 44%, misses down 2%
from 5.7.43 to 8.0.34 references up 63%, misses down 2%

Results: cycles, instructions, CPI

Summary:

the Results section above explains the y-axis
on beelink

from 5.6.51 to 5.7.10 cycles up 35%, instructions up 31%, cpi up 2%
from 5.7.43 to 8.0.34 cycles up 48%, instructions up 36%, cpi up 9%

on ser7

from 5.6.51 to 5.7.10 cycles up 51%, instructions up 38%, cpi up 10%
from 5.7.43 to 8.0.34 cycles up 52%, instructions up 28%, cpi up 17%

on socket2

from 5.6.51 to 5.7.10 cycles up 8%, instructions up 15%, cpi down 6%
from 5.7.43 to 8.0.34 cycles up 9%, instructions up 19%, cpi down 9%

Results: dTLB

Summary:

the Results section above explains the y-axis
on beelink

from 5.6.51 to 5.7.10 dTLB-loads up 62%, dTLB-load-misses up 44%
from 5.7.43 to 8.0.34 dTLB-loads up 55%, dTLB-load-misses up 52%

on ser7

from 5.6.51 to 5.7.10 dTLB-loads up 66%, dTLB-load-misses up 23%
from 5.7.43 to 8.0.34 dTLB-loads up 53%, dTLB-load-misses up 28%

on socket2

loads

from 5.6.51 to 5.7.10 dTLB-loads up 17%, dTLB-load-misses up 9%
from 5.7.43 to 8.0.34 dTLB-loads up 23%, dTLB-load-misses up 50%

stores

from 5.6.51 to 5.7.10 dTLB-stores up 22%, dTLB-store-misses down 33%
from 5.7.43 to 8.0.34 dTLB-stores up 30%, dTLB-store-misses up 91%

Results: iTLB

Summary:

the Results section above explains the y-axis
on beelink

from 5.6.51 to 5.7.10 iTLB-loads up 65%, iTLB-load-misses up 3.1X
from 5.7.43 to 8.0.34 iTLB-loads up 54%, iTLB-load-misses up 2.7X

on ser7

from 5.6.51 to 5.7.10 iTLB-loads up 72%, iTLB-load-misses up 1.9X
from 5.7.43 to 8.0.34 iTLB-loads up 21%, iTLB-load-misses up 2.5X

on socket2

from 5.6.51 to 5.7.10 iTLB-loads up 28%, iTLB-load-misses up 25%
from 5.7.43 to 8.0.34 iTLB-loads up 27%, iTLB-load-misses up 2.1X

Results: L1 cache

Summary:

the Results section above explains the y-axis
on beelink

dcache

from 5.6.51 to 5.7.10 loads up 39%, load-misses up 32%
from 5.7.43 to 8.0.34 loads up 45%, load-misses up 43%

icache

from 5.6.51 to 5.7.10 loads-misses up 25%
from 5.7.43 to 8.0.34 loads-misses up 49%

on ser7

dcache

from 5.6.51 to 5.7.10 loads up 29%, load-misses up 19%
from 5.7.43 to 8.0.34 loads up 6%, load-misses up 5%

icache

from 5.6.51 to 5.7.10 loads-misses up 29%
from 5.7.43 to 8.0.34 loads-misses up 17%

on socket2

dcache

from 5.6.51 to 5.7.10 loads up 18%, load-misses up 11%, stores up 23%
from 5.7.43 to 8.0.34 loads up 22%, load-misses up 19%, stores up 29%

icache

from 5.6.51 to 5.7.10 loads-misses up 25%
from 5.7.43 to 8.0.34 loads-misses up 37%

Results: LLC

The LLC counters were only supported in the socket2 CPU.

Summary:

the Results section above explains the y-axis
on socket2

loads

from 5.6.51 to 5.7.10 loads up 5%, load-misses down 13%
from 5.7.43 to 8.0.34 loads up 27%, load-misses down 4%

stores

from 5.6.51 to 5.7.10 stores are flat, store-misses down 28%
from 5.7.43 to 8.0.34 stores up 56%, store-misses up 22%

Results: context switches

Summary:

on beelink

from 5.6.21 to 5.7.10 context switches down 5%
from 5.7.43 to 8.0.34 context switches up 30%

on ser7

from 5.6.21 to 5.7.10 context switches up 12%
from 5.7.43 to 8.0.34 context switches up 8%

on socket2

from 5.6.21 to 5.7.10 context switches down 18%
from 5.7.43 to 8.0.34 context switches up 21%

Results: CPU migrations

Summary:

on beelink

from 5.6.21 to 5.7.10 CPU migrations up 24%
from 5.7.43 to 8.0.34 CPU migrations up 6.1X

on ser7

from 5.6.21 to 5.7.10 CPU migrations up 4.5X
from 5.7.43 to 8.0.34 CPU migrations up 2.4X

on socket2

from 5.6.21 to 5.7.10 CPU migrations down 44%
from 5.7.43 to 8.0.34 CPU migrations up 46%

Explaining changes in MySQL performance via hardware perf counters: part 3

This is part 3 of my series on using HW counters from Linux perf to explain why MySQL gets slower from 5.6 to 8.0. Refer to part 1 for an overview.

What happened in MySQL 5.7 and 8.0 to put so much more stress on the memory system?

tl;dr

It looks like someone sprinkled magic go slower dust across most of the MySQL code because the slowdown from MySQL 5.6 to 8.0 is not isolated to a few call stacks.
MySQL 8.0 uses ~1.5X more instructions/operation than 5.6. Cache activity (references, loads, misses are frequently up 1.5X or more. TLB activity (loads, misses) is frequently up 2X to 5X with the iTLB being a bigger problem than the dTLB.
innodb_log_writer_threads=ON is worse than I thought. I will soon have a post on that.

Too long to be a tl;dr

I don't have much experience using Linux perf counters to explain performance changes.
There are huge increases for data TLB, instruction TLB and L1 cache counters. From the Xeon CPU (socket2 below) the change from MySQL 5.6.21 to 8.0.34 measured as events/query are:

branches, branch misses: up ~1.6X, ~1.5X
cache references: up ~3.7X
instructions: up ~1.5X
dTLB loads, load-misses, stores: up ~1.6X, ~2.5X, ~1.7X
iTLB loads, load-misses: up ~2.0X, ~2.8X
L1 data cache loads, load-misses, stores: up ~1.6X, ~1.5X, ~1.7X
L1 instruction cache load-misses: up ~1.9X
LLC loads, stores, store-misses: up ~2.2X, ~2.7X, ~1.3X
Context switches are flat, CPU migrations down

For many of the HW counters the biggest jumps occur between the last point release in 5.6 and the first in 5.7 and then again between the last in 5.7 and the first in 8.0. Perhaps this is good news because it means the problems are not spread across every point release.

The posts in this series are:

part 1 - introduction and results for the l.i0 (initial load) benchmark step
part 2 - results for the l.i1 (write-only) benchmark step
part 3 - (this post) results for the q100 (read+write) benchmark step
part 4 - results for the q1000 (read+write) benchmark step

Performance

The charts below show average throughput (QPS, or really operations/s) for the l.i0 benchmark step.

The benchmark uses 1 client for the small servers (beelink, ser7) and 12 clients for the big server (socket2).
MySQL 8.0 is much slower than 5.6 on the small servers (beelink, ser7) and slightly slower on the big server (socket2).