Sunday, December 31, 2023

Updated Insert benchmark: MyRocks 5.6 and 8.0, large server, cached database

This has results for the Insert Benchmark using MyRocks 5.6 and 8.0 using a large server and cached workload. 

tl;dr

  • performance between MyRocks 5.6.35, 8.0.28 and 8.0.32 are similar, unlike what I see with InnoDB where there are large changes, both good (writes) and bad (for reads) using the same setup
Build + Configuration

I tested MyRocks 5.6.35, 8.0.28 and 8.0.32. These were compiled from source. All builds use CMAKE_BUILD_TYPE =Release.
  • MyRocks 5.6.35
    • compiled with clang on Dec 21, 2023 at git hash 4f3a57a1, RocksDB 8.7.0 at git hash 29005f0b
  • MyRocks 8.0.28
    • compiled with clang on Dec 22, 2023 at git hash 2ad105fc, RocksDB 8.7.0 at git hash 29005f0b
  • MyRocks 8.0.32
    • compiled with clang on Dec 22, 2023 at git hash 76707b44, RocksDB 8.7.0 at git hash 29005f0b
The my.cnf files are here for 5.6.35 (d1), 8.0.28 (d1 and d2) and 8.0.32 (d1 and d2). For MyRocks 8.0 the d1 configs enable the performance_schema and the d2 configs disable it. The full names for the configs are my.cnf.cy9d1_u (d1) and my.cnf.cy9d2_u (d2).

Benchmark

The benchmark is run with 24 clients to avoid over-subscribing the CPU. The server has 40 cores, 80 HW threads (hyperthreads enabled), 256G of RAM and many TB of fast SSD.

I used the updated Insert Benchmark so there are more benchmark steps described below. In order, the benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions).
  • qr100
    • use 3 connections/client. One does range queries for 3600 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • lik qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results

The performance report is here. It has a lot more detail including charts, tables and metrics from iostat and vmstat to help explain the performance differences.

From the summary throughput is similar between MyRocks 5.6 and 8.0. The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to MyRocks 5.6.35. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Another report is here and the difference is that I changed the benchmark to wait for X seconds after l.i2 to reduce writeback debt before the read-write steps start. And X is max(1200, 60 + #rows/1M). The goal is to reduce variance for the read-write benchmark steps.

From the HW efficiency metrics per benchmark step
  • l.i0l.x
    • metrics are similar
  • l.i1
    • most metrics are similar but CPU overhead (cpupq) is ~13% larger in 8.0 and max response time (max_op) is more than 2X larger in 8.0
  • l.i2
    • most metrics are similar but CPU overhead (cpupq) is ~39% larger in 8.0 and max response time (max_op) is more than 2X larger in 8.0
  • qr100, qp100, qr500, qp500, qr1000, qp1000
    • metrics are similar
It is interesting that the CPU overhead is larger for MyRocks 8.0.28 and 8.0.32 during the l.i1 and l.i2 benchmark steps because throughput and response time distributions are similar to MyRocks 5.6.35. So the extra CPU in MyRocks 8.0 does not appear to occur in the foreground threads that processes the insert and delete statements. 

The CPU overhead from compaction is 3% to 6% larger for MyRocks 8.0 -- see the CompMergeCPU(sec) column on the Sum line for 5.6.35, 8.0.28 and 8.0.32. But that can only explain a few (<= 5) usecs/operation of CPU overhead -- not the 20 to 100+ that I want to explain. 

If I just look at the output from ps aux | grep mysqld I see that MyRocks 8.0 uses at least 1.2X more CPU than 5.6. But again, that can be from background activities and might not be on the path for foreground operations (queries).

For now this is a mystery to which I will return later.












Updated Insert benchmark: InnoDB, MySQL 5.6 to 8.0, large server, cached database

This has results for the Insert Benchmark using InnoDB from MySQL 5.6 to 8.0, a large server and a cached workload. 

tl;dr

  • MySQL 8.0 is faster than 5.6 for writes because the improvements for concurrent workloads are larger than the regressions from new CPU overheads. With 8.0 the throughput is ~2X larger than 5.6 for l.i0 (inserts with no secondary indexes) and ~3X larger for l.i1/l.i2 (random inserts, random deletes).
  • MySQL 8.0 is slower than 5.6 for reads because there are new CPU overheads. For range queries with 8.0 the throughput is ~13% to ~40% less than 5.6 and for point queries it is ~20% less.
  • With MySQL 5.7 the write throughput is not as large as MySQL 8.0 but the read throughput is better. 
  • Context matters as the results here for MySQL 8.0 are better than results on a small server. The difference is that I use a workload with high concurrency (24 clients) on the large server and low concurrency (1 client) on the small server. On both servers MySQL 8.0 pays a price from new CPU overheads. But on the large server it also gets a benefit from some of that new code.
Build + Configuration

I tested MySQL 5.6.51, 5.7.40 and 8.0.33. These were compiled from source. The builds use non-default compiler toolchains to run on production hardware so the build process is non-trivial (CMake input files have to set many things) and I am reluctant to build many versions from each major release because it takes hours to days to figure out the CMake input file. All builds use CMAKE_BUILD_TYPE =Release, -march=native and -mtune=native. The 8.0 builds might have used -DWITH_LTO =ON.

The my.cnf files are here for 5.6, for 5.7 and for 8.0.

Benchmark

The benchmark is run with 24 clients to avoid over-subscribing the CPU.

I used the updated Insert Benchmark so there are more benchmark steps described below. In order, the benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 50M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions).
  • qr100
    • use 3 connections/client. One does range queries for 3600 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • lik qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results

The performance report is here. It has a lot more detail including charts, tables and metrics from iostat and vmstat to help explain the performance differences.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

A brief overview from the summary

  • MySQL 8.0 is faster than 5.6 for writes because the improvements for concurrent workloads are larger than the regressions from new CPU overheads. With 8.0 the throughput is ~2X larger than 5.6 for l.i0 (inserts with no secondary indexes) and ~3X larger for l.i1/l.i2 (random inserts, random deletes).
  • MySQL 8.0 is slower than 5.6 for reads because there are new CPU overheads. For range queries with 8.0 the throughput is ~13% to ~40% less than 5.6 and for point queries it is ~20% less.
  • MySQL 5.7 does better than 8.0 for reads but worse for writes.
Most of the performance differences can be explained by CPU overhead. See the cpupq column in the tables here. This is the total CPU time per operation where CPU time is the sum of the us and sy columns in vmstat and operations is inserts for l.i0, l.i1 and l.i2, indexed rows for l.x and queries for the qr* and qp* steps. Note that that cpupq isn't perfect -- it measures too much including the benchmark client CPU -- but it still is a useful tool.

Updates for the Insert Benchmark, December 2023

This describes the Insert Benchmark after another round of changes I recently made. Past descriptions are here and here. Source code for the benchmark is here. It is still written in Python and I hope that one day a Python JIT will arrive to reduce the overhead of the benchmark client.

The changes include:

  • switched from float to int for the price column
  • changed the marketsegment index from (price, customerid) to (productid, customerid). The other indexes are unchanged -- registersegment is on (cashregisterid, price, customerid) and pdc is on (price, dateandtime, customerid)
  • added a mode for the read-write benchmark step to do point queries on the PK index
  • changed some of the benchmark steps to do a delete per insert to avoid growing the table size. I made this change a few months ago.
  • add the l.i2 benchmark step that modifies fewer rows per transaction compared to l.i1
  • added code to reduce checkpoint (InnoDB) and compaction (RocksDB) debt that runs between the l.i2 and qr100 benchmark steps. The code for Postgres was already there.
Alas, I have yet to address coordinated omission.

Benchmark steps

The benchmark is run with X clients and usually with a client per table.

The benchmark is a sequence of steps that are run in order:
  • l.i0
    • insert Y million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One does inserts as fast as possible and the other does deletes at the same rate as the inserts to avoid changing the number of rows in the table. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions).
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow where X is max(1200, 60 + #nrows/1M). While waiting do things to reduce writeback debt where the things are:
      • Postgres (see here) - do a vacuum analyze for all tables concurrently. When that finishes do a checkpoint.
      • InnoDB (see here) - change innodb_max_dirty_pages_pct[_lwm] to 1, change innodb_idle_flush_pct to 100. When done waiting restore them to previous values.
      • MyRocks (see here) - set rocksdb_force_flush_memtable_now to flush the memtable, wait 20 seconds and then set rocksdb_compact_lzero_now to flush L0. Note that rocksdb_compact_lzero_now wasn't supported until mid-2023.
  • qr100
    • use 3 connections/client. One does range queries as fast as possible and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • lik qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Friday, December 29, 2023

RocksDB 8.x benchmarks: large server, cached database

This post has results for performance regressions in all versions of 8.x from 8.0.0 to 8.9.2 using a large server. In a previous post I shared results for RocksDB 8.0 to 8.8 on the same hardware.

tl;dr

  • The regressions from 8.0 to 8.9 are small. For the non read-only benchmark steps, QPS from 8.9.2 is at most 3% less than 8.0.0 with leveled and at most 1% less with universal.
I focus on the benchmark steps that aren't read-only because they suffer less from noise.  These benchmark steps are fillseq, revrangewhilewriting, fwdrangewhilewriting, readwhilewriting and overwriteandwait. I also focus on leveled more so than universal, in part because there is more noise with universal but also because the workloads I care about most use leveled.

Builds

I compiled with gcc RocksDB 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3 and 8.8.1 and 8.9.2 which are the latest patch releases.

Benchmark

All tests used a server with 40 cores, 80 HW threads, 2 sockets, 256GB of RAM and many TB of fast NVMe SSD with Linux 5.1.2, XFS and SW RAID 0 across 6 devices. For the results here, the database is cached by RocksDB. The benchmark was repeated for leveled and universal compaction.

Everything used the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache once RocksDB 9.0 arrives.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run them is: 
bash x3.sh 24 no 3600 c40r256bc180 40000000 4000000000 byrx
A spreadsheet with all results is here and performance summaries are here for leveled and for universal.

Results: leveled

For the non read-only benchmark steps, QPS from 8.9.2 is at most 3% less than from 8.0.0.
Results: universal

For the non read-only benchmark steps, QPS from 8.9.2 is at most 1% less than from 8.0.0.


Sunday, December 24, 2023

Create InnoDB indexes 2X faster with this simple trick

I made parallel create index for InnoDB up to 2.4X faster by doing one of:

I filed bug 113505 for this. The problem is that the parallel create index threads all call that function and there is too much contention on the counter that is incremented. The problem wasn't obvious from Linux perf when using the default metric (cycles). I then tried perf stat and IPC (instructions/cycle) was about 2X larger during create index with performance_schema=0. Finally, I repeated the perf measurements using memory system counters (cache-misses, etc) and with perf record, perf report and perf annotate the problem was obvious.

I am still deciding whether this 2X faster bug fix is more exciting then the one I found for Falcon.

Results

The results here are from a server with 2 sockets, 12 cores/socket, 64G of RAM running Ubuntu 22.04 and one m.2 device with 2TB and XFS. All tests use MySQL 8.0.35 and all of the my.cnf files are here.

The test is the create index benchmark step from the Insert Benchmark with one client, 80M rows and 3 secondary indexes created in one ALTER statement. With innodb_ddl_threads=N if N=1 then only one thread handles the index create work, otherwise it is done in parallel over N threads -- this is assuming innodb_parallel_read_threads >= innodb_ddl_threads, I am not sure I want to remember how they interact.

I tested my.cnf configs with innodb_ddl_threads and innodb_parallel_read_threads set to =1, =4, =8 and =16. Note that the default is =4. Tests were repeated with performance_schema=1 and =0 and for =1, nothing else was set for the perf schema.

The following configurations are tested:
  • ddl1 - innodb_ddl_threads=1, innodb_parallel_read_threads=1, performance_schema=1
  • ddl1+ps0 - like ddl1 but with perfomance_schema=0
  • ddl4 - innodb_ddl_threads=4, innodb_parallel_read_threads=4, performance_schema=1
  • ddl4+ps0 - like ddl4 but with perfomance_schema=0
  • ddl8 - innodb_ddl_threads=8, innodb_parallel_read_threads=8, performance_schema=1
  • ddl8+ps0 - like ddl8 but with perfomance_schema=0
  • ddl16 - innodb_ddl_threads=16, innodb_parallel_read_threads=16, performance_schema=1
  • ddl16+ps0 - like ddl16 but with perfomance_schema=0
The chart shows the time to create indexes for MySQL 8.0.35 using two builds. Note that shorter bars are better here. The as-is build is 8.0.35 unchanged and the fixed build is 8.0.35 with this line removed.
  • When performance_schema=0 then there is no difference between the as-is and fixed builds
  • When performance_schema=1 then the fixed build is ~2X faster than the as-is build


Monday, December 18, 2023

What is the cost of the performance schema in MySQL 5.7 and 8.0: small server, cached database, simple queries

This post has results from the Insert Benchmark for some MySQL 8.0.35 with InnoDB using the Insert Benchmark, a small server and a cached workload. The goal is to document the overhead of the performance schema.

tl;dr

  • The perf schema in 5.7 costs <= 4% of QPS
  • The perf schema in 8.0 costs ~10% of QPS, except for index create where the cost is much larger. I will try to explain that later. 
Build + configuration

I used InnoDB with MySQL 8.0.35 and 5.7.44.

The cmake files for each of the builds are here for 5.7.44 and for 8.0.35

By default the my.cnf files I use for MySQL 5.7 and 8.0 have performance_schema=1 but I don't otherwise enable instruments. So the tests here document the overhead from using whatever is enabled by default with the perf schema.

For 8.0.35 there were 5 variants of the build and 2 my.cnf files. I didn't test the full cross product, but there are 7 different combinations I tried that are listed below.

The 8.0.35 builds are: 

  • my8035_rel
    • MySQL 8.0.35 and the rel build with CMAKE_BUILD_TYPE=Release
  • my8035_rel_native
    • MySQL 8.0.35 and the rel_native build with CMAKE_BUILD_TYPE=Release -march=native -mtune=native
  • my8035_rel_native_lto
    • MySQL 8.0.35 and the rel_native_lto build with CMAKE_BUILD_TYPE=Release -march=native -mtune=native WITH_LTO=ON
  • my8035_rel_less
    • MySQL 8.0.35 and the rel_less build with CMAKE_BUILD_TYPE=Release ENABLED_PROFILING=OFF WITH_RAPID=OFF
  • my8035_rel_lessps.cz10a_bee
    • MySQL 8.0.35 and the rel build with CMAKE_BUILD_TYPE=Release and as much as possible of the perf schema code disabled at compile time. See the cmake files linked above.
The 8.0.35 my.cnf files:

I tried 7 combinations of build + configuration for 8.0.35:

  • my8035_rel.cz10a_bee
  • my8035_rel_native.cz10a_bee
  • my8035_rel_native_lto.cz10a_bee
  • my8035_rel_less.cz10a_bee
  • my8035_rel_lessps.cz10a_bee
  • my8035_rel.cz10aps0_bee
  • my8035_rel_lessps.cz10aps0_bee
For 5.7.44 I tried two combinations of build + configuration
  • my5744_rel.cz10a_bee - uses the rel build and cz10a_bee my.cnf
  • my5744_rel.cz10aps0_bee - uses the rel build and cz10aps0_bee my.cnf that disables the perf schema

Benchmark

The Insert Benchmark was run in one setup - a cached workload.

The benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. 

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

  • l.i0
    • insert 20 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.
Results

Reports are here for 5.7.44 and for 8.0.35.

The Summary sections linked below have tables for absolute and relative throughput. The relative throughput is (QPS for my version / QPS for base version). The base version is my5744_rel.cz10a_bee for the 5.7.44 report and my8035_rel.cz10a_bee for the 8.0.35 report.

The summaries linked below have three tables. The first table shows the absolute throughput (queries or writes/s). The second table shows the throughput for a given version relative to MySQL 5.6.21 (a value less than 1.0 means the version is slower). The third table shows the average background insert rate during q100, q500 and q1000. All versions sustained the targets, so that table can be ignored here.

By write-heavy I mean the l.i0, l.x and l.i1 benchmark steps. By read-heavy I mean the q100, q500 and q1000 benchmark steps. Below I use red and green to indicate when the relative QPS is down or up by more than 5% and grey otherwise.

From the Summary for 5.7.44
  • relative QPS is relative to my5744_rel.cz10a_bee
  • relative QPS for my5744_rel.cz10aps0_bee is (1.03, 1.12, 1.03) and (1.02, 1.03, 1.04) for write- and read-heavy. So throughput benefit from disabling the performance schema is <= 4% except for the l.x benchmark step that creates indexes.
From the Summary for 8.0.35
  • relative QPS is relative to my8035_rel.cz10a_bee
  • my8035_rel_native.cz10a_bee and my8035_rel_less.cz10a_bee have performance similar to the base case
  • my8035_rel_native_lto.cz10a_bee gets between 4% and 9% more QPS than the base case
  • builds that disable the perf schema in my.cnf or at compile time get ~1.07X more QPS for l.i0 and l.i1, ~1.5X more throughput for index create and between 1.11X and 1.15X more QPS for read-heavy. I will soon get flamegraphs to explain some of this.

Perf regressions in MySQL from 5.6 to 8.0 using the Insert Benchmark and a small server, Dec23: part 1

This post has results from the Insert Benchmark for some MySQL 5.6 releases, all 5.7 releases and all 8.0 releases using a small server and a cached workload. It follows up on work I published in October. The differences are that I updated the my.cnf files to make them more similar across releases and I added results for pre-GA releases from 5.7 and 8.0.

tl;dr

  • For 4 of the 6 benchmark steps, MySQL 8.0.35 gets about 60% of the throughput relative to 5.6.51 with simple queries and a cached database. The regressions for random writes (l.i1) is less and index create (l.x) is faster in 8.0 than in 5.6.
  • For MySQL 5.7 about half of the perf regressions occur between the last 5.6 release (5.6.51) and the first 5.7 release that I was able to compile (5.7.5). The rest are spread evenly across 5.7.
  • For MySQL 8.0 almost all of the perf regressions are spread across 8.0 releases from 8.0.0 to 8.0.35. 
  • For InnoDB there is at least one significant regression arriving in 8.0.30+ related to which code gets inlined by the compiler. This is somewhat obfuscated because two perf bugs that hurt this benchmark were fixed after 8.0.28. For more details see PS-8822.
Builds

It isn't easy to build older code on newer systems, compilers, etc. Notes on that are here for 5.6for 5.6 and 5.7for 5.7 and for 8.0. A note on using cmake is here. The rel builds were used as everything was compiled using CMAKE_BUILD_TYPE=Release.

Tests were done for:
  • 5.6 - 5.6.21, 5.6.51
  • 5.7 - 5.7.5 thru 5.7.10, 5.7.20, 5.7.30, 5.7.44
  • 8.0 - 8.0.0 thru 8.0.5, 8.0.11 thru 8.0.14, 8.0.20, 8.0.22, 8.0.28, 8.0.30, 8.0.34, 8.0.35
I used the cz10a_bee config and it is here for
Benchmark

The Insert Benchmark was run in one setup - a cached workload.

The benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. 

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

  • l.i0
    • insert 20 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.
Results

The Summary sections linked below have tables for absolute and relative throughput. Absolute throughput is just the QPS per version. The relative throughput is (QPS for my version / QPS for base version). The base version is 5.6.21 for the 5.6-only report, 5.7.5 for the 5.7-only report, 8.0.0 for the 8.0-only report and 5.6.21 for the report that compares 5.6, 5.7 and 8.0.

Reports are here for 5.6, 5.7, 8.0, 5.6 thru 8.0.

The summaries linked below have three tables. The first table shows the absolute throughput (queries or writes/s). The second table shows the throughput for a given version relative to MySQL 5.6.21 (a value less than 1.0 means the version is slower). The third table shows the average background insert rate during q100, q500 and q1000. All versions sustained the targets, so that table can be ignored here.

By write-heavy I mean the l.i0, l.x and l.i1 benchmark steps. By read-heavy I mean the q100, q500 and q1000 benchmark steps. Below I use red and green to indicate when the relative QPS is down or up by more than 5% and grey otherwise.

From the Summary for 5.6, 5.7 and 8.0
  • relative QPS is relative to MySQL 5.6.21
  • relative QPS in 5.7.44 is (0.82, 1.35, 1.10) and  (0.730.720.72) for write- and read-heavy
    • There are perf regressions for the l.i0, q100, q500, and q1000 benchmark steps. About half of that is from the 5.6 to 5.7 transition and the rest is a gradual loss during across 5.7. 
    • For the l.x benchmark step perf improves from 5.6 to 5.7 and then gradually declines across 5.7.
    • For the l.i1 benchmark step perf improves from 5.6 to 5.7 and is steady across 5.7
  • relative QPS in 8.0.35 is (0.55, 1.24, 0.89) and (0.62, 0.61, 0.62) for write- and read-heavy
    • Perf in 8.0.0 is close to 5.7.44 and 8.0.0 is the first (pre-GA) 8.0 release
    • There are perf regressions for all benchmark steps. These are gradual across 8.0.
From the Summary for 8.0
  • relative QPS is relative to MySQL 8.0.0
  • relative QPS in 8.0.35 is (0.69, 0.92, 0.84) and (0.84, 0.85, 0.85) for write- and read-heavy
From the Summary for 5.7
  • relative QPS is relative to MySQL 5.7.5
  • relative QPS in 5.7.44 is (0.920.831.02) and (0.880.870.88) for write- and read-heavy
From the Summary for 5.6
  • relative QPS is relative to MySQL 5.6.21
  • relative QPS in 5.6.51 is (0.96, 1.02, 0.97) and (0.98, 0.97, 0.98) for write- and read-heavy









Monday, November 27, 2023

Explaining changes in MySQL performance via hardware perf counters: part 4

This is part 4 of my series on using HW counters from Linux perf to explain why MySQL gets slower from 5.6 to 8.0. Refer to part 1 for an overview.

What happened in MySQL 5.7 and 8.0 to put so much more stress on the memory system?

tl;dr

  • It looks like someone sprinkled magic go slower dust across most of the MySQL code because the slowdown from MySQL 5.6 to 8.0 is not isolated to a few call stacks.
  • MySQL 8.0 uses ~1.5X more instructions/operation than 5.6. Cache activity (references, loads, misses are frequently up 1.5X or more. TLB activity (loads, misses) is frequently up 2X to 5X with the iTLB being a bigger problem than the dTLB.
  • innodb_log_writer_threads=ON is worse than I thought. I will soon have a post on that.

Too long to be a tl;dr
  • I don't have much experience using Linux perf counters to explain performance changes.
  • There are huge increases for data TLB, instruction TLB and L1 cache counters. From the Xeon CPU (socket2 below) the change from MySQL 5.6.21 to 8.0.34 measured as events/query are:
    • branches, branch misses: up ~1.5X, ~1.2X
    • cache references: up ~2.8X
    • instructions: up ~1.5X
    • dTLB loads, load-misses, stores: up ~1.6X, ~1.9X, ~1.8X
    • iTLB loads, load-misses: up ~2.1X, ~2.4X
    • L1 data cache loads, load-misses, stores: up ~1.6X, ~1.5X, ~1.8X
    • L1 instruction cache load-misses: up ~1.9X
    • LLC loads, stores: up ~1.6X, ~1.9X
    • Context switches and CPU migrations are down
  • For many of the HW counters the biggest jumps occur between the last point release in 5.6 and the first in 5.7 and then again between the last in 5.7 and the first in 8.0. Perhaps this is good news because it means the problems are not spread across every point release.

The posts in this series are: 

  • part 1 - introduction and results for the l.i0 (initial load) benchmark step
  • part 2 - results for the l.i1 (write-only) benchmark step
  • part 3 - results for the q100 (read+write) benchmark step
  • part 4 - (this post) results for the q1000 (read+write) benchmark step

Performance

The charts below show average throughput (QPS, or really operations/s) for the l.i0 benchmark step.

  • The benchmark uses 1 client for the small servers (beelink, ser7) and 12 clients for the big server (socket2). 
  • MySQL 8.0 is much slower than 5.6 on the small servers (beelink, ser7) and slightly slower on the big server (socket2).
Results

Refer to the Results section in part 1 to understand what is displayed on the charts. The y-axis frequently does not start at zero to improve readability. But this also makes it harder to compare adjacent graphs.

Below when I write up 30% that is the same as up 1.3X. I switch from the former to the latter when the increase is larger than 99%.

Spreadsheets are here for beelinkser7 and socket2. See the Servers section above to understand the HW for beelink, ser7 and socket2.

Results: branches and branch misses

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 branches up 24%, branch-misses up 49%
    • from 5.7.43 to 8.0.34 branches up 32%, branch-misses up 71%
  • on ser7
    • from 5.6.51 to 5.7.10 branches up 27%, branch-misses up 36%
    • from 5.7.43 to 8.0.34 branches up 20%, branch-misses up 53%
  • on socket2
    • from 5.6.51 to 5.7.10 branches up 14%, branch-misses up 6%
    • from 5.7.43 to 8.0.34 branches up 17%, branch-misses up 12%
Results: cache references and misses

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 references up 35%, misses up 30%
    • from 5.7.43 to 8.0.34 references up 42%, misses up 41%
  • on ser7
    • from 5.6.51 to 5.7.10 references up 43%, misses up 2.3X
    • from 5.7.43 to 8.0.34 references up 44%, misses up 94%
  • on socket2
    • from 5.6.51 to 5.7.10 references up 44%, misses down 2%
    • from 5.7.43 to 8.0.34 references up 63%, misses down 2%
Results: cycles, instructions, CPI

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 cycles up 35%, instructions up 31%, cpi up 2%
    • from 5.7.43 to 8.0.34 cycles up 48%, instructions up 36%, cpi up 9%
  • on ser7
    • from 5.6.51 to 5.7.10 cycles up 51%, instructions up 38%, cpi up 10%
    • from 5.7.43 to 8.0.34 cycles up 52%, instructions up 28%, cpi up 17%
  • on socket2
    • from 5.6.51 to 5.7.10 cycles up 8%, instructions up 15%, cpi down 6%
    • from 5.7.43 to 8.0.34 cycles up 9%, instructions up 19%, cpi down 9%
Results: dTLB

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 dTLB-loads up 62%, dTLB-load-misses up 44%
    • from 5.7.43 to 8.0.34 dTLB-loads up 55%, dTLB-load-misses up 52%
  • on ser7
    • from 5.6.51 to 5.7.10 dTLB-loads up 66%, dTLB-load-misses up 23%
    • from 5.7.43 to 8.0.34 dTLB-loads up 53%, dTLB-load-misses up 28%
  • on socket2
    • loads
      • from 5.6.51 to 5.7.10 dTLB-loads up 17%, dTLB-load-misses up 9%
      • from 5.7.43 to 8.0.34 dTLB-loads up 23%, dTLB-load-misses up 50%
    • stores
      • from 5.6.51 to 5.7.10 dTLB-stores up 22%, dTLB-store-misses down 33%
      • from 5.7.43 to 8.0.34 dTLB-stores up 30%, dTLB-store-misses up 91%
Results:  iTLB

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 iTLB-loads up 65%, iTLB-load-misses up 3.1X
    • from 5.7.43 to 8.0.34 iTLB-loads up 54%, iTLB-load-misses up 2.7X
  • on ser7
    • from 5.6.51 to 5.7.10 iTLB-loads up 72%, iTLB-load-misses up 1.9X
    • from 5.7.43 to 8.0.34 iTLB-loads up 21%, iTLB-load-misses up 2.5X
  • on socket2
    • from 5.6.51 to 5.7.10 iTLB-loads up 28%, iTLB-load-misses up 25%
    • from 5.7.43 to 8.0.34 iTLB-loads up 27%, iTLB-load-misses up 2.1X
Results: L1 cache
Summary:

  • the Results section above explains the y-axis
  • on beelink
    • dcache
      • from 5.6.51 to 5.7.10 loads up 39%, load-misses up 32%
      • from 5.7.43 to 8.0.34 loads up 45%, load-misses up 43%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses up 25%
      • from 5.7.43 to 8.0.34 loads-misses up 49%
  • on ser7
    • dcache
      • from 5.6.51 to 5.7.10 loads up 29%, load-misses up 19%
      • from 5.7.43 to 8.0.34 loads up 6%, load-misses up 5%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses up 29%
      • from 5.7.43 to 8.0.34 loads-misses up 17%
  • on socket2
    • dcache
      • from 5.6.51 to 5.7.10 loads up 18%, load-misses up 11%, stores up 23%
      • from 5.7.43 to 8.0.34 loads up 22%, load-misses up 19%, stores up 29%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses  up 25%
      • from 5.7.43 to 8.0.34 loads-misses up 37%
Results: LLC

The LLC counters were only supported in the socket2 CPU.

Summary:

  • the Results section above explains the y-axis
  • on socket2
    • loads
      • from 5.6.51 to 5.7.10 loads up 5%, load-misses down 13%
      • from 5.7.43 to 8.0.34 loads up 27%, load-misses down 4%
    • stores
      • from 5.6.51 to 5.7.10 stores are flat, store-misses down 28%
      • from 5.7.43 to 8.0.34 stores up 56%, store-misses up 22%
Results: context switches

Summary:
  • on beelink
    • from 5.6.21 to 5.7.10 context switches down 5%
    • from 5.7.43 to 8.0.34 context switches up 30%
  • on ser7
    • from 5.6.21 to 5.7.10 context switches up 12%
    • from 5.7.43 to 8.0.34 context switches up 8%
  • on socket2
    • from 5.6.21 to 5.7.10 context switches down 18%
    • from 5.7.43 to 8.0.34 context switches up 21%
Results: CPU migrations

Summary:
  • on beelink
    • from 5.6.21 to 5.7.10 CPU migrations up 24%
    • from 5.7.43 to 8.0.34 CPU migrations up 6.1X
  • on ser7
    • from 5.6.21 to 5.7.10 CPU migrations up 4.5X
    • from 5.7.43 to 8.0.34 CPU migrations up 2.4X
  • on socket2
    • from 5.6.21 to 5.7.10 CPU migrations down 44%
    • from 5.7.43 to 8.0.34 CPU migrations up 46%

Explaining changes in MySQL performance via hardware perf counters: part 3

This is part 3 of my series on using HW counters from Linux perf to explain why MySQL gets slower from 5.6 to 8.0. Refer to part 1 for an overview.

What happened in MySQL 5.7 and 8.0 to put so much more stress on the memory system?

tl;dr

  • It looks like someone sprinkled magic go slower dust across most of the MySQL code because the slowdown from MySQL 5.6 to 8.0 is not isolated to a few call stacks.
  • MySQL 8.0 uses ~1.5X more instructions/operation than 5.6. Cache activity (references, loads, misses are frequently up 1.5X or more. TLB activity (loads, misses) is frequently up 2X to 5X with the iTLB being a bigger problem than the dTLB.
  • innodb_log_writer_threads=ON is worse than I thought. I will soon have a post on that.

Too long to be a tl;dr
  • I don't have much experience using Linux perf counters to explain performance changes.
  • There are huge increases for data TLB, instruction TLB and L1 cache counters. From the Xeon CPU (socket2 below) the change from MySQL 5.6.21 to 8.0.34 measured as events/query are:
    • branches, branch misses: up ~1.6X, ~1.5X
    • cache references: up ~3.7X
    • instructions: up ~1.5X
    • dTLB loads, load-misses, stores: up ~1.6X, ~2.5X, ~1.7X
    • iTLB loads, load-misses: up ~2.0X, ~2.8X
    • L1 data cache loads, load-misses, stores: up ~1.6X, ~1.5X, ~1.7X
    • L1 instruction cache load-misses: up ~1.9X
    • LLC loads, stores, store-misses: up ~2.2X, ~2.7X, ~1.3X
    • Context switches are flat, CPU migrations down
  • For many of the HW counters the biggest jumps occur between the last point release in 5.6 and the first in 5.7 and then again between the last in 5.7 and the first in 8.0. Perhaps this is good news because it means the problems are not spread across every point release.

The posts in this series are: 

  • part 1 - introduction and results for the l.i0 (initial load) benchmark step
  • part 2 - results for the l.i1 (write-only) benchmark step
  • part 3 - (this post) results for the q100 (read+write) benchmark step
  • part 4 - results for the q1000 (read+write) benchmark step

Performance

The charts below show average throughput (QPS, or really operations/s) for the l.i0 benchmark step.

  • The benchmark uses 1 client for the small servers (beelink, ser7) and 12 clients for the big server (socket2). 
  • MySQL 8.0 is much slower than 5.6 on the small servers (beelink, ser7) and slightly slower on the big server (socket2).
Results

Refer to the Results section in part 1 to understand what is displayed on the charts. The y-axis frequently does not start at zero to improve readability. But this also makes it harder to compare adjacent graphs.

Below when I write up 30% that is the same as up 1.3X. I switch from the former to the latter when the increase is larger than 99%.

Spreadsheets are here for beelinkser7 and socket2. See the Servers section above to understand the HW for beelink, ser7 and socket2.

Results: branches and branch misses

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 branches up 29%, branch-misses up 55%
    • from 5.7.43 to 8.0.34 branches up 17%, branch-misses up 67%
  • on ser7
    • from 5.6.51 to 5.7.10 branches up 39%, branch-misses up 52%
    • from 5.7.43 to 8.0.34 branches up 18%, branch-misses up 56%
  • on socket2
    • from 5.6.51 to 5.7.10 branches up 19%, branch-misses up 20%
    • from 5.7.43 to 8.0.34 branches up 17%, branch-misses up 22%
Results: cache references and misses

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 references up 38%, misses up 38%
    • from 5.7.43 to 8.0.34 references up 32%, misses up 36%
  • on ser7
    • from 5.6.51 to 5.7.10 references up 57%, misses up 2.5X
    • from 5.7.43 to 8.0.34 references up 45%, misses up 2.0X
  • on socket2
    • from 5.6.51 to 5.7.10 references up 69%, misses up 1%
    • from 5.7.43 to 8.0.34 references up 79%, misses down 5%

Results: cycles, instructions, CPI

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 cycles up 11%, instructions are flat, cpi up 12%
    • from 5.7.43 to 8.0.34 cycles up 42%, instructions up 18%, cpi up 18%
  • on ser7
    • from 5.6.51 to 5.7.10 cycles up 66%, instructions up 52%, cpi up 9%
    • from 5.7.43 to 8.0.34 cycles up 36%, instructions up 12%, cpi up 22%
  • on socket2
    • from 5.6.51 to 5.7.10 cycles up 28%, instructions up 21%, cpi up 5%
    • from 5.7.43 to 8.0.34 cycles up 36%, instructions up 16%, cpi up 17%
Results: dTLB

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 dTLB-loads up 49%, dTLB-load-misses up 43%
    • from 5.7.43 to 8.0.34 dTLB-loads up 39%, dTLB-load-misses up 53%
  • on ser7
    • from 5.6.51 to 5.7.10 dTLB-loads up 2X, dTLB-load-misses up 56%
    • from 5.7.43 to 8.0.34 dTLB-loads up 44%, dTLB-load-misses up 23%
  • on socket2
    • loads
      • from 5.6.51 to 5.7.10 dTLB-loads up 22%, dTLB-load-misses up 18%
      • from 5.7.43 to 8.0.34 dTLB-loads up 14%, dTLB-load-misses up 76%
    • stores
      • from 5.6.51 to 5.7.10 dTLB-stores up 27%, dTLB-store-misses down 61%
      • from 5.7.43 to 8.0.34 dTLB-stores up 20%, dTLB-store-misses up 2.4X
Results:  iTLB

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • from 5.6.51 to 5.7.10 iTLB-loads up 48%, iTLB-load-misses up 3.1X
    • from 5.7.43 to 8.0.34 iTLB-loads up 37%, iTLB-load-misses up 2.6X
  • on ser7
    • from 5.6.51 to 5.7.10 iTLB-loads up 81%, iTLB-load-misses up 95%
    • from 5.7.43 to 8.0.34 iTLB-loads up 30%, iTLB-load-misses up 2.8X
  • on socket2
    • from 5.6.51 to 5.7.10 iTLB-loads up 39%, iTLB-load-misses up 33%
    • from 5.7.43 to 8.0.34 iTLB-loads up 25%, iTLB-load-misses up 2.1X
Results: L1 cache

Summary:

  • the Results section above explains the y-axis
  • on beelink
    • dcache
      • from 5.6.51 to 5.7.10 loads up 24%, load-misses up 16%
      • from 5.7.43 to 8.0.34 loads up 21%, load-misses up 24%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses up 23%
      • from 5.7.43 to 8.0.34 loads-misses up 33%
  • on ser7
    • dcache
      • from 5.6.51 to 5.7.10 loads up 38%, load-misses up 27%
      • from 5.7.43 to 8.0.34 loads up 28%, load-misses up 27%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses up 40%
      • from 5.7.43 to 8.0.34 loads-misses up 34%
  • on socket2
    • dcache
      • from 5.6.51 to 5.7.10 loads up 22%, load-misses up 13%, stores up 26%
      • from 5.7.43 to 8.0.34 loads up 18%, load-misses up 20%, stores up 24%
    • icache
      • from 5.6.51 to 5.7.10 loads-misses  up 30%
      • from 5.7.43 to 8.0.34 loads-misses up 37%
Results: LLC

The LLC counters were only supported in the socket2 CPU.

Summary:

  • the Results section above explains the y-axis
  • on socket2
    • loads
      • from 5.6.51 to 5.7.10 loads up 26%, load-misses down 3%
      • from 5.7.43 to 8.0.34 loads up 24%, load-misses down 18%
    • stores
      • from 5.6.51 to 5.7.10 stores up 16%, store-misses down 25%
      • from 5.7.43 to 8.0.34 stores up 69%, store-misses up 29%
Results: context switches

Summary:
  • on beelink
    • from 5.6.21 to 5.7.10 context switches down 31%
    • from 5.7.43 to 8.0.34 context switches up 3%
  • on ser7
    • from 5.6.21 to 5.7.10 context switches up 8%
    • from 5.7.43 to 8.0.34 context switches down 5%
  • on socket2
    • from 5.6.21 to 5.7.10 context switches up 1%
    • from 5.7.43 to 8.0.34 context switches down 2%
Results: CPU migrations

Summary:
  • on beelink
    • from 5.6.21 to 5.7.10 CPU migrations down 38%
    • from 5.7.43 to 8.0.34 CPU migrations up 94%
  • on ser7
    • from 5.6.21 to 5.7.10 CPU migrations up 41%
    • from 5.7.43 to 8.0.34 CPU migrations up 5.1X
  • on socket2
    • from 5.6.21 to 5.7.10 CPU migrations down 41%
    • from 5.7.43 to 8.0.34 CPU migrations down 54%

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...