This has results for MySQL with InnoDB vs the updated Insert Benchmark with an IO-bound workload and 8-core server with results from MySQL versions 5.6 through 8.0. Recent results from a cached workload and the same server are here.
tl;dr
- Regressions here with the IO-bound workload are smaller than with a cached workload because the extra IO latency often dominates the extra CPU overhead that arrives in modern MySQL.
- Regressions tend to be large between major versions (5.6 -> 5.7, 5.7 -> 8.0). While they were small within 5.6 and 5.7 (5.6.21 -> 5.6.51, 5.7.10 -> 5.7.44) there were also large within 8.0.
- The perf schema continues to have performance problems. The biggest problems are a soon to be fixed bug for parallel create index (see here) and a ~15% drop in range query throughput. The drop is larger for range than point queries in this workload because the point queries are much more IO-bound so the IO latency hides the cost of the perf schema.
Comparing MySQL 8.0.36 with 5.6.21
- Initial load (l.i0) throughput is ~2X larger in 5.6
- Write only (l.i1, l.i2) throughput is ~1.2X larger in 8.0
- Range queries (qr*) throughput is much smaller in 8.0
- Point queries (qp*) throughput is between ~9% smaller to similar in 8.0
Build + Configuration
I tested many versions of MySQL 5.6, 5.7 and 8.0 These were compiled from source. I used the CMake files from here with the patches here to fix problems that otherwise prevent compiling older MySQL releases on modern Ubuntu. In all cases I use the rel build that uses CMAKE_BUILD_TYPE =Release.
I used the cz10a_bee my.cnf files that are here for 5.6, for 5.7 and for 8.0. For 5.7 and 8.0 there are many variants of that file to make them work on a range of the point releases.
I used the cz10a_bee my.cnf files that are here for 5.6, for 5.7 and for 8.0. For 5.7 and 8.0 there are many variants of that file to make them work on a range of the point releases.
The versions I tested are:
- 5.6
- 5.6.21, 5.6.31, 5.6.41, 5.6.51
- 5.7
- 5.7.10, 5.7.20, 5.7.30, 5.7.44
- 8.0
- 8.0.13, 8.0.14, 8.0.20, 8.0.28, 8.0.35, 8.0.36
For 8.0.35 I tested a few variations from what is described above to understand the cost of the performance schema:
- my8035_rel.cz10aps0_bee
- this uses my.cnf.cz10aps0_bee which is the same as my.cnf.cz10a_bee except it adds performance_schema =0
- my8035_rel_lessps.cz10a_bee
- the build disables as much as possible of the performance schema. The CMake file is here.
The Benchmark
The test server is described here. It is a Beelink SER4 with 8 cores, 16G RAM and Ubuntu 22.04. Storage is an m.2 device with XFS and discard enabled.
The benchmark is explained here and is run with 1 client. The benchmark steps are:
- l.i0
- insert 800 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts 4M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows total
- Work and waiting is done at the end of this step to reduce write back debt
- qr100
- use 3 connections/client. One does range queries for 1800 seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results
The summary has 3 tables. The first shows absolute throughput by DBMS tested per benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.
Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.
From the summary for 5.6
- The base case is 5.6.21
- Comparing 5.6.51 with 5.6.21
- l.i0 - relative QPS is 0.92 in 5.6.51
- l.x - relative QPS is 1.01 in 5.6.51
- l.i1, l.i2 - relative QPS is 1.00, 0.99 in 5.6.51
- qr100, qr500, qr1000 - relative QPS is 0.98, 0.97, 0.98 in 5.6.51
- qp100, qp500, qp1000 - relative QPS is 0.96, 0.96, 0.98 in 5.6.51
From the summary for 5.7
- The base case is 5.7.10
- Comparing 5.7.44 with 5.7.10
- l.i0 - relative QPS is 0.96 in 5.7.44
- l.x - relative QPS is 0.96 in 5.7.44
- l.i1, l.i2 - relative QPS is 0.96, 0.98 in 5.7.44
- qr100, qr500, qr1000 - relative QPS is 0.95, 0.97, 0.97 in 5.7.44
- qp100, qp500, qp1000 - relative QPS is 1.00, 0.99, 0.99 in 5.7.44
From the summary for 8.0
- The base case is 8.0.13
- Comparing 8.0.36 with 8.0.13
- l.i0 - relative QPS is 0.81 in 8.0.36
- l.x - relative QPS is 1.00 in 8.0.36
- l.i1, l.i2 - relative QPS is 0.98, 0.93 in 8.0.36
- qr100, qr500, qr1000 - relative QPS is 1.01, 0.97, 0.93 in 8.0.36
- qp100, qp500, qp1000 - relative QPS is 0.98, 1.00 and 1.01 in 8.0.36
From the summary for 8.0 but focusing on the 8.0.35 variations that disable the perf schema
- Throughput for write-heavy steps (l.i0, l.i1, l.i2) is ~5% better
- Throughput for parallel index create is ~1.5X better (read this)
- For read-write benchmark steps
- Throughput for range queries (qr*) is ~15% better
- Throughput for point queries (qp*) is unchanged
- The point query benchmark steps are a lot more IO-bound than the range query steps which might explain why the perf schema cost is larger for range queries here. See the rpq column here which is iostat reads per query and it is less than 0.2 for qr100 but larger than 9 for qp100. From the cpupq column here that measures CPU per query the perf schema increases that by up to 10% for point queries.
- To reduce perf schema overhead it is better to disable it at compile time than via my.cnf
From the summary for 5.6, 5.7, 8.0
- The base case is 5.6.21
- The regressions here are smaller than for a cached workload because the workload here is frequently IO-bound and the extra IO latency often dominates the extra CPU latency that arrives in modern MySQL.
- Comparing 5.7.44 and 8.0.36 with 5.6.21
- l.i0
- relative QPS is 0.80 in 5.7.44
- relative QPS is 0.54 in 8.0.36
- l.x
- relative QPS is 1.36 in 5.7.44
- relative QPS is 1.30 in 8.0.36
- l.i1, l.i2
- relative QPS is 1.30, 1.25 in 5.7.44
- relative QPS is 1.29, 1.16 in 8.0.36
- qr100, qr500, qr1000
- relative QPS is 0.73, 0.83, 0.92 in 5.7.44
- relative QPS is 0.68, 0.77, 0.85 in 8.0.36
- qp100, qp500, qp1000
- relative QPS is 0.96, 0.96, 1.02 in 5.7.44
- relative QPS is 0.91, 0.93, 1.00 in 8.0.36
No comments:
Post a Comment