Small Datum: InnoDB vs the Insert Benchmark on a medium server, includes Percona Server

I previously shared results for the Insert Benchmark, InnoDB and a medium server. Now I have updated results that include Percona Server and a few new configs (a20, a21, a22). The good news is that Percona Server does better than upstream on the write-intensive parts of the benchmark. The bad news is that it does fix the problem of bad performance.

Several of the benchmark steps here (l.i1, q100, q500, q1000) were run for more time than they were in the previous report.

tl;dr

Setting innodb_empty_free_list_algorithm to backoff with Percona Server does not fix the bad performance from single-page flushing
The slow query from too much MVCC GC debt can be avoided by making sure that innodb_page_cleaners, innodb_purge_threads and innodb_buffer_pool_instances have the same value. They won't by default.
Limiting max purge lag avoids some queries taking 1000X longer than needed, but at a big cost in write throughput. Query rates also suffer in many cases with the configs that limit max purge lag which isn't expected. Regardless, enabling limits for max purge lag is an important topic to consider.
InnoDB doesn't suffer much from the fairness issue
For Percona Server vs upstream using a similar config ...

With a cached workload Percona Server has similar write throughput and slightly worse read throughput
With an IO-bound workload Percona Server has better write throughput and similar read throughput

Must not forget about bug 109595 as that reduces query rates for q100, q500 and q1000

Builds

I used InnoDB from upstream MySQL 5.6.51, 5.7.40 and 8.0.33. I also used it from Percona Server 8.0.33 (8.0.33-25). All were compiled from source with CMAKE_BUILD_TYPE=Release. Link-time optimization (-flto) was used for 5.7 and 8.0.

Benchmark

The insert benchmark was run in two setups.

cached by InnoDB - all tables are cached by InnoDB
IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

l.i0

insert X million rows per table where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance where X is 200 for cached by InnoDB and 50 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.

Configurations

All of the my.cnf files for 8.0, 5.7 and 5.6.

The base configurations are here:

for upstream MySQL 8.0, 5.7 and 5.6. The naming pattern is my.cnf.cy10$x_gcp_c2s30 where x is a for the base config and a1 ... a22 otherwise.
for Percona Server 8.0.33 (ps1 and ps2, see below)

The configurations tested are:

base - see links above
a2 - increases innodb_io_capacity[_max] to 20k/40k
a7 - adds innodb_lru_scan_depth=2048
a10 - adds innodb_purge_threads=8, innodb_page_cleaners=8
a12 - reduces innodb_redo_log_capacity to 32G
a13 - adds innodb_max_purge_lag=50k, innodb_max_purge_lag_delay=1M, innodb_purge_threads=8
a17 - adds innodb_doublewrite=OFF (not safe for production)
a18 - adds innodb_buffer_pool_instances=4, innodb_page_cleaners=4
a19 - adds innodb_max_dirty_pages_pct_lwm=20, innodb_max_dirty_pages_pct=50
a20 - like a13 but with innodb_max_purge_lag=200k
a21 - like a13 but with innodb_max_purge_lag=500k
a22 - combines a10 and a20
ps1 - for Percona Server, same as base but sets innodb_empty_free_list_algorithm to legacy
ps2 - for Percona Server, same as base but sets innodb_empty_free_list_algorithm to backoff

Most of the configurations were used for MySQL 8.0.33. Only ps1 and ps2 were used for Percona Server. Only the base config was used for 5.6.51 and 5.7.40.

Results

Reports are here for:

Cached by InnoDB: 5.6 vs 5.7 vs 8.0 and 8.0 only
IO-bound: 5.6 vs 5.7 vs 8.0 and 8.0 only

I analyze these in terms of:

average throughput
fairness between benchmark clients
SLA for background inserts and deletes
response time histograms
performance at 1-second intervals
purge lag
The impact of MVCC GC on some queries

Results: average throughput

The summaries show average throughput for each benchmark step using absolute and relative values. The relative values are the throughput relative to the throughput for the base config. The summaries are here for Cached by InnoDB (5.6 v 5.7 v 8.0 and 8.0 only) and IO-bound (5.6 v 5.7 v 8.0 and 8.0 only). In the summaries green/red mean good/bad for the table with absolute values and blue/gold mean better/worse for the table with relative values. I focus on the l.i1, q100, q500 and q1000 benchmark steps.

For Cached by InnoDB, 8.0 only (see here)

For l.i1 the a10, a13, a20, a21 and a22 configs reduce the write rate by ~15%. All but the a10 config enable limits on max purge lag and the reduction is expected for them. The reduction for a10 is surprising. The a10 config sets innodb_purge_threads and innodb_page_cleaners to 8 to match the default for innodb_buffer_pool_instances. Note that the a18 config also sets purge_threads, page_cleaners and buffer_pool_instances to the same value, but the value is 4 for a18, and there isn't a reduction for a18.
For q100 the query throughput is ~1.4X better for the configs (a13, a20, a21, a22) that limit max purge lag. For those configs there is less MVCC GC debt which means there is less overhead from the work done to reduce that debt. The throughput is also better for a10 (see the previous bullet point).
For q500 and q1000 the throughput for the the max purge limiting configs is similar to the others because the history list length has been reduced for all configs (the debt was repaid during q100).

For Cached by InnoDB, 5.6 vs 5.7 vs 8.0 (see here)

For l.i0 (initial load) throughput in 5.7 is 1.88X better than 5.6 while in 8.0 it is only ~1.6X better than 5.6. From the rps (storage reads/second), rpq (storage reads/insert) and cpupq (CPU/insert) columns the difference is mostly due to 5.6 doing read IO (I don't know why) that isn't done by 5.7 and 8.0 while the regression from 5.7 to 8.0 might be from new CPU overhead -- see here.
For l.x the create index step is fastest for 5.7 and then regressions from 5.7 to 8.0. There is more CPU and read IO overhead in 8.0 vs 5.7 -- see here for the rps, rpq and cpupq columns.
For l.i1 (inserts & deletes with secondary indexes) the write rates for 5.7 and 8.0 are ~1.6X better than for 5.6
For q100 the query rate drops from 5.6 to 5.7 to 8.0 as the CPU overhead (see cpupq here) increases
For q500 and q1000 the query rates are closer and 5.6 suffers from more synchronization overhead (see cspq here) while 8.0 benefits from code improvements related to concurrency.
Upstream InnoDB is similar to Percona Server for writes and slightly better for reads here. The difference for reads appears to be from CPU overhead (see cpupq here for q100, q500 and q1000).

For IO-bound, 8.0 only (see here)

For l.i1 the configs that limit max purge lag (a13, a20, a21, a22) get ~25% of the write rate that other configs sustain. This is a big price to pay for better query performance.
For q100 most of the configs that limit purge lag (a13, a20, a22) get better query rates but a21 does not. From metrics I see the CPU overhead (cpupq) and storage read overhead (rps, rpq) are too high for a21.
For q500 and q1000 the query rates suffer for the configs that limit max purge lag. This is odd.
Percona Server does better than upstream InnoDB (comparing to the base config) on the l.i1 benchmark step and similar on query throughput.
Results for the a10 config are hard to explain -- throughput for l.i1, q100, q500 and q1000 are all worse than the base config. From the metrics here I see more mutex contention (cspq) and more CPU overhead (cpupq). This isn't an issue for a18, which is similar buts sets the options to =4 rather than =8. Maybe the CPU is oversubscribed or innodb_purge_threads=8 is a problem.

For IO-bound, 5.6 vs 5.7 vs 8.0 (see here)

For l.i0 (initial load) throughput is 1.97X better in 5.7 vs 5.6 while 8.0 is only ~1.7X better than 5.6 because 8.0 has more CPU overhead than 5.7 (see cpupq here)
For l.x (create index) perf is ~1.8X better in 5.7 vs 5.6 while 8.0 is only ~1.5X better in upstream 8.0 and ~1.4X better in Percona Server. The difference between 5.7 and 8.0 is from more CPU and read IO overhead (see cpupq, rps and rpq here). The difference between upstream and Percona is from CPU overhead (see cpupq here).
For l.i1 (inserts+delete with secondary indexes) 5.7 and Percona Server are ~1.1X faster than 5.6 while upstream 8.0 is slightly slower than 5.6. Percona Server uses less CPU and read IO than upstream (see cpupq, rps and rpq here).
For q100, q500 and q1000 5.7 is ~1.2X faster than 5.6 while 8.0 (upstream and Percona Server) are ~1.3X faster than 5.6 because they use less CPU (see cpupq here).

Results: Fairness between benchmark clients

These are measured from the l.i1 benchmark step and the duration for this step varies as it runs until the configured number of inserts are done. In this case the benchmark step takes >= 10,000 seconds. The results here show that the benchmark clients don't finish at the same time but the difference isn't as large as it is for Postgres. There are spikes in throughput near the end of the benchmark steps for the stragglers because they have less competition for the DBMS at that point.

Legend:

* min_duration/max_duration - min and max runtimes from the 8 clients

* diff - max_duration - min_duration

* pct - (diff / min_duration) * 100

* config

Cached by InnoDB

diff pct config

123 0.6 my5651_rel.cy10a_gcp_c2s30

30 0.2 my5740_rel_lto.cy10a_gcp_c2s30

37 0.2 my8033_rel_lto.cy10a10_gcp_c2s30

31 0.2 my8033_rel_lto.cy10a12_gcp_c2s30

58 0.4 my8033_rel_lto.cy10a13_gcp_c2s30

32 0.3 my8033_rel_lto.cy10a17_gcp_c2s30

27 0.2 my8033_rel_lto.cy10a18_gcp_c2s30

55 0.3 my8033_rel_lto.cy10a20_gcp_c2s30

17 0.1 my8033_rel_lto.cy10a21_gcp_c2s30

25 0.2 my8033_rel_lto.cy10a22_gcp_c2s30

37 0.3 my8033_rel_lto.cy10a2_gcp_c2s30

24 0.2 my8033_rel_lto.cy10a7_gcp_c2s30

23 0.2 my8033_rel_lto.cy10a_gcp_c2s30

30 0.2 ps8033_rel_lto.cy10aps1_gcp_c2s30

23 0.2 ps8033_rel_lto.cy10aps2_gcp_c2s30

IO-bound

diff pct config

118 0.7 my5651_rel.cy10a_gcp_c2s30

95 0.7 my5740_rel_lto.cy10a_gcp_c2s30

441 2.7 my8033_rel_lto.cy10a_gcp_c2s30

201 1.4 ps8033_rel_lto.cy10aps1_gcp_c2s30

205 1.4 ps8033_rel_lto.cy10aps2_gcp_c2s30

822 4.6 my8033_rel_lto.cy10a10_gcp_c2s30

213 1.2 my8033_rel_lto.cy10a12_gcp_c2s30

295 0.5 my8033_rel_lto.cy10a13_gcp_c2s30

337 2.6 my8033_rel_lto.cy10a17_gcp_c2s30

304 1.8 my8033_rel_lto.cy10a18_gcp_c2s30

150 0.2 my8033_rel_lto.cy10a20_gcp_c2s30

317 0.4 my8033_rel_lto.cy10a21_gcp_c2s30

316 0.5 my8033_rel_lto.cy10a22_gcp_c2s30

273 1.6 my8033_rel_lto.cy10a2_gcp_c2s30

285 1.8 my8033_rel_lto.cy10a7_gcp_c2s30

Results: SLA for background inserts and deletes

The insert+delete rates are listed at the end of the Summary section in the perf reports.

Cached by InnoDB

8.0 - target rates were sustained
5.6 vs 5.7 vs 8.0 - 5.6.51 failed for the q100 step

IO-bound

8.0 - a13, a20 and a22 configs for upstream 8.0.33 failed for the q1000 step. These configs enable limits on the max purge lag to avoid too much MVCC GC debt and the limits were reached during q1000 for which the targets are 8000 inserts/s and 8000 delete/s (8x1000). The value from limits on purge lag can be seen in the Impact of MVCC GC section below where a query can take ~100 seconds without a purge lag limit vs ~30 milliseconds with the limit.
5.6 vs 5.7 vs 8.0 - target rates were sustained

Results: response time histograms

Response time distributions -- trying to be brief because this blog post is long.

Cached by InnoDB

not that interesting

5.6 vs 5.7 vs 8.0

results for 5.7 and 8.0 are mostly better than 5.6

IO-bound

on the l.i1 step the a21 config has a bad outlier, 18.493s for a write, while Percona Server has outliers that are worse than upstream.

5.6 vs 5.7 vs 8.0

on the li0 step 5.7 has the best histogram.
on the l.i1 step the max response time is 16.256s for 5.6, 14.511s for 5.7, 0.447s for upstream 8.0 and ~2s/~1s for Percona with the ps1/ps2 configs. Percona Server has the best response time histograms.
For q100, q500 and q1000 the histograms are similar

Results: performance at 1-second intervals

For Cached by InnoDB and 5.6 vs 5.7 vs 8.0

For l.i1 the insert rate and and max response time charts are similar between upstream 8.0 (base config) and Percona Server (ps2 config).
For q1000 Percona Server with the ps2 config has less variance for inserts (see here) vs upstream InnoDB (see here).

For IO-bound and 5.6 vs 5.7 vs 8.0

For l.i0 upstream 8.0 is similar to Percona Server
For l.i1 the insert rate charts are similar -- a slow rise over ~4000s then stable while all have a bit of response time variance
For q100 there is more query throughput variance with Percona Server and the ps2 config vs upstream 8.0
For q1000 there is more query throughput variance with Percona Server and the ps2 config vs upstream 8.0

Results: purge lag

These show history list length (InnoDB purge lag) at the end of each benchmark step for Cached by InnoDB and IO-bound. The a13, a20, a21 and a22 configs all enable limits on max purge lag to avoid a too-large history list length. The benefit from this is avoiding slow queries (see the next section). The cost is a lower write rate (see the average throughput section above). There are always tradeoffs and the Rum Conjecture applies. I don't know whether the costs of this tradeoff from limiting max purge lag could be smaller. Is the current code good enough?

Results: The impact of MVCC GC on some queries

There is a query done at the start of benchmark steps to determine the min value of transactionid (the PK column in the table). Inserts are done to the head of the table while deletes are done from the tail and for InnoDB and MyRocks this query can be slow unless you change the values of a few options.

This query can be a problem for InnoDB (regardless of tuning) and for MyRocks (dependent on tuning). It isn't a problem for Postgres.

A blog post with more details is here.

The query response time in seconds for the query done at the start of the benchmark step. It gets slow starting with the steps that follow l.i1 because l.i1 creates too much MVCC GC debt. Some configs (a13, a20, a21) avoid the slow query by enabling a limit on the max purge lag. Other configs (a10, a22) avoid the slow query by making sure that the innodb_page_cleaners, innodb_purge_threads and innodb_buffer_pool_instances have the same value.

Cached by InnoDB

config l.i1 q100 q500 q1000

ps8033_rel_lto.cy10aps2_gcp_c2s30 0.032 86.014 5.308 0.001

ps8033_rel_lto.cy10aps1_gcp_c2s30 0.032 96.757 5.292 0.001

my8033_rel_lto.cy10a_gcp_c2s30 0.033 95.307 1.915 0.001

my8033_rel_lto.cy10a2_gcp_c2s30 0.033 94.290 1.566 0.001

my8033_rel_lto.cy10a7_gcp_c2s30 0.033 92.804 0.002 0.001

my8033_rel_lto.cy10a10_gcp_c2s30 0.033 0.001 0.001 0.001

my8033_rel_lto.cy10a12_gcp_c2s30 0.033 94.619 1.962 0.001

my8033_rel_lto.cy10a13_gcp_c2s30 0.034 0.001 0.001 0.001

my8033_rel_lto.cy10a17_gcp_c2s30 0.032 38.458 0.083 0.001

my8033_rel_lto.cy10a18_gcp_c2s30 0.033 97.824 3.295 0.001

my8033_rel_lto.cy10a20_gcp_c2s30 0.034 0.001 0.001 0.001

my8033_rel_lto.cy10a21_gcp_c2s30 0.033 0.001 0.001 0.001

my8033_rel_lto.cy10a22_gcp_c2s30 0.034 0.001 0.001 0.001

my5740_rel_lto.cy10a_gcp_c2s30 0.000 153.6 121.0 128.6

my5651_rel.cy10a_gcp_c2s30 0.000 450.5 368.1 126.5

IO-bound

config l.i1 q100 q500 q1000

ps8033_rel_lto.cy10aps2_gcp_c2s30 0.031 77.351 25.876 27.465

ps8033_rel_lto.cy10aps1_gcp_c2s30 0.031 84.890 29.894 29.342

my8033_rel_lto.cy10a_gcp_c2s30 0.036 79.598 28.934 30.864

my8033_rel_lto.cy10a7_gcp_c2s30 0.034 80.706 27.499 31.721

my8033_rel_lto.cy10a2_gcp_c2s30 0.034 67.411 28.523 32.066

my8033_rel_lto.cy10a22_gcp_c2s30 0.032 1.178 0.001 0.001

my8033_rel_lto.cy10a21_gcp_c2s30 0.032 4.447 0.001 0.002

my8033_rel_lto.cy10a20_gcp_c2s30 0.034 1.459 0.002 0.002

my8033_rel_lto.cy10a18_gcp_c2s30 0.032 85.202 31.091 32.530

my8033_rel_lto.cy10a17_gcp_c2s30 0.034 29.340 14.566 13.648

my8033_rel_lto.cy10a13_gcp_c2s30 0.032 0.095 0.001 0.002

my8033_rel_lto.cy10a12_gcp_c2s30 0.034 61.456 26.763 29.332

my8033_rel_lto.cy10a10_gcp_c2s30 0.031 63.485 27.013 22.921

my5740_rel_lto.cy10a_gcp_c2s30 0.003 101.5 39.1 42.5

my5651_rel.cy10a_gcp_c2s30 0.018 102.7 108.9 66.2

Small Datum

Monday, August 14, 2023

InnoDB vs the Insert Benchmark on a medium server, includes Percona Server

2 comments:

How efficient is RocksDB for IO-bound, point-query workloads?