Monday, August 14, 2023

InnoDB vs the Insert Benchmark on a medium server, includes Percona Server

I previously shared results for the Insert Benchmark, InnoDB and a medium server. Now I have updated results that include Percona Server and a few new configs (a20, a21, a22). The good news is that Percona Server does better than upstream on the write-intensive parts of the benchmark. The bad news is that it does fix the problem of bad performance.

Several of the benchmark steps here (l.i1, q100, q500, q1000) were run for more time than they were in the previous report.

tl;dr

  • Setting innodb_empty_free_list_algorithm to backoff with Percona Server does not fix the bad performance from single-page flushing
  • The slow query from too much MVCC GC debt can be avoided by making sure that innodb_page_cleaners, innodb_purge_threads and innodb_buffer_pool_instances have the same value. They won't by default.
  • Limiting max purge lag avoids some queries taking 1000X longer than needed, but at a big cost in write throughput. Query rates also suffer in many cases with the configs that limit max purge lag which isn't expected. Regardless, enabling limits for max purge lag is an important topic to consider.
  • InnoDB doesn't suffer much from the fairness issue
  • For Percona Server vs upstream using a similar config ...
    • With a cached workload Percona Server has similar write throughput and slightly worse read throughput
    • With an IO-bound workload Percona Server has better write throughput and similar read throughput
  • Must not forget about bug 109595 as that reduces query rates for q100, q500 and q1000
Builds

I used InnoDB from upstream MySQL 5.6.51, 5.7.40 and 8.0.33. I also used it from Percona Server 8.0.33 (8.0.33-25). All were compiled from source with CMAKE_BUILD_TYPE=Release. Link-time optimization (-flto) was used for 5.7 and 8.0.

Benchmark

The insert benchmark was run in two setups.

  • cached by InnoDB - all tables are cached by InnoDB
  • IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows per table where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another X million rows per table with secondary index maintenance where X is 200 for cached by InnoDB and 50 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.

Configurations

All of the my.cnf files for 8.0, 5.7 and 5.6.

The base configurations are here:

  • for upstream MySQL 8.0, 5.7 and 5.6. The naming pattern is my.cnf.cy10$x_gcp_c2s30 where x is a for the base config and a1 ... a22 otherwise.
  • for Percona Server 8.0.33 (ps1 and ps2, see below)

The configurations tested are:

  • base - see links above
  • a2 - increases innodb_io_capacity[_max] to 20k/40k
  • a7 - adds innodb_lru_scan_depth=2048
  • a10 - adds innodb_purge_threads=8, innodb_page_cleaners=8
  • a12 - reduces innodb_redo_log_capacity to 32G
  • a13 - adds innodb_max_purge_lag=50k, innodb_max_purge_lag_delay=1M, innodb_purge_threads=8
  • a17 - adds innodb_doublewrite=OFF (not safe for production)
  • a18 - adds innodb_buffer_pool_instances=4, innodb_page_cleaners=4
  • a19 - adds innodb_max_dirty_pages_pct_lwm=20, innodb_max_dirty_pages_pct=50
  • a20 - like a13 but with innodb_max_purge_lag=200k
  • a21 - like a13 but with innodb_max_purge_lag=500k
  • a22 - combines a10 and a20
  • ps1 - for Percona Server, same as base but sets innodb_empty_free_list_algorithm to legacy
  • ps2 - for Percona Server, same as base but sets innodb_empty_free_list_algorithm to backoff
Most of the configurations were used for MySQL 8.0.33. Only ps1 and ps2 were used for Percona Server. Only the base config was used for 5.6.51 and 5.7.40.

Results

Reports are here for:
I analyze these in terms of:
  • average throughput
  • fairness between benchmark clients
  • SLA for background inserts and deletes
  • response time histograms
  • performance at 1-second intervals
  • purge lag 
  • The impact of MVCC GC on some queries
Results: average throughput

The summaries show average throughput for each benchmark step using absolute and relative values. The relative values are the throughput relative to the throughput for the base config. The summaries are here for Cached by InnoDB (5.6 v 5.7 v 8.0 and 8.0 only) and IO-bound (5.6 v 5.7 v 8.0 and 8.0 only). In the summaries green/red mean good/bad for the table with absolute values and blue/gold mean better/worse for the table with relative values. I focus on the l.i1, q100, q500 and q1000 benchmark steps.

For Cached by InnoDB, 8.0 only (see here)
  • For l.i1 the a10, a13, a20, a21 and a22 configs reduce the write rate by ~15%. All but the a10 config enable limits on max purge lag and the reduction is expected for them. The reduction for a10 is surprising. The a10 config sets innodb_purge_threads and innodb_page_cleaners to 8 to match the default for innodb_buffer_pool_instances. Note that the a18 config also sets purge_threads, page_cleaners and buffer_pool_instances to the same value, but the value is 4 for a18, and there isn't a reduction for a18.
  • For q100 the query throughput is ~1.4X better for the configs (a13, a20, a21, a22) that limit max purge lag. For those configs there is less MVCC GC debt which means there is less overhead from the work done to reduce that debt. The throughput is also better for a10 (see the previous bullet point). 
  • For q500 and q1000 the throughput for the the max purge limiting configs is similar to the others because the history list length has been reduced for all configs (the debt was repaid during q100).
For Cached by InnoDB, 5.6 vs 5.7 vs 8.0 (see here)
  • For l.i0 (initial load) throughput in 5.7 is 1.88X better than 5.6 while in 8.0 it is only ~1.6X better than 5.6. From the rps (storage reads/second), rpq (storage reads/insert) and cpupq (CPU/insert) columns the difference is mostly due to 5.6 doing read IO (I don't know why) that isn't done by 5.7 and 8.0 while the regression from 5.7 to 8.0 might be from new CPU overhead -- see here.
  • For l.x the create index step is fastest for 5.7 and then regressions from 5.7 to 8.0. There is more CPU and read IO overhead in 8.0 vs 5.7 -- see here for the rps, rpq and cpupq columns.
  • For l.i1 (inserts & deletes with secondary indexes) the write rates for 5.7 and 8.0 are ~1.6X better than for 5.6
  • For q100 the query rate drops from 5.6 to 5.7 to 8.0 as the CPU overhead (see cpupq here) increases
  • For q500 and q1000 the query rates are closer and 5.6 suffers from more synchronization overhead (see cspq here) while 8.0 benefits from code improvements related to concurrency.
  • Upstream InnoDB is similar to Percona Server for writes and slightly better for reads here. The difference for reads appears to be from CPU overhead (see cpupq here for q100, q500 and q1000).
For IO-bound, 8.0 only (see here)
  • For l.i1 the configs that limit max purge lag (a13, a20, a21, a22) get ~25% of the write rate that other configs sustain. This is a big price to pay for better query performance. 
  • For q100 most of the configs that limit purge lag (a13, a20, a22) get better query rates but a21 does not. From metrics I see the CPU overhead (cpupq) and storage read overhead (rps, rpq) are too high for a21.
  • For q500 and q1000 the query rates suffer for the configs that limit max purge lag. This is odd.
  • Percona Server does better than upstream InnoDB (comparing to the base config) on the l.i1 benchmark step and similar on query throughput.
  • Results for the a10 config are hard to explain -- throughput for l.i1, q100, q500 and q1000 are all worse than the base config. From the metrics here I see more mutex contention (cspq) and more CPU overhead (cpupq). This isn't an issue for a18, which is similar buts sets the options to =4 rather than =8. Maybe the CPU is oversubscribed or innodb_purge_threads=8 is a problem.
For IO-bound, 5.6 vs 5.7 vs 8.0 (see here)
  • For l.i0 (initial load) throughput is 1.97X better in 5.7 vs 5.6 while 8.0 is only ~1.7X better than 5.6 because 8.0 has more CPU overhead than 5.7 (see cpupq here)
  • For l.x (create index) perf is ~1.8X better in 5.7 vs 5.6 while 8.0 is only ~1.5X better in upstream 8.0 and ~1.4X better in Percona Server. The difference between 5.7 and 8.0 is from more CPU and read IO overhead (see cpupq, rps and rpq here). The difference between upstream and Percona is from CPU overhead (see cpupq here).
  • For l.i1 (inserts+delete with secondary indexes) 5.7 and Percona Server are ~1.1X faster than 5.6 while upstream 8.0 is slightly slower than 5.6. Percona Server uses less CPU and read IO than upstream (see cpupq, rps and rpq here).
  • For q100, q500 and q1000 5.7 is ~1.2X faster than 5.6 while 8.0 (upstream and Percona Server) are ~1.3X faster than 5.6 because they use less CPU (see cpupq here).
Results: Fairness between benchmark clients

These are measured from the l.i1 benchmark step and the duration for this step varies as it runs until the configured number of inserts are done. In this case the benchmark step takes >= 10,000 seconds. The results here show that the benchmark clients don't finish at the same time but the difference isn't as large as it is for Postgres. There are spikes in throughput near the end of the benchmark steps for the stragglers because they have less competition for the DBMS at that point.

Legend:
* min_duration/max_duration - min and max runtimes from the 8 clients
* diff - max_duration - min_duration
* pct - (diff / min_duration) * 100
* config

Cached by InnoDB
diff    pct     config
123     0.6     my5651_rel.cy10a_gcp_c2s30
30      0.2     my5740_rel_lto.cy10a_gcp_c2s30
37      0.2     my8033_rel_lto.cy10a10_gcp_c2s30
31      0.2     my8033_rel_lto.cy10a12_gcp_c2s30
58      0.4     my8033_rel_lto.cy10a13_gcp_c2s30
32      0.3     my8033_rel_lto.cy10a17_gcp_c2s30
27      0.2     my8033_rel_lto.cy10a18_gcp_c2s30
55      0.3     my8033_rel_lto.cy10a20_gcp_c2s30
17      0.1     my8033_rel_lto.cy10a21_gcp_c2s30
25      0.2     my8033_rel_lto.cy10a22_gcp_c2s30
37      0.3     my8033_rel_lto.cy10a2_gcp_c2s30
24      0.2     my8033_rel_lto.cy10a7_gcp_c2s30
23      0.2     my8033_rel_lto.cy10a_gcp_c2s30
30      0.2     ps8033_rel_lto.cy10aps1_gcp_c2s30
23      0.2     ps8033_rel_lto.cy10aps2_gcp_c2s30

IO-bound
diff    pct     config
118     0.7     my5651_rel.cy10a_gcp_c2s30
95      0.7     my5740_rel_lto.cy10a_gcp_c2s30
441     2.7     my8033_rel_lto.cy10a_gcp_c2s30
201     1.4     ps8033_rel_lto.cy10aps1_gcp_c2s30
205     1.4     ps8033_rel_lto.cy10aps2_gcp_c2s30
822     4.6     my8033_rel_lto.cy10a10_gcp_c2s30
213     1.2     my8033_rel_lto.cy10a12_gcp_c2s30
295     0.5     my8033_rel_lto.cy10a13_gcp_c2s30
337     2.6     my8033_rel_lto.cy10a17_gcp_c2s30
304     1.8     my8033_rel_lto.cy10a18_gcp_c2s30
150     0.2     my8033_rel_lto.cy10a20_gcp_c2s30
317     0.4     my8033_rel_lto.cy10a21_gcp_c2s30
316     0.5     my8033_rel_lto.cy10a22_gcp_c2s30
273     1.6     my8033_rel_lto.cy10a2_gcp_c2s30
285     1.8     my8033_rel_lto.cy10a7_gcp_c2s30

Results: SLA for background inserts and deletes

The insert+delete rates are listed at the end of the Summary section in the perf reports.
  • Cached by InnoDB
  • IO-bound
    • 8.0 - a13, a20 and a22 configs for upstream 8.0.33 failed for the q1000 step. These configs enable limits on the max purge lag to avoid too much MVCC GC debt and the limits were reached during q1000 for which the targets are 8000 inserts/s and 8000 delete/s (8x1000). The value from limits on purge lag can be seen in the Impact of MVCC GC section below where a query can take ~100 seconds without a purge lag limit vs ~30 milliseconds with the limit.
    • 5.6 vs 5.7 vs 8.0 - target rates were sustained
Results: response time histograms

Response time distributions -- trying to be brief because this blog post is long.
  • Cached by InnoDB
  • IO-bound
    • 8.0
      • on the l.i1 step the a21 config has a bad outlier, 18.493s for a write, while Percona Server has outliers that are worse than upstream.
    • 5.6 vs 5.7 vs 8.0
      • on the li0 step 5.7 has the best histogram.
      • on the l.i1 step the max response time is 16.256s for 5.6, 14.511s for 5.7, 0.447s for upstream 8.0 and ~2s/~1s for Percona with the ps1/ps2 configs. Percona Server has the best response time histograms.
      • For q100, q500 and q1000 the histograms are similar
Results: performance at 1-second intervals

For Cached by InnoDB and 5.6 vs 5.7 vs 8.0
For IO-bound and 5.6 vs 5.7 vs 8.0
Results: purge lag

These show history list length (InnoDB purge lag) at the end of each benchmark step for Cached by InnoDB and IO-bound. The a13, a20, a21 and a22 configs all enable limits on max purge lag to avoid a too-large history list length. The benefit from this is avoiding slow queries (see the next section). The cost is a lower write rate (see the average throughput section above). There are always tradeoffs and the Rum Conjecture applies. I don't know whether the costs of this tradeoff from limiting max purge lag could be smaller. Is the current code good enough? 

Results: The impact of MVCC GC on some queries

There is a query done at the start of benchmark steps to determine the min value of transactionid (the PK column in the table). Inserts are done to the head of the table while deletes are done from the tail and for InnoDB and MyRocks this query can be slow unless you change the values of a few options.

This query can be a problem for InnoDB (regardless of tuning) and for MyRocks (dependent on tuning). It isn't a problem for Postgres.

A blog post with more details is here.

The query response time in seconds for the query done at the start of the benchmark step. It gets slow starting with the steps that follow l.i1 because l.i1 creates too much MVCC GC debt. Some configs (a13, a20, a21) avoid the slow query by enabling a limit on the max purge lag. Other configs (a10, a22) avoid the slow query by making sure that the innodb_page_cleaners, innodb_purge_threads and innodb_buffer_pool_instances have the same value.

Cached by InnoDB
config                                  l.i1    q100    q500    q1000
ps8033_rel_lto.cy10aps2_gcp_c2s30       0.032   86.014  5.308   0.001
ps8033_rel_lto.cy10aps1_gcp_c2s30       0.032   96.757  5.292   0.001

my8033_rel_lto.cy10a_gcp_c2s30          0.033   95.307  1.915   0.001
my8033_rel_lto.cy10a2_gcp_c2s30         0.033   94.290  1.566   0.001
my8033_rel_lto.cy10a7_gcp_c2s30         0.033   92.804  0.002   0.001
my8033_rel_lto.cy10a10_gcp_c2s30        0.033   0.001   0.001   0.001
my8033_rel_lto.cy10a12_gcp_c2s30        0.033   94.619  1.962   0.001
my8033_rel_lto.cy10a13_gcp_c2s30        0.034   0.001   0.001   0.001
my8033_rel_lto.cy10a17_gcp_c2s30        0.032   38.458  0.083   0.001
my8033_rel_lto.cy10a18_gcp_c2s30        0.033   97.824  3.295   0.001
my8033_rel_lto.cy10a20_gcp_c2s30        0.034   0.001   0.001   0.001
my8033_rel_lto.cy10a21_gcp_c2s30        0.033   0.001   0.001   0.001
my8033_rel_lto.cy10a22_gcp_c2s30        0.034   0.001   0.001   0.001

my5740_rel_lto.cy10a_gcp_c2s30          0.000   153.6   121.0   128.6
my5651_rel.cy10a_gcp_c2s30              0.000   450.5   368.1   126.5


IO-bound
config                                  l.i1    q100    q500    q1000
ps8033_rel_lto.cy10aps2_gcp_c2s30       0.031   77.351  25.876  27.465
ps8033_rel_lto.cy10aps1_gcp_c2s30       0.031   84.890  29.894  29.342

my8033_rel_lto.cy10a_gcp_c2s30          0.036   79.598  28.934  30.864
my8033_rel_lto.cy10a7_gcp_c2s30         0.034   80.706  27.499  31.721
my8033_rel_lto.cy10a2_gcp_c2s30         0.034   67.411  28.523  32.066
my8033_rel_lto.cy10a22_gcp_c2s30        0.032   1.178   0.001   0.001
my8033_rel_lto.cy10a21_gcp_c2s30        0.032   4.447   0.001   0.002
my8033_rel_lto.cy10a20_gcp_c2s30        0.034   1.459   0.002   0.002
my8033_rel_lto.cy10a18_gcp_c2s30        0.032   85.202  31.091  32.530
my8033_rel_lto.cy10a17_gcp_c2s30        0.034   29.340  14.566  13.648
my8033_rel_lto.cy10a13_gcp_c2s30        0.032   0.095   0.001   0.002
my8033_rel_lto.cy10a12_gcp_c2s30        0.034   61.456  26.763  29.332
my8033_rel_lto.cy10a10_gcp_c2s30        0.031   63.485  27.013  22.921

my5740_rel_lto.cy10a_gcp_c2s30          0.003   101.5   39.1    42.5
my5651_rel.cy10a_gcp_c2s30              0.018   102.7   108.9   66.2












2 comments:

  1. Replies
    1. I don't think that Percona Server fixes the problem of InnoDB write back falling behind leading to single-page flush stalls.

      Delete

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB a...