Monday, April 17, 2023

Perf regressions in MyRocks with the insert benchmark

I used the insert benchmark to search for performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28) and to determine the impact of compiler optimizations because I build it from source. The context for the results is read-heavy and write-heavy, in-memory (cached by MyRocks) on a small server with low-concurrency and a big server with high-concurrency.

The small server is a Beelink SER 4700u with 8 AMD cores, 16G RAM and NVMe SSD and low-concurrency was 1 and 4 clients. The big server is a c2-standard-60 with 30 cores, hyperthreading disabled, 240G RAM and 3T of local NVMe SSD and high-concurrency was 20 clients.

tl;dr

  • The best 8.0 builds use link-time optimization which increases QPS by up to 10%. I have yet to figure out how to get 5.6 builds with link-time optimization. The comparisons below between the best 5.6 build and the best 8.0 build is skewed for this reason.
  • It is possible that QPS on the read+write benchmark steps (q100.1, q500.1, q1000.1) suffer from bug 109595.

tl;dr for the small server:

  • For 5.6.35 the rel build has the best performance (up to 8% more QPS)
  • For 8.0.28 the rel_native_lto build has the best performance (up to 12% more QPS)
  • Relative throughput for 8.0.28 versus 5.6.35 is between 0.81 and 0.92 for the benchmark steps, excluding index creation (l.x), on the 1 client & 1 table configuration. So 8.0.28 uses 10% to 20% more CPU per operation even with the ~10% improvement from link-time optimization.
tl;dr for the big server:
  • For 5.6.35 the rel build has the best performance (up to 5% more QPS)
  • For 8.0.28 the rel_native_lto build (up to 10% more QPS)
  • 8.0.28 had similar insert throughput but ~20% better read QPS versus 5.6.35. But remember that 8.0 benefits from link-time optimization while 5.6 does not.
Benchmarks

An overview of the insert benchmark is here and here. The insert benchmark was run for a cached database with both 1 and 4 clients. For 1 client the benchmark used 1 table. For 4 clients the benchmark was first run with 4 tables (1 client/table) and then again with 1 table (all clients shared the table). The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each.

Benchmarks were repeated for two configurations:
  • cached by RocksDB - all data fits in the RocksDB block cache
  • cached by OS - all data fits in the OS page cache but not the RocksDB block cache. For the small server RocksDB block cache size was set to 1G and the database was ~4G at test end. For the big server the block cache size was 4G and the database was ~60G at test end.
The my.cnf files are here for:
The benchmark is a sequence of steps. The values for X were 20 for the small server and 400 for the big server. The steps are:

  • l.i0 - insert X million rows without secondary indexes
  • l.x - create 3 secondary indexes. I usually ignore results from this step.
  • l.i1 - insert another X million rows with the overhead of secondary index maintenance
  • q100.1 - do queries as fast as possible with 100 inserts/s/thread done in the background
  • q500.1 - do queries as fast as possible with 500 inserts/s/thread done in the background
  • q1000.1 - do queries as fast as possible with 1000 inserts/s/thread done in the background
MyRocks

Tests used MyRocks from FB MySQL 5.6.35 and 8.0.28 with different builds to test compiler optimizations. The builds are described in a previous post and I tested:
  • 5.6.35 - I tested the rel, rel_o2 and rel_withdbg builds
  • 8.0.28 - I tested the rel_with_dbug, rel_o2, rel_native, rel, rel_o2_l2o, rel_native_lto and rel_lto builds
Reports: small server

Performance summaries generated by shell scripts are below. A short guide to these results is here.

For understanding CPU regressions the most interesting result is the comparison between 5.6.35 and 8.0.28 using 1 client, 1 table and a cached database -- see here. Tests with more concurrency can show improvements or regressions from mutex contention, which is also interesting, but the first thing I want to understand is CPU overhead. The 5.6 vs 8.0 comparison uses the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The use of link-time optimization for the 8.0 build but not for the 5.6 build will hide regressions because it increases QPS by between 5% and 10% for 8.0. 

The relative throughput for 8.0.28 vs 5.6.35 in the 1 table, 1 client, cached database configuration is:
  • 0.81 for l.i0 (inserts without secondary index maintenance)
  • 0.87 for l.i1 (inserts with secondary index maintenance)
  • 0.88 for q100.1 - range queries with 100 inserts/
  • 0.90 for q500.1 - range queries with 500 inserts/s
  • 0.92 for q1000.1 - range queries with 1000 inserts/s 
The 5.6 vs 8.0 results compare the 5.6.35 rel build with the 8.0.28 rel_native_lto build. So the 8.0 results get an (unfair) boost from link-time optimization.

Reports for cached by RocksDB:
Reports for cached by OS:
Reports: big server

Performance summaries generated by shell scripts are below. A short guide to these results is here.

For understanding CPU regressions the most interesting result is the comparison between 5.6.35 and 8.0.28 using 20 clients, 20 tables and a cached database -- see here. The 5.6 vs 8.0 comparison uses the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The use of link-time optimization for the 8.0 build but not for the 5.6 build will hide regressions because it increases QPS by between 5% and 10% for 8.0. 

The relative throughput for 8.0.28 vs 5.6.35 in the 20 clients, 20 tables and cached database configuration is:
  • 1.05 for l.i0 (inserts without secondary index maintenance)
  • 1.02 for l.i1 (inserts with secondary index maintenance)
  • 1.01 for q100.1 - range queries with 100 inserts/
  • 1.00 for q500.1 - range queries with 500 inserts/s
  • 1.01 for q1000.1 - range queries with 1000 inserts/s 
Reports cached by RocksDB:
Reports cached by OS:
Response time: big server

Charts with throughput and response time per 1-second intervals are here using results from the big server. First, the disclaimers:
  • The graphs are from just one of the 20 clients
  • I risk drawing strong conclusions from small samples
The charts:
  • cached by RocksDB
    • 20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1
      • From the graph the l.i1 result has less variance than InnoDB
      • For q100.1 there are write stalls between 1000s and 1600s. The QPS graph has a pattern that repeats every ~200s -- QPS starts high then slowly degrades, repeat. This might be the impact from a memtable flush or L0->L1 compaction. When memtable is full or L0 has more SSTs then CPU overhead from queries increases. 
      • For q500.1 results are similar to q100.1 but the QPS pattern happens faster because compaction happens faster (background insert rate here is 5X larger).
      • For q1000.1 results are similar to q100.1 but the QPS pattern happens even faster (see q500.1 above, background insert rate here is 10X larger).
    • 20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1
      • Results are similar to 20 clients, 20 tables above
  • cached by OS: 
Up next are tables with response time details per benchmark step -- both histograms and the max. Note that ms means milliseconds, by insert I mean a multi-row insert statement and the use of brackets ([]) below indicates the range of the max response time over multiple results (1 result per build). It can be risky to draw strong inferences from a small sample (one run per build).
  • cached by RocksDB
    • 20 clients, 20 tables
      • l.i0 - the max is [357ms, 376ms]
      • l.i1 - the max is [380ms, 397ms]
      • q100.1 - the max is [31ms, 47ms] for queries and [9ms, 25ms] for inserts
      • q500.1 - the max is [27ms, 41ms] for queries and [31ms, 128ms] for inserts
      • q1000.1 - the max is [37ms, 38ms] for queries and [34ms, 89ms] for inserts
    • 20 clients, 1 table
      • l.i0 - the max is [239ms, 309ms]
      • l.i1 - the max is [226ms, 467ms]
      • q100.1 - the max is [30ms, 37ms] for queries and [15ms, 16ms] for inserts
      • q500.1 - the max is [36ms, 39ms] for queries and [31ms, 51ms] for inserts
      • q1000.1 - the max is [24ms, 46ms] for queries and [36ms, 53ms] for inserts
  • cached by OS
    • 20 clients, 20 tables
      • l.i0 - the max is [330ms, 333ms]
      • l.i1 - the max is [372ms, 431ms]
      • q100.1 - the max is [31ms, 46ms] for queries and [17ms, 24ms] for inserts
      • q500.1 - the max is [30ms, 55ms] for queries and [25ms, 46ms] for inserts
      • q1000.1 - the max is [31ms, 55ms] for queries and [44ms, 51ms] for inserts
    • 20 clients, 1 table
      • l.i0 - the max is [243ms, 264ms]
      • l.i1 - the max is [208ms, 349ms]
      • q100.1 - the max is [16ms, 51ms] for queries and [10ms, 17ms] for inserts
      • q500.1 - the max is [42ms, 49ms] for queries and [31ms, 35ms] for inserts
      • q1000.1 - the max is [34ms, 50ms] for queries and [34ms, 37ms] for inserts




































No comments:

Post a Comment

I recently published results with a summary of HW performance counters for sysbench with MySQL on four CPU types. The performance reports we...