Small Datum: Perf regressions in MyRocks with the insert benchmark

I used the insert benchmark to search for performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28) and to determine the impact of compiler optimizations because I build it from source. The context for the results is read-heavy and write-heavy, in-memory (cached by MyRocks) on a small server with low-concurrency and a big server with high-concurrency.

The small server is a Beelink SER 4700u with 8 AMD cores, 16G RAM and NVMe SSD and low-concurrency was 1 and 4 clients. The big server is a c2-standard-60 with 30 cores, hyperthreading disabled, 240G RAM and 3T of local NVMe SSD and high-concurrency was 20 clients.

tl;dr

The best 8.0 builds use link-time optimization which increases QPS by up to 10%. I have yet to figure out how to get 5.6 builds with link-time optimization. The comparisons below between the best 5.6 build and the best 8.0 build is skewed for this reason.
It is possible that QPS on the read+write benchmark steps (q100.1, q500.1, q1000.1) suffer from bug 109595.

tl;dr for the small server:

For 5.6.35 the rel build has the best performance (up to 8% more QPS)
For 8.0.28 the rel_native_lto build has the best performance (up to 12% more QPS)
Relative throughput for 8.0.28 versus 5.6.35 is between 0.81 and 0.92 for the benchmark steps, excluding index creation (l.x), on the 1 client & 1 table configuration. So 8.0.28 uses 10% to 20% more CPU per operation even with the ~10% improvement from link-time optimization.

tl;dr for the big server:

For 5.6.35 the rel build has the best performance (up to 5% more QPS)
For 8.0.28 the rel_native_lto build (up to 10% more QPS)
8.0.28 had similar insert throughput but ~20% better read QPS versus 5.6.35. But remember that 8.0 benefits from link-time optimization while 5.6 does not.

Benchmarks

An overview of the insert benchmark is here and here. The insert benchmark was run for a cached database with both 1 and 4 clients. For 1 client the benchmark used 1 table. For 4 clients the benchmark was first run with 4 tables (1 client/table) and then again with 1 table (all clients shared the table). The read+write steps (q100.1, q500.1, q1000.1) were run for 1800 seconds each.

Benchmarks were repeated for two configurations:

cached by RocksDB - all data fits in the RocksDB block cache
cached by OS - all data fits in the OS page cache but not the RocksDB block cache. For the small server RocksDB block cache size was set to 1G and the database was ~4G at test end. For the big server the block cache size was 4G and the database was ~60G at test end.

The my.cnf files are here for:

small server: 5.6.35 cached by RocksDB and by OS, 8.0.28 cached by RocksDB and by OS
big server: 5.6.35 cached by RocksDB and by OS, 8.0.28 cached by RocksDB and by OS

The benchmark is a sequence of steps. The values for X were 20 for the small server and 400 for the big server. The steps are:

l.i0 - insert X million rows without secondary indexes
l.x - create 3 secondary indexes. I usually ignore results from this step.
l.i1 - insert another X million rows with the overhead of secondary index maintenance
q100.1 - do queries as fast as possible with 100 inserts/s/thread done in the background
q500.1 - do queries as fast as possible with 500 inserts/s/thread done in the background
q1000.1 - do queries as fast as possible with 1000 inserts/s/thread done in the background

MyRocks

Tests used MyRocks from FB MySQL 5.6.35 and 8.0.28 with different builds to test compiler optimizations. The builds are described in a previous post and I tested:

5.6.35 - I tested the rel, rel_o2 and rel_withdbg builds
8.0.28 - I tested the rel_with_dbug, rel_o2, rel_native, rel, rel_o2_l2o, rel_native_lto and rel_lto builds

Reports: small server

Performance summaries generated by shell scripts are below. A short guide to these results is here.

For understanding CPU regressions the most interesting result is the comparison between 5.6.35 and 8.0.28 using 1 client, 1 table and a cached database -- see here. Tests with more concurrency can show improvements or regressions from mutex contention, which is also interesting, but the first thing I want to understand is CPU overhead. The 5.6 vs 8.0 comparison uses the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The use of link-time optimization for the 8.0 build but not for the 5.6 build will hide regressions because it increases QPS by between 5% and 10% for 8.0.

The relative throughput for 8.0.28 vs 5.6.35 in the 1 table, 1 client, cached database configuration is:

0.81 for l.i0 (inserts without secondary index maintenance)
0.87 for l.i1 (inserts with secondary index maintenance)
0.88 for q100.1 - range queries with 100 inserts/
0.90 for q500.1 - range queries with 500 inserts/s
0.92 for q1000.1 - range queries with 1000 inserts/s

The 5.6 vs 8.0 results compare the 5.6.35 rel build with the 8.0.28 rel_native_lto build. So the 8.0 results get an (unfair) boost from link-time optimization.

Reports for cached by RocksDB:

1 client, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
4 clients, 4 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
4 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0

Reports for cached by OS:

1 client, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
4 clients, 4 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
4 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0

Reports: big server

Performance summaries generated by shell scripts are below. A short guide to these results is here.

For understanding CPU regressions the most interesting result is the comparison between 5.6.35 and 8.0.28 using 20 clients, 20 tables and a cached database -- see here. The 5.6 vs 8.0 comparison uses the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The use of link-time optimization for the 8.0 build but not for the 5.6 build will hide regressions because it increases QPS by between 5% and 10% for 8.0.

The relative throughput for 8.0.28 vs 5.6.35 in the 20 clients, 20 tables and cached database configuration is:

1.05 for l.i0 (inserts without secondary index maintenance)
1.02 for l.i1 (inserts with secondary index maintenance)
1.01 for q100.1 - range queries with 100 inserts/
1.00 for q500.1 - range queries with 500 inserts/s
1.01 for q1000.1 - range queries with 1000 inserts/s

Reports cached by RocksDB:

20 clients, 20 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
20 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0

Reports cached by OS:

20 clients, 20 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
20 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0

Response time: big server

Charts with throughput and response time per 1-second intervals are here using results from the big server. First, the disclaimers:

The graphs are from just one of the 20 clients
I risk drawing strong conclusions from small samples

The charts:

cached by RocksDB

20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1

From the graph the l.i1 result has less variance than InnoDB
For q100.1 there are write stalls between 1000s and 1600s. The QPS graph has a pattern that repeats every ~200s -- QPS starts high then slowly degrades, repeat. This might be the impact from a memtable flush or L0->L1 compaction. When memtable is full or L0 has more SSTs then CPU overhead from queries increases.
For q500.1 results are similar to q100.1 but the QPS pattern happens faster because compaction happens faster (background insert rate here is 5X larger).
For q1000.1 results are similar to q100.1 but the QPS pattern happens even faster (see q500.1 above, background insert rate here is 10X larger).

20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1

Results are similar to 20 clients, 20 tables above

cached by OS:

20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1

Results are similar to 20 clients, 20 tables above

20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1

Results are similar to 20 clients, 20 tables above

Up next are tables with response time details per benchmark step -- both histograms and the max. Note that ms means milliseconds, by insert I mean a multi-row insert statement and the use of brackets ([]) below indicates the range of the max response time over multiple results (1 result per build). It can be risky to draw strong inferences from a small sample (one run per build).

cached by RocksDB

20 clients, 20 tables

l.i0 - the max is [357ms, 376ms]
l.i1 - the max is [380ms, 397ms]
q100.1 - the max is [31ms, 47ms] for queries and [9ms, 25ms] for inserts
q500.1 - the max is [27ms, 41ms] for queries and [31ms, 128ms] for inserts
q1000.1 - the max is [37ms, 38ms] for queries and [34ms, 89ms] for inserts

20 clients, 1 table

l.i0 - the max is [239ms, 309ms]
l.i1 - the max is [226ms, 467ms]
q100.1 - the max is [30ms, 37ms] for queries and [15ms, 16ms] for inserts
q500.1 - the max is [36ms, 39ms] for queries and [31ms, 51ms] for inserts
q1000.1 - the max is [24ms, 46ms] for queries and [36ms, 53ms] for inserts

cached by OS

20 clients, 20 tables

l.i0 - the max is [330ms, 333ms]
l.i1 - the max is [372ms, 431ms]
q100.1 - the max is [31ms, 46ms] for queries and [17ms, 24ms] for inserts
q500.1 - the max is [30ms, 55ms] for queries and [25ms, 46ms] for inserts
q1000.1 - the max is [31ms, 55ms] for queries and [44ms, 51ms] for inserts

20 clients, 1 table

l.i0 - the max is [243ms, 264ms]
l.i1 - the max is [208ms, 349ms]
q100.1 - the max is [16ms, 51ms] for queries and [10ms, 17ms] for inserts
q500.1 - the max is [42ms, 49ms] for queries and [31ms, 35ms] for inserts
q1000.1 - the max is [34ms, 50ms] for queries and [34ms, 37ms] for inserts

Small Datum

Monday, April 17, 2023

Perf regressions in MyRocks with the insert benchmark

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)