I used the insert benchmark to search for performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28) and to determine the impact of compiler optimizations because I build it from source. The context for the results is read-heavy and write-heavy, in-memory (cached by MyRocks) on a small server with low-concurrency and a big server with high-concurrency.
The small server is a Beelink SER 4700u with 8 AMD cores, 16G RAM and NVMe SSD and low-concurrency was 1 and 4 clients. The big server is a c2-standard-60 with 30 cores, hyperthreading disabled, 240G RAM and 3T of local NVMe SSD and high-concurrency was 20 clients.
tl;dr
- The best 8.0 builds use link-time optimization which increases QPS by up to 10%. I have yet to figure out how to get 5.6 builds with link-time optimization. The comparisons below between the best 5.6 build and the best 8.0 build is skewed for this reason.
- It is possible that QPS on the read+write benchmark steps (q100.1, q500.1, q1000.1) suffer from bug 109595.
tl;dr for the small server:
- For 5.6.35 the rel build has the best performance (up to 8% more QPS)
- For 8.0.28 the rel_native_lto build has the best performance (up to 12% more QPS)
- Relative throughput for 8.0.28 versus 5.6.35 is between 0.81 and 0.92 for the benchmark steps, excluding index creation (l.x), on the 1 client & 1 table configuration. So 8.0.28 uses 10% to 20% more CPU per operation even with the ~10% improvement from link-time optimization.
- For 5.6.35 the rel build has the best performance (up to 5% more QPS)
- For 8.0.28 the rel_native_lto build (up to 10% more QPS)
- 8.0.28 had similar insert throughput but ~20% better read QPS versus 5.6.35. But remember that 8.0 benefits from link-time optimization while 5.6 does not.
Benchmarks were repeated for two configurations:
- cached by RocksDB - all data fits in the RocksDB block cache
- cached by OS - all data fits in the OS page cache but not the RocksDB block cache. For the small server RocksDB block cache size was set to 1G and the database was ~4G at test end. For the big server the block cache size was 4G and the database was ~60G at test end.
- small server: 5.6.35 cached by RocksDB and by OS, 8.0.28 cached by RocksDB and by OS
- big server: 5.6.35 cached by RocksDB and by OS, 8.0.28 cached by RocksDB and by OS
- l.i0 - insert X million rows without secondary indexes
- l.x - create 3 secondary indexes. I usually ignore results from this step.
- l.i1 - insert another X million rows with the overhead of secondary index maintenance
- q100.1 - do queries as fast as possible with 100 inserts/s/thread done in the background
- q500.1 - do queries as fast as possible with 500 inserts/s/thread done in the background
- q1000.1 - do queries as fast as possible with 1000 inserts/s/thread done in the background
- 5.6.35 - I tested the rel, rel_o2 and rel_withdbg builds
- 8.0.28 - I tested the rel_with_dbug, rel_o2, rel_native, rel, rel_o2_l2o, rel_native_lto and rel_lto builds
For understanding CPU regressions the most interesting result is the comparison between 5.6.35 and 8.0.28 using 1 client, 1 table and a cached database -- see here. Tests with more concurrency can show improvements or regressions from mutex contention, which is also interesting, but the first thing I want to understand is CPU overhead. The 5.6 vs 8.0 comparison uses the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The use of link-time optimization for the 8.0 build but not for the 5.6 build will hide regressions because it increases QPS by between 5% and 10% for 8.0.
- 0.81 for l.i0 (inserts without secondary index maintenance)
- 0.87 for l.i1 (inserts with secondary index maintenance)
- 0.88 for q100.1 - range queries with 100 inserts/
- 0.90 for q500.1 - range queries with 500 inserts/s
- 0.92 for q1000.1 - range queries with 1000 inserts/s
- 1 client, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
- 4 clients, 4 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
- 4 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
- 1 client, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
- 4 clients, 4 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
- 4 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
For understanding CPU regressions the most interesting result is the comparison between 5.6.35 and 8.0.28 using 20 clients, 20 tables and a cached database -- see here. The 5.6 vs 8.0 comparison uses the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The use of link-time optimization for the 8.0 build but not for the 5.6 build will hide regressions because it increases QPS by between 5% and 10% for 8.0.
- 1.05 for l.i0 (inserts without secondary index maintenance)
- 1.02 for l.i1 (inserts with secondary index maintenance)
- 1.01 for q100.1 - range queries with 100 inserts/
- 1.00 for q500.1 - range queries with 500 inserts/s
- 1.01 for q1000.1 - range queries with 1000 inserts/s
- 20 clients, 20 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
- 20 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
- 20 clients, 20 tables: 5.6.35, 8.0.28, 5.6 vs 8.0
- 20 clients, 1 table: 5.6.35, 8.0.28, 5.6 vs 8.0
- The graphs are from just one of the 20 clients
- I risk drawing strong conclusions from small samples
- cached by RocksDB
- 20 clients, 20 tables: l.i0, l.i1, q100.1, q500.1, q1000.1
- From the graph the l.i1 result has less variance than InnoDB
- For q100.1 there are write stalls between 1000s and 1600s. The QPS graph has a pattern that repeats every ~200s -- QPS starts high then slowly degrades, repeat. This might be the impact from a memtable flush or L0->L1 compaction. When memtable is full or L0 has more SSTs then CPU overhead from queries increases.
- For q500.1 results are similar to q100.1 but the QPS pattern happens faster because compaction happens faster (background insert rate here is 5X larger).
- For q1000.1 results are similar to q100.1 but the QPS pattern happens even faster (see q500.1 above, background insert rate here is 10X larger).
- 20 clients, 1 table: l.i0, l.i1, q100.1, q500.1, q1000.1
- Results are similar to 20 clients, 20 tables above
- cached by OS:
- cached by RocksDB
- 20 clients, 20 tables
- l.i0 - the max is [357ms, 376ms]
- l.i1 - the max is [380ms, 397ms]
- q100.1 - the max is [31ms, 47ms] for queries and [9ms, 25ms] for inserts
- q500.1 - the max is [27ms, 41ms] for queries and [31ms, 128ms] for inserts
- q1000.1 - the max is [37ms, 38ms] for queries and [34ms, 89ms] for inserts
- 20 clients, 1 table
- l.i0 - the max is [239ms, 309ms]
- l.i1 - the max is [226ms, 467ms]
- q100.1 - the max is [30ms, 37ms] for queries and [15ms, 16ms] for inserts
- q500.1 - the max is [36ms, 39ms] for queries and [31ms, 51ms] for inserts
- q1000.1 - the max is [24ms, 46ms] for queries and [36ms, 53ms] for inserts
- cached by OS
- 20 clients, 20 tables
- l.i0 - the max is [330ms, 333ms]
- l.i1 - the max is [372ms, 431ms]
- q100.1 - the max is [31ms, 46ms] for queries and [17ms, 24ms] for inserts
- q500.1 - the max is [30ms, 55ms] for queries and [25ms, 46ms] for inserts
- q1000.1 - the max is [31ms, 55ms] for queries and [44ms, 51ms] for inserts
- 20 clients, 1 table
- l.i0 - the max is [243ms, 264ms]
- l.i1 - the max is [208ms, 349ms]
- q100.1 - the max is [16ms, 51ms] for queries and [10ms, 17ms] for inserts
- q500.1 - the max is [42ms, 49ms] for queries and [31ms, 35ms] for inserts
- q1000.1 - the max is [34ms, 50ms] for queries and [34ms, 37ms] for inserts
No comments:
Post a Comment