Wednesday, March 29, 2023

Perf regressions in MyRocks, a larger server and sysbench

This has results for in-memory sysbench on a c2-standard-60 server in GCP to determine whether there are CPU performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28). Results for MyRocks and sysbench on a small server are here. The context for the results is short-running queries, in-memory (cached by MyRocks) with high-concurrency (20 clients) on a big server (30-cores).

There are two goals from these benchmarks. The first goal is to determine whether there are CPU regressions (more CPU/query) from old versions to new versions. The second goal is to determine which compiler optimizations I should use when building MyRocks from source.

tl;dr

  • For MyRocks 5.6.35 the rel build has the best performance
  • For MyRocks 8.0.28 the rel_native_lto build has the best performance. The largest improvement is from link time optimization.
  • For the 5.6.35 vs 8.0.28 comparison only the write benchmarks show a regression in 8.0.28, for point queries 5.6.35 and 8.0.28 have similar throughput and for range queries 8.0.28 gets about 20% more throughput. The results here for 8.0.28 are much better than the results on the small server. Perhaps the extra CPU in 8.0.28 is offset by less mutex contention. 
Range queries in 5.6.35 vs 8.0.28

Why are the range query microbenchmarks ~20% faster in 8.0.28? My first guess is that 8.0.28 used the hyper clock cache, but it used LRUCache just like 5.6.35. Then I looked at vmstat output for range-covered-pk.range100 where 8.0.28 gets ~32% more QPS: 108204 for 8.0.28 vs 82209 for 5.6.35. From vmstat I see that average values for user and system CPU (the us and sy columns) are (50, 17) for 5.6.35 and (59, 10) for 8.0.28 where (A, B) is (user, system). A larger ratio for system time implies there is more mutex contention.

Some of the difference is due the usage of link-time optimization for 8.0.28 but not for 5.6.35 because I wasn't willing to invest a few hours in figuring out how to build 5.6 with -flto.

Benchmark

A description of how I run sysbench is here. Tests use the a c2-standard-60 server on GCP (30 cores with hyperthreading disabled, 240G RAM, 3T of storage (XFS, SW RAID 0 striped over 8 NVMe devices). The sysbench tests were run for 20 clients and 600 seconds per microbenchmark using 4 tables with 50M rows per table. All tests use the MyRocks storage engine. The test database fits in the MyRocks buffer pool.

I used a similar configuration (my.cnf) for all versions which is here for 5.6.35 and 8.0.28.

Builds

I tested MyRocks in FB MySQL versions 5.6.35 and 8.0.28 using multiple builds for each version. For each build+version the full set of sysbench microbenchmarks was repeated.

Compiler options tested by the builds include:
  • -O2 vs -O3
  • link time optimization via -flto
  • CPU specific tuning via -march=native -mtune=native
  • CMAKE_BUILD_TYPE set to RelWithDebInfo vs Release (see here)
The builds are fully described in the previous post.

For MyRocks 5.6.35 I tested these builds: rel, rel_o2, rel_withdbg. 

For MyRocks 8.0.28 I tested these builds: rel_withdbg, rel_o2, rel_native, rel, rel_o2_lto, rel_native_lto, rel_lto.

Results: per-version

The result spreadsheet is here.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. The base case is the rel_withdbg build for 5.6.35 and 8.0.28.

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes. 

Disclaimers:
  • Readability is much better via the spreadsheet so I did not make the graphs x-large here. 
  • For most of the graphs the axis with values doesn't start at 0 to improve readability
For MyRocks 5.6.35 the throughput median for the rel build relative to rel_withdbg is 1.02 for point, 1.04 for range, 1.03 for writes.
For MyRocks 8.0.28 the throughput median for the rel_native_lto build relative to rel_withdbg is 1.06 for point, 1.06 for range, 1.01 for writes.
Results: all versions

These have results for MyRocks versions 5.6.35 and 8.0.28 on one graph using the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The result spreadsheet is here.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. The base case is the rel build with MyRocks 5.6.35.

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes.

The throughput median for the rel_native_lto 8.0.28 build relative to the 5.6.35 rel is 1.02 for point, 1.23 for range, 0.97 for writes. The results here (few regressions) are much better than the results on the small server. Perhaps mutex contention was greatly reduced to counter the increase in CPU/query.
Summary statistics

These are computed for the throughput relative to the rel_withdbg build. 

For MyRocks 5.6.35

rel_withdbgrel_o2rel
Point: avg0.981.03
Point: median0.991.02
Point: min0.870.93
Point: max1.311.27
Point: stddev0.0960.082
Range: avg1.001.03
Range: median1.011.04
Range: min0.910.96
Range: max1.031.06
Range: stddev0.0370.025
Write: avg1.001.03
Write: median1.001.03
Write: min1.001.01
Write: max1.021.04
Write: stddev0.0070.009

For MyRocks 8.0.28

rel_withdbgrel_o2rel_nativerelrel_o2_ltorel_native_ltorel_lto
Point: avg0.981.061.061.001.071.05
Point: median1.001.041.041.001.061.05
Point: min0.780.991.010.921.040.97
Point: max1.041.241.211.111.131.13
Point: stddev0.0560.0560.0500.0500.0300.037
Range: avg1.001.031.051.021.061.06
Range: median1.001.021.051.021.061.06
Range: min0.971.001.000.911.021.01
Range: max1.021.051.151.141.211.22
Range: stddev0.0130.0160.0400.0470.0450.051
Write: avg1.001.011.011.001.011.01
Write: median1.001.011.001.001.011.01
Write: min1.000.991.000.991.001.00
Write: max1.021.021.021.021.041.04
Write: stddev0.0070.0130.0070.0070.0130.014

For MyRocks 8.0.28 with the rel_native_lto build relative to MyRocks 5.6.35 with the rel build

5635_rel8028_rel_native_lto
Point: avg1.02
Point: median1.02
Point: min0.69
Point: max1.33
Point: stddev0.151
Range: avg1.20
Range: median1.23
Range: min0.95
Range: max1.45
Range: stddev0.159
Write: avg0.97
Write: median0.97
Write: min0.80
Write: max1.12
Write: stddev0.103









Perf regressions in MyRocks, a small server & sysbench

I used sysbench to search for performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28) and to determine the impact of compiler optimizations because I build it from source. The context for the results is short-running queries, in-memory (cached by MyRocks) with low-concurrency (1 & 4 clients) on a small server (8-core AMD).

tl;dr:

  • For MyRocks 5.6.35 the rel build has the best performance
  • For MyRocks 8.0.28 the rel_native_lto build has the best performance. The largest improvement is from link time optimization.
  • MyRocks 8.0.28 gets ~10% less throughput than 8.0.28 for short-running queries. The cause is more CPU/query. Much of the regression appears to be above the MySQL storage engine layer because the regressions from 5.6 to 8.0 are even larger for InnoDB than for MyRocks -- 25% or more with upstream MySQL/InnoDB vs 10% here.
  • The microbenchmarks with the largest regressions from 5.6 to 8.0 are random-points (select 1000 rows via in-list), insert and scan. Explaining these has been added to my TODO list although the problem with random-points is probably bug 102037 (fixed upstream in 8.0.31). See the Results: all versions section for more detail. 

Benchmark

A description of how I run sysbench is here. Tests use the Beelink server (8-core AMD, 16G RAM, NVMe SSD). The sysbench tests were run for 600 seconds per microbenchmark using 1 table with 20M rows. All tests use the MyRocks storage engine. The test database fits in the MyRocks buffer pool.  The benchmark was repeated for 1 and 4 clients.

I used a similar configuration (my.cnf) for all versions which is here for 5.6.35 and 8.0.28.

Builds

I tested MyRocks in FB MySQL versions 5.6.35 and 8.0.28 using multiple builds for each version. For each build+version the full set of sysbench microbenchmarks was repeated.

Compiler options tested by the builds include:
  • -O2 vs -O3
  • link time optimization via -flto
  • CPU specific tuning via -march=native -mtune=native
  • CMAKE_BUILD_TYPE set to RelWithDebInfo vs Release (see here)
The possible builds are:
  • rel_withdbg
    • CMAKE_BUILD_TYPE=RelWithDebInfo which implies -O2 -flto (this gets link time optimization by default, unlike Release)
  • rel
    • CMAKE_BUILD_TYPE=Release which implies -O3
  • rel_o2
    • CMAKE_BUILD_TYPE=Release, forces -O2
  • rel_native
    • CMAKE_BUILD_TYPE=Release which implies -O3, adds -march=native -mtune=native
  • rel_o2_lto
    • CMAKE_BUILD_TYPE=Release, forces -O2, adds -flto for link time optimization
  • rel_native_lto
    • CMAKE_BUILD_TYPE=Release which implies -O3, adds -march=native -mtune=native, adds -flto for link time optimization
  • rel_lto
    • CMAKE_BUILD_TYPE=Release which implies -O3, adds -flto for link time optimization
For MyRocks 5.6.35 I tested these builds: rel, rel_o2, rel_withdbg. The command line for cmake, output from cmake and output from make is here.

For MyRocks 8.0.28 I tested these builds: rel_withdbg, rel_o2, rel_native, rel, rel_o2_lto, rel_native_lto, rel_lto. The command line for cmake, output from cmake and output from make is here

Results: per-version

The result spreadsheet is here.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. The base case is the rel_withdbg build for 5.6.35 and 8.0.28.

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes. 

Disclaimers:
  • Readability is much better via the spreadsheet so I did not make the graphs x-large here. 
  • For most of the graphs the axis with values doesn't start at 0 to improve readability
For MyRocks 5.6.35 with 1 client the throughput median for the rel build relative to rel_withdbg is 1.03 for point, 1.03 for range, 1.00 for writes.
For MyRocks 5.6.35 with 4 clients the throughput median for the rel build relative to rel_withdbg is 1.01 for point, 1.03 for range, 1.01 for writes.
For MyRocks 8.0.28 with 1 client the throughput median for the rel_native_lto build relative to rel_withdbg is 1.08 for point, 1.08 for range, 1.08 for writes.
For MyRocks 8.0.28 with 4 clients the throughput median for the rel_native_lto build relative to rel_withdbg is 1.07 for point, 1.11 for range, 1.08 for writes.
Results: all versions

These have results for MyRocks versions 5.6.35 and 8.0.28 on one graph using the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The result spreadsheet is here.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. The base case is the rel build with MyRocks 5.6.35.

There are regressions (more CPU/query) in MySQL releases from 5.6 to 8.0 and most appear to be above the storage engine level because the regressions here are not as bad as the results for upstream MySQL with InnoDB.

This table shows the median throughput for MyRocks 8.0.28 relative to 5.6.35 for the 1-client and 4-client benchmarks.

1-client4-clients
Point0.850.90
Range0.900.98
Write0.900.91

The microbenchmarks with the largest regressions from 5.6.35 to 8.0.28 are:

1-client4-clients
random-points.pre_range=10000.430.45
random-points_range=10000.450.50
scan_range=1000.780.84
insert_range=1000.700.73

For the microbenchmarks with the largest regression, I will do more to explain these in a future post:
  • random-points - the Lua file is oltp_inlist_select.lua and the SQL is here. The query is a SELECT statement with 1000 values in the in-list to fetch rows by an exact match on an index. My first guess is that this is from the optimizer doing more index dives for 8.0.28 than for 5.6.35 as I filed bug 91139 and blogged about this in 2017. However, the my.cnf I use have eq_range_index_dive_limit=10 so I have yet to explain this. Then I remembered that I reported another bug for the same microbenchmark that arrived around 8.0.22 and was fixed in 8.0.31 -- see bug 102037. I don't think MyRocks 8.0.28 has that fix yet.
  • scan - the Lua file is oltp_scan.lua and the SQL is here. The query is written to filter all rows via the WHERE clause (nothing matches). So it isn't clear whether the regression is from the storage engine or the MySQL code that evaluates the WHERE clause.
  • insert - the Lua file is oltp_insert.lua and the SQL is here
There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes.

First the graphs for 1 client (1 thread).
And then the graphs for 4 clients (4 threads).
Summary statistics: per version

These are computed for the throughput relative to the rel_withdbg build. 

For MyRocks 5.6.35 with 1 client

rel_withdbgrel_o2rel
Point: avg1.011.09
Point: median0.991.03
Point: min0.970.99
Point: max1.241.74
Point: stddev0.0710.193
Range: avg0.991.04
Range: median0.991.03
Range: min0.931.00
Range: max1.041.16
Range: stddev0.0240.038
Write: avg0.991.00
Write: median1.001.00
Write: min0.960.96
Write: max1.021.02
Write: stddev0.0180.019

For MyRocks 5.6.35 with 4 clients

rel_withdbgrel_o2rel
Point: avg1.031.01
Point: median1.001.01
Point: min0.980.99
Point: max1.291.04
Point: stddev0.0790.015
Range: avg0.991.02
Range: median0.991.03
Range: min0.890.98
Range: max1.061.06
Range: stddev0.0360.022
Write: avg1.001.01
Write: median1.001.01
Write: min0.990.99
Write: max1.011.02
Write: stddev0.0080.008

For MyRocks 8.0.28 with 1 client

rel_withdbgrel_o2rel_nativerelrel_o2_ltorel_native_ltorel_lto
Point: avg1.001.011.011.041.081.10
Point: median1.001.011.011.031.081.09
Point: min0.960.990.990.970.991.05
Point: max1.031.031.041.101.181.26
Point: stddev0.0200.0110.0130.0360.0420.051
Range: avg1.001.011.021.051.091.08
Range: median1.001.011.021.041.081.08
Range: min0.960.991.000.981.071.06
Range: max1.021.041.051.111.121.12
Range: stddev0.0150.0130.0170.0330.0140.018
Write: avg1.011.011.011.061.081.07
Write: median1.001.011.021.061.081.07
Write: min0.990.990.991.031.041.04
Write: max1.031.021.031.081.101.10
Write: stddev0.0110.0110.0130.0140.0180.018

For MyRocks 8.0.28 with 4 clients

rel_withdbgrel_o2rel_nativerelrel_o2_ltorel_native_ltorel_lto
Point: avg0.991.000.991.021.061.05
Point: median1.011.021.011.041.071.08
Point: min0.800.770.750.760.820.78
Point: max1.011.041.031.091.141.16
Point: stddev0.0570.0710.0740.0830.0770.089
Range: avg1.001.021.021.041.101.08
Range: median1.011.021.031.051.111.08
Range: min0.960.950.950.971.061.03
Range: max1.031.031.051.081.121.13
Range: stddev0.0160.0210.0260.0290.0240.030
Write: avg1.011.001.011.051.081.07
Write: median1.011.001.011.061.081.07
Write: min1.000.990.991.021.041.04
Write: max1.021.021.021.081.101.08
Write: stddev0.0070.0090.0100.0170.0180.014

Summary statistics: per version

These are computed for the throughput from MyRocks 8.0.28 with the rel_native_lto build relative to the rel build in MyRocks 5.6.35

1 client (1 thread)

5635_rel8028_rel_native_lto
Point: avg0.79
Point: median0.85
Point: min0.43
Point: max0.97
Point: stddev0.167
Range: avg0.91
Range: median0.90
Range: min0.78
Range: max1.04
Range: stddev0.073
Write: avg0.87
Write: median0.90
Write: min0.70
Write: max0.92
Write: stddev0.064

4 clients (4 threads)

5635_rel8028_rel_native_lto
Point: avg0.86
Point: median0.90
Point: min0.45
Point: max1.01
Point: stddev0.159
Range: avg0.97
Range: median0.98
Range: min0.84
Range: max1.08
Range: stddev0.060
Write: avg0.89
Write: median0.91
Write: min0.73
Write: max0.99
Write: stddev0.077

I recently published results with a summary of HW performance counters for sysbench with MySQL on four CPU types. The performance reports we...