Small Datum: March 2023

Wednesday, March 29, 2023

Perf regressions in MyRocks, a larger server and sysbench

This has results for in-memory sysbench on a c2-standard-60 server in GCP to determine whether there are CPU performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28). Results for MyRocks and sysbench on a small server are here. The context for the results is short-running queries, in-memory (cached by MyRocks) with high-concurrency (20 clients) on a big server (30-cores).

There are two goals from these benchmarks. The first goal is to determine whether there are CPU regressions (more CPU/query) from old versions to new versions. The second goal is to determine which compiler optimizations I should use when building MyRocks from source.

tl;dr

For MyRocks 5.6.35 the rel build has the best performance
For MyRocks 8.0.28 the rel_native_lto build has the best performance. The largest improvement is from link time optimization.
For the 5.6.35 vs 8.0.28 comparison only the write benchmarks show a regression in 8.0.28, for point queries 5.6.35 and 8.0.28 have similar throughput and for range queries 8.0.28 gets about 20% more throughput. The results here for 8.0.28 are much better than the results on the small server. Perhaps the extra CPU in 8.0.28 is offset by less mutex contention.

Range queries in 5.6.35 vs 8.0.28

Why are the range query microbenchmarks ~20% faster in 8.0.28? My first guess is that 8.0.28 used the hyper clock cache, but it used LRUCache just like 5.6.35. Then I looked at vmstat output for range-covered-pk.range100 where 8.0.28 gets ~32% more QPS: 108204 for 8.0.28 vs 82209 for 5.6.35. From vmstat I see that average values for user and system CPU (the us and sy columns) are (50, 17) for 5.6.35 and (59, 10) for 8.0.28 where (A, B) is (user, system). A larger ratio for system time implies there is more mutex contention.

Some of the difference is due the usage of link-time optimization for 8.0.28 but not for 5.6.35 because I wasn't willing to invest a few hours in figuring out how to build 5.6 with -flto.

Benchmark

A description of how I run sysbench is here. Tests use the a c2-standard-60 server on GCP (30 cores with hyperthreading disabled, 240G RAM, 3T of storage (XFS, SW RAID 0 striped over 8 NVMe devices). The sysbench tests were run for 20 clients and 600 seconds per microbenchmark using 4 tables with 50M rows per table. All tests use the MyRocks storage engine. The test database fits in the MyRocks buffer pool.

I used a similar configuration (my.cnf) for all versions which is here for 5.6.35 and 8.0.28.

Builds

I tested MyRocks in FB MySQL versions 5.6.35 and 8.0.28 using multiple builds for each version. For each build+version the full set of sysbench microbenchmarks was repeated.

Compiler options tested by the builds include:

-O2 vs -O3
link time optimization via -flto
CPU specific tuning via -march=native -mtune=native
CMAKE_BUILD_TYPE set to RelWithDebInfo vs Release (see here)

The builds are fully described in the previous post.

For MyRocks 5.6.35 I tested these builds: rel, rel_o2, rel_withdbg.

For MyRocks 8.0.28 I tested these builds: rel_withdbg, rel_o2, rel_native, rel, rel_o2_lto, rel_native_lto, rel_lto.

Results: per-version

The result spreadsheet is here.

The graphs use relative throughput which is throughput for me / throughput for base case. When the relative throughput is > 1 then my results are better than the base case. When it is 1.10 then my results are ~10% better than the base case. The base case is the rel_withdbg build for 5.6.35 and 8.0.28.

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes.

Disclaimers:

Readability is much better via the spreadsheet so I did not make the graphs x-large here.
For most of the graphs the axis with values doesn't start at 0 to improve readability

For MyRocks 5.6.35 the throughput median for the rel build relative to rel_withdbg is 1.02 for point, 1.04 for range, 1.03 for writes.

For MyRocks 8.0.28 the throughput median for the rel_native_lto build relative to rel_withdbg is 1.06 for point, 1.06 for range, 1.01 for writes.

Results: all versions

These have results for MyRocks versions 5.6.35 and 8.0.28 on one graph using the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The result spreadsheet is here.

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes.

The throughput median for the rel_native_lto 8.0.28 build relative to the 5.6.35 rel is 1.02 for point, 1.23 for range, 0.97 for writes. The results here (few regressions) are much better than the results on the small server. Perhaps mutex contention was greatly reduced to counter the increase in CPU/query.

Summary statistics

These are computed for the throughput relative to the rel_withdbg build.

For MyRocks 5.6.35

rel_withdbg	rel_o2	rel
Point: avg	0.98	1.03
Point: median	0.99	1.02
Point: min	0.87	0.93
Point: max	1.31	1.27
Point: stddev	0.096	0.082

Range: avg	1.00	1.03
Range: median	1.01	1.04
Range: min	0.91	0.96
Range: max	1.03	1.06
Range: stddev	0.037	0.025

Write: avg	1.00	1.03
Write: median	1.00	1.03
Write: min	1.00	1.01
Write: max	1.02	1.04
Write: stddev	0.007	0.009

For MyRocks 8.0.28

rel_withdbg	rel_o2	rel_native	rel	rel_o2_lto	rel_native_lto	rel_lto
Point: avg	0.98	1.06	1.06	1.00	1.07	1.05
Point: median	1.00	1.04	1.04	1.00	1.06	1.05
Point: min	0.78	0.99	1.01	0.92	1.04	0.97
Point: max	1.04	1.24	1.21	1.11	1.13	1.13
Point: stddev	0.056	0.056	0.050	0.050	0.030	0.037

Range: avg	1.00	1.03	1.05	1.02	1.06	1.06
Range: median	1.00	1.02	1.05	1.02	1.06	1.06
Range: min	0.97	1.00	1.00	0.91	1.02	1.01
Range: max	1.02	1.05	1.15	1.14	1.21	1.22
Range: stddev	0.013	0.016	0.040	0.047	0.045	0.051

Write: avg	1.00	1.01	1.01	1.00	1.01	1.01
Write: median	1.00	1.01	1.00	1.00	1.01	1.01
Write: min	1.00	0.99	1.00	0.99	1.00	1.00
Write: max	1.02	1.02	1.02	1.02	1.04	1.04
Write: stddev	0.007	0.013	0.007	0.007	0.013	0.014

For MyRocks 8.0.28 with the rel_native_lto build relative to MyRocks 5.6.35 with the rel build

5635_rel	8028_rel_native_lto
Point: avg	1.02
Point: median	1.02
Point: min	0.69
Point: max	1.33
Point: stddev	0.151

Range: avg	1.20
Range: median	1.23
Range: min	0.95
Range: max	1.45
Range: stddev	0.159

Write: avg	0.97
Write: median	0.97
Write: min	0.80
Write: max	1.12
Write: stddev	0.103

Perf regressions in MyRocks, a small server & sysbench

I used sysbench to search for performance regressions from old MyRocks (5.6.35) to modern MyRocks (8.0.28) and to determine the impact of compiler optimizations because I build it from source. The context for the results is short-running queries, in-memory (cached by MyRocks) with low-concurrency (1 & 4 clients) on a small server (8-core AMD).

tl;dr:

For MyRocks 5.6.35 the rel build has the best performance
For MyRocks 8.0.28 the rel_native_lto build has the best performance. The largest improvement is from link time optimization.
MyRocks 8.0.28 gets ~10% less throughput than 8.0.28 for short-running queries. The cause is more CPU/query. Much of the regression appears to be above the MySQL storage engine layer because the regressions from 5.6 to 8.0 are even larger for InnoDB than for MyRocks -- 25% or more with upstream MySQL/InnoDB vs 10% here.
The microbenchmarks with the largest regressions from 5.6 to 8.0 are random-points (select 1000 rows via in-list), insert and scan. Explaining these has been added to my TODO list although the problem with random-points is probably bug 102037 (fixed upstream in 8.0.31). See the Results: all versions section for more detail.

Benchmark

A description of how I run sysbench is here. Tests use the Beelink server (8-core AMD, 16G RAM, NVMe SSD). The sysbench tests were run for 600 seconds per microbenchmark using 1 table with 20M rows. All tests use the MyRocks storage engine. The test database fits in the MyRocks buffer pool. The benchmark was repeated for 1 and 4 clients.

I used a similar configuration (my.cnf) for all versions which is here for 5.6.35 and 8.0.28.

Builds

-O2 vs -O3
link time optimization via -flto
CPU specific tuning via -march=native -mtune=native
CMAKE_BUILD_TYPE set to RelWithDebInfo vs Release (see here)

The possible builds are:

rel_withdbg

CMAKE_BUILD_TYPE=RelWithDebInfo which implies -O2 -flto (this gets link time optimization by default, unlike Release)

CMAKE_BUILD_TYPE=Release which implies -O3

rel_o2

CMAKE_BUILD_TYPE=Release, forces -O2

rel_native

CMAKE_BUILD_TYPE=Release which implies -O3, adds -march=native -mtune=native

rel_o2_lto

CMAKE_BUILD_TYPE=Release, forces -O2, adds -flto for link time optimization

rel_native_lto

CMAKE_BUILD_TYPE=Release which implies -O3, adds -march=native -mtune=native, adds -flto for link time optimization

rel_lto

CMAKE_BUILD_TYPE=Release which implies -O3, adds -flto for link time optimization

For MyRocks 5.6.35 I tested these builds: rel, rel_o2, rel_withdbg. The command line for cmake, output from cmake and output from make is here.

For MyRocks 8.0.28 I tested these builds: rel_withdbg, rel_o2, rel_native, rel, rel_o2_lto, rel_native_lto, rel_lto. The command line for cmake, output from cmake and output from make is here.

Results: per-version

The result spreadsheet is here.

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes.

Disclaimers:

Readability is much better via the spreadsheet so I did not make the graphs x-large here.
For most of the graphs the axis with values doesn't start at 0 to improve readability

For MyRocks 5.6.35 with 1 client the throughput median for the rel build relative to rel_withdbg is 1.03 for point, 1.03 for range, 1.00 for writes.

For MyRocks 5.6.35 with 4 clients the throughput median for the rel build relative to rel_withdbg is 1.01 for point, 1.03 for range, 1.01 for writes.

For MyRocks 8.0.28 with 1 client the throughput median for the rel_native_lto build relative to rel_withdbg is 1.08 for point, 1.08 for range, 1.08 for writes.

For MyRocks 8.0.28 with 4 clients the throughput median for the rel_native_lto build relative to rel_withdbg is 1.07 for point, 1.11 for range, 1.08 for writes.

Results: all versions

These have results for MyRocks versions 5.6.35 and 8.0.28 on one graph using the rel build for 5.6.35 and the rel_native_lto build for 8.0.28. The result spreadsheet is here.

There are regressions (more CPU/query) in MySQL releases from 5.6 to 8.0 and most appear to be above the storage engine level because the regressions here are not as bad as the results for upstream MySQL with InnoDB.

This table shows the median throughput for MyRocks 8.0.28 relative to 5.6.35 for the 1-client and 4-client benchmarks.

	1-client	4-clients
Point	0.85	0.90
Range	0.90	0.98
Write	0.90	0.91

The microbenchmarks with the largest regressions from 5.6.35 to 8.0.28 are:

	1-client	4-clients
random-points.pre_range=1000	0.43	0.45
random-points_range=1000	0.45	0.50
scan_range=100	0.78	0.84
insert_range=100	0.70	0.73

For the microbenchmarks with the largest regression, I will do more to explain these in a future post:

random-points - the Lua file is oltp_inlist_select.lua and the SQL is here. The query is a SELECT statement with 1000 values in the in-list to fetch rows by an exact match on an index. My first guess is that this is from the optimizer doing more index dives for 8.0.28 than for 5.6.35 as I filed bug 91139 and blogged about this in 2017. However, the my.cnf I use have eq_range_index_dive_limit=10 so I have yet to explain this. Then I remembered that I reported another bug for the same microbenchmark that arrived around 8.0.22 and was fixed in 8.0.31 -- see bug 102037. I don't think MyRocks 8.0.28 has that fix yet.
scan - the Lua file is oltp_scan.lua and the SQL is here. The query is written to filter all rows via the WHERE clause (nothing matches). So it isn't clear whether the regression is from the storage engine or the MySQL code that evaluates the WHERE clause.
insert - the Lua file is oltp_insert.lua and the SQL is here

There are three graphs per version which group the microbenchmarks by the dominant operation: one for point queries, one for range queries, one for writes.

First the graphs for 1 client (1 thread).

And then the graphs for 4 clients (4 threads).

Summary statistics: per version

These are computed for the throughput relative to the rel_withdbg build.

For MyRocks 5.6.35 with 1 client

rel_withdbg	rel_o2	rel
Point: avg	1.01	1.09
Point: median	0.99	1.03
Point: min	0.97	0.99
Point: max	1.24	1.74
Point: stddev	0.071	0.193

Range: avg	0.99	1.04
Range: median	0.99	1.03
Range: min	0.93	1.00
Range: max	1.04	1.16
Range: stddev	0.024	0.038

Write: avg	0.99	1.00
Write: median	1.00	1.00
Write: min	0.96	0.96
Write: max	1.02	1.02
Write: stddev	0.018	0.019

For MyRocks 5.6.35 with 4 clients

rel_withdbg	rel_o2	rel
Point: avg	1.03	1.01
Point: median	1.00	1.01
Point: min	0.98	0.99
Point: max	1.29	1.04
Point: stddev	0.079	0.015

Range: avg	0.99	1.02
Range: median	0.99	1.03
Range: min	0.89	0.98
Range: max	1.06	1.06
Range: stddev	0.036	0.022

Write: avg	1.00	1.01
Write: median	1.00	1.01
Write: min	0.99	0.99
Write: max	1.01	1.02
Write: stddev	0.008	0.008

For MyRocks 8.0.28 with 1 client

rel_withdbg	rel_o2	rel_native	rel	rel_o2_lto	rel_native_lto	rel_lto
Point: avg	1.00	1.01	1.01	1.04	1.08	1.10
Point: median	1.00	1.01	1.01	1.03	1.08	1.09
Point: min	0.96	0.99	0.99	0.97	0.99	1.05
Point: max	1.03	1.03	1.04	1.10	1.18	1.26
Point: stddev	0.020	0.011	0.013	0.036	0.042	0.051

Range: avg	1.00	1.01	1.02	1.05	1.09	1.08
Range: median	1.00	1.01	1.02	1.04	1.08	1.08
Range: min	0.96	0.99	1.00	0.98	1.07	1.06
Range: max	1.02	1.04	1.05	1.11	1.12	1.12
Range: stddev	0.015	0.013	0.017	0.033	0.014	0.018

Write: avg	1.01	1.01	1.01	1.06	1.08	1.07
Write: median	1.00	1.01	1.02	1.06	1.08	1.07
Write: min	0.99	0.99	0.99	1.03	1.04	1.04
Write: max	1.03	1.02	1.03	1.08	1.10	1.10
Write: stddev	0.011	0.011	0.013	0.014	0.018	0.018

For MyRocks 8.0.28 with 4 clients

rel_withdbg	rel_o2	rel_native	rel	rel_o2_lto	rel_native_lto	rel_lto
Point: avg	0.99	1.00	0.99	1.02	1.06	1.05
Point: median	1.01	1.02	1.01	1.04	1.07	1.08
Point: min	0.80	0.77	0.75	0.76	0.82	0.78
Point: max	1.01	1.04	1.03	1.09	1.14	1.16
Point: stddev	0.057	0.071	0.074	0.083	0.077	0.089

Range: avg	1.00	1.02	1.02	1.04	1.10	1.08
Range: median	1.01	1.02	1.03	1.05	1.11	1.08
Range: min	0.96	0.95	0.95	0.97	1.06	1.03
Range: max	1.03	1.03	1.05	1.08	1.12	1.13
Range: stddev	0.016	0.021	0.026	0.029	0.024	0.030

Write: avg	1.01	1.00	1.01	1.05	1.08	1.07
Write: median	1.01	1.00	1.01	1.06	1.08	1.07
Write: min	1.00	0.99	0.99	1.02	1.04	1.04
Write: max	1.02	1.02	1.02	1.08	1.10	1.08
Write: stddev	0.007	0.009	0.010	0.017	0.018	0.014

Summary statistics: per version

These are computed for the throughput from MyRocks 8.0.28 with the rel_native_lto build relative to the rel build in MyRocks 5.6.35

1 client (1 thread)

5635_rel	8028_rel_native_lto
Point: avg	0.79
Point: median	0.85
Point: min	0.43
Point: max	0.97
Point: stddev	0.167

Range: avg	0.91
Range: median	0.90
Range: min	0.78
Range: max	1.04
Range: stddev	0.073

Write: avg	0.87
Write: median	0.90
Write: min	0.70
Write: max	0.92
Write: stddev	0.064

4 clients (4 threads)

5635_rel	8028_rel_native_lto
Point: avg	0.86
Point: median	0.90
Point: min	0.45
Point: max	1.01
Point: stddev	0.159

Range: avg	0.97
Range: median	0.98
Range: min	0.84
Range: max	1.08
Range: stddev	0.060

Write: avg	0.89
Write: median	0.91
Write: min	0.73
Write: max	0.99
Write: stddev	0.077

Wednesday, March 29, 2023

Perf regressions in MyRocks, a larger server and sysbench

Perf regressions in MyRocks, a small server & sysbench

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)