This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:
- MALLOC_ARENA_MAX with glibc malloc
- see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.
- compiling RocksDB and MyRocks with jemalloc specific code enabled
- In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.
tl;dr: jemalloc
- For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled
- performance, VSZ and RSS were similar
- After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code
- performance is slightly better (excluding the outlier, the benefit is up to 3%)
- peak VSZ is cut in half
- peak RSS is reduced by ~9%
tl;dr: glibc malloc on a 48-core server
- With 1 arena performance is lousy but the RSS bloat is mostly solved
- With 8, 48 or 96 arenas the RSS bloat is still there
- With 48 arenas there are still significant (5% to 10%) performance drops
- With 96 arenas the performance drop was mostly ~2%
Building MyRocks with jemalloc support
This was harder than I expected. The first step was easy -- I added these to the CMake command line, the first is for MyRocks and the second is for RocksDB. When the first is set then HAVE_JEMALLOC is defined in config.h. When the second is set then ROCKSDB_JEMALLOC is defined on the compiler command line.
-DHAVE_JEMALLOC=1-DWITH_JEMALLOC=1
The hard part is that there were linker errors for unresolved symbols -- the open-source build was broken. The fix that worked for me is here. I removed libunwind.so and added libjemalloc.so in its place.
Running mysqld with MALLOC_ARENA_MAX
I wasn't sure if it was sufficient for me to set an environment variable when invoking mysqld_safe, so I just edited the mysqld_safe script to do that for me:
182a183,184> cmd="MALLOC_ARENA_MAX=1 $cmd"> echo Run :: $cmd
Results: jemalloc
The jemalloc specific code in MyRocks and RocksDB is useful but most of it is not there to boost performance. The jemalloc specific code most likely to boost performance is here in MyRocks and is enabled when rocksdb_cache_dump=0 is added to my.cnf.
Results are here for 3 setups:
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
- This is the base case in the table below
- this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH
- fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
- This is col-1 in the table below
- MySQL with jemalloc specific code enabled at compile time
- fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
- This is col-2 in the table below
- MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
- results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
- results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
col-1 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
col-2 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
col-1 col-2
0.92 1.40 hot-points_range=100
1.00 1.01 point-query_range=100
1.01 1.02 points-covered-pk_range=100
0.94 1.03 points-covered-si_range=100
1.01 1.02 points-notcovered-pk_range=100
0.98 1.02 points-notcovered-si_range=100
1.01 1.03 random-points_range=1000
1.01 1.02 random-points_range=100
0.99 1.00 random-points_range=10
0.98 1.00 range-covered-pk_range=100
0.96 0.97 range-covered-si_range=100
0.98 0.98 range-notcovered-pk_range=100
1.00 1.02 range-notcovered-si_range=100
0.98 1.00 read-only-count_range=1000
1.01 1.01 read-only-distinct_range=1000
0.99 0.99 read-only-order_range=1000
1.00 1.00 read-only_range=10000
0.99 0.99 read-only_range=100
0.99 1.00 read-only_range=10
0.98 0.99 read-only-simple_range=1000
0.99 0.99 read-only-sum_range=1000
0.98 0.98 scan_range=100
1.01 1.02 delete_range=100
1.01 1.03 insert_range=100
0.99 1.01 read-write_range=100
1.00 1.01 read-write_range=10
1.00 1.02 update-index_range=100
1.02 1.02 update-inlist_range=100
1.01 1.03 update-nonindex_range=100
0.99 1.01 update-one_range=100
1.01 1.03 update-zipf_range=100
1.00 1.01 write-only_range=10000
The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.
- jemalloc.1
- base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
- jemalloc.2
- col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
- This has little impact on VSZ and RSS
- jemalloc.3
- col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
- This cuts peak VSZ in half and reduces peak RSS by 9%
Peak values for MyRocks with 10G buffer pool
alloc VSZ RSS RSS/10
jemalloc.1 45.6 12.2 1.22
jemalloc.2 46.0 12.5 1.25
jemalloc.3 20.2 11.6 1.16
Results: MALLOC_ARENA_MAX
The binaries tested are:
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
- base case in the table below
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
- col-1 in the table below
- uses MALLOC_ARENA_MAX=1
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
- col-2 in the table below
- uses MALLOC_ARENA_MAX=8
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
- col-3 in the table below
- uses MALLOC_ARENA_MAX=48
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
- col-4 in the table below
- uses MALLOC_ARENA_MAX=48
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
- performance with 1 or 8 arenas is lousy
- performance drops some (often 5% to 10%) with 48 arenas
- performance drops ~2% with 96 arenas
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
col-1 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
col-2 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
col-3 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
col-4 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
col-1 col-2 col-3 col-4
0.89 0.78 0.72 0.78 hot-points_range=100
0.23 0.61 0.96 0.98 point-query_range=100
0.31 0.86 0.96 1.01 points-covered-pk_range=100
0.24 0.87 0.95 1.01 points-covered-si_range=100
0.31 0.86 0.97 1.01 points-notcovered-pk_range=100
0.20 0.86 0.97 1.00 points-notcovered-si_range=100
0.35 0.79 0.96 1.01 random-points_range=1000
0.30 0.87 0.96 1.01 random-points_range=100
0.23 0.67 0.96 0.99 random-points_range=10
0.06 0.48 0.92 0.96 range-covered-pk_range=100
0.14 0.52 0.97 0.99 range-covered-si_range=100
0.13 0.46 0.91 0.97 range-notcovered-pk_range=100
0.23 0.87 0.96 1.01 range-notcovered-si_range=100
0.23 0.76 0.97 0.99 read-only-count_range=1000
0.56 1.00 0.96 0.97 read-only-distinct_range=1000
0.20 0.47 0.90 0.94 read-only-order_range=1000
0.68 1.04 1.00 1.00 read-only_range=10000
0.21 0.76 0.98 0.99 read-only_range=100
0.19 0.70 0.97 0.99 read-only_range=10
0.21 0.58 0.94 0.98 read-only-simple_range=1000
0.19 0.57 0.95 1.00 read-only-sum_range=1000
0.53 0.98 1.00 1.01 scan_range=100
0.30 0.81 0.98 1.00 delete_range=100
0.50 0.92 1.00 1.00 insert_range=100
0.23 0.72 0.97 0.98 read-write_range=100
0.20 0.67 0.96 0.98 read-write_range=10
0.33 0.88 0.99 1.00 update-index_range=100
0.36 0.76 0.94 0.98 update-inlist_range=100
0.30 0.85 0.98 0.99 update-nonindex_range=100
0.86 0.98 1.00 1.01 update-one_range=100
0.32 0.86 0.98 0.98 update-zipf_range=100
0.27 0.80 0.97 0.98 write-only_range=10000
The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.
Using 1 arena prevents RSS bloat but comes at a huge cost in performance. If I had more time I would have tested for 2, 4 and 6 arenas but I don't think glibc malloc + RocksDB are meant to be.
Peak values for MyRocks with 10G buffer pool
alloc VSZ RSS RSS/10
default 46.1 36.2 3.62
arena = 1 15.9 14.1 1.41
arena = 8 32.6 27.7 2.77
arena = 48 35.2 29.2 2.92
arena = 96 39.3 32.5 3.25
No comments:
Post a Comment