Showing posts with label malloc. Show all posts
Showing posts with label malloc. Show all posts

Saturday, April 19, 2025

Battle of the Mallocators: part 2

This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:

  • MALLOC_ARENA_MAX with glibc malloc
    • see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.
  • compiling RocksDB and MyRocks with jemalloc specific code enabled
    • In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.
tl;dr: jemalloc
  • For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled
    • performance, VSZ and RSS were similar
  • After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code
    • performance is slightly better (excluding the outlier, the benefit is up to 3%)
    • peak VSZ is cut in half
    • peak RSS is reduced by ~9%
tl;dr: glibc malloc on a 48-core server
  • With 1 arena performance is lousy but the RSS bloat is mostly solved
  • With 8, 48 or 96 arenas the RSS bloat is still there
  • With 48 arenas there are still significant (5% to 10%) performance drops
  • With 96 arenas the performance drop was mostly ~2%
Building MyRocks with jemalloc support

This was harder than I expected. The first step was easy -- I added these to the CMake command line, the first is for MyRocks and the second is for RocksDB. When the first is set then HAVE_JEMALLOC is defined in config.h. When the second is set then ROCKSDB_JEMALLOC is defined on the compiler command line.

  -DHAVE_JEMALLOC=1
  -DWITH_JEMALLOC=1
The hard part is that there were linker errors for unresolved symbols -- the open-source build was broken. The fix that worked for me is here. I removed libunwind.so and added libjemalloc.so in its place.

Running mysqld with MALLOC_ARENA_MAX

I wasn't sure if it was sufficient for me to set an environment variable when invoking mysqld_safe, so I just edited the mysqld_safe script to do that for me:

182a183,184
>   cmd="MALLOC_ARENA_MAX=1 $cmd"
>   echo Run :: $cmd

Results: jemalloc

The jemalloc specific code in MyRocks and RocksDB is useful but most of it is not there to boost performance. The jemalloc specific code most likely to boost performance is here in MyRocks and is enabled when rocksdb_cache_dump=0 is added to my.cnf.

Results are here for 3 setups:
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
    • This is the base case in the table below
    • this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH
  • fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
    • This is col-1 in the table below
    • MySQL with jemalloc specific code enabled at compile time
  • fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
    • This is col-2 in the table below
    • MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
  • results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
  • results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
col-2 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

col-1   col-2
0.92    1.40    hot-points_range=100
1.00    1.01    point-query_range=100
1.01    1.02    points-covered-pk_range=100
0.94    1.03    points-covered-si_range=100
1.01    1.02    points-notcovered-pk_range=100
0.98    1.02    points-notcovered-si_range=100
1.01    1.03    random-points_range=1000
1.01    1.02    random-points_range=100
0.99    1.00    random-points_range=10
0.98    1.00    range-covered-pk_range=100
0.96    0.97    range-covered-si_range=100
0.98    0.98    range-notcovered-pk_range=100
1.00    1.02    range-notcovered-si_range=100
0.98    1.00    read-only-count_range=1000
1.01    1.01    read-only-distinct_range=1000
0.99    0.99    read-only-order_range=1000
1.00    1.00    read-only_range=10000
0.99    0.99    read-only_range=100
0.99    1.00    read-only_range=10
0.98    0.99    read-only-simple_range=1000
0.99    0.99    read-only-sum_range=1000
0.98    0.98    scan_range=100
1.01    1.02    delete_range=100
1.01    1.03    insert_range=100
0.99    1.01    read-write_range=100
1.00    1.01    read-write_range=10
1.00    1.02    update-index_range=100
1.02    1.02    update-inlist_range=100
1.01    1.03    update-nonindex_range=100
0.99    1.01    update-one_range=100
1.01    1.03    update-zipf_range=100
1.00    1.01    write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.
  • jemalloc.1
    • base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
  • jemalloc.2
    • col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
    • This has little impact on VSZ and RSS
  • jemalloc.3
    • col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
    • This cuts peak VSZ in half and reduces peak RSS by 9%
Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
jemalloc.1      45.6    12.2    1.22
jemalloc.2      46.0    12.5    1.25
jemalloc.3      20.2    11.6    1.16

Results: MALLOC_ARENA_MAX

The binaries tested are:
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
    • base case in the table below
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
    • col-1 in the table below
    • uses MALLOC_ARENA_MAX=1
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
    • col-2 in the table below
    • uses MALLOC_ARENA_MAX=8
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
    • col-3 in the table below
    • uses MALLOC_ARENA_MAX=48
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
    • col-4 in the table below
    • uses MALLOC_ARENA_MAX=48
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
  • performance with 1 or 8 arenas is lousy
  • performance drops some (often 5% to 10%) with 48 arenas
  • performance drops ~2% with 96 arenas
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
col-2 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
col-3 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
col-4 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-1   col-2   col-3   col-4
0.89    0.78    0.72    0.78    hot-points_range=100
0.23    0.61    0.96    0.98    point-query_range=100
0.31    0.86    0.96    1.01    points-covered-pk_range=100
0.24    0.87    0.95    1.01    points-covered-si_range=100
0.31    0.86    0.97    1.01    points-notcovered-pk_range=100
0.20    0.86    0.97    1.00    points-notcovered-si_range=100
0.35    0.79    0.96    1.01    random-points_range=1000
0.30    0.87    0.96    1.01    random-points_range=100
0.23    0.67    0.96    0.99    random-points_range=10
0.06    0.48    0.92    0.96    range-covered-pk_range=100
0.14    0.52    0.97    0.99    range-covered-si_range=100
0.13    0.46    0.91    0.97    range-notcovered-pk_range=100
0.23    0.87    0.96    1.01    range-notcovered-si_range=100
0.23    0.76    0.97    0.99    read-only-count_range=1000
0.56    1.00    0.96    0.97    read-only-distinct_range=1000
0.20    0.47    0.90    0.94    read-only-order_range=1000
0.68    1.04    1.00    1.00    read-only_range=10000
0.21    0.76    0.98    0.99    read-only_range=100
0.19    0.70    0.97    0.99    read-only_range=10
0.21    0.58    0.94    0.98    read-only-simple_range=1000
0.19    0.57    0.95    1.00    read-only-sum_range=1000
0.53    0.98    1.00    1.01    scan_range=100
0.30    0.81    0.98    1.00    delete_range=100
0.50    0.92    1.00    1.00    insert_range=100
0.23    0.72    0.97    0.98    read-write_range=100
0.20    0.67    0.96    0.98    read-write_range=10
0.33    0.88    0.99    1.00    update-index_range=100
0.36    0.76    0.94    0.98    update-inlist_range=100
0.30    0.85    0.98    0.99    update-nonindex_range=100
0.86    0.98    1.00    1.01    update-one_range=100
0.32    0.86    0.98    0.98    update-zipf_range=100
0.27    0.80    0.97    0.98    write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.

Using 1 arena prevents RSS bloat but comes at a huge cost in performance. If I had more time I would have tested for 2, 4 and 6 arenas but I don't think glibc malloc + RocksDB are meant to be.

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
default         46.1    36.2    3.62
arena = 1       15.9    14.1    1.41
arena = 8       32.6    27.7    2.77
arena = 48      35.2    29.2    2.92
arena = 96      39.3    32.5    3.25


Friday, April 11, 2025

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.

I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.

There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.

tl;dr

  • For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
  • With respect to peak RSS
    • For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
    • For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.
  • For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc
    • For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
    • For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc

Prior art

I have several blog posts on using jemalloc with MyRocks.

  • October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
  • April 2017 - Performance for large, concurrent allocations
  • April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
  • August 2023 - RocksDB and glibc malloc
  • September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS) 
  • September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
  • October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0

Builds, configuration and hardware

I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

For malloc the server uses:
  • glibc
    • version2.35-0ubuntu3.9
  • tcmalloc
    • provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf
  • jemalloc
    • provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf

The configuration files are here for InnoDB and for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.

The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.

Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.

The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256

Peak VSZ and RSS

The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.

Peak values for InnoDB with 80G buffer pool
alloc           VSZ     RSS     RSS/80
glibc           88.2    86.5    1.08
tcmalloc        88.1    85.3    1.06
jemalloc        91.5    87.0    1.08

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
glibc           46.1    36.2    3.62
tcmalloc        15.3    13.1    1.31
jemalloc        45.6    12.2    1.22

Performance: InnoDB

From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.

The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1 col-2
0.99 1.02 hot-points_range=100
1.05 1.04 point-query_range=100
0.96 0.99 points-covered-pk_range=100
0.98 0.99 points-covered-si_range=100
0.96 0.99 points-notcovered-pk_range=100
0.97 0.98 points-notcovered-si_range=100
0.97 1.00 random-points_range=1000
0.95 0.99 random-points_range=100
0.99 0.99 random-points_range=10
1.04 1.03 range-covered-pk_range=100
1.05 1.07 range-covered-si_range=100
1.04 1.03 range-notcovered-pk_range=100
0.98 1.00 range-notcovered-si_range=100
1.02 1.02 read-only-count_range=1000
1.05 1.07 read-only-distinct_range=1000
1.07 1.12 read-only-order_range=1000
1.28 1.09 read-only_range=10000
1.03 1.05 read-only_range=100
1.05 1.08 read-only_range=10
1.08 1.07 read-only-simple_range=1000
1.04 1.03 read-only-sum_range=1000
1.02 1.02 scan_range=100
1.01 1.00 delete_range=100
1.03 1.01 insert_range=100
1.02 1.02 read-write_range=100
1.03 1.03 read-write_range=10
1.01 1.02 update-index_range=100
1.15 0.98 update-inlist_range=100
1.06 0.99 update-nonindex_range=100
1.03 1.03 update-one_range=100
1.02 1.01 update-zipf_range=100
1.18 1.05 write-only_range=10000

Performance: MyRocks

From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.

The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1 col-2
0.68 1.00 hot-points_range=100
1.04 1.04 point-query_range=100
1.09 1.09 points-covered-pk_range=100
1.00 1.09 points-covered-si_range=100
1.09 1.09 points-notcovered-pk_range=100
1.10 1.12 points-notcovered-si_range=100
1.08 1.08 random-points_range=1000
1.09 1.09 random-points_range=100
1.05 1.10 random-points_range=10
0.99 1.07 range-covered-pk_range=100
1.01 1.03 range-covered-si_range=100
1.05 1.09 range-notcovered-pk_range=100
1.10 1.09 range-notcovered-si_range=100
1.07 1.05 read-only-count_range=1000
1.00 1.00 read-only-distinct_range=1000
0.98 1.04 read-only-order_range=1000
1.03 1.03 read-only_range=10000
0.96 1.03 read-only_range=100
1.02 1.04 read-only_range=10
0.98 1.07 read-only-simple_range=1000
1.07 1.09 read-only-sum_range=1000
1.02 1.02 scan_range=100
1.05 1.03 delete_range=100
1.11 1.07 insert_range=100
0.96 0.97 read-write_range=100
0.94 0.95 read-write_range=10
1.08 1.04 update-index_range=100
1.08 1.07 update-inlist_range=100
1.09 1.04 update-nonindex_range=100
1.04 1.04 update-one_range=100
1.07 1.04 update-zipf_range=100
1.03 1.02 write-only_range=10000

Monday, September 25, 2023

Variance in peak RSS with jemalloc 5.2.1

Peak RSS for jemalloc 5.2.1 has much variance with the Insert Benchmark with MyRocks. The variance is a function of how you build and configure jemalloc. The worst case (largest peak RSS) is jemalloc 5.2.1 provided by Ubuntu 22.04 and I have yet to figure out how to reproduce that result using jemalloc 5.2.1 compiled from source.

I previously shared results to show that jemalloc and tcmalloc are better than glibc malloc for RocksDB. That was followed by a post that shows the peak RSS with different jemalloc versions. This post has additional results for jemalloc 5.2.1 using different jemalloc config options. 

tl;dr

  • Peak RSS has large spikes with jemalloc 4.4, 4.5 and somewhat in 5.0 and 5.1. Tobin Baker suggested these might be from changes to the usage of MADV_FREE and MADV_DONTNEED. These start to show during the l.i1 benchmark step and then are obvious during q100, q500 and q1000.
  • For tests that use Hyper Clock cache there is a large peak RSS with Ubuntu-provided jemalloc 5.2.1 that is obvious during the l.x and l.i1 benchmark steps. I can't reproduce this using jemalloc 5.2.1 compiled from source despite my attempts to match the configuration.
  • Benchmark throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3)

Builds

My previous post explains the benchmark and HW. 

To get the jemalloc config details I added malloc-conf="stats_print:true" to my.cnf which causes stats and the config details to get written to the MySQL error log on shutdown.

I compiled many versions of jemalloc from source -- 4.0.4, 4.1.1, 4.2.1, 4.3.1, 4.4.0, 4.5.0, 5.0.1, 5.1.0, 5.2.0, 5.2.1, 5.3.0. All of these used the default jemalloc config, and while it isn't listed there the default value for background_thread is false.

  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.opt_safety_checks: false
  config.prof: false
  config.prof_libgcc: false
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false

The config for Ubuntu-provided 5.2.1 is below. This is also the config used by je-5.2.1.prof (see below). It also gets background_thread=false. It differs from what I show above by:

  • uses config.prof: true
  • uses config.prof_libgcc: true

  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.opt_safety_checks: false
  config.prof: true
  config.prof_libgcc: true
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false

Finally, I tried one more config when compiling from source to match the config that is used at work. I get that via: 

configure --disable-cache-oblivious --enable-opt-safety-checks --enable-prof --disable-prof-libgcc --enable-prof-libunwind --with-malloc-conf="background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"

With that the option values are the following, plus the background thread is enabled. The build that uses this is named je-5.2.1.prod below.


  config.cache_oblivious: false
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: "background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"
  config.opt_safety_checks: true
  config.prof: true
  config.prof_libgcc: false
  config.prof_libunwind: true
  config.stats: true
  config.utrace: false
  config.xmalloc: false

Now I have results for variants of jemalloc 5.2.1 and the names here match the names I used on the spreadsheets that show peak RSS.

  • je-5.2.1.ub - Ubuntu-provided 5.2.1
  • je-5.2.1 - compiled from source with default options
  • je-5.2.1.prof - compiled with configure -enable-prof --enable-prof-prof_libgcc to get a config that matches je-5.2.1.ub
  • je-5.2.1.prod - 5.2.1 compiled from source using 

Benchmarks

I ran the Insert Benchmark using a 60G RocksDB block cache. The benchmark was repeated twice -- once using the (older) LRU block cache, once using the (newer) Hyper Clock cache.

The benchmark was run in the IO-bound setup and the database is larger than memory. The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps and the peak RSS problem is worst for the l.x benchmark step that creates indexes and allocates a lot of memory while doing so:

  • l.i0
    • insert 500 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Results: perf reports

My standard perf reports are here for both types of block caches: LRU and Hyper Clock.

  • Throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3). See the tables with absolute and relative throughput in the Summary section for LRU and for Hyper Clock.
  • HW performance metrics are mostly similar regardless of the peak RSS spikes. See the tables for LRU and for Hyper Clock. The interesting columns include: cpupq has CPU per operation, cpups has the average value for vmstat's us + sy, csps has the average value for vmstat's cs and cspq has context switches per operation.
So the good news is that tests here don't find performance regressions, although the more interesting test would be on larger HW with more concurrency.

Results: peak RSS

I measured the peak RSS during each benchmark step. The spreadsheet is here

Summary:

  • The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.

Summary

  • The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.
  • The peak RSS is larger values for je-5.2.1.ub during l.x and l.i1. I have been unable to reproduce that with jemalloc compiled from source despite matching the configuration.

Sunday, August 27, 2023

RocksDB and glibc malloc don't play nice together

Pineapple and ham work great together on pizza. RocksDB and glibc malloc don't work great together. The primary problem is that for RocksDB processes the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. I have written about this before -- see here and here. RocksDB is a stress test for an allocator.

tl;dr

  • For a process using RocksDB the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. There will be more crashes from the OOM killer with glibc malloc.

Benchmark

The benchmark is explained in a previous post

The insert benchmark was run in the IO-bound setup and the database is larger than memory.

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

  • l.i0
    • insert 500 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Malloc

The test was repeated with 4 malloc implementations:

  • je-5.2.1 - jemalloc 5.2.1, the version provided by Ubuntu 22.04
  • je-5.3.0 - jemalloc 5.3.0, the current jemalloc release, built from source
  • tc-2.9.1 - tcmalloc 2.9.1, the version provided by Ubuntu 22.04
  • glibc 2.3.5 - this is the version provided by Ubuntu 22.04

Results

I measured the peak RSS during each benchmark step.

The benchmark completed for all malloc implementations using the c5 config, but had some benchmark steps run for more time there would have been OOM with glibc. All of the configs used a 40G RocksDB block cache.

The benchmark completed for jemalloc and tcmalloc using the c7 config and fails with OOM with glibc on the q1000 step. Had the l.i1, q100 and q500 steps run for more time then the OOM would have happened sooner.



Postgres 18rc1 vs sysbench

This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and larg...