Monday, October 2, 2023

Variance in peak RSS with jemalloc 5.2.1, part 2

I previously shared results to show peak RSS by jemalloc version for the Insert Benchmark. This post has additional results. 

The problem is that peak RSS is larger during the l.x benchmark step when Hyper Clock Cache is enabled and the difference large enough that there might be OOM in production unless you are prepared for this.

The previous results were from an unmodified Insert Benchmark that has the l.x benchmark step to create secondary indexes after the database has been loaded by the l.i0 benchmark step. Here I changed the Insert Benchmark to create the secondary indexes at the start of the l.i0 benchmark step and skip l.x. The theory is that the large memory allocations done during create index contribute to the problem. One source of large allocations is determined by the value of rocksdb_merge_combine_read_size which I set to 64M for these tests -- so that is large but not huge.

tl;dr

  • The peak RSS problems with Hyper Clock Cache don't reproduce when the l.x step is skipped
  • Peak RSS is too large for jemalloc 4.4 and 4.5 and that gets fixed over several 5.x releases. Tobin Baker suggested these might be from changes to the usage of MADV_FREE and MADV_DONTNEED.
  • The je-5.2.1.prod configuration does better than the other jemalloc 5.2 configuration I tested. Perhaps this is from the usage of background_thread:true.
Benchmark and builds

The benchmark and test HW is explained here. The jemalloc versions, jemalloc configurations and MyRocks configurations are explained here. In this post I share results for the c7 MyRocks configuration that enables the Hyper Clock Cache.

The insert benchmark was run in one setup -- the database was larger than memory. The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables. The benchmark is a sequence of steps.

  • l.i0
    • insert 50 million rows per table
  • l.x
    • create 3 secondary indexes.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

jemalloc configurations

Output from stats_print:true is here for all versions tested. Output from grep to determine the value for muzzy_decay_ms per version is below. All of the configurations get the default except for stats.j521ub (je-5.2.1.ub) which sets it to zero (but zero is the default for 5.2.1).

je-5.0.1:      opt.muzzy_decay_ms: 10000 (arenas.muzzy_decay_ms: 10000)
je-5.1.0:      opt.muzzy_decay_ms: 10000 (arenas.muzzy_decay_ms: 10000)
je-5.2.0:      opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
je-5.2.1:      opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
je-5.2.1.prod: opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
je-5.2.1.prof: opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
je-5.2.1.ub:   opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
je-5.3.0:      opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)

Results

A spreadsheet with results is here. The with_l.x sheet has results with the l.x benchmark step and these were explained in a previous blog post. The no_l.x sheet has results where the l.x benchmark step is skipped.

The first chart is from a previous blog post to show peak RSS with the Hyper Clock Cache when the l.x benchmark step is not skipped. I focus on the result for l.x where there is a spike for je-5.2.1.ub.
This chart is from the most recent tests where the l.x benchmark step is skipped. 
  • There are still spikes with jemalloc 4.4 and 4.5 that are fixed by 5.2
  • Within the 5.2 builds the best result is from je-5.2.1.prod and that might be from using background_thread:true


4 comments:

  1. Can you also compare jemlloc results with tcmalloc and mimalloc?

    ReplyDelete
    Replies
    1. http://smalldatum.blogspot.com/2023/08/rocksdb-and-glibc-malloc-dont-play-nice.html

      Delete
  2. The Linux implementation of MADV_FREE seems unusable. Freed memory counts in your RSS and makes you attractive to OOM killer. And confuses other tools that try to manage memory between processes. It shouldn't count against RSS since logically the kernel owns it.

    ReplyDelete
  3. And the OOM killer ought to arrange to deallocate freed memory before it rains to killing.

    ReplyDelete