I previously shared results to show peak RSS by jemalloc version for the Insert Benchmark. This post has additional results.
The problem is that peak RSS is larger during the l.x benchmark step when Hyper Clock Cache is enabled and the difference large enough that there might be OOM in production unless you are prepared for this.
The previous results were from an unmodified Insert Benchmark that has the l.x benchmark step to create secondary indexes after the database has been loaded by the l.i0 benchmark step. Here I changed the Insert Benchmark to create the secondary indexes at the start of the l.i0 benchmark step and skip l.x. The theory is that the large memory allocations done during create index contribute to the problem. One source of large allocations is determined by the value of rocksdb_merge_combine_read_size which I set to 64M for these tests -- so that is large but not huge.
tl;dr
- The peak RSS problems with Hyper Clock Cache don't reproduce when the l.x step is skipped
- Peak RSS is too large for jemalloc 4.4 and 4.5 and that gets fixed over several 5.x releases. Tobin Baker suggested these might be from changes to the usage of MADV_FREE and MADV_DONTNEED.
- The je-5.2.1.prod configuration does better than the other jemalloc 5.2 configuration I tested. Perhaps this is from the usage of background_thread:true.
The benchmark is run with 8 clients and 8 tables. The benchmark is a sequence of steps.
- l.i0
- insert 50 million rows per table
- l.x
- create 3 secondary indexes.
- l.i1
- insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
- q100, q500, q1000
- do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.
- There are still spikes with jemalloc 4.4 and 4.5 that are fixed by 5.2
- Within the 5.2 builds the best result is from je-5.2.1.prod and that might be from using background_thread:true
Can you also compare jemlloc results with tcmalloc and mimalloc?
ReplyDeletehttp://smalldatum.blogspot.com/2023/08/rocksdb-and-glibc-malloc-dont-play-nice.html
DeleteThe Linux implementation of MADV_FREE seems unusable. Freed memory counts in your RSS and makes you attractive to OOM killer. And confuses other tools that try to manage memory between processes. It shouldn't count against RSS since logically the kernel owns it.
ReplyDeleteAnd the OOM killer ought to arrange to deallocate freed memory before it rains to killing.
ReplyDelete