Sunday, August 27, 2023

RocksDB and glibc malloc don't play nice together

Pineapple and ham work great together on pizza. RocksDB and glibc malloc don't work great together. The primary problem is that for RocksDB processes the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. I have written about this before -- see here and here. RocksDB is a stress test for an allocator.

tl;dr

  • For a process using RocksDB the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. There will be more crashes from the OOM killer with glibc malloc.

Benchmark

The benchmark is explained in a previous post

The insert benchmark was run in the IO-bound setup and the database is larger than memory.

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

  • l.i0
    • insert 500 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Malloc

The test was repeated with 4 malloc implementations:

  • je-5.2.1 - jemalloc 5.2.1, the version provided by Ubuntu 22.04
  • je-5.3.0 - jemalloc 5.3.0, the current jemalloc release, built from source
  • tc-2.9.1 - tcmalloc 2.9.1, the version provided by Ubuntu 22.04
  • glibc 2.3.5 - this is the version provided by Ubuntu 22.04

Results

I measured the peak RSS during each benchmark step.

The benchmark completed for all malloc implementations using the c5 config, but had some benchmark steps run for more time there would have been OOM with glibc. All of the configs used a 40G RocksDB block cache.

The benchmark completed for jemalloc and tcmalloc using the c7 config and fails with OOM with glibc on the q1000 step. Had the l.i1, q100 and q500 steps run for more time then the OOM would have happened sooner.



4 comments:

  1. The Cloud Flare post about "The effect of switching to TCMalloc on RocksDB memory use" is also highly recommended reading, on what happens in glibc malloc.

    If one is going to change the allocator, it might also make sense to change the OOM score, so that eg. backup software is more likely to get killed:

    [Service]
    Environment="LD_PRELOAD=/usr/lib64/libtcmalloc.so.4"
    OOMScoreAdjust=-600

    Don't forget about "systemctl daemon-reload" and "systemctl cat" before a restart.

    Anecdotally, there was no obvious difference, the couple of times I have instead tried libtcmalloc_minimal.so.4, which is described as "does not include the heap profiler and checker (perhaps to reduce binary size for a static binary)"

    ReplyDelete
    Replies
    1. Thank you for the suggestions. I enjoy getting useful feedback like this when I explore problems that are new to me.

      Delete
  2. The title says, specifically, RocksDB, but the contents of the post refer to MyRocks/MySQL benchmarks; not what I was expecting given the title. Now, most of us understand this distinction, but it begs the question of is this RSS/glibc issue truly a "rocksdb issue" or a "mysql running myrocks" issue?

    ReplyDelete
    Replies
    1. RocksDB on its own isn't very useful so there won't be any pure RocksDB benchmark results. It is an embedded DBMS compiled with application code. In this case the application was MySQL. Even db_bench, the RocksDB benchmark client, is application code compiled with RocksDB. Perhaps I will try to reproduce this using db_bench, but will that make it a RocksDB running db_bench issue?

      From the old blog posts I linked above in the first paragraph I reproduced the glibc malloc problem long ago using MongoRocks:
      https://smalldatum.blogspot.com/2014/12/malloc-and-mongodb-performance.html

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...