Monday, November 20, 2017

Concurrent large allocations: glibc malloc, jemalloc and tcmalloc

At high-concurrency mysqld with jemalloc or tcmalloc can get ~4X more QPS on sysbench read-only compared to mysqld with glibc malloc courtesy of memory allocation stalls.

Last week I had more fun with malloc, but the real problem turned out to be a new gcc optimization. This week brings a different problem. I was curious about bug 88071 reported by Alexey Kopytov which explains the negative performance impact from large allocations with glibc malloc. This can be an issue at high-concurrency and the allocation strategy for sort_buffer_size in MySQL might be part of the problem and that problem might be worth fixing as MySQL gets better at complex query processing.

On the bright side there is an easy workaround -- use jemalloc or tcmalloc with mysqld and that can be set in my.cnf. I think upstream MongoDB binaries are linked with tcmalloc. I hope someone can tell me in a comment what is used in binaries provided by upstream MySQL, Percona and MariaDB.

I previously wrote that the expected benefit from jemalloc and tcmalloc is a smaller RSS. I frequently see that mysqld RSS is 2X larger with glibc malloc compared to jemalloc or tcmalloc. Using twice as much memory is a big deal. I did a similar test for MongoDB but only published VSZ for mongod which was larger for glibc malloc. Sometimes I find that jemalloc and tcmalloc also improve performance (more throughput, more QPS) at high-concurrency.

Configuration

I used modern sysbench on a server with 24 cores and 48 HW threads. While I used MyRocks this will reproduce with InnoDB. The test tables were cached by the storage engine. The sysbench command line is at the end of this post. The test is read-only with additional options to limit it to one of the queries. The test was repeated with sort_buffer_size set to 2M and 32M. The test was run for 8 and 48 concurrent clients.

I tested 4 binaries:
  • glibc-2.20 - glibc malloc from glibc 2.20 with gcc 4.9
  • glibc-2.23 - glibc malloc from glibc 2.23 with gcc 5.x
  • jemalloc - jemalloc 5.0.x with gcc 4.9
  • tcmalloc - modern tcmalloc with gcc 4.9
Results

The table below has the QPS for 8 and 48 clients for each of the binaries using 2M and 32M for sort_buffer_size. The value of sort_buffer_size is appended to the end of the configuration string (-2m, -32m). Note that there is a significant speedup (about 4X) from 8 to 48 clients for jemalloc and tcmalloc using both 2M and 32M for sort_buffer_size. There is also a speedup for glibc malloc with sort_bufer_size=2M. But glibc malloc has a problem with sort_buffer_size=32M.

8       48      concurrency/configuration
24775   73550   glibc-2.20-2m
21939   18239   glibc-2.20-32m
26494   78082   jemalloc-2m
26602   78993   jemlloc-32m
27109   78106   tcmalloc-2m
26466   76214   tcmalloc-32m
26247   78430   glibc-2.23-2m
22320   24674   glibc-2.23-32m

The chart below is from the data above.

The source of the problem isn't obvious from PMP samples, the top CPU consumers from hierarchical perf, or the top CPU consumers from non-hierarchical perf. But the amount of CPU from queued_spin_lock_slowpath in the perf output hints at the problem. The MySQL bug blames mmap. Many years ago when I was first evaluating LevelDB it used to fall over for large database courtesy of mmap and kernel VM code that didn't do well with concurrency. I don't know if those are still an issue. I don't need to debug this as I use jemalloc and hope you use it or tcmalloc.

The command line for the test is below:

sysbench --db-driver=mysql --mysql-user=... --mysql-password=... --mysql-host=127.0.0.1 --mysql-db=test --mysql-storage-engine=rocksdb --range-size=1000 --table-size=1000000 --tables=8 --threads=48 --events=0 --time=1800 --rand-type=uniform --order-ranges=4 --simple-ranges=0 --distinct-ranges=0 --sum-ranges=0 /data/mysql/sysbench10/share/sysbench/oltp_read_only.lua run

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...