Wednesday, June 7, 2023

Create index, MyRocks and memory

I got OOM during index create for a few MyRocks benchmarks and I am somewhat ignorant about how MyRocks handles create index. This occurred on a server that has 16G of RAM, RocksDB block cache was 10G and mysqld RSS was ~11G so there wasn't much spare memory.

I am still somewhat ignorant but will share what I learned.

By OOM I mean that the Linux OOM killer decided to kill mysqld when the memory demand exceeded the memory supply.

There is one approach for creating a secondary index: bulk.

tl;dr

  • Plan for the increase to mysqld RSS as a function of rocksdb_merge_combine_read_size

Bulk approach

MyRocks can optimize bulk loading and those features have been extended to support secondary index creation. Documentation on bulk loading is here. My request to improve the docs is here. My possibly correct description of how this works:

  1. The PK is scanned to extract the columns needed for the secondary index and stored in memory
    1. When a memory limit is reached the columns are sorted and written to a temp file
    2. Continue the PK index scan
  2. Assume there are many sorted runs, each in a temp file. Ignore the case where nothing was spilled.
    1. Merge the sorted runs, write the output to new SSTs that are written to the max level of the LSM tree
There are two options that determine memory usage during bulk load:
  • rocksdb_merge_combine_read_size - the default is 1G and there is one concurrently active allocation per alter table add index statement even when multiple indexes are created by that statement.
  • rocksdb_merge_buf_size - the default is 64M and I am not sure how many concurrently active allocations might be used per alter table add index statement

Experiments: small server

I did a few experiments on a small server to understand memory usage by monitoring VSZ and RSS for mysqld while creating indexes. The server has 16G of RAM, the RocksDB block cache was 1G, rocksdb_merge_buf_size was the default (64M) and rocksdb_merge_combine_read_size was set to 128M.

My goal is to understand the memory demand for create index as a function of rocksdb_merge_buf_size and rocksdb_merge_combine_read_size so that I can avoid OOM. However, I really should have tested more values for these options. From the RSS diff table below I think that mysqld RSS grows by ~300M per concurrent index create when rocksdb_merge_combine_read_size is 128M.

First I did a full scan of a large table. At the end of the scan VSZ was 2.3G and RSS was 1.5G for mysqld, courtesy of the 1G RocksDB block cache.

Then I created secondary indexes first with 1 client and then with 4 clients creating the indexes concurrently. For the 1 client test the table had 800M rows. For the 4 client test the tables each have 200M rows. Scripts monitored VSZ and RSS for mysqld during each benchmark step and the table below reports the peak values during the benchmark step. The benchmark steps here are:

  • l.i0 - load the table in PK order, table has no secondary indexes
  • l.x - create 3 secondary indexes per table

Peak VSZ (GB) during benchmark step
Clients l.i0    l.x
1       3.1     4.2
4       3.4     5.7

Peak RSS (GB) during benchmark step
Clients l.i0    l.x
4       1.8     3.8
1       1.7     2.8

Finally, I subtract VSZ/RSS during the full scan (VSZ=2.3G, RSS=1.5G) from the peak VSZ/RSS values in the tables above.

Computed as:
(peak value during benchmark step) - (value after full scan)

VSZ diff
Clients l.i0    l.x
1       0.8     1.9
4       1.1     3.4

RSS diff
Clients l.i0    l.x
1       0.2     1.3
4       0.3     2.3

Experiments: big server

Next I ran tests on a large server with a 4G RocksDB block cache. After a SELECT statement that does a full scan of a large table the VSZ and RSS for mysqld are 6.0G and 4.5G. Tests were repeated with rocksdb_merge_combine_read_size set to 128M, 256M, 512M, 1024M, 2048M, 3000M and 4000M.

As the value for rocksdb_merge_combine_read_size gets larger the increase in RSS approaches the value of it. When it is set to 4000M then mysqld RSS increases by 4.1G. But when it is 128M then mysqld RSS increases by ~300M. 

Legend:
* mcrs - rocksdb_merge_combine_read_size in MB
* vsz - peak mysqld VSZ in GB during the create index
* rss - peak mysqld RSS in GB during the create index

--- Peak RSS and VSZ during create index

mcrs    vsz     rss
 128     7.1    4.8
 256     7.1    4.9
 512     8.1    5.2
1024     8.1    5.7
2048     9.3    6.7
3000    10.8    7.6
4000    12.0    8.6

--- Diffs

Computed as:
  (peak value during benchmark step) - (value after full scan)

mcrs    vsz     rss
 128    1.1     0.3
 256    1.1     0.4
 512    2.1     0.7
1024    2.1     1.2
2048    3.3     2.2
3000    4.8     3.1
4000    6.0     4.1

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...