I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.
tl;dr
- With jemalloc the peak RSS for mysqld is larger with rocksdb_use_hyper_clock_cache=ON so I reduce the value of rocksdb_block_cache_size from 8G to 6G for some configurations. This isn't fully explained but experts are working on it.
- The base config (a0) is good enough and the other configs don't provide a significant improvement. This isn't a big surprise, while the hyper clock cache and subcompactions are a big deal on larger servers the server in this case is small and the workload has low concurrency.
- In some cases the a3 config that disables intra-L0 compaction hurts write throughput. This result is similar to what I measured on a larger server.
The insert benchmark was run in three configurations.
- cached by RocksDB - all tables fit in the RocksDB block cache
- cached by OS - all tables fit in the OS page cache but not the 1G RocksDB block cache
- IO-bound - the database is larger than memory
This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.
The benchmark is run with 1 client. The benchmark is a sequence of steps.
- l.i0
- insert X million rows across all tables without secondary indexes where X is 20 for cached and 800 for IO-bound
- l.x
- create 3 secondary indexes. I usually ignore performance from this step.
- l.i1
- insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
- q100
- do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
- q500
- do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
- q1000
- do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
Configurations
The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 1G block cache, one with a larger block cache. The larger block cache size is 8G when LRU is used and 6G when hyper clock cache is used (see tl;dr).
- a0 (1G, 8G) - base config
- a1 (1G, 6G) - adds rocksdb_use_hyper_clock_cache=ON
- a2 (1G, 8G) - adds rocksdb_block_cache_numshardbits=3
- a3 (1G, 8G) - disables intra-L0 compaction via a hack
- a4 (1G, 8G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
- a5 (1G, 8G) - enables subcompactions via rocksdb_max_subcompactions=2
- a6 (1G, 6G) - combines a1, a2, a5
- a7 (1G, 6G) - combines a1, a5
The conclusion is that the base config (a0) is good enough and the other configs don't provide a significant improvement. This isn't a big surprise, while the hyper clock cache (a1) and subcompactions (a5) are a big deal on larger servers the server in this case is small and the workload has low concurrency. The a3 config is bad for performance on the IO-bound workload -- intra-L0 compaction is useful.
When evaluating this based on average throughput (see summaries for Cached by RocksDB, Cached by OS and IO-bound) the base config (a0) is good enough and the other configs don't provide significant improvements although for IO-bound the a3 config is bad for the l.i1 benchmark step because it increases write stalls.
The charts showing various metrics at 1-second intervals look similar with one exception. Links are in the performance summaries, grep for per 1-second interval in Cached by RocksDB, Cached by OS and IO-bound. The exception is on IO-bound with the a3 config -- see the IPS charts for the l.i1 benchmark step with the a0 config and a3 config where the a3 config has much more variance.
No comments:
Post a Comment