I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.
tl;dr
- The good
- Performance is much better when subcompactions and the hyper clock cache are enabled.
- The not as good
- Peak RSS is larger for configs that enable the hyper clock cache. The problem here is not as large as it is on pure open source builds that I test because the version of jemalloc used here is tuned differently. Understanding this is a work in progress.
- The select min(transactionid) from TABLE query used at the start of benchmark steps can take up to 20 seconds when it should finish in a few milliseconds. I have yet to try to fix this via tuning.
Updates
- Added values from the RocksDB perf context counters to explain the slow select min() query
Builds
I used MyRocks from FB MySQL 8.0.28 with source from June 2023 at git hash ef5b9b101.
Benchmark
The insert benchmark was run in three configurations.
- cached by RocksDB - all tables fit in the RocksDB block cache
- cached by OS - all tables fit in the OS page cache but not the 4G RocksDB block cache
- IO-bound - the database is larger than memory
The test HW has 80 cores with hyperthreads enabled, 256G of RAM and fast local-attached NVMe storage.
The benchmark is run with 24 clients and a client per table. The benchmark is a sequence of steps.
- l.i0
- insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound
- l.x
- create 3 secondary indexes. I usually ignore performance from this step.
- l.i1
- insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
- q100
- do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
- q500
- do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
- q1000
- do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
Configurations
The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 4G block cache, one with 180G cache.
- c (4G, 180G) - base config
- c1 (4G, 180G) - adds rocksdb_use_hyper_clock_cache=ON
- c2 (4G, 180G) - adds rocksdb_block_cache_numshardbits=4
- c3 (4G, 180G) - disables intra-L0 compaction via a hack
- c4 (4G, 180G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
- c5 (4G, 180G) - enables subcompactions via rocksdb_max_subcompactions=4
- c6 (4G, 180G) - combines c1, c2, c5
- c7 (4G, 180G) - combines c1, c5
Reviewing results based on average throughput:
- Cached by RocksDB
- The c7 config (combines c1 and c5) is the best followed by c1 and c5 because subcompactions and hyper clock cache are great for perf. The c2 config is bad for l.i0 perhaps because there is more mutex contention with fewer block cache shards. The c3 config is bad for l.i0 and l.i1 because disabling intra-L0 compaction is bad for write throughput. The c4 config is also bad for l.i0 and l.i1 because it makes the L0 less write friendly and more read friendly.
- Cached by OS
- See the description above for Cached by RocksDB
- IO-bound
- See the description above for Cached by RocksDB
Reviewing results based on response time histograms:
- Cached by RocksDB
- What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.
- Cached by OS
- What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs while the base config has a 25-second write stall. For q1000 the query response time histograms are much better for the c6 and c7 configs.
- IO-bound
Reviewing results based on variance over per-second intervals:
- Cached by RocksDB
- For l.i1 see the max insert response time charts for the base, c3, c4 and c5 configs. The c5 config greatly reduces the variance.
- Cached by OS
- For l.i0 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. The c3 and c4 configs have much more variance.
- For l.i1 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. Results are much better with c5.
- IO-bound
Slow get_min queries
Some of the benchmark steps do a simple query at startup to determine the min value of the transactionid column. This is the PK column and the query should be fast but when MVCC GC falls behind it can be slow. I provide more detail on the problem here.
-- Cached by RocksDB
config q100 q500 q1000
c 9.744 9.029 8.521
c1 11.762 9.487 8.761
c2 8.938 8.635 7.625
c3 16.680 15.582 16.392
c4 15.819 14.861 16.046
c5 15.999 15.251 17.754
c6 16.348 14.795 18.964
c7 15.046 13.760 18.254
-- Cached by OS
config q100 q500 q1000
c 15.080 15.462 15.896
c1 14.064 13.622 11.954
c2 10.148 9.641 7.242
c3 16.876 18.810 18.159
c4 16.252 18.571 18.138
c5 16.290 18.533 18.411
c6 16.530 17.341 18.236
c7 17.148 19.591 18.775
-- IO-bound
config q100 q500 q1000
c 9.310 8.747 11.018
c1 7.853 8.661 10.508
c2 9.156 9.188 11.096
c3 17.336 20.003 20.281
c4 19.281 19.794 21.849
c5 17.313 23.940 22.283
c6 18.848 19.926 22.483
c7 17.170 17.790 21.088
Peak RSS
This shows peak RSS for mysqld during the l.x (create index) benchmark step. It is ~5GB larger for the configs that enable the hyper clock cache. The difference is larger on the open-source builds that I do. I think jemalloc here is tuned better than on the open-source builds.
peak
config RSS (GB)
c 185.3
c1 191.3
c2 185.9
c3 183.3
c4 186.5
c5 188.0
c6 191.3
c7 191.3
Update 1: perf context counters
RocksDB has perf context counters that let you understand the work RocksDB does for a specific operation. AFAIK these can be access per session, but in my case I queried them via a global table assuming there wasn't much other activity on the server and ignoring compaction (which is risky to ignore).
The counters are in the information_schema.rocksdb_perf_context_global table. I ran the query and selected the values of all counters from that table before and after running the query and then computed the diff between the two sets of measurements. The counters are only populated if you set rocksdb_perf_context_level to something >= 2 in my.cnf.
My notes and more numbers from this experiment are here.
This is what the counters look like when the query is fast (0.001s) at the start of benchmark step l.i1.
My notes and more numbers from this experiment are here.
This is what the counters look like when the query is fast (0.001s) at the start of benchmark step l.i1.
4 SEEK_CHILD_SEEK_COUNT
5 BLOCK_CACHE_HIT_COUNT
12 IO_THREAD_POOL_ID
46 ITER_READ_BYTES
47 USER_KEY_COMPARISON_COUNT
And then when the query is slow (12.715s) at the start of benchmark step q100. The problems are the large values for *_SKIPPED_COUNT. The large value for USER_KEY_COMPARISON_COUNT might just be from background compaction.
65466 BLOCK_READ_COUNT
225827 BLOCK_CACHE_HIT_COUNT
36521600 INTERNAL_DELETE_SKIPPED_COUNT
36521600 INTERNAL_KEY_SKIPPED_COUNT
256113421 USER_KEY_COMPARISON_COUNT
511093703 BLOCK_READ_BYTE
511093703 IO_BYTES_READ
No comments:
Post a Comment