Monday, October 6, 2014

Details on range scan performance regression in MySQL 5.7

My last blog post was too long to explain performance regressions for range scans with InnoDB in MySQL 5.7.
  • For a secondary index a non-covering scan degrades much more at high-concurrency than a covering scan. The problem gets worse as the scan length increases. This problem isn't new in 5.7 but it is much worse in 5.7 versus 5.6. At 32 threads the problem is waits on btr_search_latch and hash_table_locks. I opened bug 74283 for this.
  • A covering scan of a secondary index degrades much more at high-concurrency than a PK scan. This problem is equally bad in 5.7 and 5.6. Wait times reported by the performance schema do not explain the problem -- search below for PFS_NOT_INSTRUMENTED to understand why. Using Linux perf I definitely see more mutex contention in the covering index scan. But I don't understand the internal differences between PK and covering secondary index scans. I opened bug 74280 for this.
non-covering vs covering

These graphs show the absolute response time in seconds, not the normalized response time as used in my previous posts. The response time difference between non-covering and covering secondary index scans is much larger at high-concurrency than at low-concurrency. For the 10-row scan the non-covering index is close to the covering index response time at 1 thread but at 32 threads it is almost 2X worse. For the 1000-row scan the non-covering index is about 2X worse at 1 thread and then at 32 threads is about 8X worse in 5.6 and about 13X worse in 5.7. From the PS data that follows the problem is much more wait time on btr_search_latch and hash_table_locks.


covering vs PK

These graphs show absolute response time for the 1000-row scan using MySQL 5.6 and 5.7. There is not much difference in response time between the PK and covering secondary index scans at 1 thread. There is a big difference at 32 threads. 
Performance Schema

Assuming the problem is mutex contention then we have to perfect tool to diagnose the problem -- the Performance Schema. I enabled it to instrument wait/synch/% objects and the results are interesting. I ran the scan tests at 32 threads for the q1, q4 and q5 range scans (q1 is PK, q4 is non-covering secondary, q5 is covering secondary). Tests were done for 10-row and 1000-row scans with the adaptive hash enabled and then disabled. For each configuration the top-10 wait events are listed below. Note that the q4 query runs for much more time than both q1 and q5 and that q5 runs for about 2X longer than q1. I have not normalized the wait times to account for that.

It took the following amount of time to finish the tests
  • adaptive hash on, 10 rows: 23, 35, 23 seconds for q1, q4, q5
  • adaptive hash off, 10 rows: 23, 36, 23 seconds for q1, q4, q5
  • adaptive hash on, 1000 rows: 173, 3540, 237 seconds for q1, q4, q5
  • adaptive hash off, 1000 rows: 137, 3440, 254 seconds for q1, q4, q5

10-row PK scan at 32 threads, adaptive hash index enabled

This is the base case. It was about as fast as the covering index scan.

Seconds    Event
7.50       wait/synch/sxlock/innodb/hash_table_locks
2.57       wait/synch/rwlock/sql/LOCK_grant
2.51       wait/synch/mutex/innodb/fil_system_mutex
2.48       wait/synch/mutex/sql/THD::LOCK_query_plan
1.56       wait/synch/mutex/sql/THD::LOCK_thd_data
1.21       wait/synch/mutex/sql/LOCK_table_cache
1.20       wait/synch/sxlock/innodb/btr_search_latch
0.84       wait/synch/mutex/sql/THD::LOCK_thd_query
0.76       wait/synch/mutex/innodb/trx_pool_mutex
0.75       wait/synch/sxlock/innodb/index_tree_rw_lock

10-row covering scan at 32 threads, adaptive hash index enabled

Compared to the 10-row PK scan, this has:

  • More wait time on btr_search_latch (3.89 vs 1.20 seconds)

Seconds    Event
7.49       wait/synch/sxlock/innodb/hash_table_locks
3.89       wait/synch/sxlock/innodb/btr_search_latch
2.69       wait/synch/mutex/sql/THD::LOCK_query_plan
2.60       wait/synch/mutex/innodb/fil_system_mutex
2.37       wait/synch/rwlock/sql/LOCK_grant
1.45       wait/synch/mutex/sql/THD::LOCK_thd_data
1.22       wait/synch/mutex/sql/LOCK_table_cache
0.88       wait/synch/mutex/sql/THD::LOCK_thd_query
0.74       wait/synch/sxlock/innodb/index_tree_rw_lock
0.68       wait/synch/mutex/innodb/trx_pool_mutex

10-row non-covering scan at 32 threads, adaptive hash index enabled

Compared to the 10-row covering scan, this has:
  • Much more wait time on btr_search_latch (15.60 vs 3.89 seconds)
  • A bit more time on hash_table_locks (9.16 vs 7.49 seconds)
  • A bit less time on LOCK_grant and fil_system_mutex (~1.5 vs ~2.5 seconds)
Seconds    Event
15.60      wait/synch/sxlock/innodb/btr_search_latch
 9.16      wait/synch/sxlock/innodb/hash_table_locks
 2.75      wait/synch/mutex/sql/THD::LOCK_query_plan
 1.54      wait/synch/mutex/innodb/fil_system_mutex
 1.47      wait/synch/mutex/sql/THD::LOCK_thd_data
 1.42      wait/synch/rwlock/sql/LOCK_grant
 1.15      wait/synch/mutex/sql/LOCK_table_cache
 0.96      wait/synch/mutex/sql/THD::LOCK_thd_query
 0.84      wait/synch/sxlock/innodb/index_tree_rw_lock
 0.71      wait/synch/mutex/innodb/trx_pool_mutex

10-row PK scan at 32 threads, adaptive hash index disabled

This is the base case. It was about as fast as the covering index scan whether or not the adaptive hash index was enabled.

Seconds    Event
8.47       wait/synch/sxlock/innodb/hash_table_locks
2.65       wait/synch/rwlock/sql/LOCK_grant
2.58       wait/synch/mutex/innodb/fil_system_mutex
2.42       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/THD::LOCK_thd_data
1.25       wait/synch/sxlock/innodb/index_tree_rw_lock
1.17       wait/synch/mutex/sql/LOCK_table_cache
0.87       wait/synch/mutex/sql/THD::LOCK_thd_query
0.65       wait/synch/mutex/innodb/trx_pool_mutex
0.08       wait/io/file/innodb/innodb_data_file

10-row covering scan at 32 threads, adaptive hash index disabled

The waits here are similar to the PK scan.

Seconds    Event
8.53       wait/synch/sxlock/innodb/hash_table_locks
2.67       wait/synch/rwlock/sql/LOCK_grant
2.54       wait/synch/mutex/innodb/fil_system_mutex
2.47       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/LOCK_table_cache
1.44       wait/synch/mutex/sql/THD::LOCK_thd_data
1.25       wait/synch/sxlock/innodb/index_tree_rw_lock
0.93       wait/synch/mutex/sql/THD::LOCK_thd_query
0.68       wait/synch/mutex/innodb/trx_pool_mutex
0.08       wait/io/file/innodb/innodb_data_file

10-row non-covering scan at 32 threads, adaptive hash index disabled

Compared to the covering index scan, this has:
  • Much more time in hash_table_locks (30.97 vs 8.53 seconds)
  • Much more time in index_tree_rw_lock (8.80 vs 1.25 seconds)

Seconds    Event
30.97      wait/synch/sxlock/innodb/hash_table_locks
 8.80      wait/synch/sxlock/innodb/index_tree_rw_lock
 2.76      wait/synch/mutex/sql/THD::LOCK_query_plan
 2.14      wait/synch/rwlock/sql/LOCK_grant
 2.12      wait/synch/mutex/innodb/fil_system_mutex
 1.46      wait/synch/mutex/sql/THD::LOCK_thd_data
 1.24      wait/synch/mutex/sql/LOCK_table_cache
 1.00      wait/synch/mutex/sql/THD::LOCK_thd_query
 0.66      wait/synch/mutex/innodb/trx_pool_mutex
 0.08      wait/io/file/innodb/innodb_data_file

1000-row PK scan at 32 threads, adaptive hash index enabled

This is the base case.

Seconds    Event
8.21       wait/synch/sxlock/innodb/hash_table_locks
3.13       wait/synch/mutex/sql/THD::LOCK_query_plan
1.57       wait/synch/mutex/sql/THD::LOCK_thd_data
1.08       wait/synch/mutex/sql/LOCK_table_cache
1.04       wait/synch/mutex/sql/THD::LOCK_thd_query
0.69       wait/synch/rwlock/sql/LOCK_grant
0.66       wait/synch/mutex/innodb/trx_pool_mutex
0.64       wait/synch/sxlock/innodb/btr_search_latch
0.62       wait/synch/mutex/innodb/fil_system_mutex
0.60       wait/synch/sxlock/innodb/index_tree_rw_lock

1000-row covering scan at 32 threads, adaptive hash index enabled

This is about 1.4X slower than the PK scan at 1000 rows. The waits here are about the same as the PK case but the top wait event, hash_table_locks, is less here than in the PK scan. There isn't a wait event that explains the difference in performance. Perhaps the difference comes from CPU instructions or cache misses. The reason why the PS doesn't explain the difference is that the per-block rw-lock doesn't use the PS, from buf_block_init() there is this code:
   rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);


Seconds    Event
6.45       wait/synch/sxlock/innodb/hash_table_locks
3.22       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/THD::LOCK_thd_data
1.00       wait/synch/mutex/sql/THD::LOCK_thd_query
0.93       wait/synch/mutex/sql/LOCK_table_cache
0.72       wait/synch/mutex/innodb/trx_pool_mutex
0.70       wait/synch/sxlock/innodb/btr_search_latch
0.65       wait/synch/mutex/innodb/fil_system_mutex
0.63       wait/synch/rwlock/sql/LOCK_grant
0.56       wait/synch/sxlock/innodb/index_tree_rw_lock

1000-row non-covering scan at 32 threads, adaptive hash index enabled

This is 15 to 20 times slower than the PK and covering index scans. Compared to the covering index scan this has:
  • Much more time in btr_search_latch (655.57 vs 0.70 seconds)
  • Much more time in hash_table_locks (35.47 vs 6.45 seconds)
  • Much more time in index_tree_rw_lock (9.68 vs 0.56 seconds)
Seconds    Event
655.57     wait/synch/sxlock/innodb/btr_search_latch
 35.47     wait/synch/sxlock/innodb/hash_table_locks
  9.68     wait/synch/sxlock/innodb/index_tree_rw_lock
  3.42     wait/synch/mutex/sql/THD::LOCK_query_plan
  1.65     wait/synch/mutex/sql/THD::LOCK_thd_data
  1.26     wait/synch/mutex/sql/THD::LOCK_thd_query
  0.86     wait/synch/mutex/sql/LOCK_table_cache
  0.74     wait/synch/mutex/innodb/trx_pool_mutex
  0.73     wait/synch/rwlock/sql/LOCK_grant
  0.43     wait/synch/mutex/innodb/fil_system_mutex

1000-row PK scan at 32 threads, adaptive hash index disabled

This is the base case.

Seconds    Event
9.80       wait/synch/sxlock/innodb/hash_table_locks
3.00       wait/synch/mutex/sql/THD::LOCK_query_plan
1.53       wait/synch/mutex/sql/THD::LOCK_thd_data
1.13       wait/synch/sxlock/innodb/index_tree_rw_lock
1.05       wait/synch/mutex/sql/LOCK_table_cache
1.02       wait/synch/mutex/sql/THD::LOCK_thd_query
0.79       wait/synch/mutex/innodb/trx_pool_mutex
0.64       wait/synch/rwlock/sql/LOCK_grant
0.58       wait/synch/mutex/innodb/fil_system_mutex
0.08       wait/io/file/innodb/innodb_data_file

1000-row covering scan at 32 threads, adaptive hash index disabled

This is about 2X slower than the PK scan at 1000 rows. The waits here are about the same as the PK case but the top wait event, hash_table_locks, is less here than in the PK scan. There isn't a wait event that explains the difference in performance. Perhaps the difference comes from CPU instructions or cache misses.

Seconds    Event
8.14       wait/synch/sxlock/innodb/hash_table_locks
3.00       wait/synch/mutex/sql/THD::LOCK_query_plan
1.51       wait/synch/mutex/sql/THD::LOCK_thd_data
0.99       wait/synch/sxlock/innodb/index_tree_rw_lock
0.99       wait/synch/mutex/sql/THD::LOCK_thd_query
0.95       wait/synch/mutex/sql/LOCK_table_cache
0.77       wait/synch/mutex/innodb/trx_pool_mutex
0.66       wait/synch/rwlock/sql/LOCK_grant
0.62       wait/synch/mutex/innodb/fil_system_mutex
0.08       wait/io/file/innodb/innodb_data_file

1000-row non-covering scan at 32 threads, adaptive hash index disabled

Wow, compared to the covering index scan this has:

  • Much more wait time on hash_table_locks (1434.73 vs 8.14 seconds)
  • Much more time on index_tree_rw_lock (659.07 vs 0.99 seconds)

Seconds    Event
1434.73    wait/synch/sxlock/innodb/hash_table_locks
 659.07    wait/synch/sxlock/innodb/index_tree_rw_lock
   3.25    wait/synch/mutex/sql/THD::LOCK_query_plan
   1.51    wait/synch/mutex/sql/THD::LOCK_thd_data
   0.97    wait/synch/mutex/sql/THD::LOCK_thd_query
   0.87    wait/synch/mutex/sql/LOCK_table_cache
   0.71    wait/synch/mutex/innodb/trx_pool_mutex
   0.67    wait/synch/rwlock/sql/LOCK_grant
   0.44    wait/synch/mutex/innodb/fil_system_mutex
   0.08    wait/io/file/innodb/innodb_data_file

Missing time?

I am trying to explain why the 1000-row covering index scan is slower than the PK scan. Output from the performance schema doesn't explain the difference. Perhaps the PS isn't instrumenting something. Looking at non-hierarchical output from perf I see a difference. This is from the PK scan:

     8.58%   mysqld  mysqld                [.] row_search_mvcc(...)                                     
     8.00%   mysqld  mysqld                [.] ut_delay(unsigned long)                
     4.87%   mysqld  mysqld                [.] rec_get_offsets_func(...)
     3.50%   mysqld  mysqld                [.] mtr_t::Command::release_all(
     3.12%   mysqld  libc-2.14.1.so        [.] __memcpy_ssse3_back
     3.02%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
     2.94%   mysqld  mysqld                [.] buf_page_optimistic_get(...)
     2.50%   mysqld  mysqld                [.] ha_innobase::general_fetch(...)

And this is from the covering index scan. For the secondary index scan the top two CPU consumers are the mutex busy wait loop. So there is mutex contention that isn't reported by the performance schema.

    24.30%   mysqld  mysqld                [.] ut_delay(unsigned long)
    16.51%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
     4.54%   mysqld  mysqld                [.] mtr_t::Command::release_all()
     4.51%   mysqld  mysqld                [.] row_search_mvcc(...)
     2.70%   mysqld  mysqld                [.] _ZL22pfs_rw_lock_s_lock_lowP9rw_lock_tmPKcm.isra.17
     2.25%   mysqld  libc-2.14.1.so        [.] __memcpy_ssse3_back
     2.15%   mysqld  mysqld                [.] buf_page_optimistic_get(...)
     2.03%   mysqld  mysqld                [.] rec_get_offsets_func(...)

From hierarchical perf output this is the result for the PK scan:

     8.55%   mysqld  mysqld                [.] row_search_mvcc(...)
             |
             --- row_search_mvcc(...)
                |          
                |--99.11%-- ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          handler::ha_index_next(unsigned char*)
                |          handler::read_range_next()
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          sub_select(JOIN*, QEP_TAB*, bool)
                |          JOIN::exec()
                |          mysql_select(...)
                |          handle_select(THD*, select_result*, unsigned long)
                |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          handle_connection
                |          pfs_spawn_thread
                |          start_thread
                |          __clone
                |          
                |--0.54%-- ha_innobase::index_read(...)
                |          handler::ha_index_read_map(...)
                |          handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          sub_select(JOIN*, QEP_TAB*, bool)
                |          JOIN::exec()
                |          mysql_select(...)
                |          handle_select(THD*, select_result*, unsigned long)
                |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          handle_connection
                |          pfs_spawn_thread
                |          start_thread
                |          __clone
                 --0.34%-- [...]

     7.96%   mysqld  mysqld                [.] ut_delay(unsigned long)
             |
             --- ut_delay(unsigned long)
                |          
                |--99.96%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--99.85%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone
                |           --0.15%-- [...]
                 --0.04%-- [...]

Which looks very different from the result for the covering index scan:

    24.49%   mysqld  mysqld                [.] ut_delay(unsigned long)
             |
             --- ut_delay(unsigned long)
                |          
                |--99.98%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--98.79%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone

    16.22%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
             |
             --- TTASEventMutex<TrackPolicy>::spin_and_wait(...)
                |          
                |--99.71%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--98.90%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone

No comments:

Post a Comment

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL , a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions f...