Wednesday, August 27, 2014

The InnoDB mutex

InnoDB provides a custom mutex and rw-lock implementation. I wondered how the InnoDB mutex performance compares to a pthread mutex on modern hardware and finally had time to test it. My test client is innotsim and includes code copied from InnoDB in MySQL 5.6. There is a lot of work-in-progress for InnoDB mutexes in 5.7 and I have yet to catch up to that work. 

Summary

Generally, the performance is more constant for pthread mutex and more a function of the number of threads for the InnoDB mutex. The interesting result is the point (number of threads in the test) at which the lines cross and the InnoDB mutex changes from being faster to slower than pthread mutexes. There are a few performance metrics that matter. I only report on two of them -- the time per loop iteration which includes the overhead of (in)efficient mutex code and the CPU utilization during the test. I don't report on fairness here. Another interesting question for which I provide data but not analysis is whether the performance improvement from using a busy-wait loop in a mutex to wait for unlock is worth the additional CPU & power consumption.

My conclusions:

  • For low thread counts the InnoDB mutex had a better response time than the pthread variants.
  • The InnoDB mutex does much worse with many threads per mutex. This is probably due to code in InnoDB that wakes all threads waiting for a mutex lock when the mutex is unlocked.
  • The crossing point (number of threads) at which pthread mutexes outperform the InnoDB mutex increased as the lock hold time increased.
  • Performance with the default pthread mutex was usually better than with the adaptive pthread mutex
  • Assuming the vmstat CPU counters (us, sy, id) are correct then there was rarely any idle time for the tests with zero and 1000 nsec lock hold times. This means that all CPUs were saturated even for the test with 1000 nsec lock hold time and 1024 threads competing for one mutex.
  • There was significant idle time from the 4000 nsec lock hold test for the pthread mutexes but not for the InnoDB mutex. In that case InnoDB is using much more CPU than it should.
  • I didn't include stats for it, but innotsim has a custom mutex that does a busy-wait and then calls pthread_mutex_trylock. There is a variant that uses either a global counter or a per-mutex counter to limit the max number of threads that can spin concurrently. While the option to limit the number of max spinning threads helped performance at high concurrency it otherwise hurt performance. The likely cause is the overhead (memory transfers) from maintaining the extra counters.

InnoDB mutex

The custom InnoDB mutex adds value in terms of behavior (monitoring, debugging, correctness) but that might come at a cost in performance. I suspect that the value-added behavior could be implemented on something lighter weight using a pthread mutex. On x86-64 + Linux the InnoDB mutex uses a lock word and atomic instructions. When the lock word is already set a busy wait loop is done for up to ~50 usecs. The busy wait time is configurable, but doesn't adapt per mutex. When that limit is reached a slot is reserved in the sync array and this requires a short term lock/unlock of the sync array pthread mutex. Then the lock word is checked again and if still set the thread sleeps on an InnoDB os_event_t (pthread_cond_t + pthread_mutex_t) after storing a pointer to the os_event_t in the sync array. When the InnoDB mutex is unlocked then broadcast is done on the pthread_cond_t in the os_event_t. This broadcast can wake all threads trying to lock the InnoDB mutex but only one will get to lock it.

I assume this code arrived in the mid-90s when SMP systems were less common so the performance overhead from it wasn't a big deal. It also provides useful behavior. A few more comments on this: 
  • Waking all threads waiting to lock the mutex when it is unlocked doesn't scale as the number of waiters increases. If all waiting threads are scheduled to run from the broadcast then this can have an O(N*N) CPU overhead for N threads. The overhead from this is visible in the graphs below.
  • A busy wait loop is done that uses the PAUSE instruction on x86. This gives a waiting thread a chance to get the lock without sleeping assuming the lock hold time is not too long. The wait time is configurable for InnoDB but it is not adaptive. There is a pthread mutex option on Linux that provides an adaptive, but not configurable, busy wait loop. The busy wait loop means that more user CPU time can be consumed. InnoDB does not have a limit on the max number of threads that can spin concurrently. I assume that the pthread adaptive mutex also lacks that. It might not be a good thing to have 1000 threads spinning in the busy-wait loop. For some of the configurations tested below limiting the max number of threads that spin was useful. I also don't know if the adaptive pthread mutex on Linux uses the PAUSE instruction and I haven't looked at current source code for glibc.
  • Threads wait on the InnoDB sync array. A background thread occasionally scans this sync array to find threads that have been waiting too long (many minutes) and will kill the server when one is found because that should not happen. Long waits for row locks could happen, but those don't use the sync array. Were pthread mutex used directly then each thread could use a loop of pthread_mutex_timedlock calls assuming that function is generally available. Otherwise, each InnoDB mutex could contain a pthread condition variable and then do timed sleeps on that. But using a condition variable means we again confront the problem of broadcast and waking all waiters on unlock.
  • Monitoring commands can display the threads that wait for each mutex. This has been very useful for debugging rare problems.
  • The mutex includes fields for the filename and line# of the caller that last locked it. Back in the day this was extremely useful to debug InnoDB.
  • Performance counters have been added to the InnoDB mutex to count the number of lock requests, the number of requests that require a busy wait loop and the number that require use of the sync array. However, with the introduction of the performance schema to upstream MySQL these counters might get removed.

Test setup

The test servers have 2 sockets, 20/40 CPUs/vCPUs (HT is enabled) and ran Linux 3.10. The test client is innotsim compiled with gcc 4.4.6 and the test server used glibc 2.12. We have a talented Linux kernel team, and I have been working with them to repeat tests on several versions of our 3.10 branch as performance bugs are fixed.

The innotsim client includes the mutex code extracted from InnoDB with a few changes to remove performance counters. innotsim is configured to use N threads and M mutexes. Each thread does J iterations of the benchmark work loop where each iteration locks a mutex, increments a per-mutex counter, does Y iterations of a work loop to simulate holding the lock and then unlocks the mutex. Tests used several mutex implementations: InnoDB, pthread adaptive and pthread default. Tests were repeated for the cross product of:
  • 1, 2, 4, 8, 16 mutexes. Threads were evenly distributed across the mutexes. Threads were not pinned to cores. With 16 mutexes there was much less contention.
  • 1, 2, 4, 8, 16, 20, 24, 36, 40, 48, 64, 128, 256, 512, 1024 threads
  • Lock hold times of 0, 1000 and 4000 nsecs

Graphs for 0 nsec lock hold

The graphs below display performance (nanoseconds per loop iteration) for the tests that use a 0 nanosecond lock hold duration and 1, 4 and 16 mutexes. In practice the lock hold duration is greater than 0 because a per-mutex counter is also incremented when the lock is held and that adds latency from memory system transfers. Note that:
  • InnoDB degrades significantly at high concurrency so that is most obvious when all (1024) threads share one mutex.
  • InnoDB does better than pthread adaptive & pthread default at low concurrency
  • pthread default does better than pthread adaptive at all times



Graphs for 1000 nsec lock hold

The graphs below display performance (nanoseconds per loop iteration) for the tests that use a ~1000 nanosecond lock hold duration and 1, 4 and 16 mutexes.  Note that:
  • InnoDB degrades significantly at high concurrency so that is most obvious when all (1024) threads share one mutex.
  • The point at which pthread does better than InnoDB has shifted to the right (higher thread count required)
  • pthread default and pthread adaptive are similar




Graphs for 4000 nsec lock hold

The graphs below display performance (nanoseconds per loop iteration) for the tests that use a ~4000 nanosecond lock hold duration and 1, 4 and 16 mutexes. Results are similar to the 1000 nsec lock hold case.



CPU overhead

The question for which I am still working on an answer is whether the extra performance from the busy-wait loop in InnoDB and pthread adaptive is worth the CPU overhead and extra context switches. The innotsim client has a mutex variation that uses the busy-wait loop but limits the max number of threads that can spin. That is an attempt to get the benefit from busy-wait with less overhead. Perhaps I will return to that. For now I will share sample vmstat output during the tests. I collected it at 1 second intervals and share data from one of the samples for each of the test configurations.

This is vmstat output from the 0 nsec lock hold tests at 1024 threads. The columns are:
  • cs - context switch rate
  • us - %user CPU time
  • sy - %system CPU time
  • id - %idle CPU time
InnoDB mutex
#mutexes        cs - us - sy - id
       1    444925 - 27 - 71 -  2 
       2    446279 - 25 - 75 -  0 
       4    408583 - 36 - 64 -  0 
       8    223385 - 82 - 18 -  0 
      16     75598 - 97 -  3 -  0 

pthread adaptive mutex
#mutexes        cs - us - sy - id
       1     73605 -  8 - 92 -  0 
       2    203305 - 24 - 76 -  0 
       4    614610 - 49 - 51 -  0 
       8    659100 - 72 - 28 -  0 
      16    343300 - 87 - 13 -  0 

pthread default mutex
#mutexes        cs - us - sy - id
       1     48735 -  3 - 97 -  0 
       2    141863 -  5 - 95 -  0 
       4    337363 - 10 - 90 -  0 
       8    853007 - 18 - 82 -  0 
      16    966979 - 45 - 55 -  0 

And this is vmstat output from the 1000 nsec lock hold tests at 1024 threads.

InnoDB mutex
#mutexes        cs - us - sy - id
       1    452304 - 24 - 75 -  1
       2    441478 - 21 - 79 -  0
       4    446445 - 23 - 77 -  0 
       8    434083 - 32 - 68 -  0 
      16    411418 - 43 - 57 -  0 

pthread adaptive mutex
#mutexes        cs - us - sy - id
       1    102700 -  5 - 95 -  0 
       2    249238 - 16 - 84 -  0 
       4    718922 - 30 - 70 -  0 
       8   1172786 - 55 - 45 -  0 
      16   1017120 - 76 - 24 -  0 

pthread default mutex
#mutexes        cs - us - sy - id
       1     98288 -  2 - 98 -  0 
       2    204332 -  4 - 96 -  0 
       4    580350 -  9 - 91 -  0 
       8   1050830 - 17 - 83 -  0 
      16   2711492 - 28 - 72 -  0 

And vmstat output from the 4000 nsec lock hold tests at 1024 threads. Note that there is significant idle time for the pthread mutexes but not the InnoDB mutex.

InnoDB mutex
#mutexes        cs - us - sy - id
       1    455946 - 28 - 70 -  1
       2    449311 - 24 - 76 -  0
       4    443422 - 26 - 74 -  0
       8    444105 - 39 - 61 -  0
      16    461858 - 62 - 38 -  0

pthread adaptive mutex
#mutexes        cs - us - sy - id
       1    318129 -  2 -  1 - 96
       2    645815 -  5 -  3 - 92
       4   1222591 -  9 -  6 - 84
       8   2300555 - 17 - 10 - 72
      16   1806223 - 39 - 11 - 50

pthread default mutex
#mutexes        cs - us - sy - id
       1    312771 -  3 -  2 - 96 
       2    639662 -  4 -  3 - 93 
       4   1224032 -  9 -  6 - 85 
       8   2298452 - 19 - 11 - 70 

      16   1892771 - 39 -  9 - 52 

3 comments:

  1. Courtesy of https://twitter.com/morgo
    http://dev.mysql.com/worklog/task/?id=6044

    ReplyDelete
  2. One of the XtraDB 5.6 focus areas were mutexes and rwlocks. We have extended the custom InnoDB mutex, and I am not sure whether we could have done the same with pthread mutexes or futexes.

    The extension we did was a priority mutex: add a new event to the mutex, add new API for acquiring the mutex with high priority and low priority. Whenever there is a high priority waiter, 1) new incoming low priority acquisition requests skip spinning completely; 2) only the high priority event is raised on mutex unlock.

    The use case for this was empty buffer pool free list refill: once you have hundreds of query threads looping in an attempt to get a free page - lock mutex, check whether the free list is empty, unlock, repeat, go to single page flush after a few repeats - a page cleaner thread having produced a new free page has a very hard time in putting it on the free list: it's just one more thread in the thundering herd. Vicious circle ensues. The priority mutex helped here as expected.

    We then tried to use this framework in ensuring that InnoDB utility threads aren't starved when there is a high number of query threads by converting selected other mutexes and rwlocks to the priority framework. It sort of helped, but we ended up going the route of raising the utility thread scheduling priority relative to query threads instead.

    We then discovered that the priority mutex by itself was unable to solve the free list refill problem completely. An issue of high lock/unlock rate by the query threads remained. For this (and IIRC, for the index lock?) we experimented with adaptive spinning/waiting inside the mutex/rwlock. IIRC it was easy to get some improvement but hard to get significant and consistent improvement across the board. We ended up solving the problem at the higher level, by introducing adaptive sleep for the free list page users.

    To conclude, we found it hard to use a bottom-up approach where we'd attack the mutex framework in isolation. We have achieved some results by a top-down approach where we analysed a particular issue and made changes to both mutexes and high-level code using that mutex.

    The current 5.7 code has the new mutex framework but not many uses for it. I am looking forward to new code pushes with those uses and benchmarks to support them.

    ReplyDelete
    Replies
    1. My philosophy is that good mutex implementations make things not worse, bad mutex implementations make things worse. I think your results reflect that and the mutex-using code needs to be fixed. I worked on making free-list reclaim more efficient for high-concurrency, high-IOPs workloads, but then 5.6 totally changed things and made my changes useless. I like the 5.6 changes, even if I lost some nice patches. I have not looked at 5.6 under high IOPs load.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...