Small Datum: August 2014

InnoDB provides a custom mutex and rw-lock implementation. I wondered how the InnoDB mutex performance compares to a pthread mutex on modern hardware and finally had time to test it. My test client is innotsim and includes code copied from InnoDB in MySQL 5.6. There is a lot of work-in-progress for InnoDB mutexes in 5.7 and I have yet to catch up to that work.

Summary

Generally, the performance is more constant for pthread mutex and more a function of the number of threads for the InnoDB mutex. The interesting result is the point (number of threads in the test) at which the lines cross and the InnoDB mutex changes from being faster to slower than pthread mutexes. There are a few performance metrics that matter. I only report on two of them -- the time per loop iteration which includes the overhead of (in)efficient mutex code and the CPU utilization during the test. I don't report on fairness here. Another interesting question for which I provide data but not analysis is whether the performance improvement from using a busy-wait loop in a mutex to wait for unlock is worth the additional CPU & power consumption.

My conclusions:

For low thread counts the InnoDB mutex had a better response time than the pthread variants.
The InnoDB mutex does much worse with many threads per mutex. This is probably due to code in InnoDB that wakes all threads waiting for a mutex lock when the mutex is unlocked.
The crossing point (number of threads) at which pthread mutexes outperform the InnoDB mutex increased as the lock hold time increased.
Performance with the default pthread mutex was usually better than with the adaptive pthread mutex
Assuming the vmstat CPU counters (us, sy, id) are correct then there was rarely any idle time for the tests with zero and 1000 nsec lock hold times. This means that all CPUs were saturated even for the test with 1000 nsec lock hold time and 1024 threads competing for one mutex.
There was significant idle time from the 4000 nsec lock hold test for the pthread mutexes but not for the InnoDB mutex. In that case InnoDB is using much more CPU than it should.
I didn't include stats for it, but innotsim has a custom mutex that does a busy-wait and then calls pthread_mutex_trylock. There is a variant that uses either a global counter or a per-mutex counter to limit the max number of threads that can spin concurrently. While the option to limit the number of max spinning threads helped performance at high concurrency it otherwise hurt performance. The likely cause is the overhead (memory transfers) from maintaining the extra counters.

InnoDB mutex

The custom InnoDB mutex adds value in terms of behavior (monitoring, debugging, correctness) but that might come at a cost in performance. I suspect that the value-added behavior could be implemented on something lighter weight using a pthread mutex. On x86-64 + Linux the InnoDB mutex uses a lock word and atomic instructions. When the lock word is already set a busy wait loop is done for up to ~50 usecs. The busy wait time is configurable, but doesn't adapt per mutex. When that limit is reached a slot is reserved in the sync array and this requires a short term lock/unlock of the sync array pthread mutex. Then the lock word is checked again and if still set the thread sleeps on an InnoDB os_event_t (pthread_cond_t + pthread_mutex_t) after storing a pointer to the os_event_t in the sync array. When the InnoDB mutex is unlocked then broadcast is done on the pthread_cond_t in the os_event_t. This broadcast can wake all threads trying to lock the InnoDB mutex but only one will get to lock it.

I assume this code arrived in the mid-90s when SMP systems were less common so the performance overhead from it wasn't a big deal. It also provides useful behavior. A few more comments on this:

Waking all threads waiting to lock the mutex when it is unlocked doesn't scale as the number of waiters increases. If all waiting threads are scheduled to run from the broadcast then this can have an O(N*N) CPU overhead for N threads. The overhead from this is visible in the graphs below.
A busy wait loop is done that uses the PAUSE instruction on x86. This gives a waiting thread a chance to get the lock without sleeping assuming the lock hold time is not too long. The wait time is configurable for InnoDB but it is not adaptive. There is a pthread mutex option on Linux that provides an adaptive, but not configurable, busy wait loop. The busy wait loop means that more user CPU time can be consumed. InnoDB does not have a limit on the max number of threads that can spin concurrently. I assume that the pthread adaptive mutex also lacks that. It might not be a good thing to have 1000 threads spinning in the busy-wait loop. For some of the configurations tested below limiting the max number of threads that spin was useful. I also don't know if the adaptive pthread mutex on Linux uses the PAUSE instruction and I haven't looked at current source code for glibc.
Threads wait on the InnoDB sync array. A background thread occasionally scans this sync array to find threads that have been waiting too long (many minutes) and will kill the server when one is found because that should not happen. Long waits for row locks could happen, but those don't use the sync array. Were pthread mutex used directly then each thread could use a loop of pthread_mutex_timedlock calls assuming that function is generally available. Otherwise, each InnoDB mutex could contain a pthread condition variable and then do timed sleeps on that. But using a condition variable means we again confront the problem of broadcast and waking all waiters on unlock.
Monitoring commands can display the threads that wait for each mutex. This has been very useful for debugging rare problems.
The mutex includes fields for the filename and line# of the caller that last locked it. Back in the day this was extremely useful to debug InnoDB.
Performance counters have been added to the InnoDB mutex to count the number of lock requests, the number of requests that require a busy wait loop and the number that require use of the sync array. However, with the introduction of the performance schema to upstream MySQL these counters might get removed.

Test setup

The test servers have 2 sockets, 20/40 CPUs/vCPUs (HT is enabled) and ran Linux 3.10. The test client is innotsim compiled with gcc 4.4.6 and the test server used glibc 2.12. We have a talented Linux kernel team, and I have been working with them to repeat tests on several versions of our 3.10 branch as performance bugs are fixed.

The innotsim client includes the mutex code extracted from InnoDB with a few changes to remove performance counters. innotsim is configured to use N threads and M mutexes. Each thread does J iterations of the benchmark work loop where each iteration locks a mutex, increments a per-mutex counter, does Y iterations of a work loop to simulate holding the lock and then unlocks the mutex. Tests used several mutex implementations: InnoDB, pthread adaptive and pthread default. Tests were repeated for the cross product of:

1, 2, 4, 8, 16 mutexes. Threads were evenly distributed across the mutexes. Threads were not pinned to cores. With 16 mutexes there was much less contention.
1, 2, 4, 8, 16, 20, 24, 36, 40, 48, 64, 128, 256, 512, 1024 threads
Lock hold times of 0, 1000 and 4000 nsecs

Graphs for 0 nsec lock hold

The graphs below display performance (nanoseconds per loop iteration) for the tests that use a 0 nanosecond lock hold duration and 1, 4 and 16 mutexes. In practice the lock hold duration is greater than 0 because a per-mutex counter is also incremented when the lock is held and that adds latency from memory system transfers. Note that:

InnoDB degrades significantly at high concurrency so that is most obvious when all (1024) threads share one mutex.
InnoDB does better than pthread adaptive & pthread default at low concurrency
pthread default does better than pthread adaptive at all times

Graphs for 1000 nsec lock hold

The graphs below display performance (nanoseconds per loop iteration) for the tests that use a ~1000 nanosecond lock hold duration and 1, 4 and 16 mutexes. Note that:

InnoDB degrades significantly at high concurrency so that is most obvious when all (1024) threads share one mutex.
The point at which pthread does better than InnoDB has shifted to the right (higher thread count required)
pthread default and pthread adaptive are similar

Graphs for 4000 nsec lock hold

The graphs below display performance (nanoseconds per loop iteration) for the tests that use a ~4000 nanosecond lock hold duration and 1, 4 and 16 mutexes. Results are similar to the 1000 nsec lock hold case.

CPU overhead

The question for which I am still working on an answer is whether the extra performance from the busy-wait loop in InnoDB and pthread adaptive is worth the CPU overhead and extra context switches. The innotsim client has a mutex variation that uses the busy-wait loop but limits the max number of threads that can spin. That is an attempt to get the benefit from busy-wait with less overhead. Perhaps I will return to that. For now I will share sample vmstat output during the tests. I collected it at 1 second intervals and share data from one of the samples for each of the test configurations.

This is vmstat output from the 0 nsec lock hold tests at 1024 threads. The columns are:

cs - context switch rate
us - %user CPU time
sy - %system CPU time
id - %idle CPU time

InnoDB mutex

#mutexes cs - us - sy - id

1 444925 - 27 - 71 - 2

2 446279 - 25 - 75 - 0

4 408583 - 36 - 64 - 0

8 223385 - 82 - 18 - 0

16 75598 - 97 - 3 - 0

pthread adaptive mutex

#mutexes cs - us - sy - id

1 73605 - 8 - 92 - 0

2 203305 - 24 - 76 - 0

4 614610 - 49 - 51 - 0

8 659100 - 72 - 28 - 0

16 343300 - 87 - 13 - 0

pthread default mutex

#mutexes cs - us - sy - id

1 48735 - 3 - 97 - 0

2 141863 - 5 - 95 - 0

4 337363 - 10 - 90 - 0

8 853007 - 18 - 82 - 0

16 966979 - 45 - 55 - 0

And this is vmstat output from the 1000 nsec lock hold tests at 1024 threads.

InnoDB mutex

#mutexes cs - us - sy - id

1 452304 - 24 - 75 - 1

2 441478 - 21 - 79 - 0

4 446445 - 23 - 77 - 0

8 434083 - 32 - 68 - 0

16 411418 - 43 - 57 - 0

pthread adaptive mutex

#mutexes cs - us - sy - id

1 102700 - 5 - 95 - 0

2 249238 - 16 - 84 - 0

4 718922 - 30 - 70 - 0

8 1172786 - 55 - 45 - 0

16 1017120 - 76 - 24 - 0

pthread default mutex

#mutexes cs - us - sy - id

1 98288 - 2 - 98 - 0

2 204332 - 4 - 96 - 0

4 580350 - 9 - 91 - 0

8 1050830 - 17 - 83 - 0

16 2711492 - 28 - 72 - 0

And vmstat output from the 4000 nsec lock hold tests at 1024 threads. Note that there is significant idle time for the pthread mutexes but not the InnoDB mutex.

InnoDB mutex

#mutexes cs - us - sy - id

1 455946 - 28 - 70 - 1

2 449311 - 24 - 76 - 0

4 443422 - 26 - 74 - 0

8 444105 - 39 - 61 - 0

16 461858 - 62 - 38 - 0

pthread adaptive mutex

#mutexes cs - us - sy - id

1 318129 - 2 - 1 - 96

2 645815 - 5 - 3 - 92

4 1222591 - 9 - 6 - 84

8 2300555 - 17 - 10 - 72

16 1806223 - 39 - 11 - 50

pthread default mutex

#mutexes cs - us - sy - id

1 312771 - 3 - 2 - 96

2 639662 - 4 - 3 - 93

4 1224032 - 9 - 6 - 85

8 2298452 - 19 - 11 - 70

16 1892771 - 39 - 9 - 52

Asynchronous replication is lossy by nature. It is also useful in production especially for replication across a WAN where the latency for one or two network roundtrips between replicas is too much. Latency is reduced when replicas are nearby but that greatly increases cost from using more replicas. In MySQL land we have work-in-progress that uses enhanced semi-sync replication and binlog archivers to make the binlog 2-safe within a datacenter and greatly reduce the chance that a committed transaction is lost (some people call this lossless). Even better, it can be done without running another local replica so it has a minor impact on the cost of the tier and the network latency overhead from it is very low.

In a perfect world we could always afford the network latency and/or extra hardware cost from synchronous replication. But production would be much less fun in a perfect world. Even for systems that have implemented sync replication I am skeptical that the system behaves as advertised until it has aged in production for a few years

Replication can also be lossy in practice because of bugs. MySQL replication slaves were not crash-safe for too long because the replication state was maintained in a file that wasn't kept consistent with InnoDB on crash recovery. This was first fixed in the Google patch for MySQL via rpl_transaction_enabled. It was fixed many years later in upstream MySQL.

Reputations from bugs and missing features can linger long after the bugs have been fixed and the features have been implemented. MySQL suffered from this because of MyISAM but some of the damage was self inflicted. Compare the claims about integrity for atomic operations & MyISAM in an older version of the manual with the current version.

In transactional terms, MyISAM tables effectively always operate in autocommit = 1 mode. Atomic operations often offer comparable integrity with higher performance.

MongoDB had a MyISAM moment with the initial version of its storage engine that was not crash safe. Fortunately they rallied and added journalling. I think they are having another MyISAM moment with replication. The TokuMX team has an excellent series of posts and a great white paper describing performance and correctness problems that exist in MongoDB and have been fixed in TokuMX. The problems include slow elections and unexpected data loss on master failover.

The post by Aphyr is also worth reading. While there read the posts on other systems. A lot of innovative correctness testing & debugging was done to produce those results.

Will these problems be fixed before the reputation sticks? This is a good reason to consider TokuMX.

Small Datum

Thursday, August 28, 2014