Tuesday, December 10, 2019

Fixing the InnoDB RW-lock

I hope that someone updates the InnoDB Wikipedia page to explain how it came to be. I know it started with Heikki Tuuri. At what point did he get help? Regardless, early InnoDB was an amazing accomplishment and fortunately MySQL saw the potential for it after other large DBMS vendors were not interested.

I enjoyed reading the source code -- it is well written with useful comments. Some of the comments are amusing at this point with references to HW assumptions that were valid in the early and mid 90s. InnoDB had, and might still have, a SQL parser that is used to maintain dictionary tables. Fortunately the code has aged well and been significantly improved by the InnoDB team.

InnoDB and SMP

InnoDB had a big problem in 2005. Commodity SMP hardware was changing from 4-core/1-socket to 8-core/2-socket and InnoDB wasn't ready for it. My memory is that MySQL in 2005 saturated at 10k QPS with sysbench and moving from 1 to 2 sockets hurt QPS.

InnoDB had many problems. A lot of work was done to fix them by the InnoDB team, Percona, Google, Facebook and others. I have forgotten a lot of the details but much is recorded in blog posts and slide decks. The InnoDB rw-lock was a big part of the problem. InnoDB implements a rw-lock and mutex. Both add value -- by adding monitoring for performance and debugging. The rw-lock is also required to support special use cases (I forgot the reasons, but they are valid). Unfortunately both the mutex and rw-lock were lousy on 2-socket servers under contention.

The problem was too much spinning. The rw-lock used the InnoDB mutex to protect its internal state. Both the rw-lock and mutex did some spinning before going to sleep when the lock could not be acquired. This is similar to PTHREAD_MUTEX_ADAPTIVE_NP but the spinning is configurable. The InnoDB rw-lock did spinning on its own and then could do even more spinning when using the InnoDB mutex that guarded its state. There was so much spinning.

The performance impact from this work is explained here.

A Solution

My memory is that Yasufumi Kinoshita implemented the solution first and I learned of that from his presentation at Percona Live. He might have been working at NTT at the time. Inspired by him, my team at Google implemented a similar solution, wrote a Spin model to validate the correctness and then contributed the fix to the InnoDB team who were part of Oracle. The Spin model helped convince the InnoDB team that the code might be correct. I am a huge fan of Spin.

I am proud of the work my team at Google did to fix the rw-lock and it was easy to work with the InnoDB team -- they have always been a great partner for external InnoDB contributors. But I also want to give credit for the person who first showed how to fix the InnoDB rw-lock as he has a long history of making InnoDB better.

Other fixes

Much more has been done to make InnoDB great for many-core servers. Back when the rw-lock was fixed the mutex was also changed to use atomic operations rather than pthread mutex (at least for x86, hopefully for ARM now). My early blog posts mention that and changes to the InnoDB memory heap mutex. I remember nothing about the InnoDB heap mutex. Some of my early docs were published on code.google.com (since shutdown by Google). I archived some of these before the shutdown and will try to republish them.

4 comments:

  1. At the time it would have been something else than Percona Live. But great reminder of the kind of talks you could see at the MySQL user conference, and what effect one talk can have for an entire product ecosystem.

    ReplyDelete
    Replies
    1. You are right. Maybe someone can share a link for this talk.

      Delete
  2. The "memory heap mutex" probably was the one that protected a global linked list of all memory that had been malloc()ed by InnoDB, so that InnoDB could hide memory leaks from Valgrind. (Yes, you read that right!) That mutex would trivially end up being a scalability bottleneck.

    An alternative to the harmful memory allocator wrapper was implemented and it was possible to enable it by setting the startup parameter innodb_use_sys_malloc. It took quite a few releases to first make this the default setting, and then to deprecate and remove the parameter and the original implementation.

    Related to this, I think it was in MySQL 5.7 when I removed the useless wrapper mem_alloc() and mem_free() with ut_malloc() and ut_free(). The mem_alloc() would create an "anonymous" mem_heap_t, basically adding overhead for no good reason.

    Another scalability bottleneck was the InnoDB kernel_mutex. It was split into trx_sys->mutex, lock_sys->mutex and possibly something else. Starting with MariaDB Server 10.3, the trx_sys.mutex (we removed pointer indirection for the singleton object) should be less of a contention point, because Sergey Vojtovich implemented a lock-free trx_sys.rw_trx_hash and a more efficient way of creating read views for MVCC and purge.

    ReplyDelete
    Replies
    1. I hope someone puts together a list of the work done by upstream, Percona and MariaDB. They did so much. Even the the things we did at FB and Goog were hard to keep track of. But the work from others was bigger.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...