Wednesday, April 23, 2014

RW locks are hard

MongoDB and TokuMX saturated at a lower QPS rate then MySQL when running read-only workloads on a cached database with high concurrency. Many of the stalls were on the per-database RW-lock and I was curious about the benefit from removing that lock. I hacked MongoDB to not use the RW-lock per query (not safe for production) and repeated the test. I got less than 5% more QPS at 32 concurrent clients. I expected more, looked at performance with PMP and quickly realized there were several other sources of mutex contention that are largely hidden by contention on the per-database RW-lock. So this problem won't be easy to fix but I think it can be fixed.

The easy way to implement a reader-writer lock uses the pattern listed below. That includes pthread_rwlock_t in glibc the last time I checked and the per-database RW-lock used by MongoDB. InnoDB used this pattern many years ago and then we rewrote it to make InnoDB better on multi-core. An implementation like this tends to have problems on multi-core servers. The first problem is from locking/unlocking the internal mutex at least twice per use, once to get it in read or write mode and then again to unlock it. When there is contention it can be locked/unlocked many more times than twice per use from threads that wait, wake-up and then wait again. If the operation protected by this RW-lock is very fast then a mutex is usually a better choice. Note that even when all threads are trying to lock in read mode there is still contention on the internal mutex ("mtx" below). Another problem occurs when the thread trying to unlock a RW-lock is blocked trying to lock the internal state mutex ("mtx" below). There might be other threads waiting to run as soon as the unlock gets through but the unlock is stalled because incoming lock requests are competing for the same mutex ("mtx"). I have seen many PMP thread stacks where the unlocking thread is stuck on the lock_mutex call.

    lock(mode)
        lock_mutex(mtx)
        wait_until_lock_granted(mode)
        modify_lock_state()
        unlock_mutex(mtx)

    unlock()
        lock_mutex(mtx)
        modify_lock_state()
        notify_some_waiters()
        unlock_mutex(mtx)

Something better

The alternative that scales better is to use a lock-free approach to get and set internal state in the RW-lock. We did this as part of the Google MySQL patch many years ago and that code was contributed upstream. Such an approach removes much of the contention added by an inefficient RW-lock. It won't prevent contention added because threads want the lock in read and write mode at the same time. That still requires some threads to wait. When we did the work at Google on the InnoDB RW-lock, Yasufumi Kinoshita was working on a similar change. I am very happy he continues to make InnoDB better.

A lock-free implementation for a DBMS is likely to be much more complex than what you might read about on the web or a top-tier systems conference paper. There is more complexity because of the need to support performance monitoring, manageability, special semantics and the occasional wrong design decision. For performance monitoring we need to know how frequently a lock is used and how long threads wait on it. For manageability we need to know what threads wait on a lock and which thread holds it. A frequent pattern is for today's special semantics to become tomorrow's design decisions that we regret. But we can't expect perfection given the need to move fast and the rate at which hardware changes.

The low-level reader-writer lock in MongoDB, QLock, is a RW-lock with special semantics. It has two modes each for read and write locks:  r, R, w and W. It also supports upgrades and downgrades: W to R, R to W, w to X and X to w (I didn't mention X above). Internally there are 6 condition variables, one each for r, R, w and W and then two others, U and X, to support upgrades and downgrades. Read the source for more details. I don't understand the code enough to guess whether lock-free state changes can be supported as they were for the InnoDB RW-lock.

MongoDB details

I spent a few hours browsing the source for the MongoDB RW-lock and these are my notes. I hope they help you, otherwise they will be a reference for me in the future. Queries that call find to fetch one row by PK start to run in mongod via the newRunQuery function. That gets the per-database RW-lock in read mode by creating a Client::ReadContext object on the stack and ReadContext gets the per-database RW-lock in read mode:

    /** "read lock, and set my context, all in one operation"
     *  This handles (if not recursively locked) opening an unopened database.
     */
    Client::ReadContext::ReadContext(const string& ns, const std::string& path) {
        {
            lk.reset( new Lock::DBRead(ns) );
            Database *db = dbHolder().get(ns, path);
            if( db ) {
                c.reset( new Context(path, ns, db) );
                return;
            }
        }
        ...

The dbHolder().get() call above locks a mutex in DatabaseHolder while using the database name to find the database object. There is simple string searching while the mutex is locked. It might be easy to move some of that work outside the scope of the mutex and perhaps use a mutex per hash table bucket.

        Database * get( const string& ns , const string& path ) const {
            SimpleMutex::scoped_lock lk(_m);
            Lock::assertAtLeastReadLocked(ns);
            Paths::const_iterator x = _paths.find( path );
            if ( x == _paths.end() )
                return 0;
            const DBs& m = x->second;
            string db = _todb( ns );
            DBs::const_iterator it = m.find(db);
            if ( it != m.end() )
                return it->second;
            return 0;
        }

        static string __todb( const string& ns ) {
            size_t i = ns.find( '.' );
            if ( i == string::npos ) {
                uassert( 13074 , "db name can't be empty" , ns.size() );
                return ns;
            }
            uassert( 13075 , "db name can't be empty" , i > 0 );
            return ns.substr( 0 , i );

        }

Lets get back to the DBRead constructor that was called in the ReadContext constructor above. It calls lockDB to do the real work. The code below will call other functions that lock mutexes but no mutex is held by the caller to the code below. In my case the block with "if (DB_LEVEL_LOCKING_ENABLED)" is entered and lockTop gets called to do the real work.

    Lock::DBRead::DBRead( const StringData& ns )
        : ScopedLock( 'r' ), _what(ns.toString()), _nested(false) {
        lockDB( _what );
    }

    void Lock::DBRead::lockDB(const string& ns) {
        fassert( 16254, !ns.empty() );
        LockState& ls = lockState();

        Acquiring a(this,ls);
        _locked_r=false;
        _weLocked=0;

        if ( ls.isRW() )
            return;
        if (DB_LEVEL_LOCKING_ENABLED) {
            StringData db = nsToDatabaseSubstring(ns);
            Nestable nested = n(db);
            if( !nested )
                lockOther(db);
            lockTop(ls);
            if( nested )
                lockNestable(nested);
        }
        else {
            qlk.lock_R();
            _locked_r = true;
        }
    }

Well, lockTop doesn't do the real work during my benchmark. It calls qlk.lock_r to do that.

    void Lock::DBRead::lockTop(LockState& ls) {
        switch( ls.threadState() ) {
        case 'r':
        case 'w':
            break;
        default:
            verify(false);
        case  0  :
            qlk.lock_r();
            _locked_r = true;
        }
    }

Almost there, just one more level of indirection. The call to qlk.lock_r calls the lock_r method on an instance of QLock and then something gets done.

    void lock_r() {
        verify( threadState() == 0 );
        lockState().lockedStart( 'r' );
        q.lock_r();
    }

    inline void QLock::lock_r() {
        boost::mutex::scoped_lock lk(m);
        while( !r_legal() ) {
            r.c.wait(m);
        }
        r.n++;
    }

Eventually the unlock_r method is called for the same instance of QLock. I won't show the route there however.

    inline void QLock::unlock_r() {
        boost::mutex::scoped_lock lk(m);
        fassert(16137, r.n > 0);
        --r.n;
        notifyWeUnlocked('r');
    }

And notifyWeUnlocked provides the special semantics. This includes not letting a new reader in when there is a pending write request. The code below also wakes all waiting write requests when one is waiting. This might cause many threads to be scheduled to run even though at most one will get the RW-lock. InnoDB does something similar.

    inline void QLock::notifyWeUnlocked(char me) {
        fassert(16201, W.n == 0);
        if ( me == 'X' ) {
            X.c.notify_all();
        }
        if( U.n ) {
            // U is highest priority
            if( (r.n + w.n + W.n + X.n == 0) && (R.n == 1) ) {
                U.c.notify_one();
                return;
            }
        }
        if ( X_legal() && i_block(me, 'X') ) {
            X.c.notify_one();
        }
        if ( W_legal() && i_block(me, 'W') ) {
            W.c.notify_one();
            if( _areQueueJumpingGlobalWritesPending() )
                return;
        }
        if ( R_legal_ignore_greed() && i_block(me, 'R') ) {
            R.c.notify_all();
        }
        if ( w_legal_ignore_greed() && i_block(me, 'w') ) {
            w.c.notify_all();
        }
        if ( r_legal_ignore_greed() && i_block(me, 'r') ) {
            r.c.notify_all();
        }
    }



1 comment:

  1. Also, a second rwlock (a SimpleRWLock in a WrapperForRWLock object, this is probably a pthread_rwlock_t or boost::shared_mutex on your system) gets locked before the "top" QLock is handled: https://github.com/mongodb/mongo/blob/master/src/mongo/db/d_concurrency.cpp#L743. These objects are managed in the DBLocksMap but get cached per-client so repeated locks of the same database by the same client don't have to search that map.

    ReplyDelete

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...