A rw-lock where reads dominate can suffer from contention because it has at least twice the memory writes per lock/unlock pair compared to a mutex. So when the lock hold time is short a mutex wins even when exclusive access isn't required. This can often be seen in PMP output where there are convoys and the worst-case is when a thread gets stuck trying to get the internal latch during unlock, but the InnoDB custom rw-lock might not have that problem. Lock elision for the rw-lock might be a big deal in this case.
RocksDB might also benefit from this change.
One of the challenges with glibc pthreads is documentation. I previously wrote about the difficulty of finding documentation for PTHREAD_MUTEX_ADAPTIVE_NP. The problem continues. There isn't much about pthreads in a recent version of the glibc manual. From Google searches I wasn't able to find recent docs elsewhere, except for man pages. But man pages don't document PTHREAD_MUTEX_ADAPTIVE_NP. With lock elision we get new options -- PTHREAD_MUTEX_ELISION_NP and PTHREAD_MUTEX_NO_ELISION_MP. Google searches will take you to bits of source code and email list discussions. I hope this can be improved. Given the lack of docs you might need to read the source. I hope that the community (web-scale companies) can sponsor a tech writer to provide the missing docs.
There has been drama because the introduction of this feature failed when it encountered buggy microcode on certain CPUs. Then there was more drama when it broke buggy software that worked despite the bugs, until lock elision made the bugs serious. Google searches find many of the stories.
One of my favorite perks at work is getting answers from experts. In this case the expert is Nathan Bronson (thank you). A summary of the glibc 2.23 implementation per the expert is:
- NPTL lock elision is performed using TSX's RTM (Restricted Transactional Memory) instructions XBEGIN, XEND, and XABORT, rather than TSX's HLE (Hardware Lock Elision) instructions XACQUIRE and XRELEASE
- On x86, elision support is always present when detected by HAS_CPU_FEATURE(RTM)
- pthread_rwlock_t always attempts elision if the hardware has it (both for .._rdlock and .._wrlock)
- pthread_rwlock_t uses an adaptive strategy for falling back to the non-TSX implementation. If the lock is held in a non-TSX mode, there is a transaction conflict, or the transaction exceeds TSX's (undocumented) capacity, then the current lock acquisition and the 3 following use the non-TXN code path. This means that once a lock falls off the elision path it needs several uncontended acquisitions before a transaction it will be attempted again. This seems quite conservative
- pthread_rwlock_rdlock -> pthread_rwlock_unlock with a successful transaction is about twice as fast as the non-TSX implementation under no contention, and massively better under contention
XACQUIRE / XRELEASE aren't actually instructions, they use re-uses of the REP / REPNE opcode prefixes, which allows backwards compatibility with older generation processors. Very elegant.
ReplyDeleteI don't really like the XABORT semantics. It would have been much nicer if XABORT actually set %rcx to non-zero on abort instead of allowing a 16 bit offset.
Then:
XBEGIN C7 F8
JRCXZ offset E3 xx !!
And the encoding is tighter. In practice I this would have been nearly optimal:
XBEGIN
JRCXZ 1f # elide locking
CALL lock_acquire
1f: # begin critical section
# end critical section
XEND
JRCXZ 2f # if not aborted...
CALL lock_release
2f:
Reversing the branch hints for JRCXZ forward jumps after an XBEGIN/XEND shouldn't be too hard to do.
Why? Because JECXZ is an awesome weapon to have in your arsenal. We used and abused it back in the days of binary translation and it was so much fun.
Thanks for the details.
Delete