Small Datum

Wednesday, April 23, 2014

Concurrent, read-only, not cached: MongoDB, TokuMX, MySQL

I repeated the tests described here using a database larger than RAM. The test database has 8 collections/tables with 400M documents/rows per table. I previously reported results for this workload using a server with 24 CPU cores and a slightly different flash storage device. This time I provide a graph and use a server with more CPU cores. The goal for this test is to determine whether the DBMS can use the capacity of a high-performance storage device, the impact from different filesystem readahead settings for MongoDB and TokuMX and the impact from different read page sizes for TokuMX and InnoDB. It will take two blog posts to share everything. I think I will have much better QPS for MongoDB and TokuMX in my next post so I won't list any conclusions here.

Setup

I used my forked Java and C sysbench clients. The test query fetches one document/row by PK. The test database has 8 collections/tables with 400M rows per collection/table. All are in one database. I still need to enhance the Java sysbench client to support a database per collection. I tested the configurations listed below. I don't think these are the best configurations for TokuMX and MongoDB and am running more tests to confirm. The test server has 144G RAM, 40 CPU cores and a fast flash storage device.

fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=8k, data fetched via HANDLER
fb56.sql - 740G database, MySQL 5.6.12 with the Facebook path, InnoDB, page_size=8k, data fetched via SELECT
orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via HANDLER.
orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via SELECT
tokumx32 - 554G database, TokuMX 1.4.1, quicklz, readPageSize=32K, 16K filesystem readahead
tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K, 32K filesystem readahead
mongo24 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0, 16K filesystem readahead
mongo26 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1, 16K filesystem readahead

Results

Results for MySQL 5.7.4 are not in the graph to keep it readable and are similar to MySQL 5.6.12. Note that MySQL is able to get more than 100,000 QPS at high concurrency, TokuMX reaches 30,000 and MongoDB isn't able to reach 20,000. I think MongoDB and TokuMX can do a lot better when I reduce the filesystem readahead for both and reduce the read page size for TokuMX and results for that are in my next post. MongoDB also suffers in this test because the PK index is so large that all leaf nodes cannot fit in RAM so there is more than one disk read per query. This isn't something that goes away via tuning. The workaround it to make sure the database:RAM ratio isn't too big (and spend more money on hardware).

This lists the QPS from the graph.

point queries per second
8 16 32 40 clients
39928 63542 102294 107769 fb56.handler
33630 56834 91132 102336 fb56.sql
39714 63359 101987 106205 orig57.handler
33561 56725 90900 101476 orig57.sql
12586 22738 31407 32167 tokumx32
10119 16373 18310 18232 tokumx64
12782 16639 17350 17435 mongo24
12503 17474 17988 18022 mongo26

Analysis

These tables list the average disk read rate from iostat r/s and the average number of disk reads per query. InnoDB is by far the most efficient with the smallest number of disk reads per query. TokuMX benefits from having the smallest database courtesy of quicklz compression but might suffer from a larger read page size (32k and 64k). But I don't think that is the only reason why the disk reads per query ratio is so much larger than InnoDB and TokuMX. I am repeating tests with an 8k read page size to confirm. MongoDB suffers from a PK index that is too large to be cached so disk reads are done for it and the document store. Both TokuMX and MongoDB might also do extra reads because of the filesystem readahead and I am repeating tests with smaller values for it to confirm.

iostat r/s
8 16 32 40 clients
33661 53502 86028 90616 fb56.handler
29120 49155 78748 88423 fb56.sql
33776 53702 86193 89755 orig57.handler
29244 49268 78801 88027 orig57.sql
26756 47813 65885 67840 tokumx32
23728 37442 41357 42089 tokumx64
18966 24440 25147 25322 mongo24
18312 25313 25701 25781 mongo26

disk reads per query
8 16 32 40 clients
0.84a 0.84 0.84 0.84 fb56.handler
0.86 0.86 0.86 0.86 fb56.sql
0.85 0.84 0.84 0.84 orig57.handler
0.87 0.86 0.86 0.86 orig57.sql
2.12 2.10 2.09 2.10 tokumx32
2.34 2.28 2.25 2.29 tokumx64
1.48 1.46 1.44 1.45 mongo24
1.54 1.44 1.42 1.43 mongo26

RW locks are hard

MongoDB and TokuMX saturated at a lower QPS rate then MySQL when running read-only workloads on a cached database with high concurrency. Many of the stalls were on the per-database RW-lock and I was curious about the benefit from removing that lock. I hacked MongoDB to not use the RW-lock per query (not safe for production) and repeated the test. I got less than 5% more QPS at 32 concurrent clients. I expected more, looked at performance with PMP and quickly realized there were several other sources of mutex contention that are largely hidden by contention on the per-database RW-lock. So this problem won't be easy to fix but I think it can be fixed.

The easy way to implement a reader-writer lock uses the pattern listed below. That includes pthread_rwlock_t in glibc the last time I checked and the per-database RW-lock used by MongoDB. InnoDB used this pattern many years ago and then we rewrote it to make InnoDB better on multi-core. An implementation like this tends to have problems on multi-core servers. The first problem is from locking/unlocking the internal mutex at least twice per use, once to get it in read or write mode and then again to unlock it. When there is contention it can be locked/unlocked many more times than twice per use from threads that wait, wake-up and then wait again. If the operation protected by this RW-lock is very fast then a mutex is usually a better choice. Note that even when all threads are trying to lock in read mode there is still contention on the internal mutex ("mtx" below). Another problem occurs when the thread trying to unlock a RW-lock is blocked trying to lock the internal state mutex ("mtx" below). There might be other threads waiting to run as soon as the unlock gets through but the unlock is stalled because incoming lock requests are competing for the same mutex ("mtx"). I have seen many PMP thread stacks where the unlocking thread is stuck on the lock_mutex call.

lock(mode)
lock_mutex(mtx)
wait_until_lock_granted(mode)
modify_lock_state()
unlock_mutex(mtx)

unlock()
lock_mutex(mtx)
modify_lock_state()
notify_some_waiters()
unlock_mutex(mtx)

Something better

The alternative that scales better is to use a lock-free approach to get and set internal state in the RW-lock. We did this as part of the Google MySQL patch many years ago and that code was contributed upstream. Such an approach removes much of the contention added by an inefficient RW-lock. It won't prevent contention added because threads want the lock in read and write mode at the same time. That still requires some threads to wait. When we did the work at Google on the InnoDB RW-lock, Yasufumi Kinoshita was working on a similar change. I am very happy he continues to make InnoDB better.

A lock-free implementation for a DBMS is likely to be much more complex than what you might read about on the web or a top-tier systems conference paper. There is more complexity because of the need to support performance monitoring, manageability, special semantics and the occasional wrong design decision. For performance monitoring we need to know how frequently a lock is used and how long threads wait on it. For manageability we need to know what threads wait on a lock and which thread holds it. A frequent pattern is for today's special semantics to become tomorrow's design decisions that we regret. But we can't expect perfection given the need to move fast and the rate at which hardware changes.

The low-level reader-writer lock in MongoDB, QLock, is a RW-lock with special semantics. It has two modes each for read and write locks: r, R, w and W. It also supports upgrades and downgrades: W to R, R to W, w to X and X to w (I didn't mention X above). Internally there are 6 condition variables, one each for r, R, w and W and then two others, U and X, to support upgrades and downgrades. Read the source for more details. I don't understand the code enough to guess whether lock-free state changes can be supported as they were for the InnoDB RW-lock.

MongoDB details

I spent a few hours browsing the source for the MongoDB RW-lock and these are my notes. I hope they help you, otherwise they will be a reference for me in the future. Queries that call find to fetch one row by PK start to run in mongod via the newRunQuery function. That gets the per-database RW-lock in read mode by creating a Client::ReadContext object on the stack and ReadContext gets the per-database RW-lock in read mode:

/** "read lock, and set my context, all in one operation"

* This handles (if not recursively locked) opening an unopened database.

Client::ReadContext::ReadContext(const string& ns, const std::string& path) {

{

lk.reset( new Lock::DBRead(ns) );

Database *db = dbHolder().get(ns, path);

if( db ) {

c.reset( new Context(path, ns, db) );

return;

}

...

The dbHolder().get() call above locks a mutex in DatabaseHolder while using the database name to find the database object. There is simple string searching while the mutex is locked. It might be easy to move some of that work outside the scope of the mutex and perhaps use a mutex per hash table bucket.

Database * get( const string& ns , const string& path ) const {
SimpleMutex::scoped_lock lk(_m);
Lock::assertAtLeastReadLocked(ns);
Paths::const_iterator x = _paths.find( path );
if ( x == _paths.end() )
return 0;
const DBs& m = x->second;
string db = _todb( ns );
DBs::const_iterator it = m.find(db);
if ( it != m.end() )
return it->second;
return 0;
}

static string __todb( const string& ns ) {
size_t i = ns.find( '.' );
if ( i == string::npos ) {
uassert( 13074 , "db name can't be empty" , ns.size() );
return ns;
}
uassert( 13075 , "db name can't be empty" , i > 0 );
return ns.substr( 0 , i );

}

Lets get back to the DBRead constructor that was called in the ReadContext constructor above. It calls lockDB to do the real work. The code below will call other functions that lock mutexes but no mutex is held by the caller to the code below. In my case the block with "if (DB_LEVEL_LOCKING_ENABLED)" is entered and lockTop gets called to do the real work.

Lock::DBRead::DBRead( const StringData& ns )
: ScopedLock( 'r' ), _what(ns.toString()), _nested(false) {
lockDB( _what );
}

void Lock::DBRead::lockDB(const string& ns) {
fassert( 16254, !ns.empty() );
LockState& ls = lockState();

Acquiring a(this,ls);
_locked_r=false;
_weLocked=0;

if ( ls.isRW() )
return;
if (DB_LEVEL_LOCKING_ENABLED) {
StringData db = nsToDatabaseSubstring(ns);
Nestable nested = n(db);
if( !nested )
lockOther(db);
lockTop(ls);
if( nested )
lockNestable(nested);
}
else {
qlk.lock_R();
_locked_r = true;
}
}

Well, lockTop doesn't do the real work during my benchmark. It calls qlk.lock_r to do that.

void Lock::DBRead::lockTop(LockState& ls) {

switch( ls.threadState() ) {

case 'r':

case 'w':

break;

default:

verify(false);

case 0 :

qlk.lock_r();

_locked_r = true;

}

Almost there, just one more level of indirection. The call to qlk.lock_r calls the lock_r method on an instance of QLock and then something gets done.

void lock_r() {

verify( threadState() == 0 );

lockState().lockedStart( 'r' );

q.lock_r();

}

inline void QLock::lock_r() {

boost::mutex::scoped_lock lk(m);

while( !r_legal() ) {

r.c.wait(m);

}

r.n++;

}

Eventually the unlock_r method is called for the same instance of QLock. I won't show the route there however.

inline void QLock::unlock_r() {

boost::mutex::scoped_lock lk(m);

fassert(16137, r.n > 0);

--r.n;

notifyWeUnlocked('r');

}

And notifyWeUnlocked provides the special semantics. This includes not letting a new reader in when there is a pending write request. The code below also wakes all waiting write requests when one is waiting. This might cause many threads to be scheduled to run even though at most one will get the RW-lock. InnoDB does something similar.

inline void QLock::notifyWeUnlocked(char me) {

fassert(16201, W.n == 0);

if ( me == 'X' ) {

X.c.notify_all();

}

if( U.n ) {

// U is highest priority

if( (r.n + w.n + W.n + X.n == 0) && (R.n == 1) ) {

U.c.notify_one();

return;

}

if ( X_legal() && i_block(me, 'X') ) {

X.c.notify_one();

}

if ( W_legal() && i_block(me, 'W') ) {

W.c.notify_one();

if( _areQueueJumpingGlobalWritesPending() )

return;

}

if ( R_legal_ignore_greed() && i_block(me, 'R') ) {

R.c.notify_all();

}

if ( w_legal_ignore_greed() && i_block(me, 'w') ) {

w.c.notify_all();

}

if ( r_legal_ignore_greed() && i_block(me, 'r') ) {

r.c.notify_all();

}

Tuesday, April 22, 2014

Concurrent, read-only & cached: MongoDB, TokuMX, MySQL

This has results for a read-only workload where all data is cached. The test query fetches all columns in one doucment/row by PK. For InnoDB all data is in the buffer pool. For TokuMX and MongoDB all data is in the OS filesystem cache and accessed via mmap'd files. The test server has 40 CPU cores with HT enabled and the test clients share the host with mysqld/mongod to reduce variance from network latency. This was similar to a previous test, except the database is in cache and the test host has more CPU cores. The summary of my results is:

MongoDB 2.6 has a performance regression from using more CPU per query. The regression might be limited to simple queries that do single row lookups on the _id index. I spent a bit of time rediscovering how to get hierarchical CPU profile data from gperftools to explain this. JIRAs 13663 and 13685 are open for this.
MySQL gets much more QPS at high concurrency than MongoDB and TokuMX
MySQL gets more QPS using the HANDLER interface than SELECT. I expect the InnoDB memcached API to be even faster than HANDLER but did not test it.
MySQL uses more CPU per query in 5.7.4 than in 5.6.12 but this didn't have an impact on QPS

Setup

The test was repeated for 1, 2, 4, 8, 16, 32 and 40 concurrent clients. It uses my forked versions of the MongoDB and C clients for sysbench. There are 8 collections/tables in one database. Each table has 400M rows but queries are limited to the first 1M. I don't know yet whether using a database per collection would improve the MongoDB results. Each query fetches all columns in one document/row by PK. I have yet to push my changes to the MongoDB sysbench client to make it fetch all columns. I tested these binaries:

fb56.handler - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses HANDLER to fetch data.
fb56.sql - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses SELECT to fetch data.
orig57.handler - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses HANDLER to fetch data.
orig57.sql - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses SELECT to fetch data.
tokumx - TokuMX 1.4.1 using quicklz and 32kb pages. There should be no decompression during the test as all data used by the test (1M documents) is much smaller than 50% of RAM.
mongo24 - MongoDB 2.4.9
mongo26 - MongoDB 2.6.0

Results

At last I included a graph. I have been reluctant to include graphs on previous posts comparing MongoDB, TokuMX and MySQL because I want to avoid benchmarketing and drive-by analysis. These tests have been time consuming to run and document and I don't want to make it too easy to misinterpret the results. Results for MySQL 5.7.4 are not in the graph to make it easier to read. The top two bars (blue & red) are for MySQL and you can see that QPS increases with more concurrency. QPS for MongoDB and TokuMX saturates at a lower level of concurrency.

Numbers used for the graph above.

point queries per second
1 2 4 8 16 32 40 clients
17864 32397 60294 106374 184566 298276 350665 fb56.handler
11730 22884 39646 73485 131533 215487 249402 fb56.sql
18161 33262 59413 107505 185894 306084 371045 orig57.handler
11775 21838 40528 75322 135331 227450 266917 orig57.sql
14298 25219 45743 83214 142489 168498 161840 tokumx
17203 30158 52476 94705 161922 174453 170177 mongo24
10705 19502 34318 61977 109684 152667 151555 mongo26

Analysis

I used vmstat to measure the average CPU utilization (user + system) during the test. The numbers below are: (CPU_utilization / QPS) * 1,000,000. There are some interesting details.

the values are larger for MySQL 5.7 than for 5.6 at low concurrency. Note that in both cases the performance schema was disabled at compile time.
the values are much larger for MongoDB 2.6 than for 2.4 and hopefully this can be fixed via JIRAs 13663 and 13685.

(CPU_utilization / QPS) * 1,000,000

1 2 4 8 16 32 40 clients

218 197 197 208 216 251 268 fb56.handler

323 310 287 298 304 352 372 fb56.sql

357 279 240 216 215 248 250 orig57.handler

407 380 313 288 302 342 359 orig57.sql

272 269 251 254 266 302 296 tokumx

232 215 219 225 234 257 252 mongo24

373 333 340 342 355 425 422 mongo26

I also used vmstat to measure the context switch rate and the table below lists the number of context switches per query. Note that the rate decreases with concurrency for MySQL but not for MongoDB and TokuMX. I don't know enough about Linux internals to interpret this.

vmstat.cs / QPS

context switch per query

1 2 4 8 16 32 40 clients

4.44 4.14 4.01 3.79 3.47 3.05 2.19 fb56.handler

4.61 4.32 4.03 3.84 3.59 3.23 2.65 fb56.sql

4.53 4.27 4.07 3.88 3.52 3.08 2.20 orig57.handler

4.81 4.48 4.19 3.96 3.63 3.07 2.19 orig57.sql

4.59 4.30 4.08 3.87 3.77 4.32 4.32 tokumx

4.54 4.23 4.03 3.84 3.79 4.29 4.30 mongo24

4.80 4.43 4.21 3.99 3.93 4.58 4.63 mongo26

Hierarchical CPU profiling for MongoDB

One day it will be easy to get hierarchical CPU profile results for open-source databases using open-source profiling tools. Support for CPU profiling via Google perftools can be compiled into MongoDB via the --use-cpu-profiler option. Given the use of a compiler toolchain in a nonstandard location I also used --extrapath and --extralib to help it find libunwind. However, the profiler output file was mostly empty after doing this and did not have any per-thread results.

For future reference, profiling was started and stopped via:

echo "db.runCommand({ _cpuProfilerStart: { profileFilename: '/path/to/output' } })" | bin/mongo admin
echo "db.runCommand({ _cpuProfilerStop:1})" | bin/mongo admin

Eventually I remembered that it might help to add a call to ProfilerRegisterThread() at the start of each new thread. I did something similar in the Google patch for MySQL many years ago. And now I have hierarchical CPU profiles to help understand the source of performance regressions in MongoDB 2.6. I updated JIRA 6628 with details from this and then was asked to create JIRA 13683. The change is to handleIncomingMsg():
static void* handleIncomingMsg(void* arg) {
TicketHolderReleaser connTicketReleaser( &Listener::globalTicketHolder );
::ProfilerRegisterThread(); // Add this

Friday, April 18, 2014

Biebermarks

Yet another microbenchmark result. This one is based on behavior that has caused problems in the past for a variety of reasons which lead to a few interesting discoveries. The first was that using a short lock-wait timeout was better than the InnoDB deadlock detection code. The second was that no-stored procedures could overcome network latency.

The workload is a large database where all updates are done to a small number of rows. I think it is important to use a large database to include the overhead from searching multiple levels of a b-tree. The inspiration for this is maintaining counts for popular entities like Justin Bieber and One Direction. This comes from serving the social graph. For more on that read about TAO and LinkBench.

The most popular benchmark for MySQL is sysbench and it is usually run with a uniform distribution so that all rows are equally likely to be queried or modified. But real workloads have skew which can cause stress in unexpected places and I describe one such place within InnoDB from this microbenchmark. YCSB and LinkBench are benchmarks that have skew and can be run for MySQL. I hope that more of the MySQL benchmark results in the future include skew.

Configuration

See a previous post for more details. Eight collections/tables with 400M documents/rows per collection/table were created. All collections/tables are in one database so MongoDB suffers from the per-database RW-lock. But MySQL and TokuMX also suffer from a similar issue when all clients are trying to update the same row. Tests were run for 1, 2, 4 and 8 tables where one row per table was updated. So when the test used 4 tables there were 4 rows getting updates. For each number of tables tests were run for up to 64 concurrent clients/threads. The result tables listed in the next section should make that clear.

The workload is updating the non-indexed column of one document/row by PK per transaction. There are no secondary indexes on the table. In this case the document/row with ID=1 is chosen for every update. For MySQL and TokuMX an auto-commit transaction is used. The journal (redo log) is used but the update does not wait for the journal/log to be forced to disk. The updates should not require disk reads as all relevant index and data blocks remain in cache. TokuMX might do reads in the background to maintain fractal trees but I don't understand their algorithm to be certain.

The database was loaded in PK order and about 8 hours of concurrent & random updates were done to warmup the database prior to this test. The warmup was the same workload as described in a previous post.

The MySQL test client limits clients to one table. So when there are 64 clients and 8 tables then there are 8 clients updating the 1 row per table. The MongoDB/TokuMX client does not do that. It lets all clients update all tables so in this case there are at most 64 clients updating the row per table and on average there would be 8.

The test server has 40 CPU cores with HT enabled, fast flash storage and 144G of RAM. The benchmark client and database servers shared the host. Tests were run for several configurations:

mongo26 - MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
mongo24 - MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
mysql - MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
toku-32 - TokuMX 1.4.1, readPageSize=32k, quicklz compression, logFlushPeriod=300, w:1,j:0. I don't have results for toku-32 yet.
toku-64 - TokuMX 1.4.1, readPageSize=64k, quicklz compression, logFlushPeriod=300, w:1,j:0

Results per DBMS

I first list the results by DBMS to show the impact from spreading the workload over more rows/tables. The numbers below are the updates per second rate. I use "DOP=X" to indicate the number of concurrent clients and "DOP" stands for Degree Of Parallelism (it is an Oracle thing). A few conclusions from the results below:

MySQL/InnoDB does much better with more tables for two reasons. The first is that it allows for more concurrency. The second is that it avoids some of the overhead in the code that maintains row locks and threads waiting for row locks. I describe that in more detail at the end of this post.
MongoDB 2.4.9 is slightly faster than 2.6.0rc2. I think the problem is that mongod requires more CPU per update in 2.6 versus 2.4 and this looks like a performance regression in 2.6 (at least in 2.6.0rc2). I am still profiling to figure out where. More details on this are at the end of the post. I filed JIRA 13663 for this.
MongoDB doesn't benefit from spreading the load over more collections when all collections are in the same database. This is expected given the per-database RW-lock.

Updates per second
config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 1 8360 15992 30182 24932 23924 23191 21048
mysql 2 X 16527 30824 49999 41045 40506 38357
mysql 4 X X 32351 51791 67423 62116 59137
mysql 8 X X X 54826 80409 73782 68128

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64

mongo24 1 10212 17844 30204 34003 33895 33564 33451

mongo24 2 X 10256 17698 30547 34125 33717 33573

mongo24 4 X X 10670 17690 30903 34027 33586

mongo24 8 X X X 10379 17702 30920 33758

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64

mongo26 1 9187 16131 27648 28506 27784 27437 27021

mongo26 2 X 9367 16035 27490 28326 27746 27354

mongo26 4 X X 9179 16028 27666 28330 27647

mongo26 8 X X X 9125 16038 27275 27858

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64

toku-64 1 7327 12804 16179 12154 11021 9990 8344

toku-64 2 X 7173 12690 20483 23064 22354 20349

toku-64 4 X X 7191 12943 21399 33485 40124

toku-64 8 X X X 7121 12727 22096 38207

Results per number of tables

This reorders the results from above to show them for all configurations at the same number of tables. You are welcome to draw conclusions about which is faster.

Updates per second
config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 1 8360 15992 30182 24932 23924 23191 21048
mongo24 1 10212 17844 30204 34003 33895 33564 33451
mongo26 1 9187 16131 27648 28506 27784 27437 27021

toku-64 1 7327 12804 16179 12154 11021 9990 8344

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 2 X 16527 30824 49999 41045 40506 38357
mongo24 2 X 10256 17698 30547 34125 33717 33573
mongo26 2 X 9367 16035 27490 28326 27746 27354
toku-64 2 X 7173 12690 20483 23064 22354 20349

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 4 X X 32351 51791 67423 62116 59137
mongo24 4 X X 10670 17690 30903 34027 33586
mongo26 4 X X 9179 16028 27666 28330 27647
toku-64 4 X X 7191 12943 21399 33485 40124

config #tables DOP=1 DOP=2 DOP=4 DOP=8 DOP=16 DOP=32 DOP=64
mysql 8 X X X 54826 80409 73782 68128
mongo24 8 X X X 10379 17702 30920 33758
mongo26 8 X X X 9125 16038 27275 27858
toku-64 8 X X X 7121 12727 22096 38207

Row locks for InnoDB

I used PMP to understand MySQL/InnoDB on this workload. I frequently saw all user threads blocked on a condition variable with this stack trace. It seems odd that all threads are sleeping. I think the problem is that one thread can run but has yet to be scheduled by Linux. My memory of the row lock code is that it wakes threads in FIFO order and when N threads wait for a lock on the same row then each thread waits on a separate condition variable. I am not sure if this code has been improved in MySQL 5.7. A quick reading of some of the 5.6.12 row lock code showed many mutex operations. Problems in this code have escaped scrutiny in the past because much of our public benchmark activity has used workloads with uniform distributions.

pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low2,lock_wait_suspend_thread,row_mysql_handle_errors,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

This was a less frequent stack trace from the test ...

lock_get_mode,lock_table_other_has_incompatible,lock_table,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

Row locks for TokuMX

TokuMX has a similar point at which all threads wait. It isn't a big surprise given that both provide fine-grained concurrency control but there is no granularity finer than a row lock.

pthread_cond_timedwait@@GLIBC_2.3.2,toku_cond_timedwait,toku::lock_request::wait,toku_db_wait_range_lock,toku_c_getf_set(__toku_dbc*,,db_getf_set,autotxn_db_getf_set(__toku_db*,,mongo::CollectionBase::findByPK(mongo::BSONObj,mongo::queryByPKHack(mongo::Collection*,,mongo::updateObjects(char,mongo::lockedReceivedUpdate(char,mongo::receivedUpdate(mongo::Message&,,mongo::assembleResponse(mongo::Message&,,mongo::MyMessageHandler::process(mongo::Message&,,mongo::PortMessageServer::handleIncomingMsg(void*)

MongoDB 2.4 versus 2.6

I get about 1.2X more updates/second with MongoDB 2.4.9 compared to 2.6.0rc2. I think the problem is that 2.6 uses more CPU per update. I filed JIRA 13663 for this but am still trying to profile the code. So far I know the following all of which indicates that the 2.4.9 test is running 1.2X faster than 2.6.0rc2 with 32 client threads and 1 table:

I get ~1.2X more updates/second with 2.4.9
the Java sysbench client uses ~1.2X more CPU per "top" with 2.4.9
the context switch rate is ~1.2X higher with 2.4.9

The interesting point is that mongod for 2.4.9 only uses ~1.03X more CPU than 2.6.0rc2 per "top" during this test even though it is doing 1.2X more updates/second. So 2.6.0rc2 uses more CPU per update. I will look at "perf" output. I can repeat this with the GA version of 2.6.

Wednesday, April 16, 2014

TokuMX, MongoDB and InnoDB, IO-bound update-only with fast storage

I repeated the update-only IO-bound tests using pure-flash servers to compare TokuMX, MongoDB and InnoDB. The test setup was the same as on the pure-disk servers except for the hardware. In this case the servers have fast flash storage, 144G of RAM and 24 CPU cores with HT enabled. As a reminder, the InnoDB change buffer and TokuMX fractal tree don't help on this workload because there are no secondary indexes to maintain. Note that all collections/tables are in one database for this workload thus showing the worst-case for the MongoDB per-database RW-lock. The result summary:

InnoDB is much faster than MongoDB and TokuMX. This test requires a high rate of dirty page writeback and thanks to a lot of work from the InnoDB team at MySQL with help from Percona and Facebook (and others) the InnoDB engine is now very good at that. Relative to MongoDB, InnoDB also benefits from a clustered PK index.
MongoDB is much slower than InnoDB for two reasons. First it doesn't have a clustered PK index so it might do storage reads for both the index search and then while reading the document. The second reason is the per-database RW-lock. As I described previously this lock appears to be held during disk reads when the index is searched so at most one thread searches the index at a time even though there are concurrent update requests. I created JIRA 3177 to make that obvious in the documentation. Because of this the peak rate for MongoDB is approximately the number of reads per second that one thread can do from the flash device. The device can sustain many more reads/second with concurrency but MongoDB doesn't get much benefit from it. I think there will be at most 2 concurrent flash/disk reads at any time -- one while searching the index and the other while prefetching the document into RAM after releasing the per-database RW-lock in Record::touch.
TokuMX also benefits from the clustered PK index but it suffers from other problems that I was unable to debug. I think it can do much better once a Toku expert reproduces the problem on their hardware.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has fast flash storage that can do more than 5000 reads/second from one thread and more than 50,000 reads/second from many threads. The server also has 24 CPU cores with HT enabled and 144G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested the configurations listed below. I ran tests for more configurations but forgot to adjust read_ahead_kb so I won't publish results from those hosts.

mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0

mysql-4k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=4k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0

mysql-8k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0

tokumx-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0

Results

That probably isn't a typo below. InnoDB sustained about 5 to 10 times more updates/second. MongoDB does many more disk reads per update which is similar to the pure-disk results. I don't have the expertise to explain why TokuMX results weren't better but I shared information with the Tokutek team. Bytes written to storage per update is listed for InnoDB to show the impact on the write rate from using a smaller page. That can be important when flash endurance must be improved.

TPS
configuration 8 clients 16 clients 32 clients 64 clients
mysql-8k 24834 33886 37573 40198
mysql-4k 24826 31704 34644 34987
tokumx-quicklz 3706 3950 3552 3357
mongo-p2y 5194 5167 5173 5102

disk reads/second from iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients

mysql-8k 20995 28371 31397 33537

mysql-4k 22016 27985 30553 30972

tokumx-quicklz 4943 5641 4962 4783

mongo-p2y 8960 8921 8951 8859

disk reads per update

configuration 8 clients 16 clients 32 clients 64 clients

mysql-8k 0.85 0.84 0.84 0.83

mysql-4k 0.89 0.88 0.88 0.89

tokumx-quicklz 1.33 1.43 1.40 1.42

mongo-p2y 1.73 1.73 1.73 1.74

bytes written per update

configuration 8 clients 16 clients 32 clients 64 clients

mysql-8k 6.56 6.40 5.36 5.36

mysql-4k 3.86 3.72 3.76 3.78

Types of writes

What does it mean to make writes fast? It helps to distinguish between the different types of writes. The slowest is a write that must be implemented as read-modify-write. This might require a disk read and can also create contention from preventing concurrent changes to the row for the duration of the read, modify and write. The row might not be unlocked until the change is made durable on storage (commit, fsync, etc) which lets you estimate the peak rate at which a single row can be changed on a traditional DBMS. And this latency between changes can get even worse when there is sync replication or multiple client-server round trips per transaction. The UPDATE statement in SQL is usually implemented as read-modify-write. Some DBMS engines require locking to be done above the DBMS because they don't support locking across operations where read and write are separate operations (RocksDB is an example). Other DBMS engines compensate for that with a conditional put that performs a write when parts of the row have not changed like checkAndPut in HBase. But if the client does a read prior to the write then the overhead from the read still exists.

Some UPDATE statements perform changes that are commutative and it is possible to avoid the read prior to the write. That optimization is rarely implemented but it is possible in RocksDB with the merge operator, TokuDB, and Redis. Increment and decrement are examples of commutative operations. But this also requires upsert behavior to handle the case where the row to be changed doesn't exist. If the read is to be skipped it also means that a result cannot be returned -- that is the count of rows changed or the old/new value of the changed column.

A blind-write is the easiest write to make fast. This can be done via a Put call with a key-value store like LevelDB or RocksDB. Some SQL operations can also use a blind-write if we have an option to not return the count of changed rows when the statement is executed and the operation behaves like an upsert. This is rarely implemented but TokuDB might help people appreciate it.

So there are at least 3 types of writes and from slowest to fastest they are read-modify-write, commutative-write, blind-write. Hopefully these optimizations will become more common in new database engines. From the write operation there shouldn't be a performance difference between commutative-write and blind-write. But the query latency for a row subject to commutative-write can be much worse than for blind-write because many old updates might have to be read and merged.