Wednesday, April 23, 2014

Concurrent, read-only, not cached: MongoDB, TokuMX, MySQL

I repeated the tests described here using a database larger than RAM. The test database has 8 collections/tables with 400M documents/rows per table. I previously reported results for this workload using a server with 24 CPU cores and a slightly different flash storage device. This time I provide a graph and use a server with more CPU cores. The goal for this test is to determine whether the DBMS can use the capacity of a high-performance storage device, the impact from different filesystem readahead settings for MongoDB and TokuMX and the impact from different read page sizes for TokuMX and InnoDB. It will take two blog posts to share everything. I think I will have much better QPS for MongoDB and TokuMX in my next post so I won't list any conclusions here.

Setup

I used my forked Java and C sysbench clients. The test query fetches one document/row by PK. The test database has 8 collections/tables with 400M rows per collection/table. All are in one database. I still need to enhance the Java sysbench client to support a database per collection. I tested the configurations listed below. I don't think these are the best configurations for TokuMX and MongoDB and am running more tests to confirm. The test server has 144G RAM, 40 CPU cores and a fast flash storage device.
  • fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=8k, data fetched via HANDLER
  • fb56.sql - 740G database, MySQL 5.6.12 with the Facebook path, InnoDB, page_size=8k, data fetched via SELECT
  • orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via HANDLER. 
  • orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via SELECT
  • tokumx32 - 554G database, TokuMX 1.4.1, quicklz, readPageSize=32K, 16K filesystem readahead
  • tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K, 32K filesystem readahead
  • mongo24 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0, 16K filesystem readahead
  • mongo26 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1, 16K filesystem readahead

Results

Results for MySQL 5.7.4 are not in the graph to keep it readable and are similar to MySQL 5.6.12. Note that MySQL is able to get more than 100,000 QPS at high concurrency, TokuMX reaches 30,000 and MongoDB isn't able to reach 20,000. I think MongoDB and TokuMX can do a lot better when I reduce the filesystem readahead for both and reduce the read page size for TokuMX and results for that are in my next post. MongoDB also suffers in this test because the PK index is so large that all leaf nodes cannot fit in RAM so there is more than one disk read per query. This isn't something that goes away via tuning. The workaround it to make sure the database:RAM ratio isn't too big (and spend more money on hardware).
This lists the QPS from the graph.

point queries per second
     8     16     32     40  clients
 39928  63542 102294 107769  fb56.handler
 33630  56834  91132 102336  fb56.sql
 39714  63359 101987 106205  orig57.handler
 33561  56725  90900 101476  orig57.sql
 12586  22738  31407  32167  tokumx32
 10119  16373  18310  18232  tokumx64
 12782  16639  17350  17435  mongo24
 12503  17474  17988  18022  mongo26

Analysis

These tables list the average disk read rate from iostat r/s and the average number of disk reads per query. InnoDB is by far the most efficient with the smallest number of disk reads per query. TokuMX benefits from having the smallest database courtesy of quicklz compression but might suffer from a larger read page size (32k and 64k). But I don't think that is the only reason why the disk reads per query ratio is so much larger than InnoDB and TokuMX. I am repeating tests with an 8k read page size to confirm. MongoDB suffers from a PK index that is too large to be cached so disk reads are done for it and the document store. Both TokuMX and MongoDB might also do extra reads because of the filesystem readahead and I am repeating tests with smaller values for it to confirm.

iostat r/s
     8     16     32     40  clients
 33661  53502  86028  90616  fb56.handler
 29120  49155  78748  88423  fb56.sql
 33776  53702  86193  89755  orig57.handler
 29244  49268  78801  88027  orig57.sql
 26756  47813  65885  67840  tokumx32
 23728  37442  41357  42089  tokumx64
 18966  24440  25147  25322  mongo24
 18312  25313  25701  25781  mongo26

disk reads per query
     8     16     32     40  clients
  0.84a  0.84   0.84   0.84  fb56.handler
  0.86   0.86   0.86   0.86  fb56.sql
  0.85   0.84   0.84   0.84  orig57.handler
  0.87   0.86   0.86   0.86  orig57.sql
  2.12   2.10   2.09   2.10  tokumx32
  2.34   2.28   2.25   2.29  tokumx64
  1.48   1.46   1.44   1.45  mongo24
  1.54   1.44   1.42   1.43  mongo26


RW locks are hard

MongoDB and TokuMX saturated at a lower QPS rate then MySQL when running read-only workloads on a cached database with high concurrency. Many of the stalls were on the per-database RW-lock and I was curious about the benefit from removing that lock. I hacked MongoDB to not use the RW-lock per query (not safe for production) and repeated the test. I got less than 5% more QPS at 32 concurrent clients. I expected more, looked at performance with PMP and quickly realized there were several other sources of mutex contention that are largely hidden by contention on the per-database RW-lock. So this problem won't be easy to fix but I think it can be fixed.

The easy way to implement a reader-writer lock uses the pattern listed below. That includes pthread_rwlock_t in glibc the last time I checked and the per-database RW-lock used by MongoDB. InnoDB used this pattern many years ago and then we rewrote it to make InnoDB better on multi-core. An implementation like this tends to have problems on multi-core servers. The first problem is from locking/unlocking the internal mutex at least twice per use, once to get it in read or write mode and then again to unlock it. When there is contention it can be locked/unlocked many more times than twice per use from threads that wait, wake-up and then wait again. If the operation protected by this RW-lock is very fast then a mutex is usually a better choice. Note that even when all threads are trying to lock in read mode there is still contention on the internal mutex ("mtx" below). Another problem occurs when the thread trying to unlock a RW-lock is blocked trying to lock the internal state mutex ("mtx" below). There might be other threads waiting to run as soon as the unlock gets through but the unlock is stalled because incoming lock requests are competing for the same mutex ("mtx"). I have seen many PMP thread stacks where the unlocking thread is stuck on the lock_mutex call.

    lock(mode)
        lock_mutex(mtx)
        wait_until_lock_granted(mode)
        modify_lock_state()
        unlock_mutex(mtx)

    unlock()
        lock_mutex(mtx)
        modify_lock_state()
        notify_some_waiters()
        unlock_mutex(mtx)

Something better

The alternative that scales better is to use a lock-free approach to get and set internal state in the RW-lock. We did this as part of the Google MySQL patch many years ago and that code was contributed upstream. Such an approach removes much of the contention added by an inefficient RW-lock. It won't prevent contention added because threads want the lock in read and write mode at the same time. That still requires some threads to wait. When we did the work at Google on the InnoDB RW-lock, Yasufumi Kinoshita was working on a similar change. I am very happy he continues to make InnoDB better.

A lock-free implementation for a DBMS is likely to be much more complex than what you might read about on the web or a top-tier systems conference paper. There is more complexity because of the need to support performance monitoring, manageability, special semantics and the occasional wrong design decision. For performance monitoring we need to know how frequently a lock is used and how long threads wait on it. For manageability we need to know what threads wait on a lock and which thread holds it. A frequent pattern is for today's special semantics to become tomorrow's design decisions that we regret. But we can't expect perfection given the need to move fast and the rate at which hardware changes.

The low-level reader-writer lock in MongoDB, QLock, is a RW-lock with special semantics. It has two modes each for read and write locks:  r, R, w and W. It also supports upgrades and downgrades: W to R, R to W, w to X and X to w (I didn't mention X above). Internally there are 6 condition variables, one each for r, R, w and W and then two others, U and X, to support upgrades and downgrades. Read the source for more details. I don't understand the code enough to guess whether lock-free state changes can be supported as they were for the InnoDB RW-lock.

MongoDB details

I spent a few hours browsing the source for the MongoDB RW-lock and these are my notes. I hope they help you, otherwise they will be a reference for me in the future. Queries that call find to fetch one row by PK start to run in mongod via the newRunQuery function. That gets the per-database RW-lock in read mode by creating a Client::ReadContext object on the stack and ReadContext gets the per-database RW-lock in read mode:

    /** "read lock, and set my context, all in one operation"
     *  This handles (if not recursively locked) opening an unopened database.
     */
    Client::ReadContext::ReadContext(const string& ns, const std::string& path) {
        {
            lk.reset( new Lock::DBRead(ns) );
            Database *db = dbHolder().get(ns, path);
            if( db ) {
                c.reset( new Context(path, ns, db) );
                return;
            }
        }
        ...

The dbHolder().get() call above locks a mutex in DatabaseHolder while using the database name to find the database object. There is simple string searching while the mutex is locked. It might be easy to move some of that work outside the scope of the mutex and perhaps use a mutex per hash table bucket.

        Database * get( const string& ns , const string& path ) const {
            SimpleMutex::scoped_lock lk(_m);
            Lock::assertAtLeastReadLocked(ns);
            Paths::const_iterator x = _paths.find( path );
            if ( x == _paths.end() )
                return 0;
            const DBs& m = x->second;
            string db = _todb( ns );
            DBs::const_iterator it = m.find(db);
            if ( it != m.end() )
                return it->second;
            return 0;
        }

        static string __todb( const string& ns ) {
            size_t i = ns.find( '.' );
            if ( i == string::npos ) {
                uassert( 13074 , "db name can't be empty" , ns.size() );
                return ns;
            }
            uassert( 13075 , "db name can't be empty" , i > 0 );
            return ns.substr( 0 , i );

        }

Lets get back to the DBRead constructor that was called in the ReadContext constructor above. It calls lockDB to do the real work. The code below will call other functions that lock mutexes but no mutex is held by the caller to the code below. In my case the block with "if (DB_LEVEL_LOCKING_ENABLED)" is entered and lockTop gets called to do the real work.

    Lock::DBRead::DBRead( const StringData& ns )
        : ScopedLock( 'r' ), _what(ns.toString()), _nested(false) {
        lockDB( _what );
    }

    void Lock::DBRead::lockDB(const string& ns) {
        fassert( 16254, !ns.empty() );
        LockState& ls = lockState();

        Acquiring a(this,ls);
        _locked_r=false;
        _weLocked=0;

        if ( ls.isRW() )
            return;
        if (DB_LEVEL_LOCKING_ENABLED) {
            StringData db = nsToDatabaseSubstring(ns);
            Nestable nested = n(db);
            if( !nested )
                lockOther(db);
            lockTop(ls);
            if( nested )
                lockNestable(nested);
        }
        else {
            qlk.lock_R();
            _locked_r = true;
        }
    }

Well, lockTop doesn't do the real work during my benchmark. It calls qlk.lock_r to do that.

    void Lock::DBRead::lockTop(LockState& ls) {
        switch( ls.threadState() ) {
        case 'r':
        case 'w':
            break;
        default:
            verify(false);
        case  0  :
            qlk.lock_r();
            _locked_r = true;
        }
    }

Almost there, just one more level of indirection. The call to qlk.lock_r calls the lock_r method on an instance of QLock and then something gets done.

    void lock_r() {
        verify( threadState() == 0 );
        lockState().lockedStart( 'r' );
        q.lock_r();
    }

    inline void QLock::lock_r() {
        boost::mutex::scoped_lock lk(m);
        while( !r_legal() ) {
            r.c.wait(m);
        }
        r.n++;
    }

Eventually the unlock_r method is called for the same instance of QLock. I won't show the route there however.

    inline void QLock::unlock_r() {
        boost::mutex::scoped_lock lk(m);
        fassert(16137, r.n > 0);
        --r.n;
        notifyWeUnlocked('r');
    }

And notifyWeUnlocked provides the special semantics. This includes not letting a new reader in when there is a pending write request. The code below also wakes all waiting write requests when one is waiting. This might cause many threads to be scheduled to run even though at most one will get the RW-lock. InnoDB does something similar.

    inline void QLock::notifyWeUnlocked(char me) {
        fassert(16201, W.n == 0);
        if ( me == 'X' ) {
            X.c.notify_all();
        }
        if( U.n ) {
            // U is highest priority
            if( (r.n + w.n + W.n + X.n == 0) && (R.n == 1) ) {
                U.c.notify_one();
                return;
            }
        }
        if ( X_legal() && i_block(me, 'X') ) {
            X.c.notify_one();
        }
        if ( W_legal() && i_block(me, 'W') ) {
            W.c.notify_one();
            if( _areQueueJumpingGlobalWritesPending() )
                return;
        }
        if ( R_legal_ignore_greed() && i_block(me, 'R') ) {
            R.c.notify_all();
        }
        if ( w_legal_ignore_greed() && i_block(me, 'w') ) {
            w.c.notify_all();
        }
        if ( r_legal_ignore_greed() && i_block(me, 'r') ) {
            r.c.notify_all();
        }
    }



Tuesday, April 22, 2014

Concurrent, read-only & cached: MongoDB, TokuMX, MySQL

This has results for a read-only workload where all data is cached. The test query fetches all columns in one doucment/row by PK. For InnoDB all data is in the buffer pool. For TokuMX and MongoDB all data is in the OS filesystem cache and accessed via mmap'd files. The test server has 40 CPU cores with HT enabled and the test clients share the host with mysqld/mongod to reduce variance from network latency. This was similar to a previous test, except the database is in cache and the test host has more CPU cores. The summary of my results is:
  • MongoDB 2.6 has a performance regression from using more CPU per query. The regression might be limited to simple queries that do single row lookups on the _id index. I spent a bit of time rediscovering how to get hierarchical CPU profile data from gperftools to explain this. JIRAs 13663 and 13685 are open for this.
  • MySQL gets much more QPS at high concurrency than MongoDB and TokuMX
  • MySQL gets more QPS using the HANDLER interface than SELECT. I expect the InnoDB memcached API to be even faster than HANDLER but did not test it.
  • MySQL uses more CPU per query in 5.7.4 than in 5.6.12 but this didn't have an impact on QPS

Setup

The test was repeated for 1, 2, 4, 8, 16, 32 and 40 concurrent clients. It uses my forked versions of the MongoDB and C clients for sysbench. There are 8 collections/tables in one database. Each table has 400M rows but queries are limited to the first 1M. I don't know yet whether using a database per collection would improve the MongoDB results. Each query fetches all columns in one document/row by PK. I have yet to push my changes to the MongoDB sysbench client to make it fetch all columns. I tested these binaries:
  • fb56.handler - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses HANDLER to fetch data.
  • fb56.sql - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses SELECT to fetch data.
  • orig57.handler - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses HANDLER to fetch data.
  • orig57.sql - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses SELECT to fetch data.
  • tokumx - TokuMX 1.4.1 using quicklz and 32kb pages. There should be no decompression during the test as all data used by the test (1M documents) is much smaller than 50% of RAM.
  • mongo24 - MongoDB 2.4.9
  • mongo26 - MongoDB 2.6.0

Results

At last I included a graph. I have been reluctant to include graphs on previous posts comparing MongoDB, TokuMX and MySQL because I want to avoid benchmarketing and drive-by analysis. These tests have been time consuming to run and document and I don't want to make it too easy to misinterpret the results. Results for MySQL 5.7.4 are not in the graph to make it easier to read. The top two bars (blue & red) are for MySQL and you can see that QPS increases with more concurrency. QPS for MongoDB and TokuMX saturates at a lower level of concurrency.
Numbers used for the graph above.

point queries per second
    1      2      4      8     16     32     40  clients
17864  32397  60294 106374 184566 298276 350665  fb56.handler
11730  22884  39646  73485 131533 215487 249402  fb56.sql
18161  33262  59413 107505 185894 306084 371045  orig57.handler
11775  21838  40528  75322 135331 227450 266917  orig57.sql
14298  25219  45743  83214 142489 168498 161840  tokumx
17203  30158  52476  94705 161922 174453 170177  mongo24
10705  19502  34318  61977 109684 152667 151555  mongo26

Analysis

I used vmstat to measure the average CPU utilization (user + system) during the test. The numbers below are: (CPU_utilization / QPS) * 1,000,000. There are some interesting details.
  • the values are larger for MySQL 5.7 than for 5.6 at low concurrency. Note that in both cases the performance schema was disabled at compile time.
  • the values are much larger for MongoDB 2.6 than for 2.4 and hopefully this can be fixed via JIRAs 13663 and 13685.
(CPU_utilization / QPS) * 1,000,000
  1      2      4      8     16     32     40  clients
218    197    197    208    216    251    268  fb56.handler
323    310    287    298    304    352    372  fb56.sql
357    279    240    216    215    248    250  orig57.handler
407    380    313    288    302    342    359  orig57.sql
272    269    251    254    266    302    296  tokumx
232    215    219    225    234    257    252  mongo24
373    333    340    342    355    425    422  mongo26

I also used vmstat to measure the context switch rate and the table below lists the number of context switches per query. Note that the rate decreases with concurrency for MySQL but not for MongoDB and TokuMX. I don't know enough about Linux internals to interpret this.

vmstat.cs / QPS

context switch per query
     1      2      4      8     16     32     40  clients
  4.44   4.14   4.01   3.79   3.47   3.05   2.19  fb56.handler
  4.61   4.32   4.03   3.84   3.59   3.23   2.65  fb56.sql
  4.53   4.27   4.07   3.88   3.52   3.08   2.20  orig57.handler
  4.81   4.48   4.19   3.96   3.63   3.07   2.19  orig57.sql
  4.59   4.30   4.08   3.87   3.77   4.32   4.32  tokumx
  4.54   4.23   4.03   3.84   3.79   4.29   4.30  mongo24
  4.80   4.43   4.21   3.99   3.93   4.58   4.63  mongo26

Hierarchical CPU profiling for MongoDB

One day it will be easy to get hierarchical CPU profile results for open-source databases using open-source profiling tools. Support for CPU profiling via Google perftools can be compiled into MongoDB via the --use-cpu-profiler option. Given the use of a compiler toolchain in a nonstandard location I also used --extrapath and --extralib to help it find libunwind. However, the profiler output file was mostly empty after doing this and did not have any per-thread results.

For future reference, profiling was started and stopped via:
echo "db.runCommand({ _cpuProfilerStart: { profileFilename: '/path/to/output' } })" | bin/mongo admin
echo "db.runCommand({ _cpuProfilerStop:1})" | bin/mongo admin
Eventually I remembered that it might help to add a call to ProfilerRegisterThread() at the start of each new thread. I did something similar in the Google patch for MySQL many years ago. And now I have hierarchical CPU profiles to help understand the source of performance regressions in MongoDB 2.6. I updated JIRA 6628 with details from this and then was asked to create JIRA 13683. The change is to handleIncomingMsg():
        static void* handleIncomingMsg(void* arg) {
            TicketHolderReleaser connTicketReleaser( &Listener::globalTicketHolder );
            ::ProfilerRegisterThread(); // Add this

Friday, April 18, 2014

Biebermarks

Yet another microbenchmark result. This one is based on behavior that has caused problems in the past for a variety of reasons which lead to a few interesting discoveries. The first was that using a short lock-wait timeout was better than the InnoDB deadlock detection code. The second was that no-stored procedures could overcome network latency.

The workload is a large database where all updates are done to a small number of rows. I think it is important to use a large database to include the overhead from searching multiple levels of a b-tree. The inspiration for this is maintaining counts for popular entities like Justin Bieber and One Direction. This comes from serving the social graph. For more on that read about TAO and LinkBench.

The most popular benchmark for MySQL is sysbench and it is usually run with a uniform distribution so that all rows are equally likely to be queried or modified. But real workloads have skew which can cause stress in unexpected places and I describe one such place within InnoDB from this microbenchmark. YCSB and LinkBench are benchmarks that have skew and can be run for MySQL. I hope that more of the MySQL benchmark results in the future include skew.

Configuration

See a previous post for more details. Eight collections/tables with 400M documents/rows per collection/table were created. All collections/tables are in one database so MongoDB suffers from the per-database RW-lock. But MySQL and TokuMX also suffer from a similar issue when all clients are trying to update the same row. Tests were run for 1, 2, 4 and 8 tables where one row per table was updated. So when the test used 4 tables there were 4 rows getting updates. For each number of tables tests were run for up to 64 concurrent clients/threads. The result tables listed in the next section should make that clear.

The workload is updating the non-indexed column of one document/row by PK per transaction. There are no secondary indexes on the table. In this case the document/row with ID=1 is chosen for every update. For MySQL and TokuMX an auto-commit transaction is used. The journal (redo log) is used but the update does not wait for the journal/log to be forced to disk. The updates should not require disk reads as all relevant index and data blocks remain in cache. TokuMX might do reads in the background to maintain fractal trees but I don't understand their algorithm to be certain.

The database was loaded in PK order and about 8 hours of concurrent & random updates were done to warmup the database prior to this test. The warmup was the same workload as described in a previous post.

The MySQL test client limits clients to one table. So when there are 64 clients and 8 tables then there are 8 clients updating the 1 row per table. The MongoDB/TokuMX client does not do that. It lets all clients update all tables so in this case there are at most 64 clients updating the row per table and on average there would be 8.

The test server has 40 CPU cores with HT enabled, fast flash storage and 144G of RAM. The benchmark client and database servers shared the host. Tests were run for several configurations:
  • mongo26 - MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mongo24 - MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
  • mysql - MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • toku-32 - TokuMX 1.4.1, readPageSize=32k, quicklz compression, logFlushPeriod=300, w:1,j:0. I don't have results for toku-32 yet.
  • toku-64 - TokuMX 1.4.1, readPageSize=64k, quicklz compression, logFlushPeriod=300, w:1,j:0

Results per DBMS

I first list the results by DBMS to show the impact from spreading the workload over more rows/tables. The numbers below are the updates per second rate. I use "DOP=X" to indicate the number of concurrent clients and "DOP" stands for Degree Of Parallelism (it is an Oracle thing). A few conclusions from the results below:

  • MySQL/InnoDB does much better with more tables for two reasons. The first is that it allows for more concurrency. The second is that it avoids some of the overhead in the code that maintains row locks and threads waiting for row locks. I describe that in more detail at the end of this post.
  • MongoDB 2.4.9 is slightly faster than 2.6.0rc2. I think the problem is that mongod requires more CPU per update in 2.6 versus 2.4 and this looks like a performance regression in 2.6 (at least in 2.6.0rc2). I am still profiling to figure out where. More details on this are at the end of the post. I filed JIRA 13663 for this.
  • MongoDB doesn't benefit from spreading the load over more collections when all collections are in the same database. This is expected given the per-database RW-lock.

Updates per second
config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         1   8360  15992  30182  24932   23924   23191   21048
mysql         2      X  16527  30824  49999   41045   40506   38357
mysql         4      X      X  32351  51791   67423   62116   59137
mysql         8      X      X      X  54826   80409   73782   68128

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mongo24       1  10212  17844  30204  34003   33895   33564   33451
mongo24       2      X  10256  17698  30547   34125   33717   33573
mongo24       4      X      X  10670  17690   30903   34027   33586
mongo24       8      X      X      X  10379   17702   30920   33758

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mongo26       1   9187  16131  27648  28506   27784   27437   27021
mongo26       2      X   9367  16035  27490   28326   27746   27354
mongo26       4      X      X   9179  16028   27666   28330   27647
mongo26       8      X      X      X   9125   16038   27275   27858

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
toku-64       1   7327  12804  16179  12154   11021    9990    8344
toku-64       2      X   7173  12690  20483   23064   22354   20349
toku-64       4      X      X   7191  12943   21399   33485   40124
toku-64       8      X      X      X   7121   12727   22096   38207

Results per number of tables

This reorders the results from above to show them for all configurations at the same number of tables. You are welcome to draw conclusions about which is faster.

Updates per second
config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         1   8360  15992  30182  24932   23924   23191   21048
mongo24       1  10212  17844  30204  34003   33895   33564   33451
mongo26       1   9187  16131  27648  28506   27784   27437   27021
toku-64       1   7327  12804  16179  12154   11021    9990    8344

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         2      X  16527  30824  49999   41045   40506   38357
mongo24       2      X  10256  17698  30547   34125   33717   33573
mongo26       2      X   9367  16035  27490   28326   27746   27354
toku-64       2      X   7173  12690  20483   23064   22354   20349

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         4      X      X  32351  51791   67423   62116   59137
mongo24       4      X      X  10670  17690   30903   34027   33586
mongo26       4      X      X   9179  16028   27666   28330   27647
toku-64       4      X      X   7191  12943   21399   33485   40124

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         8      X      X      X  54826   80409   73782   68128
mongo24       8      X      X      X  10379   17702   30920   33758
mongo26       8      X      X      X   9125   16038   27275   27858
toku-64       8      X      X      X   7121   12727   22096   38207

Row locks for InnoDB

I used PMP to understand MySQL/InnoDB on this workload. I frequently saw all user threads blocked on a condition variable with this stack trace. It seems odd that all threads are sleeping. I think the problem is that one thread can run but has yet to be scheduled by Linux. My memory of the row lock code is that it wakes threads in FIFO order and when N threads wait for a lock on the same row then each thread waits on a separate condition variable. I am not sure if this code has been improved in MySQL 5.7. A quick reading of some of the 5.6.12 row lock code showed many mutex operations. Problems in this code have escaped scrutiny in the past because much of our public benchmark activity has used workloads with uniform distributions.
pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low2,lock_wait_suspend_thread,row_mysql_handle_errors,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection
This was a less frequent stack trace from the test ...
lock_get_mode,lock_table_other_has_incompatible,lock_table,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

Row locks for TokuMX

TokuMX has a similar point at which all threads wait. It isn't a big surprise given that both provide fine-grained concurrency control but there is no granularity finer than a row lock.
pthread_cond_timedwait@@GLIBC_2.3.2,toku_cond_timedwait,toku::lock_request::wait,toku_db_wait_range_lock,toku_c_getf_set(__toku_dbc*,,db_getf_set,autotxn_db_getf_set(__toku_db*,,mongo::CollectionBase::findByPK(mongo::BSONObj,mongo::queryByPKHack(mongo::Collection*,,mongo::updateObjects(char,mongo::lockedReceivedUpdate(char,mongo::receivedUpdate(mongo::Message&,,mongo::assembleResponse(mongo::Message&,,mongo::MyMessageHandler::process(mongo::Message&,,mongo::PortMessageServer::handleIncomingMsg(void*)

MongoDB 2.4 versus 2.6

I get about 1.2X more updates/second with MongoDB 2.4.9 compared to 2.6.0rc2. I think the problem is that 2.6 uses more CPU per update. I filed JIRA 13663 for this but am still trying to profile the code. So far I know the following all of which indicates that the 2.4.9 test is running 1.2X faster than 2.6.0rc2 with 32 client threads and 1 table:
  • I get ~1.2X more updates/second with 2.4.9
  • the Java sysbench client uses ~1.2X more CPU per "top" with 2.4.9
  • the context switch rate is ~1.2X higher with 2.4.9
The interesting point is that mongod for 2.4.9 only uses ~1.03X more CPU than 2.6.0rc2 per "top" during this test even though it is doing 1.2X more updates/second. So 2.6.0rc2 uses more CPU per update. I will look at "perf" output. I can repeat this with the GA version of 2.6.

Wednesday, April 16, 2014

TokuMX, MongoDB and InnoDB, IO-bound update-only with fast storage

I repeated the update-only IO-bound tests using pure-flash servers to compare TokuMX, MongoDB and InnoDB. The test setup was the same as on the pure-disk servers except for the hardware. In this case the servers have fast flash storage, 144G of RAM and 24 CPU cores with HT enabled. As a reminder, the InnoDB change buffer and TokuMX fractal tree don't help on this workload because there are no secondary indexes to maintain. Note that all collections/tables are in one database for this workload thus showing the worst-case for the MongoDB per-database RW-lock. The result summary:
  • InnoDB is much faster than MongoDB and TokuMX. This test requires a high rate of dirty page writeback and thanks to a lot of work from the InnoDB team at MySQL with help from Percona and Facebook (and others) the InnoDB engine is now very good at that. Relative to MongoDB, InnoDB also benefits from a clustered PK index.
  • MongoDB is much slower than InnoDB for two reasons. First it doesn't have a clustered PK index so it might do storage reads for both the index search and then while reading the document. The second reason is the per-database RW-lock. As I described previously this lock appears to be held during disk reads when the index is searched so at most one thread searches the index at a time even though there are concurrent update requests. I created JIRA 3177 to make that obvious in the documentation. Because of this the peak rate for MongoDB is approximately the number of reads per second that one thread can do from the flash device. The device can sustain many more reads/second with concurrency but MongoDB doesn't get much benefit from it. I think there will be at most 2 concurrent flash/disk reads at any time -- one while searching the index and the other while prefetching the document into RAM after releasing the per-database RW-lock in Record::touch.
  • TokuMX also benefits from the clustered PK index but it suffers from other problems that I was unable to debug. I think it can do much better once a Toku expert reproduces the problem on their hardware.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has fast flash storage that can do more than 5000 reads/second from one thread and more than 50,000 reads/second from many threads.  The server also has 24 CPU cores with HT enabled and 144G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested the configurations listed below. I ran tests for more configurations but forgot to adjust read_ahead_kb so I won't publish results from those hosts.
  • mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mysql-4k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=4k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • mysql-8k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • tokumx-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0

Results

That probably isn't a typo below. InnoDB sustained about 5 to 10 times more updates/second. MongoDB does many more disk reads per update which is similar to the pure-disk results. I don't have the expertise to explain why TokuMX results weren't better but I shared information with the Tokutek team. Bytes written to storage per update is listed for InnoDB to show the impact on the write rate from using a smaller page. That can be important when flash endurance must be improved.

TPS
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k           24834       33886       37573       40198
mysql-4k           24826       31704       34644       34987
tokumx-quicklz      3706        3950        3552        3357
mongo-p2y           5194        5167        5173        5102

disk reads/second from iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k           20995       28371       31397       33537
mysql-4k           22016       27985       30553       30972
tokumx-quicklz      4943        5641        4962        4783
mongo-p2y           8960        8921        8951        8859

disk reads per update
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k            0.85        0.84        0.84        0.83
mysql-4k            0.89        0.88        0.88        0.89
tokumx-quicklz      1.33        1.43        1.40        1.42
mongo-p2y           1.73        1.73        1.73        1.74 

bytes written per update
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k            6.56        6.40        5.36        5.36
mysql-4k            3.86        3.72        3.76        3.78

Types of writes

What does it mean to make writes fast? It helps to distinguish between the different types of writes. The slowest is a write that must be implemented as read-modify-write. This might require a disk read and can also create contention from preventing concurrent changes to the row for the duration of the read, modify and write. The row might not be unlocked until the change is made durable on storage (commit, fsync, etc) which lets you estimate the peak rate at which a single row can be changed on a traditional DBMS. And this latency between changes can get even worse when there is sync replication or multiple client-server round trips per transaction. The UPDATE statement in SQL is usually implemented as read-modify-write. Some DBMS engines require locking to be done above the DBMS because they don't support locking across operations where read and write are separate operations (RocksDB is an example). Other DBMS engines compensate for that with a conditional put that performs a write when parts of the row have not changed like checkAndPut in HBase. But if the client does a read prior to the write then the overhead from the read still exists.

Some UPDATE statements perform changes that are commutative and it is possible to avoid the read prior to the write. That optimization is rarely implemented but it is possible in RocksDB with the merge operatorTokuDB, and Redis. Increment and decrement are examples of commutative operations. But this also requires upsert behavior to handle the case where the row to be changed doesn't exist. If the read is to be skipped it also means that a result cannot be returned -- that is the count of rows changed or the old/new value of the changed column.

blind-write is the easiest write to make fast. This can be done via a Put call with a key-value store like LevelDB or RocksDB. Some SQL operations can also use a blind-write if we have an option to not return the count of changed rows when the statement is executed and the operation behaves like an upsert. This is rarely implemented but TokuDB might help people appreciate it.

So there are at least 3 types of writes and from slowest to fastest they are read-modify-writecommutative-writeblind-write. Hopefully these optimizations will become more common in new database engines. From the write operation there shouldn't be a performance difference between commutative-write and blind-write. But the query latency for a row subject to commutative-write can be much worse than for blind-write because many old updates might have to be read and merged.

More details on disk IO-bound, update only for MongoDB, TokuMX and InnoDB

This has a few more details on the results for update-only sysbench using a disk IO-bound workload. I describe the impact from changing innodb_flush_neighbors. The parameter can be set to write back some dirty pages early when other pages in the same extent must be written back. The goal is to reduce the number of disk seeks consumed by page writeback and this can help on disk based servers.

There might be a small impact from changing innodb_flush_neighbors on this workload from both the TPS results and the amount of data written to disk per update.  In a previous blog post I explained the impact of this parameter on the insert benchmark for pure-disk servers. The benefit there was much greater than here. I think there are fewer dirty pages per extent in this workload because the database is much larger than RAM so the feature is less likely to be used.

The test configuration is described in a previous post. The only difference here is that I repeated the test for innodb_flush_neighbors set to 0, 1 and 2.

TPS
configuration      8 clients  16 clients  32 clients  64 clients
flush_neighbors=0        498         677         926         969
flush_neighbors=1        506         692         848        1004
flush_neighbors=2        543         737         913        1043

KB/update written to disk
configuration      8 clients  16 clients  32 clients  64 clients
flush_neighbors=0       7.95        6.75        5.88        6.55
flush_neighbors=1       7.91        6.71        6.53        6.50
flush_neighbors=2       7.78        6.68        6.68        6.74




Tuesday, April 15, 2014

MongoDB, TokuMX and InnoDB for disk IO-bound, update-only by PK

I used sysbench to measure TPS for a workload that does 1 update by primary key per transaction. The database was much larger than RAM and the server has a SAS disk array that can do at least 2000 IOPs with a lot of concurrency. The update is to a non-indexed column so there is no secondary index maintenance which also means there is no benefit from a fractal tree in TokuMX or the change buffer in InnoDB. I also modified the benchmark client to avoid creating a secondary index. Despite that TokuMX gets almost 2X more TPS than InnoDB and InnoDB gets 3X to 5X more TPS than MongoDB.
  • TokuMX is faster because it doesn't use (or waste) random IOPs on writes so more IO capacity is available for reads. In this workload an update is a read-modify-write operation where the read is likely to require a disk read.
  • MongoDB is slower for two reasons. The first reason is the per-database RW-lock and the result doesn't get better with more concurrent clients. For this test all collections were in one database. The lock is held while the b-tree index for the PK is searched to find the document to update. Disk reads might be done when the lock is held. The second reason is that it does twice the number of disk reads per update while InnoDB & TokuMX do about 1 per update. Part of the difference is that InnoDB and TokukMX have clustered PK indexes but the results are much worse than I expected for MongoDB. I wonder if caching of index leaf blocks is not as effective as I expect or if I am wrong to expect this. Maybe this is one of the problems of depending on the OS VM to cache the right data.

Yield on page fault

The TPS results for MongoDB are limited by disk read latency. Even though there is a disk array that can do a few thousand random reads per second, the array sustains about 150 reads/second when there is a single stream of IO requests. And the per-database RW-lock guarantees that is the case. So MongoDB won't get more than 1 / disk-read-latency updates per second for this test regardless of the number of disks in the array or number of concurrent clients.

MongoDB documentation mentions that the per-database RW-lock can be yielded on page faults but the documentation wasn't specific enough for me. I think this is what you need to know and I hope MongoDB experts correct any mistakes.
  1. Yield is only done for access to documents. It is not done while accessing primary or secondary indexes. To see in the code where a yield might be done search for calls to Record::_accessing() which throws PageFaultException. The record might also be "prefetched" after releasing the per-database RW-lock via a call to Record::touch().
  2. Yield is done on predicted page faults, not on actual page faults. AFAIK, a signal handler for SIGSEGV could be used to do this for actual page faults and MongoDB creates a handler for SIGSEGV but only to print a stack trace before exiting. MongoDB has something like an LRU to track memory references and predict page faults. I haven't spent much time trying to figure out that code but have seen those functions use a lot of CPU time for some benchmarks. I am curious why the btree code uses that tracking code (it calls likelyInPhysicalMemory). To learn more about the page fault prediction code read the functions Record::likelyInPhysicalMemory and Record::_accessing and the classes PageFaultException and Rolling.
From reading the above you should assume that you really want all indexes to be cached in RAM. Alas that can be hard to do for big data databases. For this test my server has 72G of RAM and the PK indexes are 83G. So I know that all of the indexes won't be cached.

I tried to overcome disk read stalls during index searching by changing the Java sysbench client to manually prefetch the to-be-updated document by calling findOne prior to the update. That improved TPS by about 20%. I hoped for more but the prefetch attempt needs a read-lock and pending write-lock requests on the per-database RW-lock appear to block new read-lock requests. I think this is done to prevent write-lock requests from getting starved. My attempt is not a workaround.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has a SAS disk array that can do more than 2000 IOPs with many concurrent requests, 16 CPU cores with HT enabled and 72G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested these configurations:
  • mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mongo-p2n - 828 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
  • mysql - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=2
  • mysql-zlib - 349 GB database, MySQL 5.6.12, InnoDB 2X compression (key_block_size=4) via zlib, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=1
  • tokumz-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0
  • tokumz-zlib - 385 GB database, TokuMX 1.4.1 with zlib compression, logFlushPeriod=300, w:1,j:0

Results

MongoDB does twice the number of disk reads per update compared to TokuMX and InnoDB. MongoDB TPS does not increase with concurrency. TPS does increase with concurrency for InnoDB and TokuMX which benefit from having many more concurrent pending disk reads. TokuMX does better than InnoDB because it doesn't use random IOPs for database page writes so there is more capacity remaining for reads.

TPS
configuration  8 clients  16 clients  32 clients  64 clients
tokumx-zlib          888        1267        1647        2034
tokumx-quicklz       870        1224        1567        1915
mysql-zlib           562         809         983        1140
mysql                543         737         913        1043
mongo-p2y            168         168         169         169
mongo-p2n            168         169         168         169

iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
tokumx-zlib          924        1279        1650        2032
tokumx-quicklz       891        1243        1600        1948
mysql-zlib           520         727         862         966
mysql                512         695         855         970
mongo-p2y            337         340         342         344
mongo-p2n            343         347         350         350

disk reads per update
configuration  8 clients  16 clients  32 clients  64 clients
tokumx-zlib         1.04        1.01        1.00        1.00
tokumx-quicklz      1.02        1.02        1.02        1.02
mysql-zlib          0.93        0.90        0.88        0.85
mysql               0.94        0.94        0.94        0.93
mongo-p2y           2.01        2.02        2.02        2.04
mongo-p2n           2.04        2.05        2.08        2.07

Sunday, April 13, 2014

Why aren't you using X, version 2

Sometimes I get asked why am I not using product X where X is anything but MySQL. The products that are suggested change over time and the value of X very much depends on the person asking the question. An ex-manager from my days at Oracle told me that Oracle would be better and developers from the SQL Server team told me the same. For those keeping score there was a social network that ran SQL Server and they were kind of enough to explain why.

Too often this is an assertion rather than a question and it would be more clear to say "I think you should be using X". A better question would be "Why are you using MySQL". This is the burden we carry for running MySQL at scale, but I am not in search of sympathy. There are several possible answers.

Getting better with age

This might explain existing deployments:
  1. It was there when we arrived
  2. We made it better (along with Oracle, Percona, MariaDB, etc)

It is pretty good

With the quality of the 5.6 release and features likely to appear in 5.7, I expect MySQL to get many new deployments. This isn't just a legacy thing. MySQL performance is excellent for IO-bound workloads and almost excellent for in-memory. Manageability is about to get much better when GTID, parallel apply and enhanced semi-sync are deployed (or deployable). It isn't perfect. We need more features, PS usability is a work in progress and all of the replication goodness might not be there for most users until 5.7 is GA. 

Scalability

Sometimes I am told that something else scales better, but scalability is rarely defined. Context is very important here. If your deployment has a few servers then you want to minimize management overhead as the cost for people is larger than for hardware. But things change when a small team runs a huge number of servers and for that it is very important to minimize hardware cost by using a DBMS that is efficient for some of high QPS, IO-bound, read-heavy or write-heavy workloads. Note that a small team running a huge deployment is an existence proof that one or both of these are true -- the team is awesome, the product is manageable.

Leaving out quality of service, a simple definition for scalability is that a given workload requires A people, B hardware units and C lines of automation code. For something to scale better than MySQL it should reduce some of A, B and C. For many web-scale deployments the cost of C has mostly been paid and migrating to something new means a large cost for C. Note that B represents many potential bottlenecks. The value of B might be large to get more IOPs for IO-bound workloads with databases that are much bigger than RAM. It might be large to get more RAM to keep everything cached. Unfortunately, some deployments are not going to fully describe that context (some things are secret). The value of A is influenced by the features in C and the manageability features in the DBMS but most web-scale companies don't disclose the values of B and A.

Thursday, April 10, 2014

MongoDB, TokuMX and InnoDB for concurrent inserts

I used the insert benchmark with concurrent insert threads to understand performance limits in MongoDB, TokuMX and InnoDB. The database started empty and eventually was much larger than RAM. The benchmark requires many random writes for secondary index maintenance for an update-in-place b-tree used by MongoDB and InnoDB. The test server has fast flash storage. The work per transaction for this test is inserting 1000 documents/rows where each document/row is small (100 bytes) and has 3 secondary indexes to maintain. The test used 10 client connections to run these transactions concurrently and each client uses a separate collection/table. The performance summaries listed below are based on the context for this test -- fast storage, insert heavy with secondary index maintenance. My conclusion from running many insert benchmark tests is that I don't want to load big databases with MongoDB when secondary index maintenance must be done during the load. Sometimes creating the indexes after the load is an option but performance in this area must be improved.

The performance summary for the workload when the database is cached (smaller than RAM).
  • InnoDB and TokuMX are always much faster MongoDB except when database-per-collection is used and the MongoDB journal is disabled. I did not attempt to run InnoDB or TokuMX with the journal disabled or even on tmpfs so I am not comparing the same thing in that case. Reasons for better performance from InnoDB include the insert buffer, less bloat in the database, more internal concurrency and a more mature b-tree (sometimes older is better). Reasons for better performance from TokuMX include fractal trees, compression and more internal concurrency. AFAIK, the MongoDB write lock is not released when there is a disk read (page fault) during secondary index maintenance. Even when there aren't faults at most one client searches the b-tree indexes at a time.
  • InnoDB is faster than TokuMX
  • MongoDB inserts/second are doubled by using a database-per-collection compared to one database for all collections
  • MongoDB inserts/second are doubled by disabling the journal
  • MongoDB performance doesn't change much between using j:1 (fsync-on-commit) and w:1,j:0 (fsync a few times per second). 
The performance summary for the workload when the database is much larger than RAM. 
  • Eventually TokuMX is much faster than InnoDB. This is expected for the insert benchmark.
  • TokuMX and InnoDB are much faster than MongoDB. TPS degrades as the database size grows: not much for TokuMX, faster for InnoDB, really fast for MongoDB.
  • Disabling the journal doesn't help MongoDB The bottleneck is elsewhere.
  • Not using fsync-on-commit doesn't help MongoDB. The bottleneck is elsewhere.
  • Using database-per-collection doesn't do much to help MongoDB. The bottleneck is elsewhere.

Configuration

The Java and Python insert benchmark clients were used to load up to 3B documents/rows into 10 collections/tables, which is also up to 300M documents/rows per collection/table. For MongoDB the tests were run with all collections in one database and then again with a database-per-collection. Tests were run as a sequence of rounds where 100M documents/rows were loaded per round (10M per client) and performance metrics were computed per round. The test hosts have 144GB of RAM, PCIe flash and 24 CPU cores with HT enabled. The client and DBMS software ran on the same host. Several configurations were tested:
  • inno-lazy - MySQL 5.6.12, InnoDB, doublewrite=0, page_size=8k, flush_log_at_trx_commit=2
  • inno-sync - MySQL 5.6.12, InnoDB, doublewrite=0, page_size=8k, flush_log_at_trx_commit=1
  • toku-lazy - TokuMX 1.4.1, logFlushPeriod=300, w:1,j:0
  • toku-sync - TokuMX 1.4.1, logFlushPeriod=0, j:1
  • mongo24-1db-noj - MongoDB 2.4.9, nojournal=true, 10 collections in 1 database
  • mongo26-1db-noj - MongoDB 2.6.0rc2, nojournal=true, 10 collections in 1 database
  • mongo24-10db-noj - MongoDB 2.4.9, nojournal=true, database per collection
  • mongo26-10db-noj - MongoDB 2.6.0rc2, nojournal=true, database per collection
  • mongo24-1db-lazy - MongoDB 2.4.9, journalCommitInterval=300, w:1,j:0, 10 collections in 1 database
  • mongo26-1db-lazy - MongoDB 2.6.0rc2, journalCommitInterval=300, w:1,j:0, 10 collections in 1 database
  • mongo24-10db-lazy - MongoDB 2.4.9, journalCommitInterval=300, w:1,j:0, database per collection
  • mongo26-10db-lazy - MongoDB 2.6.0rc2, journalCommitInterval=300, w:1,j:0, database per collection
  • mongo24-1db-sync - MongoDB 2.4.9, journalCommitInterval=2, j:1, 10 collections in 1 database
  • mongo26-1db-sync - MongoDB 2.6.0rc2, journalCommitInterval=2, j:1, 10 collections in 1 database
  • mongo24-10db-sync - MongoDB 2.4.9, journalCommitInterval=2, j:1, database per collection
  • mongo26-10db-sync - MongoDB 2.6.0rc2, journalCommitInterval=2, j:1, database per collection

Results @100M rows

Legend for the columns:
  • DB-size - size of the database at the end of the test round
  • Bytes-per-doc - count of documents/rows divided by DB-size
  • Write-rate - average rate of bytes written to storage during the test round measured by iostat
  • Bytes-written - total bytes written to storage during the test round
  • Test-secs - number of seconds to complete the test round
  • TPS - average transactions per second during the test round where each transaction inserts 1000 documents/rows
  • Server - the tested configuration
Notes:
  • I don't know why inno-sync is a bit faster than inno-lazy, maybe HW was the cause. The key point is that doing fsync-on-commit isn't significant for this test. It also has a small impact for TokuMX.
  • MongoDB TPS is only close to InnoDB & TokuMX when the journal is disabled and a collection per database is used
  • There is up to a 2X benefit with MongoDB from using a database per collection

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
  17 GB     182          164 MB/s     77 GB           494    202672   inno-sync
  17 GB     182          159 MB/s     81 GB           540    185742   inno-lazy
  15 GB     161           60 MB/s     42 GB           686    145720   toku-sync
  14 GB     150           61 MB/s     41 GB           663    150718   toku-lazy
   X GB       X          140 MB/s   1606 GB         11451      8743   mongo24-1db-sync
  38 GB     400          148 MB/s   1566 GB         10586      9463   mongo26-1db-sync
  60 GB     644          205 MB/s   1337 GB          6485     15921   mongo24-10db-sync
  40 GB     429          216 MB/s   1310 GB          6069     16933   mongo26-10db-sync
  36 GB     387          138 MB/s   1611 GB         11640      8593   mongo24-1db-lazy
   X GB       X          147 MB/s   1574 GB         10702      9352   mongo26-1db-lazy
  60 GB     644          181 MB/s   1267 GB          6989     14684   mongo24-10db-lazy
  40 GB     429          189 MB/s   1251 GB          6629     15610   mongo26-10db-lazy
  37 GB     397          206 MB/s    995 GB          4819     20752   mongo24-1db-noj
   X GB       X          234 MB/s    878 GB          3742     26736   mongo26-1db-noj
  60 GB     644          370 MB/s    456 GB          1226     81663   mongo24-10db-noj
  40 GB     429          600 MB/s    348 GB           577    179224   mongo26-10db-noj

Results @500M rows

MongoDB TPS starts to drop as the database becomes larger than RAM.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
  83 GB     178          269 MB/s    183 GB           708    141290   inno-sync
  83 GB     178          269 MB/s    180 GB           710    141611   inno-lazy
  44 GB      94           62 MB/s     59 GB           948    105421   toku-sync
  46 GB      99           61 MB/s     56 GB           918    108930   toku-lazy
 180 GB     387          147 MB/s   2284 GB         15519      6454   mongo24-1db-sync
 200 GB     429          158 MB/s   2277 GB         14394      6964   mongo26-1db-sync
 220 GB     472          249 MB/s   2230 GB          8938     11886   mongo24-10db-sync
 220 GB     472          262 MB/s   2232 GB          8528     12041   mongo26-10db-sync
 180 GB     387          146 MB/s   2287 GB         15696      6374   mongo24-1db-lazy
 199 GB     427          154 MB/s   2278 GB         14791      6773   mongo26-1db-lazy
 219 GB     470          226 MB/s   2197 GB          9718     10667   mongo24-10db-lazy
 219 GB     470          233 MB/s   2190 GB          9389     10817   mongo26-10db-lazy
 180 GB     387          261 MB/s   1997 GB          7640     13094   mongo24-1db-noj
   X GB       X          291 MB/s   1980 GB          6800     14736   mongo26-1db-noj
 220 GB     472          758 MB/s   1721 GB          2271     44067   mongo24-10db-noj
 220 GB     472          741 MB/s   1438 GB          1703     60036   mongo26-10db-noj

Results @1B rows

The gap widens between InnoDB/TokuMX and MongoDB. Another result that I saw is uneven durations for the test clients with MongoDB. With InnoDB/TokuMX the clients usually finish within a few seconds. With MongoDB I frequently see test runs where a few clients take hundreds of seconds more than other clients.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
 159 GB     170          292 MB/s    292 GB          1051     95067   inno-sync
 159 GB     170          291 MB/s    291 GB          1057     94994   inno-lazy
  74 GB      79           63 MB/s     67 GB          1053     95443   toku-sync
  74 GB      79           67 MB/s     69 GB          1030     97071   toku-lazy
 400 GB     430           83 MB/s   2328 GB         28151      3557   mongo24-1db-sync
 380 GB     408           87 MB/s   2323 GB         26771      3746   mongo26-1db-sync
 419 GB     450          146 MB/s   2300 GB         15821      6880   mongo24-10db-sync
 400 GB     430          148 MB/s   2293 GB         15520      6989   mongo26-10db-sync
 400 GB     430           83 MB/s   2327 GB         28461      3517   mongo24-1db-lazy
 380 GB     408           86 MB/s   2322 GB         26937      3718   mongo26-1db-lazy
 420 GB     451          143 MB/s   2280 GB         15949      6555   mongo24-10db-lazy
 400 GB     430          148 MB/s   2267 GB         15361      6880   mongo26-10db-lazy
 400 GB     430          104 MB/s   2100 GB         20291      4938   mongo24-1db-noj
   X GB       X          110 MB/s   2098 GB         19147      5246   mongo26-1db-noj
 420 GB     451          206 MB/s   2068 GB         10039     10037   mongo24-10db-noj
 400 GB     430          214 MB/s   2063 GB          9623     10662   mongo26-10db-noj

Results @1.5B rows

Tests for mongo24-1db-lazy and mongo26-1db-lazy were stopped. I wasn't willing to wait. TPS for MongoDB continues to degrade faster than InnoDB & TokuMX.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
 229 GB     163          304 MB/s    371 GB          1290     77491   inno-sync
 229 GB     163          303 MB/s    372 GB          1300     77108   inno-lazy
 110 GB      78           75 MB/s     80 GB          1072     93240   toku-sync
 110 GB      78           75 MB/s     81 GB          1068     93575   toku-lazy
 540 GB     387           54 MB/s   2336 GB         43013      2327   mongo24-1db-sync
 580 GB     415           56 MB/s   2330 GB         41803      2400   mongo26-1db-sync
 560 GB     401           84 MB/s   2322 GB         27778      3790   mongo24-10db-sync
 600 GB     429           82 MB/s   2315 GB         28257      3708   mongo26-10db-sync
 560 GB     401           93 MB/s   2316 GB         24887      4218   mongo24-10db-lazy
 600 GB     429           94 MB/s   2312 GB         24638      4185   mongo26-10db-lazy
 544 GB     389           60 MB/s   2116 GB         35192      2845   mongo24-1db-noj
   X GB       X           62 MB/s   2118 GB         34123      2951   mongo26-1db-noj
 560 GB     401           96 MB/s   2104 GB         21903      4837   mongo24-10db-noj
 600 GB     429           98 MB/s   2090 GB         21276      5115   mongo26-10db-noj

Results @1.7B rows

Tests for mongo2?-*db-sync were stopped at 1.7B rows. I wasn't willing to wait.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
 620 GB     392           50 MB/s   2336 GB         46506      2153   mongo24-1db-sync
 620 GB     392           52 MB/s   2333 GB         45273      2218   mongo26-1db-sync
 639 GB     404           74 MB/s   2318 GB         31264      3348   mongo24-10db-sync
 640 GB     404           73 MB/s   2317 GB         31960      3266   mongo26-10db-sync

Results @2B rows

More of the same.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
 305 GB     163          267 MB/s    449 GB          1777     56930   inno-sync
 305 GB     163          265 MB/s    445 GB          1774     56466   inno-lazy
 141 GB      76           77 MB/s     80 GB          1037     96401   toku-sync
 141 GB      76           78 MB/s     79 GB          1019     97993   toku-lazy
 704 GB     378           50 MB/s   2125 GB         42575      2352   mongo24-1db-noj
   X GB       X           51 MB/s   2124 GB         41401      2422   mongo26-1db-noj
 720 GB     387           69 MB/s   2114 GB         30779      3269   mongo24-10db-noj
 740 GB     397           65 MB/s   2111 GB         32501      3102   mongo26-10db-noj

Results @2.5B rows

More of the same. The inno-sync test was stopped. I was impatient but TPS was still pretty good.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
 376 GB     161          228 MB/s    617 GB          2863     35489   inno-lazy
 180 GB      76           82 MB/s     87 GB          1044     94770   toku-sync
 180 GB      76           83 MB/s     85 GB          1023     97588   toku-lazy
 840 GB     361           47 MB/s   2202 GB         46782      2139   mongo24-1db-noj
   X GB       X           49 MB/s   2207 GB         45456      2209   mongo26-1db-noj
 860 GB     369           62 MB/s   2198 GB         35462      2824   mongo24-10db-noj
 920 GB     395           58 MB/s   2199 GB         37755      2652   mongo26-10db-noj

Results @3B rows

More of the same.  A few tests are still running and I will update this in a few days if they finish.

DB-size  Bytes-per-doc Write-rate  Bytes-written  Test-secs     TPS   Server
 448 GB     160          220 MB/s    665 GB          3186     31831   inno-lazy
 212 GB      76           84 MB/s     93 GB          1091     91657   toku-sync
 214 GB      77           86 MB/s     92 GB          1066     93724   toku-lazy
 960 GB     344           43 MB/s   2105 GB         48793      2052   mongo24-1db-noj
1100 GB     394           45 MB/s   2107 GB         47366      2123   mongo26-1db-noj
 980 GB     351           53 MB/s   2529 GB         39545      2530   mongo24-10db-noj
1100 GB     394           50 MB/s   2109 GB         42416      2362   mongo26-10db-noj

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...