Wednesday, April 23, 2014

Concurrent, read-only, not cached: MongoDB, TokuMX, MySQL

I repeated the tests described here using a database larger than RAM. The test database has 8 collections/tables with 400M documents/rows per table. I previously reported results for this workload using a server with 24 CPU cores and a slightly different flash storage device. This time I provide a graph and use a server with more CPU cores. The goal for this test is to determine whether the DBMS can use the capacity of a high-performance storage device, the impact from different filesystem readahead settings for MongoDB and TokuMX and the impact from different read page sizes for TokuMX and InnoDB. It will take two blog posts to share everything. I think I will have much better QPS for MongoDB and TokuMX in my next post so I won't list any conclusions here.

Setup

I used my forked Java and C sysbench clients. The test query fetches one document/row by PK. The test database has 8 collections/tables with 400M rows per collection/table. All are in one database. I still need to enhance the Java sysbench client to support a database per collection. I tested the configurations listed below. I don't think these are the best configurations for TokuMX and MongoDB and am running more tests to confirm. The test server has 144G RAM, 40 CPU cores and a fast flash storage device.
  • fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=8k, data fetched via HANDLER
  • fb56.sql - 740G database, MySQL 5.6.12 with the Facebook path, InnoDB, page_size=8k, data fetched via SELECT
  • orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via HANDLER. 
  • orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=8k, data fetched via SELECT
  • tokumx32 - 554G database, TokuMX 1.4.1, quicklz, readPageSize=32K, 16K filesystem readahead
  • tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K, 32K filesystem readahead
  • mongo24 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0, 16K filesystem readahead
  • mongo26 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1, 16K filesystem readahead

Results

Results for MySQL 5.7.4 are not in the graph to keep it readable and are similar to MySQL 5.6.12. Note that MySQL is able to get more than 100,000 QPS at high concurrency, TokuMX reaches 30,000 and MongoDB isn't able to reach 20,000. I think MongoDB and TokuMX can do a lot better when I reduce the filesystem readahead for both and reduce the read page size for TokuMX and results for that are in my next post. MongoDB also suffers in this test because the PK index is so large that all leaf nodes cannot fit in RAM so there is more than one disk read per query. This isn't something that goes away via tuning. The workaround it to make sure the database:RAM ratio isn't too big (and spend more money on hardware).
This lists the QPS from the graph.

point queries per second
     8     16     32     40  clients
 39928  63542 102294 107769  fb56.handler
 33630  56834  91132 102336  fb56.sql
 39714  63359 101987 106205  orig57.handler
 33561  56725  90900 101476  orig57.sql
 12586  22738  31407  32167  tokumx32
 10119  16373  18310  18232  tokumx64
 12782  16639  17350  17435  mongo24
 12503  17474  17988  18022  mongo26

Analysis

These tables list the average disk read rate from iostat r/s and the average number of disk reads per query. InnoDB is by far the most efficient with the smallest number of disk reads per query. TokuMX benefits from having the smallest database courtesy of quicklz compression but might suffer from a larger read page size (32k and 64k). But I don't think that is the only reason why the disk reads per query ratio is so much larger than InnoDB and TokuMX. I am repeating tests with an 8k read page size to confirm. MongoDB suffers from a PK index that is too large to be cached so disk reads are done for it and the document store. Both TokuMX and MongoDB might also do extra reads because of the filesystem readahead and I am repeating tests with smaller values for it to confirm.

iostat r/s
     8     16     32     40  clients
 33661  53502  86028  90616  fb56.handler
 29120  49155  78748  88423  fb56.sql
 33776  53702  86193  89755  orig57.handler
 29244  49268  78801  88027  orig57.sql
 26756  47813  65885  67840  tokumx32
 23728  37442  41357  42089  tokumx64
 18966  24440  25147  25322  mongo24
 18312  25313  25701  25781  mongo26

disk reads per query
     8     16     32     40  clients
  0.84a  0.84   0.84   0.84  fb56.handler
  0.86   0.86   0.86   0.86  fb56.sql
  0.85   0.84   0.84   0.84  orig57.handler
  0.87   0.86   0.86   0.86  orig57.sql
  2.12   2.10   2.09   2.10  tokumx32
  2.34   2.28   2.25   2.29  tokumx64
  1.48   1.46   1.44   1.45  mongo24
  1.54   1.44   1.42   1.43  mongo26


RW locks are hard

MongoDB and TokuMX saturated at a lower QPS rate then MySQL when running read-only workloads on a cached database with high concurrency. Many of the stalls were on the per-database RW-lock and I was curious about the benefit from removing that lock. I hacked MongoDB to not use the RW-lock per query (not safe for production) and repeated the test. I got less than 5% more QPS at 32 concurrent clients. I expected more, looked at performance with PMP and quickly realized there were several other sources of mutex contention that are largely hidden by contention on the per-database RW-lock. So this problem won't be easy to fix but I think it can be fixed.

The easy way to implement a reader-writer lock uses the pattern listed below. That includes pthread_rwlock_t in glibc the last time I checked and the per-database RW-lock used by MongoDB. InnoDB used this pattern many years ago and then we rewrote it to make InnoDB better on multi-core. An implementation like this tends to have problems on multi-core servers. The first problem is from locking/unlocking the internal mutex at least twice per use, once to get it in read or write mode and then again to unlock it. When there is contention it can be locked/unlocked many more times than twice per use from threads that wait, wake-up and then wait again. If the operation protected by this RW-lock is very fast then a mutex is usually a better choice. Note that even when all threads are trying to lock in read mode there is still contention on the internal mutex ("mtx" below). Another problem occurs when the thread trying to unlock a RW-lock is blocked trying to lock the internal state mutex ("mtx" below). There might be other threads waiting to run as soon as the unlock gets through but the unlock is stalled because incoming lock requests are competing for the same mutex ("mtx"). I have seen many PMP thread stacks where the unlocking thread is stuck on the lock_mutex call.

    lock(mode)
        lock_mutex(mtx)
        wait_until_lock_granted(mode)
        modify_lock_state()
        unlock_mutex(mtx)

    unlock()
        lock_mutex(mtx)
        modify_lock_state()
        notify_some_waiters()
        unlock_mutex(mtx)

Something better

The alternative that scales better is to use a lock-free approach to get and set internal state in the RW-lock. We did this as part of the Google MySQL patch many years ago and that code was contributed upstream. Such an approach removes much of the contention added by an inefficient RW-lock. It won't prevent contention added because threads want the lock in read and write mode at the same time. That still requires some threads to wait. When we did the work at Google on the InnoDB RW-lock, Yasufumi Kinoshita was working on a similar change. I am very happy he continues to make InnoDB better.

A lock-free implementation for a DBMS is likely to be much more complex than what you might read about on the web or a top-tier systems conference paper. There is more complexity because of the need to support performance monitoring, manageability, special semantics and the occasional wrong design decision. For performance monitoring we need to know how frequently a lock is used and how long threads wait on it. For manageability we need to know what threads wait on a lock and which thread holds it. A frequent pattern is for today's special semantics to become tomorrow's design decisions that we regret. But we can't expect perfection given the need to move fast and the rate at which hardware changes.

The low-level reader-writer lock in MongoDB, QLock, is a RW-lock with special semantics. It has two modes each for read and write locks:  r, R, w and W. It also supports upgrades and downgrades: W to R, R to W, w to X and X to w (I didn't mention X above). Internally there are 6 condition variables, one each for r, R, w and W and then two others, U and X, to support upgrades and downgrades. Read the source for more details. I don't understand the code enough to guess whether lock-free state changes can be supported as they were for the InnoDB RW-lock.

MongoDB details

I spent a few hours browsing the source for the MongoDB RW-lock and these are my notes. I hope they help you, otherwise they will be a reference for me in the future. Queries that call find to fetch one row by PK start to run in mongod via the newRunQuery function. That gets the per-database RW-lock in read mode by creating a Client::ReadContext object on the stack and ReadContext gets the per-database RW-lock in read mode:

    /** "read lock, and set my context, all in one operation"
     *  This handles (if not recursively locked) opening an unopened database.
     */
    Client::ReadContext::ReadContext(const string& ns, const std::string& path) {
        {
            lk.reset( new Lock::DBRead(ns) );
            Database *db = dbHolder().get(ns, path);
            if( db ) {
                c.reset( new Context(path, ns, db) );
                return;
            }
        }
        ...

The dbHolder().get() call above locks a mutex in DatabaseHolder while using the database name to find the database object. There is simple string searching while the mutex is locked. It might be easy to move some of that work outside the scope of the mutex and perhaps use a mutex per hash table bucket.

        Database * get( const string& ns , const string& path ) const {
            SimpleMutex::scoped_lock lk(_m);
            Lock::assertAtLeastReadLocked(ns);
            Paths::const_iterator x = _paths.find( path );
            if ( x == _paths.end() )
                return 0;
            const DBs& m = x->second;
            string db = _todb( ns );
            DBs::const_iterator it = m.find(db);
            if ( it != m.end() )
                return it->second;
            return 0;
        }

        static string __todb( const string& ns ) {
            size_t i = ns.find( '.' );
            if ( i == string::npos ) {
                uassert( 13074 , "db name can't be empty" , ns.size() );
                return ns;
            }
            uassert( 13075 , "db name can't be empty" , i > 0 );
            return ns.substr( 0 , i );

        }

Lets get back to the DBRead constructor that was called in the ReadContext constructor above. It calls lockDB to do the real work. The code below will call other functions that lock mutexes but no mutex is held by the caller to the code below. In my case the block with "if (DB_LEVEL_LOCKING_ENABLED)" is entered and lockTop gets called to do the real work.

    Lock::DBRead::DBRead( const StringData& ns )
        : ScopedLock( 'r' ), _what(ns.toString()), _nested(false) {
        lockDB( _what );
    }

    void Lock::DBRead::lockDB(const string& ns) {
        fassert( 16254, !ns.empty() );
        LockState& ls = lockState();

        Acquiring a(this,ls);
        _locked_r=false;
        _weLocked=0;

        if ( ls.isRW() )
            return;
        if (DB_LEVEL_LOCKING_ENABLED) {
            StringData db = nsToDatabaseSubstring(ns);
            Nestable nested = n(db);
            if( !nested )
                lockOther(db);
            lockTop(ls);
            if( nested )
                lockNestable(nested);
        }
        else {
            qlk.lock_R();
            _locked_r = true;
        }
    }

Well, lockTop doesn't do the real work during my benchmark. It calls qlk.lock_r to do that.

    void Lock::DBRead::lockTop(LockState& ls) {
        switch( ls.threadState() ) {
        case 'r':
        case 'w':
            break;
        default:
            verify(false);
        case  0  :
            qlk.lock_r();
            _locked_r = true;
        }
    }

Almost there, just one more level of indirection. The call to qlk.lock_r calls the lock_r method on an instance of QLock and then something gets done.

    void lock_r() {
        verify( threadState() == 0 );
        lockState().lockedStart( 'r' );
        q.lock_r();
    }

    inline void QLock::lock_r() {
        boost::mutex::scoped_lock lk(m);
        while( !r_legal() ) {
            r.c.wait(m);
        }
        r.n++;
    }

Eventually the unlock_r method is called for the same instance of QLock. I won't show the route there however.

    inline void QLock::unlock_r() {
        boost::mutex::scoped_lock lk(m);
        fassert(16137, r.n > 0);
        --r.n;
        notifyWeUnlocked('r');
    }

And notifyWeUnlocked provides the special semantics. This includes not letting a new reader in when there is a pending write request. The code below also wakes all waiting write requests when one is waiting. This might cause many threads to be scheduled to run even though at most one will get the RW-lock. InnoDB does something similar.

    inline void QLock::notifyWeUnlocked(char me) {
        fassert(16201, W.n == 0);
        if ( me == 'X' ) {
            X.c.notify_all();
        }
        if( U.n ) {
            // U is highest priority
            if( (r.n + w.n + W.n + X.n == 0) && (R.n == 1) ) {
                U.c.notify_one();
                return;
            }
        }
        if ( X_legal() && i_block(me, 'X') ) {
            X.c.notify_one();
        }
        if ( W_legal() && i_block(me, 'W') ) {
            W.c.notify_one();
            if( _areQueueJumpingGlobalWritesPending() )
                return;
        }
        if ( R_legal_ignore_greed() && i_block(me, 'R') ) {
            R.c.notify_all();
        }
        if ( w_legal_ignore_greed() && i_block(me, 'w') ) {
            w.c.notify_all();
        }
        if ( r_legal_ignore_greed() && i_block(me, 'r') ) {
            r.c.notify_all();
        }
    }



Tuesday, April 22, 2014

Concurrent, read-only & cached: MongoDB, TokuMX, MySQL

This has results for a read-only workload where all data is cached. The test query fetches all columns in one doucment/row by PK. For InnoDB all data is in the buffer pool. For TokuMX and MongoDB all data is in the OS filesystem cache and accessed via mmap'd files. The test server has 40 CPU cores with HT enabled and the test clients share the host with mysqld/mongod to reduce variance from network latency. This was similar to a previous test, except the database is in cache and the test host has more CPU cores. The summary of my results is:
  • MongoDB 2.6 has a performance regression from using more CPU per query. The regression might be limited to simple queries that do single row lookups on the _id index. I spent a bit of time rediscovering how to get hierarchical CPU profile data from gperftools to explain this. JIRAs 13663 and 13685 are open for this.
  • MySQL gets much more QPS at high concurrency than MongoDB and TokuMX
  • MySQL gets more QPS using the HANDLER interface than SELECT. I expect the InnoDB memcached API to be even faster than HANDLER but did not test it.
  • MySQL uses more CPU per query in 5.7.4 than in 5.6.12 but this didn't have an impact on QPS

Setup

The test was repeated for 1, 2, 4, 8, 16, 32 and 40 concurrent clients. It uses my forked versions of the MongoDB and C clients for sysbench. There are 8 collections/tables in one database. Each table has 400M rows but queries are limited to the first 1M. I don't know yet whether using a database per collection would improve the MongoDB results. Each query fetches all columns in one document/row by PK. I have yet to push my changes to the MongoDB sysbench client to make it fetch all columns. I tested these binaries:
  • fb56.handler - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses HANDLER to fetch data.
  • fb56.sql - MySQL 5.6.12 with the Facebook patch and 8k pages. Uses SELECT to fetch data.
  • orig57.handler - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses HANDLER to fetch data.
  • orig57.sql - MySQL 5.7.4 without the Facebook patch and 8k pages. Uses SELECT to fetch data.
  • tokumx - TokuMX 1.4.1 using quicklz and 32kb pages. There should be no decompression during the test as all data used by the test (1M documents) is much smaller than 50% of RAM.
  • mongo24 - MongoDB 2.4.9
  • mongo26 - MongoDB 2.6.0

Results

At last I included a graph. I have been reluctant to include graphs on previous posts comparing MongoDB, TokuMX and MySQL because I want to avoid benchmarketing and drive-by analysis. These tests have been time consuming to run and document and I don't want to make it too easy to misinterpret the results. Results for MySQL 5.7.4 are not in the graph to make it easier to read. The top two bars (blue & red) are for MySQL and you can see that QPS increases with more concurrency. QPS for MongoDB and TokuMX saturates at a lower level of concurrency.
Numbers used for the graph above.

point queries per second
    1      2      4      8     16     32     40  clients
17864  32397  60294 106374 184566 298276 350665  fb56.handler
11730  22884  39646  73485 131533 215487 249402  fb56.sql
18161  33262  59413 107505 185894 306084 371045  orig57.handler
11775  21838  40528  75322 135331 227450 266917  orig57.sql
14298  25219  45743  83214 142489 168498 161840  tokumx
17203  30158  52476  94705 161922 174453 170177  mongo24
10705  19502  34318  61977 109684 152667 151555  mongo26

Analysis

I used vmstat to measure the average CPU utilization (user + system) during the test. The numbers below are: (CPU_utilization / QPS) * 1,000,000. There are some interesting details.
  • the values are larger for MySQL 5.7 than for 5.6 at low concurrency. Note that in both cases the performance schema was disabled at compile time.
  • the values are much larger for MongoDB 2.6 than for 2.4 and hopefully this can be fixed via JIRAs 13663 and 13685.
(CPU_utilization / QPS) * 1,000,000
  1      2      4      8     16     32     40  clients
218    197    197    208    216    251    268  fb56.handler
323    310    287    298    304    352    372  fb56.sql
357    279    240    216    215    248    250  orig57.handler
407    380    313    288    302    342    359  orig57.sql
272    269    251    254    266    302    296  tokumx
232    215    219    225    234    257    252  mongo24
373    333    340    342    355    425    422  mongo26

I also used vmstat to measure the context switch rate and the table below lists the number of context switches per query. Note that the rate decreases with concurrency for MySQL but not for MongoDB and TokuMX. I don't know enough about Linux internals to interpret this.

vmstat.cs / QPS

context switch per query
     1      2      4      8     16     32     40  clients
  4.44   4.14   4.01   3.79   3.47   3.05   2.19  fb56.handler
  4.61   4.32   4.03   3.84   3.59   3.23   2.65  fb56.sql
  4.53   4.27   4.07   3.88   3.52   3.08   2.20  orig57.handler
  4.81   4.48   4.19   3.96   3.63   3.07   2.19  orig57.sql
  4.59   4.30   4.08   3.87   3.77   4.32   4.32  tokumx
  4.54   4.23   4.03   3.84   3.79   4.29   4.30  mongo24
  4.80   4.43   4.21   3.99   3.93   4.58   4.63  mongo26

Hierarchical CPU profiling for MongoDB

One day it will be easy to get hierarchical CPU profile results for open-source databases using open-source profiling tools. Support for CPU profiling via Google perftools can be compiled into MongoDB via the --use-cpu-profiler option. Given the use of a compiler toolchain in a nonstandard location I also used --extrapath and --extralib to help it find libunwind. However, the profiler output file was mostly empty after doing this and did not have any per-thread results.

For future reference, profiling was started and stopped via:
echo "db.runCommand({ _cpuProfilerStart: { profileFilename: '/path/to/output' } })" | bin/mongo admin
echo "db.runCommand({ _cpuProfilerStop:1})" | bin/mongo admin
Eventually I remembered that it might help to add a call to ProfilerRegisterThread() at the start of each new thread. I did something similar in the Google patch for MySQL many years ago. And now I have hierarchical CPU profiles to help understand the source of performance regressions in MongoDB 2.6. I updated JIRA 6628 with details from this and then was asked to create JIRA 13683. The change is to handleIncomingMsg():
        static void* handleIncomingMsg(void* arg) {
            TicketHolderReleaser connTicketReleaser( &Listener::globalTicketHolder );
            ::ProfilerRegisterThread(); // Add this

Friday, April 18, 2014

Biebermarks

Yet another microbenchmark result. This one is based on behavior that has caused problems in the past for a variety of reasons which lead to a few interesting discoveries. The first was that using a short lock-wait timeout was better than the InnoDB deadlock detection code. The second was that no-stored procedures could overcome network latency.

The workload is a large database where all updates are done to a small number of rows. I think it is important to use a large database to include the overhead from searching multiple levels of a b-tree. The inspiration for this is maintaining counts for popular entities like Justin Bieber and One Direction. This comes from serving the social graph. For more on that read about TAO and LinkBench.

The most popular benchmark for MySQL is sysbench and it is usually run with a uniform distribution so that all rows are equally likely to be queried or modified. But real workloads have skew which can cause stress in unexpected places and I describe one such place within InnoDB from this microbenchmark. YCSB and LinkBench are benchmarks that have skew and can be run for MySQL. I hope that more of the MySQL benchmark results in the future include skew.

Configuration

See a previous post for more details. Eight collections/tables with 400M documents/rows per collection/table were created. All collections/tables are in one database so MongoDB suffers from the per-database RW-lock. But MySQL and TokuMX also suffer from a similar issue when all clients are trying to update the same row. Tests were run for 1, 2, 4 and 8 tables where one row per table was updated. So when the test used 4 tables there were 4 rows getting updates. For each number of tables tests were run for up to 64 concurrent clients/threads. The result tables listed in the next section should make that clear.

The workload is updating the non-indexed column of one document/row by PK per transaction. There are no secondary indexes on the table. In this case the document/row with ID=1 is chosen for every update. For MySQL and TokuMX an auto-commit transaction is used. The journal (redo log) is used but the update does not wait for the journal/log to be forced to disk. The updates should not require disk reads as all relevant index and data blocks remain in cache. TokuMX might do reads in the background to maintain fractal trees but I don't understand their algorithm to be certain.

The database was loaded in PK order and about 8 hours of concurrent & random updates were done to warmup the database prior to this test. The warmup was the same workload as described in a previous post.

The MySQL test client limits clients to one table. So when there are 64 clients and 8 tables then there are 8 clients updating the 1 row per table. The MongoDB/TokuMX client does not do that. It lets all clients update all tables so in this case there are at most 64 clients updating the row per table and on average there would be 8.

The test server has 40 CPU cores with HT enabled, fast flash storage and 144G of RAM. The benchmark client and database servers shared the host. Tests were run for several configurations:
  • mongo26 - MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mongo24 - MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
  • mysql - MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • toku-32 - TokuMX 1.4.1, readPageSize=32k, quicklz compression, logFlushPeriod=300, w:1,j:0. I don't have results for toku-32 yet.
  • toku-64 - TokuMX 1.4.1, readPageSize=64k, quicklz compression, logFlushPeriod=300, w:1,j:0

Results per DBMS

I first list the results by DBMS to show the impact from spreading the workload over more rows/tables. The numbers below are the updates per second rate. I use "DOP=X" to indicate the number of concurrent clients and "DOP" stands for Degree Of Parallelism (it is an Oracle thing). A few conclusions from the results below:

  • MySQL/InnoDB does much better with more tables for two reasons. The first is that it allows for more concurrency. The second is that it avoids some of the overhead in the code that maintains row locks and threads waiting for row locks. I describe that in more detail at the end of this post.
  • MongoDB 2.4.9 is slightly faster than 2.6.0rc2. I think the problem is that mongod requires more CPU per update in 2.6 versus 2.4 and this looks like a performance regression in 2.6 (at least in 2.6.0rc2). I am still profiling to figure out where. More details on this are at the end of the post. I filed JIRA 13663 for this.
  • MongoDB doesn't benefit from spreading the load over more collections when all collections are in the same database. This is expected given the per-database RW-lock.

Updates per second
config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         1   8360  15992  30182  24932   23924   23191   21048
mysql         2      X  16527  30824  49999   41045   40506   38357
mysql         4      X      X  32351  51791   67423   62116   59137
mysql         8      X      X      X  54826   80409   73782   68128

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mongo24       1  10212  17844  30204  34003   33895   33564   33451
mongo24       2      X  10256  17698  30547   34125   33717   33573
mongo24       4      X      X  10670  17690   30903   34027   33586
mongo24       8      X      X      X  10379   17702   30920   33758

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mongo26       1   9187  16131  27648  28506   27784   27437   27021
mongo26       2      X   9367  16035  27490   28326   27746   27354
mongo26       4      X      X   9179  16028   27666   28330   27647
mongo26       8      X      X      X   9125   16038   27275   27858

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
toku-64       1   7327  12804  16179  12154   11021    9990    8344
toku-64       2      X   7173  12690  20483   23064   22354   20349
toku-64       4      X      X   7191  12943   21399   33485   40124
toku-64       8      X      X      X   7121   12727   22096   38207

Results per number of tables

This reorders the results from above to show them for all configurations at the same number of tables. You are welcome to draw conclusions about which is faster.

Updates per second
config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         1   8360  15992  30182  24932   23924   23191   21048
mongo24       1  10212  17844  30204  34003   33895   33564   33451
mongo26       1   9187  16131  27648  28506   27784   27437   27021
toku-64       1   7327  12804  16179  12154   11021    9990    8344

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         2      X  16527  30824  49999   41045   40506   38357
mongo24       2      X  10256  17698  30547   34125   33717   33573
mongo26       2      X   9367  16035  27490   28326   27746   27354
toku-64       2      X   7173  12690  20483   23064   22354   20349

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         4      X      X  32351  51791   67423   62116   59137
mongo24       4      X      X  10670  17690   30903   34027   33586
mongo26       4      X      X   9179  16028   27666   28330   27647
toku-64       4      X      X   7191  12943   21399   33485   40124

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         8      X      X      X  54826   80409   73782   68128
mongo24       8      X      X      X  10379   17702   30920   33758
mongo26       8      X      X      X   9125   16038   27275   27858
toku-64       8      X      X      X   7121   12727   22096   38207

Row locks for InnoDB

I used PMP to understand MySQL/InnoDB on this workload. I frequently saw all user threads blocked on a condition variable with this stack trace. It seems odd that all threads are sleeping. I think the problem is that one thread can run but has yet to be scheduled by Linux. My memory of the row lock code is that it wakes threads in FIFO order and when N threads wait for a lock on the same row then each thread waits on a separate condition variable. I am not sure if this code has been improved in MySQL 5.7. A quick reading of some of the 5.6.12 row lock code showed many mutex operations. Problems in this code have escaped scrutiny in the past because much of our public benchmark activity has used workloads with uniform distributions.
pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low2,lock_wait_suspend_thread,row_mysql_handle_errors,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection
This was a less frequent stack trace from the test ...
lock_get_mode,lock_table_other_has_incompatible,lock_table,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

Row locks for TokuMX

TokuMX has a similar point at which all threads wait. It isn't a big surprise given that both provide fine-grained concurrency control but there is no granularity finer than a row lock.
pthread_cond_timedwait@@GLIBC_2.3.2,toku_cond_timedwait,toku::lock_request::wait,toku_db_wait_range_lock,toku_c_getf_set(__toku_dbc*,,db_getf_set,autotxn_db_getf_set(__toku_db*,,mongo::CollectionBase::findByPK(mongo::BSONObj,mongo::queryByPKHack(mongo::Collection*,,mongo::updateObjects(char,mongo::lockedReceivedUpdate(char,mongo::receivedUpdate(mongo::Message&,,mongo::assembleResponse(mongo::Message&,,mongo::MyMessageHandler::process(mongo::Message&,,mongo::PortMessageServer::handleIncomingMsg(void*)

MongoDB 2.4 versus 2.6

I get about 1.2X more updates/second with MongoDB 2.4.9 compared to 2.6.0rc2. I think the problem is that 2.6 uses more CPU per update. I filed JIRA 13663 for this but am still trying to profile the code. So far I know the following all of which indicates that the 2.4.9 test is running 1.2X faster than 2.6.0rc2 with 32 client threads and 1 table:
  • I get ~1.2X more updates/second with 2.4.9
  • the Java sysbench client uses ~1.2X more CPU per "top" with 2.4.9
  • the context switch rate is ~1.2X higher with 2.4.9
The interesting point is that mongod for 2.4.9 only uses ~1.03X more CPU than 2.6.0rc2 per "top" during this test even though it is doing 1.2X more updates/second. So 2.6.0rc2 uses more CPU per update. I will look at "perf" output. I can repeat this with the GA version of 2.6.

Wednesday, April 16, 2014

TokuMX, MongoDB and InnoDB, IO-bound update-only with fast storage

I repeated the update-only IO-bound tests using pure-flash servers to compare TokuMX, MongoDB and InnoDB. The test setup was the same as on the pure-disk servers except for the hardware. In this case the servers have fast flash storage, 144G of RAM and 24 CPU cores with HT enabled. As a reminder, the InnoDB change buffer and TokuMX fractal tree don't help on this workload because there are no secondary indexes to maintain. Note that all collections/tables are in one database for this workload thus showing the worst-case for the MongoDB per-database RW-lock. The result summary:
  • InnoDB is much faster than MongoDB and TokuMX. This test requires a high rate of dirty page writeback and thanks to a lot of work from the InnoDB team at MySQL with help from Percona and Facebook (and others) the InnoDB engine is now very good at that. Relative to MongoDB, InnoDB also benefits from a clustered PK index.
  • MongoDB is much slower than InnoDB for two reasons. First it doesn't have a clustered PK index so it might do storage reads for both the index search and then while reading the document. The second reason is the per-database RW-lock. As I described previously this lock appears to be held during disk reads when the index is searched so at most one thread searches the index at a time even though there are concurrent update requests. I created JIRA 3177 to make that obvious in the documentation. Because of this the peak rate for MongoDB is approximately the number of reads per second that one thread can do from the flash device. The device can sustain many more reads/second with concurrency but MongoDB doesn't get much benefit from it. I think there will be at most 2 concurrent flash/disk reads at any time -- one while searching the index and the other while prefetching the document into RAM after releasing the per-database RW-lock in Record::touch.
  • TokuMX also benefits from the clustered PK index but it suffers from other problems that I was unable to debug. I think it can do much better once a Toku expert reproduces the problem on their hardware.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has fast flash storage that can do more than 5000 reads/second from one thread and more than 50,000 reads/second from many threads.  The server also has 24 CPU cores with HT enabled and 144G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested the configurations listed below. I ran tests for more configurations but forgot to adjust read_ahead_kb so I won't publish results from those hosts.
  • mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mysql-4k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=4k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • mysql-8k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • tokumx-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0

Results

That probably isn't a typo below. InnoDB sustained about 5 to 10 times more updates/second. MongoDB does many more disk reads per update which is similar to the pure-disk results. I don't have the expertise to explain why TokuMX results weren't better but I shared information with the Tokutek team. Bytes written to storage per update is listed for InnoDB to show the impact on the write rate from using a smaller page. That can be important when flash endurance must be improved.

TPS
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k           24834       33886       37573       40198
mysql-4k           24826       31704       34644       34987
tokumx-quicklz      3706        3950        3552        3357
mongo-p2y           5194        5167        5173        5102

disk reads/second from iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k           20995       28371       31397       33537
mysql-4k           22016       27985       30553       30972
tokumx-quicklz      4943        5641        4962        4783
mongo-p2y           8960        8921        8951        8859

disk reads per update
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k            0.85        0.84        0.84        0.83
mysql-4k            0.89        0.88        0.88        0.89
tokumx-quicklz      1.33        1.43        1.40        1.42
mongo-p2y           1.73        1.73        1.73        1.74 

bytes written per update
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k            6.56        6.40        5.36        5.36
mysql-4k            3.86        3.72        3.76        3.78

Types of writes

What does it mean to make writes fast? It helps to distinguish between the different types of writes. The slowest is a write that must be implemented as read-modify-write. This might require a disk read and can also create contention from preventing concurrent changes to the row for the duration of the read, modify and write. The row might not be unlocked until the change is made durable on storage (commit, fsync, etc) which lets you estimate the peak rate at which a single row can be changed on a traditional DBMS. And this latency between changes can get even worse when there is sync replication or multiple client-server round trips per transaction. The UPDATE statement in SQL is usually implemented as read-modify-write. Some DBMS engines require locking to be done above the DBMS because they don't support locking across operations where read and write are separate operations (RocksDB is an example). Other DBMS engines compensate for that with a conditional put that performs a write when parts of the row have not changed like checkAndPut in HBase. But if the client does a read prior to the write then the overhead from the read still exists.

Some UPDATE statements perform changes that are commutative and it is possible to avoid the read prior to the write. That optimization is rarely implemented but it is possible in RocksDB with the merge operatorTokuDB, and Redis. Increment and decrement are examples of commutative operations. But this also requires upsert behavior to handle the case where the row to be changed doesn't exist. If the read is to be skipped it also means that a result cannot be returned -- that is the count of rows changed or the old/new value of the changed column.

blind-write is the easiest write to make fast. This can be done via a Put call with a key-value store like LevelDB or RocksDB. Some SQL operations can also use a blind-write if we have an option to not return the count of changed rows when the statement is executed and the operation behaves like an upsert. This is rarely implemented but TokuDB might help people appreciate it.

So there are at least 3 types of writes and from slowest to fastest they are read-modify-writecommutative-writeblind-write. Hopefully these optimizations will become more common in new database engines. From the write operation there shouldn't be a performance difference between commutative-write and blind-write. But the query latency for a row subject to commutative-write can be much worse than for blind-write because many old updates might have to be read and merged.

The insert benchmark on a small server, IO-bound workload : Postgres 19 beta1

This has results for Postgres versions 19 beta1, 18.4 and 17.10 with the Insert Benchmark on a small server using a cached and CPU-bound wo...