Tuesday, April 1, 2014

TokuMX, MongoDB and InnoDB on IO-bound point queries

I used sysbench to understand whether TokuMX, MongoDB and InnoDB can use most of the IOPs provided by a fast flash device and whether their use is efficient. The workload query fetches one document/row by primary key. This is a very simple workload but helps me to understand how disk read requests are processed. The primary metric is the QPS that can be sustained when the database is much larger than RAM. A secondary metric is the number of disk reads per query during the test. Efficiency, not doing too many reads per query, matters when you want to support many concurrent users.

tl;dr - new database engines are usually worse on multi-core than old database engines. I know there are exceptions to this rule for both new engines, like WiredTiger and RethinkDB, and old engines that won't be named but I am generalizing. Or maybe I am just marketing after having read a few too many attempts to frame the discussion in terms of legacy software being retrofitted for the web versus bespoke software (all shiny, new and perfect). As good as my team is at work, we also start with something (MySQL) that is pretty good.

Conclusions:
  • InnoDB is faster (higher QPS). Whatever, I am an fanboy after all. I assume TokuMX and MongoDB will get better. I know Toku has made huge improvements in this area.
  • InnoDB does the fewest disk reads per query and MongoDB does by far the most. The MongoDB result should be explained. This might be something to fix. The iostat r/s rate was much higher than the fault/s rate reported by mongostat. InnoDB and TokuDB have a clustered index on the primary key. MongoDB does not and might do a disk read for both the index lookup and the heap storage. But the primary key leaf nodes should have been cached during the test. Alas I did not confirm my assumption. I will repeat this test on disk servers and the impact from extra storage reads should be much worse there if these really are "random" reads from the storage device perspective.
  • TokuMX with quicklz and 8k readPageSize is much faster than with zlib and 64k readPageSize. The benefit for quicklz versus zlib when compression ratios aren't an issue isn't a surprise. QPS is better with 8k reads versus 64k because there is less to decompress after each read and because flash devices are bound by transfer rates rather than seeks at high concurrency.
  • I modified the Java sysbench client to not call beginTransaction and ensureConnection as this workload works fine with autocommit. Those calls were only being made for TokuMX as MongoDB doesn't do multi-document transactions. Without my change TokuMX results suffer from extra work that isn't needed.
  • MongoDB doesn't have a buffer pool but it does have something like an LRU to predict which pages are in memory. This can use a lot of CPU and cause mutex contention. See Rolling::access and Record::likelyInPhysicalMemory.
  • MongoDB is most sensitive to filesystem readahead parameters. You will do more tuning of the filesystem with MongoDB and TokuMX given their use of buffered IO. You will do more tuning of the server with InnoDB given its use of direct IO. Note that MongoDB benefits most from readahead. TokuMX will do page size reads via buffered IO so the OS knows the full extent of the read request. With MongoDB and buffered IO via mmap when a 16k object the Record::touch method will "touch" the record in memory at 2kb offsets. When this crosses a filesystem page boundary then there might be more than one disk read for the document. Perhaps madvise calls can be used to hint the full extent of the access. Note that the documents for this test were less than 512 bytes so that should be less of an issue.
  • There is a lot of activity (locks, unlocks) on the per-database reader-writer lock in MongoDB when doing disk reads as there is code to release it during a disk read and then relock which implies at least 2 lock/unlock pairs per query for this workload. See PlannerExecutor::getNext, ClientCursor::staticYield and Record::touch.

Setup

The Java sysbench client was used for TokuMX and MongoDB. My forked C sysbench client was used for InnoDB. The test database used 8 collections with 40M documents/collection. Note that all collections are in one database which isn't the base case for MongoDB given the per-database reader-writer lock. There is contention from frequent lock & unlock of a reader-writer lock even when all locks are in shared mode. The test was repeated for 8, 16, 32 and 64 clients. For a few tests I also have QPS for 128 clients. The clients ran on the same host as the server and the server has 24 cores with HT enabled. QPS results were stable throughout the test. For MongoDB and TokuMX tests were repeated with filesystem readahead set to 0kb, 8kb and 32kb.

Several configurations were tested. The database size is included. To get the load done for MongoDB I did not create the secondary index which is included for the InnoDB and TokuMX sizes.
  • inno-u-8 - 740 GB, MySQL 5.6.12, InnoDB without compression, 8kb pages
  • inno-z-4 - 349 GB, MySQL 5.6.12, InnoDB with 2X compression, 4kb pages on disk
  • mongo - 877 GB, MongoDB 2.6.0rc2
  • toku-z-64 - 371 GB, TokuDB 1.4.0, 64kb pages, zlib compression
  • toku-z-8 - 390 GB, TokuDB 1.4.0, 8kb pages, zlib compression
  • toku-q-8 - 499 GB, TokuDB 1.4.0, 8kb pages, quicklz compression

Results

For the MongoDB and TokuMX results in the table below the filesystem readahead was set to 8kb. The QPS@X columns list the QPS at that number of clients. The r/q@128 column lists the number of disk reads per query for 128 clients. The disk read rate was measured by iostat. Note the rate is much higher for MongoDB than for others and this is not explained by the difference in database sizes. The context switch rate for TokuMX is much lower than the others but my TokuMX test binaries don't have debug symbols so I didn't try to explain that via PMP and Linux perf.

config     QPS@8  QPS@16  QPS@32  QPS@64  QPS@128  r/q@128  cs@128
inno-u-8   31459   55988   92927   96628    96771     0.83  410279
inno-z-8   21766   46967   66423   72508
toku-q-8   28076   46139   64839   64467    61686     1.09  308911  
toku-z-8   24592   41665   59477   59817  
toku-z-64  11109   18519   26048   28594
mongo      18635   36916   52523   63469    59916     1.65  432390

This table lists the CPU overhead from the tests for which QPS is reported above. The value is reported as user+sys and the data was collected via vmstat. The high rates for system time with toku-z-64 surprise me. Alas I wiped the database before trying to debug it. But MongoDB also used more system time than others. In the case of MongoDB that appears to be the overhead from Linux VM given the use of mmap.

config     CPU@8   CPU@16  CPU@32  CPU@64
inno-u-8   20+9    35+22   55+28   65+34
inno-z-8   23+11   40+14   64+20   73+23
toku-q-8   25+9    38+16   56+24   59+25
toku-z-8   27+8    43+14   63+20   66+21
toku-z-64  5+30    9+49    13+70   16+82
mongo      16+12   25+20   38+31   46+40

The last table lists the CPU reported by the Linux top utility for the database process (mongod/mysqld) and the client processes. The value 400 means 400% or 4 CPU cores. Values are reported as server+client. The values are round because I estimated them by hand.

config    CPU@8     CPU@16     CPU@32     CPU@64     CPU@128
inno-u-8  490+120   930+216    1500+320   1800+390   1800+400
toku-q-8  490+105   900+200    1400+300   1400+320   1450+380
mongo     460+100   800+150    1200+250   1450+300   1600+360


4 comments:

  1. Just wondering, any experiences with TokuDB on these experiments? I am guessing not, but figured I would ask.

    ReplyDelete
  2. I stopped testing TokuDB -- time is finite and I am helping a MongoDB user. InnoDB is included as a sanity check.

    ReplyDelete
  3. What about toku-u-8?

    ReplyDelete
    Replies
    1. Will share those results offline. Don't really understand them and didn't take the time to debug.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...