Friday, May 9, 2014

Write amplification from log writes

MongoDB, TokuMX and MySQL use log files with high-value data. For MongoDB this is the journal that uses direct IO. For MySQL this is the binlog, relay log and InnoDB redo log,  all use buffered IO by default, the InnoDB redo log uses 512 bytes as the "page size"  and the replication logs have no notion of page size.

The minimum size for a write to a storage device is either the sector size or filesystem page size. The sector size is either 512 or 4096 bytes today. In the future it will be 4096 or larger. The filesystem page size on my servers is 4096 bytes. When a DBMS tries to append 309 bytes to the end of a log file then more than 309 bytes are written to the storage device. Depending on the filesystem and choice of buffered or direct IO either a disk sector or a filesystem page will be written. This can explain why the bytes written rate as reported by iostat using the wsec/s column is higher than the rate reported by the DBMS.

MongoDB avoids the uncertainty by padding journal writes to a multiple of 8 KB. Perhaps the padding will be reduced to a multiple of 4 KB (see JIRA 13344). But the good thing is that the counters reported by MongoDB are correct. The OS and storage device will report the same value for bytes written to the journal file. Of course, this ignores any write-amplification from flash garbage collection.

MySQL and TokuMX do not avoid the uncertainty. I spent a few hours today looking at a busy host to explain the difference between the write rates reported by MySQL, iostat and flash storage physical bytes written counters. The sources of writes on the host include the following. Most of the bytes written were from the doublewrite buffer followed by dirty page write back.
  • InnoDB dirty page writeback - The host uses mostly 2X compression with an in-memory page size of 16 KB. So most disk writes are 8 KB but some are 16 KB.
  • InnoDB doublewrite buffer - Even though most pages are 8 KB, any page in the doublewrite buffer uses 16 KB. Domas has described this as triple-writing pages in his bug report.
  • InnoDB redo log - logically these are done as a multiple of 512 bytes with buffered IO. In the WebScaleSQL patch we have an option to round up the write to 4 KB to avoid reads from the filesystem when the to-be-written page is not in the OS filesystem cache. Alas the InnoDB counter for bytes written to the log did not include the bytes written from the round-to-4KB.
  • MySQL binlog - the binlog makes no attempt to round up writes to the end of the log. The counter for bytes written to the binlog ignore the round up done by the filesystem when fsync is called.
We are working on fixing the InnoDB redo log bytes-written counter to include the padding added when rounding writes up to 4 KB. Until then you can use existing counters to estimate how much rounding was needed. These counters show the number of bytes written to the redo log and the number of fsyncs (or fdatasyncs) done. The size of the average write per fsync is less than 2 KB. With a 4 KB filesystem page the real rate is more than 2X the value reported by the Innodb_os_log_written counter.
    Innodb_os_log_fsyncs 370923966
    Innodb_os_log_written 656878594048

Counters for the binlog writes and fsyncs can also be used to understand whether the write rate to it is wrong. The size of the average write per fsync is again less than 2 KB and with a 4 KB filesystem page the real rate is more than 2X the value reported by the Binlog_bytes_written counter.
    Binlog_bytes_written 699055440580
    Binlog_fsync_count 385104140

The final set of counters were used to estimate the average size of a page writeback. The ratio of the values below is about 10 KB. Most pages written were 8 KB but some were 16 KB.
    Innodb_data_async_write_requests 286202895
    Innodb_data_async_write_bytes 2764395651584

I am not sure whether all of the counters above are available in upstream MySQL. They are in WebScaleSQL. We have been adding extra monitoring for many years to support running MySQL at web scale.

Monday, May 5, 2014

Overhead of reading the Nth attribute in a document for MongoDB & TokuMX

I used the MongoDB sysbench client to measure the overhead from skipping N attributes in a document to access the N+1th attribute for N set to 1, 10, 100 and 1000. As expected there is overhead as N grows, for small N the overhead is less for TokuMX and for large N the overhead is less for MongoDB.

My test used 1 collection with 1M documents and 1 client thread. The sysbench client was changed to add extra attributes immediately after the _id attribute. The extra attributes used names of the form x${K} for K in 1 to 1, 1 to 10, 1 to 100 and 1 to 1000. Each of these attributes were assigned the value of K (x1=1, x2=2, ...). Immediately after these another attribute was added with the name y and assigned the same value as the _id attribute. Each test query matches one document and returns the values of the _id and y attributes after matching the _id and y attributes. The purpose of this test is to determine the overhead from skipping the extra attributes to find the y attribute. For N=1 the results and test was kind-of similar to what I did for a recent cached & read-only test.

This is data for the chart above. Note that TokuMX does much better for small N but MongoDB does much better for large N. I repeated the tests with the y attribute before the x${K} attributes and the results are about the same as below so my assumption was wrong about the overhead for attribute searches and the real overhead is from BSON parsing. Note that I ran extra tests for 200, 400, 500 and 750 attributes to understand the rate at which QPS decreases.

queries per second by #attributes
    1     10    100   200   400   500   750  1000   number of x${K} attributes
14326  13643  10317  8638  6543  5832  4515  3306   tokumx141
 9866   9398   8473  8305  7242  6903  6164  4954   mongo249
 9119   9530   8547  8613  7890  7577  6602  5289   mongo260

Overhead for 1000 attributes

I looked at top when the tests were running and for all cases mongod was using ~90% of 1 CPU core and Java was using ~10 of another. So MongoDB was using CPU at the same rate as TokuMX while serving many more QPS. I then used the Linux perf utility to understand what was using CPU in mongod. The CPU overhead for TokuMX is from calls to free & malloc. If I had an easy way to get hierarchical profiles I would include that, alas I don't have that. I do have the -g option with perf but the output isn't very interesting. It would be nice if TokuMX figured out how to get more debug symbols into their binary without expecting users to download/install a separate debug package.
This is the output from "perf -g" for TokuMX 1.4.1

    14.56%   mongod  mongod                 [.] free                             
             --- free
                |--91.43%-- 0x7fe27c1c09d8
                |          (nil)
                |--7.66%-- (nil)
                 --0.91%-- [...]

    12.71%   mongod  mongod                 [.] malloc                                                                               
             --- malloc
                |--76.17%-- mongo::ElementMatcher::~ElementMatcher()
                |          (nil)
                |--16.90%-- operator new(unsigned long)
                |          0x100000000
                |          0x3031317473657462
                |--4.22%-- (nil)
                |          |          
                |           --100.00%-- 0xffb
                |--1.37%-- 0xffffffff00000000
                |          0x3031317473657462
                 --1.34%-- mongo::BSONObj::init

And this is the non-hierarchical output from Linux perf during a 5 second interval of the test.

TokuMX 141
    15.83%   mongod  mongod                 [.] free
    11.64%   mongod  mongod                 [.] malloc
    11.23%   mongod  mongod                 [.] mongo::Projection::transform
     6.42%   mongod    [.] _ZNSs4_Rep10_M_disposeERKSaIcE.part.13
     6.40%   mongod  mongod                 [.] std::_Rb_tree<>::find
     4.81%   mongod            [.] strlen
     4.01%   mongod    [.] std::basic_string::basic_string
     3.49%   mongod    [.] char* std::string::_S_construct
     2.43%   mongod            [.] memcmp
     2.38%   mongod            [.] memcpy
     2.00%   mongod    [.] operator new(unsigned long)
     1.93%   mongod  mongod                 [.] mongo::Projection::append
     1.27%   mongod    [.] std::string::_Rep::_S_create

MongoDB 2.4.9
    13.78%   mongod  mongod               [.] mongo::Projection::transform
    11.75%   mongod  mongod               [.] mongo::UnorderedFastKeyTable::Area::find
     7.95%   mongod         [.] __strlen_sse42
     7.28%   mongod  mongod               [.] mongo::Projection::append
     4.21%   mongod  mongod               [.] (anonymous namespace)::cpp_alloc
     3.86%   mongod  mongod               [.] mongo::BSONElement::size()
     2.07%   mongod  mongod               [.] _ZN12_GLOBAL__N_121do_free_with_callback
     1.18%   mongod  mongod               [.] mongo::KeyV1::toBson()
     0.92%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     0.86%   mongod  mongod               [.] boost::detail::shared_count::~shared_count()
     0.71%   mongod   [.] pthread_mutex_lock
     0.70%   mongod  [.] std::basic_string::~basic_string()
     0.69%   mongod  mongod               [.] tc_malloc
     0.68%   mongod         [.] __memcpy_ssse3

MongoDB 2.6.0
    14.28%   mongod  mongod               [.] mongo::ProjectionStage::transform
    10.03%   mongod  mongod               [.] MurmurHash3_x64_128
     4.45%   mongod  mongod               [.] mongo::BSONElement::size()
     3.93%   mongod         [.] __strlen_sse42
     2.83%   mongod  mongod               [.] _ZNKSt3tr110_HashtableIN5mongo10StringDataES2_SaIS2_ESt9_IdentityIS2_ESt8equal_toIS2_ENS2_6HasherENS_8__detail18_Mod_range_hashingENS9_20_Default_ranged_hashENS9_20_Prime_rehash_policyELb0ELb1ELb1EE12_M_find_nodeEPNS9_10_Hash_nodeIS2_Lb0EEERKS2_m.isra.162.constprop.255
     2.65%   mongod  mongod               [.] operator new(unsigned long)
     1.98%   mongod         [.] __memcmp_sse4_1
     1.74%   mongod  mongod               [.] mongo::StringData::Hasher::operator()
     1.62%   mongod  mongod               [.] operator delete(void*)
     1.46%   mongod  mongod               [.] mongo::KeyV1::toBson()
     1.20%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.06%   mongod         [.] __memcpy_ssse3
     0.83%   mongod  mongod               [.] tc_malloc
     0.82%   mongod  mongod               [.] mongo::ps::Rolling::access
     0.78%   mongod   [.] pthread_mutex_lock

Overhead for 10 attributes

This lists the CPU profile from Linux perf during 5 seconds of the test. Unlike the 1000 attribute result above, here malloc/free or their equivalent are the top two sources for TokuMX and MongoDB.

TokuMX 1.4.1
     4.53%   mongod  mongod                 [.] malloc
     4.12%   mongod  mongod                 [.] free
     2.89%   mongod  [.] _ZNK4toku3omtIP13klpair_structS2_Lb0EE24find_internal_plus_arrayIR9ft_searchXa
     1.80%   mongod            [.] strlen
     1.53%   mongod  mongod                 [.] mongo::storage::KeyV1::woCompare
     1.14%   mongod      [.] pthread_mutex_lock
     1.02%   mongod            [.] memcpy
     0.99%   mongod  mongod                 [.] mongo::Projection::transform
     0.96%   mongod    [.] _ZNSs4_Rep10_M_dispose
     0.87%   mongod    [.] operator new(unsigned long)
     0.86%   mongod  mongod                 [.] mongo::Projection::init
     0.85%   mongod  [.] _Z26toku_ft_search_which_childP17__toku_descriptorPFiP9__toku_dbPK10__toku_dbt
     0.85%   mongod            [.] memcmp

MongoDB 2.4.9
     7.47%   mongod  mongod               [.] cpp_alloc
     3.94%   mongod  mongod               [.] _ZN12_GLOBAL__N_121do_free_with_callback    
     2.24%   mongod  mongod               [.] mongo::KeyV1::toBson()
     2.21%   mongod         [.] __strlen_sse42
     1.58%   mongod  mongod               [.] boost::detail::shared_count::~shared_count()
     1.51%   mongod  mongod               [.] mongo::BSONElement::size()
     1.48%   mongod         [.] __memcpy_ssse3
     1.40%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.39%   mongod  [.] _ZNSs4_Rep10_M_disposeERKSaIcE
     1.22%   mongod  [.] std::basic_string::~basic_string()
     1.22%   mongod         [.] __memcmp_sse4_1
     1.19%   mongod  mongod               [.] tc_malloc
     0.99%   mongod  mongod               [.] operator new(unsigned long)
     0.95%   mongod  mongod               [.] boost::intrusive_ptr::~intrusive_ptr()
     0.94%   mongod   [.] pthread_mutex_lock
     0.89%   mongod  [.] std::basic_string::basic_string
     0.77%   mongod  [.] std::basic_string::basic_string
     0.69%   mongod  mongod               [.] mongo::BtreeBucket::customBSONCmp
     0.68%   mongod  mongod               [.] mongo::UnorderedFastKeyTable::Area::find
     0.64%   mongod  [ip_tables]          [k] ipt_do_table
     0.64%   mongod  mongod               [.] CoveredIndexMatcher::matches

MongoDB 2.6.0
     4.26%   mongod  mongod               [.] operator new(unsigned long)
     2.86%   mongod  mongod               [.] operator delete(void*)
     2.70%   mongod  mongod               [.] mongo::KeyV1::toBson() const
     1.62%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.57%   mongod         [.] __memcpy_ssse3
     1.56%   mongod         [.] __strlen_sse42
     1.14%   mongod         [.] vfprintf
     1.13%   mongod   [.] pthread_mutex_lock
     1.13%   mongod         [.] __memcmp_sse4_1
     1.06%   mongod  mongod               [.] mongo::BSONElement::size()
     0.92%   mongod  [.] _ZNSs4_Rep10_M_dispose
     0.87%   mongod  mongod               [.] tc_malloc
     0.84%   mongod  [.] std::basic_string::~basic_string()  

Friday, May 2, 2014

The impact of read ahead and read size on MongoDB, TokuMX and MySQL

This continues my work on using a very simple workload (read-only, fetch 1 document by PK, database larger than RAM, fast storage) to understand how to get more QPS from TokuMX and MongoDB. I need to get more random read IOPs from storage to get more QPS from the DBMS. Beyond getting more throughput I ran tests to understand the impact of filesystem readahead on MongoDB and TokuMX and the impact of the read page size on TokuMX and InnoDB. My results are based on fast flash storage that can do more than 100,000 4kb reads/second. Be careful when trying to use these results for slower storage devices, especially disks.

In my previous post I wrote that I was able to get much more QPS from InnoDB but was likely to report improvements for TokuMX/MongoDB in the next post. This is the next post and some of the results I report here are better for a few reasons. First, some tests limited queries to the first half of each collection/table so the cache hit rate is much better especially for non-leaf index nodes. TokuMX and MongoDB are more sensitive to that and might require more cache/RAM for indexes than InnoDB. Second, I use better values for read_ahead_kb and readPageSize. My summary for this post is that:
  • Using a smaller value for readPageSize with TokuMX can help
  • Using the right value for read_ahead_kb can help. But I think that depending on filesystem readahead is a kludge that ends up wasting IO or hurting QPS
  • TokuMX and MongoDB require more cache/RAM for indexes than InnoDB even when TokuMX uses compression and InnoDB does not.


I describe the impact of read_ahead_kb for TokuMX and MongoDB, the impact of readPageSize for TokuMX and the impact of innodb_page_size for MySQL. My test server uses fast flash storage and what I report here might not be relevant for disk storage. From past experience there isn't much difference in peak IOPs from disks using 8kb versus 32kb reads as the device is bound by seek and rotational latency rather than data transfer. But flash devices usually do many more random reads per second at 8kb compared to 32kb.

I have a lot of experience with systems that use direct IO, less for systems that do reads & writes via buffered IO and much less for buffered IO + mmap. The epic rant about O_DIRECT is amusing because there wasn't a good alternative at the time for any DBMS that cared about IO performance. I suspect the DBMS community hasn't done enough to make things better (reach out to Linux developers, pay for improvements, volunteer time to run performance tests, donate expensive storage HW to people who do the systems work). But I think things are getting better and better.

mmap reads

MongoDB uses buffered IO and mmap. To get peak QPS for an IO-bound workload it must read the right amount of data per request -- not too much and not too little. With pread you tell the OS the right amount of data to read by specifying the read size. With mmap it is harder to communicate this. When a page fault occurs the OS knows that is must read the 4k page that contains the faulting address from storage. The OS doesn't know whether it should read adjacent 4k pages by using a larger read request to avoid a seek per 4kb of data. Perhaps the workaround is to suggest the read size by calling madvise with the MADV_WILLNEED flag prior to referencing memory and hope the suggestion isn't ignored.

Reading 4kb at a time isn't good for MongoDB index pages which are 8kb, so the OS might do two disk reads and two disk seeks in the worst case. A 4kb read is also too small for documents that are larger than 4kb and the worst case again is extra disk reads. A partial workaround is to set read_ahead_kb to suggest to Linux that more data should be read. But one value for read_ahead_kb won't cover all use cases -- 8kb index pages, large documents and small documents. Note that this is also a hint. If you want high performance and high efficiency for IO-bound OLTP and you rely on read_ahead_kb then you aren't going to have a good time.

There isn't much information on the impact of read_ahead_kb on MongoDB and TokuMX. A simple way to understand the impact from it is to look at the disk read rate and disk sector read rate as measured by iostat. The number of disk sectors per disk read is also interesting. I report that data below from my tests. There is an open question about this on the MongoDB users email list and reference to a blog post.


InnoDB uses 16kb pages by default and the page size can be changed via the innodb_page_size parameter when installing MySQL. A fast flash device can do more random IOPs for 4kb requests than for 16kb requests because data transfer can limit throughput. TokuMX has a similar option that can be set per table via readPageSize. The TokuMX default is 64kb which might be good for disk-based servers but I suspect is too large for flash. Even on disk-based servers there are benefits from using a smaller page as less space will be wasted in RAM when only a few documents/rows per page are useful. For TokuMX the readPageSize is the size of the page prior to compression.

Setup for 1.6B documents

I used the same test as described previously except I limited queries to the first half of each table/collection and there were 1.6B documents/rows that could be fetched in the 8 collections/tables. I tested the configurations listed below. This lists the total database size but only half of the database was accessed.
  • fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER
  • fb56.sql - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
  • orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER. 
  • orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
  • tokumx8 - 456G database, TokuMX 1.4.1, quicklz, readPageSize=8k
  • tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K
  • mongo249 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0
  • mongo260 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1

Results for 1.6B documents

The graph below shows the best QPS for each of the servers tested: 8kb page_size for MySQL, 8kb readahead for tokumx8 and tokumx64, 4kb readahead for MongoDB. I exclude the results for MySQL 5.7 because they are similar to 5.6. As in my previous blog post MySQL with InnoDB does better than the others but the difference is less significant for two reasons. First I repeated tests for many different values of read_ahead_kb and display the best one below. Second the test database is smaller. I assume this helps MongoDB and TokuMX by improving the cache hit rate for index data -- MongoDB because the index isn't clustered so we need all of the index in RAM to avoid doing an extra disk read per query, TokuMX because the non-leaf levels of the index are larger than for a b-tree. MongoDB 2.6.0 does worse than 2.4.9 because of a CPU performance regression for simple PK queries (JIRA 13663 and 13685). TokuMX with an 8kb readPageSize is much better than a 64kb readPageSize because there is less data to decompress per disk read, there is a better cache hit rate and fast storage does more IOPs for smaller read requests. Note that the database was also much smaller on disk with an 8kb readPageSize. It would be good to explain that.
This has QPS results for all of the configurations tested. The peak QPS for each of the configurations below was used in the graph above except for MySQL 5.7.

queries per second
    8    16     32     40  concurrent clients
44078 71929 108999 110640  fb56.handler, 16k page
36739 63622 100654 107970  fb56.sql, 16k page
44743 74183 117871 131160  fb56.handler, 8k page
37064 64602 102672 119555  fb56.sql, 8k page
43440 71328 113778 128366  fb56.handler, 4k page
35629 61437  97452 113916  fb56.sql, 4k page
44120 73919 119059 130071  orig57.handler, 8k page
36589 64339 102399 119260  orig57.sql, 8k page
42772 70368 113781 128088  orig57.handler, 4k page
35287 60928  96736 112950  orig57.sql, 4k page
24502 39332  62389  68967  tokumx8, 0k readahead
27762 45074  73102  79999  tokumx8, 4k readahead
29256 49093  81222  91508  tokumx8, 8k readahead
27835 45406  76164  82695  tokumx8, 16k readahead
 6347  9287  13512  14396  tokumx64, 0k readahead
 7221 12835  20233  21477  tokumx64, 4k readahead
 9263 16088  26595  27943  tokumx64, 8k readahead
10272 18602  22645  22015  tokumx64, 16k readahead
11191 20349  24090  24086  tokumx64, 32k readahead
10384 16492  17093  16600  tokumx64, 64k readahead
38154 62257  96033 103770  mongo249, 0k readahead
38274 62321  96131 106017  mongo249, 4k readahead
33088 51609  72699  76311  mongo249, 8k readahead
16533 22871  24019  25076  mongo249, 16k readahead
17572 23332  24319  24324  mongo249, 32k readahead
29179 49731  77114  84779  mongo260, 0k readahead
28979 49521  76569  86985  mongo260, 4k readahead
26321 42967  65662  71112  mongo260, 8k readahead
15338 23131  24448  25131  mongo260, 16k readahead
16277 23428  24443  24566  mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Disk reads/query is much higher for TokuMX with a 64k readPageSize than for all other servers. The rate changes for MongoDB 2.4.9 and 2.6.0 between 8k and 16k readahead.

reads/query  server
0.649        fb56.handler, 16kb page
0.672        fb56.handler, 8kb page
0.748        fb56.handler, 4kb page
1.363        tokumx8, 0k readahead
1.275        tokumx8, 4k readahead
0.927        tokumx8, 8k readahead
0.987        tokumx8, 16k readahead
7.161        tokumx64, 0k readahead
6.547        tokumx64, 4k readahead
3.833        tokumx64, 8k readahead
2.659        tokumx64, 16k readahead
2.432        tokumx64, 32k readahead
2.909        tokumx64, 64k readahead
0.806        mongo249, 0k readahead
0.807        mongo249, 4k readahead
0.824        mongo249, 8k readahead
1.117        mongo249, 16k readahead
1.147        mongo249, 32k readahead
0.830        mongo260, 0k readahead
0.832        mongo260, 4k readahead
0.847        mongo260, 8k readahead
1.115        mongo260, 16k readahead
1.145        mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read   server
16384        fb56.handler, 16kb page
 8192        fb56.handler, 8kb page
 4096        fb56.handler, 4kb page
 4096        tokumx8, 0k readahead
 5939        tokumx8, 4k readahead
 9011        tokumx8, 8k readahead
10752        tokumx8, 16k readahead
 4096        tokumx64, 0k readahead
 4557        tokumx64, 4k readahead
 9011        tokumx64, 8k readahead
23347        tokumx64, 16k readahead
23398        tokumx64, 32k readahead
27802        tokumx64, 64k readahead
 4096        mongo249, 0k readahead
 4198        mongo249, 4k readahead
 7322        mongo249, 8k readahead
19098        mongo249, 16k readahead
18534        mongo249, 32k readahead
 4096        mongo260, 0k readahead
 4250        mongo260, 4k readahead
 7424        mongo260, 8k readahead
19712        mongo260, 16k readahead
19149        mongo260, 32k readahead

Setup for 3.2B documents

The same test was repeated except the clients were able to query all 3.2B documents/rows in the test collections/tables. I exclude results for TokuMX with 64k readPageSize,

Results for 3.2B documents

The graph has the best configuration for each server: 8kb page_size for MySQL, 16kb readahead for tokumx8, 4kb readahead for MongoDB. TokuMX matches MongoDB 2.4.9 here while it did worse than it in the 1.6B document test.
This has QPS results for all of the configurations tested. It should be possible for MySQL with InnoDB to get more QPS at 4kb pages. I don't know why that didn't happen and suspect that mutex contention was a problem. TokuMX with 8k readPageSize matched MongoDB 2.4.9 here, otherwise the results are similar to the 1.6B document/row test.

queries per second
    8    16     32     40  concurrent clients
38767 60366  86895  87517  fb56.handler, 16k page
33062 54847  84480  85764  fb56.sql, 16k page
39940 63628 102261 107312  fb56.handler, 8k page
33599 56819  91128 102378  fb56.sql, 8k page
39165 62409 101283 111496  fb56.handler, 4k page
32593 54810  87967 100644  fb56.sql, 4k page
38829 60419  86223  86655  orig57.handler, 16k page
33298 55034  84282  84734  orig57.sql, 16k page
39898 63673 102641 106424  orig57.handler, 8k page
33801 57097  91397 101718  orig57.sql, 8k page
39015 62159 101203 107398  orig57.handler, 4k page
32433 54067  84663  90872  orig57.sql, 4k page
21062 32292  49979  54177  tokumx8, 0k readahead
24379 39353  58571  61245  tokumx8, 4k readahead
27871 45992  73472  80748  tokumx8, 8k readahead
27396 45214  74040  81592  tokumx8, 16k readahead
29529 45976  69002  72772  mongo249, 0k readahead
29482 45868  71590  73942  mongo249, 4k readahead
23608 35676  48662  51503  mongo249, 8k readahead
18606 27554  31865  32637  mongo249, 16k readahead
12485 16190  16668  16662  mongo249, 32k readahead
24296 40154  61795  66992  mongo260, 0k readahead
24245 39781  61343  68450  mongo260, 4k readahead
20559 32111  46572  49358  mongo260, 8k readahead
17115 26220  32825  33184  mongo260, 16k readahead
12309 17050  17542  17595  mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test.  Note that TokuMX QPS gets much better as the rate decreases when a larger readahead is used.

reads/query  server
0.823        fb56.handler, 16kb page
0.842        fb56.handler, 8kb page
0.897        fb56.handler, 4kb page
1.794        tokumx8, 0k readahead
1.723        tokumx8, 4k readahead
1.074        tokumx8, 8k readahead
0.999        tokumx8, 16k readahead
1.213        mongo249, 0k readahead
1.213        mongo249, 4k readahead
1.270        mongo249, 8k readahead
1.232        mongo249, 16k readahead
1.478        mongo249, 32k readahead
1.225        mongo260, 0k readahead
1.226        mongo260, 4k readahead
1.285        mongo260, 8k readahead
1.225        mongo260, 16k readahead
1.458        mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read   server
16384        fb56.handler, 16kb page
 8192        fb56.handler, 8kb page
 4096        fb56.handler, 4kb page
 4096        tokumx8, 0k readahead
 5939        tokumx8, 4k readahead
 9574        tokumx8, 8k readahead
11469        tokumx8, 16k readahead
 4096        mongo249, 0k readahead
 4250        mongo249, 4k readahead
 7373        mongo249, 8k readahead
13107        mongo249, 16k readahead
21197        mongo249, 32k readahead
 4096        mongo260, 0k readahead
 4301        mongo260, 4k readahead
 7475        mongo260, 8k readahead
13210        mongo260, 16k readahead
21658        mongo260, 32k readahead