Small Datum: May 2014

Friday, May 9, 2014

Write amplification from log writes

MongoDB, TokuMX and MySQL use log files with high-value data. For MongoDB this is the journal that uses direct IO. For MySQL this is the binlog, relay log and InnoDB redo log, all use buffered IO by default, the InnoDB redo log uses 512 bytes as the "page size" and the replication logs have no notion of page size.

The minimum size for a write to a storage device is either the sector size or filesystem page size. The sector size is either 512 or 4096 bytes today. In the future it will be 4096 or larger. The filesystem page size on my servers is 4096 bytes. When a DBMS tries to append 309 bytes to the end of a log file then more than 309 bytes are written to the storage device. Depending on the filesystem and choice of buffered or direct IO either a disk sector or a filesystem page will be written. This can explain why the bytes written rate as reported by iostat using the wsec/s column is higher than the rate reported by the DBMS.

MongoDB avoids the uncertainty by padding journal writes to a multiple of 8 KB. Perhaps the padding will be reduced to a multiple of 4 KB (see JIRA 13344). But the good thing is that the counters reported by MongoDB are correct. The OS and storage device will report the same value for bytes written to the journal file. Of course, this ignores any write-amplification from flash garbage collection.

MySQL and TokuMX do not avoid the uncertainty. I spent a few hours today looking at a busy host to explain the difference between the write rates reported by MySQL, iostat and flash storage physical bytes written counters. The sources of writes on the host include the following. Most of the bytes written were from the doublewrite buffer followed by dirty page write back.

InnoDB dirty page writeback - The host uses mostly 2X compression with an in-memory page size of 16 KB. So most disk writes are 8 KB but some are 16 KB.
InnoDB doublewrite buffer - Even though most pages are 8 KB, any page in the doublewrite buffer uses 16 KB. Domas has described this as triple-writing pages in his bug report.
InnoDB redo log - logically these are done as a multiple of 512 bytes with buffered IO. In the WebScaleSQL patch we have an option to round up the write to 4 KB to avoid reads from the filesystem when the to-be-written page is not in the OS filesystem cache. Alas the InnoDB counter for bytes written to the log did not include the bytes written from the round-to-4KB.
MySQL binlog - the binlog makes no attempt to round up writes to the end of the log. The counter for bytes written to the binlog ignore the round up done by the filesystem when fsync is called.

We are working on fixing the InnoDB redo log bytes-written counter to include the padding added when rounding writes up to 4 KB. Until then you can use existing counters to estimate how much rounding was needed. These counters show the number of bytes written to the redo log and the number of fsyncs (or fdatasyncs) done. The size of the average write per fsync is less than 2 KB. With a 4 KB filesystem page the real rate is more than 2X the value reported by the Innodb_os_log_written counter.

Innodb_os_log_fsyncs 370923966

Innodb_os_log_written 656878594048

Counters for the binlog writes and fsyncs can also be used to understand whether the write rate to it is wrong. The size of the average write per fsync is again less than 2 KB and with a 4 KB filesystem page the real rate is more than 2X the value reported by the Binlog_bytes_written counter.

Binlog_bytes_written 699055440580

Binlog_fsync_count 385104140

The final set of counters were used to estimate the average size of a page writeback. The ratio of the values below is about 10 KB. Most pages written were 8 KB but some were 16 KB.

Innodb_data_async_write_requests 286202895

Innodb_data_async_write_bytes 2764395651584

I am not sure whether all of the counters above are available in upstream MySQL. They are in WebScaleSQL. We have been adding extra monitoring for many years to support running MySQL at web scale.

Monday, May 5, 2014

Overhead of reading the Nth attribute in a document for MongoDB & TokuMX

I used the MongoDB sysbench client to measure the overhead from skipping N attributes in a document to access the N+1th attribute for N set to 1, 10, 100 and 1000. As expected there is overhead as N grows, for small N the overhead is less for TokuMX and for large N the overhead is less for MongoDB.

My test used 1 collection with 1M documents and 1 client thread. The sysbench client was changed to add extra attributes immediately after the _id attribute. The extra attributes used names of the form x${K} for K in 1 to 1, 1 to 10, 1 to 100 and 1 to 1000. Each of these attributes were assigned the value of K (x1=1, x2=2, ...). Immediately after these another attribute was added with the name y and assigned the same value as the _id attribute. Each test query matches one document and returns the values of the _id and y attributes after matching the _id and y attributes. The purpose of this test is to determine the overhead from skipping the extra attributes to find the y attribute. For N=1 the results and test was kind-of similar to what I did for a recent cached & read-only test.

This is data for the chart above. Note that TokuMX does much better for small N but MongoDB does much better for large N. I repeated the tests with the y attribute before the x${K} attributes and the results are about the same as below so my assumption was wrong about the overhead for attribute searches and the real overhead is from BSON parsing. Note that I ran extra tests for 200, 400, 500 and 750 attributes to understand the rate at which QPS decreases.

queries per second by #attributes
   1 10 100 200 400 500 750 1000 number of x${K} attributes
14326 13643 10317  8638 6543 5832 4515 3306 tokumx141
9866 9398 8473 8305 7242  6903 6164 4954 mongo249
9119 9530 8547 8613 7890  7577 6602 5289 mongo260

Overhead for 1000 attributes

I looked at top when the tests were running and for all cases mongod was using ~90% of 1 CPU core and Java was using ~10 of another. So MongoDB was using CPU at the same rate as TokuMX while serving many more QPS. I then used the Linux perf utility to understand what was using CPU in mongod. The CPU overhead for TokuMX is from calls to free & malloc. If I had an easy way to get hierarchical profiles I would include that, alas I don't have that. I do have the -g option with perf but the output isn't very interesting. It would be nice if TokuMX figured out how to get more debug symbols into their binary without expecting users to download/install a separate debug package.
This is the output from "perf -g" for TokuMX 1.4.1

14.56% mongod mongod [.] free

--- free

|--91.43%-- 0x7fe27c1c09d8

| (nil)

|--7.66%-- (nil)

--0.91%-- [...]

12.71% mongod mongod [.] malloc

--- malloc

|--76.17%-- mongo::ElementMatcher::~ElementMatcher()

| (nil)

|--16.90%-- operator new(unsigned long)

| 0x100000000

| 0x3031317473657462

|--4.22%-- (nil)

| |

| --100.00%-- 0xffb

|--1.37%-- 0xffffffff00000000

| 0x3031317473657462

--1.34%-- mongo::BSONObj::init

And this is the non-hierarchical output from Linux perf during a 5 second interval of the test.

TokuMX 141

15.83% mongod mongod [.] free

11.64% mongod mongod [.] malloc

11.23% mongod mongod [.] mongo::Projection::transform

6.42% mongod libstdc++.so.6.0.17 [.] _ZNSs4_Rep10_M_disposeERKSaIcE.part.13

6.40% mongod mongod [.] std::_Rb_tree<>::find

4.81% mongod libc-2.5.so [.] strlen

4.01% mongod libstdc++.so.6.0.17 [.] std::basic_string::basic_string

3.49% mongod libstdc++.so.6.0.17 [.] char* std::string::_S_construct

2.43% mongod libc-2.5.so [.] memcmp

2.38% mongod libc-2.5.so [.] memcpy

2.00% mongod libstdc++.so.6.0.17 [.] operator new(unsigned long)

1.93% mongod mongod [.] mongo::Projection::append

1.27% mongod libstdc++.so.6.0.17 [.] std::string::_Rep::_S_create

MongoDB 2.4.9

13.78% mongod mongod [.] mongo::Projection::transform

11.75% mongod mongod [.] mongo::UnorderedFastKeyTable::Area::find

7.95% mongod libc-2.13.so [.] __strlen_sse42

7.28% mongod mongod [.] mongo::Projection::append

4.21% mongod mongod [.] (anonymous namespace)::cpp_alloc

3.86% mongod mongod [.] mongo::BSONElement::size()

2.07% mongod mongod [.] _ZN12_GLOBAL__N_121do_free_with_callback

1.18% mongod mongod [.] mongo::KeyV1::toBson()

0.92% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

0.86% mongod mongod [.] boost::detail::shared_count::~shared_count()

0.71% mongod libpthread-2.13.so [.] pthread_mutex_lock

0.70% mongod libstdc++.so.6.0.16 [.] std::basic_string::~basic_string()

0.69% mongod mongod [.] tc_malloc

0.68% mongod libc-2.13.so [.] __memcpy_ssse3

MongoDB 2.6.0

14.28% mongod mongod [.] mongo::ProjectionStage::transform

10.03% mongod mongod [.] MurmurHash3_x64_128

4.45% mongod mongod [.] mongo::BSONElement::size()

3.93% mongod libc-2.13.so [.] __strlen_sse42

2.83% mongod mongod [.] _ZNKSt3tr110_HashtableIN5mongo10StringDataES2_SaIS2_ESt9_IdentityIS2_ESt8equal_toIS2_ENS2_6HasherENS_8__detail18_Mod_range_hashingENS9_20_Default_ranged_hashENS9_20_Prime_rehash_policyELb0ELb1ELb1EE12_M_find_nodeEPNS9_10_Hash_nodeIS2_Lb0EEERKS2_m.isra.162.constprop.255

2.65% mongod mongod [.] operator new(unsigned long)

1.98% mongod libc-2.13.so [.] __memcmp_sse4_1

1.74% mongod mongod [.] mongo::StringData::Hasher::operator()

1.62% mongod mongod [.] operator delete(void*)

1.46% mongod mongod [.] mongo::KeyV1::toBson()

1.20% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

1.06% mongod libc-2.13.so [.] __memcpy_ssse3

0.83% mongod mongod [.] tc_malloc

0.82% mongod mongod [.] mongo::ps::Rolling::access

0.78% mongod libpthread-2.13.so [.] pthread_mutex_lock

Overhead for 10 attributes

This lists the CPU profile from Linux perf during 5 seconds of the test. Unlike the 1000 attribute result above, here malloc/free or their equivalent are the top two sources for TokuMX and MongoDB.

TokuMX 1.4.1

4.53% mongod mongod [.] malloc

4.12% mongod mongod [.] free

2.89% mongod libtokufractaltree.so [.] _ZNK4toku3omtIP13klpair_structS2_Lb0EE24find_internal_plus_arrayIR9ft_searchXa

dL_Z15wrappy_fun_findIS6_XadL_ZL23heaviside_from_search_tRK10__toku_db

1.80% mongod libc-2.5.so [.] strlen

1.53% mongod mongod [.] mongo::storage::KeyV1::woCompare

1.14% mongod libpthread-2.5.so [.] pthread_mutex_lock

1.02% mongod libc-2.5.so [.] memcpy

0.99% mongod mongod [.] mongo::Projection::transform

0.96% mongod libstdc++.so.6.0.17 [.] _ZNSs4_Rep10_M_dispose

0.87% mongod libstdc++.so.6.0.17 [.] operator new(unsigned long)

0.86% mongod mongod [.] mongo::Projection::init

0.85% mongod libtokufractaltree.so [.] _Z26toku_ft_search_which_childP17__toku_descriptorPFiP9__toku_dbPK10__toku_dbt

S5_EP6ftnodeP9ft_search

0.85% mongod libc-2.5.so [.] memcmp

MongoDB 2.4.9

7.47% mongod mongod [.] cpp_alloc

3.94% mongod mongod [.] _ZN12_GLOBAL__N_121do_free_with_callback

2.24% mongod mongod [.] mongo::KeyV1::toBson()

2.21% mongod libc-2.13.so [.] __strlen_sse42

1.58% mongod mongod [.] boost::detail::shared_count::~shared_count()

1.51% mongod mongod [.] mongo::BSONElement::size()

1.48% mongod libc-2.13.so [.] __memcpy_ssse3

1.40% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

1.39% mongod libstdc++.so.6.0.16 [.] _ZNSs4_Rep10_M_disposeERKSaIcE

1.22% mongod libstdc++.so.6.0.16 [.] std::basic_string::~basic_string()

1.22% mongod libc-2.13.so [.] __memcmp_sse4_1

1.19% mongod mongod [.] tc_malloc

0.99% mongod mongod [.] operator new(unsigned long)

0.95% mongod mongod [.] boost::intrusive_ptr::~intrusive_ptr()

0.94% mongod libpthread-2.13.so [.] pthread_mutex_lock

0.89% mongod libstdc++.so.6.0.16 [.] std::basic_string::basic_string

0.77% mongod libstdc++.so.6.0.16 [.] std::basic_string::basic_string

0.69% mongod mongod [.] mongo::BtreeBucket::customBSONCmp

0.68% mongod mongod [.] mongo::UnorderedFastKeyTable::Area::find

0.64% mongod [ip_tables] [k] ipt_do_table

0.64% mongod mongod [.] CoveredIndexMatcher::matches

MongoDB 2.6.0

4.26% mongod mongod [.] operator new(unsigned long)

2.86% mongod mongod [.] operator delete(void*)

2.70% mongod mongod [.] mongo::KeyV1::toBson() const

1.62% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

1.57% mongod libc-2.13.so [.] __memcpy_ssse3

1.56% mongod libc-2.13.so [.] __strlen_sse42

1.14% mongod libc-2.13.so [.] vfprintf

1.13% mongod libpthread-2.13.so [.] pthread_mutex_lock

1.13% mongod libc-2.13.so [.] __memcmp_sse4_1

1.06% mongod mongod [.] mongo::BSONElement::size()

0.92% mongod libstdc++.so.6.0.16 [.] _ZNSs4_Rep10_M_dispose

0.87% mongod mongod [.] tc_malloc

0.84% mongod libstdc++.so.6.0.16 [.] std::basic_string::~basic_string()

Friday, May 2, 2014

The impact of read ahead and read size on MongoDB, TokuMX and MySQL

This continues my work on using a very simple workload (read-only, fetch 1 document by PK, database larger than RAM, fast storage) to understand how to get more QPS from TokuMX and MongoDB. I need to get more random read IOPs from storage to get more QPS from the DBMS. Beyond getting more throughput I ran tests to understand the impact of filesystem readahead on MongoDB and TokuMX and the impact of the read page size on TokuMX and InnoDB. My results are based on fast flash storage that can do more than 100,000 4kb reads/second. Be careful when trying to use these results for slower storage devices, especially disks.

In my previous post I wrote that I was able to get much more QPS from InnoDB but was likely to report improvements for TokuMX/MongoDB in the next post. This is the next post and some of the results I report here are better for a few reasons. First, some tests limited queries to the first half of each collection/table so the cache hit rate is much better especially for non-leaf index nodes. TokuMX and MongoDB are more sensitive to that and might require more cache/RAM for indexes than InnoDB. Second, I use better values for read_ahead_kb and readPageSize. My summary for this post is that:

Using a smaller value for readPageSize with TokuMX can help
Using the right value for read_ahead_kb can help. But I think that depending on filesystem readahead is a kludge that ends up wasting IO or hurting QPS
TokuMX and MongoDB require more cache/RAM for indexes than InnoDB even when TokuMX uses compression and InnoDB does not.

Goals

I describe the impact of read_ahead_kb for TokuMX and MongoDB, the impact of readPageSize for TokuMX and the impact of innodb_page_size for MySQL. My test server uses fast flash storage and what I report here might not be relevant for disk storage. From past experience there isn't much difference in peak IOPs from disks using 8kb versus 32kb reads as the device is bound by seek and rotational latency rather than data transfer. But flash devices usually do many more random reads per second at 8kb compared to 32kb.

I have a lot of experience with systems that use direct IO, less for systems that do reads & writes via buffered IO and much less for buffered IO + mmap. The epic rant about O_DIRECT is amusing because there wasn't a good alternative at the time for any DBMS that cared about IO performance. I suspect the DBMS community hasn't done enough to make things better (reach out to Linux developers, pay for improvements, volunteer time to run performance tests, donate expensive storage HW to people who do the systems work). But I think things are getting better and better.

mmap reads

MongoDB uses buffered IO and mmap. To get peak QPS for an IO-bound workload it must read the right amount of data per request -- not too much and not too little. With pread you tell the OS the right amount of data to read by specifying the read size. With mmap it is harder to communicate this. When a page fault occurs the OS knows that is must read the 4k page that contains the faulting address from storage. The OS doesn't know whether it should read adjacent 4k pages by using a larger read request to avoid a seek per 4kb of data. Perhaps the workaround is to suggest the read size by calling madvise with the MADV_WILLNEED flag prior to referencing memory and hope the suggestion isn't ignored.

Reading 4kb at a time isn't good for MongoDB index pages which are 8kb, so the OS might do two disk reads and two disk seeks in the worst case. A 4kb read is also too small for documents that are larger than 4kb and the worst case again is extra disk reads. A partial workaround is to set read_ahead_kb to suggest to Linux that more data should be read. But one value for read_ahead_kb won't cover all use cases -- 8kb index pages, large documents and small documents. Note that this is also a hint. If you want high performance and high efficiency for IO-bound OLTP and you rely on read_ahead_kb then you aren't going to have a good time.

There isn't much information on the impact of read_ahead_kb on MongoDB and TokuMX. A simple way to understand the impact from it is to look at the disk read rate and disk sector read rate as measured by iostat. The number of disk sectors per disk read is also interesting. I report that data below from my tests. There is an open question about this on the MongoDB users email list and reference to a blog post.

readPageSize

InnoDB uses 16kb pages by default and the page size can be changed via the innodb_page_size parameter when installing MySQL. A fast flash device can do more random IOPs for 4kb requests than for 16kb requests because data transfer can limit throughput. TokuMX has a similar option that can be set per table via readPageSize. The TokuMX default is 64kb which might be good for disk-based servers but I suspect is too large for flash. Even on disk-based servers there are benefits from using a smaller page as less space will be wasted in RAM when only a few documents/rows per page are useful. For TokuMX the readPageSize is the size of the page prior to compression.

Setup for 1.6B documents

I used the same test as described previously except I limited queries to the first half of each table/collection and there were 1.6B documents/rows that could be fetched in the 8 collections/tables. I tested the configurations listed below. This lists the total database size but only half of the database was accessed.

fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER
fb56.sql - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER.
orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
tokumx8 - 456G database, TokuMX 1.4.1, quicklz, readPageSize=8k
tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K
mongo249 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0
mongo260 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1

Results for 1.6B documents

The graph below shows the best QPS for each of the servers tested: 8kb page_size for MySQL, 8kb readahead for tokumx8 and tokumx64, 4kb readahead for MongoDB. I exclude the results for MySQL 5.7 because they are similar to 5.6. As in my previous blog post MySQL with InnoDB does better than the others but the difference is less significant for two reasons. First I repeated tests for many different values of read_ahead_kb and display the best one below. Second the test database is smaller. I assume this helps MongoDB and TokuMX by improving the cache hit rate for index data -- MongoDB because the index isn't clustered so we need all of the index in RAM to avoid doing an extra disk read per query, TokuMX because the non-leaf levels of the index are larger than for a b-tree. MongoDB 2.6.0 does worse than 2.4.9 because of a CPU performance regression for simple PK queries (JIRA 13663 and 13685). TokuMX with an 8kb readPageSize is much better than a 64kb readPageSize because there is less data to decompress per disk read, there is a better cache hit rate and fast storage does more IOPs for smaller read requests. Note that the database was also much smaller on disk with an 8kb readPageSize. It would be good to explain that.

This has QPS results for all of the configurations tested. The peak QPS for each of the configurations below was used in the graph above except for MySQL 5.7.

queries per second

8 16 32 40 concurrent clients

44078 71929 108999 110640 fb56.handler, 16k page

36739 63622 100654 107970 fb56.sql, 16k page

44743 74183 117871 131160 fb56.handler, 8k page

37064 64602 102672 119555 fb56.sql, 8k page

43440 71328 113778 128366 fb56.handler, 4k page

35629 61437 97452 113916 fb56.sql, 4k page

44120 73919 119059 130071 orig57.handler, 8k page

36589 64339 102399 119260 orig57.sql, 8k page

42772 70368 113781 128088 orig57.handler, 4k page

35287 60928 96736 112950 orig57.sql, 4k page

24502 39332 62389 68967 tokumx8, 0k readahead

27762 45074 73102 79999 tokumx8, 4k readahead

29256 49093 81222 91508 tokumx8, 8k readahead

27835 45406 76164 82695 tokumx8, 16k readahead

6347 9287 13512 14396 tokumx64, 0k readahead

7221 12835 20233 21477 tokumx64, 4k readahead

9263 16088 26595 27943 tokumx64, 8k readahead

10272 18602 22645 22015 tokumx64, 16k readahead

11191 20349 24090 24086 tokumx64, 32k readahead

10384 16492 17093 16600 tokumx64, 64k readahead

38154 62257 96033 103770 mongo249, 0k readahead

38274 62321 96131 106017 mongo249, 4k readahead

33088 51609 72699 76311 mongo249, 8k readahead

16533 22871 24019 25076 mongo249, 16k readahead

17572 23332 24319 24324 mongo249, 32k readahead

29179 49731 77114 84779 mongo260, 0k readahead

28979 49521 76569 86985 mongo260, 4k readahead

26321 42967 65662 71112 mongo260, 8k readahead

15338 23131 24448 25131 mongo260, 16k readahead

16277 23428 24443 24566 mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Disk reads/query is much higher for TokuMX with a 64k readPageSize than for all other servers. The rate changes for MongoDB 2.4.9 and 2.6.0 between 8k and 16k readahead.

reads/query server
0.649 fb56.handler, 16kb page
0.672 fb56.handler, 8kb page
0.748 fb56.handler, 4kb page
-
1.363 tokumx8, 0k readahead
1.275 tokumx8, 4k readahead
0.927 tokumx8, 8k readahead
0.987 tokumx8, 16k readahead
-
7.161 tokumx64, 0k readahead
6.547 tokumx64, 4k readahead
3.833 tokumx64, 8k readahead
2.659 tokumx64, 16k readahead
2.432 tokumx64, 32k readahead
2.909 tokumx64, 64k readahead
-
0.806 mongo249, 0k readahead
0.807 mongo249, 4k readahead
0.824 mongo249, 8k readahead
1.117 mongo249, 16k readahead
1.147 mongo249, 32k readahead
-
0.830 mongo260, 0k readahead
0.832 mongo260, 4k readahead
0.847 mongo260, 8k readahead
1.115 mongo260, 16k readahead
1.145 mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read server
16384 fb56.handler, 16kb page
8192 fb56.handler, 8kb page
4096 fb56.handler, 4kb page
-
4096 tokumx8, 0k readahead
5939 tokumx8, 4k readahead
9011 tokumx8, 8k readahead
10752 tokumx8, 16k readahead
-
4096 tokumx64, 0k readahead
4557 tokumx64, 4k readahead
9011 tokumx64, 8k readahead
23347 tokumx64, 16k readahead
23398 tokumx64, 32k readahead
27802 tokumx64, 64k readahead
-
4096 mongo249, 0k readahead
4198 mongo249, 4k readahead
7322 mongo249, 8k readahead
19098 mongo249, 16k readahead
18534 mongo249, 32k readahead
-
4096 mongo260, 0k readahead
4250 mongo260, 4k readahead
7424 mongo260, 8k readahead
19712 mongo260, 16k readahead
19149 mongo260, 32k readahead

Setup for 3.2B documents

The same test was repeated except the clients were able to query all 3.2B documents/rows in the test collections/tables. I exclude results for TokuMX with 64k readPageSize,

Results for 3.2B documents

The graph has the best configuration for each server: 8kb page_size for MySQL, 16kb readahead for tokumx8, 4kb readahead for MongoDB. TokuMX matches MongoDB 2.4.9 here while it did worse than it in the 1.6B document test.

This has QPS results for all of the configurations tested. It should be possible for MySQL with InnoDB to get more QPS at 4kb pages. I don't know why that didn't happen and suspect that mutex contention was a problem. TokuMX with 8k readPageSize matched MongoDB 2.4.9 here, otherwise the results are similar to the 1.6B document/row test.

queries per second

8 16 32 40 concurrent clients

38767 60366 86895 87517 fb56.handler, 16k page

33062 54847 84480 85764 fb56.sql, 16k page

39940 63628 102261 107312 fb56.handler, 8k page

33599 56819 91128 102378 fb56.sql, 8k page

39165 62409 101283 111496 fb56.handler, 4k page

32593 54810 87967 100644 fb56.sql, 4k page

38829 60419 86223 86655 orig57.handler, 16k page

33298 55034 84282 84734 orig57.sql, 16k page

39898 63673 102641 106424 orig57.handler, 8k page

33801 57097 91397 101718 orig57.sql, 8k page

39015 62159 101203 107398 orig57.handler, 4k page

32433 54067 84663 90872 orig57.sql, 4k page

21062 32292 49979 54177 tokumx8, 0k readahead

24379 39353 58571 61245 tokumx8, 4k readahead

27871 45992 73472 80748 tokumx8, 8k readahead

27396 45214 74040 81592 tokumx8, 16k readahead

29529 45976 69002 72772 mongo249, 0k readahead

29482 45868 71590 73942 mongo249, 4k readahead

23608 35676 48662 51503 mongo249, 8k readahead

18606 27554 31865 32637 mongo249, 16k readahead

12485 16190 16668 16662 mongo249, 32k readahead

24296 40154 61795 66992 mongo260, 0k readahead

24245 39781 61343 68450 mongo260, 4k readahead

20559 32111 46572 49358 mongo260, 8k readahead

17115 26220 32825 33184 mongo260, 16k readahead

12309 17050 17542 17595 mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Note that TokuMX QPS gets much better as the rate decreases when a larger readahead is used.

reads/query server
0.823 fb56.handler, 16kb page
0.842 fb56.handler, 8kb page
0.897 fb56.handler, 4kb page
-
1.794 tokumx8, 0k readahead
1.723 tokumx8, 4k readahead
1.074 tokumx8, 8k readahead
0.999 tokumx8, 16k readahead
-
1.213 mongo249, 0k readahead
1.213 mongo249, 4k readahead
1.270 mongo249, 8k readahead
1.232 mongo249, 16k readahead
1.478 mongo249, 32k readahead
-
1.225 mongo260, 0k readahead
1.226 mongo260, 4k readahead
1.285 mongo260, 8k readahead
1.225 mongo260, 16k readahead
1.458 mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read server
16384 fb56.handler, 16kb page
8192 fb56.handler, 8kb page
4096 fb56.handler, 4kb page
-
4096 tokumx8, 0k readahead
5939 tokumx8, 4k readahead
9574 tokumx8, 8k readahead
11469 tokumx8, 16k readahead
-
4096 mongo249, 0k readahead
4250 mongo249, 4k readahead
7373 mongo249, 8k readahead
13107 mongo249, 16k readahead
21197 mongo249, 32k readahead
-
4096 mongo260, 0k readahead
4301 mongo260, 4k readahead
7475 mongo260, 8k readahead
13210 mongo260, 16k readahead
21658 mongo260, 32k readahead