Small Datum: The impact of read ahead and read size on MongoDB, TokuMX and MySQL

This continues my work on using a very simple workload (read-only, fetch 1 document by PK, database larger than RAM, fast storage) to understand how to get more QPS from TokuMX and MongoDB. I need to get more random read IOPs from storage to get more QPS from the DBMS. Beyond getting more throughput I ran tests to understand the impact of filesystem readahead on MongoDB and TokuMX and the impact of the read page size on TokuMX and InnoDB. My results are based on fast flash storage that can do more than 100,000 4kb reads/second. Be careful when trying to use these results for slower storage devices, especially disks.

In my previous post I wrote that I was able to get much more QPS from InnoDB but was likely to report improvements for TokuMX/MongoDB in the next post. This is the next post and some of the results I report here are better for a few reasons. First, some tests limited queries to the first half of each collection/table so the cache hit rate is much better especially for non-leaf index nodes. TokuMX and MongoDB are more sensitive to that and might require more cache/RAM for indexes than InnoDB. Second, I use better values for read_ahead_kb and readPageSize. My summary for this post is that:

Using a smaller value for readPageSize with TokuMX can help
Using the right value for read_ahead_kb can help. But I think that depending on filesystem readahead is a kludge that ends up wasting IO or hurting QPS
TokuMX and MongoDB require more cache/RAM for indexes than InnoDB even when TokuMX uses compression and InnoDB does not.

Goals

I describe the impact of read_ahead_kb for TokuMX and MongoDB, the impact of readPageSize for TokuMX and the impact of innodb_page_size for MySQL. My test server uses fast flash storage and what I report here might not be relevant for disk storage. From past experience there isn't much difference in peak IOPs from disks using 8kb versus 32kb reads as the device is bound by seek and rotational latency rather than data transfer. But flash devices usually do many more random reads per second at 8kb compared to 32kb.

I have a lot of experience with systems that use direct IO, less for systems that do reads & writes via buffered IO and much less for buffered IO + mmap. The epic rant about O_DIRECT is amusing because there wasn't a good alternative at the time for any DBMS that cared about IO performance. I suspect the DBMS community hasn't done enough to make things better (reach out to Linux developers, pay for improvements, volunteer time to run performance tests, donate expensive storage HW to people who do the systems work). But I think things are getting better and better.

mmap reads

MongoDB uses buffered IO and mmap. To get peak QPS for an IO-bound workload it must read the right amount of data per request -- not too much and not too little. With pread you tell the OS the right amount of data to read by specifying the read size. With mmap it is harder to communicate this. When a page fault occurs the OS knows that is must read the 4k page that contains the faulting address from storage. The OS doesn't know whether it should read adjacent 4k pages by using a larger read request to avoid a seek per 4kb of data. Perhaps the workaround is to suggest the read size by calling madvise with the MADV_WILLNEED flag prior to referencing memory and hope the suggestion isn't ignored.

Reading 4kb at a time isn't good for MongoDB index pages which are 8kb, so the OS might do two disk reads and two disk seeks in the worst case. A 4kb read is also too small for documents that are larger than 4kb and the worst case again is extra disk reads. A partial workaround is to set read_ahead_kb to suggest to Linux that more data should be read. But one value for read_ahead_kb won't cover all use cases -- 8kb index pages, large documents and small documents. Note that this is also a hint. If you want high performance and high efficiency for IO-bound OLTP and you rely on read_ahead_kb then you aren't going to have a good time.

There isn't much information on the impact of read_ahead_kb on MongoDB and TokuMX. A simple way to understand the impact from it is to look at the disk read rate and disk sector read rate as measured by iostat. The number of disk sectors per disk read is also interesting. I report that data below from my tests. There is an open question about this on the MongoDB users email list and reference to a blog post.

readPageSize

InnoDB uses 16kb pages by default and the page size can be changed via the innodb_page_size parameter when installing MySQL. A fast flash device can do more random IOPs for 4kb requests than for 16kb requests because data transfer can limit throughput. TokuMX has a similar option that can be set per table via readPageSize. The TokuMX default is 64kb which might be good for disk-based servers but I suspect is too large for flash. Even on disk-based servers there are benefits from using a smaller page as less space will be wasted in RAM when only a few documents/rows per page are useful. For TokuMX the readPageSize is the size of the page prior to compression.

Setup for 1.6B documents

I used the same test as described previously except I limited queries to the first half of each table/collection and there were 1.6B documents/rows that could be fetched in the 8 collections/tables. I tested the configurations listed below. This lists the total database size but only half of the database was accessed.

fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER
fb56.sql - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER.
orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
tokumx8 - 456G database, TokuMX 1.4.1, quicklz, readPageSize=8k
tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K
mongo249 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0
mongo260 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1

Results for 1.6B documents

The graph below shows the best QPS for each of the servers tested: 8kb page_size for MySQL, 8kb readahead for tokumx8 and tokumx64, 4kb readahead for MongoDB. I exclude the results for MySQL 5.7 because they are similar to 5.6. As in my previous blog post MySQL with InnoDB does better than the others but the difference is less significant for two reasons. First I repeated tests for many different values of read_ahead_kb and display the best one below. Second the test database is smaller. I assume this helps MongoDB and TokuMX by improving the cache hit rate for index data -- MongoDB because the index isn't clustered so we need all of the index in RAM to avoid doing an extra disk read per query, TokuMX because the non-leaf levels of the index are larger than for a b-tree. MongoDB 2.6.0 does worse than 2.4.9 because of a CPU performance regression for simple PK queries (JIRA 13663 and 13685). TokuMX with an 8kb readPageSize is much better than a 64kb readPageSize because there is less data to decompress per disk read, there is a better cache hit rate and fast storage does more IOPs for smaller read requests. Note that the database was also much smaller on disk with an 8kb readPageSize. It would be good to explain that.

This has QPS results for all of the configurations tested. The peak QPS for each of the configurations below was used in the graph above except for MySQL 5.7.

queries per second

8 16 32 40 concurrent clients

44078 71929 108999 110640 fb56.handler, 16k page

36739 63622 100654 107970 fb56.sql, 16k page

44743 74183 117871 131160 fb56.handler, 8k page

37064 64602 102672 119555 fb56.sql, 8k page

43440 71328 113778 128366 fb56.handler, 4k page

35629 61437 97452 113916 fb56.sql, 4k page

44120 73919 119059 130071 orig57.handler, 8k page

36589 64339 102399 119260 orig57.sql, 8k page

42772 70368 113781 128088 orig57.handler, 4k page

35287 60928 96736 112950 orig57.sql, 4k page

24502 39332 62389 68967 tokumx8, 0k readahead

27762 45074 73102 79999 tokumx8, 4k readahead

29256 49093 81222 91508 tokumx8, 8k readahead

27835 45406 76164 82695 tokumx8, 16k readahead

6347 9287 13512 14396 tokumx64, 0k readahead

7221 12835 20233 21477 tokumx64, 4k readahead

9263 16088 26595 27943 tokumx64, 8k readahead

10272 18602 22645 22015 tokumx64, 16k readahead

11191 20349 24090 24086 tokumx64, 32k readahead

10384 16492 17093 16600 tokumx64, 64k readahead

38154 62257 96033 103770 mongo249, 0k readahead

38274 62321 96131 106017 mongo249, 4k readahead

33088 51609 72699 76311 mongo249, 8k readahead

16533 22871 24019 25076 mongo249, 16k readahead

17572 23332 24319 24324 mongo249, 32k readahead

29179 49731 77114 84779 mongo260, 0k readahead

28979 49521 76569 86985 mongo260, 4k readahead

26321 42967 65662 71112 mongo260, 8k readahead

15338 23131 24448 25131 mongo260, 16k readahead

16277 23428 24443 24566 mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Disk reads/query is much higher for TokuMX with a 64k readPageSize than for all other servers. The rate changes for MongoDB 2.4.9 and 2.6.0 between 8k and 16k readahead.

reads/query server
0.649 fb56.handler, 16kb page
0.672 fb56.handler, 8kb page
0.748 fb56.handler, 4kb page
-
1.363 tokumx8, 0k readahead
1.275 tokumx8, 4k readahead
0.927 tokumx8, 8k readahead
0.987 tokumx8, 16k readahead
-
7.161 tokumx64, 0k readahead
6.547 tokumx64, 4k readahead
3.833 tokumx64, 8k readahead
2.659 tokumx64, 16k readahead
2.432 tokumx64, 32k readahead
2.909 tokumx64, 64k readahead
-
0.806 mongo249, 0k readahead
0.807 mongo249, 4k readahead
0.824 mongo249, 8k readahead
1.117 mongo249, 16k readahead
1.147 mongo249, 32k readahead
-
0.830 mongo260, 0k readahead
0.832 mongo260, 4k readahead
0.847 mongo260, 8k readahead
1.115 mongo260, 16k readahead
1.145 mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read server
16384 fb56.handler, 16kb page
8192 fb56.handler, 8kb page
4096 fb56.handler, 4kb page
-
4096 tokumx8, 0k readahead
5939 tokumx8, 4k readahead
9011 tokumx8, 8k readahead
10752 tokumx8, 16k readahead
-
4096 tokumx64, 0k readahead
4557 tokumx64, 4k readahead
9011 tokumx64, 8k readahead
23347 tokumx64, 16k readahead
23398 tokumx64, 32k readahead
27802 tokumx64, 64k readahead
-
4096 mongo249, 0k readahead
4198 mongo249, 4k readahead
7322 mongo249, 8k readahead
19098 mongo249, 16k readahead
18534 mongo249, 32k readahead
-
4096 mongo260, 0k readahead
4250 mongo260, 4k readahead
7424 mongo260, 8k readahead
19712 mongo260, 16k readahead
19149 mongo260, 32k readahead

Setup for 3.2B documents

The same test was repeated except the clients were able to query all 3.2B documents/rows in the test collections/tables. I exclude results for TokuMX with 64k readPageSize,

Results for 3.2B documents

The graph has the best configuration for each server: 8kb page_size for MySQL, 16kb readahead for tokumx8, 4kb readahead for MongoDB. TokuMX matches MongoDB 2.4.9 here while it did worse than it in the 1.6B document test.

This has QPS results for all of the configurations tested. It should be possible for MySQL with InnoDB to get more QPS at 4kb pages. I don't know why that didn't happen and suspect that mutex contention was a problem. TokuMX with 8k readPageSize matched MongoDB 2.4.9 here, otherwise the results are similar to the 1.6B document/row test.

queries per second

8 16 32 40 concurrent clients

38767 60366 86895 87517 fb56.handler, 16k page

33062 54847 84480 85764 fb56.sql, 16k page

39940 63628 102261 107312 fb56.handler, 8k page

33599 56819 91128 102378 fb56.sql, 8k page

39165 62409 101283 111496 fb56.handler, 4k page

32593 54810 87967 100644 fb56.sql, 4k page

38829 60419 86223 86655 orig57.handler, 16k page

33298 55034 84282 84734 orig57.sql, 16k page

39898 63673 102641 106424 orig57.handler, 8k page

33801 57097 91397 101718 orig57.sql, 8k page

39015 62159 101203 107398 orig57.handler, 4k page

32433 54067 84663 90872 orig57.sql, 4k page

21062 32292 49979 54177 tokumx8, 0k readahead

24379 39353 58571 61245 tokumx8, 4k readahead

27871 45992 73472 80748 tokumx8, 8k readahead

27396 45214 74040 81592 tokumx8, 16k readahead

29529 45976 69002 72772 mongo249, 0k readahead

29482 45868 71590 73942 mongo249, 4k readahead

23608 35676 48662 51503 mongo249, 8k readahead

18606 27554 31865 32637 mongo249, 16k readahead

12485 16190 16668 16662 mongo249, 32k readahead

24296 40154 61795 66992 mongo260, 0k readahead

24245 39781 61343 68450 mongo260, 4k readahead

20559 32111 46572 49358 mongo260, 8k readahead

17115 26220 32825 33184 mongo260, 16k readahead

12309 17050 17542 17595 mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Note that TokuMX QPS gets much better as the rate decreases when a larger readahead is used.

reads/query server
0.823 fb56.handler, 16kb page
0.842 fb56.handler, 8kb page
0.897 fb56.handler, 4kb page
-
1.794 tokumx8, 0k readahead
1.723 tokumx8, 4k readahead
1.074 tokumx8, 8k readahead
0.999 tokumx8, 16k readahead
-
1.213 mongo249, 0k readahead
1.213 mongo249, 4k readahead
1.270 mongo249, 8k readahead
1.232 mongo249, 16k readahead
1.478 mongo249, 32k readahead
-
1.225 mongo260, 0k readahead
1.226 mongo260, 4k readahead
1.285 mongo260, 8k readahead
1.225 mongo260, 16k readahead
1.458 mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read server
16384 fb56.handler, 16kb page
8192 fb56.handler, 8kb page
4096 fb56.handler, 4kb page
-
4096 tokumx8, 0k readahead
5939 tokumx8, 4k readahead
9574 tokumx8, 8k readahead
11469 tokumx8, 16k readahead
-
4096 mongo249, 0k readahead
4250 mongo249, 4k readahead
7373 mongo249, 8k readahead
13107 mongo249, 16k readahead
21197 mongo249, 32k readahead
-
4096 mongo260, 0k readahead
4301 mongo260, 4k readahead
7475 mongo260, 8k readahead
13210 mongo260, 16k readahead
21658 mongo260, 32k readahead

2 comments:

Raghavendra PrabhuMay 4, 2014 at 12:36 AM
a) Interesting blog post. I came here mainly for read_ahead_kb. I had looked at it
a while back and worked on it (and pushed many to mainline kernel as well, though
couldn't follow up).

While it is an interesting tunable, high global values of it can be bad. The
reason is that, even though kernel is wise enough to (or at least it was till
a while back, see max_sane_readahead in mm/readahead.c), readahead is still
considered under high memory pressure.

So, mainly, I would look if higher RA window triggered any sort of reclaim
(through perf top or so).

b)
That is, even though readahead is done under __GFP_COLD | __GFP_NORETRY |
__GFP_NOWARN (cold, don't retry and don't warn), it still can make page
allocation fall back to slow path! By this I mean, suppose your application does
a read of a particular segment of file and there isn't much memory available,
the readahead will still try to read pages *even* if it means triggering zone
reclaim (check __alloc_pages_slowpath in mm/page_alloc.c for more).

I wanted to fix the above and added a new GFP for it - GFP_READAHEAD which meant
to avoid any reclaim under some conditions. I hope this makes sense - we may not
want readahead to trigger side effects involving further dirty write I/O.

(For b, I need to brush up more with latest kernel readahead changes, mine is latest
till about 3.6).

c)
Anyhow, one more warning about ra_pages (readahead window) is that it ramps up
dynamically based on the detected pattern. It can go upto 512 4k pages. The
whole RA window resizing is quite interesting and is in mm/readahead.c -
page_cache_sync_readahead and page_cache_async_readahead. (Note - the 512 4k
seems to be added lately, earlier it was based on NR_INACTIVE_FILE and
NR_FREE_PAGES).

d)
Regarding mmap, mmap has some additional readahead logic built on top of this:

/* If we don't want any read-ahead, don't bother */
if (vma->vm_flags & VM_RAND_READ)
return;
if (!ra->ra_pages)
return;

if (vma->vm_flags & VM_SEQ_READ) {
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
}

/* Avoid banging the cache line if not needed */
if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
ra->mmap_miss++;

/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
if (ra->mmap_miss > MMAP_LOTSAMISS)
return;

This has remained unchanged for a while. But, the key is MMAP_LOTSAMISS - ie.
when it detects readahead is not helping, it doesn't do it unless VM_SEQ_READ is
set (which is from a madvise).

The above readahead is triggered directly in page fault path - filemap_fault.

e) Regarding the numbers, I think tokumx may be using fadvise/madvise quite
deftly. That is the key when doing buffered I/O (something which even pgsql has
issues with, from what I could gather at collab summit). Some of madvise
actually double the RA window.

f) Regarding filesystems and read_ahead_kb, strictly speaking, filesystems
shouldn't directly do this and VFS should take care of it (filesystems defer
this with ->read_pages and so on). BUT, I have seen a
few filesystems make direct calls to page_cache_{a,}sync_readahead: btrfs and ext*
(former seems to be doing it a lot and latter only in readdir()).

Anyways, I have a few readahead branches here
http://git.wnohang.net/cgit.cgi/bldit.git/refs/heads readahead branches -
especially http://git.wnohang.net/cgit.cgi/bldit.git/log/?h=new-readahead

Mark CallaghanMay 6, 2014 at 10:27 AM
Thanks for the details. My concern is about trying to use read_ahead_kb to manage a small amount of readahead needed for OLTP -- no full scans but sometimes clients need more than 4kb at a time. I think that depending on readhead in that case is not going to make things efficient & performant.

Small Datum

Friday, May 2, 2014

The impact of read ahead and read size on MongoDB, TokuMX and MySQL