Friday, May 2, 2014

The impact of read ahead and read size on MongoDB, TokuMX and MySQL

This continues my work on using a very simple workload (read-only, fetch 1 document by PK, database larger than RAM, fast storage) to understand how to get more QPS from TokuMX and MongoDB. I need to get more random read IOPs from storage to get more QPS from the DBMS. Beyond getting more throughput I ran tests to understand the impact of filesystem readahead on MongoDB and TokuMX and the impact of the read page size on TokuMX and InnoDB. My results are based on fast flash storage that can do more than 100,000 4kb reads/second. Be careful when trying to use these results for slower storage devices, especially disks.

In my previous post I wrote that I was able to get much more QPS from InnoDB but was likely to report improvements for TokuMX/MongoDB in the next post. This is the next post and some of the results I report here are better for a few reasons. First, some tests limited queries to the first half of each collection/table so the cache hit rate is much better especially for non-leaf index nodes. TokuMX and MongoDB are more sensitive to that and might require more cache/RAM for indexes than InnoDB. Second, I use better values for read_ahead_kb and readPageSize. My summary for this post is that:
  • Using a smaller value for readPageSize with TokuMX can help
  • Using the right value for read_ahead_kb can help. But I think that depending on filesystem readahead is a kludge that ends up wasting IO or hurting QPS
  • TokuMX and MongoDB require more cache/RAM for indexes than InnoDB even when TokuMX uses compression and InnoDB does not.

Goals

I describe the impact of read_ahead_kb for TokuMX and MongoDB, the impact of readPageSize for TokuMX and the impact of innodb_page_size for MySQL. My test server uses fast flash storage and what I report here might not be relevant for disk storage. From past experience there isn't much difference in peak IOPs from disks using 8kb versus 32kb reads as the device is bound by seek and rotational latency rather than data transfer. But flash devices usually do many more random reads per second at 8kb compared to 32kb.

I have a lot of experience with systems that use direct IO, less for systems that do reads & writes via buffered IO and much less for buffered IO + mmap. The epic rant about O_DIRECT is amusing because there wasn't a good alternative at the time for any DBMS that cared about IO performance. I suspect the DBMS community hasn't done enough to make things better (reach out to Linux developers, pay for improvements, volunteer time to run performance tests, donate expensive storage HW to people who do the systems work). But I think things are getting better and better.

mmap reads

MongoDB uses buffered IO and mmap. To get peak QPS for an IO-bound workload it must read the right amount of data per request -- not too much and not too little. With pread you tell the OS the right amount of data to read by specifying the read size. With mmap it is harder to communicate this. When a page fault occurs the OS knows that is must read the 4k page that contains the faulting address from storage. The OS doesn't know whether it should read adjacent 4k pages by using a larger read request to avoid a seek per 4kb of data. Perhaps the workaround is to suggest the read size by calling madvise with the MADV_WILLNEED flag prior to referencing memory and hope the suggestion isn't ignored.

Reading 4kb at a time isn't good for MongoDB index pages which are 8kb, so the OS might do two disk reads and two disk seeks in the worst case. A 4kb read is also too small for documents that are larger than 4kb and the worst case again is extra disk reads. A partial workaround is to set read_ahead_kb to suggest to Linux that more data should be read. But one value for read_ahead_kb won't cover all use cases -- 8kb index pages, large documents and small documents. Note that this is also a hint. If you want high performance and high efficiency for IO-bound OLTP and you rely on read_ahead_kb then you aren't going to have a good time.

There isn't much information on the impact of read_ahead_kb on MongoDB and TokuMX. A simple way to understand the impact from it is to look at the disk read rate and disk sector read rate as measured by iostat. The number of disk sectors per disk read is also interesting. I report that data below from my tests. There is an open question about this on the MongoDB users email list and reference to a blog post.

readPageSize

InnoDB uses 16kb pages by default and the page size can be changed via the innodb_page_size parameter when installing MySQL. A fast flash device can do more random IOPs for 4kb requests than for 16kb requests because data transfer can limit throughput. TokuMX has a similar option that can be set per table via readPageSize. The TokuMX default is 64kb which might be good for disk-based servers but I suspect is too large for flash. Even on disk-based servers there are benefits from using a smaller page as less space will be wasted in RAM when only a few documents/rows per page are useful. For TokuMX the readPageSize is the size of the page prior to compression.

Setup for 1.6B documents

I used the same test as described previously except I limited queries to the first half of each table/collection and there were 1.6B documents/rows that could be fetched in the 8 collections/tables. I tested the configurations listed below. This lists the total database size but only half of the database was accessed.
  • fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER
  • fb56.sql - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
  • orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER. 
  • orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
  • tokumx8 - 456G database, TokuMX 1.4.1, quicklz, readPageSize=8k
  • tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K
  • mongo249 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0
  • mongo260 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1

Results for 1.6B documents

The graph below shows the best QPS for each of the servers tested: 8kb page_size for MySQL, 8kb readahead for tokumx8 and tokumx64, 4kb readahead for MongoDB. I exclude the results for MySQL 5.7 because they are similar to 5.6. As in my previous blog post MySQL with InnoDB does better than the others but the difference is less significant for two reasons. First I repeated tests for many different values of read_ahead_kb and display the best one below. Second the test database is smaller. I assume this helps MongoDB and TokuMX by improving the cache hit rate for index data -- MongoDB because the index isn't clustered so we need all of the index in RAM to avoid doing an extra disk read per query, TokuMX because the non-leaf levels of the index are larger than for a b-tree. MongoDB 2.6.0 does worse than 2.4.9 because of a CPU performance regression for simple PK queries (JIRA 13663 and 13685). TokuMX with an 8kb readPageSize is much better than a 64kb readPageSize because there is less data to decompress per disk read, there is a better cache hit rate and fast storage does more IOPs for smaller read requests. Note that the database was also much smaller on disk with an 8kb readPageSize. It would be good to explain that.
This has QPS results for all of the configurations tested. The peak QPS for each of the configurations below was used in the graph above except for MySQL 5.7.

queries per second
    8    16     32     40  concurrent clients
44078 71929 108999 110640  fb56.handler, 16k page
36739 63622 100654 107970  fb56.sql, 16k page
44743 74183 117871 131160  fb56.handler, 8k page
37064 64602 102672 119555  fb56.sql, 8k page
43440 71328 113778 128366  fb56.handler, 4k page
35629 61437  97452 113916  fb56.sql, 4k page
-
44120 73919 119059 130071  orig57.handler, 8k page
36589 64339 102399 119260  orig57.sql, 8k page
42772 70368 113781 128088  orig57.handler, 4k page
35287 60928  96736 112950  orig57.sql, 4k page
-
24502 39332  62389  68967  tokumx8, 0k readahead
27762 45074  73102  79999  tokumx8, 4k readahead
29256 49093  81222  91508  tokumx8, 8k readahead
27835 45406  76164  82695  tokumx8, 16k readahead
-
 6347  9287  13512  14396  tokumx64, 0k readahead
 7221 12835  20233  21477  tokumx64, 4k readahead
 9263 16088  26595  27943  tokumx64, 8k readahead
10272 18602  22645  22015  tokumx64, 16k readahead
11191 20349  24090  24086  tokumx64, 32k readahead
10384 16492  17093  16600  tokumx64, 64k readahead
-
38154 62257  96033 103770  mongo249, 0k readahead
38274 62321  96131 106017  mongo249, 4k readahead
33088 51609  72699  76311  mongo249, 8k readahead
16533 22871  24019  25076  mongo249, 16k readahead
17572 23332  24319  24324  mongo249, 32k readahead
-
29179 49731  77114  84779  mongo260, 0k readahead
28979 49521  76569  86985  mongo260, 4k readahead
26321 42967  65662  71112  mongo260, 8k readahead
15338 23131  24448  25131  mongo260, 16k readahead
16277 23428  24443  24566  mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Disk reads/query is much higher for TokuMX with a 64k readPageSize than for all other servers. The rate changes for MongoDB 2.4.9 and 2.6.0 between 8k and 16k readahead.

reads/query  server
0.649        fb56.handler, 16kb page
0.672        fb56.handler, 8kb page
0.748        fb56.handler, 4kb page
-
1.363        tokumx8, 0k readahead
1.275        tokumx8, 4k readahead
0.927        tokumx8, 8k readahead
0.987        tokumx8, 16k readahead
-
7.161        tokumx64, 0k readahead
6.547        tokumx64, 4k readahead
3.833        tokumx64, 8k readahead
2.659        tokumx64, 16k readahead
2.432        tokumx64, 32k readahead
2.909        tokumx64, 64k readahead
-
0.806        mongo249, 0k readahead
0.807        mongo249, 4k readahead
0.824        mongo249, 8k readahead
1.117        mongo249, 16k readahead
1.147        mongo249, 32k readahead
-
0.830        mongo260, 0k readahead
0.832        mongo260, 4k readahead
0.847        mongo260, 8k readahead
1.115        mongo260, 16k readahead
1.145        mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read   server
16384        fb56.handler, 16kb page
 8192        fb56.handler, 8kb page
 4096        fb56.handler, 4kb page
-
 4096        tokumx8, 0k readahead
 5939        tokumx8, 4k readahead
 9011        tokumx8, 8k readahead
10752        tokumx8, 16k readahead
-
 4096        tokumx64, 0k readahead
 4557        tokumx64, 4k readahead
 9011        tokumx64, 8k readahead
23347        tokumx64, 16k readahead
23398        tokumx64, 32k readahead
27802        tokumx64, 64k readahead
-
 4096        mongo249, 0k readahead
 4198        mongo249, 4k readahead
 7322        mongo249, 8k readahead
19098        mongo249, 16k readahead
18534        mongo249, 32k readahead
-
 4096        mongo260, 0k readahead
 4250        mongo260, 4k readahead
 7424        mongo260, 8k readahead
19712        mongo260, 16k readahead
19149        mongo260, 32k readahead

Setup for 3.2B documents

The same test was repeated except the clients were able to query all 3.2B documents/rows in the test collections/tables. I exclude results for TokuMX with 64k readPageSize,

Results for 3.2B documents

The graph has the best configuration for each server: 8kb page_size for MySQL, 16kb readahead for tokumx8, 4kb readahead for MongoDB. TokuMX matches MongoDB 2.4.9 here while it did worse than it in the 1.6B document test.
This has QPS results for all of the configurations tested. It should be possible for MySQL with InnoDB to get more QPS at 4kb pages. I don't know why that didn't happen and suspect that mutex contention was a problem. TokuMX with 8k readPageSize matched MongoDB 2.4.9 here, otherwise the results are similar to the 1.6B document/row test.

queries per second
    8    16     32     40  concurrent clients
38767 60366  86895  87517  fb56.handler, 16k page
33062 54847  84480  85764  fb56.sql, 16k page
39940 63628 102261 107312  fb56.handler, 8k page
33599 56819  91128 102378  fb56.sql, 8k page
39165 62409 101283 111496  fb56.handler, 4k page
32593 54810  87967 100644  fb56.sql, 4k page
-
38829 60419  86223  86655  orig57.handler, 16k page
33298 55034  84282  84734  orig57.sql, 16k page
39898 63673 102641 106424  orig57.handler, 8k page
33801 57097  91397 101718  orig57.sql, 8k page
39015 62159 101203 107398  orig57.handler, 4k page
32433 54067  84663  90872  orig57.sql, 4k page
-
21062 32292  49979  54177  tokumx8, 0k readahead
24379 39353  58571  61245  tokumx8, 4k readahead
27871 45992  73472  80748  tokumx8, 8k readahead
27396 45214  74040  81592  tokumx8, 16k readahead
-
29529 45976  69002  72772  mongo249, 0k readahead
29482 45868  71590  73942  mongo249, 4k readahead
23608 35676  48662  51503  mongo249, 8k readahead
18606 27554  31865  32637  mongo249, 16k readahead
12485 16190  16668  16662  mongo249, 32k readahead
-
24296 40154  61795  66992  mongo260, 0k readahead
24245 39781  61343  68450  mongo260, 4k readahead
20559 32111  46572  49358  mongo260, 8k readahead
17115 26220  32825  33184  mongo260, 16k readahead
12309 17050  17542  17595  mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test.  Note that TokuMX QPS gets much better as the rate decreases when a larger readahead is used.

reads/query  server
0.823        fb56.handler, 16kb page
0.842        fb56.handler, 8kb page
0.897        fb56.handler, 4kb page
-
1.794        tokumx8, 0k readahead
1.723        tokumx8, 4k readahead
1.074        tokumx8, 8k readahead
0.999        tokumx8, 16k readahead
-
1.213        mongo249, 0k readahead
1.213        mongo249, 4k readahead
1.270        mongo249, 8k readahead
1.232        mongo249, 16k readahead
1.478        mongo249, 32k readahead
-
1.225        mongo260, 0k readahead
1.226        mongo260, 4k readahead
1.285        mongo260, 8k readahead
1.225        mongo260, 16k readahead
1.458        mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read   server
16384        fb56.handler, 16kb page
 8192        fb56.handler, 8kb page
 4096        fb56.handler, 4kb page
-
 4096        tokumx8, 0k readahead
 5939        tokumx8, 4k readahead
 9574        tokumx8, 8k readahead
11469        tokumx8, 16k readahead
-
 4096        mongo249, 0k readahead
 4250        mongo249, 4k readahead
 7373        mongo249, 8k readahead
13107        mongo249, 16k readahead
21197        mongo249, 32k readahead
-
 4096        mongo260, 0k readahead
 4301        mongo260, 4k readahead
 7475        mongo260, 8k readahead
13210        mongo260, 16k readahead
21658        mongo260, 32k readahead

2 comments:

  1. a) Interesting blog post. I came here mainly for read_ahead_kb. I had looked at it
    a while back and worked on it (and pushed many to mainline kernel as well, though
    couldn't follow up).

    While it is an interesting tunable, high global values of it can be bad. The
    reason is that, even though kernel is wise enough to (or at least it was till
    a while back, see max_sane_readahead in mm/readahead.c), readahead is still
    considered under high memory pressure.

    So, mainly, I would look if higher RA window triggered any sort of reclaim
    (through perf top or so).

    b)
    That is, even though readahead is done under __GFP_COLD | __GFP_NORETRY |
    __GFP_NOWARN (cold, don't retry and don't warn), it still can make page
    allocation fall back to slow path! By this I mean, suppose your application does
    a read of a particular segment of file and there isn't much memory available,
    the readahead will still try to read pages *even* if it means triggering zone
    reclaim (check __alloc_pages_slowpath in mm/page_alloc.c for more).

    I wanted to fix the above and added a new GFP for it - GFP_READAHEAD which meant
    to avoid any reclaim under some conditions. I hope this makes sense - we may not
    want readahead to trigger side effects involving further dirty write I/O.

    (For b, I need to brush up more with latest kernel readahead changes, mine is latest
    till about 3.6).

    c)
    Anyhow, one more warning about ra_pages (readahead window) is that it ramps up
    dynamically based on the detected pattern. It can go upto 512 4k pages. The
    whole RA window resizing is quite interesting and is in mm/readahead.c -
    page_cache_sync_readahead and page_cache_async_readahead. (Note - the 512 4k
    seems to be added lately, earlier it was based on NR_INACTIVE_FILE and
    NR_FREE_PAGES).


    d)
    Regarding mmap, mmap has some additional readahead logic built on top of this:


    /* If we don't want any read-ahead, don't bother */
    if (vma->vm_flags & VM_RAND_READ)
    return;
    if (!ra->ra_pages)
    return;

    if (vma->vm_flags & VM_SEQ_READ) {
    page_cache_sync_readahead(mapping, ra, file, offset,
    ra->ra_pages);
    return;
    }

    /* Avoid banging the cache line if not needed */
    if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
    ra->mmap_miss++;

    /*
    * Do we miss much more than hit in this file? If so,
    * stop bothering with read-ahead. It will only hurt.
    */
    if (ra->mmap_miss > MMAP_LOTSAMISS)
    return;


    This has remained unchanged for a while. But, the key is MMAP_LOTSAMISS - ie.
    when it detects readahead is not helping, it doesn't do it unless VM_SEQ_READ is
    set (which is from a madvise).

    The above readahead is triggered directly in page fault path - filemap_fault.


    e) Regarding the numbers, I think tokumx may be using fadvise/madvise quite
    deftly. That is the key when doing buffered I/O (something which even pgsql has
    issues with, from what I could gather at collab summit). Some of madvise
    actually double the RA window.


    f) Regarding filesystems and read_ahead_kb, strictly speaking, filesystems
    shouldn't directly do this and VFS should take care of it (filesystems defer
    this with ->read_pages and so on). BUT, I have seen a
    few filesystems make direct calls to page_cache_{a,}sync_readahead: btrfs and ext*
    (former seems to be doing it a lot and latter only in readdir()).


    Anyways, I have a few readahead branches here
    http://git.wnohang.net/cgit.cgi/bldit.git/refs/heads readahead branches -
    especially http://git.wnohang.net/cgit.cgi/bldit.git/log/?h=new-readahead


    ReplyDelete
  2. Thanks for the details. My concern is about trying to use read_ahead_kb to manage a small amount of readahead needed for OLTP -- no full scans but sometimes clients need more than 4kb at a time. I think that depending on readhead in that case is not going to make things efficient & performant.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...