tag:blogger.com,1999:blog-9149523927864751087.post1104797637601260969..comments2024-03-26T09:43:01.052-07:00Comments on Small Datum: The impact of read ahead and read size on MongoDB, TokuMX and MySQLMark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-9149523927864751087.post-83592454957973031482014-05-06T10:27:01.709-07:002014-05-06T10:27:01.709-07:00Thanks for the details. My concern is about trying...Thanks for the details. My concern is about trying to use read_ahead_kb to manage a small amount of readahead needed for OLTP -- no full scans but sometimes clients need more than 4kb at a time. I think that depending on readhead in that case is not going to make things efficient & performant.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-1393510943980296432014-05-04T00:36:05.247-07:002014-05-04T00:36:05.247-07:00a) Interesting blog post. I came here mainly for r...a) Interesting blog post. I came here mainly for read_ahead_kb. I had looked at it <br />a while back and worked on it (and pushed many to mainline kernel as well, though <br />couldn't follow up).<br /><br />While it is an interesting tunable, high global values of it can be bad. The<br />reason is that, even though kernel is wise enough to (or at least it was till<br />a while back, see max_sane_readahead in mm/readahead.c), readahead is still<br />considered under high memory pressure.<br /><br />So, mainly, I would look if higher RA window triggered any sort of reclaim <br />(through perf top or so).<br /><br />b)<br />That is, even though readahead is done under __GFP_COLD | __GFP_NORETRY |<br />__GFP_NOWARN (cold, don't retry and don't warn), it still can make page<br />allocation fall back to slow path! By this I mean, suppose your application does<br />a read of a particular segment of file and there isn't much memory available,<br />the readahead will still try to read pages *even* if it means triggering zone<br />reclaim (check __alloc_pages_slowpath in mm/page_alloc.c for more).<br /><br />I wanted to fix the above and added a new GFP for it - GFP_READAHEAD which meant<br />to avoid any reclaim under some conditions. I hope this makes sense - we may not<br />want readahead to trigger side effects involving further dirty write I/O.<br /><br />(For b, I need to brush up more with latest kernel readahead changes, mine is latest <br />till about 3.6).<br /><br />c)<br />Anyhow, one more warning about ra_pages (readahead window) is that it ramps up <br />dynamically based on the detected pattern. It can go upto 512 4k pages. The <br />whole RA window resizing is quite interesting and is in mm/readahead.c - <br />page_cache_sync_readahead and page_cache_async_readahead. (Note - the 512 4k <br />seems to be added lately, earlier it was based on NR_INACTIVE_FILE and <br />NR_FREE_PAGES).<br /><br /><br />d) <br />Regarding mmap, mmap has some additional readahead logic built on top of this:<br /><br /><br /> /* If we don't want any read-ahead, don't bother */<br /> if (vma->vm_flags & VM_RAND_READ)<br /> return;<br /> if (!ra->ra_pages)<br /> return;<br /><br /> if (vma->vm_flags & VM_SEQ_READ) {<br /> page_cache_sync_readahead(mapping, ra, file, offset,<br /> ra->ra_pages);<br /> return;<br /> }<br /><br /> /* Avoid banging the cache line if not needed */<br /> if (ra->mmap_miss < MMAP_LOTSAMISS * 10)<br /> ra->mmap_miss++;<br /><br /> /*<br /> * Do we miss much more than hit in this file? If so,<br /> * stop bothering with read-ahead. It will only hurt.<br /> */<br /> if (ra->mmap_miss > MMAP_LOTSAMISS)<br /> return;<br /><br /><br />This has remained unchanged for a while. But, the key is MMAP_LOTSAMISS - ie. <br />when it detects readahead is not helping, it doesn't do it unless VM_SEQ_READ is <br />set (which is from a madvise). <br /><br />The above readahead is triggered directly in page fault path - filemap_fault.<br /><br /><br />e) Regarding the numbers, I think tokumx may be using fadvise/madvise quite <br />deftly. That is the key when doing buffered I/O (something which even pgsql has <br />issues with, from what I could gather at collab summit). Some of madvise <br />actually double the RA window.<br /><br /><br />f) Regarding filesystems and read_ahead_kb, strictly speaking, filesystems <br />shouldn't directly do this and VFS should take care of it (filesystems defer <br />this with ->read_pages and so on). BUT, I have seen a <br />few filesystems make direct calls to page_cache_{a,}sync_readahead: btrfs and ext* <br />(former seems to be doing it a lot and latter only in readdir()). <br /><br /><br />Anyways, I have a few readahead branches here<br />http://git.wnohang.net/cgit.cgi/bldit.git/refs/heads readahead branches -<br />especially http://git.wnohang.net/cgit.cgi/bldit.git/log/?h=new-readahead <br /><br /><br />Raghavendra Prabhuhttp://wnohang.netnoreply@blogger.com