Thanks for the details. My concern is about trying...

2014-05-06T10:27:01.709-07:00

Thanks for the details. My concern is about trying to use read_ahead_kb to manage a small amount of readahead needed for OLTP -- no full scans but sometimes clients need more than 4kb at a time. I think that depending on readhead in that case is not going to make things efficient & performant.

a) Interesting blog post. I came here mainly for r...

2014-05-04T00:36:05.247-07:00

a) Interesting blog post. I came here mainly for read_ahead_kb. I had looked at it
a while back and worked on it (and pushed many to mainline kernel as well, though
couldn't follow up).

While it is an interesting tunable, high global values of it can be bad. The
reason is that, even though kernel is wise enough to (or at least it was till
a while back, see max_sane_readahead in mm/readahead.c), readahead is still
considered under high memory pressure.

So, mainly, I would look if higher RA window triggered any sort of reclaim
(through perf top or so).

b)
That is, even though readahead is done under __GFP_COLD | __GFP_NORETRY |
__GFP_NOWARN (cold, don't retry and don't warn), it still can make page
allocation fall back to slow path! By this I mean, suppose your application does
a read of a particular segment of file and there isn't much memory available,
the readahead will still try to read pages *even* if it means triggering zone
reclaim (check __alloc_pages_slowpath in mm/page_alloc.c for more).

I wanted to fix the above and added a new GFP for it - GFP_READAHEAD which meant
to avoid any reclaim under some conditions. I hope this makes sense - we may not
want readahead to trigger side effects involving further dirty write I/O.

(For b, I need to brush up more with latest kernel readahead changes, mine is latest
till about 3.6).

c)
Anyhow, one more warning about ra_pages (readahead window) is that it ramps up
dynamically based on the detected pattern. It can go upto 512 4k pages. The
whole RA window resizing is quite interesting and is in mm/readahead.c -
page_cache_sync_readahead and page_cache_async_readahead. (Note - the 512 4k
seems to be added lately, earlier it was based on NR_INACTIVE_FILE and
NR_FREE_PAGES).

d)
Regarding mmap, mmap has some additional readahead logic built on top of this:

/* If we don't want any read-ahead, don't bother */
if (vma->vm_flags & VM_RAND_READ)
return;
if (!ra->ra_pages)
return;

if (vma->vm_flags & VM_SEQ_READ) {
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
}

/* Avoid banging the cache line if not needed */
if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
ra->mmap_miss++;

/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
if (ra->mmap_miss > MMAP_LOTSAMISS)
return;

This has remained unchanged for a while. But, the key is MMAP_LOTSAMISS - ie.
when it detects readahead is not helping, it doesn't do it unless VM_SEQ_READ is
set (which is from a madvise).

The above readahead is triggered directly in page fault path - filemap_fault.

e) Regarding the numbers, I think tokumx may be using fadvise/madvise quite
deftly. That is the key when doing buffered I/O (something which even pgsql has
issues with, from what I could gather at collab summit). Some of madvise
actually double the RA window.

f) Regarding filesystems and read_ahead_kb, strictly speaking, filesystems
shouldn't directly do this and VFS should take care of it (filesystems defer
this with ->read_pages and so on). BUT, I have seen a
few filesystems make direct calls to page_cache_{a,}sync_readahead: btrfs and ext*
(former seems to be doing it a lot and latter only in readdir()).

Anyways, I have a few readahead branches here
http://git.wnohang.net/cgit.cgi/bldit.git/refs/heads readahead branches -
especially http://git.wnohang.net/cgit.cgi/bldit.git/log/?h=new-readahead

Comments on Small Datum: The impact of read ahead and read size on MongoDB, TokuMX and MySQL

Thanks for the details. My concern is about trying...

a) Interesting blog post. I came here mainly for r...