I am trying to understand why 4kb random reads from an SSD are about 2X slower when using a filesystem vs a raw device and this reproduces across different servers but I am only sharing the results from my home Intel NUCs. The symptoms are, the read response time is:
- ~2X larger per iostat's r_await for reads done via a filesystem vs a raw device (.04 vs .08 millisecs)
- ~3X larger per blkparse for reads done via a filesystem vs a raw device (~16 vs 50+ microsecs)
Once again, I am confused. Is something below user land doing (if from-filesystem -> go-slow). In theory, if the D->C transition reported by blkparse includes a lot of CPU overhead from the filesystem then that might explain this, but the CPU per IO overhead isn't large enough for that to be true here.
My test scripts for this are rfio.sh
- the mystery has been solved thanks to advice
from an expert (Andreas Freund) who is in my Twitter circle, and engaging with experts I would never get to meet in real life is why I use Twitter. I have been running the raw device tests with the device mostly empty and the fix is to run the test with it mostly full otherwise the SSD firmware can do something special (and faster) when reading data that was never written.
blktrace + blkparse
I used fio for 3 configurations: raw device, O_DIRECT and buffered IO. For O_DIRECT and buffered IO the filesystem is XFS and there were 8 20G files on a server that has 16G of RAM. The results for O_DIRECT and buffered IO were similar.
The results I discuss in the next 2 paragraphs are from fio run with --numjobs=1.
Output from blkparse for a few IOs are here
. The lifecycle of an IO starts with state Q
(queued) and ends with state C
(completed). The number in parentheses at the end of the line for state C is the response time in microseconds. The RWBS field has RA for buffered IO (R = read, A = possible readahead) and R for O_DIRECT and raw. The timestamp format is seconds.nanoseconds and I pasted examples from ~18 seconds into the measurement. Most of the time for an IO request occurs between state D (dispatch to device) and C (completed). The blkparse man page is here
From the examples I shared the response time is ~15 microseconds for raw and 50+ microseconds for O_DIRECT and buffered IO. Averaging the values from all samples collected over 20 seconds shows the average was 15 microseconds for raw vs ~73 microseconds for O_DIRECT and buffered. The read request size is 4096 in all cases (see + 8 in the blkparse output).
The sector offsets in the blkparse output are all a multiple of 8 (8 x 512 == 4096) so the requests have that in common between raw, O_DIRECT and buffered.
Command lines and performance metrics from fio are here
. I ran fio for numjobs in 1, 2, 4, 8, 16, 32, 48 and 64. A summary of the results:
And results after making the fix described in Update above, the results from raw are only slightly better than O_DIRECT, while buffered gets a better throughput number because it benefits from some hits in the OS page cache.