Small Datum: SSD read response time: raw device vs a filesystem

Friday, November 18, 2022

SSD read response time: raw device vs a filesystem

I am trying to understand why 4kb random reads from an SSD are about 2X slower when using a filesystem vs a raw device and this reproduces across different servers but I am only sharing the results from my home Intel NUCs. The symptoms are, the read response time is:

~2X larger per iostat's r_await for reads done via a filesystem vs a raw device (.04 vs .08 millisecs)
~3X larger per blkparse for reads done via a filesystem vs a raw device (~16 vs 50+ microsecs)

Once again, I am confused. Is something below user land doing (if from-filesystem -> go-slow). In theory, if the D->C transition reported by blkparse includes a lot of CPU overhead from the filesystem then that might explain this, but the CPU per IO overhead isn't large enough for that to be true here.

My test scripts for this are rfio.sh and do_fio.sh.

Update - the mystery has been solved thanks to advice from an expert (Andreas Freund) who is in my Twitter circle, and engaging with experts I would never get to meet in real life is why I use Twitter. I have been running the raw device tests with the device mostly empty and the fix is to run the test with it mostly full otherwise the SSD firmware can do something special (and faster) when reading data that was never written.

blktrace + blkparse

I used fio for 3 configurations: raw device, O_DIRECT and buffered IO. For O_DIRECT and buffered IO the filesystem is XFS and there were 8 20G files on a server that has 16G of RAM. The results for O_DIRECT and buffered IO were similar.

The results I discuss in the next 2 paragraphs are from fio run with --numjobs=1.

Output from blkparse for a few IOs are here. The lifecycle of an IO starts with state Q (queued) and ends with state C (completed). The number in parentheses at the end of the line for state C is the response time in microseconds. The RWBS field has RA for buffered IO (R = read, A = possible readahead) and R for O_DIRECT and raw. The timestamp format is seconds.nanoseconds and I pasted examples from ~18 seconds into the measurement. Most of the time for an IO request occurs between state D (dispatch to device) and C (completed). The blkparse man page is here.

From the examples I shared the response time is ~15 microseconds for raw and 50+ microseconds for O_DIRECT and buffered IO. Averaging the values from all samples collected over 20 seconds shows the average was 15 microseconds for raw vs ~73 microseconds for O_DIRECT and buffered. The read request size is 4096 in all cases (see + 8 in the blkparse output).

The sector offsets in the blkparse output are all a multiple of 8 (8 x 512 == 4096) so the requests have that in common between raw, O_DIRECT and buffered.

fio

Command lines and performance metrics from fio are here. I ran fio for numjobs in 1, 2, 4, 8, 16, 32, 48 and 64. A summary of the results:

IOPs is ~2X larger for raw vs O_DIRECT and buffered
CPU is 1.25 to 1.5X larger for O_DIRECT and buffered once numjobs is >= 8
iostat r_await is ~2X larger for O_DIRECT and buffered vs raw
iostat rareq-sz is the same for O_DIRECT, buffered and raw

And results after making the fix described in Update above, the results from raw are only slightly better than O_DIRECT, while buffered gets a better throughput number because it benefits from some hits in the OS page cache.

IOPs are similar
CPU is higher with a filesystem, ~1.2X with O_DIRECT and ~1.5X with buffered IO vs raw for numjobs >= 8. But this just means a few more microseconds as the cost is ~8, ~9, ~12 microseconds per IO for raw, O_DIRECT and buffered. I don't know yet how to explain the larger CPU/IO numbers for numjobs < 8. Is that a warmup cost or is something amortized?
iostat r_await is similar
iostat rareq-sz is the same

Small Datum

Friday, November 18, 2022

SSD read response time: raw device vs a filesystem

No comments:

Post a Comment

Is it time for TPC-BLOB?