Small Datum: Performance impact from too-small RAID stripes

Thursday, July 16, 2015

Performance impact from too-small RAID stripes

What is the impact from the RAID stripe size? That depends on the workload. Many years ago there was an excellent paper from Oracle with the title Optimal Storage Configuration Made Easy. It is still worth reading today. One of the points in the paper is that you don't want individual storage read requests to use more than one disk for workloads with high concurrency to reduce the number of disk seeks.

RocksDB is used for OLTP and supports workloads with high concurrency from user requests and from compaction done in the background. There might be 8 threads doing compaction on a busy server and a typical compaction step done by a thread is to read from ~11 SST files and then write ~11 SST files. While each SST file is read sequentially, all of the files are read at the same time. The SST files are written in sequence. So we have ~12 streams of IO per compaction thread (11 for reads, 1 for write) and with 8 compaction threads then we can have ~96 streams of concurrent IO. The reads are done a block at a time (for example 16kb) and depend on file system readahead to get larger requests and reduce seeks. The write requests are likely to be large because the files are multi-MB and fsync is called when all of the data to the file has been written or optionally when sync_file_range is called after every N MB of writes.

Even if we limit our focus to writes there is a lot of concurrency as there is likely 1 write stream in progress for each compaction thread and another for the WAL. The reads done by compaction are interesting because many of them will get data from the OS filesystem cache. When level style compaction is used, the small levels near the top of the LSM are likely to be in cache while the largest level is not.

Reads vs stripe size

Enough about RocksDB, this post is supposed to be about the performance impact from RAID stripe sizes. I ran a simple IO performance test on two servers that were identical except for the RAID stripe size. Both had 15 disks with HW RAID 0, one used a 256KB stripe and the other a 1MB stripe. While the disk array is ~55TB I ran tests limited to the first 10TB and used the raw device. Domas provided the test client and it was run for 1 to 128 threads doing 1MB random reads.

The array with a 1MB stripe gets more throughput when there are at least 2 threads doing reads. At high concurrency the array with a 1MB stripe gets almost 2X more throughput.

Read MB/second by concurrency
1 2 4 8 12 16 24 32 64 128 threads
67 125 223 393 528 642 795 892 891 908 1MB stripe
74 115 199 285 332 373 407 435 499 498 256KB stripe

2 comments:

DimitriAugust 4, 2015 at 9:47 AM
Hi Mark,

I'd say this is as expected.. ;-)
However I'm also curious if things are looking similar when 256KB block size is used (while 16K and 64K is interesting too).
And another curious observation may happen (or not) if you'll replay the same tests limited to 1TB, then 2TB, then 5TB.. -- all depends on your disk array and its RAID controller capacities/capabilities..

Rgds,
-Dimitri
ReplyDelete
Replies

Add comment

Thursday, July 16, 2015

Performance impact from too-small RAID stripes

Reads vs stripe size

2 comments:

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)