Thursday, July 16, 2015

Performance impact from too-small RAID stripes

What is the impact from the RAID stripe size? That depends on the workload. Many years ago there was an excellent paper from Oracle with the title Optimal Storage Configuration Made Easy. It is still worth reading today. One of the points in the paper is that you don't want individual storage read requests to use more than one disk for workloads with high concurrency to reduce the number of disk seeks.

RocksDB is used for OLTP and supports workloads with high concurrency from user requests and from compaction done in the background. There might be 8 threads doing compaction on a busy server and a typical compaction step done by a thread is to read from ~11 SST files and then write ~11 SST files. While each SST file is read sequentially, all of the files are read at the same time. The SST files are written in sequence. So we have ~12 streams of IO per compaction thread (11 for reads, 1 for write) and with 8 compaction threads then we can have ~96 streams of concurrent IO. The reads are done a block at a time (for example 16kb) and depend on file system readahead to get larger requests and reduce seeks. The write requests are likely to be large because the files are multi-MB and fsync is called when all of the data to the file has been written or optionally when sync_file_range is called after every N MB of writes.

Even if we limit our focus to writes there is a lot of concurrency as there is likely 1 write stream in progress for each compaction thread and another for the WAL. The reads done by compaction are interesting because many of them will get data from the OS filesystem cache. When level style compaction is used, the small levels near the top of the LSM are likely to be in cache while the largest level is not.

Reads vs stripe size

Enough about RocksDB, this post is supposed to be about the performance impact from RAID stripe sizes. I ran a simple IO performance test on two servers that were identical except for the RAID stripe size. Both had 15 disks with HW RAID 0, one used a 256KB stripe and the other a 1MB stripe. While the disk array is ~55TB I ran tests limited to the first 10TB and used the raw device. Domas provided the test client and it was run for 1 to 128 threads doing 1MB random reads.

The array with a 1MB stripe gets more throughput when there are at least 2 threads doing reads. At high concurrency the array with a 1MB stripe gets almost 2X more throughput.

Read MB/second by concurrency
1       2       4       8       12      16      24      32      64      128     threads
67      125     223     393     528     642     795     892     891     908     1MB stripe
74      115     199     285     332     373     407     435     499     498     256KB stripe

2 comments:

  1. Hi Mark,

    I'd say this is as expected.. ;-)
    However I'm also curious if things are looking similar when 256KB block size is used (while 16K and 64K is interesting too).
    And another curious observation may happen (or not) if you'll replay the same tests limited to 1TB, then 2TB, then 5TB.. -- all depends on your disk array and its RAID controller capacities/capabilities..

    Rgds,
    -Dimitri

    ReplyDelete
    Replies
    1. A clever RAID controller can't fix the problem of too-small RAID stripes when doing large IO requests. Controllers for HW RAID have been a challenge for my peers at times, they make it hard to understand what is really going on.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...