Wednesday, October 26, 2022

Quantifying storage on Linux

Some things are complicated but I understand them (RocksDB). Clearly that isn't too complicated and the complexity might be a barrier to entry which boosts the demand for my skills. Other things are complicated and I don't understand them that well. Clearly those things are too complicated.

Yes, I am trying to be funny but what I wrote above might be true for many of us. In this case the thing that I don't understand that well are things that support IO for a DBMS -- filesystems, block layer and storage devices. It is likely that something in this post is factually incorrect and I am happy to be corrected. Some of my posts are thinly veiled attempts to get free advice from experts.

The problem I am trying to understand this week is the size of IO requests at different layers of the stack while running RocksDB benchmarks. To be specific: are the reads being done at a multiple of 512 or 4096 bytes? And when might that be possible (O_DIRECT vs buffered IO). From the details below I suspect I can do 512-byte reads with O_DIRECT on the v3.small and v4.small servers, but a 512-byte read on the GCP server will end up doing a 4096-byte transfer at some level of the stack.

I am trying to understand the cases when a read-only workload with RockDB can and cannot saturate the IO capacity of a storage device. I am using 3 types of servers: home servers that I will abbreviate as v3.small and v4.small, and a c2-standard-60 server in GCP that uses SSD Persistent Disk. In all cases the filesystem is XFS and the OS is Ubuntu 22.04. You need CPU to do IO and the number of CPU cores is 4 (Intel i7 @ 2.7GHz) for v3.small, 8 (AMD Ryzen 7 at 2GHz) for v4.small and 30 (Intel Xeon @ 3.1GHz) for c2-standard-60. The storage is NVMe from Samsung 970 EVO on v3.small and Kingston on v4.small. I don't know what device is used for GCP.

The rest of this post lists the information that I found via:

  • /sys/block/$device/queue/*
  • lsblk -t $dev
  • xfs_info

Details: lsblk

Note:

  • {v3,v4}.small use min-io=512, phy-sec=512, log-sec=512
  • GCP uses min-io=4096, phy-sec=4096, log-sec=512
  • From this I wonder whether there are cases where RocksDB can actually do 512-byte IO requests (logical) and whether all layers of the stack will respect that and not do 4096-byte requests to return the requested 512 bytes (physical). 
  • One guess that when LOG-SEC < PHY-SEC (see GCP below) that some layer of the stack will do an operation at the larger (PHY-SEC) size but return the smaller (LOG-SEC) size.
From lsblk --help:
  • MIN-IO - minimum I/O size
  • PHY-SEC - physical sector size
  • LOG-SEC - logical sector size

And the full details on the storage that I use. The man page for lsblk is here.

# v3.small
$ lsblk -t /dev/nvme0n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0    512      0     512     512    0 none     1023 128    0B

# v4.small
$ lsblk -t /dev/nvme0n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0    512      0     512     512    0 none      255 128    0B

# GCP
$ lsblk -t /dev/sdb
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
sdb          0   4096      0    4096     512    0 none     8192 128    4G

Details: /sys

Docs for these are here and several of these values are also in lsblk output above.

From /sys/block/$device/queue/$name
v3      v4      GCP     name
512     512     4096    physical_block_size
512     512     512     logical_block_size
512     512     512     hw_sector_size
512     512     4096    minimum_io_size
512     512     4096    physical_block_size
none    none    none    scheduler
512     512     4096    discard_granularity
1280    256     256     max_sectors_kb
nvme0n1 nvme0n1 sdb     $device

Details: xfs_info

The man page for xfs_info is here. The filesystems were created using default option for mkfs.xfs.


# v3.small

$ xfs_info /dev/nvme0n1

meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=30524162 blks

         =                       sectsz=512   attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1    bigtime=0 inobtcount=0

data     =                       bsize=4096   blocks=122096646, imaxpct=25

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=59617, version=2

         =                       sectsz=512   sunit=0 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0


# v4.small

$ xfs_info /dev/nvme0n1

meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=30524162 blks

         =                       sectsz=512   attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1    bigtime=0 inobtcount=0

data     =                       bsize=4096   blocks=122096646, imaxpct=25

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=59617, version=2

         =                       sectsz=512   sunit=0 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0


# GCP

$ xfs_info /dev/sdb

meta-data=/dev/sdb               isize=512    agcount=4, agsize=196608000 blks

         =                       sectsz=4096  attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1    bigtime=0 inobtcount=0

data     =                       bsize=4096   blocks=786432000, imaxpct=5

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=384000, version=2

         =                       sectsz=4096  sunit=1 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0

3 comments:

  1. I found this blog while seeking for answers myself. Thank you for the work you do!

    For now, I only have more questions and observations. I was mainly interested in efficient IO using 'read' syscalls.

    My questions/findings are:
    1) Try using 'fstat.st_blksize' value that is "Block size for filesystem I/O". I wonder what it shows compared to your previous measurements https://man7.org/linux/man-pages/man2/lstat.2.html

    2) C has a constant called BUFSIZ that is "The value of BUFSIZ is chosen on each system so as to make stream I/O efficient". And I wonder how is the value actually chosen? https://www.gnu.org/software/libc/manual/html_node/Controlling-Buffering.html#index-BUFSIZ

    3) I observed that using 64KB buffer with buffered IO on my laptop leads to the fastest read time and the fewest number of 'read' syscalls when using Java's FileInputStream. 64KB happens to be the L1 cache line size on my Mac. I am interested if different CPU caches matter when having buffered IO.

    Anyway, I hope you find your answers. And I will hopefully get mine one day :)

    ReplyDelete
    Replies
    1. For 1) someone knows which of the /sys/block/*/queue entries sets the value for fstat.st_blksize. Maybe we will learn one day.

      For 2) I don't know about BUFSIZ but from the values above the useful one is max_sectors_kb which is the max size of an IO request. Note that it is 256kb for GCP (and I assume also 256kb for EBS). I read elsewhere that a contiguous request on EBS counts as 1 IO when it is <= 256kb,, so if you want to max out your EBS IOPs budget then figure out how to do 256kb requests (and an LSM makes it easy to do that on the write side).

      For 3) I think you mean the L1 cache line is 64 bytes not KB. WRT to the impact of the cache on buffered IO perf, one challenge is to find the HW on which to do the tests.

      Delete
    2. Thank you for the answer! It is quite helpful.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...