Small Datum: Quantifying storage on Linux

Wednesday, October 26, 2022

Quantifying storage on Linux

Some things are complicated but I understand them (RocksDB). Clearly that isn't too complicated and the complexity might be a barrier to entry which boosts the demand for my skills. Other things are complicated and I don't understand them that well. Clearly those things are too complicated.

Yes, I am trying to be funny but what I wrote above might be true for many of us. In this case the thing that I don't understand that well are things that support IO for a DBMS -- filesystems, block layer and storage devices. It is likely that something in this post is factually incorrect and I am happy to be corrected. Some of my posts are thinly veiled attempts to get free advice from experts.

The problem I am trying to understand this week is the size of IO requests at different layers of the stack while running RocksDB benchmarks. To be specific: are the reads being done at a multiple of 512 or 4096 bytes? And when might that be possible (O_DIRECT vs buffered IO). From the details below I suspect I can do 512-byte reads with O_DIRECT on the v3.small and v4.small servers, but a 512-byte read on the GCP server will end up doing a 4096-byte transfer at some level of the stack.

I am trying to understand the cases when a read-only workload with RockDB can and cannot saturate the IO capacity of a storage device. I am using 3 types of servers: home servers that I will abbreviate as v3.small and v4.small, and a c2-standard-60 server in GCP that uses SSD Persistent Disk. In all cases the filesystem is XFS and the OS is Ubuntu 22.04. You need CPU to do IO and the number of CPU cores is 4 (Intel i7 @ 2.7GHz) for v3.small, 8 (AMD Ryzen 7 at 2GHz) for v4.small and 30 (Intel Xeon @ 3.1GHz) for c2-standard-60. The storage is NVMe from Samsung 970 EVO on v3.small and Kingston on v4.small. I don't know what device is used for GCP.

The rest of this post lists the information that I found via:

/sys/block/$device/queue/*
lsblk -t $dev
xfs_info

Details: lsblk

Note:

{v3,v4}.small use min-io=512, phy-sec=512, log-sec=512
GCP uses min-io=4096, phy-sec=4096, log-sec=512
From this I wonder whether there are cases where RocksDB can actually do 512-byte IO requests (logical) and whether all layers of the stack will respect that and not do 4096-byte requests to return the requested 512 bytes (physical).
One guess that when LOG-SEC < PHY-SEC (see GCP below) that some layer of the stack will do an operation at the larger (PHY-SEC) size but return the smaller (LOG-SEC) size.

From lsblk --help:

MIN-IO - minimum I/O size
PHY-SEC - physical sector size
LOG-SEC - logical sector size

And the full details on the storage that I use. The man page for lsblk is here.

# v3.small
$ lsblk -t /dev/nvme0n1
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
nvme0n1 0 512 0 512 512 0 none 1023 128 0B

# v4.small
$ lsblk -t /dev/nvme0n1
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
nvme0n1 0 512 0 512 512 0 none 255 128 0B

# GCP
$ lsblk -t /dev/sdb
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
sdb 0 4096 0 4096 512 0 none 8192 128 4G

Details: /sys

Docs for these are here and several of these values are also in lsblk output above.

From /sys/block/$device/queue/$name
v3 v4 GCP name
512 512 4096 physical_block_size
512 512 512 logical_block_size
512 512 512 hw_sector_size
512 512 4096 minimum_io_size
512 512 4096 physical_block_size
none none none scheduler
512 512 4096 discard_granularity
1280 256 256 max_sectors_kb
nvme0n1 nvme0n1 sdb $device

Details: xfs_info

The man page for xfs_info is here. The filesystems were created using default option for mkfs.xfs.

# v3.small

$ xfs_info /dev/nvme0n1

meta-data=/dev/nvme0n1 isize=512 agcount=4, agsize=30524162 blks

= sectsz=512 attr=2, projid32bit=1

= crc=1 finobt=1, sparse=1, rmapbt=0

= reflink=1 bigtime=0 inobtcount=0

data = bsize=4096 blocks=122096646, imaxpct=25

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0, ftype=1

log =internal log bsize=4096 blocks=59617, version=2

= sectsz=512 sunit=0 blks, lazy-count=1

realtime =none extsz=4096 blocks=0, rtextents=0

# v4.small

$ xfs_info /dev/nvme0n1

meta-data=/dev/nvme0n1 isize=512 agcount=4, agsize=30524162 blks

= sectsz=512 attr=2, projid32bit=1

= crc=1 finobt=1, sparse=1, rmapbt=0

= reflink=1 bigtime=0 inobtcount=0

data = bsize=4096 blocks=122096646, imaxpct=25

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0, ftype=1

log =internal log bsize=4096 blocks=59617, version=2

= sectsz=512 sunit=0 blks, lazy-count=1

realtime =none extsz=4096 blocks=0, rtextents=0

# GCP

$ xfs_info /dev/sdb

meta-data=/dev/sdb isize=512 agcount=4, agsize=196608000 blks

= sectsz=4096 attr=2, projid32bit=1

= crc=1 finobt=1, sparse=1, rmapbt=0

= reflink=1 bigtime=0 inobtcount=0

data = bsize=4096 blocks=786432000, imaxpct=5

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0, ftype=1

log =internal log bsize=4096 blocks=384000, version=2

= sectsz=4096 sunit=1 blks, lazy-count=1

realtime =none extsz=4096 blocks=0, rtextents=0

3 comments:

melgenekOctober 30, 2022 at 1:19 PM
I found this blog while seeking for answers myself. Thank you for the work you do!

For now, I only have more questions and observations. I was mainly interested in efficient IO using 'read' syscalls.

My questions/findings are:
1) Try using 'fstat.st_blksize' value that is "Block size for filesystem I/O". I wonder what it shows compared to your previous measurements https://man7.org/linux/man-pages/man2/lstat.2.html

2) C has a constant called BUFSIZ that is "The value of BUFSIZ is chosen on each system so as to make stream I/O efficient". And I wonder how is the value actually chosen? https://www.gnu.org/software/libc/manual/html_node/Controlling-Buffering.html#index-BUFSIZ

3) I observed that using 64KB buffer with buffered IO on my laptop leads to the fastest read time and the fewest number of 'read' syscalls when using Java's FileInputStream. 64KB happens to be the L1 cache line size on my Mac. I am interested if different CPU caches matter when having buffered IO.

Anyway, I hope you find your answers. And I will hopefully get mine one day :)
ReplyDelete
Replies

Add comment

Wednesday, October 26, 2022

Quantifying storage on Linux

3 comments:

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)