Tuesday, February 8, 2022

RocksDB internals: prefetch and/or readahead

RocksDB can optimize IO for large and small read requests. Small read requests are done for user queries while large read requests can be done for iterators from users and compaction.

By reads I mean reading data from the filesystem. That data might be in the OS page cache otherwise it must be read from a storage device. Back in the day the choices were to use buffered IO or mmap. Today there is a new option -- O_DIRECT.

tl;dr for POSIX (some else can document this for Windows)

  • for small reads RocksDB can use posix_fadvise with POSIX_FADV_RANDOM
  • for large reads RocksDB can use posix_fadvise with POSIX_FADV_SEQUENTIAL to request filesystem readahead. It can also do large reads synchronously in the context of the thread that consumes the read. Some of the RocksDB docs describe this as readahead and/or prefetch. To (pedantic) me prefetch (definitely) and readahead (possibly) implies async requests. Writing this post helps me avoid that confusion.

History In the early days all reads by RocksDB were done one block at a time where the uncompressed block size was usually between 4kb and 16kb. The block might be compressed so the size of the pread request can be less than that. Unless O_DIRECT is used, the block is not aligned to the filesystem page size so the read request can span filesystem pages meaning that reading a 2kb compressed block might end up causing two filesystem pages (4kb each) getting read.

Today all reads are not done one block at a time. There are at least two cases where a file will be read sequentially -- compaction and a long range scan via an iterator -- for which RocksDB can do larger (multi-block) read requests. RocksDB also has a Hint method for open files and that can trigger a call to posix_fadvise.

Some details on iterator readahead are here. Search for automatic readahead.

Prefetch, readahead and large IO requests

RocksDB can request that the kernel does readahead or prefetching via a posix_fadvise call. This is controlled by the access_hint_on_compaction option. RocksDB can also explicitly use large read requests for user and compaction iterators. Some of the code is complicated. It is easy to see where the options are used and where the large reads are done. But the path between the option use and the large read isn't as easy to trace compared to what I described in my past few posts. Two important methods are FilePrefetchBuffer::Prefetch and BlockPrefetcher::PrefetchIfNeeded.

Options

The relevant options include:

  • max_auto_readahead_size - the max size of a readahead request for an iterator. This is adaptive, the readahead size starts at 8kb and grows to max_auto_readahead_size.
  • compaction_readahead_size - sets the readahead size for iterators used by compaction to read input SSTs. The use is here and then here in the BlockBasedTableIterator ctor.
  • prepopulate_block_cache - this isn't readahead but is related. When this option is set then the output from memtable flushes (new L0 SSTs) are written into the block cache. This is useful when O_DIRECT is used as it saves the cost of reading new L0 files into the block cache. It might help to extend this to new SSTs written to the L1. Start here to read the code that caches blocks from memtable flushes.
  • advise_random_on_open - when True, Hint(kRandom) is called when a file is opened for POSIX that triggers a call to posix_fadvise with POSIX_FADV_RANDOM. This is intended for files that will be used for user queries.
  • access_hint_on_compaction_start - specifies the access hint that should be used for reads done by compaction. This is used in SetupForCompaction when Hint is called. The Hint implementation for POSIX calls Fadvise which calls posix_fadvise. It might help to use SEQUENTIAL as the value for access_hint_on_compaction_start but I haven't tested this in a long time. Issue 9448 is open because there is a race WRT to the posix_fadvise call for this and advise_random_on_open.
  • Things in ReadOptions:
    • adaptive_readahead - enables adaptive readahead for iterators that increases the size of readahead as more data is read from the iterator
    • readahead_size - default is 0, when set can increase the size of the initial readahead. When not set the initial readahead is 2 blocks.
  • verify_checksums_readahead_size - the size of read requests when verifying checkums on SST files that will be ingested
  • blob_compaction_readahead_size - sets the readahead size for reads from blob files. I am not sure whether this is only for reads from blobs during leveled LSM compaction or during compaction triggered by blob_garbage_collection_force_threshold.
  • log_readahead_size - the readahead size for log reads

Other options for large IO requests:

Options that limit the amount of dirty data from compaction and the WAL

  • bytes_per_sync - in some cases when zero the value is set to 1M at startup. When non-zero, RangeSync is called after this many bytes have been written to an SST during compaction. The RangeSync method calls sync_file_range to trigger writeback and reduce the duration of the fsync or fdatasync done when compaction is done writing to that SST. The call to sync_file_range might block but the hope is that it doesn't because the goal is async write-behind.
  • wal_bytes_per_sync - similar to bytes_per_sync but for the WAL. I am confused by the impact of this option. Ignoring the cases where the value of it is assigned to the value of another option, this is the only use for it.
  • strict_bytes_per_sync - when true then SYNC_FILE_RANGE_WAIT_BEFORE is used in the sync_file_range call. Otherwise only SYNC_FILE_RANGE_WRITE is used. When true this provides a strict bound on the amount of data that would have to be written to storage when fsync or fdatasync are called when the SST is full/closed.
















No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...