Small Datum: Can RocksDB use all of the IOPs from fast storage?

Last October someone asked on the RocksDB email list whether RocksDB can use all of the IO capacity of a fast storage device. They provided more context -- they want to use it to cache objects that are ~4kb and give RocksDB as little memory as possible and the workload is mostly point queries (RocksDB Get).

That person provided a shell script to show how they used db_bench to test this. I started with that script, we had some discussion on that thread and then I went quiet. At last I have some results to share. To be clear, I am trying to answer two questions. First, can the QPS rate for RocksDB approach the IO capacity of the storage system. Second, can the IO rate for a RocksDB benchmark approach the IO capacity of the storage system. And the context is a for a read-mostly/large-object workload that gives RocksDB as little memory as possible. The RocksDB workload should do ~1 IO/query for the answer to the first question to be true. If it does much more than 1 per query then we are trying to answer the second question.

Results vary based on whether the block cache is too small versus large enough. Unfortunately, for many workloads it isn't easy to determine what is large enough. By large enough I mean that the metadata working set fits in the block cache (bloom filters and index blocks). Alas, making this estimate can be hard especially with a database that grows or shrinks over time. Partitioned metadata is a solution if you don't want to figure out the proper size for your block cache (which isn't easy) or if you don't have enough memory.

I took too long to writeup these experiments and lost some details with respect to the hardware used and benchmark command lines. Also, I feel like a character from Severance while doing writeups like this. There is much time looking for scary numbers.

tl;dr

In some cases RocksDB can saturate the IO capacity for a server for an IO-bound and read-heavy workload -- either efficiently (doing 1 IO/query or inefficiently (doing extra IOs/query).
You can spend on memory to reduce IO demand or spend on IO to reduce memory demand.
The price to pay for a too-small block cache is more IO/query. That price is larger with non-partitioned metadata.
When the block cache is large enough the price to pay for partitioned metadata is more CPU/query and more IO/query. However this price appears to be larger with big servers than with small and explaining that is on my TODO list.
One of the server types (Beelink) gets more IOPs from fio with O_DIRECT+filesystem than it does from the same SSD accessed as a raw device. I have no idea why.

Well, actually

Can RocksDB saturate an IO device? It depends and there are several factors to consider:

CPU - it takes CPU to do IO. On a variety of hosts the CPU overhead per IO is ~10 usecs from Linux (kernel and userland, measured with fio) and then in the best case RocksDB uses another ~20 usecs/query for point queries that read ~4kb values from storage per query. So in the best case that is 30 usecs of CPU time per IO which means the peak rate is ~33k queries per core. In this case I need at least 10 CPU cores to consume ~330k IO/s. The 20 usecs/query for RocksDB ignores decompression which adds another 1 to 5 usecs per block read from storage (~1 usec for lz4, ~5 for zstd).
memory - you can spend more on memory (larger RocksDB block cache) to save on IO or you can spend more on IO to save on memory (smaller RocksDB block cache). In the best case for an IO-bound workload there is one read from storage per query. To achieve that all (or most) of the RocksDB metadata (bloom filter, index) blocks must be in memory.
partitioned metadata blocks - the db_bench option --partition_index_and_filters can be set to partition index and bloom filter blocks (split them into pieces). The cost of partitioned index and filter blocks is more CPU overhead (discussed below) but the benefit is large for workloads when the working set doesn't fit in the RocksDB block cache. When all metadata can't be cached and partitioning isn't enabled then there will be frequent large IO requests (large == several hundred KB) to read index and filter blocks and this is bad for performance. Using partitioned metadata blocks avoids this problem. If you don't use partitioned metadata blocks then you will have to manually tune, or guess, at the amount of memory needed for the RocksDB block cache to avoid the problem.
IO capacity - obviously it is easier for RocksDB to saturate the IO capacity of a slow storage device (spinning disk or cloud storage) than a fast one (local attached SSD). But another important part of the context is concurrency. For most storage systems the IO capacity is a function of concurrency -- with more concurrency there will be more IO/second up to a point.

One thing to keep in mind is that RocksDB must read more than 4kb from storage to get a 4kb value. How much more is a function of software (filesystem, EBS vs local attach), hardware (storage devices have a minimal read size of 512b or 4096b) and configuration (O_DIRECT vs buffered IO).

I didn't enable the option to align blocks, db_bench --block_align is false by default for Leveled compaction and there is no option for block align with BlobDB. There is a trade to be had with --block_align -- when enabled it might reduce IOPs and bytes transferred from storage per user query but when disabled the database uses less space.
The workload fetches 4kb of user data per point query and RocksDB adds a few bytes of metadata to a data block. Thus, without compression (I didn't enable it) RocksDB must fetch a bit more than 4kb of data to get a 4kb value so it always must fetch 4kb + one sector (sector might be 512 or 4096 bytes).

Hardware

I used a variety of hardware:

work

80 HW threads with hyperthreading enabled, 256G RAM, several local attach NVMe SSD
tests used 8 x 300G files and were run for 1, 5, 10, 20, 30 and 40 threads

Intel NUC with 4 cores, 16G RAM and local attach NVMe (see here)
tests used 8 x 20G files and were run for 1, 4, 8, 16, 24 and 32 threads

Beelink

Beelink with 8 AMD cores, 16G RAM and local attach NVMe (see here). All results are provided for this with the original (Kingston) SSD and some results with the new (Samsung 980 Pro) SSD
tests used 8 x 20G files and were run for 1, 4, 8, 16, 24 and 32 threads

c7g.16xlarge with 64 cores, 128G RAM and EBS (io2, 5T, 100k+ IOPs)
tests used 8 x 300G files and were run for 1, 2, 5, 10, 20, 30, 40 and 60 threads

c2-standard-60 with 30 cores (hyperthreading disabled), 240G RAM and SSD Persistent disk (maybe 5T)
tests used 8 x 300G files and were run for 1, 5, 10, 20 and 30 threads

While I lost a few details with respect to the cloud servers tested, the real IO capacity is discovered by running fio as I want to know how many IO/second can be done at different concurrency levels. For the fio tests that used raw devices I made sure the filesystem on the raw device was at least 90% full.

All servers except for work used Ubuntu 22.04. All used XFS.

Benchmark

I first used fio to quantify the IO throughput I could get at different concurrency levels. The script is here and was repeated for raw devices, O_DIRECT and buffered IO. I used the psync IO engine so pread (synchronous IO) was used and there would be one pending IO per thread to match what RocksDB does in most cases.

I then used db_bench via the runit.sh script that invokes the q.sh script that invokes db_bench. Tests were repeated for Leveled compaction and integrated BlobDB in each case with partitioned metadata blocks enabled and disabled.

The benchmarks with fio are:

fio.raw.4kb - fio does 4kb random reads on a raw device
fio.dir.4kb - fio does 4kb random reads with O_DIRECT
fio.raw.8kb - fio does 8kb random reads on a raw device
fio.dir.8kb - fio does 8kb random reads with O_DIRECT

The benchmarks with db_bench are repeated with leveled compaction and BlobDB using different sizes for the RocksDB block cache. For the work server the block cache sizes were 2G, 4G, 8G, 16G, 24G and 32G and in that test 24G was enough to cache all RocksDB metadata. For NUC and Beelink the block cache sizes were 256M, 512M, 1G, 2G, 4G, 8G, 12G and 2G was enough to cache RocksDB metadata. For GCP the block cache sizes were 2G, 4G, 8G, 16G and 24G. I didn't test enough block cache sizes for AWS.

The RocksDB (db_bench) tests were repeated with and without partitioned index and filters. Most tabs on the spreadsheet (work, nuc, beelink, gcp) have results for the following. The beelink, new SSD tab only has fio results for the Beelink server after I replaced the Kingston NVMe with a Samsung 980 Pro.

4096b.dir.blob.p1 - uses O_DIRECT, BlobDB and partitioned metadata
4096b.dir.blob.p0 - uses O_DIRECT, BlobDB but doesn't use partitioned metadata
4096b.dir.lev.p1 - uses O_DIRECT, Leveled compaction and partitioned metadata
4096b.dir.lev.p0 - uses O_DIRECT, Leveled compaction but doesn't use partitioned metadata

Initially I didn't make sure the filesystem mounted on the SSD was at least 90% full before running the fio.raw.4kb and fio.raw.8kb tests. That was a mistake that I soon fixed because raw reads to space that has been trim'ed from an SSD can be a noop and you get bogus perf results.

Results

A spreadsheet with all of the details is here. The spreadsheet lists the absolute and relative QPS by concurrency level. The relative QPS is relative to the base case (fio with a raw device doing 4kb reads). For each tab on the spreadsheet the left hand side has the throughput (reads/s) and the right side has the relative QPS.

While I won't explain how to decode them, files with more performance details including IO latency and CPU/query are here for work, NUC, Beelink, AWS and GCP.

Results from fio

The interesting results:

The bottleneck is IOPs, transfer rate or both.

It is IOPs when the 4kb and 8kb results are similar.
It is transfer rate when the 4kb results are significantly better.
It is both when the 4kb results are better but not significantly better
For now I claim that similar means 10% or less difference (IOPs bottleneck), significantly better means a 30% or more difference (transfer rate bottleneck) and anything in between means the bottleneck is both IOPs and transfer rate.
Note that this classification changes with concurrency for some devices.

Using a raw device gets the best throughput for all devices except the Beelink. For the Beelink the fio.dir.4kb (O_DIRECT, 4kb reads) run has the best throughput. I have yet to learn why but if you scroll to the Update 1 section at the end of this post I provide details that start to explain this.

Classifying the bottleneck:

IOPs at low concurrency, both at mid concurrency, transfer rate at high concurrency

Beelink with original SSD

Both at all concurrency levels

Beelink with new SSD

Transfer rate at all concurrency levels
The results confuse me because the read IOPs is larger with O_DIRECT+filesystem than accessing the SSD as a raw device.

IOPs at all concurrency levels. This is expected with AWS where a read <= 256kb counts as one IO and EBS provided ~80k IOPs.

For raw it was IOPs at all concurrency levels
For O_DIRECT it was both at low and mid concurrency, IOPs at high

Work

IOPs at all concurrency levels

Results from db_bench

I share 4 graphs for two server types; Work and NUC. Graphs for the other server types are in the spreadsheet and all look similar to the graphs I share here. The graphs are from db_bench for a read-only workload that reads 4kb values and the benchmark is repeated for different block cache sizes. The goal is to determine the impact of using a larger block cache.

The four graphs are:

BlobDB with partitioned metadata
BlobDB without partitioned metadata
Leveled compaction with partitioned metadata
Leveled compaction without partitioned metadata

For each server type the first 3 graphs look the same while the last graph (Leveled compaction without partitioned metadata) shows the impact of partitioned metadata. There reduction in performance from using a too-small cache is much larger in this case. While the performance problem from using a too-small cache can be diagnosed at run-time it isn't trivial to notice and I have wasted time on this.

Results: db_bench, work server

Scroll down to the fourth graph. It doesn't look like the first three.

Results: db_bench, NUC

Scroll down to the fourth graph. It doesn't look like the first three.

QPS for partitioned vs non-partitioned: graphs

These graphs show (QPS with partitioned metadata / QPS without partitioned metadata) for Leveled compaction. There is a performance penalty for partitioned metadata (relative QPS is much less than 1.0) for the Work and GCP servers but not for the NUC and Beelink servers (explaining that is on my TODO list). Regardless, there is a huge benefit from using partitioned metadata when the block cache is too small

QPS for partitioned vs non-partitioned: tables

Tables with data for graphs from the previous section.

Work server

QPS for p1 / p0	1	5	10	20	30	40
2G cache	2.23	2.43	2.69	3.16	3.60	4.00
4G cache	2.13	2.32	2.57	3.02	3.42	3.82
8G cache	1.78	1.94	2.13	2.47	2.81	3.12
16G cache	0.71	0.76	0.81	0.87	0.91	0.96
24G cache	0.63	0.69	0.71	0.73	0.73	0.74
32G cache	0.67	0.81	0.85	0.87	0.88	0.88

NUC

QPS for p1 / p0	1	4	8	16	24	32
256M cache	3.04	4.65	6.80	9.22	10.66	11.76
512M cache	2.48	4.10	5.57	7.32	8.39	9.25
1G cache	0.87	1.66	1.75	1.94	2.15	2.33
2G cache	0.91	0.90	0.93	0.90	0.95	0.95
4G cache	0.98	0.99	0.99	0.95	1.00	1.00
8G cache	0.97	0.86	0.99	0.95	1.00	1.00
12G cache	0.97	0.99	0.98	0.95	0.99	0.99

Beelink

QPS for p1 / p0	1	4	8	16	24	32
256M cache	1.79	3.36	4.95	6.65	7.20	7.26
512M cache	1.54	2.81	4.09	5.46	5.93	6.01
1G cache	0.83	1.11	1.39	1.72	1.80	1.80
2G cache	0.93	0.94	0.95	0.96	0.96	0.95
4G cache	0.98	0.99	1.01	1.01	1.02	1.02
8G cache	0.98	1.00	1.01	1.03	1.03	1.02
12G cache	0.99	1.01	1.01	1.02	1.02	1.02

GCP

QPS for p1 / p0	1	5	10	20	30
2G cache	1.53	1.44	1.87	2.37	2.37
4G cache	1.40	1.34	1.68	2.16	2.16
8G cache	1.02	1.10	1.14	1.38	1.38
16G cache	0.46	0.59	0.65	0.64	0.64
24G cache	0.46	0.66	0.82	0.81	0.81

Partitioned vs Non-partitioned metadata

This explains the performance impact from Leveled compaction with a too-small block cache using metrics collected by iostat and vmstat that are normalized (divided by) the QPS to determine the IO and CPU cost per query. The data is here for the Work, GCP, NUC and Beelink server types.

All server types show a performance problem with Leveled compaction and non-partitioned metadata when the block cache is too small. However, the comparison between partitioned and non-partitioned metadata when the block cache is large enough is interesting. When the block cache is large enough:

For the big servers (Work and GCP) the performance with non-partitioned metadata is better than with partitioned metadata. Also, HW efficiency is worse with partitioned metadata as there is more IO/query (1.1 vs 1.0, 1.3 vs 1.0) and more CPU/query (39 vs 31 usecs, 57 vs 41 usecs).
For the small servers the performance with non-partitioned and partitioned metadata are similar.

Explaining the perf difference for partitioned metadata between big and small servers when the block cache is large enough is on my TODO list. The perf difference between partitioned and non-partitioned that reproduces in the big and small servers can be explained by the fact that more and larger IO requests are done when the cache is too small and metadata is not partitioned. The result from that is more IO latency and more CPU overhead.

Results from the work server with a too-small block cache:

	partitioned	not partitioned
QPS	62091	19674
Response time (usecs)	322	1016
cpu usecs /query	59	188
IO reads /query	3.1	4.7
r_await (usecs)	82	348
iostat rareq-sz (KB)	6.8	66
cache size, concurrency	2G, 20	2G, 20

Results from the work server with large enough block cache:

	partitioned	not partitioned
QPS	147691	169566
Response time (usecs)	135	118
cpu usecs /query	39	31
IO reads /query	1.1	1
r_await (usecs)	90	89
iostat rareq-sz (KB)	8.1	8.2
cache size, concurrency	32G, 20	32G, 20

Results from GCP with a too-small block cache:

	partitioned	not partitioned
QPS	9947	4191
Response time (usecs)	2010	4769
cpu usecs /query	73	212
IO reads /query	3	3.6
r_await (usecs)	672	1632
iostat rareq-sz (KB)	4.4	79
cache size, concurrency	2G, 20	2G, 20

Results from GCP with a large enough block cache:

	partitioned	not partitioned
QPS	23734	29168
Response time (usecs)	843	686
cpu usecs /query	57	41
IO reads /query	1.3	1
r_await (usecs)	650	638
iostat rareq-sz (KB)	5.4	5
cache size, concurrency	24G, 20	24G, 20

Results from the NUC with a too-small block cache:

	partitioned	not partitioned
QPS	24211	3563
Response time (usecs)	330	2244
cpu usecs /query	43	539
IO reads /query	2.8	2.6
r_await (usecs)	104	630
iostat rareq-sz (KB)	3.5	100.3
cache size, concurrency	256M, 8	256M, 8

Results from the NUC with a large enough block cache:

	partitioned	not partitioned
QPS	60180	60994
Response time (usecs)	133	131
cpu usecs /query	24	22
IO reads /query	1	1
r_await (usecs)	113	110
iostat rareq-sz (KB)	4.7	4.6
cache size, concurrency	8G, 8	8G, 8

Results from the Beelink with a too-small block cache:

	partitioned	not partitioned
QPS	19962	3004
Response time (usecs)	801	5323
cpu usecs /query	48	181
IO reads /query	2.8	3.3
r_await (usecs)	273	2053
iostat rareq-sz (KB)	3.4	78.3
cache size, concurrency	256M, 16	256M, 16

Results from the Beelink with a large enough block cache:

	partitioned	not partitioned
QPS	44902	44287
Response time (usecs)	356	361
cpu usecs /query	29	24
IO reads /query	1	1
r_await (usecs)	333	341
iostat rareq-sz (KB)	4.6	4.6
cache size, concurrency	4G, 16	4G, 16

Update 1 - Beelink, fio raw vs O_DIRECT

This is an attempt to explain why I get more IOPs with O_DIRECT + XFS vs raw devices when using fio on my Beelink servers. For the O_DIRECT+XFS tests I have been creating 8 20G files on an otherwise empty device. Discard is enabled when XFS is mounted. The device is a 1 TB Samsung 980 Pro, although I can repro this with a 512G Kingston SSD. After running the O_DIRECT+XFS tests I then create more 20G files to make the device almost (at least 95%) full and then I do the raw device tests. When tested this way I get more IOPs from O_DIRECT + XFS than from raw using the Beelink server as shown above.

Then I tried the following:

Do mkfs.xfs on the Beelink for the Samsung 980 Pro, create 8 20G files, run fio to do random 4kb reads with O_DIRECT and I get ~400k IOPs at 32 threads and the average value for iostat's r_await is 70 microseconds.
Create 37 more 20G files to make the device almost full and repeat the test from step 1, again just using the first 8 20G files and I get 360k IOPs at 32 threads and the average value for iostat's r_await is now 80 microseconds.
Run the test for fio with a now almost full raw device and I get 370k IOPs at 32 threads.
Drop those 37 20G files, wait 5 minutes and repeat the test. Now I get ~400k IOPs at 32 threads.

The fio+raw test has been repeated after the server was idle for 12 hours in case there was SSD GC/grooming needed after making the device almost full but IOPs do not change after the 12 hour wait.

So I still don't understand this, but I will claim that fio+raw gets more IOPs than fio+O_DIRECT when the device is in the same state. The confusion comes from reporting numbers above where fio+raw was run on a full device but fio+O_DIRECT was not.

Small Datum

Wednesday, February 22, 2023

Can RocksDB use all of the IOPs from fast storage?

1 comment:

Postgres 18 beta2: large server, Insert Benchmark, part 2