Small Datum: October 2025

Thursday, October 23, 2025

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.

By IO efficiency I mean:
(storage read IOPs from RocksDB benchmark / storage read IOPs from fio)

And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).

This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and ~0.88 at 6 clients. Were I to use slower storage, such as an SSD where read latency was ~200 usecs at io_depth=1 then the IO efficiency would be closer to 0.95.

Note that:

IO efficiency increases (decreases) when SSD read latency increases (decreases)
IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled

The overheads per 8kb block read on my test hardware were:

about 11 microseconds from libc + kernel
between 6 and 10 microseconds from RocksDB
~100 usecs of IO latency at io_depth=1, ~150 usecs at io_depth=6

A simple performance model

A simple model to predict the wall-clock latency for reading a block is:

userland CPU + libc/kernel CPU + device latency

For fio I assume that userland CPU is zero, I measured libc/kernel at ~11 usecs and will estimate that device latency is ~91 usecs. My device latency estimate comes from read-only benchmarks with fio where fio reports the average latency as 102 usecs which includes 11 usecs of CPU from libc+kernel and 91 = 102 - 11.

This model isn't perfect, as I will show below when reporting results for RocksDB, but it might be sufficient. But it allows you to predict latencies and IO efficiency when the RocksDB CPU overhead is increased or reduced.

Q and A

The RocksDB API could function as a universal API for storage engines, and if new DBMS built on that then it would be possible to combine new DBMS with new storage engines much faster than what is possible today.

Persistent hash indexes are not widely implemented, but getting one that uses the RocksDB API would be interesting for workloads such as the one I run here. However, there are fewer use cases for a hash index (no range queries) than for a range index like an LSM so it is harder to justify the investment in such work.

Q: What is the CPU overhead from libc + kernel per 8kb read?
A: About 10 microseconds on this CPU.

Q: Can you write your own code that will be faster than RocksDB for such a workload?
A: Yes, you can

Q: Should you write your own library for this?
A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.

Q: Will RocksDB add features to make this faster?
A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.

Q: Does this matter?

A: It matters more when storage is fast (read latency less than 100 usecs). As read response time grows the CPU overhead from RocksDB becomes much less of an issue.

Benchmark hardware

I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.

From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.

CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:

$ cpupower frequency-info

analyzing CPU 5:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 5
CPUs which need to have their frequency coordinated by software: 5
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.60 GHz - 3.80 GHz
available frequency steps: 3.80 GHz, 2.20 GHz, 1.60 GHz
available cpufreq governors: conservative ... powersave performance schedutil
current policy: frequency should be within 1.60 GHz and 3.80 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.79 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: no

Results from fio

I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.

fio --name=randread --rw=randread --ioengine=sync --numjobs=$NJ --iodepth=1 \
--buffered=0 --direct=1 \
--bs=8k \
--size=400G \
--randrepeat=0 \
--runtime=600s --ramp_time=1s \
--filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8 \
--group_reporting

Results are:

legend:
* iops - average reads/s reported by fio
* usPer, syPer - user, system CPU usecs per read

* cpuPer - usPer + syPer
* lat.us - average read latency in microseconds
* numjobs - the value for --numjobs with fio

iops usPer syPer cpuPer lat.us numjobs
9884 1.351 9.565 10.916 101.61 1
43782 1.379 10.642 12.022 136.35 6

Results from RocksDB

I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:

fillseq - loads the LSM tree in key order
revrange - I ignore the results from this
overwritesome - overwrites 10% of the KV pairs
flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
readrandom - does random point queries when LSM tree has many levels
compact - compacts LSM tree into one level
readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled

I use readrandom, readrandom2 and readrandom3 to vary the amount of work that RocksDB must do per query and measure the CPU overhead of that work. The most work happens with readrandom as the LSM tree has many levels and there are bloom filters to check. The least work happens with readrandom3 as the LSM tree only has one level and there are no bloom filters to check.

Initially I ran tests with --block_align not set as that reduces space-amplification (less padding) but 8kb reads are likely to cross file system page boundaries and become larger reads. But given the focus here is on IO efficiency, I used --block_align.

A summary of the results for db_bench with 1 user (thread) and 6 users (threads) is:

--- 1 user

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

8282 8350 8.5 11.643 7.602 19.246 120.74 101 readrandom

8394 8327 8.7 9.997 8.525 18.523 119.13 105 readrandom2

8522 8400 8.2 8.732 8.718 17.450 117.34 100 readrandom3

--- 6 users

38391 38628 8.1 14.645 7.291 21.936 156.27 134 readrandom

39359 38623 8.3 10.449 9.346 19.795 152.43 144 readrandom2

39669 38874 8.0 9.459 9.850 19.309 151.24 140 readrandom3

From the following:

IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do.
With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.

legend:

* io.eff - IO efficiency as (db_bench storage read IOPs / fio storage read IOPs)

* us.inc - incremental user CPU usecs per read as (db_bench usPer - fio usPer)

* cpu.inc - incremental total CPU usecs per read as (db_bench cpuPer - fio cpuPer)

--- 1 user

io.eff us.inc cpu.inc test

------ ------ ------

0.844 10.292 8.330 readrandom

0.842 8.646 7.607 readrandom2

0.849 7.381 6.534 readrandom3

--- 6 users

io.eff us.inc cpu.inc test

------ ------ ------

0.882 13.266 9.914 readrandom

0.882 9.070 7.773 readrandom2

0.887 8.080 7.287 readrandom3

Evaluating the simple performance model

I described a simple performance model earlier in this blog post and now it is time to see how well it does for RocksDB. First I will use values from the 1 user/client/thread case:

IO latency is ~91 usecs per fio
libc+kernel CPU overhead is ~11 usecs per fio
RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3

The model is far from perfect as it predicts that RocksDB will sustain:

9063 IOPs for readrandom, when it actually did 8350
9124 IOPs for readrandom2, when it actually did 8327
9214 IOPs for readrandom3, when it actually did 8400

Regardless, model is a good way to think about the problem.

The impact from --block_align

RocksDB QPS increases by between 7% and 9% when --block_align is enabled. Enabling it reduces read-amp and increases space-amp. But given the focus here is on IO efficiency I prefer to enable it. RocksDB QPS increases with it enabled because fewer storage read requests cross file system page boundaries, thus the average read size from storage is reduced (see the reqsz column below).

legend:

* qps - RocksDB QPS

* iops - average reads/s reported by fio

* reqsz - average read request size in KB per iostat

* usPer, syPer, cpuPer - user, system and (user+system) CPU usecs per read

* rx.lat - average read latency in microseconds, per RocksDB

* io.lat - average read latency in microseconds, per iostat

* test - the db_bench test name

- block_align disabled

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

7629 7740 8.9 12.133 8.718 20.852 137.92 111 readrandom

7866 7813 9.1 10.094 9.098 19.192 127.12 115 readrandom2

7972 7862 8.6 8.931 9.326 18.257 125.44 110 readrandom3

- block_align enabled

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

8282 8350 8.5 11.643 7.602 19.246 120.74 101 readrandom

8394 8327 8.7 9.997 8.525 18.523 119.13 105 readrandom2

8522 8400 8.2 8.732 8.718 17.450 117.34 100 readrandom3

Async IO in RocksDB

Per the wiki, RocksDB can do async IO for point queries that use MultiGet. That is done via coroutines and requires linking with Folly. My builds do not support that today and because my focus is on efficiency rather than throughput I did not try it for this test.

Flamegraphs

Flamegraphs are here for readrandom, readrandom2 and readrandom3.

A summary of where CPU time is spent based on the flamegraphs.

Legend:

* rr, rr2, rr3 - readrandom, readrandom2, readrandom3

* libc+k - time in libc + kernel

* checksm - verify data block checksum after read

* IBI:Sk - IndexBlockIter::SeekImpl

* DBI:Sk - DataBlockIter::SeekImpl

* LRU - lookup, insert blocks in the LRU, update metrics

* bloom - check bloom filters

* BSI - BinarySearchIndexReader::NewIterator

* File - FilePicker::GetNextFile, FindFileInRange

* other - other parts of the call stack, from DBImpl::Get and functions called by it

rr is readrandom, rr2 is readrandom2, rr3 is readrandom3

Percentage of samples

rr rr2 rr3

libc+k 37.30 42.22 50.92

checksm 3.76 2.66 2.91

IBI:Sk 7.07 7.36 7.76

DBI:Sk 3.05 2.15 1.96

LRU 5.19 6.19 6.02

bloom 18.35 8.14 0

BSI 2.28 4.02 3.12

File 3.74 3.34 4.44

other 19.26 23.92 22.87

Monday, October 20, 2025

Determine how much concurrency to use on a benchmark for small, medium and large servers

What I describe here works for me given my goal, which is to find performance regressions. A benchmark run at low concurrency is used to find regressions from CPU overhead. A benchmark run at high concurrency is used to find regressions from mutex contention. A benchmark run at medium concurrency might help find both.

My informal way for classifying servers by size is:

small - has less than 10 cores
medium - has between 10 and 20 cores
large - has more than 20 cores

How much concurrency?

I almost always co-locate benchmark clients and the DBMS on the same server. This comes at a cost (less CPU and RAM is available for the DBMS) and might have odd artifacts because clients in the real world are usually not co-located. But it has benefits that matter to me. First, I don't worry about variance from changes in network latency. Second, this is much easier to setup.

I try to not oversubscribe the CPU when I run a benchmark. For benchmarks where there are few waits for reads from or writes to storage, then I will limit the number of benchmark users so that the concurrent connection count is less than the number of CPU cores (cores, not VPUs) and I almost always use servers with Intel Hyperthreads and AMD SMT disabled. I do this because DBMS performance suffers when the CPU is oversubscribed and back when I was closer to production we did our best to avoid that state.

Even for benchmarks that have some benchmark steps where the workload will have IO waits, I will still limit the amount of concurrency unless all benchmark steps that I measure will have IO waits.

Assuming a benchmark is composed of a sequence of steps (at minimum: load, query) then I consider the number of concurrent connections per benchmark user. For sysbench, the number of concurrent connections is the same as the number of users, although sysbench uses the --threads argument to set the number of users. I am just getting started with TPROC-C via HammerDB and that appears to be like sysbench with one concurrent connection per virtual user (VU).

For the Insert Benchmark the number of concurrent connections is 2X the number of users on the l.i1 and l.i2 steps and then 3X the number of users on the range-query read-write steps (qr*) and the point-query read-write steps (qp*). And whether or not there are IO-waits for these users is complicated, so I tend to configure the benchmark so that the number of users is no more than half the number of CPU cores.

Finally, I usually set the benchmark concurrency level to be less than the number of CPU cores because I want to leave some cores for the DBMS to do the important background work, which is mostly MVCC garbage collection -- MyRocks compaction, InnoDB purge and dirty page writeback, Postgres vacuum.

Thursday, October 16, 2025

Why is RocksDB spending so much time handling page faults?

This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead added by RocksDB. While looking at flamegraphs taken during the benchmark I was confused that about 20% of the samples were from page fault handling. This confused me at first.

The lesson here is to run your benchmark long enough to reach a steady state before you measure things or there will be confusion. And I was definitely confused when I first saw this. Perhaps my post saves time for the next person who spots this.

The workload is db_bench with a database size that is much larger than memory and read-only microbenchmarks for point lookups and range scans.

Then I wondered if this was a transient issue that occurs while RocksDB is warming up the block cache and growing process RSS until the block cache has been fully allocated.

While b-trees as used by Postgres and MySQL will do a large allocation at process start, RocksDB does an allocation per block read, and when the block is evicted then the allocation is free'd. This can be a stress test for a memory allocator which is why jemalloc and tcmalloc work better than glibc malloc for RocksDB. I revisit the mallocator topic every few years and my most recent post is here.

In this case I use RocksDB with jemalloc. Even though per-block allocations are transient, the memory used by jemalloc is mostly not transient. While there are cases where jemalloc an return memory to the OS, with my usage that is unlikely to happen.

Were I to let the benchmark run for a long enough time, then eventually jemalloc would finish getting memory from the OS. However, my tests were running for about 10 minutes and doing about 10,000 block reads per second while I had configured RocksDB to use a block cache that was at least 36G and the block size was 8kb. So my tests weren't running long enough for the block cache to fill, which means that during the measurement period:

jemalloc was still asking for memory
block cache eviction wasn't needed and after each block read a new entry was added to the block cache

The result in this example is 22.69% of the samples are from page fault handling. That is the second large stack from the left. The RocksDB code where it happens is rocksdb::BlockFetcher::ReadBlockContents.

When I run the benchmark for more time, the CPU overhead from page fault handling goes away.

Tuesday, October 14, 2025

Is it time for TPC-BLOB?

If you want to store vectors in your database then what you store as a row, KV pair or document is likely to be larger than the fixed-page size (when your DBMS uses fixed-page sizes) and you will soon care about efficient and performant support for large objects. I assume this support hasn't been the top priority for many DBMS implementations and there will be some performance bugs.

In a SQL DBMS, support for large objects will use the plumbing created to handle LOB (Large OBject) datatypes. We should define what the L in LOB means here and I will wave my hands and claim larger than a fixed-page in your favorite DBMS but smaller than 512kb because I limit my focus to online workloads.

Perhaps now is the time for industry standard benchmarks for workloads with large objects. Should it be TPC-LOB or TPC-BLOB?

Most popular DBMS use fixed-size pages whether that storage is index-organized via an update-in-place b-tree (InnoDB) or heap-organized (Postgres, Oracle). For rows that are larger than the page size, which is usually between 4kb and 16kb, the entire row or largest columns will be stored out of line and likely split across several pages in the out of line storage. When the row is read, additional reads will be done to gather all of the too-large parts from the out of line locations.

This approach is far from optimal as there will be more CPU overhead, more random IO and might be more wasted space. But this was good enough because support for LOBs wasn't a priority for these DBMS as their focus was on OLTP where rows were likely to be smaller than a fixed-size page.

Perhaps by luck, perhaps it was fate, but WiredTiger is a great fit for MongoDB because it is more flexible about page sizes. And it is more flexible because it isn't an update-in-place b-tree, instead it is a copy-on-write random (CoW-R) b-tree that doesn't need or use out-of-line storage, although for extra large documents there might be a benefit from out-of-line.

MyRocks, and other LSM-based DBMS, also don't require out-of-line storage but they can benefit from it as shown by WiscKey and other engines that do key-value separation. Even the mighty RocksDB has an implementation of key-value separation via BlobDB.

Monday, October 13, 2025

Postgres 18.0 vs sysbench on a 32-core server

This is yet another great result for Postgres 18.0 vs sysbench. This time I used a 32-core server. Results for a 24-core server are here. The goal for this benchmark is to check for regressions from new CPU overhead and mutex contention.

I repeated the benchmark twice because I had some uncertainty about platform variance (HW and SW) on the first run.

tl;dr, from Postgres 17.6 to 18.0

There might be regressions from 17.6 to 18.0 but they are small (usually <= 3%)

tl;dr, from Postgres 12.22 through 18.0

the hot-points test is almost 2X faster starting in 17.6
scan is ~1.2X faster starting in 14.19
all write tests are much faster staring in 17.6

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.22, 14.19, 15.14, 16.10, 17.6, and 18.0.

The server is a Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM and an AMD Ryzen Threadripper PRO 5975WX with 32-Cores. The OS is Ubuntu 24.04 and storage is a 2TB m.2 SSD with ext-4 and discard enabled.

Prior to 18.0, the configuration file was named conf.diff.cx10a_c32r128 and is here for 12.22, 13.22, 14.19, 15.14, 16.10 and 17.6.

For 18.0 I tried 3 configuration files:

conf.diff.cx10b_c32r128 (x10b) - uses io_method=sync
conf.diff.cx10c_c32r128 (x10c) - uses io_method=worker
conf.diff.cx10d_c32r128 (x10d) - uses io_method=io_uring

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks

and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The benchmark is run with 24 clients and 8 tables with 10M rows per table. The purpose is to search for regressions from new CPU overhead and mutex contention.

I ran the benchmark twice. In the first run, there was several weeks between getting results for the older Postgres releases and Postgres 18.0 so I am less certain about variance from the hardware and softare. One concern is changes in daily temperature because I don't have a climate-controlled server room. Another concern is changes from updating my OS install.

In the second run, all results were collected within 7 days and I am less concerned about variance there.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for base version)

When the relative QPS is > 1 then some version is faster than base version. When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

I present results for:

versions 12 through 18 using 12.22 as the base version
versions 17.6 and 18.0 using 17.6 as the base version

Results: Postgres 17.6 and 18.0

All files are here.

Results per microbenchmark from vmstat and iostat are here for the first and second run.

Some comments:

18.0 looks better relative to 17.6 in the second run and I explain my uncertainty about the first run above
But I am skeptical about the great result for 18.0 on the full scan test (scan_range=100) in the second run. That might be variance induced by vacuum.
There might be regressions from 17.6 to 18.0 but they are small (usually <= 3%)
The small regression in read-only_range=10 might be from new optimizer overhead, because it doesn't reproduce when the length of the range query is increased -- see read-only_range=100 and read-only_range=10000.

Relative to: 17.6

col-1 : 18.0 with the x10b config that uses io_method=sync

col-2 : 18.0 with the x10c config that uses io_method=worker

col-3 : 18.0 with the x10d config that uses io_method=io_uring

col-1 col-2 col-3 point queries, first run

0.97 0.99 0.94 hot-points_range=100

0.97 0.98 0.96 point-query_range=100

1.00 0.99 0.99 points-covered-pk_range=100

0.99 1.00 1.00 points-covered-si_range=100

0.98 0.99 0.98 points-notcovered-pk_range=100

0.99 0.99 0.99 points-notcovered-si_range=100

1.00 1.00 0.99 random-points_range=1000

0.98 0.98 0.98 random-points_range=100

0.99 0.98 0.99 random-points_range=10

col-1 col-2 col-3 point queries, second run

0.98 1.00 0.99 hot-points_range=100

1.00 1.00 0.99 point-query_range=100

1.01 1.01 1.01 points-covered-pk_range=100

1.00 1.01 1.00 points-covered-si_range=100

1.00 0.98 1.00 points-notcovered-pk_range=100

1.00 1.00 1.01 points-notcovered-si_range=100

1.00 1.01 1.01 random-points_range=1000

1.00 0.99 1.01 random-points_range=100

0.99 0.99 1.00 random-points_range=10

col-1 col-2 col-3 range queries without aggregation, first run

0.97 0.98 0.95 range-covered-pk_range=100

0.97 0.97 0.94 range-covered-si_range=100

0.98 0.98 0.97 range-notcovered-pk_range=100

0.99 0.99 0.98 range-notcovered-si_range=100

0.97 0.99 0.96 scan_range=100

col-1 col-2 col-3 range queries without aggregation, second run

0.99 0.99 0.98 range-covered-pk_range=100

0.99 0.99 0.99 range-covered-si_range=100

0.98 0.99 0.98 range-notcovered-pk_range=100

0.99 1.00 1.00 range-notcovered-si_range=100

1.24 1.24 1.22 scan_range=100

col-1 col-2 col-3 range queries with aggregation, first run

0.99 1.00 1.00 read-only-count_range=1000

1.01 1.01 1.01 read-only-distinct_range=1000

1.01 1.01 1.00 read-only-order_range=1000

1.04 1.04 1.04 read-only_range=10000

0.99 0.99 0.98 read-only_range=100

0.97 0.98 0.97 read-only_range=10

0.99 0.98 0.98 read-only-simple_range=1000

0.99 0.99 0.99 read-only-sum_range=1000

col-1 col-2 col-3 range queries with aggregation, second run

0.99 1.00 1.00 read-only-count_range=1000

1.01 1.01 1.00 read-only-distinct_range=1000

0.99 0.99 1.00 read-only-order_range=1000

1.02 1.03 1.03 read-only_range=10000

0.99 0.99 0.99 read-only_range=100

0.99 0.99 0.98 read-only_range=10

0.99 1.00 1.01 read-only-simple_range=1000

1.00 1.00 1.00 read-only-sum_range=1000

col-1 col-2 col-3 writes, first run

0.99 0.98 0.96 delete_range=100

0.99 0.96 0.98 insert_range=100

1.00 0.99 0.98 read-write_range=100

0.99 0.98 0.98 read-write_range=10

1.00 0.99 1.00 update-index_range=100

1.03 0.95 1.01 update-inlist_range=100

0.99 0.99 1.00 update-nonindex_range=100

1.00 1.00 1.01 update-one_range=100

0.98 0.99 1.00 update-zipf_range=100

0.97 0.97 0.99 write-only_range=10000

col-1 col-2 col-3 writes, second run

0.97 0.97 0.98 delete_range=100

0.99 0.99 1.00 insert_range=100

0.99 0.99 0.98 read-write_range=100

0.98 0.98 0.98 read-write_range=10

0.97 0.98 0.97 update-index_range=100

0.98 0.99 1.04 update-inlist_range=100

0.98 0.99 0.98 update-nonindex_range=100

0.99 0.99 0.98 update-one_range=100

0.98 0.99 0.98 update-zipf_range=100

0.99 0.97 0.95 write-only_range=10000

Results: Postgres 12 to 18

All files are here.

Results per microbenchmark from vmstat and iostat are here for the first and second run.

The data below with a larger font is here.

Some comments:

the hot-points test is almost 2X faster starting in 17.6
scan is ~1.2X faster starting in 14.19
all write tests are much faster staring in 17.6

Relative to: 12.22

col-1 : 13.22

col-2 : 14.19

col-3 : 15.14

col-4 : 16.10

col-5 : 17.6

col-6 : 18.0 with the x10b config

col-7 : 18.0 with the x10c config

col-8 : 18.0 with the x10d config

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 point queries, first run

1.02 1.00 1.01 1.00 1.94 1.87 1.91 1.82 hot-points_range=100

1.01 1.02 1.02 1.00 1.02 0.99 1.00 0.98 point-query_range=100

1.02 1.02 1.01 1.03 1.01 1.01 1.00 1.00 points-covered-pk_range=100

1.01 1.04 1.03 1.05 1.03 1.02 1.03 1.03 points-covered-si_range=100

1.01 1.01 1.01 1.02 1.02 1.00 1.00 1.00 points-notcovered-pk_range=100

1.00 1.03 1.02 1.03 1.02 1.01 1.01 1.02 points-notcovered-si_range=100

1.01 1.02 1.02 1.03 1.00 1.00 1.00 0.99 random-points_range=1000

1.01 1.02 1.02 1.02 1.02 1.00 1.00 1.00 random-points_range=100

1.02 1.03 1.02 1.02 1.01 1.00 1.00 1.00 random-points_range=10

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 point queries, second run

1.00 0.98 0.99 1.00 1.94 1.90 1.93 1.92 hot-points_range=100

1.00 1.01 1.02 1.03 1.03 1.02 1.02 1.02 point-query_range=100

1.02 1.01 1.00 1.04 0.99 1.00 1.00 0.99 points-covered-pk_range=100

1.01 1.04 1.03 1.07 1.03 1.03 1.05 1.04 points-covered-si_range=100

1.01 1.02 1.03 1.04 1.01 1.00 0.99 1.01 points-notcovered-pk_range=100

1.02 1.05 1.05 1.05 1.03 1.03 1.03 1.04 points-notcovered-si_range=100

1.01 1.02 1.03 1.03 0.99 0.99 1.00 1.00 random-points_range=1000

1.02 1.02 1.03 1.04 1.01 1.01 1.00 1.01 random-points_range=100

1.02 1.02 1.02 1.03 1.02 1.01 1.01 1.02 random-points_range=10

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 range queries without aggregation, first run

1.00 1.02 1.02 1.01 1.00 0.97 0.98 0.95 range-covered-pk_range=100

1.00 1.02 1.02 1.01 1.00 0.97 0.97 0.94 range-covered-si_range=100

1.01 1.00 1.00 1.00 0.99 0.97 0.97 0.97 range-notcovered-pk_range=100

0.99 1.00 1.00 0.99 1.01 1.00 1.00 0.99 range-notcovered-si_range=100

0.98 1.24 1.11 1.13 1.16 1.12 1.14 1.11 scan_range=100

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 range queries without aggregation, second run

1.01 1.02 1.02 1.02 1.01 1.00 1.00 0.99 range-covered-pk_range=100

1.01 1.03 1.02 1.02 1.01 1.00 1.01 1.00 range-covered-si_range=100

1.00 0.99 1.00 1.00 0.99 0.97 0.98 0.98 range-notcovered-pk_range=100

1.00 1.00 1.00 0.98 1.01 1.00 1.01 1.01 range-notcovered-si_range=100

1.00 1.27 1.15 1.15 0.97 1.20 1.20 1.18 scan_range=100

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 range queries with aggregation, first run

1.02 1.00 1.00 1.01 0.97 0.96 0.97 0.97 read-only-count_range=1000

1.00 1.00 1.02 1.02 0.98 0.99 0.99 0.99 read-only-distinct_range=1000

1.01 1.00 1.03 1.03 1.00 1.01 1.01 1.01 read-only-order_range=1000

1.00 0.98 1.00 1.06 0.95 0.99 0.99 0.99 read-only_range=10000

1.00 1.00 1.00 1.00 1.00 0.98 0.98 0.98 read-only_range=100

1.00 1.01 1.01 1.00 1.01 0.98 0.99 0.98 read-only_range=10

1.01 1.00 1.02 1.01 1.00 0.99 0.98 0.98 read-only-simple_range=1000

1.00 1.00 1.01 1.00 0.99 0.98 0.98 0.98 read-only-sum_range=1000

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 range queries with aggregation, second run

1.03 1.02 1.02 1.03 0.97 0.97 0.97 0.98 read-only-count_range=1000

1.00 0.99 1.02 1.02 0.98 0.99 0.99 0.99 read-only-distinct_range=1000

1.00 0.99 1.02 1.04 1.02 1.01 1.01 1.02 read-only-order_range=1000

1.01 1.03 1.03 1.06 0.97 0.99 0.99 0.99 read-only_range=10000

0.99 1.00 1.00 1.01 1.00 0.99 0.99 0.99 read-only_range=100

0.99 1.00 1.00 1.00 1.01 0.99 1.00 0.99 read-only_range=10

1.00 0.99 1.01 1.00 0.99 0.98 0.98 0.99 read-only-simple_range=1000

1.00 1.00 1.01 1.01 0.99 0.98 0.98 0.98 read-only-sum_range=1000

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 writes, first run

1.00 1.08 1.08 1.05 1.25 1.24 1.23 1.20 delete_range=100

1.01 1.05 1.04 1.03 1.07 1.06 1.02 1.05 insert_range=100

1.00 1.06 1.07 1.07 1.10 1.09 1.08 1.07 read-write_range=100

1.00 1.07 1.08 1.07 1.13 1.13 1.11 1.11 read-write_range=10

0.99 1.04 1.04 0.90 1.43 1.43 1.41 1.43 update-index_range=100

1.00 1.09 1.08 1.08 1.11 1.15 1.06 1.12 update-inlist_range=100

1.00 1.05 1.05 1.04 1.35 1.34 1.34 1.35 update-nonindex_range=100

1.02 0.95 0.96 0.93 1.19 1.19 1.19 1.20 update-one_range=100

1.00 1.05 1.08 1.07 1.23 1.21 1.22 1.23 update-zipf_range=100

1.01 1.06 1.05 1.01 1.25 1.22 1.20 1.24 write-only_range=10000

col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 writes, second run

1.00 1.06 1.07 1.07 1.26 1.23 1.23 1.24 delete_range=100

1.03 1.07 1.05 1.05 1.09 1.07 1.08 1.09 insert_range=100

1.01 1.07 1.08 1.07 1.11 1.10 1.10 1.09 read-write_range=100

0.99 1.04 1.06 1.07 1.13 1.11 1.11 1.12 read-write_range=10

0.99 1.02 1.04 0.87 1.44 1.40 1.41 1.40 update-index_range=100

1.00 1.11 1.12 1.09 1.17 1.14 1.16 1.22 update-inlist_range=100

1.01 1.04 1.06 1.03 1.36 1.33 1.35 1.34 update-nonindex_range=100

1.01 0.95 0.98 0.94 1.22 1.21 1.21 1.20 update-one_range=100

0.99 1.05 1.07 1.07 1.24 1.21 1.22 1.21 update-zipf_range=100

1.02 1.06 1.06 1.02 1.27 1.25 1.23 1.21 write-only_range=10000

Monday, October 6, 2025

My time at Oracle: functional and design specification reviews

I worked at Oracle from 1997 to 2005 for 3 years on the app server team in Portland and the last 5 on DBMS query execution in Redwood Shores. I had a good time there, made many friends and learned a lot.

They had an excellent process for functional and design specification reviews. Like many, I am wary of (too much) process but this wasn't too much. It was just enough.

At a high level, you would write and then get a review for the functional spec. The review was an in-person meeting. Once that was resolved the process would repeat for the design spec. You were expected to write a good spec -- it was better for one person (the author) to spend much time on it to avoid wasting time for the many readers. Many specs would be revisited long after the review because there is turnover and specs are easier to read than source code.

We used FrameMaker to write the specs on Solaris workstations. That was a long time ago. The functional spec I wrote for IEEE754 datatypes was more than 50 pages because I had to document every aspect of PL/SQL and SQL that would be impacted by it (there were so many functions to document). The design spec I wrote for a new sort algorithm was also quite long because I had already implemented the algorithm to collect performance results to justify the effort. The patent attorney copied much of that design doc into the patent resulting in a patent that might be more readable than average.

For each specification you setup a meeting a few weeks out and shared the spec with people who might attend the meeting. In many cases there was feedback via email or in person prior to the meeting that could be resolved before the meeting. But in some cases there was feedback that wouldn't get resolved until the meeting.

It is important to split the functional and design specs, and their reviews. It helps with efficiency and the design review might change a lot based on the outcome of the functional spec review.

There are a variety of responses to the feedback, and all of that was added to an appendix of the spec (both the feedback and the response). Common responses include:

good point

I will change my spec as you suggest

no thank you

I disagree and will not change my spec as you suggest. Hopefully this isn't the response to all feedback but some people like to bike shed and/or get in the way of progress. When I rewrote the sort algorithm, I used something that was derived from quicksort and quicksort implementations have worse than expected performance on some input sequences. The algorithm I used was far better than vanilla quicksort in that regard, but it didn't eliminate the risk. However, the performance improvement over the existing code was so large (the white paper claims 5X faster) that I sad no thank you and the project got done. But I did spend some time doing the math to show how likely (or unlikely) the worst cases were. I needed a tool with arbitrary precision math to for that because the numbers are small and might have ended up using a Scheme implementation.

good point, but

I won't change my spec, but I have a workaround for the problem you mention. For IEEE754 datatypes, a few people objected because a few infrequently and fading platforms for the DBMS did not have hardware support for IEEE754. My solution was to use functions for each IEEE754 operation that were trivial for platforms with IEEE754 HW support -- things like double multiply_double(x, y) { return x*y } but could be implemented as needed on the platforms that lacked IEEE754 via a software implementation of IEEE754.

Sunday, October 5, 2025

Measuring scaleup for Postgres 18.0 with sysbench

This post has results to measure scaleup for Postgres 18.0 on a 48-core server.

tl;dr

Postgres continues to be boring (in a good way)
Results are mostly excellent
A few of the range query tests have a scaleup that is less than great but I need time to debug

Builds, Configuration & Hardware

The server has an AMD EPYC 9454P 48-Core Processor with AMD SMT disabled, 128G of RAM and SW RAID 0 with 2 NVMe devices. The OS is Ubuntu 22.04.

I compiled Postgres 18.0 from source and the configuration file is here.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks

and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres. Each microbenchmark is run for 300 seconds.

The benchmark is run with 1, 2, 4, 8, 12, 16, 20, 24, 32, 40 and 48 clients. The purpose is to determine how well Postgres scales up. All tests use 8 tables with 10M rows per table.

Results

I still use relative QPS here, but in a different way. The relative QPS here is:

(QPS at X clients) / (QPS at 1 client)

The goal is to determine scaleup efficiency for Postgres. When the relative QPS at X clients is a value near X, then things are great. But sometimes things aren't great and the relative QPS is much less than X. One issue is data contention for some of the write-heavy microbenchmarks. Another issue is mutex and rw-lock contention.

Perf debugging via vmstat and iostat

I use normalized results from vmstat and iostat to help explain why things aren't as fast as expected. By normalized I mean I divide the average values from vmstat and iostat by QPS to see things like how much CPU is used per query or how many context switches occur per write. And note that a high context switch rate is often a sign of mutex contention.

Those results are here but can be difficult to read.

Charts: point queries

The spreadsheet with all of the results is here.

While results aren't perfect, they are excellent. Perfect results would be to get a scaleup of 48 at 48 clients and here the result is between 40 and 42 in most tests. The worst-case is for hot-points where the scaleup is 32.57 at 48 clients. Note that the hot-points test has the most data contention of the point-query tests, as all queries fetch the same rows.

From the vmstat metrics (see here) I don't see an increase in mutex contention (more context switches, see the cs/o column) but I do see an increase in CPU (cpu/o). When compared to a test that has better scaleup, like points-covered-pk, there I also don't see an increase in mutex contention and do see an increase in CPU overhead (see cpu/o) but the CPU increase is smaller (see here).

Charts: range queries without aggregation

The spreadsheet with all of the results is here.

The results again are great, but not perfect. The worst case is for range-notcovered-pk where the scaleup is 32.92 at 48 clients. The base case is for scan where the scaleup is 46.56 at 48 clients.

From the vmstat metrics for range-notcovered-pk I don't see any obvious problems. The CPU overhead (cpu/o, CPU per query) increases by 1.08 (about 8%) from 1 to 48 clients while the context switches per query (cs/o) decreases (see here).

Charts: range queries with aggregation

The spreadsheet with all of the results is here.

Results for range queries with aggregation are worse than for range queries without aggregation. I hope to try and explain that later. A perfect result is scaleup equal to 48. Here, 3 of 8 tests have scaleup less than 3, 4 have scaleup between 30 and 40, and the best case is read-only_range=10 with a scaleup of 43.35.

The worst-case was read-only-count with a scaleup of 21.38. From the vmstat metrics I see that at CPU overhead (cpu/o, CPU per query) increases by 2.08 at 48 clients vs 1 client while context switches per query (cs/o) decrease (see here). I am curious about that CPU increase as isn't as bad for the other range query tests, for example see here where it is no larger than 1.54. The query for read-only-count is here.

Later I hope to explain why read-only-count, read-only-simple and read-only-sum don't do better.

Charts: writes

The spreadsheet with all of the results is here.

The worst-case is update-one where scaleup is 2.86 at 48 clients. The bad result is expected as having many concurrent clients update the same row is an anti-pattern with Postgres. The scaleup for Postgres on that test is a lot worse than for MySQL where it was ~8 with InnoDB. But I am not here for Postgres vs InnoDB arguments.

Excluding the tests that mix reads and writes (read-write-*) the scaleup is between 13 and 21. This is far from great but isn't horrible. I run with fsync-on-commit disabled which highlights problems but is less realistic. So for now I am happy with this results.

Wednesday, October 1, 2025

Measuring scaleup for MariaDB with sysbench

This post has results to measure scaleup for MariaDB 11.8.3 on a 48-core server.

tl;dr

Scaleup is better for range queries than for point queries
For tests where results were less than great, the problem appears to be mutex contention within InnoDB

Builds, Configuration & Hardware

The server has an AMD EPYC 9454P 48-Core Processor with AMD SMT disabled, 128G of RAM and SW RAID 0 with 2 NVMe devices. The OS is Ubuntu 22.04.

I compiled MariaDB 11.8.3 from source and the my.cnf file is here.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks

and most test only 1 type of SQL statement. Benchmarks are run with the database cached by MariaDB. Each microbenchmark is run for 300 seconds.

The benchmark is run with 1, 2, 4, 8, 12, 16, 20, 24, 32, 40 and 48 clients. The purpose is to determine how well MariaDB scales up. All tests use 8 tables with 10M rows per table.

Results

I still use relative QPS here, but in a different way. The relative QPS here is:

(QPS at X clients) / (QPS at 1 client)

The goal is to determine scaleup efficiency for MariaDB. When the relative QPS at X clients is a value near X, then things are great. But sometimes things aren't great and the relative QPS is much less than X. One issue is data contention for some of the write-heavy microbenchmarks. Another issue is mutex and rw-lock contention.

Perf debugging via vmstat and iostat

Charts: point queries

The spreadsheet with all of the results is here.

For point queries

tests for which the relative QPS at 48 clients is greater than 40

point-query

tests for which the relative QPS at 48 clients is between 30 and 40

none

tests for which the relative QPS at 48 clients is between 20 and 30

hot-points, points-covered-si, random-points_range=10

tests for which the relative QPS at 48 clients is between 10 and 20

points-covered-pk, points-notcovered-pk, points-notcovered-si, random-points_range=100

tests for which the relative QPS at 48 clients is less than 10

random-points_range=1000

For 5 of the 9 point query tests, QPS stops improving beyond 16 clients. And I assume that mutex contention is the problem.

Results for the random-points_range=Z tests are interesting. They use oltp_inlist_select.lua which does a SELECT with a large IN-list where the IN-list entries can find rows by exact match on the PK. The value of Z is the number of entries in the IN-list. And here MariaDB scales worse with a larger Z (1000) than with a smaller Z (10 or 100), which means that the thing that limits scaleup is more likely in InnoDB than the parser or optimizer.

From the normalized vmstat metrics (see here) for 1 client and 48 clients the number of context switches per query (the cs/o column) grows a lot more from 1 to 48 clients for random-points_range=1000 than for random-points_range=10. The ratio (cs/o at 48 clients / cs/o at 1 client) is 1.46 for random-points_range=10 and then increases to 19.96 for random-points_range=1000. The problem appears to be mutex contention.

Charts: range queries without aggregation

The spreadsheet with all of the results is here.

For range queries without aggregation:

tests for which the relative QPS at 48 clients is greater than 40

range-covered-pk, range-covered-si, range-notcovered-pk

tests for which the relative QPS at 48 clients is between 30 and 40

scan

tests for which the relative QPS at 48 clients is between 20 and 30

none

tests for which the relative QPS at 48 clients is between 10 and 20

none

tests for which the relative QPS at 48 clients is less than 10

range-notcovered-si

Only one test has less than great results for scaleup -- range-notcovered-si. QPS for it stops growing beyond 12 clients. The root cause appears to be mutex contention based on the large value for cs/o in the normalized vmstat metrics (see here). For all of the range-*covered-* tests, has the most InnoDB activity per query -- the query isn't covering so it must do PK index access per index entry it finds in the secondary index.

Charts: range queries with aggregation

The spreadsheet with all of the results is here.

For range queries with aggregation:

tests for which the relative QPS at 48 clients is greater than 40

read-only-distinct, read-only-order, read-only-range=Y, read-only-sum

tests for which the relative QPS at 48 clients is between 30 and 40

read-only-count, read-only-simple

tests for which the relative QPS at 48 clients is between 20 and 30

none

tests for which the relative QPS at 48 clients is between 10 and 20

none

tests for which the relative QPS at 48 clients is less than 10

none

Results here are excellent, and better than the results above for range queries without aggregation. The difference might mean that there is less concurrent activity within InnoDB because aggregation code is run after each row is fetched from InnoDB.

Charts: writes

The spreadsheet with all of the results is here.

For writes:

tests for which the relative QPS at 48 clients is greater than 40

none

tests for which the relative QPS at 48 clients is between 30 and 40

read-write_range=Y

tests for which the relative QPS at 48 clients is between 20 and 30

update-index, write-only

tests for which the relative QPS at 48 clients is between 10 and 20

delete, insert, update-inlist, update-nonindex, update-zipf

tests for which the relative QPS at 48 clients is less than 10

update-one

The best result is for the read-write_range=Y tests which are the classic sysbench transaction that does a mix of writes, point and range queries.

The worst result is from update-one which suffers from data contention as all updates are to the same row. A poor result is expected here.