Small Datum: January 2026

Wednesday, January 7, 2026

SSDs, power loss protection and fsync latency

This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option.

The primary point of this post is to document the claim:

For an SSD without power loss protection, writes are fast but fsync is slow.

The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise.

tl;dr

for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
with a consumer SSD writes are fast but fsync is often slow
use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency

Updates:

I am not surprised that Tanel Poder has a great blog post on this topic

InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC

When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT:

Do one large write to the doublewrite buffer, call fsync on that file
Do the batch of in-place (16kb) page writes
Call fsync once per database file that was written by step 2

When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file. The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:

O_DIRECT_NO_FSYNC: InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.

Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.

The following table uses results while running the Insert Benchmark for InnoDB to compute the ratio of fsyncs per write using the SHOW GLOBAL STATUS counters:

Innodb_data_fsyncs / Innodb_data_writes

And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC.

5.7.44 8.0.44

.01046 .00729 O_DIRECT
.00172 .00053 O_DIRECT_NO_FSYNC

Power loss protection

I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.

One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.

For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.

Results from fio

I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:

O_DIRECT without fsync, 16kb writes
O_DIRECT with an fsync per write, 16kb writes
O_DIRECT with an fdatasync per write, 16kb writes
O_DIRECT without fsync, 2M writes
O_DIRECT with an fsync per write, 2M writes
O_DIRECT with an fdatasync per write, 2M writes

Results from all tests are here. I did the test on several servers:

dell32

a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.

a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.

hetz

an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled.

ser7

a mini-PC I have at home. The SSD is a Samsung 990 Pro using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD, the web claims it does not have PLP and fsync latency is several milliseconds.

socket2

a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.

Results: overview

All of the results are here.

This table lists fsync and fdatasync latency per server:

for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
sync latency is much lower on servers with enterprise SSDs
sync latency after 2M writes is sometimes much larger than after 16kb writes
for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.

--- Sync latency in microseconds for sync after 16kb writes

dell hetz ser7 socket2

891.1 12.4 2974.2 1.6 fsync

447.4 9.8 2783.2 0.7 fdatasync

gcp

local devices hyperdisk

ext-4 xfs ext-4 xfs

56.2 39.5 738.1 635.0 fsync

28.1 29.0 46.8 46.0 fdatasync

--- Sync latency in microseconds for sync after 2M writes

dell hetz ser7 socket2

980.1 58.2 5396.8 139.1 fsync

449.7 10.8 3508.2 2.2 fdatasync

gcp

local devices hyperdisk

ext-4 xfs ext-4 xfs

1020.4 916.8 821.2 778.9 fsync

832.4 809.7 63.6 51.2 fdatasync

Results: dell

Summary:

Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
This servers uses a consumer SSD so high sync latency is expected

Legend:

w/s - writes/s
MB/s - MB written/s
sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s MB/s sync test

43400 646.6 0.0 no-sync

43500 648.5 0.0 no-sync

1083 16.1 891.1 fsync

1085 16.2 889.2 fsync

2100 31.3 447.4 fdatasync

2095 31.2 448.6 fdatasync

2 MB writes

w/s MB/s sync test

2617 4992.5 0.0 no-sync

2360 4502.3 0.0 no-sync

727 1388.5 980.1 fsync

753 1436.2 942.5 fsync

1204 2297.4 449.7 fdatasync

1208 2306.0 446.9 fdatasync

Results: gcp

Summary

Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.

Legend:

w/s - writes/s
MB/s - MB written/s
sync - latency per sync (fsync or fdatasync)

--- ext-4 and local devices

16 KB writes

w/s MB/s sync test

10100 150.7 0.0 no-sync

10300 153.5 0.0 no-sync

6555 97.3 56.2 fsync

6607 98.2 55.1 fsync

8189 122.1 28.1 fdatasync

8157 121.1 28.2 fdatasync

2 MB writes

w/s MB/s sync test

390 744.8 0.0 no-sync

388 741.0 1020.4 fsync

388 741.0 1012.7 fsync

390 744.8 832.4 fdatasync

390 744.8 869.6 fdatasync

--- xfs and local devices

16 KB writes

w/s MB/s sync test

9866 146.9 0.0 no-sync

9730 145.0 0.0 no-sync

7421 110.6 39.5 fsync

7537 112.5 38.3 fsync

8100 121.1 29.0 fdatasync

8117 121.1 28.8 fdatasync

2 MB writes

w/s MB/s sync test

390 744.8 0.0 no-sync

389 743.9 916.8 fsync

389 743.9 919.1 fsync

390 744.8 809.7 fdatasync

390 744.8 806.5 fdatasync

--- ext-4 and Hyperdisk Balanced

16 KB writes

w/s MB/s sync test

2093 31.2 0.0 no-sync

2068 30.8 0.0 no-sync

804 12.0 738.1 fsync

798 11.9 740.6 fsync

1963 29.3 46.8 fdatasync

1922 28.6 49.0 fdatasync

2 MB writes

w/s MB/s sync test

348 663.8 0.0 no-sync

367 701.0 0.0 no-sync

278 531.2 821.2 fsync

271 517.8 814.1 fsync

358 683.8 63.6 fdatasync

345 659.0 64.5 fdatasync

--- xfs and Hyperdisk Balanced

16 KB writes

w/s MB/s sync test

2033 30.3 0.0 no-sync

2004 29.9 0.0 no-sync

870 13.0 635.0 fsync

858 12.8 645.0 fsync

1787 26.6 46.0 fdatasync

1727 25.7 49.6 fdatasync

2 MB writes

w/s MB/s sync test

343 655.2 0.0 no-sync

267 511.2 778.9 fsync

268 511.2 774.7 fsync

347 661.8 51.2 fdatasync

336 642.8 54.4 fdatasync

Results: hetz

Summary

this has an enterprise SSD with excellent (low) sync latency

Legend:

w/s - writes/s
MB/s - MB written/s
sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s MB/s sync test

37700 561.7 0.0 no-sync

37500 558.9 0.0 no-sync

25200 374.8 12.4 fsync

25100 374.8 12.4 fsync

27600 411.0 0.0 fdatasync

27200 404.4 9.8 fdatasync

2 MB writes

w/s MB/s sync test

1833 3497.1 0.0 no-sync

1922 3667.8 0.0 no-sync

1393 2656.9 58.2 fsync

1355 2585.4 59.6 fsync

1892 3610.6 10.8 fdatasync

1922 3665.9 10.8 fdatasync

Results: ser7

Summary:

this has a consumer SSD with high sync latency
results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.

Legend:

w/s - writes/s
MB/s - MB written/s
sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s MB/s sync test

34000 506.4 0.0 no-sync

40200 598.9 0.0 no-sync

325 5.0 2974.2 fsync

333 5.1 2867.3 fsync

331 5.1 2783.2 fdatasync

330 5.0 2796.1 fdatasync

2 MB writes

w/s MB/s sync test

362 691.4 0.0 no-sync

364 695.2 0.0 no-sync

67 128.7 10828.3 fsync

114 218.4 5396.8 fsync

141 268.9 3864.0 fdatasync

192 368.1 3508.2 fdatasync

Results: socket2

Summary:

this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger

Legend:

w/s - writes/s
MB/s - MB written/s
sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s MB/s sync test

49500 737.2 0.0 no-sync

49300 734.3 0.0 no-sync

44500 662.8 1.6 fsync

45400 676.2 1.5 fsync

46700 696.2 0.7 fdatasync

45200 674.2 0.7 fdatasync

2 MB writes

w/s MB/s sync test

707 1350.4 0.0 no-sync

708 1350.4 0.0 no-sync

703 1342.8 139.1 fsync

703 1342.8 122.5 fsync

707 1350.4 2.2 fdatasync

707 1350.4 2.1 fdatasync

Friday, January 2, 2026

Common prefix skipping, adaptive sort

The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.

The patent is for a new in-memory sort algorithm that needs a name. Features include:

common prefix skipping

skips comparing the prefix of of key bytes when possible

adaptive

switches between quicksort and most-significant digit radix sort

key substring caching

reduces CPU cache misses by caching the next few bytes of the key

produces results before sort is done

sorted output can be produced (to the rest of the query, or spilled to disk for an external sort) before the sort is finished.

Update:

the sort algorithm needs a name and common prefix skipping adaptive quicksort is much too long. So I suggest Orasort.

How it came to be

From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.

While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.

So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:

were the algorithm able to remember the length of the common prefix as it descends then it can skip the common prefix during comparisons to save on CPU overhead
were the algorithm able to learn when the length of the common prefix grows then it can switch from quicksort to most-significant digit (MSD) radix sort using the next byte beyond the common prefix and then switch back to quicksort after doing that
the algorithm can cache bytes from the key in an array, like Alphasort. But unlike Alphasort as it descends it can cache the next few bytes it will need to compare rather than only caching the first few bytes of the key. This provides much better memory system behavior (fewer cache misses).

Early implementation

This might have been in 2003 before we were able to access work computers from home. I needed to get results that would convince management this was worth doing. I started my proof-of-concept on an old PowerPC based Mac I had at home that found a second life after I installed Yellow Dog Linux on it.

After some iteration I had good results on the PowerPC. So I brought my source code into work and repeated the test on other CPUs that I could find. On my desk I had a Sun workstation and a Windows PC with a 6 year old Pentium 3 CPU (600MHz, 128kb L2 cache). Elsewhere I had access to a new Sun server with a 900MHz UltraSPARC IV (or IV+) CPU and an HP server with a PA RISC CPU.

I also implemented other state of the art algorithms including Alphasort along with the old sort algorithm used by Oracle. From testing I learned:

my new sort was much faster than other algorithms when keys were larger than 8 bytes
my new sort was faster on my old Pentium 3 CPU than on the Sun UltraSPARC IV

The first was great news for me, the second was less than great news for Sun shareholders. I never learned why that UltraSPARC IV performance was lousy. It might have been latency to the caches.

Real implementation

Once I had great results, it was time for the functional and design specification reviews. I remember two issues:

the old sort was stable, the new sort was not

I don't remember how this concern was addressed

the new sort has a bad, but unlikely, worst-case

The problem here is the worst-case when quicksort picks the worst pivot every time it selects a pivot. The new sort wasn't naive, it used the median from a sample of keys each time to select a pivot (the sample size might have been 5). So I did the math to estimate the risk. Given that the numbers are big and probabilities are small I needed a library or tool that supported arbitrary-precision arithmetic and ended up using a Scheme implementation. The speedup in most cases justified the risk in a few cases.

And once I had this implemented within the Oracle DBMS I was able to compare it with the old sort. The new sort was often about 5 times faster than the old sort. I then compared it with SyncSort. I don't remember whether they had a DeWitt Clause so I won't share the results but I will say that the new sort in Oracle looked great in comparison.

The End

The new sorted landed in 10gR2 and was featured in a white-paper. I also got a short email from Larry Ellison thanking me for the work. A promotion or bonus would have to wait as you had to play the long-game in your career at Oracle. And that was all the motivation I needed to leave Oracle -- first for a startup, and then to Google and Facebook.

After leaving Oracle, much of my time was spent on making MySQL better. Great open-source DBMS, like MySQL and PostgreSQL, were not good for Oracle's new license revenue. Oracle is a better DBMS, but not everyone needs it or can afford it.