Wednesday, June 19, 2024

Stalls from TRIM after deleting a lot of data

If you delete a lot of data in a short amount of time and are using the discard option when mounting a filesystem on an SSD then there might be stalls become some SSDs can't process TRIM as fast as you want.

Long ago, Domas wrote a post on this and shared a link to the slowrm utility that can be used to delete data in a way that doesn't lead to stalls from an SSD that isn't fast enough at TRIM. 

tl;dr

  • Deleting a large amount of data in a short amount of time can lead to IO stalls when a filesystem is mounted with the discard option and this appears to be to some device models
Updates:
  • Useful feedback on Twitter is that I used consumer SSD and this won't happen on enterprise SSDs. So I repeated the test on two more devices, one described as enterprise and the other as datacenter. The stall reproduced on one of those devices.

Some background

  • Using the discard option can be a good idea because it improves SSD endurance
  • The rate at which an SSD can process TRIM appears to be vendor specific. The first SSD I used in production was from FusionIO and it was extra fast at TRIM. Some SSDs that followed have not been but I won't name the devices I have used at work.
  • We need trimbench to make it easier to identify which devices suffer from this, although by trimbench I mean a proper benchmark client, not the script I used here. The real trimbench would have more complex workloads.
  • While deleting files is more common with an LSM storage engine than a b-tree, there are still large deletions in production with a b-tree. One example is deleting log files (database or other) after they have been archived. A more common example might be the various background jobs running on your database HW that exist to keep things healthy (except when they drop files to fast). Another example is the temp files used for complex queries when joins and sorts spill to disk.
trimbench

I wrote a script to measure the impact of a large amount of TRIM on read IO performance and it is here. I tested it on two servers I have at home that are described here as v4 (SER4) and v6 (socket2):
  • SER4 - runs Ubuntu 22.04 with both the non-HWE (5.x) and HWE (6.5.x) kernels. Storage is a Samsung 980 Pro with XFS. Tests were run for the filesystem mounted with and without the discard option
  • socket2 - runs Ubuntu 22.04 with the HWE (6.5.x) kernel. There were two device types: Samsung 870 EVO, a pair of Crucial T500 (CT2000T500SSD5). Tests were repeated with and without the discard option and for both XFS and ext4.
The trimbench workload is here:
  • Create a large file that will soon be deleted, then sleep
  • Create a smaller file from which reads will be done, then sleep
  • Delete the large file
  • Concurrent with the delete, use fio to read from the smaller file using O_DIRECT
  • Look at the results from iostat at 1-second intervals to see what happens to read throughput when the delete is in progress
Results

For these tests:
  • Samsung 980 Pro - could not reproduce a problem and processes TRIM at 524 GB/s (or more)
  • Samsung 870 Evo - could reproduce a problem and processes TRIM at ~70 GB/s
  • Crucial T500 - could reproduce the problem and processes TRIM at ~63 GB/s
The SER4 test created and dropped a 512 GB file with this command line:
bash run.sh /path/to/f 512 32 30 90 test 8

The socket2 test created and dropped a 1 TB file with this command line:
bash run.sh /path/to/f 1024 32 30 90 test 8

The run.sh script has since been updated to add one more parameter, the value for fio --engine=$X and examples are now:

bash run.sh /data/m/f 4096 128 90 300 test 24 io_uring

bash run.sh /data/m/f 8192 256 90 300 test 32 io_uring

bash run.sh /data/m/f 8192 256 90 300 test 32 libaio


Results from iostat for the Samsung 980 Pro are here which show that read IOPs drop by ~5% while the TRIM is in progress and the TRIM is processed at 524 GB/s (or more). The read IOPs are ~110k /s before and after the TRIM. They drop to ~105k /s for the 1-second interval in which the TRIM is done. By or more I mean that the TRIM is likely done in less than one second and repeating the test with a larger file to drop might show a larger rate in GB/s for TRIM processing.

Results from iostat for the Samsung 870 Evo are here which show that read IOPs drop from ~58k /s to ~5k /s for the 16-second interval in which the TRIM is done and TRIM is processed at ~70 GB/s.

Results from iostat for the Crucial T500 are here which show that read IOPs drop from ~92k /s to ~4k /s for the 18-second interval in which the TRIM is done and TRIM is processes at ~62 GB/s. I also have results for SW RAID 0 using 2 T500 devices and they also show a large drop in read IOPs when TRIM is processed.



3 comments:

  1. > While deleting files is more common with an LSM storage engine than a b-tree, there are still large deletions in production with a b-tree.

    Another example is when deleting the "old" table after an online schema change (gh'osh and pt-osc). Also, when purging data with DROP TABLE or DROP PARTITION.

    I understand you did io bandwidth and iops tests, have you considered doing io latency tests ? ioping would allow to do such test.

    ReplyDelete
    Replies
    1. My current script runs fio and the fio output includes response time latency at various percentiles. That output is in the github repo, I just didn't look at it to save time.

      Delete
  2. Maybe one can reliably reproduce the VirtualBox bug with this tool?
    https://superuser.com/questions/529149/how-to-compact-virtualboxs-vdi-file-size#comment2853473_1182422

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...