Wednesday, January 1, 2020

From disk to flashcache to flash

The past decade in database storage was interesting whether you stayed with local attach storage, used block & object storage from cloud or on-prem vendors or moved to OSS scale-out storage like Ceph, GlusterFS and MinIO. I am writing about my experience and will focus on local attach.

Over the past decade the DBMS deployments I cared for went from disk to flashcache to flash on the HW side and then from MySQL+InnoDB to MySQL+MyRocks on the SW side. I assume that HW changes faster than DBMS software. DBMS algorithms that can adapt to such changes will do better in the next decade.

One comment I have heard a few too many times is that storage performance doesn't matter much because you can fit the database in RAM. More recently I hear the same except change RAM to Optane. I agree that this can be done for many workloads. I am less certain that it should be done for many workloads. That (all data in RAM/Optane) costs a lot in money, power and even space in the data center. Lets make web-scale DBMS green. Use enough RAM/Optane for cache to meet the performance SLA and then use SSD or disk arrays. At some point there is no return from cutting the DBMS query response time in half but cutting the DBMS HW cost in half is usually a big deal.

Priorities

With disk and flashcache I worried a lot about the IOPs demand because the supply was limited, usually less than 2000 operations/second. On moving to flash I stopped worrying about that and began worrying about write and space amplification (efficiency).

The context for this is small data (OLTP) workloads and deployments where reducing HW cost matters.  Overcoming the IOPs shortage was interesting at times and good for my career as there were always new problems that had to be fixed right now. Moving to flash made life easier for everyone. There was an opportunity cost from using disks -- time spent solving the latest IOPs demand crisis was time not spent on longer term projects. Moving to flash gave us time to build and deploy MyRocks.

MyRocks has better space and write efficiency than a b-tree. The cost of better space and write efficiency with an LSM is more CPU overhead for point and range queries. Sometimes that is a great trade. Better space and write efficiency means you buy less SSD and it lasts longer. Better write efficiency is a big deal with lower endurance (TLC and QLC) NAND flash. I wonder how this changes in the cloud. Cloud vendors might improve their profit margins with better space and write efficiency but they also have the ability to pass on some of the inefficiency costs to the user. A cloud user doesn't have to worry as much about write efficiency because they are renting the SSD.

Hardware

This is my history with storage for web-scale MySQL. The NUC servers I use today have similar RAM/CPU as the DBMS servers I started with in 2005 but the NUC servers have much more IO capacity.

First there were disk arrays with HW RAID and SW RAID. This was RAID 10 which was better for durability than availability. Data isn't lost on a single-disk failure but the server performance is unacceptable when a HW RAID cache battery fails (fsync is too slow) or a rebuild is in progress after a disk gets replaced.

Then there was flashcache and performance is wonderful when the read working sit fits in the flash cache but there is an abrupt change in performance when it does not. Those were exciting years. Some of the performance critical parts of flashcache were in the Linux kernel. I lack kernel skills and it took us (really, Domas) a while to identify perf problems that were eventually fixed.

Then there was flash and the abundance of IOPs was wonderful. I look forward to the next decade.

Anecdotes

If you use disk arrays at scale then you will see corruption at scale. You are likely using multiple storage devices with multiple firmware revisions. It is interesting when 99% of corruption occurs on 1% of the deployment -- all on the same, new firmware revision. That result makes it easy to focus on the HW as the probable problem and stop blaming MySQL. I can't imagine doing web-scale DBMS without per-page checksums.

Performance variance with NAND flash is to be expected. I hope that more is done to explain and document it for OLTP workloads. The common problem is that NAND flash GC can stall reads on the device. I wish it were easier to characterize device performance for enthusiasts like myself. I am sure there is an interesting paper on this topic. How much concurrency does the device provide? How are writes done? How is GC done? What is the stall distribution? What can be done to reduce stalls (see multistream and LightNVM)?

Using TRIM (mount FS with discard) at scale is exciting. RocksDB and MyRocks do a lot of TRIM while while InnoDB does not. How many GB/s and unlink/s of TRIM does the device support? TRIM performance varies greatly by vendor. I hope more is done to document these differences. Perhaps we need trimbench. People at web-scale companies have stories that never get shared because they don't want to throw their SSD vendors under the bus. I was spoiled by FusionIO. My memory is that FusionIO TRIM was a noop from a perf perspective.

Innosim is an InnoDB IO simulator that I wrote to help device vendors reproduce performance stalls we encountered with web-scale MySQL. It is easier to run than MySQL while able to generate similar IO patterns. I wrote it because InnoDB has a pattern of coordinated IO that fio wasn't able to reproduce. The pattern occurs during page write back -- first write the double write buffer (1MB or 2MB sequential write) and then do 64 or 128 random 16kb writes. Innosim also takes much less time to reach steady state -- just sequentially write out X GB of database pages versus load InnoDB and then run (for days) an update workload to fragment the indexes. Fragmentation takes time. I wish more DBMS benchmarks ran long enough to get sufficient fragmentation but that can be expensive.

Perhaps one day I will write WTsim, the WiredTiger IO simulator. I wrote ldbsim, the LevelDB IO simulator, but it was rarely used because the RocksDB benchmark client, db_bench, was easy to use even if fragmenting the LSM tree still took a long time. I am not sure that fio would be able to reproduce the special IO patterns created by RocksDB compaction. I love fio but I am not sure it should try to solve this problem for me.

5 comments:

  1. Do any SSD vendors support multistream? It seems like with multistream you can avoid using trim, but still get most of the benefits of trim. If the vendors don't want to support trim well, maybe they can support multistream well...

    ReplyDelete
    Replies
    1. The multistream work I have seen has been done by Samsung. Multistream is in at least one of the important specs. Maybe the endurance benefits it provides are more important as the world moves to lower endurance NAND flash (QLC).

      I enjoyed hacking on RocksDB to make it use a PoC multistream implementation. That was easy.

      Delete
  2. I see how the new, smarter FTL interaction (Open-channel SSD or Zoned Namespaces) helps LSM, but so far MongoDB has only used the WiredTiger API's default btree table type. So I don't think it will help reduce checkpoint stalls. Well it might but I'm not optimistic I guess.

    Are you hinting that you will test a version of MongoDB using WT's lsm type of table (instead of btree) for all the collection and indexes?

    ReplyDelete
    Replies
    1. I have yet to learn about the plans for the WT LSM in MongoDB. I also need more time to get current on WT before I trust my opinions. Regardless, I agree that multistream is less likely to benefit the WT b-tree.

      Delete
    2. Thanks Mark. OK, got it. I was a bit surprised because trying out LSM-within-WiredTiger would have been a second exploration after (iirc) first attempts drew a negative conclusion about LSM in MongoDB (caveat: for the majority of use cases, but not all).

      But that was before these disks with very LSM-suitable FTL interfaces appeared, on top of being NVMe instead of NAND. So I could see there might be return on that investment. And I'm thinking it will have increasing dividends without further software work if the hardware folks keep on giving us new flash disks that get closer and closer to DRAM performance.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...