Tuesday, February 18, 2020

New school vs old school DBMS

There are two approaches to DBMS deployment
  1. Old school - pay whatever it takes to keep that one instance running
  2. New school - allow for failure and make failover reliable
At what point do you stop paying to increase availability and durability of one DBMS instance and instead spend that money and energy elsewhere? The old school approach was more popular in the days before cloud and web-scale. Of course early in the cloud and web-scale days we allowed for failure without making failover reliable. Those weren't fun times.

That y must design for failures also doesn't mean you must tolerate lousy hardware. Just because failover should be fast, reliable and lossless doesn't mean you want it running too frequently. There is a difference between commodity and cheap hardware. On the other hand, asking for too much special hardware won't win you friends on the datacenter HW team. If the DBMS is the only thing asking for redundant power, redundant network, special cooling or HW RAID with battery-backed write cache then the DBMS is a problem for web-scale. I know this from experience.

One reason for the old school approach was the large cost of the DBMS licenses and the SMP server on which it ran. Users are motivated to not buy 3X the licenses when the cost is sufficiently large and instead invest in an environment that helps that DBMS keep running (more HA hardware). Any comparison between new school and old school should consider whether the goal is durability or availability because availability costs more than durability. You can try to provide one or both. In this post I am writing about something that needs both.

Storage is one place where you can spend a lot of money to keep that one instance running. That can be done via RAID-10 where the impact is buying twice more storage devices. Or it can be done via a HA storage solution that provides impressive levels of performance, availability and durability.

One thing I am curious about is whether SSD devices need RAID-10. All storage devices have some chance of failure so the question isn't whether RAID-10 is useful. The question is whether the usefulness outweighs the cost. I assume the answer will depend on the device. Samsung advertises fail in place (FIP) as a new feature. But chip fail protection and wear leveling have been here for a long time. For which devices is that sufficient so that RAID-10 isn't recommended? Do vendors make this clear in their docs?

New School

The new school approach must do two things: make durable commits fast, make failover fast and lossless.

Fast durable commits are usually done via sync replication. I think of sync log shipping as a variant of sync replication where the end point is a log archive rather than a replica. But some sync replication solutions already support some of that via a witness.

MongoDB provides fast durable commits via async replication. I was impressed when I first learned of the implementation. The property needed for durable commits is to avoid making writes visible before they are durable. Sync replication is an implementation detail. With MongoDB and the majority read concern a snapshot on the primary is advanced to track the point-in-time at which commits are durable (applied on enough replicas).

Whether durable commits are fast depends on where you place the replicas. Speed of light matters. When replicas are far apart then there will be more commit latency. When replicas are not far apart there might be a larger HW bill. Witnesses (log only replicas) can help here.

Lossless failover assumes a solution for durable commit. Once durable commit has been provided then fast failover is a matter of detecting failures, electing a new primary, promoting the replica to be a primary and then directing traffic to the new primary. There are many details here and plenty of opportunities for mistakes. I know from having helped make some of them. Fortunately the rise of web-scale DBMS means that we get solutions that work.

Things that I ignore in this post:
  • Fast failure is a challenging problem depending on how fast you want it to be. 
  • Systems that allow multiple replicas to initiate writes might not expose failover to a client, but many of the problems described here are still solved under the covers by such systems.
  • Even for systems that don't have explicit failover there is still an impact on clients from failed in-progress transactions. Although Comdb2 hides that from clients.

No comments:

Post a Comment

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...