Friday, January 31, 2020

Durability vs Availability

I don't consider myself an expert on distributed systems but sometimes I try to write on the topic. I hope any mistakes here aren't too serious. I like to describe things so here is my attempt.

Durability

Durability is about the probability of losing data whether that data was written to the DBMS yesterday or recently. I describe this as a probability rather than systems that can or cannot lose data. I have spent too much time with web-scale deployments to claim that transactions will never be lost.

Backups prevent loss of data added yesterday or last year. Making recent changes durable is harder and comes at a cost in complexity and commit latency. The solution is to get commit log entries onto multiple servers before something is considered committed. There are fascinating and complicated ways to do this. Fortunately most of us can make use of Raft and Paxos implementations written by experts.

While my team at Google was the first to implement semisync replication, someone else gets credit for the idea to make it lossless. And that turned out to be a great idea. Another interesting idea is the way that MongoDB has separated the read and write paths to provide strongly consistent reads and highly durable writes without implementing synchronous replication on the commit path (Raft is used elsewhere). This is clever and flexible. Users get the semantics they need while only paying for the stronger semantics when desired.

Availability

Availability is about avoiding big and small downtime. An obvious source of big downtime is the time required to restore from a backup and replay archived commit logs after a server fails. The cost of better availability is more hardware because this is frequently done via replica sets, something that looks like synchronous replication and one of:
  • Single-primary elected via consensus with fast & automated failover and failure detection
  • Multi-master - I assume this means no failover, which might be the fastest kind of failover

Small Downtime

Small downtime is another way to reduce availability and deserves more PR. Sources of small downtime include:
  • Too slow commit - not all clients can wait, so too slow might also imply data loss.
  • Java GC stalls - I am happy to have little experience with servers written in Java
  • Flash GC stalls - NAND flash response time latency isn't great once you add enough 9s to percentile response time latency
  • Flash TRIM stalls - we rarely discuss this in public
  • Other stalls - more things that don't get enough discussion
  • Overloaded server - who is at fault for this. It seems like users are more willing to blame the DBMS for being too slow when subject to too much concurrent load than they are willing to blame a storage device for lousy response times when subject to ridiculous numbers of concurrent requests
Implementation Details
I once worked on a self-service MySQL project that was ahead of its time. Unfortunately I left the company before it was deployed. I think it succeeded for many years. Since I was a part time product manager on the effort I think of the cross-product of durability and availability via two levels of durability (more lossy, less lossy) and three levels of availability (low, medium, high). 

For durability more lossy is provided via async replication or async log shipping. There are weak bounds on the number of commits that might be lost when the primary disappears. Less lossy durability usually requires synchronous replication.

The availability levels are implemented by:
  • low - on failure of the primary restore from backup and replay the archived commit log. This can take hours.
  • medium - use HA storage (EBS, Google Persistent Disk, S3, etc). On failure of the primary setup a new compute node, run crash recovery and continue. This takes less time than the low availability solution. You still need backups and commit log archives as the HA storage will occasionally have problems.
  • high - use replication to maintain a failover target. Hopefully the failover target can handle queries to get more value from the extra hardware. 
At this point most of the combinations of the following are useful. However resources are finite and it might not be a good idea to provide all of them (jack of all trades vs master of none):
   (more, less lossy durability) X (low, medium, high availability)

8 comments:

  1. Hi Mark,

    I liked this one but think you missed talking about thing such a NDB or Oracle RAC which are beyond the current list of availability in the post. You covered them in failover but I think a "max" availability design where a DBMS can automatically retry some or all queries (mongo also here) is an important call out.

    ReplyDelete
    Replies
    1. RAC is HA for compute. Sometimes I see that as stone soup, just add HA storage and you have an HA solution. But HA storage is the more valuable thing and maybe Oracle missed a huge opportunity with RAC with a focus on HA compute rather than storage.

      Not sure what you mean by "beyond current list of availability". People are welcome to report what kind of availability these deliver in practice. If there is suggestion that these two max out availability vs alternatives, then I am skeptical. Also I prefer solution that can run in commodity data center on commodity HW -- not expensive shared disk, not isolated purpose-built setup with resources shared by nothing else. Such constraints mean that these won't be widely used.

      For solution that hides server blips (no transaction restart needed) and OSS see Comdb2 at https://bloomberg.github.io/comdb2/overview_home.html

      Delete
    2. Also, I assume the value added by NDB and RAC is mostly within a LAN. Still need to figure out WAN solution either because workload is global or for disasters.

      Delete
    3. This is a fun topic. If we are considering max HA without regard to cost then non-stop style systems (Tandom, whatever is there now) have to be discussed. But I prefer to consider max HA per dollar rather than max HA regardless of cost.

      Delete
  2. Durability: To expand a little, when I worked on a spinning disk cluster of several thousand nodes (web crawler for YST) I tracked the probability of node failure per day based on how many nodes were auto-remediated out of service (p-fail or probability of failure). At the time, each shard had two replicas.

    In this system, when I spoke about durability I claimed that the probability of loosing a shard completely was p-fail^2 (death of primary and the secondary). But with the caveat that there was an 8 hour window for a fresh replica to come online, during which the probably was just p-fail, or one could call it p-fail * (8hr/24hr). I was never quite sure if that was a good enough explanation.
    We changed the system such that each node spread it's shard replicas to many nodes (before it had been that all shards from one node were backed up to one other node). This meant that you were more likely to loose a single shard, but the window of the time to recover went down, and the amount of data lost also went down.

    Some applications can accept some amount of data loss gracefully but for others, any loss can be a catastrophe. I would call this "loss-risk" and say it was 1/num_shards as a percentage, so 50 shards is a 2% loss-risk and 1,000 shards has a 0.1% loss-risk.

    Although I've always wanted a clear way to express these things, I like your model of classifying these with low/high durability X low/med/high availability. It's always been enough in my experience for actual planning and steering.

    Thanks,
    /charles

    ReplyDelete
    Replies
    1. While at FB I saw some really interesting math on probability failures in large cluster. That is a fun + hard problem, much more so that what I am writing about here. Makes me think a bit about black swan failures.

      Delete
  3. No mention of the accelerated database recovery in SQL Server 2019 which cuts down recovery time from the transaction log from hours to minutes or seconds

    ReplyDelete
    Replies
    1. Lots of great things haven't been mentioned. I have a vague idea that Oracle also has optimizations to boost recovery time. This is easier to make fast with an LSM -- as recovery in that case is just replaying Puts into the memtable -- not much IO.

      But I try to avoid products with a DeWitt Clause. I like to write about perf and that clause would prevent me. A famous database prof once strongly asserted that SQL Server would do awesome on Linkbench and the insert benchmark. I am happy for someone to show me that is true.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...