Small Datum: Durability vs Availability

Friday, January 31, 2020

Durability vs Availability

I don't consider myself an expert on distributed systems but sometimes I try to write on the topic. I hope any mistakes here aren't too serious. I like to describe things so here is my attempt.

Durability

Durability is about the probability of losing data whether that data was written to the DBMS yesterday or recently. I describe this as a probability rather than systems that can or cannot lose data. I have spent too much time with web-scale deployments to claim that transactions will never be lost.

Backups prevent loss of data added yesterday or last year. Making recent changes durable is harder and comes at a cost in complexity and commit latency. The solution is to get commit log entries onto multiple servers before something is considered committed. There are fascinating and complicated ways to do this. Fortunately most of us can make use of Raft and Paxos implementations written by experts.

While my team at Google was the first to implement semisync replication, someone else gets credit for the idea to make it lossless. And that turned out to be a great idea. Another interesting idea is the way that MongoDB has separated the read and write paths to provide strongly consistent reads and highly durable writes without implementing synchronous replication on the commit path (Raft is used elsewhere). This is clever and flexible. Users get the semantics they need while only paying for the stronger semantics when desired.

Availability

Availability is about avoiding big and small downtime. An obvious source of big downtime is the time required to restore from a backup and replay archived commit logs after a server fails. The cost of better availability is more hardware because this is frequently done via replica sets, something that looks like synchronous replication and one of:

Single-primary elected via consensus with fast & automated failover and failure detection
Multi-master - I assume this means no failover, which might be the fastest kind of failover

Small Downtime

Small downtime is another way to reduce availability and deserves more PR. Sources of small downtime include:

Too slow commit - not all clients can wait, so too slow might also imply data loss.
Java GC stalls - I am happy to have little experience with servers written in Java
Flash GC stalls - NAND flash response time latency isn't great once you add enough 9s to percentile response time latency
Flash TRIM stalls - we rarely discuss this in public
Other stalls - more things that don't get enough discussion
Overloaded server - who is at fault for this. It seems like users are more willing to blame the DBMS for being too slow when subject to too much concurrent load than they are willing to blame a storage device for lousy response times when subject to ridiculous numbers of concurrent requests

Implementation Details

I once worked on a self-service MySQL project that was ahead of its time. Unfortunately I left the company before it was deployed. I think it succeeded for many years. Since I was a part time product manager on the effort I think of the cross-product of durability and availability via two levels of durability (more lossy, less lossy) and three levels of availability (low, medium, high).

For durability more lossy is provided via async replication or async log shipping. There are weak bounds on the number of commits that might be lost when the primary disappears. Less lossy durability usually requires synchronous replication.

The availability levels are implemented by:

low - on failure of the primary restore from backup and replay the archived commit log. This can take hours.
medium - use HA storage (EBS, Google Persistent Disk, S3, etc). On failure of the primary setup a new compute node, run crash recovery and continue. This takes less time than the low availability solution. You still need backups and commit log archives as the HA storage will occasionally have problems.
high - use replication to maintain a failover target. Hopefully the failover target can handle queries to get more value from the extra hardware.

At this point most of the combinations of the following are useful. However resources are finite and it might not be a good idea to provide all of them (jack of all trades vs master of none):

(more, less lossy durability) X (low, medium, high availability)

8 comments:

David MurphyFebruary 1, 2020 at 1:35 AM
Hi Mark,

I liked this one but think you missed talking about thing such a NDB or Oracle RAC which are beyond the current list of availability in the post. You covered them in failover but I think a "max" availability design where a DBMS can automatically retry some or all queries (mongo also here) is an important call out.
ReplyDelete
Replies
cgthayerFebruary 2, 2020 at 8:14 PM
Durability: To expand a little, when I worked on a spinning disk cluster of several thousand nodes (web crawler for YST) I tracked the probability of node failure per day based on how many nodes were auto-remediated out of service (p-fail or probability of failure). At the time, each shard had two replicas.

In this system, when I spoke about durability I claimed that the probability of loosing a shard completely was p-fail^2 (death of primary and the secondary). But with the caveat that there was an 8 hour window for a fresh replica to come online, during which the probably was just p-fail, or one could call it p-fail * (8hr/24hr). I was never quite sure if that was a good enough explanation.
We changed the system such that each node spread it's shard replicas to many nodes (before it had been that all shards from one node were backed up to one other node). This meant that you were more likely to loose a single shard, but the window of the time to recover went down, and the amount of data lost also went down.

Some applications can accept some amount of data loss gracefully but for others, any loss can be a catastrophe. I would call this "loss-risk" and say it was 1/num_shards as a percentage, so 50 shards is a 2% loss-risk and 1,000 shards has a 0.1% loss-risk.

Although I've always wanted a clear way to express these things, I like your model of classifying these with low/high durability X low/med/high availability. It's always been enough in my experience for actual planning and steering.

Thanks,
/charles
ReplyDelete
Replies
UnknownFebruary 6, 2020 at 12:52 AM
No mention of the accelerated database recovery in SQL Server 2019 which cuts down recovery time from the transaction log from hours to minutes or seconds
ReplyDelete
Replies

Add comment

Friday, January 31, 2020

Durability vs Availability

8 comments:

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1