Durability is about the probability of losing data whether that data was written to the DBMS yesterday or recently. I describe this as a probability rather than systems that can or cannot lose data. I have spent too much time with web-scale deployments to claim that transactions will never be lost.
Backups prevent loss of data added yesterday or last year. Making recent changes durable is harder and comes at a cost in complexity and commit latency. The solution is to get commit log entries onto multiple servers before something is considered committed. There are fascinating and complicated ways to do this. Fortunately most of us can make use of Raft and Paxos implementations written by experts.
While my team at Google was the first to implement semisync replication, someone else gets credit for the idea to make it lossless. And that turned out to be a great idea. Another interesting idea is the way that MongoDB has separated the read and write paths to provide strongly consistent reads and highly durable writes without implementing synchronous replication on the commit path (Raft is used elsewhere). This is clever and flexible. Users get the semantics they need while only paying for the stronger semantics when desired.
Availability is about avoiding big and small downtime. An obvious source of big downtime is the time required to restore from a backup and replay archived commit logs after a server fails. The cost of better availability is more hardware because this is frequently done via replica sets, something that looks like synchronous replication and one of:
- Single-primary elected via consensus with fast & automated failover and failure detection
- Multi-master - I assume this means no failover, which might be the fastest kind of failover
Small downtime is another way to reduce availability and deserves more PR. Sources of small downtime include:
- Too slow commit - not all clients can wait, so too slow might also imply data loss.
- Java GC stalls - I am happy to have little experience with servers written in Java
- Flash GC stalls - NAND flash response time latency isn't great once you add enough 9s to percentile response time latency
- Flash TRIM stalls - we rarely discuss this in public
- Other stalls - more things that don't get enough discussion
- Overloaded server - who is at fault for this. It seems like users are more willing to blame the DBMS for being too slow when subject to too much concurrent load than they are willing to blame a storage device for lousy response times when subject to ridiculous numbers of concurrent requests
I once worked on a self-service MySQL project that was ahead of its time. Unfortunately I left the company before it was deployed. I think it succeeded for many years. Since I was a part time product manager on the effort I think of the cross-product of durability and availability via two levels of durability (more lossy, less lossy) and three levels of availability (low, medium, high).
For durability more lossy is provided via async replication or async log shipping. There are weak bounds on the number of commits that might be lost when the primary disappears. Less lossy durability usually requires synchronous replication.
The availability levels are implemented by:
- low - on failure of the primary restore from backup and replay the archived commit log. This can take hours.
- medium - use HA storage (EBS, Google Persistent Disk, S3, etc). On failure of the primary setup a new compute node, run crash recovery and continue. This takes less time than the low availability solution. You still need backups and commit log archives as the HA storage will occasionally have problems.
- high - use replication to maintain a failover target. Hopefully the failover target can handle queries to get more value from the extra hardware.
At this point most of the combinations of the following are useful. However resources are finite and it might not be a good idea to provide all of them (jack of all trades vs master of none):
(more, less lossy durability) X (low, medium, high availability)