Comments on Small Datum: Aurora for MySQL is coming

$1.44 is for a 1 year term, $0.96 is for 3 year te...

2014-11-24T23:36:41.446-08:00

$1.44 is for a 1 year term, $0.96 is for 3 year term. the up-front fees are also different.

incremental backup <> redo log. it just means the changes to the data file. no different than what you do when backing up your home computer. Anurag@AWS

Not really. You get acks back to writes issued asy...

2014-11-24T23:33:08.737-08:00

Not really. You get acks back to writes issued asynchronously. Once enough come back for the things you need, you can ack the commit upwards. That said, certainly concurrent updates to a single row cause contention - that's the nature of ACID. But not for the reason you say. You can group commit updates to a single row just fine. There is no need to sync each commit separately. Anurag@AWS

For now, we kept the coarse grain invalidation as ...

2014-11-24T23:26:29.334-08:00

For now, we kept the coarse grain invalidation as in MySQL today. I agree with you that this has issues, but we don't want to change db surface area, and tracking this at a low level in the kernel is complex - particularly give SQL's expressiveness. In any case, I think the question on query caching is more the penalty in the query cache lookup (and locking) than the hit rate. That's fixable. Anurag@AWS

The pricing guide at http://aws.amazon.com/rds/aur...

2014-11-21T14:42:41.051-08:00

The pricing guide at http://aws.amazon.com/rds/aurora/pricing/ show $1.44/hour for reserved r3.8xlarge.

For backup I am skeptical that incremental backup can go too far back in time because of the extra time to apply deltas during recovery and because of servers with high-update rates will have huge deltas. Regardless people will figure that out soon.

How is invalidate done for the query cache? I have...

2014-11-21T14:39:44.688-08:00

How is invalidate done for the query cache? I have seen caches work when invalidate is fine grain (invalidate cache results based on the keys that are changed) but that has required users to specify cache keys to invalidate as part of insert, update and delete statements. I have also seen query caches that don't work because results are invalidated whenever a table is modified. Without any details on your query cache, performance people won't be able to figure out whether it can be used. I assume that docs will eventually arrive. People are impatient for details because Aurora looks really impressive.

Given the use of sync replication I assume: * comm...

2014-11-21T14:34:48.907-08:00

Given the use of sync replication I assume:
* commit latency is one network roundtrip between AZs and at least one fsync
* commits to transactions that don't conflict can be in flight concurrently
* commits to transactions that conflict are serialized
* so a workload with concurrent updates to one row gets a throughput of about 1 / network-round-trip

Does this mean you implemented a new storage layer...

2014-11-21T14:32:18.691-08:00

Does this mean you implemented a new storage layer under InnoDB? That is interesting and would be good news to learn you can reuse that code. Oh, and thanks for answering questions here.

slight clarification - that's our design goal ...

2014-11-21T14:11:44.628-08:00

slight clarification - that's our design goal with respect to MySQL 5.6 using the InnoDB storage engine. There are of course many other storage engines, and it is hard to be compatible with the behavior across multiple. We also plan to retain compatibility as MySQL 5.7 and later versions emerge, though it is always hard to make statements about the future.

Anurag@AWS

Sorry, I didn't provide a sig on the prior res...

2014-11-21T13:48:51.586-08:00

Sorry, I didn't provide a sig on the prior response to Harish. It was from Anurag@AWS.

I think some of the confusion we've caused is ...

2014-11-21T13:45:19.515-08:00

I think some of the confusion we've caused is in the difference in replication to Replicas vs replication of storage. Harrish's note is largely right and I've added on where necessary.

Beyond that, the storage subsystem is new/custom. not based on EBS. Replication is on a 10GB chunk basis, so failures have a reduced blast radius relative to replication failure of an entire cluster. Buffer pool survives unplanned restart.

Feel free to reach out to me directly if you have further questions (awgupta@amazon.com). Anurag@AWS

The pricing estimate in your post isn't quite ...

2014-11-21T13:16:41.696-08:00

The pricing estimate in your post isn't quite right. 3YR RI pricing for an R3.8xlarge is 20K upfront + 0.96/hour. That's 20K + 0.96 * 8760 hr/yr * 3 yrs = 42,228.80 per box. For your scenario of two boxes, 84,457.60.

We don't charge for backup up to instance storage size, and beyond that, charge at prevailing S3 rates, which starts at 0.03/GB-month. I'm not sure it is that easy to generate 50TB of backup for a 3TB database that does incremental backup (log-structured). You can if you do a lot of user snapshots and rewrite your dataset monthly. Seems extreme.

Storage is shared across primary and replica. Remember that you're paying for the storage and IOs you actually use, rather than what is peak provisioned. For most customers, that is 5x lower on storage for used vs provisioned and 10-20x on avg IOs vs peaks IOPS.

Anurag@AWS

We were trying to compare the best number in Auror...

2014-11-21T12:54:55.997-08:00

We were trying to compare the best number in Aurora vs the best number in MySQL. My presentation shows our numbers with and without query cache, as well as those for MySQL based on our measurement. I do believe that (properly written) query caching is valuable for database customers.

I'd also agree that performance doesn't matter if the database isn't up. We spent a lot of time on availability for this reason.

Anurag@AWS

We expect to support the full MySQL surface area. ...

2014-11-21T12:52:24.495-08:00

We expect to support the full MySQL surface area.

Anurag@AWS

Harish, thanks for the careful read through. Here ...

2014-11-21T12:51:26.179-08:00

Harish, thanks for the careful read through. Here are some responses.

We asynchronously write to 6 copies and ack the write when we see four completions. So, traditional 4/6 quorums with synchrony as you surmised. Now, each log record can end up with a independent quorum from any other log record, which helps with jitter, but introduces some sophistication in recovery protocols. We peer to peer to fill in holes. We also will repair bad segments in the background, and downgrade to a 3/4 quorum if unable to place in an AZ for any extended period. You need a pretty bad failure to get a write outage.

You're also right about how Aurora replicas work. One clarification is that, most of the time, we are applying the redo change in the replica buffer pool rather than invalidating. That helps with blocks seeing a lot of both reads and writes. And, of course, down at this layer, there is no need to wait for transaction commit before pushing the replica - all the standard MVCC code works.

Failover is as you describe - we're mostly just purging old records out of cache to catch up - the redo log itself provides a clean model for seeing what needs to be done. It helps a lot in this world to have a definitive LSN "clock" rather than the fuzzy wall-clock times used in other replication schemes.

Better perf in Aurora will be nice if it is really...

2014-11-21T06:14:03.673-08:00

Better perf in Aurora will be nice if it is really there but I think the real story is better availability and manageability -- auto-failover, incremental db size growth up to 64T, 6X replication, etc.

it means apples are being compared to oranges wher...

2014-11-21T05:55:03.819-08:00

it means apples are being compared to oranges where a private take on the query cache responding to simple sysbench queries is being compared to MySQL with the query cache disabled.

Reading the FAQs and watching the video carefully,...

2014-11-21T01:44:39.921-08:00

Reading the FAQs and watching the video carefully, I get the sense that it's synchronous replication at the storage layer. The key idea with the product is scale-out, network-attached SSD storage (optimized for the DB workload). Now the underlying storage layer is replicated 6x (2x in 3 different AZs). I'm going to guess they use distributed consensus for all writes to the storage layer (you need 4 out of 6 for majority quorum and so you can tolerate 2 failures and still be up for writes).

On top of this storage layer, they probably have multiple EC2 instances running their custom MySQL daemons – one of them is the leader which is accepting new writes and a bunch of read-only slaves (they're calling them Aurora replicas to differentiate them from regular MySQL binlog-based replicas). If you think about it, the main job of the slaves is to keep invalidating dirty pages from their bufpool (if any). They don't actually have to go through the mechanics of committing a txn because the txn has already been committed to the storage node in this replica's AZ. This would explain why they can do low millisecond replication-lag for the Aurora replicas compared to typical 'several seconds' replication-lag for the normal MySQL replicas.

Doing a failover in this model should simply involve electing a new leader among the Aurora slaves and let that slave catch up to all updates from the storage layer and start accepting new writes. And so you get zero data loss and a really quick failover with a warm bufpool.

What does it mean for Aurora benchmarks to have 10...

2014-11-20T15:29:42.311-08:00

What does it mean for Aurora benchmarks to have 100% hit rate for result set cache?
https://media.amazonwebservices.com/blog/2014/os_rds_full_page_mon_2.png

Or text? Or GIS? Or character sets?

2014-11-20T11:29:19.892-08:00

Or text? Or GIS? Or character sets?

Does it support foreign keys? I thought I saw that...

2014-11-20T11:09:59.452-08:00

Does it support foreign keys? I thought I saw that it wasn't.