Thursday, November 20, 2014

Aurora for MySQL is coming

I am excited about Aurora for MySQL. While there aren't many details, I have two conclusions from the information that is available. First, many talented people did great work on this. Second, many customers want the features it provides and some of these features are otherwise unavailable unless you are web-scale and can afford a team of automation experts. This is a big deal and good for the MySQL community. I am not just writing this to boost my priority on the Aurora preview signup list.

I don't think it matters whether Aurora has better performance. The big story is much better availability and manageability without having to hire an expert MySQL operations team. The limiting factor might be cost but that depends on whether you come from the land where everything is free or from a commercial DBMS. And even in the land of free, the cost for an operations team is large.

Soapbox

Before describing what I learned about it I have a short editorial.
  • Is Amazon the reason we need AGPL? It isn't clear to me that they improve upstream MySQL. They do benefit the community by making it easier to run MySQL in the cloud. But I don't see them in the community. Which means I also don't see their developers in the community and who wants to disappear while working on mostly open-source?
  • Their marketing leads with up to 5X faster than MySQL which is translated by the tech press into 5X faster than MySQL. When I signed up for an Aurora preview the response email from Amazon also used the 5X faster than MySQL claim. I prefer up to 5X faster than MySQL.
  • In the video from AWS re:Invent 2014 James Hamilton makes some interesting claims.
    • I am not sure he heard about InnoDB based on this statement -- Just doing a state of the art storage engine would have been worth doing. Take Jim Gray's black book on transactions, implement it, I would be happy. Someone can tell him to be happy.
    • In big print -- 3X write performance, 5X read performance. In small print -- sysbench. It will take time to determine whether similar performance & availability can be had elsewhere at a similar cost.
    • In big print -- 400X less lag, 2 seconds vs 5 milliseconds. I doubt this. The slide also stated -- synchronous multi-DC replication. I don't see how that is possible except within a datacenter given network latency. Other content from Amazon claims this is async replication. But then Mr. Hamilton stated that everything is there, transactions are not lost were two data centers to instantly disappear. Again, that requires sync replication. This is confusing and I am surprised his claims don't match the other documentation from Amazon.
    • It can fix torn pages. InnoDB does this for writes in progress during a crash. I assume Aurora can also do this for pages written a long time ago and that would be a big deal.

Information

What is Aurora? I don't know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB. I encourage people to read the details page, FAQ, slides and pricing guide. The response from Clustrix is also useful. At least one answer on Quora is excellent and this overview of Amazon datacenters will convince you that they network needed to make this work with low latency.

The big deal is that the database is replicated 6X using 3 availability zones (AZs). I assume this means it uses 2 storage clusters per AZ. Documentation from Amazon states this is done using async replication and (committed) writes are available within a few milliseconds or 10s of milliseconds. I assume this is only for Aurora read replicas in the same AZ as the master. Network latency will make that impossible for some remote AZs.  In the presentation at AWS re:Invent there are claims that the replication is synchronous. That is confusing.

For now I will assume that Aurora does async replication to the two remote AZs and the non-primary storage cluster in the primary AZ -- which means that 5 of the 6 copies are updated via async replication. If this is true, then an outage at the primary storage cluster can mean that recent commits are lost. It would be great if someone were to make this clear. My preferred solution would be sync replication within the primary AZ to maintain copies in 2 storage clusters, and then async replication to the 4 copies in the 2 remote AZs. We will soon have multiple solutions in the MySQL community that can do sync replication within a datacenter and async replication elsewhere -- Galera, upstream and lossless semisync. But Aurora is much easier to deploy than the alternatives. My standard question is what commit rate can be sustained when all updates are to the same row? Sync replication with cross-country network round trips makes that slow.

The presentation also claimed this was mostly drop-in compatible. I am interested in compatibility with InnoDB. 
  • What are the semantics for cursor isolation? PostgreSQL, InnoDB and Oracle all do snapshot isolation with different semantics for writes. PostgreSQL has excellent documentation to describe the behavior.
  • Does Aurora support clustered indexes? 
  • What is the max size of an index key?
  • How are large columns supported? Is data always inline?
  • How is multi-versioning implemented? InnoDB usually does updates in place so there isn't much work for purge to do except for deletes and updates to secondary index columns. 
  • Does this use pessimistic or optimistic concurrency control?
  • Does this support partitioning?
  • What is the block size for reads?
  • Does this use compression?

Features

The brief description of features is very interesting. I will summarize that here. The features sound great and I expect them to get a lot of customers.
  • Unplanned failover to an Aurora read replica takes a few minutes. Unplanned failover to a new instance can take up to 15 minutes. This happens for customers who aren't spending money on Aurora read replicas. This is another feature that will make Aurora very popular. While it will be nice to make failover faster, the big deal is that they provide this. The usual MySQL deployment required some do-it-yourself effort to get something similar.
  • Storage uses SSD. I assume this is based on EBS. They do background scrubbing to detect and correct corrupt pages. Storage grows in 10G increments. You don't have to provision for a fixed amount of GB or TB, they will grow as needed up to 64T (or 64T/table). I am not sure I would want 64T in one instance, but it can make it easy to archive data in place. Also automatic growth to such large database sizes will make it much easier for deployments to avoid sharding especially when an instance has many TB of cold data.
  • There are interesting features for point-in-time recovery, incremental backup and snapshots. This is integrated with S3. Incremental backups make it affordable to archive data in place as you don't do full backups for data that doesn't change. But I don't understand all of their backup options. 
  • Database is replicated 2 times within 3 AZs so there are 6 copies. Up to 2 copies can be lost and writes are still possible. Up to 3 copies can be lost and reads are still possible. I assume that by copies can be lost they mean storage clusters can be lost. Automatic recovery here is another big deal.
  • The buffer pool survives mysqld process restart. I wonder if that is only true for planned restart. Regardless, this is a very useful feature for IO-bound workloads when it is important to have a warm cache. Percona used to have a patch for this with InnoDB.
  • Replication is supported in two ways -- via Aurora (at storage level) and MySQL (binlog). My bet is that Aurora replication will be most popular but some people will use MySQL replication to replicas with locally attached storage to save money. They claim much less lag with Aurora and I agree there will be less but I am not sure there will be 400X less. However, they avoid some overhead on commit by not using the binlog and they avoid a lot of complexity by not requiring MySQL replication on the master or slave. I assume that Aurora replication ships deltas rather than page images to be efficient on the network.

Cost

Cost comparisons will be interesting. I am very uncertain about my estimates here as I don't have much experience with prices on AWS or for normal sized datacenter deployments limited to a few servers. I estimate a 3 year cost of $568,814 for a high-end Aurora deployment: largest servers, 1 master, 1 replica, backup and storage. It will be interesting to compare this to non-AWS hardware because you need to account for features that Aurora provides and for the extra availability of the Aurora storage solution. I used 3TB and 10k IOPs from storage because that can be provided easily by a high-end PCIe flash device, but that also assumes a very busy server.
  • Reserved db.r3.8xlarge (32 vCPUs, 244GiB RAM) for 2 servers is $115,682 over 3 years
  • 50TB of backup at 10 cents/GB/month is $53,100 over 3 years
  • 3TB of storage and 10,000 IOPs per server is $400,032 for 2 servers over 3 years. But I am not sure if the storage cost includes IO done to all AZs to maintain the 6 database copies. In that case, the storage cost might be higher.

20 comments:

  1. Does it support foreign keys? I thought I saw that it wasn't.

    ReplyDelete
    Replies
    1. Or text? Or GIS? Or character sets?

      Delete
    2. We expect to support the full MySQL surface area.

      Anurag@AWS

      Delete
    3. slight clarification - that's our design goal with respect to MySQL 5.6 using the InnoDB storage engine. There are of course many other storage engines, and it is hard to be compatible with the behavior across multiple. We also plan to retain compatibility as MySQL 5.7 and later versions emerge, though it is always hard to make statements about the future.

      Anurag@AWS

      Delete
    4. Does this mean you implemented a new storage layer under InnoDB? That is interesting and would be good news to learn you can reuse that code. Oh, and thanks for answering questions here.

      Delete
  2. What does it mean for Aurora benchmarks to have 100% hit rate for result set cache?
    https://media.amazonwebservices.com/blog/2014/os_rds_full_page_mon_2.png

    ReplyDelete
    Replies
    1. it means apples are being compared to oranges where a private take on the query cache responding to simple sysbench queries is being compared to MySQL with the query cache disabled.

      Delete
    2. Better perf in Aurora will be nice if it is really there but I think the real story is better availability and manageability -- auto-failover, incremental db size growth up to 64T, 6X replication, etc.

      Delete
    3. We were trying to compare the best number in Aurora vs the best number in MySQL. My presentation shows our numbers with and without query cache, as well as those for MySQL based on our measurement. I do believe that (properly written) query caching is valuable for database customers.

      I'd also agree that performance doesn't matter if the database isn't up. We spent a lot of time on availability for this reason.

      Anurag@AWS

      Delete
    4. How is invalidate done for the query cache? I have seen caches work when invalidate is fine grain (invalidate cache results based on the keys that are changed) but that has required users to specify cache keys to invalidate as part of insert, update and delete statements. I have also seen query caches that don't work because results are invalidated whenever a table is modified. Without any details on your query cache, performance people won't be able to figure out whether it can be used. I assume that docs will eventually arrive. People are impatient for details because Aurora looks really impressive.

      Delete
    5. For now, we kept the coarse grain invalidation as in MySQL today. I agree with you that this has issues, but we don't want to change db surface area, and tracking this at a low level in the kernel is complex - particularly give SQL's expressiveness. In any case, I think the question on query caching is more the penalty in the query cache lookup (and locking) than the hit rate. That's fixable. Anurag@AWS

      Delete
  3. Reading the FAQs and watching the video carefully, I get the sense that it's synchronous replication at the storage layer. The key idea with the product is scale-out, network-attached SSD storage (optimized for the DB workload). Now the underlying storage layer is replicated 6x (2x in 3 different AZs). I'm going to guess they use distributed consensus for all writes to the storage layer (you need 4 out of 6 for majority quorum and so you can tolerate 2 failures and still be up for writes).

    On top of this storage layer, they probably have multiple EC2 instances running their custom MySQL daemons – one of them is the leader which is accepting new writes and a bunch of read-only slaves (they're calling them Aurora replicas to differentiate them from regular MySQL binlog-based replicas). If you think about it, the main job of the slaves is to keep invalidating dirty pages from their bufpool (if any). They don't actually have to go through the mechanics of committing a txn because the txn has already been committed to the storage node in this replica's AZ. This would explain why they can do low millisecond replication-lag for the Aurora replicas compared to typical 'several seconds' replication-lag for the normal MySQL replicas.

    Doing a failover in this model should simply involve electing a new leader among the Aurora slaves and let that slave catch up to all updates from the storage layer and start accepting new writes. And so you get zero data loss and a really quick failover with a warm bufpool.

    ReplyDelete
    Replies
    1. Harish, thanks for the careful read through. Here are some responses.

      We asynchronously write to 6 copies and ack the write when we see four completions. So, traditional 4/6 quorums with synchrony as you surmised. Now, each log record can end up with a independent quorum from any other log record, which helps with jitter, but introduces some sophistication in recovery protocols. We peer to peer to fill in holes. We also will repair bad segments in the background, and downgrade to a 3/4 quorum if unable to place in an AZ for any extended period. You need a pretty bad failure to get a write outage.

      You're also right about how Aurora replicas work. One clarification is that, most of the time, we are applying the redo change in the replica buffer pool rather than invalidating. That helps with blocks seeing a lot of both reads and writes. And, of course, down at this layer, there is no need to wait for transaction commit before pushing the replica - all the standard MVCC code works.

      Failover is as you describe - we're mostly just purging old records out of cache to catch up - the redo log itself provides a clean model for seeing what needs to be done. It helps a lot in this world to have a definitive LSN "clock" rather than the fuzzy wall-clock times used in other replication schemes.

      Delete
    2. Sorry, I didn't provide a sig on the prior response to Harish. It was from Anurag@AWS.

      Delete
    3. Given the use of sync replication I assume:
      * commit latency is one network roundtrip between AZs and at least one fsync
      * commits to transactions that don't conflict can be in flight concurrently
      * commits to transactions that conflict are serialized
      * so a workload with concurrent updates to one row gets a throughput of about 1 / network-round-trip

      Delete
    4. Not really. You get acks back to writes issued asynchronously. Once enough come back for the things you need, you can ack the commit upwards. That said, certainly concurrent updates to a single row cause contention - that's the nature of ACID. But not for the reason you say. You can group commit updates to a single row just fine. There is no need to sync each commit separately. Anurag@AWS

      Delete
  4. The pricing estimate in your post isn't quite right. 3YR RI pricing for an R3.8xlarge is 20K upfront + 0.96/hour. That's 20K + 0.96 * 8760 hr/yr * 3 yrs = 42,228.80 per box. For your scenario of two boxes, 84,457.60.

    We don't charge for backup up to instance storage size, and beyond that, charge at prevailing S3 rates, which starts at 0.03/GB-month. I'm not sure it is that easy to generate 50TB of backup for a 3TB database that does incremental backup (log-structured). You can if you do a lot of user snapshots and rewrite your dataset monthly. Seems extreme.

    Storage is shared across primary and replica. Remember that you're paying for the storage and IOs you actually use, rather than what is peak provisioned. For most customers, that is 5x lower on storage for used vs provisioned and 10-20x on avg IOs vs peaks IOPS.

    Anurag@AWS

    ReplyDelete
    Replies
    1. The pricing guide at http://aws.amazon.com/rds/aurora/pricing/ show $1.44/hour for reserved r3.8xlarge.

      For backup I am skeptical that incremental backup can go too far back in time because of the extra time to apply deltas during recovery and because of servers with high-update rates will have huge deltas. Regardless people will figure that out soon.

      Delete
    2. $1.44 is for a 1 year term, $0.96 is for 3 year term. the up-front fees are also different.

      incremental backup <> redo log. it just means the changes to the data file. no different than what you do when backing up your home computer. Anurag@AWS

      Delete
  5. I think some of the confusion we've caused is in the difference in replication to Replicas vs replication of storage. Harrish's note is largely right and I've added on where necessary.

    Beyond that, the storage subsystem is new/custom. not based on EBS. Replication is on a 10GB chunk basis, so failures have a reduced blast radius relative to replication failure of an entire cluster. Buffer pool survives unplanned restart.

    Feel free to reach out to me directly if you have further questions (awgupta@amazon.com). Anurag@AWS

    ReplyDelete

What does "perf stat" tell us about MySQL performance regressions on an AMD Threadripper CPU

I used perf stat to collect a variety of HW performance counter and will share the results over several posts. I used 4 different servers wi...