Tuesday, September 30, 2014

Sync replication in MySQL 5.X

The MySQL 5.7.5 beta is here and looks great. Group replication is in a labs preview and provides synchronous replication. We will soon have 2 choices for sync replication in the MySQL family (this and Galera). Descriptions of sync replication tend to focus on the details and I prefer to understand behavior at a high level before going into the details. Sync replication has a wonderful property -- there is no need for failover because every replica is a master. However that comes at a cost. A commit that is too slow is another form of downtime. Putting more replicas closer together is a workaround (reduces network round trip time) if you don't mind the HW cost. Lossless semi-sync replication is the alternative where you are willing to deal with failover, want a 2-safe binlog but don't want extra database servers. I am happy to see both solutions for the MySQL community.

My standard questions are:

  1. When all commits originate from the same replica, does commit processing (getting consensus) require 1 or more network round trips between replicas? AFAIK there are Paxos variations for which 1 round trip is sufficient.
  2. When all commits don't originate from the same replica, does commit processing require 2 or more than 2 network round trips? Is the expected usage to originate commits at the same replica or to let them use any replica? Because originate-at-any usually has a performance overhead of 1 extra network round trip and this matters when replicas are far away.
  3. What occurs between 2 commits to the same key from different transactions? Assuming 1 network round trip for commit (see #1 above) then my guess is fsync on replica that originates commit, send message to other replicas, fsync on other replicas, send reply to originator. So the limit on commits per second to one key is 1 / (network-round-trip + 2 * fsync).
  4. What occurs between 2 commits to different keys from different transactions? Is there grouping or pipelining so that the commit rate here is better than for #3?
  5. Does "commit" mean that the change is accepted by or applied to a majority of replicas? If it only means the change has been accepted, then I can read from a majority of replicas after commit and not see my change.
  6. What is the workaround for a key with a lot of logical contention? Is there any support for pessimistic locking? Should I route/originate all such transactions from the same replica?

Wednesday, September 10, 2014

Get help from an MBA - MySQL Benchmark Advisors

I am not an academic but I attend academic database conferences and read many papers from them. My focus is transaction processing (OLTP, small data). Some of the work published there is awesome, try reading the papers describing the R&D effort for Hekaton and you might agree (see papers from Larson, Levandoski and others). Better performance is the key contribution for many of the papers and in most cases that is measured via benchmarks because the ideas change the constant factor rather than reduce complexity from N*N to NlogN. I also wonder if there is too much emphasis on peak performance and not enough on reliability and manageability.

MySQL is frequently used as the comparison system in papers and I frequently think to myself that my MySQL is faster than their MySQL when I read the results. Then I doubt the results which makes me doubt the paper and the paper has less impact. We can fix that. Get an MBA (MySQL Benchmark Advisor) to consult on your usage of MySQL before submitting the paper for review.

Benchmarking is hard, for example see some of the things you should consider when doing performance tests for the LevelDB family. It becomes harder as the number of products compared is increased. One difference between benchmarking and benchmarketing is that a benchmark explains the difference in performance. In the context of an academic paper the new idea is almost always faster otherwise the paper would not get published (see this post for more on the issue). It might be faster because it is better. Or it might be faster because:
  • it was compared to products that were misconfigured
  • the test wasn't run long enough to fragment the file structure (b-tree, LSM, etc)
  • the test wasn't run long enough to force flash garbage collection to start
  • the test database was too small relative to the storage device size
I have a lot of experience with benchmarks for MySQL, RocksDB and locally attached storage. This means I have made a lot of mistakes and sometimes learned from them. I have also published and retracted bogus results, repeated many tests and spent a lot of time explaining what I see. In my case I have to explain the result if I want to fix the problem or expect the vendor to fix the problem. Performance results in the MySQL community are subject to peer review. Much of that happens in public when other gurus question, praise or debug your results.

MySQL and InnoDB are a frequent choice when comparisons are needed for a conference paper. I have read more than one paper with results that I don't trust. There is a lot of ambiguity because papers rarely have sufficient information to repeat the test (client source and my.cnf and steps to run the test and storage description and ...) and repeatability is an open problem for the academic database community.

So this is an open offer to the academic database & systems community. The MySQL community (at least me) is willing to offer advice on results, tuning & performance debugging. This offer is limited to the academic community. Startups or vendors are best served by the excellent consultants in the MySQL community. I expect similar degrees to be created for MongoDB (MBA) and PostgreSQL (PBA).

Saturday, September 6, 2014

VLDB & MySQL in China

I am at VLDB in Hangzhou, China and spoke with clever people from academia, industrial research labs, industry and even web scale companies. A co-worker setup meetings with people who make MySQL better including Zhejiang University, Alibaba, Taobao & NetEase. We had another meeting with expert DBAs in Shanghai. These were interesting meetings.

There is a lot of great work done to make MySQL better at these companies. While I have been aware of this, I missed the extent of the effort -- both improvements made to MySQL and the complexity of the workload. I hope we can do more to engage these web-scale MySQL teams in the community. We all benefit by making MySQL better -- adding features, fixing bugs faster, explaining best practices. Engagement can mean speaking at conferences, blogging and sharing code (github, launchpad). Everyone benefits when patches are pushed upstream - the diff contributor gets to shrink their diff and maintenance overhead.

I am impressed by the work of the MySQL teams from Alibaba and Taobao. They have a lot in common with the teams I have been on. Operations and engineering work together and engineering learns what needs to be made better. Running a DBMS in production is a great education that many developers do not get at traditional (commercial) DBMS companies. They also confront many performance problems from high-throughput workloads that require them to get more from InnoDB, even as the Oracle/InnoDB team continues to make it much more efficient.

A frequent topic was the excellent new replication features in MySQL 5.6. I did not find anyone using them. They were reluctant for good reasons, some of which will be explained in a post to be published real soon now by my co-workers. We think their concerns have been addressed by changes in WebScaleSQL and upstream MySQL. It will take time to get that message out.

There were also questions about improvements to InnoDB in MySQL 5.5 and 5.6. People want to know whether InnoDB scales better (to more cores). The short answer is yes. Each release from 5.1 to 5.5 to 5.6 and now to 5.7 has significant changes that reduce mutex contention. The end result is that 5.7 will get more QPS than earlier releases on a big multi-core server. MySQL marketing has been gettting this message out, but marketing is hard and the message needs to be repeated.


Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...