Sunday, October 11, 2015

Losing it?

Many years ago the MySQL team at Google implemented semi-sync replication courtesy of Wei Li. The use case for it was limited and the community was disappointed that it did not provide the semantics they wanted. Eventually group commit was implemented for binlog+InnoDB: first by the MySQL team at Facebook, then much better by MariaDB, and finally in upstream MySQL. With group commit some magic was added to give us lossless semisync and now we have automated, lossless and fast failover in MySQL without using extra replicas. This feature is a big deal. I look forward to solutions for MariaDB (via binlog server and MaxScale) and MySQL (via Fabric and Proxy).

MongoDB is ahead of MySQL in features that make scale-out easier to manage and the next release (3.2) adds a few more features and more robust code for existing features. I hope that some of the features that have been sacrificed in the name of scale-out will eventually arrive in a MongoDB release: per-shard transactions, per-shard joins, per-shard consistent read.


Many times we describe algorithms when users need to understand behavior and then the user gets lost in the details. It is important for developers to understand when transactions can be lost and I will describe that with answers to two questions.
  1. After the changes from a commit are visible to a concurrent session under what conditions can that commit be lost?
  2. After a client has been informed that a commit has succeeded under what conditions can that commit be lost?


I list four combinations of behavior to consider and the features in MySQL and MongoDB that provide them:


MySQL uses one solution (lossless semisync) to protect against both loss of visible and acknowledged commits. With lossless semisync row locks are held on the master until at least one slave acknowledges the commit. This can reduce commit throughput as there is a network round trip between master and replica(s) before commit is finished and row locks are released. There is at one network round trip between commits to the same row. This overhead is reduced by moving the replica close to the master. In the solution described by Yoshinori the binlog archive provides the ack rather than using extra replicas in every data center and because the binlog archive doesn't require a lot of hardware it is easier to move it closer to the master.

In MongoDB total protection comes from two solutions -- majority read concern and write concern. The benefit is that majority write concern doesn't make commit slower assuming a workload has sufficient concurrency. It will add latency to each client doing a commit just like MySQL semisync. A different feature, majority read concern, protects against loss of a visible commit. However, there is the risk that a client that needs read-your-own write semantics will have to wait. At this point it isn't clear to me that MongoDB makes it easy to read your own writes. I wonder if applications that care more about performance will use majority read concern without using majority write concern. That isn't an option with MySQL.

It will take time to figure out the implications of the performance differences. With MySQL delays are added to writers as it takes them longer to get the through semi-sync commit protocol. With MongoDB delays might be added to readers as they wait for the majority read snapshot to be advanced.

Durable on a slave?

It can be important to understand how durable a change is on a replica when the replica acknowledges a transaction to the master. There are several options and I have not read enough recent MySQL or MongoDB documentation to determine whether there are options beyond durable in memory:
  1. durable in memory - the commit is buffered in memory on a replica before it acks. Many years ago when we implemented semi-sync for MySQL this was the only choice. I tend to think that durable in memory is an oxymoron, but I am a pessimist.
  2. durable in a log - the commit is durable in a log file on a replica before it acks. There has been talk that MySQL would 
  3. committed on the replica - the commit is applied on a replica before it acks. That guarantees read-your-writes semantics when that replica is queried soon after committing a change on a master. Alas this is also likely to create performance lag unless the replica uses many threads to apply changes concurrently, just like on the master. It also creates a window where a commit is visible on a replica before the master.

More editorial

MongoDB documentation has tended to be optimistic about the features provided by the software. I think this will be resolved as the community grows. There have been some interesting discoveries. Hopefully the gap between documented and actual behavior will be reduced over time.

The mmap engine releases the per-database or per-instance write lock before syncing the oplog even when durable writes are requested.  This is now described as read uncommitted, but read non-durable might be a better name because reads are still consistent but you can see changes from others before those changes are durable in the oplog. I wrote about this when reading code and the docs have been updated since then but I think their docs need more edits. This is only a problem for the mmap engine and multiple engines in MongoDB means they need to be clear about behavior for mmap versus WiredTiger.

There were too strong claims about the semantics of the majority write concern. It protects against the loss of an acknowledged commit but some docs suggested it protected against the loss of a visible commit. Aphyr, an expert in distributed systems testing, highlighted this problem in his Call Me Maybe series and a bug report. I wrote about part of the problem prior to that but I did not connect the problem with the too-strong claims in the documentation. Many years ago MySQL made a similar mistake when documenting semi-sync replication and fixed their docs after I filed a bug.

Documentation claimed that 2-phase commit was used to keep config servers in sync. That makes it more likely that commit is all-or-nothing for the 3 or 5 config servers hosting the same data. Alas it can lead to read-only mode when a server goes away. I read the code and the two phases were 1) ping all config servers and then if all responded 2) send the change to all config servers. If all servers did not respond with OK then manual intervention was required. This isn't 2 phase commit. Fortunately, something much better will be done for the 3.2 release and the docs have been updated.

Finally, read the excellent series of posts from Tokutek on replica set failover including the overview and posts one, two, three and four. There have been problems that haven't been widely known. Fortunately the 3.2 release of MongoDB should make things better.

Wednesday, October 7, 2015

Problems not worth fixing

I worked on a closed-source DBMS years ago and the development cycle was 1) code for 6 months 2) debug for 18 months. Part 2 was longer than part 1 because the customers have high expectations for quality and the product delivers on that. But it was also longer because some co-workers might not have been code complete after 6 months and were quietly extending part 1 into part 2.

I fixed many bugs during part 2. One that I remember was in the optimizer. Someone added code to detect and remove duplicate expressions (A and A and B --> A and B). Alas the algorithm was O(N*N). Analytics queries can have complex WHERE clauses and all queries were forced to pay the O(N*N) cost whether or not they had any duplicate expressions. 

The overhead for this feature was reasonable after I fixed the code but it raises an interesting question. Are some problems not worth the cost of prevention? Several times a year I see feature requests that make me think of this.

Forgot to add the bug that inspired this post. In bug 30342 there was a change in the MySQL 5.0 optimizer that added predicates which could increase the overhead for query optimization when more predicates means more calls to records_in_range.

Monday, October 5, 2015

MyRocks versus allocators: glibc, tcmalloc, jemalloc

I used a host with Fedora 19 and Linkbench to determine the ability of memory allocators to avoid fragmentation with MyRocks (MySQL+RocksDB). RocksDB can create pressure on an allocator because allocations for the block cache can have a short lifetime. Memory is allocated on page read and released when a page is evicted from the cache. The allocation is the size of the decompressed page and sometimes the size of the compressed page. RocksDB puts more pressure on the allocator than InnoDB because with InnoDB the buffer pool allocation is done once at process start. After the Linkbench load finished I ran the query test for 24 hours and report the values of VSZ and RSS from ps at the end. From the results below jemalloc & tcmalloc do much better at avoiding fragmentation (value of RSS) compared to glibc 2.1.7. I am still adjusting to larger values of VSZ with jemalloc. It isn't a problem, but it takes time to accept.

VSZ       RSS       allocator
22246784  10062100  jemalloc
13600032  10545284  tcmalloc
26136152  20313268  glibc

Test details:
  • Linkbench with maxid=100M, loaders=10, requesters=20
  • MyRocks with 8G block cache for RocksDB
  • libc is glibc 2.17, tcmalloc from gperftools 2.1, jemalloc 3.6.0

Friday, October 2, 2015

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems.

So I asked core developers from Btrfs whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other.

XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous I mean 50X or 100X longer. Maybe I am the only one using InnoDB file-per-table and measuring time to drop a table or database. Bad things happen when DROP takes 10+ minutes --InnoDB is unusable by other connections and an InnoDB background thread might kill mysqld because it thinks there is an internal deadlock.

Did I mention this was good news for MyRocks? If you want compression then we get 2X better compression compared to compressed InnoDB. We also made sure that DROP TABLE and DROP INDEX are fast.

I updated bug 78277 for this problem. I am the only person updating that bug. I also found a corruption bug with transparent page compression, bug 78672. Past experience with transparent page compression is described here, here and here.

The end to my testing day was perfect. I rebooted the host to get back memory from XFS metadata allocations and the filesystem came back corrupt. Being a pessimist I was using a scratch filesystem for this, so I didn't lose my Ubuntu install.

Tuesday, September 8, 2015

Third day with InnoDB transparent page compression

My first two days with InnoDB transparent page compression didn't turn out well. Transparent page compression can make InnoDB source code simpler and InnoDB more performant on insert heavy workloads. Unfortunately the versions of XFS that I use are not happy after doing a hole-punch on write. The performance summary is that with transparent compression:

  • Database load is slightly faster
  • Transaction processing is slightly slower
  • DROP TABLE is 43X slower

MySQL 5.6 vs 5.7

I used a host with 24 HW threads, 144G of RAM and a 400G Intel s3700 SSD. The server uses Fedora 19, XFS and Linux kernel 3.14.27-100.fc19. The benchmark application is linkbench and was run with maxid1=100M, loaders=10 and requesters=20 (10 clients for the load, 20 for queries). I compared the Facebook patch for MySQL 5.6 with upstream MySQL 5.7. For 5.6 I used old-style compression for linktable and counttable. For 5.7 I used transparent compression for all tables. I also used 32 partitions for 5.6 and no partitions for 5.7. After the database was loaded I ran the query test for 140 loops with 1 hour per loop.

The table below has data for the Facebook patch for MySQL 5.6 (fb56) and upstream 5.7.8 (orig578). The results are the row insert rate during the load (ips) and the average QPS of the Nth hourly run (qps@N). The QPS is better for MySQL 5.6 and the load rate is better for 5.7 but I don't know how much of that is due to the use of partitions for 5.6. I ran DROP TABLE after the test and that took ~8 minutes for MySQL 5.7.8. More details are in a previous post.

          fb56      orig578
ips       55199     81970
qps@20    13731     10581
qps@40    12172      9874
qps@60    11353      8875
qps@80    10977      8234
qps@100   10793      8021
qps@120   10691      7946
qps@140   10636      7949

Transparent vs old-style compression

I then ran a test for 24 hours to compare MySQL 5.7.8 in two setups and both used partitions for all tables. They differed in that one used old-style compression and the other used transparent compression. The results were similar to the comparison with MySQL 5.6 as the load was faster with transparent compression. transaction processing was faster with old-style compression and DROP TABLE was ~30X slower with transparent compression.

After the linkbench load I ran the query test for 24 1-hour loops. At test end the database with old-style compression was 4% larger than transparent compression, but it also had more data as it sustained a higher QPS rate. I didn't count the number of rows to determine whether it had 4% more data.

The table below displays the row insert rate during the load (ips) and the average QPS from 1-hour runs at the 2nd, 12th and 24th hours (qps@N). The load rate is better with transparent compression and the QPS is better with old-style compression.

           578, old-style    578, transparent
ips        72566             79518
qps@2      16542             15504
qps@12     16079             15136
qps@24     15506             14383

Transparent compression doesn't have to provide better compression or performance to be a win, but it needs to be stable. I ran DROP DATABASE at test end and that took 5 seconds for old-style compression vs 216 seconds for transparent. The database was ~100G when dropped.

This paste has the output from the 24th 1-hour run of the linkbench query test. There are two sections, the first is from old-style compression and the second from transparent compression. For most of the linkbench operations old-style is slightly faster. But the max times for operations is much worse (~2X) with transparent.

Friday, September 4, 2015

Linkbench for MySQL 5.7.8 with an IO-bound database

I wanted to try InnoDB transparent page compression that is new in the MySQL 5.7.8 RC. That didn't work out, so I limited my tests to old-style compression. I compared MyRocks with InnoDB from the Facebook patch for 5.6, upstream 5.6.26 and upstream 5.7.8. My performance summary is:

  • MyRocks loads data faster than InnoDB. This isn't a new result. Non-unique secondary index maintenance doesn't require a read before the write (unlike a B-Tree). This is also helped by less random IO on writes and better compression.
  • MyRocks compression is much better than compressed InnoDB. After 24 hours it used between 56% and 64% of the space compared to the compressed InnoDB configurations.
  • MyRocks QPS degrades over time. This will be fixed real soon.
  • Partitioning improves InnoDB load performance in MySQL 5.6 for compressed and non-compressed tables. This reduces stalls from the per-index mutex used by InnoDB when inserts cause or might cause a page split (pessimistic code path) because there is one mutex per partition. With MySQL 5.7 partitioning doesn't help in the non-compressed table case. There has been work in 5.7 to reduce contention on the per-index mutex and I think it helped. I suspect it is still needed with old-style compression because compressed page splits are more expensive as they include recompression.
  • The Facebook patch for MySQL 5.6 is faster than upstream 5.6 and competitive with upstream 5.7.8. Too bad that patches might not reach upstream.


My test server has 144G of RAM, 40 HW threads with HT enabled and fast PCIe flash storage. I configured linkbench with loaders=10, requesters=20 and maxid1=1B. This uses 10 clients for the load, 20 clients for the query runs and about 1B rows in the node table after the load. The linkbench clients share the server with mysqld. The my.cnf settings are explained in a previous post.  The load was done with the binlog disabled. After the load there were 12 1-hour runs of the query test and I report results for hours 2 and 12. Then mysqld was restarted with the binlog enabled and 12 more 1-hour runs of the query test were done and I report results for hours 14 and 24. Fsync for the binlog was disabled. Fsync for the InnoDB redo log was done by a background thread (innodb_flush_log_at_trx_commit=2). Note that the InnoDB page size was 8kb so I used 2X compression for the link and count tables. The node table is not compressed for InnoDB because it is unlikely to compression by 50%.

I tested the following binaries:
  • myrocks - RocksDB storage engine for MySQL using the Facebook patch for MySQL 5.6
  • fb56 - InnoDB using the Facebook patch for MySQL 5.6
  • orig56 - upstream MySQL 5.6.26
  • orig57 - upstream MySQL 5.7.8
The partitioning and compression options are described by the following.  For partitioning I use 32 partitions and transactions/queries don't span partitions. All of the DDL is here.
  • p0 - no partitioning for RocksDB
  • p1 - partitioning for RocksDB
  • p0.c0 - no partitioning, no compression for InnoDB
  • p0.c1 - no partitioning, old-style compression for InnoDB
  • p1.c0 - partitioning, no compression for InnoDB
  • p1.c1 - partitioning, old-style compression for InnoDB


This lists the database size in GB after the load and query tests at the 2nd, 12th, 14th and 24th hours. I don't have sufficient granularity in my measurement script for databases larger than 1T. I am not sure why compression with upstream 5.6 and 5.7 uses more space than with the Facebook patch.

Update - I removed the results for myrocks, p1 because my measurements were wrong.

load    2h      12h     14h     24h
gb      gb      gb      gb      gb      config
 487     493     512     514     523    myrocks, p0
11XX    11XX    12XX    12XX    13XX    fb56, p0.c0
 666     697     779     787     814    fb56, p0.c1
11XX    12XX    12XX    13XX    13XX    fb56, p1.c0
 707     745     803     808     826    fb56, p1.c1
12XX    12XX    13XX    14XX    14XX    orig56, p0.c0
 756     790     879     889     920    orig56, p0.c1
13XX    13XX    14XX    14XX    14XX    orig56, p1.c0
 803     838     901     907     930    orig56, p1.c1
12XX    13XX    14XX    14XX    15XX    orig57, p0.c0
 756     796     892     902     931    orig57, p0.c1
13XX    13XX    14XX    14XX    15XX    orig57, p1.c0
 803     844     844     916     940    orig57, p1.c1

This lists the insert rate during the load (load ips) and the average query rates for the 2nd, 12th, 14th and 24th hours. Note that the query rate is lousy for p0.c1 immediately after the load. The problem is that the b-tree pages are almost full after the load and then over time many of them get split. There are stalls from page splits with compression and over time the page split rate drops.

load    2h      12h     14h     24h
ips     qps     qps     qps     qps     config
165210  31826   22347   21293   17888   myrocks, p0
103145  30045   22376   21325   18387   myrocks, p1
109355  21151   23733   23478   24865   fb56, p0.c0
 74210   8261   13928   14706   18656   fb56, p0.c1
104900  26953   26029   25161   25479   fb56, p1.c0
 90162  19888   24431   22596   22811   fb56, p1.c1
105356  16472   16873   16575   17073   orig56, p0.c0
 45966   7638   12492   13178   16516   orig56, p0.c1
 98104  18797   18273   17625   17702   orig56, p1.c0
 66738  17731   19854   19159   19418   orig56, p1.c1
122454  31009   30260   29905   29751   orig57, p0.c0
 49101   9217   17552   18448   22092   orig57, p0.c1
114400  28191   26797   25820   25832   orig57, p1.c0
 69746  22028   25204   23882   23983   orig57, p1.c1

This is the same data as above, but grouped by configuration.

load    2h      12h     14h     24h
ips     qps     qps     qps     qps     config
109355  21151   23733   23478   24865   fb56, p0.c0
105356  16472   16873   16575   17073   orig56, p0.c0
122454  31009   30260   29905   29751   orig57, p0.c0
165210  31826   22347   21293   17888   myrocks, p0
 74210   8261   13928   14706   18656   fb56, p0.c1
 45966   7638   12492   13178   16516   orig56, p0.c1
 49101   9217   17552   18448   22092   orig57, p0.c1
104900  26953   26029   25161   25479   fb56, p1.c0
 98104  18797   18273   17625   17702   orig56, p1.c0
114400  28191   26797   25820   25832   orig57, p1.c0
103145  30045   22376   21325   18387   myrocks, p1
 90162  19888   24431   22596   22811   fb56, p1.c1
 66738  17731   19854   19159   19418   orig56, p1.c1
 69746  22028   25204   23882   23983   orig57, p1.c1


For people who prefer graphs I include one for the load rates and another for the QPS from the configurations that use partitioning.

Thursday, September 3, 2015

Second day with InnoDB transparent page compression

My first day with InnoDB transparent page compression didn't turn out OK, but I am an optimist so onto day 2. I gave up trying to use it on a host with CentOS 6.6, XFS and a 3.10.53 kernel. I had a bit more luck with a Fedora 19 host using XFS and a 3.14.27 kernel and I ran linkbench.

You can follow along here or at bug 78277.

I configured linkbench with maxid1=100M and the database was about 100GB after the load. I used MySQL 5.7.8 with transparent page compression. I won't share the performance results just yet, but at the and of 7 days of the query test I shutdown mysqld and my storage device (400G Intel s3700) had 180G free space. I confirmed that I was able to use at least 176 of the 180 GB, which was a good sign. However, the device was only able to write at ~60 MB/second while making copies of a 2G test file. File copies were writing at ~200 MB/second on a second host that has the same hardware, the same amount of free space, but didn't use hole-punch for compression. So this is one more cost of using hole-punch, it makes the filesystem much slower. But I don't blame XFS. I don't think that doing a hole-punch on every small write was the intended use for that feature.

Then I ran DROP DATABASE for the database used by linkbench. It contained 3 unpartitioned tables. Two were ~60G and one was ~10G. The DROP TABLE took 8 minutes. While that ran there were many warnings like this:
[Warning] InnoDB: Trying to close/delete/truncate tablespace 'linkdb/linktable' but there are 1 pending operations on it.
[Warning] InnoDB: Trying to close/delete/truncate tablespace 'linkdb/linktable' but there are 1 pending operations on it.

Then it got more exciting -- a long semaphore wait courtesy of a slow unlink. Alas the DROP TABLE finished before the background thread killed mysqld.
InnoDB: A long semaphore wait:--Thread 139616032126720 has waited at line 830 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x327b74a8 created in file line 1153
a writer (thread id 139616238388992) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file line 830
Last time write locked in file /home/mdcallag/b/mysql-5.7.8-rc/storage/innobase/row/ line 4202
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info: