Tuesday, December 30, 2014

malloc and MongoDB performance

I used iibench to understand the impact of malloc on MongoDB performance to compare malloc from glibc 2.17, jemalloc 3.6.0 and tcmalloc/gperftools 2.2 for an insert-only workload with 10 concurrent clients. The test server has 40 cores with hyperthread enabled. Alas, I lost a few configuration details, but think I used MongoDB 2.8.0rc4 and the WiredTiger b-tree.

The summary is that tcmalloc and jemalloc provide a similar benefit and are much better than glibc malloc. I made no attempt to tune tcmalloc or jemalloc. I ran iibench 3 times for each configuration and chose the median result. There are four metrics for each configuration:

  1. Test duration in seconds
  2. Address space size per the VSZ column in ps
  3. Normalized context switch rate - the number of context switches per N documents inserted. I will leave N as undefined for now so the absolute value isn't interesting. The value can be compared between malloc implementations.
  4. Normalized CPU utilization - CPU utilization per N documents inserted. See #3.

Results

          seconds  VSZ(GB)   context-switch(relative)   CPU(relative)
tcmalloc   1943     65.1            6.6                   26510
jemalloc   1877     64.5            7.1                   27165
glibc      2251     72.3           23.0                   32120

For this test glibc uses 1.19X more time, 1.12X more address space, 3.23X more context switches and 1.18X more CPU than jemalloc and tcmalloc is similar to jemalloc. As tcmalloc is bundled with the source distribution for MongoDB I assume it is used in the binaries they distribute. This appears to be a good thing.

Friday, December 26, 2014

Storage overhead for attribute names in MongoDB

The flexibility of dynamic schema support in MongoDB comes with a few costs. One of the costs is that extra space is required to store the database. The problem is that attribute names are repeated in every document. If sizeof(attribute names) is significant relative to sizeof(attribute values) then a lot more space will be used for MongoDB with the mmapv1 engine compared to a DBMS that doesn't support dynamic schemas. Note that dynamic schemas are valuable because they support less schema, not no schema. Indexes, required attributes and assumptions about attribute values are all examples where some schema is used.

How much extra space is used for attribute names? Does page level compression in the WiredTiger and TokuMX engines make this a non-issue? Repeated values in every document from long attribute names seem like something that is easy to compress. And the alternative of using extra short attribute names will cause pain for anyone trying to use the database so page level compression might be the preferred solution.

While page level compression can remove the bloat from long attribute names for compressed copies of pages it doesn't solve the problem for uncompressed copies of pages. So there is still a cost from dynamic schemas. Perhaps one day we will get an engine that encodes long attribute names efficiently even for uncompressed pages. The impact from this is that fewer uncompressed pages can fit in cache because of the space overhead. Note that when page level compression is used there are some database pages that are in cache in both compressed and uncompressed forms. I assume that the WiredTiger and TokuMX block caches only cache uncompressed pages, but I am not an expert in either engine, and the OS filesystem cache has copies of compressed pages. I am not sure what happens when direct IO is used with WiredTiger because that prevents use of the OS filesystem cache.

Results

I used iibench for MongoDB and loaded 2B documents for a few configurations: mmapv1 and WiredTiger without compression (wt-none), with snappy compression (wt-snappy) and with zlib compression (wt-zlib). To keep the documents small I edited the iibench test to use a 1-byte character field per document and disabled creation of secondary indexes. I used two versions of iibench. The first used the attribute names as-is (long) and the second used shorter versions (short) for a few of the attribute names:
  • long attribute names: price, customerid, cashregisterid, dateandtime
  • short attribute names: price, cuid, crid, ts
There are two graphs below. The first graph has the ratio of the database size with long attributes versus the size with short attributes. A larger ratio means there is more overhead from long attributes and the mmapv1 engine has the worst ratio (1.53). The ratio for WiredTiger with the zlib engine is close to one (1.01). WiredTiger with snappy wasn't as good as zlib and I think that part of the reason is that a larger page size is used by WiredTiger when zlib compression is enabled. While that improves the compression ratio it can cause other performance problems but I am waiting for documentation to catch up to the code to understand this. One problem from a larger page size is that more memory is wasted in the block cache when a page is read to get one hot document. Another problem from a larger page size is that flash devices are usually much faster at 4kb reads than at 64kb reads, while there isn't much difference with disk for 4kb versus 64kb reads. For an IO-bound workload, there is also more CPU overhead from decompressing a 64kb page versus a 4kb page.
This shows the database size in GB for each of the engine configurations. Note that WiredTiger with zlib uses about 1/8th the space compared to mmapv1 and even the uncompressed WiredTiger engine does a lot better than mmapv1. I suspect that most of the benefit for wt-none versus mmapv1 is the overhead from using power of 2 allocations in mmapv1. As a side note, I am not sure we will be able to turn off power of 2 allocation for mmapv1 in future releases.


Tuesday, December 23, 2014

Read-modify-write optimized

A log structured merge tree (LSM) has optimizations that reduce the amount of reads from and writes to storage for a write-heavy workload. These optimizations aren't unique to an LSM as the fractal tree algorithm used by TokuMX and TokuDB has similar benefits.

There are different kinds of writes so I will explain this workload. I used iibench for MongoDB with 3 secondary indexes. This loads a collection in primary key order. Random values are used for attributes with secondary indexes. I repeated the test for different database cache sizes to show where an LSM is faster than a b-tree. While I wouldn't call the WiredTiger btree a write-optimized database algorithm, it is much more IO efficient than the mmapv1 engine (uses less space on disk, reads less from and writes less to storage for the same workload).

Results

This displays the document insert rate from iibench with 10 loader threads inserting 200M documents each. The collection goes from 0 to 2B documents during the benchmark. The test was repeated for the B-tree and LSM file structures in WiredTiger and the mmapv1 engine. The test used 3 cache sizes - 144G, 32G and 8G. Buffered IO was used for 144G and direct IO was used for 32G and 8G. The server has fast flash storage, 144G RAM and 40 hyperthread cores. The test used MongoDB 2.8.0rc3. At test end the database was 310G for the b-tree, 290G for the LSM and 815G for mmapv1. The WiredTiger tests used snappy compression. All tests used the SAFE write concern with 1000 documents inserted per request.

The insert rate for the LSM is insensitive to cache size which is expected for this workload. The insert rate for the b-tree is sensitive to cache size and goes from being much better than the LSM when the secondary index data stays in cache to much worse than the LSM when it does not. Results for mmap1 explain why it is great that MongoDB 2.8 includes WiredTiger.
I prefer benchmark results that include enough detail to understand the results. The results below are from a test with 2.8.0rc2 and are metrics from vmstat and iostat normalized per 200k inserted documents. As each insert request was for 1000 documents then the results are also per 200 requests. They are best used to compare between the WiredTiger btree (wt btree), WiredTiger LSM (wt lsm) and mmapv1 engine. The columns are:
  • vm.cs - context switches per 200 requests. Note the spike for the WiredTiger btree as the cache size is decreased. 
  • vm.us+sy - user & system CPU time (sum of the us and sy columns from vmstat multiplied by 1M). The relative value is useful. Again there is a spike for the WiredTiger btree and none for the LSM but the LSM uses much more CPU when the cache isn't small.
  • io.kBw - KB written per iostat per 200 requests. The WiredTiger LSM is always more efficient than the btree and the rate doesn't grow as the cache shrinks. The IO rate grows a lot for the WiredTiger btree as the cache shrink. All of that is expected. The WiredTiger btree has a much lower rate than mmapv1 for the same cache size.
  • io.kBr - KB read per iostat per 200 requests. The WiredTiger btree rate grows fast as the cache shrinks while the LSM is rate does not grow as the cache is reduced from 32G to 8G. The rate for mmapv1 at 144G is much worse than for WiredTiger.
  • io.r - read requests per iostat per 200 requests. The results match io.kBr.

per 200k inserts
vm.cs  vm.us+sy  io.kBw   io.kBr    io.r     engine, cache
  29     42445     417      0.49     0.02    wt btree, 144G
 306     67025     791    179.0     17.66    wt btree, 32G
1744    140560    2524   1889.0    215.0     wt btree, 8G

  76    152235     294      0.31     0.02    wt lsm, 144G
 134    148545     359     65.0      4.02    wt lsm, 32G
 137    146215     355     65.0      4.11    wt lsm, 8G 

 350     79500    2414      3.17     0.16    mmapv1, 144G 

B-tree

For a b-tree the IO pattern for the primary key index is efficient -- it is append only to grow the index to the right. If the user provides the value for the PK then the database must confirm that value is unique by reading from the database. That will not require extra reads from storage here because the database pages with that value should remain in cache.

The IO patterns for the secondary indexes are not efficient as inserted docs use random values for the secondary index attributes. Secondary index maintenance for a b-tree is a read-modify-write operation.  The page that has or will have the key must be fetched before it is written. For WiredTiger, and most other popular b-tree implementations, this is a random storage read eventually followed by a random storage write. When compression is used then a page must be decompressed as part of the read and compressed as part of the write.

InnoDB is able to avoid some of the random storage reads from secondary index maintenance via the change buffer. That avoids reads when the change buffer gets multiple changes to a secondary index page before doing the read and it can be very useful in production and on benchmarks.

A b-tree backed by a redo log defers writing back dirty pages until checkpoint or buffer pool pressure forces some pages to be written back. This reduces the IO overhead when one write to storage includes multiple changes to the page. This is less likely as the database:cache ratio increases. This occurs more frequently for workloads with skew (some keys are written much more frequently).

The size of the secondary indexes relative to cache are critical. While the database might be huge relative to cache (100X larger for example) if the secondary index fits in cache then all storage reads and many storage writes for the secondary index pages will be avoided and that makes a huge difference on the amount of IO for this workload.

For now I will ignore a b-tree that uses log-structured writes. That might reduce the cost for writing back pages but it will require storage reads and writes for garbage collection to copy-out live pages from previously written extents. It will also suffer from write-amplification for both the copy-out done during GC and from writing back a dirty page when only a fraction of the page has changed.

LSM

The popular compaction algorithms for an LSM are size-tiered and leveled. The WiredTiger LSM uses size-tiered. Without looking at the WiredTiger source, blind writes should be sufficient for secondary index maintenance because there are no unique constraints. A storage read is not required before the write.

A read before the write might be required for primary index maintenance to confirm that the PK for the inserted doc is unique. However the page with that data should almost always be in cache. An LSM can also use a bloom filter for workloads that don't load in PK order. It might be hard to define a bloom filter that is great for multiple indexes, in which case a column family is needed per index. I am not sure if WiredTiger used. I did not find references to bloom in the WiredTiger engine for MongoDB.

An LSM will do writes to storage during compaction and memtable flushes. It can be difficult to compare the overhead from these operations as they use larger IO requests (many pages at a time) compared to a b-tree. A big difference between an LSM and a b-tree is that most of the requests from the LSM are larger (many pages at a time) and amortized over many user operations. A busy LSM still does random IO:

  • multi-threaded compaction means that each thread has one stream of writes and one or more stream of reads when the database is larger than cache. The request size for these is large (many pages at a time)
  • another stream of writes is in progress for the redo log (WAL, write ahead log)
  • user operations can require single-page reads from storage

Thursday, November 20, 2014

Aurora for MySQL is coming

I am excited about Aurora for MySQL. While there aren't many details, I have two conclusions from the information that is available. First, many talented people did great work on this. Second, many customers want the features it provides and some of these features are otherwise unavailable unless you are web-scale and can afford a team of automation experts. This is a big deal and good for the MySQL community. I am not just writing this to boost my priority on the Aurora preview signup list.

I don't think it matters whether Aurora has better performance. The big story is much better availability and manageability without having to hire an expert MySQL operations team. The limiting factor might be cost but that depends on whether you come from the land where everything is free or from a commercial DBMS. And even in the land of free, the cost for an operations team is large.

Soapbox

Before describing what I learned about it I have a short editorial.
  • Is Amazon the reason we need AGPL? It isn't clear to me that they improve upstream MySQL. They do benefit the community by making it easier to run MySQL in the cloud. But I don't see them in the community. Which means I also don't see their developers in the community and who wants to disappear while working on mostly open-source?
  • Their marketing leads with up to 5X faster than MySQL which is translated by the tech press into 5X faster than MySQL. When I signed up for an Aurora preview the response email from Amazon also used the 5X faster than MySQL claim. I prefer up to 5X faster than MySQL.
  • In the video from AWS re:Invent 2014 James Hamilton makes some interesting claims.
    • I am not sure he heard about InnoDB based on this statement -- Just doing a state of the art storage engine would have been worth doing. Take Jim Gray's black book on transactions, implement it, I would be happy. Someone can tell him to be happy.
    • In big print -- 3X write performance, 5X read performance. In small print -- sysbench. It will take time to determine whether similar performance & availability can be had elsewhere at a similar cost.
    • In big print -- 400X less lag, 2 seconds vs 5 milliseconds. I doubt this. The slide also stated -- synchronous multi-DC replication. I don't see how that is possible except within a datacenter given network latency. Other content from Amazon claims this is async replication. But then Mr. Hamilton stated that everything is there, transactions are not lost were two data centers to instantly disappear. Again, that requires sync replication. This is confusing and I am surprised his claims don't match the other documentation from Amazon.
    • It can fix torn pages. InnoDB does this for writes in progress during a crash. I assume Aurora can also do this for pages written a long time ago and that would be a big deal.

Information

What is Aurora? I don't know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB. I encourage people to read the details page, FAQ, slides and pricing guide. The response from Clustrix is also useful. At least one answer on Quora is excellent and this overview of Amazon datacenters will convince you that they network needed to make this work with low latency.

The big deal is that the database is replicated 6X using 3 availability zones (AZs). I assume this means it uses 2 storage clusters per AZ. Documentation from Amazon states this is done using async replication and (committed) writes are available within a few milliseconds or 10s of milliseconds. I assume this is only for Aurora read replicas in the same AZ as the master. Network latency will make that impossible for some remote AZs.  In the presentation at AWS re:Invent there are claims that the replication is synchronous. That is confusing.

For now I will assume that Aurora does async replication to the two remote AZs and the non-primary storage cluster in the primary AZ -- which means that 5 of the 6 copies are updated via async replication. If this is true, then an outage at the primary storage cluster can mean that recent commits are lost. It would be great if someone were to make this clear. My preferred solution would be sync replication within the primary AZ to maintain copies in 2 storage clusters, and then async replication to the 4 copies in the 2 remote AZs. We will soon have multiple solutions in the MySQL community that can do sync replication within a datacenter and async replication elsewhere -- Galera, upstream and lossless semisync. But Aurora is much easier to deploy than the alternatives. My standard question is what commit rate can be sustained when all updates are to the same row? Sync replication with cross-country network round trips makes that slow.

The presentation also claimed this was mostly drop-in compatible. I am interested in compatibility with InnoDB. 
  • What are the semantics for cursor isolation? PostgreSQL, InnoDB and Oracle all do snapshot isolation with different semantics for writes. PostgreSQL has excellent documentation to describe the behavior.
  • Does Aurora support clustered indexes? 
  • What is the max size of an index key?
  • How are large columns supported? Is data always inline?
  • How is multi-versioning implemented? InnoDB usually does updates in place so there isn't much work for purge to do except for deletes and updates to secondary index columns. 
  • Does this use pessimistic or optimistic concurrency control?
  • Does this support partitioning?
  • What is the block size for reads?
  • Does this use compression?

Features

The brief description of features is very interesting. I will summarize that here. The features sound great and I expect them to get a lot of customers.
  • Unplanned failover to an Aurora read replica takes a few minutes. Unplanned failover to a new instance can take up to 15 minutes. This happens for customers who aren't spending money on Aurora read replicas. This is another feature that will make Aurora very popular. While it will be nice to make failover faster, the big deal is that they provide this. The usual MySQL deployment required some do-it-yourself effort to get something similar.
  • Storage uses SSD. I assume this is based on EBS. They do background scrubbing to detect and correct corrupt pages. Storage grows in 10G increments. You don't have to provision for a fixed amount of GB or TB, they will grow as needed up to 64T (or 64T/table). I am not sure I would want 64T in one instance, but it can make it easy to archive data in place. Also automatic growth to such large database sizes will make it much easier for deployments to avoid sharding especially when an instance has many TB of cold data.
  • There are interesting features for point-in-time recovery, incremental backup and snapshots. This is integrated with S3. Incremental backups make it affordable to archive data in place as you don't do full backups for data that doesn't change. But I don't understand all of their backup options. 
  • Database is replicated 2 times within 3 AZs so there are 6 copies. Up to 2 copies can be lost and writes are still possible. Up to 3 copies can be lost and reads are still possible. I assume that by copies can be lost they mean storage clusters can be lost. Automatic recovery here is another big deal.
  • The buffer pool survives mysqld process restart. I wonder if that is only true for planned restart. Regardless, this is a very useful feature for IO-bound workloads when it is important to have a warm cache. Percona used to have a patch for this with InnoDB.
  • Replication is supported in two ways -- via Aurora (at storage level) and MySQL (binlog). My bet is that Aurora replication will be most popular but some people will use MySQL replication to replicas with locally attached storage to save money. They claim much less lag with Aurora and I agree there will be less but I am not sure there will be 400X less. However, they avoid some overhead on commit by not using the binlog and they avoid a lot of complexity by not requiring MySQL replication on the master or slave. I assume that Aurora replication ships deltas rather than page images to be efficient on the network.

Cost

Cost comparisons will be interesting. I am very uncertain about my estimates here as I don't have much experience with prices on AWS or for normal sized datacenter deployments limited to a few servers. I estimate a 3 year cost of $568,814 for a high-end Aurora deployment: largest servers, 1 master, 1 replica, backup and storage. It will be interesting to compare this to non-AWS hardware because you need to account for features that Aurora provides and for the extra availability of the Aurora storage solution. I used 3TB and 10k IOPs from storage because that can be provided easily by a high-end PCIe flash device, but that also assumes a very busy server.
  • Reserved db.r3.8xlarge (32 vCPUs, 244GiB RAM) for 2 servers is $115,682 over 3 years
  • 50TB of backup at 10 cents/GB/month is $53,100 over 3 years
  • 3TB of storage and 10,000 IOPs per server is $400,032 for 2 servers over 3 years. But I am not sure if the storage cost includes IO done to all AZs to maintain the 6 database copies. In that case, the storage cost might be higher.

Wednesday, October 15, 2014

Updates with secondary index maintenance: 5.7 vs previous releases

My previous results for an update-only workload to a cached and IO-bound database used a table that did not require secondary index maintenance from the update. Prior to that I ran tests using mysqlslap and might have found a regression in InnoDB when the update requires secondary index maintenance (bug 74235). The mysqlslap test did all updates to the same row. The test I describe here chooses the row to update at random and this workload does not reproduce the worst case from bug 74235.

The results are that newer releases tend to do worse at 1 thread, 5.7.5 does best at 32 threads and 5.7.5 does slightly worse than 5.6.21 at less than 32 threads.  We need to fix the regressions in 5.7.5 at less-than-high concurrency workloads.

setup

The test here was the same as described in my previous posts with one exception -- the test tables have a secondary index on (k,c) and note that the column c is set to 0 at test start. The updates are of the form: UPDATE $table SET c = c + 1 WHERE id = $X. The test tables look like:
CREATE TABLE sbtest1 (
  id int(10) unsigned NOT NULL auto_increment,
  k int(10) unsigned NOT NULL default '0',
  c char(120) NOT NULL default '',
  pad char(60) NOT NULL default '',
  PRIMARY KEY  (id),
  KEY k (k,c)
) ENGINE=InnoDB AUTO_INCREMENT=8000001 DEFAULT CHARSET=latin1

There were 8 tables in the database with 8M rows per table. Tests were run for 1 thread with 1 table and 8, 16, 32 threads with 8 tables. The test at each concurrency level was run for 10 minutes. Tests were run for a 1G (IO-bound) and 32G (cached) InnoDB buffer pool. For the cached test the tables were read into the buffer pool at test start. The updates/second rate from these tests had more variance than previous tests. I ran tests many times and report rates here that were typical. For the 1-thread tests that use 1 table, the database (1 table) is 2G. For the 8, 16 and 32 thread tests that use 8 tables the database (8 tables) is 16G.

results at 1 thread

The rates for a cached database and 1 thread match most previous results -- newer releases are much slower than older releases and MySQL 5.7.5 is about 1.5X worse than MySQL 5.0.85. The results for the IO-bound setup don't follow that trend. Newer releases do better. I regret that I did not have the time to explain the difference between 5.6.21 and 5.7.5.


results at many threads

MySQL 5.7.5 and 5.6.21 are much better than older releases. 5.6.21 does better than 5.7.5 at 8 and 16 threads while 5.7.5 is the best at 32 threads.



configuration

Tuesday, October 14, 2014

Sysbench IO-bound updates: MySQL 5.7 vs previous releases

I repeated the sysbench update-only test after reducing the InnoDB buffer pool to 1G. The test database is 16G so the test should be IO-bound. MySQL 5.7.5 is 10% worse than 5.0.85 at 1 thread and much better at 8+ threads.

The previous blog post has more details on the setup. The configuration is not crash safe as the doublewrite buffer and binlog are disabled and innodb_flush_log_at_trx_commit=2. This test continued to use direct IO for InnoDB but this test requires a high rate of reads from storage. The previous test did many writes but no reads. A fast flash device is used for storage.

results at 1-thread

Only 1 table is used so there is a 2G database and 1G buffer pool. This test is less IO-bound than the many threads tests reported in the next section. Two graphs are displayed below. The first has the updates/second rates and the second has the rates normalized to the rate from 5.0.85. MySQL 5.7.5 is about 10% worse than 5.0.85 and all servers do between 3000 and 4000 updates/second.
results at many threads

These tests uniformly distribute clients across the 8 tables. MySQL 5.7.5 is the best at 16 and 32 threads. MySQL 5.0.85 and 5.1.63 saturate at 8 threads.

Monday, October 13, 2014

Sysbench cached updates: MySQL 5.7 vs previous releases

I used MySQL 5.7.5 to repeat tests I did for MySQL 5.7.2 and 5.6.10 with the sysbench client and a cached database to compare performance for low and high concurrency. My configuration was optimized for throughput rather than crash safety. The performance summary:
  • MySQL 5.7.5 is 1.47X slower than 5.0.85 at 1 thread
  • MySQL 5.7.5 is a bit slower than 5.6.21 at 8 and 16 threads
  • MySQL 5.7.5 is faster than others at 32 threads
setup

I used my sysbench 0.4 fork and set the value of the c column to 0 in the sbtest tables as that column is incremented during the test. The test used 8 tables with 8M rows per table (about 16G total) and the InnoDB buffer pool was 32G. One table was used for the 1 thread case. For the other tests the clients were uniformly distributed across the 8 tables. All data was in the buffer pool at test start. The test was run for 10 minutes at each concurrency level (1, 8, 16, 32 threads). Throughput appeared to be stable throughout the test.

Tests were run for MySQL versions 5.0.85, 5.1.63, 5.6.21 and 5.7.5. Tests were repeated for 4k and 8k innodb_page_size with MySQL 5.7.5 (see -16k, -8k and -4k below). Tests were also repeated with changes to innodb_flush_neighbors (=0), innodb_write_io_threads (=8), innodb_page_cleaners (=4) and innodb_purge_threads (=4). Results for those changes are not provided here as there wasn't an impact on QPS. The change to flush_neighbors reduced the InnoDB page write rate, which is a good thing.

results

The regression from 5.0.85 to recent versions is visible in the single-thread results below.
MySQL 5.7.5 is the champ at 32 threads but is slightly worse than 5.6.21 at 8 and 16 threads. Recent versions are much better than old versions.

configuration

I used this my.cnf for MySQL 5.7.5 and then something similar for other versions:
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 
table-definition-cache=1000
table-open-cache=2000
table-open-cache-instances=8
max_connections=20000
key_buffer_size=200M
metadata_locks_hash_instances=256 
query_cache_size=0
query_cache_type=0
server_id=9
performance_schema=0
binlog_format=row
skip_log_bin
innodb_buffer_pool_instances=8
innodb_io_capacity=32000
innodb_lru_scan_depth=4000
innodb_checksum_algorithm=CRC32
innodb_thread_concurrency=0
innodb_buffer_pool_size=32G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table

This was the command line for one of the tests:
./sysbench --batch --batch-delay=10 --test=oltp --mysql-host=127.0.0.1 --mysql-db=test --oltp-table-size=8000000 --max-time=600 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest1 --oltp-skip-trx --oltp-test-mode=incupdate --oltp-dist-type=uniform --oltp-range-size=10 --oltp-connect-delay=0 --percentile=99 --num-threads=1 --seed-rng=1413171854 run

Sunday, October 12, 2014

Page read performance: MySQL 5.7 vs previous releases

For this test I used sysbench to determine the peak rate for disk reads per second from InnoDB with fast storage. Fast storage in this case is the OS filesystem cache. I want to know the peak for MySQL 5.7.5 and whether there is a regression from 5.6. My summary is:
  • 5.7.5 is 12% to 15% worse than 5.6.21 at 1 thread. I hope we can fix this via bug 74342.
  • 5.7.5 is 2% to 4% worse than 5.6.21 at 8, 16 and 32 threads
  • 5.6/5.7 are much better than 5.0 especially at high concurrency
  • Page size (16k, 8k, 4k) has a big impact on the peak only when compression is used. 
setup

The test server has 40 hyperthread cores and is shared by the sysbench client and mysqld server.

I used my sysbench 0.4 fork configured to fetch 1 row by primary key via SELECT and HANDLER statements. The InnoDB buffer cache was 1G with innodb_flush_method="" (buffered IO). There were 8 tables in the test database with 16M rows per table.  The test database was either 32G uncompressed or 16G compressed so tests with compression had a better InnoDB buffer pool hit rate. Tests were run with 1, 8, 16 and 32 threads. The 1 thread test was limited to 1 table (2G or 4G of data) so this benefits from a better InnoDB buffer pool hit rate. The 8, 16 and 32 thread tests uniformly distribute clients across 8 tables.

Tests were done for MySQL 5.0.85, 5.6.21 and 5.0.85. All versions were tested for 16k pages without compression (see -16k-c0 below). Some MySQL 5.7.5 tests were repeated for 8k and 4k pages (see -4k- and -8k- below). Some MySQL 5.7.5 tests were repeated for 2X compression via zlib (see -c1- below).

results at 1 thread

The results at 1 thread show that peak QPS for 5.7.5 is 12% to 15% worse than for 5.6.21. I hope this can be fixed. I assume the problem is a longer code path from new features in 5.7.

results at 8 threads

The difference between 5.6.21 and 5.7.5 is smaller than at 1 thread. I assume this is the benefit from less mutex contention in 5.7.
results at 16 threads

The difference between 5.6.21 and 5.7.5 is smaller than at 1 thread. I assume this is the benefit from less mutex contention in 5.7.

results at 32 threads

The difference between 5.6.21 and 5.7.5 is smaller than at 1 thread. I assume this is the benefit from less mutex contention in 5.7.


configuration

I used this my.cnf for 5.7.5 and modified it for 5.6:
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 
table-definition-cache=1000
table-open-cache=2000
table-open-cache-instances=8
max_connections=20000
key_buffer_size=200M
metadata_locks_hash_instances=256 
query_cache_size=0
query_cache_type=0
server_id=9
performance_schema=0
binlog_format=row
skip_log_bin
innodb_buffer_pool_instances=8
innodb_io_capacity=32000
innodb_lru_scan_depth=4000
innodb_checksum_algorithm=CRC32
innodb_thread_concurrency=0
innodb_buffer_pool_size=1G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
# innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table
datadir=...

This is an example command line for a test that uses SELECT and 1 thread per table. My testing script starts 8 sysbench clients, one per table. The value for --seed-rng different for every sysbench client (function of current time and table number 1 to 8) to access different keys each run:
./sysbench --batch --batch-delay=10 --test=oltp --mysql-host=127.0.0.1 --mysql-db=test --oltp-table-size=16000000 --max-time=120 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest1 --oltp-read-only --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols --oltp-dist-type=uniform --oltp-range-size=10 --oltp-connect-delay=0 --percentile=99 --num-threads=1 --seed-rng=1413134256 run

Friday, October 10, 2014

Low-concurrency performance for updates with InnoDB: MySQL 5.7 vs previous releases

First I tested updates with the Heap engine as a base case and now I report on InnoDB to find regressions that are specific to it. The workload is described in the previous blog post. Response time is measured at 1, 4 and 32 threads for two types of updates. The unindexed update increments an unindexed column. The indexed update updates a column that is in a secondary index. Tests are run with the binlog enabled and disabled. Tests are done for MySQL versions 5.0.85, 5.1.63, 5.5.40, 5.6.21 and 5.7.5. In some cases tests are also done with the adaptive hash index disabled. The test database has 64,000 rows but all updates are to one row (lots of contention). The client is mysqlslap running on the same host as mysqld. The test server has 40 hyperthread cores.

For the single-thread update unindexed workload MySQL 5.7.5 is about 1.6X slower than 5.0.85. This is similar to what I reported in previous posts. For the single-thread updated indexed workload MySQL 5.7.5 is about 4X slower than 5.0.85 and much slower than 5.6.21. This is a regression for which I created bug 74325. This and a few other bugs might be motivation to get similar tests into the upstream release process.

The binaries below are tagged with -noahi, -nobl and -bl. The -noahi binary has the binlog and adaptive hash index disabled. The -nobl binary has the binlog disabled. The -bl binary has the binlog enabled.

update indexed, binlog off

Result summary:
  • at 1 thread: 5.7.5 is 3.8X slower, 5.6.21 is 1.43X slower
  • at 4 threads: 5.7.5 is 3.88X slower, 5.6.21 is slightly faster
  • at 32 threads: 5.7.5 is 1.96X slower, 5.6.21 is faster

update indexed, binlog on

Result summary:
  • at 1 thread: 5.7.5 is 4X slower, 5.6.21 is 1.37X slower
  • at 4 threads: 5.7.5 is 2.1X slower, 5.6.21 is 1.09X slower
  • at 32 threads: 5.7.5 is 2.26X slower, 5.6.21 is faster


update unindexed, binlog off

Result summary:
  • at 1 thread: 5.7.5 is 1.65X slower, 5.6.21 is 1.44X slower
  • at 4 threads: 5.7.5 is 1.33X slower, 5.6.21 is faster. Note that 5.7.5 is the only post-5.0.85 release still slower than 5.0.85 at 4 threads.
  • at 32 threads: 5.7.5 is faster, 5.6.21 is faster



update unindexed, binlog on

Result summary:
  • at 1 thread: 5.7.5 is 1.62X slower, 5.6.21 is 1.43X slower
  • at 4 threads: 5.7.5 is 1.39X slower, 5.6.21 is 1.18X slower
  • at 32 threads: 5.7.5 is faster, 5.6.21 is faster



Thursday, October 9, 2014

Low-concurrency performance for updates and the Heap engine: MySQL 5.7 vs previous releases

The low-concurrency performance regression evaluation continues after evaluating range queries. This post covers updates with the Heap engine. The Heap engine is used as the base case and the next post has results for InnoDB. The database has 64,000 rows but all updates were for the same row. Tests were done for updates to an indexed and then an unindexed column at 1, 4 and 32 threads. Tests were also repeated with the binlog disabled and then enabled. There are regressions at 1, 4 and 32 threads. The summary is:
  • Response time at 1 thread for 5.7.5 is between 1.55X and 1.67X worse than 5.0.85 
  • Response time at 32 threads for 5.7.5 is between 1.19X and 1.49X worse than 5.0.85 
  • In all cases it is worse in 5.7 than in 5.6
  • Regressions at 4 and 32 threads are much worse when the binlog is enabled
All of the graphs graphs use normalized response time. For each tested version that is the time at X threads divided by the time at X threads for MySQL 5.0.85 where X is 1, 4 and 32.

update unindexed, binlog off

This section has results for updates to an unindexed column with the binlog disabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.55X worse than for 5.0.85.



update unindexed, binlog on

This section has results for updates to an unindexed column with the binlog enabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.67X worse than for 5.0.85. The single-thread regression here is worse than with the binlog disabled, so there is some regression from the binlog code.


update indexed, binlog off

This section has results for updates to an indexed column with the binlog disabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.56X worse than for 5.0.85.



update indexed, binlog on

This section has results for updates to an indexed column with the binlog enabled. Newer releases are faster at 4 and 32 threads. There is a gradual regression at 1 thread and the response time for MySQL 5.7.5 is 1.67X worse than for 5.0.85. The single-thread regression here is worse than with the binlog disabled, so there is some regression from the binlog code.



command lines

Tests used mysqlslap as described in previous posts. The command lines for updates to the non-indexed and indexed columns are:
mysqlslap --concurrency=$t -h127.0.0.1 --create-schema=test1 --number-of-queries=$(( $t * 100000 )) --create=${ddl} --query="update foo set l = l + 2 where i=100"

mysqlslap --concurrency=$t -h127.0.0.1 --create-schema=test1 --number-of-queries=$(( $t * 100000 )) --create=${ddl} --query="update foo set k = k + 2 where i=100"

Monday, October 6, 2014

Details on range scan performance regression in MySQL 5.7

My last blog post was too long to explain performance regressions for range scans with InnoDB in MySQL 5.7.
  • For a secondary index a non-covering scan degrades much more at high-concurrency than a covering scan. The problem gets worse as the scan length increases. This problem isn't new in 5.7 but it is much worse in 5.7 versus 5.6. At 32 threads the problem is waits on btr_search_latch and hash_table_locks. I opened bug 74283 for this.
  • A covering scan of a secondary index degrades much more at high-concurrency than a PK scan. This problem is equally bad in 5.7 and 5.6. Wait times reported by the performance schema do not explain the problem -- search below for PFS_NOT_INSTRUMENTED to understand why. Using Linux perf I definitely see more mutex contention in the covering index scan. But I don't understand the internal differences between PK and covering secondary index scans. I opened bug 74280 for this.
non-covering vs covering

These graphs show the absolute response time in seconds, not the normalized response time as used in my previous posts. The response time difference between non-covering and covering secondary index scans is much larger at high-concurrency than at low-concurrency. For the 10-row scan the non-covering index is close to the covering index response time at 1 thread but at 32 threads it is almost 2X worse. For the 1000-row scan the non-covering index is about 2X worse at 1 thread and then at 32 threads is about 8X worse in 5.6 and about 13X worse in 5.7. From the PS data that follows the problem is much more wait time on btr_search_latch and hash_table_locks.


covering vs PK

These graphs show absolute response time for the 1000-row scan using MySQL 5.6 and 5.7. There is not much difference in response time between the PK and covering secondary index scans at 1 thread. There is a big difference at 32 threads. 
Performance Schema

Assuming the problem is mutex contention then we have to perfect tool to diagnose the problem -- the Performance Schema. I enabled it to instrument wait/synch/% objects and the results are interesting. I ran the scan tests at 32 threads for the q1, q4 and q5 range scans (q1 is PK, q4 is non-covering secondary, q5 is covering secondary). Tests were done for 10-row and 1000-row scans with the adaptive hash enabled and then disabled. For each configuration the top-10 wait events are listed below. Note that the q4 query runs for much more time than both q1 and q5 and that q5 runs for about 2X longer than q1. I have not normalized the wait times to account for that.

It took the following amount of time to finish the tests
  • adaptive hash on, 10 rows: 23, 35, 23 seconds for q1, q4, q5
  • adaptive hash off, 10 rows: 23, 36, 23 seconds for q1, q4, q5
  • adaptive hash on, 1000 rows: 173, 3540, 237 seconds for q1, q4, q5
  • adaptive hash off, 1000 rows: 137, 3440, 254 seconds for q1, q4, q5

10-row PK scan at 32 threads, adaptive hash index enabled

This is the base case. It was about as fast as the covering index scan.

Seconds    Event
7.50       wait/synch/sxlock/innodb/hash_table_locks
2.57       wait/synch/rwlock/sql/LOCK_grant
2.51       wait/synch/mutex/innodb/fil_system_mutex
2.48       wait/synch/mutex/sql/THD::LOCK_query_plan
1.56       wait/synch/mutex/sql/THD::LOCK_thd_data
1.21       wait/synch/mutex/sql/LOCK_table_cache
1.20       wait/synch/sxlock/innodb/btr_search_latch
0.84       wait/synch/mutex/sql/THD::LOCK_thd_query
0.76       wait/synch/mutex/innodb/trx_pool_mutex
0.75       wait/synch/sxlock/innodb/index_tree_rw_lock

10-row covering scan at 32 threads, adaptive hash index enabled

Compared to the 10-row PK scan, this has:

  • More wait time on btr_search_latch (3.89 vs 1.20 seconds)

Seconds    Event
7.49       wait/synch/sxlock/innodb/hash_table_locks
3.89       wait/synch/sxlock/innodb/btr_search_latch
2.69       wait/synch/mutex/sql/THD::LOCK_query_plan
2.60       wait/synch/mutex/innodb/fil_system_mutex
2.37       wait/synch/rwlock/sql/LOCK_grant
1.45       wait/synch/mutex/sql/THD::LOCK_thd_data
1.22       wait/synch/mutex/sql/LOCK_table_cache
0.88       wait/synch/mutex/sql/THD::LOCK_thd_query
0.74       wait/synch/sxlock/innodb/index_tree_rw_lock
0.68       wait/synch/mutex/innodb/trx_pool_mutex

10-row non-covering scan at 32 threads, adaptive hash index enabled

Compared to the 10-row covering scan, this has:
  • Much more wait time on btr_search_latch (15.60 vs 3.89 seconds)
  • A bit more time on hash_table_locks (9.16 vs 7.49 seconds)
  • A bit less time on LOCK_grant and fil_system_mutex (~1.5 vs ~2.5 seconds)
Seconds    Event
15.60      wait/synch/sxlock/innodb/btr_search_latch
 9.16      wait/synch/sxlock/innodb/hash_table_locks
 2.75      wait/synch/mutex/sql/THD::LOCK_query_plan
 1.54      wait/synch/mutex/innodb/fil_system_mutex
 1.47      wait/synch/mutex/sql/THD::LOCK_thd_data
 1.42      wait/synch/rwlock/sql/LOCK_grant
 1.15      wait/synch/mutex/sql/LOCK_table_cache
 0.96      wait/synch/mutex/sql/THD::LOCK_thd_query
 0.84      wait/synch/sxlock/innodb/index_tree_rw_lock
 0.71      wait/synch/mutex/innodb/trx_pool_mutex

10-row PK scan at 32 threads, adaptive hash index disabled

This is the base case. It was about as fast as the covering index scan whether or not the adaptive hash index was enabled.

Seconds    Event
8.47       wait/synch/sxlock/innodb/hash_table_locks
2.65       wait/synch/rwlock/sql/LOCK_grant
2.58       wait/synch/mutex/innodb/fil_system_mutex
2.42       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/THD::LOCK_thd_data
1.25       wait/synch/sxlock/innodb/index_tree_rw_lock
1.17       wait/synch/mutex/sql/LOCK_table_cache
0.87       wait/synch/mutex/sql/THD::LOCK_thd_query
0.65       wait/synch/mutex/innodb/trx_pool_mutex
0.08       wait/io/file/innodb/innodb_data_file

10-row covering scan at 32 threads, adaptive hash index disabled

The waits here are similar to the PK scan.

Seconds    Event
8.53       wait/synch/sxlock/innodb/hash_table_locks
2.67       wait/synch/rwlock/sql/LOCK_grant
2.54       wait/synch/mutex/innodb/fil_system_mutex
2.47       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/LOCK_table_cache
1.44       wait/synch/mutex/sql/THD::LOCK_thd_data
1.25       wait/synch/sxlock/innodb/index_tree_rw_lock
0.93       wait/synch/mutex/sql/THD::LOCK_thd_query
0.68       wait/synch/mutex/innodb/trx_pool_mutex
0.08       wait/io/file/innodb/innodb_data_file

10-row non-covering scan at 32 threads, adaptive hash index disabled

Compared to the covering index scan, this has:
  • Much more time in hash_table_locks (30.97 vs 8.53 seconds)
  • Much more time in index_tree_rw_lock (8.80 vs 1.25 seconds)

Seconds    Event
30.97      wait/synch/sxlock/innodb/hash_table_locks
 8.80      wait/synch/sxlock/innodb/index_tree_rw_lock
 2.76      wait/synch/mutex/sql/THD::LOCK_query_plan
 2.14      wait/synch/rwlock/sql/LOCK_grant
 2.12      wait/synch/mutex/innodb/fil_system_mutex
 1.46      wait/synch/mutex/sql/THD::LOCK_thd_data
 1.24      wait/synch/mutex/sql/LOCK_table_cache
 1.00      wait/synch/mutex/sql/THD::LOCK_thd_query
 0.66      wait/synch/mutex/innodb/trx_pool_mutex
 0.08      wait/io/file/innodb/innodb_data_file

1000-row PK scan at 32 threads, adaptive hash index enabled

This is the base case.

Seconds    Event
8.21       wait/synch/sxlock/innodb/hash_table_locks
3.13       wait/synch/mutex/sql/THD::LOCK_query_plan
1.57       wait/synch/mutex/sql/THD::LOCK_thd_data
1.08       wait/synch/mutex/sql/LOCK_table_cache
1.04       wait/synch/mutex/sql/THD::LOCK_thd_query
0.69       wait/synch/rwlock/sql/LOCK_grant
0.66       wait/synch/mutex/innodb/trx_pool_mutex
0.64       wait/synch/sxlock/innodb/btr_search_latch
0.62       wait/synch/mutex/innodb/fil_system_mutex
0.60       wait/synch/sxlock/innodb/index_tree_rw_lock

1000-row covering scan at 32 threads, adaptive hash index enabled

This is about 1.4X slower than the PK scan at 1000 rows. The waits here are about the same as the PK case but the top wait event, hash_table_locks, is less here than in the PK scan. There isn't a wait event that explains the difference in performance. Perhaps the difference comes from CPU instructions or cache misses. The reason why the PS doesn't explain the difference is that the per-block rw-lock doesn't use the PS, from buf_block_init() there is this code:
   rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);


Seconds    Event
6.45       wait/synch/sxlock/innodb/hash_table_locks
3.22       wait/synch/mutex/sql/THD::LOCK_query_plan
1.52       wait/synch/mutex/sql/THD::LOCK_thd_data
1.00       wait/synch/mutex/sql/THD::LOCK_thd_query
0.93       wait/synch/mutex/sql/LOCK_table_cache
0.72       wait/synch/mutex/innodb/trx_pool_mutex
0.70       wait/synch/sxlock/innodb/btr_search_latch
0.65       wait/synch/mutex/innodb/fil_system_mutex
0.63       wait/synch/rwlock/sql/LOCK_grant
0.56       wait/synch/sxlock/innodb/index_tree_rw_lock

1000-row non-covering scan at 32 threads, adaptive hash index enabled

This is 15 to 20 times slower than the PK and covering index scans. Compared to the covering index scan this has:
  • Much more time in btr_search_latch (655.57 vs 0.70 seconds)
  • Much more time in hash_table_locks (35.47 vs 6.45 seconds)
  • Much more time in index_tree_rw_lock (9.68 vs 0.56 seconds)
Seconds    Event
655.57     wait/synch/sxlock/innodb/btr_search_latch
 35.47     wait/synch/sxlock/innodb/hash_table_locks
  9.68     wait/synch/sxlock/innodb/index_tree_rw_lock
  3.42     wait/synch/mutex/sql/THD::LOCK_query_plan
  1.65     wait/synch/mutex/sql/THD::LOCK_thd_data
  1.26     wait/synch/mutex/sql/THD::LOCK_thd_query
  0.86     wait/synch/mutex/sql/LOCK_table_cache
  0.74     wait/synch/mutex/innodb/trx_pool_mutex
  0.73     wait/synch/rwlock/sql/LOCK_grant
  0.43     wait/synch/mutex/innodb/fil_system_mutex

1000-row PK scan at 32 threads, adaptive hash index disabled

This is the base case.

Seconds    Event
9.80       wait/synch/sxlock/innodb/hash_table_locks
3.00       wait/synch/mutex/sql/THD::LOCK_query_plan
1.53       wait/synch/mutex/sql/THD::LOCK_thd_data
1.13       wait/synch/sxlock/innodb/index_tree_rw_lock
1.05       wait/synch/mutex/sql/LOCK_table_cache
1.02       wait/synch/mutex/sql/THD::LOCK_thd_query
0.79       wait/synch/mutex/innodb/trx_pool_mutex
0.64       wait/synch/rwlock/sql/LOCK_grant
0.58       wait/synch/mutex/innodb/fil_system_mutex
0.08       wait/io/file/innodb/innodb_data_file

1000-row covering scan at 32 threads, adaptive hash index disabled

This is about 2X slower than the PK scan at 1000 rows. The waits here are about the same as the PK case but the top wait event, hash_table_locks, is less here than in the PK scan. There isn't a wait event that explains the difference in performance. Perhaps the difference comes from CPU instructions or cache misses.

Seconds    Event
8.14       wait/synch/sxlock/innodb/hash_table_locks
3.00       wait/synch/mutex/sql/THD::LOCK_query_plan
1.51       wait/synch/mutex/sql/THD::LOCK_thd_data
0.99       wait/synch/sxlock/innodb/index_tree_rw_lock
0.99       wait/synch/mutex/sql/THD::LOCK_thd_query
0.95       wait/synch/mutex/sql/LOCK_table_cache
0.77       wait/synch/mutex/innodb/trx_pool_mutex
0.66       wait/synch/rwlock/sql/LOCK_grant
0.62       wait/synch/mutex/innodb/fil_system_mutex
0.08       wait/io/file/innodb/innodb_data_file

1000-row non-covering scan at 32 threads, adaptive hash index disabled

Wow, compared to the covering index scan this has:

  • Much more wait time on hash_table_locks (1434.73 vs 8.14 seconds)
  • Much more time on index_tree_rw_lock (659.07 vs 0.99 seconds)

Seconds    Event
1434.73    wait/synch/sxlock/innodb/hash_table_locks
 659.07    wait/synch/sxlock/innodb/index_tree_rw_lock
   3.25    wait/synch/mutex/sql/THD::LOCK_query_plan
   1.51    wait/synch/mutex/sql/THD::LOCK_thd_data
   0.97    wait/synch/mutex/sql/THD::LOCK_thd_query
   0.87    wait/synch/mutex/sql/LOCK_table_cache
   0.71    wait/synch/mutex/innodb/trx_pool_mutex
   0.67    wait/synch/rwlock/sql/LOCK_grant
   0.44    wait/synch/mutex/innodb/fil_system_mutex
   0.08    wait/io/file/innodb/innodb_data_file

Missing time?

I am trying to explain why the 1000-row covering index scan is slower than the PK scan. Output from the performance schema doesn't explain the difference. Perhaps the PS isn't instrumenting something. Looking at non-hierarchical output from perf I see a difference. This is from the PK scan:

     8.58%   mysqld  mysqld                [.] row_search_mvcc(...)                                     
     8.00%   mysqld  mysqld                [.] ut_delay(unsigned long)                
     4.87%   mysqld  mysqld                [.] rec_get_offsets_func(...)
     3.50%   mysqld  mysqld                [.] mtr_t::Command::release_all(
     3.12%   mysqld  libc-2.14.1.so        [.] __memcpy_ssse3_back
     3.02%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
     2.94%   mysqld  mysqld                [.] buf_page_optimistic_get(...)
     2.50%   mysqld  mysqld                [.] ha_innobase::general_fetch(...)

And this is from the covering index scan. For the secondary index scan the top two CPU consumers are the mutex busy wait loop. So there is mutex contention that isn't reported by the performance schema.

    24.30%   mysqld  mysqld                [.] ut_delay(unsigned long)
    16.51%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
     4.54%   mysqld  mysqld                [.] mtr_t::Command::release_all()
     4.51%   mysqld  mysqld                [.] row_search_mvcc(...)
     2.70%   mysqld  mysqld                [.] _ZL22pfs_rw_lock_s_lock_lowP9rw_lock_tmPKcm.isra.17
     2.25%   mysqld  libc-2.14.1.so        [.] __memcpy_ssse3_back
     2.15%   mysqld  mysqld                [.] buf_page_optimistic_get(...)
     2.03%   mysqld  mysqld                [.] rec_get_offsets_func(...)

From hierarchical perf output this is the result for the PK scan:

     8.55%   mysqld  mysqld                [.] row_search_mvcc(...)
             |
             --- row_search_mvcc(...)
                |          
                |--99.11%-- ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          handler::ha_index_next(unsigned char*)
                |          handler::read_range_next()
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          sub_select(JOIN*, QEP_TAB*, bool)
                |          JOIN::exec()
                |          mysql_select(...)
                |          handle_select(THD*, select_result*, unsigned long)
                |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          handle_connection
                |          pfs_spawn_thread
                |          start_thread
                |          __clone
                |          
                |--0.54%-- ha_innobase::index_read(...)
                |          handler::ha_index_read_map(...)
                |          handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          sub_select(JOIN*, QEP_TAB*, bool)
                |          JOIN::exec()
                |          mysql_select(...)
                |          handle_select(THD*, select_result*, unsigned long)
                |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          handle_connection
                |          pfs_spawn_thread
                |          start_thread
                |          __clone
                 --0.34%-- [...]

     7.96%   mysqld  mysqld                [.] ut_delay(unsigned long)
             |
             --- ut_delay(unsigned long)
                |          
                |--99.96%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--99.85%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone
                |           --0.15%-- [...]
                 --0.04%-- [...]

Which looks very different from the result for the covering index scan:

    24.49%   mysqld  mysqld                [.] ut_delay(unsigned long)
             |
             --- ut_delay(unsigned long)
                |          
                |--99.98%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--98.79%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone

    16.22%   mysqld  mysqld                [.] TTASEventMutex<TrackPolicy>::spin_and_wait(...)
             |
             --- TTASEventMutex<TrackPolicy>::spin_and_wait(...)
                |          
                |--99.71%-- _ZN11PolicyMutexI14TTASEventMutexI11TrackPolicyEE5enterEmmPKcm.constprop.95
                |          |          
                |          |--98.90%-- buf_page_optimistic_get(...)
                |          |          btr_cur_optimistic_latch_leaves(...)
                |          |          btr_pcur_restore_position_func(...)
                |          |          _ZL30sel_restore_position_for_mysqlPmmP10btr_pcur_tmP5mtr_t.constprop.75
                |          |          row_search_mvcc(...)
                |          |          ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)
                |          |          handler::ha_index_next(unsigned char*)
                |          |          handler::read_range_next()
                |          |          handler::multi_range_read_next(char**)
                |          |          QUICK_RANGE_SELECT::get_next()
                |          |          _ZL8rr_quickP11READ_RECORD
                |          |          sub_select(JOIN*, QEP_TAB*, bool)
                |          |          JOIN::exec()
                |          |          mysql_select(...)
                |          |          handle_select(THD*, select_result*, unsigned long)
                |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                |          |          mysql_execute_command(THD*)
                |          |          mysql_parse(THD*, Parser_state*)
                |          |          dispatch_command(enum_server_command, THD*, char*, unsigned long)
                |          |          handle_connection
                |          |          pfs_spawn_thread
                |          |          start_thread
                |          |          __clone