Thursday, August 27, 2015

Single-threaded linkbench performance for MySQL 5.7, 5.6, WebScale and MyRocks

The good news is that features in the pending 5.7 release look great. More good news is that InnoDB transparent page compression might be faster than the original compression feature assuming your storage system supports it. The bad news is that there are significant performance regressions for low-concurrency workloads. I previously reported this for 5.6 and 5.7.5 and have yet to see progress. While the focus on high-concurrency workloads has been important, we can't lose this much performance at low-concurrency.

I used linkbench with one client thread to load and then query a small & cached database. This was repeated for many configurations to determine the impact from compression and partitioning. The binlog was enabled but it and the InnoDB redo log were not synced on commit. The performance summary is:
  • InnoDB transparent compression is faster for loads than non-transparent compression
  • Insert rates for 5.7.8 are much worse than for 5.6. The insert rate for 5.6 are between 1.07X and 1.77X faster than 5.7.
  • The Facebook 5.6 build is always faster than 5.6.26 and 5.7.8, except when transparent compression is used.
  • MyRocks (MySQL+RocksDB) has the fastest load rate
  • Default configuration for the performance schema reduces the insert rate by 5%

Transparent compression

While transparent compression might be a lot faster than non-transparent compression I share the same doubts as Domas about whether we can use this feature in production. The feature requires either a filesystem that is happy when doing a holepunch on every write or a filesystem that is really a key-value store where the key is pageID, the value is a page and the page will be compressed on write with minimal roundup to meet alignment requirements. By roundup I mean that when a 16kb page is compressed to 11kb, it is written as 11kb when rounded up to the next 512b sector or 12kb when rounded up to the next 4kb sector. With 2X compression, that page would be split and the ~5.5kb compressed output for each post-split page would be rounded up to 8kb.

AFAIK there isn't a GA solution for the filesystem that doesn't require holepunch on write. There is a proprietary solution (filesystem & storage device) that is not yet GA from one vendor and the open-source solution might be bcachefs assuming it gets sufficient support to reach GA.

InnoDB page compression adds latency to query processing. By non-transparent compression I mean the original page compression for InnoDB where a compression factor (2X, 4X, 8X) is declared per table. Both transparent and non-transparent page compression add the latency of decompression after page reads. There is other latency that only occurs for non-transparent page compression. By latency I mean work done by threads handling SQL statements and these are things that I see when looking at PMP stack traces.
  1. Page images are optionally written to the redo log to avoid the risk of changing the zlib library during crash recovery. I always disable page image logging.
  2. When the per-page compression log is full all changes must be applied to the page and then the page is compressed.
  3. When a page doesn't compress enough (see #2) then a page split is done and the split pages must then be compressed. When a page split is done the per-index mutex is held so other threads can't access that index until the operation is done. I frequently use partitioning to reduce the impact from contention on the per-index mutex.
  4. There is a bit more work to manage both the uncompressed and compressed LRUs.

Performance regressions

Using partitioning has improved InnoDB performance for IO-bound linkbench tests in the past as it reduces contention for the per-index mutex. This test wasn't IO-bound as the database was always cached by InnoDB, but I still wanted to see the impact from the different configurations. Transparent compression in this case is done via holepunch-on-write. I ignore the compression rates for now and will revisit that in a future test. I tested these configurations:
  • p0.c0 - no partitioning, no compression
  • p0.c1 - no partitioning, compression (transparent & non-transparent)
  • p1.c0 - partitioning, no compression
  • p1.c1 - partitioning, compression (transparent & non-transparent)
The linkbench test was run with maxid=2000001, loaders=1 & requesters=1. Otherwise the default settings were used. The InnoDB buffer pool was large enough to cache the database, the binlog was enabled but the binlog and InnoDB redo log were not synced on commit. I set many other my.cnf values to match what we use in production but don't list them here. The p0.c0 test was repeated for upstream 5.6.26 and 5.7.8 with and without the PS enabled (performance_schema=0, 1 in my.cnf). I didn't set any other options for the PS. The load rate is reduced by about 5% when the PS is enabled with default options.

I tested several binaries:
  • myrocks - the Facebook patch for MySQL, 5.6.X and the RocksDB storage engine
  • fb56 - the Facebook patch for MySQL, 5.6.X and InnoDB
  • orig56.nops, orig56.ps - upstream 5.6.26 without and with PS enabled. The my.cnf file I used was based on what we use in production and is here.
  • orig57.nops, orig57.ps - upstream 5.7.8 with non-transparent compression, without and with PS enabled. The my.cnf file I used was based on what we use in production and is here.
  • orig57.tc - upstream 5.7.8 with transparent compression and PS disabled
The test was done in two parts. First I measure the rate at which one thread can load the database. Then I measure the average QPS sustained for one hour. The data below is the insert rate from the load (load ips), the size after load (load gb), the QPS rate during the 1-hour run (1h qps) and the size after the 1 hour run (1h gb).

The relative ips column is the insert rate for that configuration divided by the rate for upstream 5.7.8. These rates are large and we have lost a lot performance at low-concurrency in 5.7.8. Most of the loss is not from the performance schema.

p0.c0
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 53577  X       3244    X      1.57      myrocks
 47165  2.3     4016    3.0    1.38      fb56
 46785  2.6     3698    3.4    1.37      orig56.nops
 48857  2.6     3735    3.4    1.43      orig56.ps
 34233  2.6     3604    3.4    1.0       orig57.nops
 36077  2.6     3693    3.4    1.05      orig57.ps

p0.c1
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 17688  1.5     3874    1.9    1.33      fb56
 14164  1.7     3481    2.1    1.07      orig56.nops
 13252  1.7     3372    2.1    1.0       orig57.nops
 34770  2.1     3510    2.3    2.62      orig57.tc

p1.c0
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 45980  3.2     3921    3.8    1.77      fb56
 34660  3.5     3560    4.1    1.34      orig56.nops
 25908  3.5     3391    4.1    1.0       orig57.nops

p1.c1
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 16152  2.1     3707    2.4    1.50      fb56
 12345  2.3     3296    2.7    1.15      orig56
 10779  2.3     3214    2.7    1.0       orig57.nops
 23148  3.0     3361    3.0    2.15      orig57.tc

Graphs

A graph for the p0.c0 test (no partitioning, no compression) showing relative insert rates.


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...