Wednesday, October 19, 2016

Make MyRocks 2X less slow

Fixing mutex contention has been good for my career. I had the workload, an RDBMS that needed a few improvements and support from a great team. Usually someone else found the bugs and I got to fix many of them. Sometimes I got too much credit because a good bug report is as valuable as the bug fix. These days I don't see many mutex contention bugs but I have begun to see more bugs from memory contention. My perf debugging skills need refreshing. They are far from modern. Thankfully we have Brendan Gregg.

For someone who debugs performance, shared_ptr is a gift. Mistakenly passing shared_ptr by value means the reference count will be changed too much and that is not good on a concurrent workload. I have encountered that at least twice in RocksDB and MyRocks. I even encountered it in MongoDB with SERVER-13382.

I have twice made MyRocks 2X less slow. First with issue 231 peak compaction throughput was doubled and now with issue 343 we almost double range-scan throughput (for long range scans with many concurrent queries). Someone important recently reported a disappointing performance result when comparing MyRocks with InnoDB. After a few days with sysbench I was able to reproduce it. This should be easy to fix.

Not mutex contention

In this bug, with sysbench read-only and read-write the peak QPS for MyRocks saturated long before InnoDB. While MyRocks and InnoDB had similar QPS at low concurrency, the QPS at high concurrency was almost 2X better for InnoDB. This was only an issue for longer range scans (try --oltp-range-size=10000) and the default was a shorter range scan (--oltp-range-size=100). My first guess was mutex contention. There was an odd pattern in vmstat where the context switch rate alternated every second for MyRocks but was steady for InnoDB. Spikes in context switch rate sometimes mean mutex contention but I did not see that with PMP. What next?

The next guess is memory system contention but my debugging skills for that problem are weak. I have told myself many times this year that I need to refresh my skills. So I started with this blog post from Brendan Gregg and tried perf stat and found that InnoDB completed almost twice the number of instructions compared to MyRocks in the same time period. Why is IPC almost 2X better for InnoDB? Results from perf are here.

I then tried a different perf stat command to get other hardware perf counters and results are here. This also shows that InnoDB completed twice the number of instructions while both have a similar value for bus-cycles, so MyRocks uses 2X the number of bus-cycles per instruction. That can explain why it is slower. What are bus-cycles? Most of the documentation only explained this is as [Hardware event] and without more details I can't look that up in Intel manuals. I asked internally and learned the magic code, 0x013c, that leads to more information. Start with this article (and consider subscribing to LWN, I do).

The next step was to get call graphs when bus-cycles was incremented. I used the command below to find the offending code. Disabling that code fixes the problem, but work remains to make that code performant. InnoDB and MyRocks have similar code to count rows read and InnoDB doesn't fall over because of it. I want to make MyRocks not fall over because of it.
perf record -ag -p $( pidof mysqld ) -e bus-cycles -- sleep 10

Useful commands

I used all of these commands today:
perf stat -a sleep 10
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
perf stat -e 'syscalls:sys_enter_*' -a sleep 10
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -a sleep 10
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses -a sleep 10
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches -a sleep 10
perf top -e raw_syscalls:sys_enter -ns comm
perf list --help
perf record -ag -p $( pidof mysqld ) -e bus-cycles -- sleep 10

Saturday, October 15, 2016

scons verbose command line

Hopefully I can find this blog post the next time I get stuck. How do you see command lines when building your favorite open source project? Try one of variants below. I am sure this list will grow over time. The scons variant is my least favorite. I use too many tools for source configuration and compiling. I am barely competent with most of them, but it is easy to find answers for popular tools. I get to use scons with MongoDB. It is less fun searching for answers to problems with less popular tools.

  make V=1
  make VERBOSE=1
  scons --debug=presub

Pagerank seems to be busted for scons. Top results are for too-old versions of scons. Top-ranked results usually tell you how to solve the problem with Python, but users aren't writing scons input files, we are doing things via the command line. At least with MongoDB's use of scons, the separator for construction variables is a space, not a colon. So do LIBS="lz4 zstd" but not LIBS="lz4:zstd".

This is my second scons inspired post. Just noticed my previous one.

Wednesday, October 12, 2016

MongoRocks and WiredTiger versus linkbench on a small server

I spent a lot of time evaluating open-source database engines over the past few years and WiredTiger has been one of my favorites. The engine and the team are excellent. I describe it as a copy-on-write-random (CoW-R) b-tree as defined in a previous post. WiredTiger also has a log-structured merge tree. It isn't officially supported in MongoDB. Fortunately we have MongoRocks if you want an LSM.

This one is long. Future reports will be shorter and reference this. My tl;dr for Linkbench with low concurrency on a small server:
  • I think there is something wrong in how WiredTiger uses zlib, at least in MongoDB 3.2.4
  • mmapV1 did better than I expected.
  • We can improve MongoRocks write efficiency and write throughput. The difference for write efficiency between MongoRocks and other MongoDB engines isn't as large as it is between MyRocks and other MySQL engines.
Update - I have a few followup tasks to do after speaking with WiredTiger and MongoRocks gurus. First, I will repeat tests using MongoDB 3.2.10. Second, I will use zlib and zlib-noraw compression for WiredTiger. Finally, I will run tests with and without the oplog to confirm whether the oplog hurts MongoRocks performance more than WiredTiger.

All about the algorithm

Until recently I have not been sharing my performance evaluations that compare MongoRocks with WiredTiger. In some cases MongoRocks performance is much better than WiredTiger and I want to explain those cases. There are two common reasons. First, WiredTiger is a new engine and there is more work to be done to improve performance. I see progress and I know more is coming. This takes time.

The second reason for differences is the database algorithm. An LSM and a B-Tree make different tradeoffs for read, write and space efficiency. See the Rum Conjecture for more details. In most cases an LSM should have better space and write efficiency while a B-Tree should have better read efficiency. But better write and space efficiency can enable better read efficiency. First, when less IO capacity is consumed for writing back database changes then more IO capacity is available for the storage reads done for user queries. Second, when less space is wasted for caching database blocks then the cache hit ratio is higher. I expect the second reason is more of an issue for InnoDB than for WiredTiger because WT does prefix encoding for indexes and should have less or no fragmentation for database pages in cache.

Page write-back is a hard feature to get right for a B-Tree. There will be dirty pages at the end of the buffer pool LRU and these pages must be written back as they approach the LRU tail. Things that need to read a page into the buffer pool will take a page from the LRU tail. If the page is still dirty at the point the thing requesting the page will stall until the page has been written back. It took a long time to make this performant for InnoDB and work still remains. It will take more time to get this right for WiredTiger. Checkpoint and eviction are the steps by which dirty pages are written back for WT. While I am far from an expert on this I have filed several performance bugs and feature requests (and many of them have been fixed). One open problem is that checkpoint is still single threaded. This one thread must find dirty pages, compress them and then do buffered writes. When zlib is used then that is too much work for one thread. Even with a faster compression algorithm I think more threads are needed, and the cost of faster decompression is more space used for the database. Server-16736 is open as a feature request for this.

Test setup

I have three small servers at home. They used Ubuntu 14.04 at the time, but have since been upgraded to 16.04. Each is a core i3 with 2 CPUs, 4 HW threads and 8G of RAM. The storage is a 120G Samsung 850 EVO m.2 SSD for the database and a 7200 RPM disk for the OS. I like the NUC servers but my next cluster will use a better CPU (core i5) with more RAM.

The benchmark is Linkbench using LinkbenchX from Percona that has support for MongoDB. For WiredTiger and MongoRocks engines this doesn't use transactions to protect the multi-operation transactions. I look forward to multi-document transactions in a future MongoDB release. I use main from my Linkbench fork rather than from LinkbenchX to avoid the use of the feature to sustain a constant request rate because that has added too much CPU overhead in some tests.

I ran two tests. First, I used an in-memory database workload with maxid1=2M. Second, I used an IO-bound database with maxid1=40M. By IO-bound I mean that the database is larger than 8G but smaller than 120G and the SSD is very busy during the test. Both tests were run with 2 connections for loading and 1 connection (client) for the query tests. The query tests were run for 24 1-hour loops and the result from the 24th hour is shared. I provide results for performance, quality of service (QoS) and efficiency. Note that for the mmapv1 IO-bound test I had to use maxid1=20M rather than 40M to avoid a full storage device.

The oplog is enabled, sync-on-commit is disabled and WiredTiger/MongoRocks get 2G of RAM for cache. Tests were run with zlib and snappy compression. I reduced file system readahead from 128 to 16 for the mmapV1 engine tests. For MongoRocks I disabled compression for the smaller levels of the LSM. For the cached database, much more of the database is not compressed because of this. I limited the oplog to 2000MB. The full mongo.conf is at the end of this post.

I used MongoDB 3.2.4 to compare the MongoRocks, WiredTiger and mmapv1 engines. I will share more results soon for MongoDB 3.3.5 and I think results are similar. When MongoDB 3.4 is published I will repeat my tests and hope to include zstandard.

If you measure storage write rates and use iostat then be careful because iostat includes bytes trimmed as bytes written. If the filesystem is mounted with discard enabled and the database engine frequently deletes files (RocksDB does) then iostat might overstate bytes written. The results I share here have been corrected for that.

Cached database load

These are the results for maxid1=2M. The database is cached for all engines except mmapV1.

  • ips - average inserts/second
  • wKB/i - average KB written to storage per insert measured by iostat
  • Mcpu/i - CPU usecs/insert, measured by vmstat
  • Size - database size in GB at the end of the load
  • rss - mongod process size (RSS) in GB from ps at the end of the load
  • engine - rx.snap/rx.zlib is MongoRocks with snappy or zlib. wt.snap/wt.zlib is WiredTiger with snappy or zlib
  • MongoRocks has the worst insert rate. Some of this is because more efficient writes can mean less efficient reads and the LSM does more key comparisons than a B-Tree when navigating the memtable. But I think that most of the reason is management of the oplog where there are optimizations we have yet to do for MongoRocks.
  • MongoRocks writes the most to storage per insert. See the previous bullet point.
  • MongoRocks and WiredTiger use a similar amount of space. Note that during the query test that follows the load the size of WT will be much larger than MongoRocks. As expected, the database is much larger with mmapV1.

ips     wKB/i   Mcpu/i  size    rss     engine
5359    4.81     6807    2.5    0.21    rx.snap
4876    4.82    10432    2.2    0.45    rx.zlib
8198    1.84     3361    2.7    1.82    wt.snap
7949    1.79     4149    2.1    1.98    wt.zlib
7936    1.64     3353   13.0    6.87    mmapV1

Cached database query

These are the results for maxid1=2M for the 24th 1-hour loop. The database is cached for all engines except mmapV1.

  • tps - average transactions/second
  • wKB/t - average KB written to storage per transaction measured by iostat
  • Mcpu/t - CPU usecs/transaction, measured by vmstat
  • Size - database size in GB at test end
  • rss - mongod process size (RSS) in GB from ps at test end
  • un, gn, ul, gll - p99 response time in milliseconds for the most popular transactions: un is updateNode, gn is getNode, ul is updateList, gll is getLinkedList. See the Linkbench paper for details.
  • engine - rx.snap/rx.zlib is MongoRocks with snappy or zlib. wt.snap/wt.zlib is WiredTiger with snappy or zlib
  • WiredTiger throughput is much worse with zlib than with snappy. I think the problem is that dirty page write back doesn't keep up because of the extra overhead from zlib compression. See above for my feature request for multi-threaded checkpoint. There is also a huge difference in the CPU overhead for WiredTiger with zlib compared to WT with snappy. That pattern does not repeat for MongoRocks. I wish I had looked at that more closely.
  • While WiredTiger and MongoRocks used a similar amount of space after the load, WT uses much more space after the query steps. I am not sure whether this is from live or dead versions of B-Tree pages.
  • Response time is better for MongoRocks than for WiredTiger. It is pretty good for mmapV1.
  • mmapV1 has the best throughput. I have been surprised by mmapV1 on several tests.
  • MongoRocks writes the least amount to storage per transaction.

tps  r/t  rKB/t  wKB/t  Mcpu/t  size   rss   un   gn   ul  gll  engine
1741   0      0   2.72   16203   3.7  2.35  0.3  0.1   1   0.9  rx.snap
1592   0      0   2.66   19306   3.0  2.36  0.4  0.2   1   1    rx.zlib
1763   0      0   5.70   23687   4.9  2.52  0.4  0.1   2   1    wt.snap
 933   0      0   8.94   81250   6.5  2.97  1    0.6   7   5    wt.zlib
1967 0.2   9.70   4.61   12048  20.0  4.87  0.9  0.7   1   1    mmapV1

IO-bound database load

These are the results for maxid1=40M for the 24th 1-hour loop. The database does not fit in cache. I used maxid1=20M for mmapV1 to avoid a full SSD. So tests for it ran with half the data.

The summary is the same as it was for the cached database and I think we can make MongoRocks a lot faster.

ips     wkb/i   Mcpu/i  size    rss     engine
4896    7.11     8177   27      2.45    rx.snap
4436    6.67    11979   22      2.29    rx.zlib
7979    1.93     3526   29      2.20    wt.snap
7719    1.89     4330   24      2.30    wt.zlib
7612    1.85     3476   66      6.93    mmapV1, 20m

IO-bound database query

These are the results for maxid1=40M for the 24th 1-hour loop. The database does not fit in cache. I used maxid1=20M for mmapV1 to avoid a full SSD. So tests for it ran with half the data.

  • Like the cached test, WiredTiger with zlib is much worse than with snappy. Most metrics are much worse for it. This isn't just zlib, I wonder if there is a bug in the way WT uses zlib.
  • Throughput continues to be better than I expected for mmapv1, but it has started to do more disk reads per transaction. It uses about 2X the space for the other engines for half the data.
  • MongoRocks provides the best efficiency with performance comparable to other engines. This is the desired result.

tps   r/t   rKB/t  wKB/t  Mcpu/t  size   rss   un    gn    ul  gll  engine
1272  1.22  12.82   4.02   17475  29     2.56   0.8   0.5   2   1   rx.snap
1075  1.03  10.44   4.56   25223  23     2.53   0.9   0.6   2   2   rx.zlib
1037  1.24  17.23  11.60   45335  34     2.69   1     1     3   2   wt.snap
 446  1.21  23.61  18.00  151628  33     3.38  13    11    21  18   wt.zlib
1261  2.43  34.77   5.28   13357  72     2.05   0.9   0.5   3   2   mmapV1, 20m


This is the full mongo.conf for zlib. It needs to be edited to enable snappy.

  fork: true
  destination: file
  path: /path/to/log
  logAppend: true
  syncPeriodSecs: 60
  dbPath: /path/to/data
    enabled: true
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: zlib
storage.wiredTiger.engineConfig.journalCompressor: none
storage.wiredTiger.engineConfig.cacheSizeGB: 2

storage.rocksdb.cacheSizeGB: 2
storage.rocksdb.configString: "compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression:kZlibCompression;compression_opts=-14:1:0;"

Tuesday, October 11, 2016

Making the case for MyRocks. It is all about efficiency.

I had two talks at Percona Live - one on MyRocks and another on web-scale. The talk links include the slides, but slides lose a lot of context. But first, the big news is that MyRocks will appear in MariaDB Server and Percona Server. I think MyRocks is great for the community and getting it into supported distributions makes it usable.

Efficiency is the reason for MyRocks. The RUM Conjecture explains the case in detail. The summary is that MyRocks has the best space efficiency, better write efficiency and good read efficiency compared to other storage engines for MySQL. The same is true of MongoRocks and MongoDB storage engines. Better efficiency is a big deal. Better compression means you can use less SSD. Better write efficiency means you get better SSD endurance or that you can switch from MLC to TLC NAND flash. Better write efficiency also means that more IO capacity will be available to handle reads from user queries.

But performance in practice has nuance that theory can miss. While I expect read performance to suffer with MyRocks compared to InnoDB, I usually don't see that when evaluating production and benchmark workloads. I spent most of this year doing performance evaluations for MyRocks and MongoRocks. I haven't shared much beyond summaries. I expect to share a lot in the future.

I prefer to not write about performance in isolation. I want to write about performance, quality of service and efficiency. By performance I usually mean peak or average throughput under realistic conditions. By quality of service I mean the nth (95th, 99th) percentile response time for queries and transactions. By efficiency I mean the amount of hardware (CPU time, disk reads, disk KB written, disk KB written, disk space) consumed. I have frequently written about performance in isolation in the past. I promise to do that less frequently in the future.

My other goal is to explain the performance that I measure. This is hard to do. I define benchmarketing as the use of unexplained performance results to claim that one product is better than another. While I am likely to do some benchmarketing for MyRocks I will also provide proper benchmarks where results are explained and details on quality of service and efficiency are included.

Let me end this with benchmarking and benchmarketing. For benchmarking I have a result from Linkbench on a small server: Intel 5th generation core i3 NUC, 4 HW threads, 8G RAM, Samsung 850 EVO SSD.  The result here is typical of results from many tests I have done. MySQL does better than MongoDB, MyRocks does better than InnoDB and MongoRocks does better than WiredTiger. MyRocks and MongoRocks have better QoS based on the p99 update time in milliseconds. The hardware efficiency metrics explain why MyRocks and MongoRocks have more throughput (TPS is transactions/second). M*Rocks does fewer disk reads per transaction (iostat r/t), writes less to disk per transaction (iostat wKB/t) and uses less space on disk (size GB). It uses more CPU time per transaction than uncompressed InnoDB. That is the price of better compression. Why it has better hardware efficiency is a topic for another post and conference talk.
For benchmarketing I have a result from read-only sysbench for an in-memory database. MyRocks matches InnoDB at low and mid concurrency and does better at high-concurrency. This is a workload (read-only & in-memory) that favors InnoDB.

Friday, October 7, 2016

MyRocks, MongoRocks, RocksDB and Mr. Mime

The big news is that MyRocks will arrive in proper distributions with expert support (Percona Server and MariaDB Server). This is a big deal for me as it helps make MyRocks better faster and it gives you a chance to evaluate MyRocks. Earlier this year Percona announced support for MongoRocks.

After two weeks in Europe (Dublin, London, Amsterdam) I have yet to catch or even encounter Mr. Mime, the Europe-only Pokemon Go character. So I will return in November to spend more time searching for Mr. Mime and speak at CodeMesh in London and HighLoad++ in Moscow.
  • - I look forward to attending as many talks as possible at CodeMesh. I speak on November 4 on the relationship between performance and efficiency. I think the RUM Conjecture makes it easier to understand the choices between database engines which matters more given that we aren't limited to update-in-place b-trees today. Contact me directly if you want a discount code for a CodeMesh ticket.
  • HighLoad++ - I visit Moscow to speak at HighLoad++, explain the case for MyRocks and MongoRocks and learn more about Tarantool, one of my favorite projects which is also in the process of adding a write-optimized database engine (Vinyl).

Tuesday, September 20, 2016

MyRocks and InnoDB with large objects and compression

I ran tests to explain the difference between MyRocks and InnoDB when storing large objects and data with varying amounts of compressibility.

Compression in MyRocks is simpler than in InnoDB. You should expect the database to use about 1.1X times the size of the compressed output. When rows compress to 60% of their original size and are 10kb before compression, then each row should use about 6.6kb in the database. The 1.1X adjustment is for space-amplification from leveled compaction.

Predicting the space used for InnoDB is harder. First, large LOB column are not stored inline and overflow pages are not shared. Second, disk pages have a fixed size and you risk using too much space or getting too many page splits when searching for a good value for key_block size. More details are here.

I ran two tests for two types of data. The first test is an insert only workload in PK-order for data with varying amounts of compressibility. The second test determined how fast point queries could be done on that data while rate-limited inserts were in progress. By varying amounts of compressibility I mean that there was one large varchar column per row and that 20%, 45%, 75% or 95% of the data in the column was random and the remainder was constant and easily compressed. Both tests used one connection for inserts. The query test also used one connection for queries.

The test pattern was run twice. In both cases the large column was a varchar. In the first case it had a length between 10,000 and 20,000 characters. In the second case it had a length between 100 and 1000 characters. The database block size was 16kb for MyRocks and InnoDB.

Insert only

For the insert-only workload the space used for MyRocks can be predicted from the compressibility of the data. That is much less true for InnoDB. For example compressed InnoDB uses about the same amount of space for pctRand in 20, 45 and 75.

MyRocks used the least amount of space. InnoDB used much more space when the column was larger (10,000 to 20,000 vs 100 to 1000). Overflow pages are the root cause.

The insert rates are better for MyRocks than for InnoDB. They were also stable for MyRocks and uncompressed InnoDB independent of the compressibility. Rates for uncompressed InnoDB are better than compressed InnoDB. While this wasn't a performance benchmark, it matches many other results I get. It is hard to get performance and compression from InnoDB.  The CPU overhead per insert was similar between MyRocks and uncompressed InnoDB. CPU overheads were mostly larger for compressed InnoDB.

Legend for the data:
  • ips - inserts per second
  • size - database size in GB at test end
  • Mcpu - microseconds of CPU per insert
  • pctRand - percentage of random data in large column
  • engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

column up to 20,000       column up to 1000
ips     size    Mcpu      ips     size    Mcpu    pctRand engine
5489      7.7    1090     34468   11      151     20      rx.zlib-6
5540     16      1127     34824   19      149     45
5532     24      1307     34517   27      166     75
5523     30      1467     34701   33      160     95

ips     size    Mcpu      ips     size    Mcpu    pctRand engine
3995     87       933     23470   66      173     20      i6n
3981     87       928     23704   66      174     45
3981     86       917     23487   66`     175     75
3995     88       914     23658   66      176     95

ips     size    Mcpu      ips     size    Mcpu    pctRand engine
3339     36      1064     13429   33      262     20      i6c
2779     32      1278     13124   33      271     45
2133     35      1750      8767   30      392     75
1757     50      2061      7228   38      461     95

Point queries

MyRocks provides the best compression, the best query throughput, and the east CPU overhead per query. My conclusions for InnoDB space consumption are similar to the results from the insert-only workload.

Legend for the data:
  • qps - queries per second
  • size - database size in GB at test end
  • Mcpu - microseconds of CPU per query
  • pctRand - percentage of random data in large column
  • engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 984      9.3    4308     2214    11      1585    20      rx.zlib-6
 910     19      4532     2113    19      1627    45
 846     30      4952     2102    27      1601    75
 795     37      5598     2051    33      1691    95

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 628    113      6240     1302    62      2527    20      i6n
 624    110      6226     1300    63      2501    45
 624    114      6312     1302    63      2536    75
 628    115      6218     1305    66      2474    95

qps     size    Mcpu      qps     size    Mcpu    pctRand engine
 708     38      5560      770    34      4450    20      i6c
 629     39      6643      687    34      4895    45
 513     44      8494      589    30      6046    75
 418     57     10619      576    39      6599    95

Thursday, September 15, 2016

Peak benchmarketing season for MySQL

Maybe this is my XKCD week. With Oracle Open World and Percona Live Amsterdam we are approaching peak benchmarketing season for MySQL. I still remember when MySQL 4.0 was limited to about 10k QPS on 4 and 8 core servers back around 2005, so the 1M QPS results we see today are a reminder of the great progress that has been made thanks to investments by upstream and the community.

In General

But getting 1.5M QPS today compared to 1M QPS last year isn't at the top of the list for many (potential) users of MySQL. I use performance, usability, mangeability, availability and efficiency to explain what matters for web-scale DBMS users. My joke is that each of these makes a different group happy: performance -> marketing, usability -> developers, manageability -> operations, availability -> end users, efficiency -> management.

The benchmarketing results mostly focus on performance. Whether InnoDB does a bit more QPS than Amazon Aurora isn't going to make Aurora less popular. Aurora might have excellent performance but I assume people are deploying it for other reasons. I hope we make it easier to market usability, manageability, availability and efficiency in the MySQL community. MongoDB has gone a long way by marketing and then delivering usability and manageability.

Even when limited to performance we need to share more than peak QPS. Efficiency and quality-of-service (QoS) are equally important. QPS without regard to response time is frequently a bogus metric. I get more IOPs from a disk by using a too large queue depth. But more IOPs at the cost of 100 millisecond disk read response times is an expensive compromise. Even when great QPS is accompanied by a good average response time I want to know if there is lousy QoS from frequent stalls leading to lousy 99th percentile response times. Percona has built their business in part by being excellent at documenting and reducing stalls in InnoDB that occur on benchmarks and real workloads.

I have been guilty of sharing too many benchmark reports in the past that ignored efficiency and QoS. I have been trying to change that this year and hope that other providers of MySQL performance results do the same. This is an example of a result that includes performance, efficiency and QoS.

MyRocks and RocksDB

A lot of the RocksDB marketing message has been about performance. Database access is faster with an embedded database than client/server because you avoid network latency. The MyRocks message has been about efficiency. The target has been better compression and less write amplification than InnoDB so you can use less SSD and lower-endurance SSD. For a workload I care about we see 2X better compression and 1/10 the write rate to storage. This is a big deal.

When starting the project we had many discussions about the amount of performance loss (reduced QPS, higher response time) we could tolerate to get more efficiency. While we were vague the initial goal was to get similar QPS and response time to InnoDB for real workloads, but we were willing to accept some regressions. It turned out that there was no regression and similar performance with much better efficiency is a big deal.

But benchmarks aren't real workloads and there will soon be more benchmark results. Some of these will repeat what I have claimed, others will not. I don't expect to respond to every result that doesn't match my expectations. I will consult when possible.

One last disclaimer. If you care about read-mostly/in-memory workloads then InnoDB is probably an excellent choice. MyRocks can still be faster than InnoDB for in-memory workloads. That is more likely when the bottleneck for InnoDB is page write-back performance. So write-heavy/in-memory can still be a winner for MyRocks.

Seriously, this is the last disclaimer. While we are bickering about benchmark results others are focusing on usability and manageability and getting all of the new deployments.