Monday, July 6, 2015

Linkbench for MySQL & MongoDB with a cached database

I used linkbench for MySQL and MongoDB to understand OLTP performance on an interesting workload. Here I post results for a cached database and in a few days I will have more results for a database larger than RAM. The summary is that MySQL is a lot faster than MongoDB even when using the same storage engine (RocksDB). I don't know all of the reasons for this and it is possible I made a mistake because this is the first time I have run linkbenchX for MongoDB. There are overheads in MongoDB that doesn't exist in MySQL, but I have not quantified the contribution from each. The ones that are interesting for this test include:
  1. PK index isn't clustered - the PK index is clustered for RocksDB and InnoDB in MySQL. All of the columns for a row can be fetched from a PK index leaf page. The PK index on _id is not clustered in MongoDB and is from _id to DiskLoc and then WiredTiger & RocksDB use a hidden clustered index on DiskLoc to access the document. While the PK index isn't clustered for mmapv1, it doesn't have the hidden clustered index on DiskLoc as it uses DiskLoc as the address (filename & offset) for the document. I expect this to be fixed in a future release so that RocksDB and WiredTiger can use a clustered index on _id. The current behavior is a legacy from the storage engine interface used by mmapv1.
  2. PK can't be composite - the primary key for a MongoDB collection is on the _id field. If your data has a natural primary key using one field, then rename that field _id. If the natural primary key is composite then you might use _id as a surrogate key and declare a unique secondary index on the composite key. The cost for this is an extra index.
The costs above mean that linkbench uses more indexes in MongoDB than in MySQL. This makes updates and queries slower as there are more indexes to maintain and more indexes to search. For MySQL there are 3 tables and 4 indexes in linkbench: a PK and secondary index on the link table, a PK on the count table and a PK on the node table. For MongoDB there are 3 collections and 9 indexes. Each collection has a PK index on the _id field and the hidden index that maps DiskLoc to a document for RocksDB & WiredTiger. As there are 3 collections this explains the first 6 indexes. Then there are unique secondary indexes for the count and link collections where a composite key would have been used in a SQL DBMS. Finally, there is the secondary index on the link collection.

I am not sure that this workload is a great fit for MongoDB but it is a good benchmark for looking at OLTP performance and is less synthetic than the insert benchmark. The MongoDB version, linkbenchx, would benefit greatly from support for per-shard transactions. That is only available via TokuMX today. These tests were run without transactions and there might be races for transactions that need to atomically change the link and count collections.

From speaking to my local MongoDB internals guru (hi Igor), the secondary index in MongoDB contains the DiskLoc to reference the document. When the secondary index isn't covering then the hidden index on DiskLoc will be searched to find the missing document. This overhead is similar to what occurs with a clustered PK index in MySQL - the PK in the secondary index is used to search the PK index to get the missing columns.

A search by _id in MongoDB is more expensive than a search on the PK in MySQL for RocksDB and WiredTiger. With the clustered PK index in MySQL all of the columns are in the leaf page of the index. For MongoDB there are two indexes to search -- first the index on _id to find a DiskLoc and then the hidden index on DiskLoc to find the document.

A search on the natural PK for the link and count tables is also more expensive in MongoDB. First the search of the unique secondary index is done to find a DiskLoc and then the hidden index on DiskLoc is searched to get the document. For MySQL there is only one search using the clustered PK index.

Configuration

I used linkbench for MySQL and linkbenchx for MongoDB. I changed bin/linkbench in linkbenchx to use the same driver as MySQL: CLASS='com.facebook.LinkBench.LinkBenchDriver'. I set maxid1=20000001 in config/FBWorkload.properties. Finally I set loaders=10, requesters=20, maxtime=3600, warmup_time=60 and requests=1000000000 in config/LinkConfigMongoDBV2.properties and config/LinkConfigMysql.properties.

With this setup the database is cached in RAM during the entire test on servers with 144G of RAM. The test pattern is to load the database and then run either 12 or 24 1-hour query runs. I report the average QPS sustained during the 2nd, 12th and optionally the 24th 1-hour runs. New data is added to the database over time so the database is larger during the 12th 1-hour run than during the 2nd. The growth rate is a function of the QPS.

I have yet to share quality-of-service metrics like p95 or p99 response time. Eventually I will do that.

I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:

  • mysql.rocks.nolog - MySQL with RocksDB and the binary log disabled
  • mysql.rocks.log - MySQL with RocksDB and the binary log enabled
  • mysql.inno.nolog - MySQL with InnoDB and the binary log disabled
  • mysql.inno.log - MySQL with InnoDB and the binary log enabled
  • mongo.rocks.nolog - MongoDB with RocksDB and the oplog disabled
  • mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
  • mongo.wiredtiger.nolog - MongoDB with WiredTiger and the oplog disabled
  • mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
  • mongo.mmap.nolog - MongoDB with mmapv1 and the oplog disabled
  • mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled

I used this mongo.conf file:
processManagement:
  fork: true
systemLog:
  destination: file
  path: /data/mysql/mongo.30/log
  logAppend: true
storage:
  syncPeriodSecs: 60
  dbPath: /data/mysql/mongo.30/data
  journal:
    enabled: true
  mmapv1:
    journal:
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy

storage.rocksdb.configString: "write_buffer_size=16m;max_write_buffer_number=4;max_background_compactions=6;max_background_flushes=3;target_file_size_base=16m;soft_rate_limit=2.9;hard_rate_limit=3;max_bytes_for_level_base=128m;stats_dump_period_sec=60;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=12;level0_stop_writes_trigger=20;max_grandparent_overlap_factor=8;max_bytes_for_level_multiplier=8"

And I used a my.cnf file that is too long to paste here. This is the schema for RocksDB and InnoDB. All InnoDB tables use 32 hash partitions to reduce mutex contention on the per-index mutex. All tables are compressed except for the node table with InnoDB because the average compression rate for rows in the node table is less than 2X. Four column families are used for the RocksDB tables, one for each index. All collections are in one column family for MongoDB+RocksDB The my.cnf settings for RocksDB are here.

Compression varied by configuration.
  • MongoDB+mmapv1 - no compression
  • MongoDB+WiredTiger - snappy
  • MySQL+InnoDB - zlib
  • MongoDB+RocksDB - no compression for levels 0, 1, snappy for level 2, zlib for others
  • MySQL+RocksDB - no compression for levels 0, 1 and zlib for others

Load performance

This lists the average insert rate sustained during the load. With maxid1=20,000,001 there are 20M rows/documents in the node table/collection, ~90M rows/documents in the link table/collection and ~15M rows/documents in the count table/collection. A few conclusions:
  • the insert rate is much better for MySQL than MongoDB even when the same engine, RocksDB, is used for both. MongoDB does more work during the load as it maintains 9 indexes versus 4 for MySQL.
  • enabling the binlog for MySQL+RocksDB hurts performance much more than for MySQL+InnoDB
  • enabling the oplog for MongoDB doesn't hurt performance as much as it does on the insert benchmark
  • for MongoDB the load rate was faster for WiredTiger than RocksDB because the database is cached. This changes when the database is larger than RAM.
 

Query performance

The average QPS was similar for MySQL+RocksDB and MySQL+InnoDB. They were both much faster than MongoDB. For MongoDB the best QPS is from WiredTiger. The MongoDB results match what I have gotten from other benchmarks and I have some faith these are accurate.

All results

This lists all results per configuration. The 24 1-hour query tests were only done for a few configurations. The results include:
  • load time - the number of seconds for the load test
  • load rate - the average insert rate during the load test
  • load size - the database size in GB when the load ended
  • 2h qps - the average QPS during the 2nd 1-hour query test
  • 2h size - the database size in GB after the 2nd 1-hour query test
  • 12h qps - the average QPS during the 12th 1-hour query test
  • 12h size - the database size in GB after the 12th 1-hour query test
  • 24h qps - the average QPS during the 24th 1-hour query test
  • 24h size - the database size in GB after the 24th 1-hour query test
A few conclusions:
  • The database size grows more for MySQL+InnoDB than for MySQL+RocksDB. Fragmentation from the B-Tree is larger than space-amplification of ~1.1X from leveled compaction. It is difficult to compare growth for configurations that don't sustain the same QPS because new data is added as a function of the QPS.
  • The database size grows more for Mongo+WiredTiger than for Mongo+RocksDB. While WiredTiger sustains a higher QPS I think part of the problem is that B-Tree fragmentation is worse than space amplification from leveled compaction.
  • I am interesting in comparing database growth between MySQL+InnoDB and Mongo+WiredTiger. Both implement a B-Tree but InnoDB is update-in-place and can suffer from internal fragmentation while WiredTiger is copy-on-write and can suffer from external fragmentation. I don't know whether there are counters to explain this in WiredTiger.


load    load    load    2h      2h      12h     12h     24h     24h
time    rate    size    qps     size    qps     size    qps     size    server
478     184496  14g     50153   17g     48671   25g     48258   34g     mysql.rocks.nolog
602     146462  14g     47981   16g     47250   23g     46537   34g     mysql.rocks.log
839     105008  15g     51574   22g     50229   37g     50058   56g     mysql.inno.nolog
851     103489  15g     51277   22g     49780   37g     49805   56g     mysql.inno.log
4156    21200   13g     15149   17g     15271   23g                     mongo.rocks.nolog
5015    17565   14g     14442   17g     13925   24g     13506   29g     mongo.rocks.log
3020    29174   14g     27940   32g     23043   45g     22141   53g     mongo.wt.nolog
3601    28020   16g     25488   34g     22992   45g                     mongo.wt.log
25563   3446    63g     9774    70g     8703    82g                     mongo.mmap.nolog
27000   3263    66g     9414    70g     8471    84g                     mongo.mmap.log

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...