- PK index isn't clustered - the PK index is clustered for RocksDB and InnoDB in MySQL. All of the columns for a row can be fetched from a PK index leaf page. The PK index on _id is not clustered in MongoDB and is from _id to DiskLoc and then WiredTiger & RocksDB use a hidden clustered index on DiskLoc to access the document. While the PK index isn't clustered for mmapv1, it doesn't have the hidden clustered index on DiskLoc as it uses DiskLoc as the address (filename & offset) for the document. I expect this to be fixed in a future release so that RocksDB and WiredTiger can use a clustered index on _id. The current behavior is a legacy from the storage engine interface used by mmapv1.
- PK can't be composite - the primary key for a MongoDB collection is on the _id field. If your data has a natural primary key using one field, then rename that field _id. If the natural primary key is composite then you might use _id as a surrogate key and declare a unique secondary index on the composite key. The cost for this is an extra index.
The costs above mean that linkbench uses more indexes in MongoDB than in MySQL. This makes updates and queries slower as there are more indexes to maintain and more indexes to search. For MySQL there are 3 tables and 4 indexes in linkbench: a PK and secondary index on the link table, a PK on the count table and a PK on the node table. For MongoDB there are 3 collections and 9 indexes. Each collection has a PK index on the _id field and the hidden index that maps DiskLoc to a document for RocksDB & WiredTiger. As there are 3 collections this explains the first 6 indexes. Then there are unique secondary indexes for the count and link collections where a composite key would have been used in a SQL DBMS. Finally, there is the secondary index on the link collection.
I am not sure that this workload is a great fit for MongoDB but it is a good benchmark for looking at OLTP performance and is less synthetic than the insert benchmark. The MongoDB version, linkbenchx, would benefit greatly from support for per-shard transactions. That is only available via TokuMX today. These tests were run without transactions and there might be races for transactions that need to atomically change the link and count collections.
From speaking to my local MongoDB internals guru (hi Igor), the secondary index in MongoDB contains the DiskLoc to reference the document. When the secondary index isn't covering then the hidden index on DiskLoc will be searched to find the missing document. This overhead is similar to what occurs with a clustered PK index in MySQL - the PK in the secondary index is used to search the PK index to get the missing columns.
A search by _id in MongoDB is more expensive than a search on the PK in MySQL for RocksDB and WiredTiger. With the clustered PK index in MySQL all of the columns are in the leaf page of the index. For MongoDB there are two indexes to search -- first the index on _id to find a DiskLoc and then the hidden index on DiskLoc to find the document.
A search on the natural PK for the link and count tables is also more expensive in MongoDB. First the search of the unique secondary index is done to find a DiskLoc and then the hidden index on DiskLoc is searched to get the document. For MySQL there is only one search using the clustered PK index.
ConfigurationI used linkbench for MySQL and linkbenchx for MongoDB. I changed bin/linkbench in linkbenchx to use the same driver as MySQL: CLASS='com.facebook.LinkBench.LinkBenchDriver'. I set maxid1=20000001 in config/FBWorkload.properties. Finally I set loaders=10, requesters=20, maxtime=3600, warmup_time=60 and requests=1000000000 in config/LinkConfigMongoDBV2.properties and config/LinkConfigMysql.properties.
With this setup the database is cached in RAM during the entire test on servers with 144G of RAM. The test pattern is to load the database and then run either 12 or 24 1-hour query runs. I report the average QPS sustained during the 2nd, 12th and optionally the 24th 1-hour runs. New data is added to the database over time so the database is larger during the 12th 1-hour run than during the 2nd. The growth rate is a function of the QPS.
I have yet to share quality-of-service metrics like p95 or p99 response time. Eventually I will do that.
I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:
- mysql.rocks.nolog - MySQL with RocksDB and the binary log disabled
- mysql.rocks.log - MySQL with RocksDB and the binary log enabled
- mysql.inno.nolog - MySQL with InnoDB and the binary log disabled
- mysql.inno.log - MySQL with InnoDB and the binary log enabled
- mongo.rocks.nolog - MongoDB with RocksDB and the oplog disabled
- mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
- mongo.wiredtiger.nolog - MongoDB with WiredTiger and the oplog disabled
- mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
- mongo.mmap.nolog - MongoDB with mmapv1 and the oplog disabled
- mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled
I used this mongo.conf file:
And I used a my.cnf file that is too long to paste here. This is the schema for RocksDB and InnoDB. All InnoDB tables use 32 hash partitions to reduce mutex contention on the per-index mutex. All tables are compressed except for the node table with InnoDB because the average compression rate for rows in the node table is less than 2X. Four column families are used for the RocksDB tables, one for each index. All collections are in one column family for MongoDB+RocksDB The my.cnf settings for RocksDB are here.
Compression varied by configuration.
- MongoDB+mmapv1 - no compression
- MongoDB+WiredTiger - snappy
- MySQL+InnoDB - zlib
- MongoDB+RocksDB - no compression for levels 0, 1, snappy for level 2, zlib for others
- MySQL+RocksDB - no compression for levels 0, 1 and zlib for others
This lists the average insert rate sustained during the load. With maxid1=20,000,001 there are 20M rows/documents in the node table/collection, ~90M rows/documents in the link table/collection and ~15M rows/documents in the count table/collection. A few conclusions:
- the insert rate is much better for MySQL than MongoDB even when the same engine, RocksDB, is used for both. MongoDB does more work during the load as it maintains 9 indexes versus 4 for MySQL.
- enabling the binlog for MySQL+RocksDB hurts performance much more than for MySQL+InnoDB
- enabling the oplog for MongoDB doesn't hurt performance as much as it does on the insert benchmark
- for MongoDB the load rate was faster for WiredTiger than RocksDB because the database is cached. This changes when the database is larger than RAM.
Query performanceThe average QPS was similar for MySQL+RocksDB and MySQL+InnoDB. They were both much faster than MongoDB. For MongoDB the best QPS is from WiredTiger. The MongoDB results match what I have gotten from other benchmarks and I have some faith these are accurate.
All resultsThis lists all results per configuration. The 24 1-hour query tests were only done for a few configurations. The results include:
- load time - the number of seconds for the load test
- load rate - the average insert rate during the load test
- load size - the database size in GB when the load ended
- 2h qps - the average QPS during the 2nd 1-hour query test
- 2h size - the database size in GB after the 2nd 1-hour query test
- 12h qps - the average QPS during the 12th 1-hour query test
- 12h size - the database size in GB after the 12th 1-hour query test
- 24h qps - the average QPS during the 24th 1-hour query test
- 24h size - the database size in GB after the 24th 1-hour query test
A few conclusions:
- The database size grows more for MySQL+InnoDB than for MySQL+RocksDB. Fragmentation from the B-Tree is larger than space-amplification of ~1.1X from leveled compaction. It is difficult to compare growth for configurations that don't sustain the same QPS because new data is added as a function of the QPS.
- The database size grows more for Mongo+WiredTiger than for Mongo+RocksDB. While WiredTiger sustains a higher QPS I think part of the problem is that B-Tree fragmentation is worse than space amplification from leveled compaction.
- I am interesting in comparing database growth between MySQL+InnoDB and Mongo+WiredTiger. Both implement a B-Tree but InnoDB is update-in-place and can suffer from internal fragmentation while WiredTiger is copy-on-write and can suffer from external fragmentation. I don't know whether there are counters to explain this in WiredTiger.
load load load 2h 2h 12h 12h 24h 24h
time rate size qps size qps size qps size server
478 184496 14g 50153 17g 48671 25g 48258 34g mysql.rocks.nolog
602 146462 14g 47981 16g 47250 23g 46537 34g mysql.rocks.log
839 105008 15g 51574 22g 50229 37g 50058 56g mysql.inno.nolog
851 103489 15g 51277 22g 49780 37g 49805 56g mysql.inno.log
4156 21200 13g 15149 17g 15271 23g mongo.rocks.nolog
5015 17565 14g 14442 17g 13925 24g 13506 29g mongo.rocks.log
3020 29174 14g 27940 32g 23043 45g 22141 53g mongo.wt.nolog
3601 28020 16g 25488 34g 22992 45g mongo.wt.log
25563 3446 63g 9774 70g 8703 82g mongo.mmap.nolog
27000 3263 66g 9414 70g 8471 84g mongo.mmap.log