Small Datum: Linkbench for MySQL & MongoDB with a cached database

I used linkbench for MySQL and MongoDB to understand OLTP performance on an interesting workload. Here I post results for a cached database and in a few days I will have more results for a database larger than RAM. The summary is that MySQL is a lot faster than MongoDB even when using the same storage engine (RocksDB). I don't know all of the reasons for this and it is possible I made a mistake because this is the first time I have run linkbenchX for MongoDB. There are overheads in MongoDB that doesn't exist in MySQL, but I have not quantified the contribution from each. The ones that are interesting for this test include:

PK index isn't clustered - the PK index is clustered for RocksDB and InnoDB in MySQL. All of the columns for a row can be fetched from a PK index leaf page. The PK index on _id is not clustered in MongoDB and is from _id to DiskLoc and then WiredTiger & RocksDB use a hidden clustered index on DiskLoc to access the document. While the PK index isn't clustered for mmapv1, it doesn't have the hidden clustered index on DiskLoc as it uses DiskLoc as the address (filename & offset) for the document. I expect this to be fixed in a future release so that RocksDB and WiredTiger can use a clustered index on _id. The current behavior is a legacy from the storage engine interface used by mmapv1.
PK can't be composite - the primary key for a MongoDB collection is on the _id field. If your data has a natural primary key using one field, then rename that field _id. If the natural primary key is composite then you might use _id as a surrogate key and declare a unique secondary index on the composite key. The cost for this is an extra index.

The costs above mean that linkbench uses more indexes in MongoDB than in MySQL. This makes updates and queries slower as there are more indexes to maintain and more indexes to search. For MySQL there are 3 tables and 4 indexes in linkbench: a PK and secondary index on the link table, a PK on the count table and a PK on the node table. For MongoDB there are 3 collections and 9 indexes. Each collection has a PK index on the _id field and the hidden index that maps DiskLoc to a document for RocksDB & WiredTiger. As there are 3 collections this explains the first 6 indexes. Then there are unique secondary indexes for the count and link collections where a composite key would have been used in a SQL DBMS. Finally, there is the secondary index on the link collection.

I am not sure that this workload is a great fit for MongoDB but it is a good benchmark for looking at OLTP performance and is less synthetic than the insert benchmark. The MongoDB version, linkbenchx, would benefit greatly from support for per-shard transactions. That is only available via TokuMX today. These tests were run without transactions and there might be races for transactions that need to atomically change the link and count collections.

From speaking to my local MongoDB internals guru (hi Igor), the secondary index in MongoDB contains the DiskLoc to reference the document. When the secondary index isn't covering then the hidden index on DiskLoc will be searched to find the missing document. This overhead is similar to what occurs with a clustered PK index in MySQL - the PK in the secondary index is used to search the PK index to get the missing columns.

A search by _id in MongoDB is more expensive than a search on the PK in MySQL for RocksDB and WiredTiger. With the clustered PK index in MySQL all of the columns are in the leaf page of the index. For MongoDB there are two indexes to search -- first the index on _id to find a DiskLoc and then the hidden index on DiskLoc to find the document.

A search on the natural PK for the link and count tables is also more expensive in MongoDB. First the search of the unique secondary index is done to find a DiskLoc and then the hidden index on DiskLoc is searched to get the document. For MySQL there is only one search using the clustered PK index.

Configuration

I used linkbench for MySQL and linkbenchx for MongoDB. I changed bin/linkbench in linkbenchx to use the same driver as MySQL: CLASS='com.facebook.LinkBench.LinkBenchDriver'. I set maxid1=20000001 in config/FBWorkload.properties. Finally I set loaders=10, requesters=20, maxtime=3600, warmup_time=60 and requests=1000000000 in config/LinkConfigMongoDBV2.properties and config/LinkConfigMysql.properties.

With this setup the database is cached in RAM during the entire test on servers with 144G of RAM. The test pattern is to load the database and then run either 12 or 24 1-hour query runs. I report the average QPS sustained during the 2nd, 12th and optionally the 24th 1-hour runs. New data is added to the database over time so the database is larger during the 12th 1-hour run than during the 2nd. The growth rate is a function of the QPS.

I have yet to share quality-of-service metrics like p95 or p99 response time. Eventually I will do that.

I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:

mysql.rocks.nolog - MySQL with RocksDB and the binary log disabled
mysql.rocks.log - MySQL with RocksDB and the binary log enabled
mysql.inno.nolog - MySQL with InnoDB and the binary log disabled
mysql.inno.log - MySQL with InnoDB and the binary log enabled
mongo.rocks.nolog - MongoDB with RocksDB and the oplog disabled
mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
mongo.wiredtiger.nolog - MongoDB with WiredTiger and the oplog disabled
mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
mongo.mmap.nolog - MongoDB with mmapv1 and the oplog disabled
mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled

I used this mongo.conf file:

processManagement:

fork: true

systemLog:

destination: file

path: /data/mysql/mongo.30/log

logAppend: true

storage:

syncPeriodSecs: 60

dbPath: /data/mysql/mongo.30/data

journal:

enabled: true

mmapv1:

journal:

commitIntervalMs: 100

operationProfiling.slowOpThresholdMs: 2000

replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy

storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy

storage.rocksdb.configString: "write_buffer_size=16m;max_write_buffer_number=4;max_background_compactions=6;max_background_flushes=3;target_file_size_base=16m;soft_rate_limit=2.9;hard_rate_limit=3;max_bytes_for_level_base=128m;stats_dump_period_sec=60;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=12;level0_stop_writes_trigger=20;max_grandparent_overlap_factor=8;max_bytes_for_level_multiplier=8"

And I used a my.cnf file that is too long to paste here. This is the schema for RocksDB and InnoDB. All InnoDB tables use 32 hash partitions to reduce mutex contention on the per-index mutex. All tables are compressed except for the node table with InnoDB because the average compression rate for rows in the node table is less than 2X. Four column families are used for the RocksDB tables, one for each index. All collections are in one column family for MongoDB+RocksDB The my.cnf settings for RocksDB are here.

Compression varied by configuration.

MongoDB+mmapv1 - no compression
MongoDB+WiredTiger - snappy
MySQL+InnoDB - zlib
MongoDB+RocksDB - no compression for levels 0, 1, snappy for level 2, zlib for others
MySQL+RocksDB - no compression for levels 0, 1 and zlib for others

Load performance

This lists the average insert rate sustained during the load. With maxid1=20,000,001 there are 20M rows/documents in the node table/collection, ~90M rows/documents in the link table/collection and ~15M rows/documents in the count table/collection. A few conclusions:

the insert rate is much better for MySQL than MongoDB even when the same engine, RocksDB, is used for both. MongoDB does more work during the load as it maintains 9 indexes versus 4 for MySQL.
enabling the binlog for MySQL+RocksDB hurts performance much more than for MySQL+InnoDB
enabling the oplog for MongoDB doesn't hurt performance as much as it does on the insert benchmark
for MongoDB the load rate was faster for WiredTiger than RocksDB because the database is cached. This changes when the database is larger than RAM.

Query performance

The average QPS was similar for MySQL+RocksDB and MySQL+InnoDB. They were both much faster than MongoDB. For MongoDB the best QPS is from WiredTiger. The MongoDB results match what I have gotten from other benchmarks and I have some faith these are accurate.

All results

This lists all results per configuration. The 24 1-hour query tests were only done for a few configurations. The results include:

load time - the number of seconds for the load test
load rate - the average insert rate during the load test
load size - the database size in GB when the load ended
2h qps - the average QPS during the 2nd 1-hour query test
2h size - the database size in GB after the 2nd 1-hour query test
12h qps - the average QPS during the 12th 1-hour query test
12h size - the database size in GB after the 12th 1-hour query test
24h qps - the average QPS during the 24th 1-hour query test
24h size - the database size in GB after the 24th 1-hour query test

A few conclusions:

The database size grows more for MySQL+InnoDB than for MySQL+RocksDB. Fragmentation from the B-Tree is larger than space-amplification of ~1.1X from leveled compaction. It is difficult to compare growth for configurations that don't sustain the same QPS because new data is added as a function of the QPS.
The database size grows more for Mongo+WiredTiger than for Mongo+RocksDB. While WiredTiger sustains a higher QPS I think part of the problem is that B-Tree fragmentation is worse than space amplification from leveled compaction.
I am interesting in comparing database growth between MySQL+InnoDB and Mongo+WiredTiger. Both implement a B-Tree but InnoDB is update-in-place and can suffer from internal fragmentation while WiredTiger is copy-on-write and can suffer from external fragmentation. I don't know whether there are counters to explain this in WiredTiger.

load load load 2h 2h 12h 12h 24h 24h

time rate size qps size qps size qps size server

478 184496 14g 50153 17g 48671 25g 48258 34g mysql.rocks.nolog

602 146462 14g 47981 16g 47250 23g 46537 34g mysql.rocks.log

839 105008 15g 51574 22g 50229 37g 50058 56g mysql.inno.nolog

851 103489 15g 51277 22g 49780 37g 49805 56g mysql.inno.log

4156 21200 13g 15149 17g 15271 23g mongo.rocks.nolog

5015 17565 14g 14442 17g 13925 24g 13506 29g mongo.rocks.log

3020 29174 14g 27940 32g 23043 45g 22141 53g mongo.wt.nolog

3601 28020 16g 25488 34g 22992 45g mongo.wt.log

25563 3446 63g 9774 70g 8703 82g mongo.mmap.nolog

27000 3263 66g 9414 70g 8471 84g mongo.mmap.log

Small Datum

Monday, July 6, 2015

Linkbench for MySQL & MongoDB with a cached database

Configuration

Load performance

Query performance

All results

No comments:

Post a Comment

HammerDB tproc-c on a large server, Postgres 14 to 19 beta1