Monday, August 3, 2015

Write-heavy workload vs the MongoDB optimizer

We experienced wasted disk IO from the MySQL query optimizer a few years back and fixed it via FORCE INDEX hints. I think I am experiencing wasted disk IO from the MongoDB optimizer during recent tests with the insert benchmark.

The problem with MySQL requires a query for which more than one index can be used. The problem query was a single table SELECT statement. The optimizer must figure out the best index and calls the records_in_range method to determine the selectivity of the query predicates for each relevant index. For InnoDB tables the records_in_range method reads the b-tree leaf node that contains the first and last index entries that satisfy the predicates and then estimates the amount of data between them. Assuming all non-leaf nodes are cached, then this does 2 disk reads in the worst case. This disk IO isn't wasted when it is done for the index to be used by the query. It is wasted when done for other indexes. By using FORCE INDEX we limited the optimizer to only considering the correct index and stopped wasting disk reads for a frequent query.

The MongoDB query planner is different than the DBMS optimizers I have known (Informix, Oracle, MySQL).
  1. To find a new plan it runs plans from candidate indexes in parallel and then chooses the plan that finishes first. I assume but have not confirmed that in parallel means that multiple cursors are open but all processing is done in one thread. Dynamic query optimization was used in the SQL optimizer for RDB and I suspect has been used elsewhere.
  2. Plans (per collection) are invalidated after a collection gets 1000 write operations. You can see how this is used by searching the source code for internalQueryCacheWriteOpsBetweenFlush. The method PlanCache::notifyOfWriteOp is called per document, not per operation. Plans are likely to be invalidated much too soon because a busy application can quickly perform 1000 write operations on a collection.
  3. This will change for the better in the next release (3.2) thanks to server-17895 and server-15225 but my tests are based on 3.0 and I didn't notice the pending change until this weekend.
  4. The value 1000 can be changed at runtime by this command:
db.runCommand({setParameter:1, internalQueryCacheWriteOpsBetweenFlush:100000})

I am running the insert benchmark for MongoDB using RocksDB, mmapv1, WiredTiger and TokuMX on a server with 15 SATA disks and SW RAID 0. The test is run for four configurations: 10 insert threads and 1 query thread, 10 insert threads and 0 query threads, 1 insert thread and 10 query threads, 0 insert threads and 10 query threads.

I will focus on two configurations: 0 insert threads and 10 query threads, 1 insert thread and 10 query threads. The database is 127GB with MongoRocks and the server has 64GB of RAM. The QPS from 10 insert threads was ~93 with 1 insert thread and ~7685 with 0 insert threads. That is a huge difference.

I then ran a shorter test with 1 insert thread and got 317 QPS from the 10 query threads. Finally I changed internalQueryCacheWriteOpsBetweenFlush from 1,000 to 1,000,000 and repeated the shorter test with 1 insert thread and got 3248 QPS from the 10 query threads. Alas, I have confirmed there is a big impact from too-frequent plan invalidation.

3 comments:

  1. Hi Mark,
    We knew the threshold was too low, but were nervous to change it until we got SERVER-15225 in, which will dynamically re-plan a query that is not performing. We backported SERVER-15225 to the 3.0 branch (3.0.4) but made it flag protected.

    You can now combine removing the cache flush as you showed above and enable re-planning using this command:
    db.runCommand({setParameter: 1, internalQueryCacheReplanningEnabled: true})

    Love to hear how this works for you.

    -Dan

    ReplyDelete
  2. Can somebody help ? the command executed under admin db: db.runCommand({setParameter:1, internalQueryCacheWriteOpsBetweenFlush:100000}) is giving error: "unrecognized parameter". running version 3.1.8

    thanks \ marian.badinka@gmail.com

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...