Thursday, February 27, 2014

Exceptions versus returning an error code

This is a note to myself as it explains some of the index maintenance code in MongoDB prior to 2.6. I looked at a few PMP samples from a MongoDB server that was running slower than desired for a concurrent workload. PMP is a convenient way to see where threads are running and waiting. A common reason for something being slow is that most threads are blocked on a mutex or IO. In this case the culprit was the per-database write lock. The running thread has this call stack in both samples that I checked:


mongo::fetchIndexInserters mongo::indexRecordUsingTwoSteps mongo::DataFileMgr::insert(char mongo::DataFileMgr::insertWithObjMod mongo::checkAndInsert mongo::insertMulti mongo::receivedInsert mongo::assembleResponse mongo::MyMessageHandler::process mongo::PortMessageServer::handleIncomingMsg Why is so much time being spent on C++ exception handling? I think the exception is thrown by alreadyInIndex. I haven't written modern C++ for at least 5 years and was curious about the overhead from try/catch when exceptions are thrown. For my simple test using g++ version 4.8.1 with -O3 the overhead was much larger than returning error codes and it looks like try/catch/throw adds at least 1 microsecond of latency. The latency might be higher for a less naive use of exceptions. Frequent exceptions getting thrown (and rethrown in this case) can increase the latency for work done when the per-db write lock is held which might not be a good thing. Local C++ experts are helping me to understand the overhead from try/catch/throw/rethrow.
/** * this error is ok/benign when doing a background indexing -- * that logic in pdfile checks explicitly for the 10287 error code. */ static void alreadyInIndex() { // we don't use massert() here as that does logging and this is 'benign' - // see catches in _indexRecord() throw MsgAssertionException(10287, "btree: key+recloc already in index"); } This is the call sequence for the read phase of index maintenance. It appears to read the index leaf pages for all indexes into memory before getting the per-db write lock. Reading before getting the write lock should reduce stalls. indexRecordUsingTwoSteps fetchIndexInserters beginInsertIntoIndex twoStepInsert insertStepOne addInsertionContinuation The twoStepInsert function has unfortunate behavior in 2.4.9. The code below allows inserts and updates to succeed without doing index maintenance when the key for that doc/index is too large. Similar code exists in a few places. A message is written to the log but queries that use the index can get incorrect results. In 2.6 the code has been changed to fail the insert/update but I have yet to look at that. if ( c.key.dataSize() > this->KeyMax ) { problem() << "ERROR: key too large len:" << c.key.dataSize() << " max:" << this->KeyMax << ' ' << c.key.dataSize() << ' ' << c.idx.indexNamespace() << endl; return; // op=Nothing } The insertStepOne function finds the location in the index to modify. It also raises an exception via alreadyInIndex when the key has already been added to the index by background indexing. The exception is caught by fetchIndexInserters and might be rethrown there. The exception will also be caught further up the stack in DataFileMgr::insert which has two places where it catches the exception, one has a debug assert and immediate rethrow (I wonder if g++ is able to optimize that away). The other location does real work after the catch. So the exception can be caught twice and rethrow twice. I wonder how much that costs. The write phase of index maintenance also starts in indexRecordUsingTwoSteps. It is step 2. This is done by calling finishAllInsertions and then calling bt_insert for all multikeys (I wonder if multikey is a composite index). The call stack for finishAllInsertions is: finishAllInsertions doIndexInsertionWrites insertHere basicInsert And the call stack for bt_insert is listed below. bt_insert and _insert can also skip index maintenance when the key is too large. bt_insert _insert insertHere basicInsert The insertHere function might call split when a b-tree page is full. I think that might need to read a sibling page in the b-tree and I am curious whether the sibling is read during step 1 (the read phase). Otherwise the per-db write lock can be held during a disk read which can be a performance problem. We had something like that while implementing slave prefetch for MySQL via the faker tool and changes to InnoDB. Eventually the majority of the page reads done by the slave in a patched version of MySQL were from reads of sibling pages during page splits. It took time to fix that.

No comments:

Post a Comment