Small Datum: February 2014

Thursday, February 27, 2014

Exceptions versus returning an error code

This is a note to myself as it explains some of the index maintenance code in MongoDB prior to 2.6. I looked at a few PMP samples from a MongoDB server that was running slower than desired for a concurrent workload. PMP is a convenient way to see where threads are running and waiting. A common reason for something being slow is that most threads are blocked on a mutex or IO. In this case the culprit was the per-database write lock. The running thread has this call stack in both samples that I checked:

_Unwind_RaiseException
_Unwind_Resume_or_Rethrow
__cxa_rethrow

mongo::fetchIndexInserters mongo::indexRecordUsingTwoSteps mongo::DataFileMgr::insert(char mongo::DataFileMgr::insertWithObjMod mongo::checkAndInsert mongo::insertMulti mongo::receivedInsert mongo::assembleResponse mongo::MyMessageHandler::process mongo::PortMessageServer::handleIncomingMsg Why is so much time being spent on C++ exception handling? I think the exception is thrown by alreadyInIndex. I haven't written modern C++ for at least 5 years and was curious about the overhead from try/catch when exceptions are thrown. For my simple test using g++ version 4.8.1 with -O3 the overhead was much larger than returning error codes and it looks like try/catch/throw adds at least 1 microsecond of latency. The latency might be higher for a less naive use of exceptions. Frequent exceptions getting thrown (and rethrown in this case) can increase the latency for work done when the per-db write lock is held which might not be a good thing. Local C++ experts are helping me to understand the overhead from try/catch/throw/rethrow. /** * this error is ok/benign when doing a background indexing -- * that logic in pdfile checks explicitly for the 10287 error code. */ static void alreadyInIndex() { // we don't use massert() here as that does logging and this is 'benign' - // see catches in _indexRecord() throw MsgAssertionException(10287, "btree: key+recloc already in index"); } This is the call sequence for the read phase of index maintenance. It appears to read the index leaf pages for all indexes into memory before getting the per-db write lock. Reading before getting the write lock should reduce stalls. indexRecordUsingTwoSteps fetchIndexInserters beginInsertIntoIndex twoStepInsert insertStepOne addInsertionContinuation The twoStepInsert function has unfortunate behavior in 2.4.9. The code below allows inserts and updates to succeed without doing index maintenance when the key for that doc/index is too large. Similar code exists in a few places. A message is written to the log but queries that use the index can get incorrect results. In 2.6 the code has been changed to fail the insert/update but I have yet to look at that. if ( c.key.dataSize() > this->KeyMax ) { problem() << "ERROR: key too large len:" << c.key.dataSize() << " max:" << this->KeyMax << ' ' << c.key.dataSize() << ' ' << c.idx.indexNamespace() << endl; return; // op=Nothing } The insertStepOne function finds the location in the index to modify. It also raises an exception via alreadyInIndex when the key has already been added to the index by background indexing. The exception is caught by fetchIndexInserters and might be rethrown there. The exception will also be caught further up the stack in DataFileMgr::insert which has two places where it catches the exception, one has a debug assert and immediate rethrow (I wonder if g++ is able to optimize that away). The other location does real work after the catch. So the exception can be caught twice and rethrow twice. I wonder how much that costs. The write phase of index maintenance also starts in indexRecordUsingTwoSteps. It is step 2. This is done by calling finishAllInsertions and then calling bt_insert for all multikeys (I wonder if multikey is a composite index). The call stack for finishAllInsertions is: finishAllInsertions doIndexInsertionWrites insertHere basicInsert And the call stack for bt_insert is listed below. bt_insert and _insert can also skip index maintenance when the key is too large. bt_insert _insert insertHere basicInsert The insertHere function might call split when a b-tree page is full. I think that might need to read a sibling page in the b-tree and I am curious whether the sibling is read during step 1 (the read phase). Otherwise the per-db write lock can be held during a disk read which can be a performance problem. We had something like that while implementing slave prefetch for MySQL via the faker tool and changes to InnoDB. Eventually the majority of the page reads done by the slave in a patched version of MySQL were from reads of sibling pages during page splits. It took time to fix that.
7

mongo::fetchIndexInserters mongo::indexRecordUsingTwoSteps mongo::DataFileMgr::insert(char mongo::DataFileMgr::insertWithObjMod mongo::checkAndInsert mongo::insertMulti mongo::receivedInsert mongo::assembleResponse mongo::MyMessageHandler::process mongo::PortMessageServer::handleIncomingMsg Why is so much time being spent on C++ exception handling? I think the exception is thrown by alreadyInIndex. I haven't written modern C++ for at least 5 years and was curious about the overhead from try/catch when exceptions are thrown. For my simple test using g++ version 4.8.1 with -O3 the overhead was much larger than returning error codes and it looks like try/catch/throw adds at least 1 microsecond of latency. The latency might be higher for a less naive use of exceptions. Frequent exceptions getting thrown (and rethrown in this case) can increase the latency for work done when the per-db write lock is held which might not be a good thing. Local C++ experts are helping me to understand the overhead from try/catch/throw/rethrow.
/** * this error is ok/benign when doing a background indexing -- * that logic in pdfile checks explicitly for the 10287 error code. */ static void alreadyInIndex() { // we don't use massert() here as that does logging and this is 'benign' - // see catches in _indexRecord() throw MsgAssertionException(10287, "btree: key+recloc already in index"); } This is the call sequence for the read phase of index maintenance. It appears to read the index leaf pages for all indexes into memory before getting the per-db write lock. Reading before getting the write lock should reduce stalls. indexRecordUsingTwoSteps fetchIndexInserters beginInsertIntoIndex twoStepInsert insertStepOne addInsertionContinuation The twoStepInsert function has unfortunate behavior in 2.4.9. The code below allows inserts and updates to succeed without doing index maintenance when the key for that doc/index is too large. Similar code exists in a few places. A message is written to the log but queries that use the index can get incorrect results. In 2.6 the code has been changed to fail the insert/update but I have yet to look at that. if ( c.key.dataSize() > this->KeyMax ) { problem() << "ERROR: key too large len:" << c.key.dataSize() << " max:" << this->KeyMax << ' ' << c.key.dataSize() << ' ' << c.idx.indexNamespace() << endl; return; // op=Nothing } The insertStepOne function finds the location in the index to modify. It also raises an exception via alreadyInIndex when the key has already been added to the index by background indexing. The exception is caught by fetchIndexInserters and might be rethrown there. The exception will also be caught further up the stack in DataFileMgr::insert which has two places where it catches the exception, one has a debug assert and immediate rethrow (I wonder if g++ is able to optimize that away). The other location does real work after the catch. So the exception can be caught twice and rethrow twice. I wonder how much that costs. The write phase of index maintenance also starts in indexRecordUsingTwoSteps. It is step 2. This is done by calling finishAllInsertions and then calling bt_insert for all multikeys (I wonder if multikey is a composite index). The call stack for finishAllInsertions is: finishAllInsertions doIndexInsertionWrites insertHere basicInsert And the call stack for bt_insert is listed below. bt_insert and _insert can also skip index maintenance when the key is too large. bt_insert _insert insertHere basicInsert The insertHere function might call split when a b-tree page is full. I think that might need to read a sibling page in the b-tree and I am curious whether the sibling is read during step 1 (the read phase). Otherwise the per-db write lock can be held during a disk read which can be a performance problem. We had something like that while implementing slave prefetch for MySQL via the faker tool and changes to InnoDB. Eventually the majority of the page reads done by the slave in a patched version of MySQL were from reads of sibling pages during page splits. It took time to fix that.

Wednesday, February 26, 2014

Write Amplification: write-optimized versus update-in-place

I spent a few days running linkbench to compare write-amplification between an update-in-place b-tree (InnoDB) and a write-optimized algorithm (TokuDB). There are a few changes that can be made to greatly reduce the bytes written rate for InnoDB. My co-workers had great success in doing this for a specific workload but might not have described the impact in public. So I will describe the changes using linkbench. Note that that bytes written rate is very important when using flash storage as device endurance is rated in terms of it. Reducing the rate can either extend the lifetime of the device or allow you to buy a lower endurance and less expensive device.

The changes to reduce the bytes written rate for InnoDB include:

reduce innodb_page_size - if pages are being written back when only a small fraction of the page is dirty then you might reduce the bytes written rate in half by cutting the page size in half (from 16k to 8k). Note that using 2X compression already reduces the disk page size in half and it probably is a bad idea to use 2X compression with 4k pages so the best you can do is use a 4k page without compression or an 8k page with 2X compression.
increase innodb_log_file_size - if pages are being written back because of the fuzzy checkpoint constraint (from the flush list not from the LRU) then making the log file larger will reduce the write back rate. However you need to be careful to avoid sacrificing too much RAM to redo log files cached by the OS as buffered IO is used for this.
increase innodb_max_dirty_pages_pct - if pages are being written back because this limit is reached then making it larger will reduce the rate. But that will also increase shutdown time.
disable the doublewrite buffer - disabling this cuts the bytes written rate for database files in half. Alas it exposes you to torn pages. One day atomic page writes might be common on flash devices and this won't be a concern.
disable innodb_flush_neighbors - this is definitely an optimization for spinning disks. Disabling it will prevent InnoDB from writing back some dirty pages before their time.
magic - there are a few more changes that my clever colleagues have done. I will allow them to describe that work.

tl;dr

The summary of the results below:

with tuning you can significantly reduce the InnoDB bytes written rate
TokuDB requires very little tuning to get good performance. I set one parameter in my.cnf for TokuDB versus many for InnoDB. They have done a great job with that.
TokuDB might be able to do much better with tuning but that is a task for another Callaghan
the bytes written rate is significantly less for TokuDB than for InnoDB
the database size is significantly smaller for TokuDB than for InnoDB
QPS on multi-core is much better today than it was last year for TokuDB
the bytes written rate to flash is underestimated for InnoDB or overestimated for TokuDB because this does not include writes done for copy out during flash block cleaning. So TokuDB is doing much better relative to InnoDB than reported by my metrics

Configuration

For InnoDB fsync-on-commit was enabled and the binlog was disabled. So this includes writes from the redo logs. InnoDB used a 4G or 8G redo log as indicated in each test result. InnoDB also used O_DIRECT and the doublewrite buffer was disabled. I don't have results for InnoDB with the doublewrite enabled. The write rates would be about 2X of what I report here which is about 4X worse than TokuDB in the worst case. When compression is enabled I used zlib and it is only used for the link and count tables. It is not used for the object table.

TokuDB used a 16k read block size and buffered IO. I also shared a much larger performance report with the TokuDB team but I am reluctant to share performance comparisons in public. I feel much better about sharing reports that show the benefit of a few changes to one DBMS. I used the default compression -- zlib. I repeated some tests for the uncached database using a larger read block size but QPS was much worse as I predicted. I also increased the TokuDB checkpoint interval from 60 seconds to 300 seconds for a few tests but that did not reduce the bytes written rate.

Write rates were measured by iostat. Note that all bytes written are not created equal. When small random writes are done so that the flash block cleaner (garbage collection) encounters a mix of live & dead pages while cleaning it must copy out (write) the live pages elsewhere. This is another source of write amplification and bytes written. I did not measure it but the impact on InnoDB should be much larger than the impact on TokuDB. A write optimized database that does large random writes, for example writes that are a multiple of the flash block size, will avoid some/most/all of the copy out. Note that I also describe all writes as random. Most busy DBMS workloads will have a variety of IO streams running concurrently on a small number of storage devices. From the perspective of the storage device this is likely to not look sequential.

For all of the tests I ran the workloads for many hours before taking any measurements to get the database into something resembling a steady state with fragmentation for InnoDB and dead versions of rows for TokuDB. I read many benchmark results for which measurements are done immediately after load and those results can be very misleading. It also helps to make sure the flash device is full so that flash GC is in progress when the benchmarks are running. I am reasonably sure that was true in this case.

The test server has 12 CPU cores and 24 with HT enabled. I used good flash devices.

The database configurations that I tested were:

TokuDB
InnoDB-4k-c - InnoDB with a 4k page size and 2X compression. This means that database pages are 2k for compressed tables which is not a good idea in the future when 4k disk sectors are deployed.
InnoDB-8k-c - InnoDB with an 8k page size and 2X compression.
InnoDB-16k-c - InnoDB with a 16k page size and 2X compression
InnoDB-4k-u - InnoDB with a 4k page size
InnoDB-8k-u - InnoDB with an 8k page size
InnoDB-16k-u - InnoDB with a 16k page size

Small cached database

For the first test I used a database that was cached by InnoDB and TokuDB. I configured linkbench to use maxid1=10M+1. The database size after initial load was much less than 20G for both InnoDB and TokuDB. 10 client threads each did 100M linkbench requests and the test took about 8 hours. A 4G redo log was used for InnoDB.

The results show that InnoDB with 4k pages and 2X compression is close to TokuDB in the bytes written rate. This is a great result but I don't recommend 4k pages with 2X compression and I did not run this test for 8k pages with 2X compression.

Larger cached database

I repeated the test with maxid1=100M+1. In this case an 8G redo log was used for InnoDB. The database after load was 34G/66G for TokuDB/InnoDB and grew to 60G/102G. It was still cached by both TokuDB and InnoDB. For this test I used 20 client threads that each did 20M requests. The bytes written rate for InnoDB with 8k pages and 2X compression was about 2X the rate for TokuDB. I did not save metrics to look at page write rates by table. If most of the writes were done for the uncompressed node table then supporting 8k+2X compressed tables concurrent with 4k-uncompressed tables would be a useful feature for InnoDB.

Not cached database

For the final result I used maxid1=1B+1. The test database after load was 341G for TokuDB, 672G for InnoDB-8k-c and 651G for InnoDB-16k-c. At test end it was 422G, 831G and 792G respectively. An 8G redo log was used for InnoDB. The test used 20 client threads that each did 20M requests. In this case the rate for InnoDB with 8k pages and 2X compression is about 1.3X the rate for TokuDB.

Notes on MongoDB space allocation

This describes what I learned about space allocation in MongoDB. I am new to it so I might be wrong and welcome corrections. I make some comparisons to MySQL given my many years working with it. MongoDB creates several types of files -- database, journal, temp files for index creation, temp files for repair and the oplog for replication. I only discuss database files in this post.

I have seen some results from comparisons with TokuMX that suggest MongoDB uses a lot more space than desired. I can't trust everything I read on the web and maybe you can't trust what I write but I won't discount the claims. Understanding space allocation makes it easier to figure out whether these claims might be an issue for a given workload. Here is one example where it used 10X more space than TokuMX. Here are other examples from Tokutek. And this is an example from a production workload that got between 3X and 4X better compression with TokuMX.

MongoDB uses 1+ files per database and there can be many collections per database. With MySQL/InnoDB there are options to use a file per instance or file per table and at times it would have been nice to use a few files per database. With MongoDB each file in the database has a fixed size and is preallocated to that full size by a background thread. Note that InnoDB will incrementally grow files and there have been stalls in the past from that (since fixed by Facebook, Percona and Oracle). Preallocation means that when when a database requires space from N files it is likely that N+1 files will be created. If you are testing MongoDB to see how much space it might use for your data be careful to use a dataset that isn't too small or the space from the preallocated file will overestimate the amount of space needed.

The sizes for database files are 64M, 128M, 256M, 512M, 1G, 2G, 2G, 2G and so on unless the smallfiles option is set in which case the size pattern is 16M, 32M, 64M, 128M, 256M, 512M, 512M, 512M and so on. Even with the smallfiles option it might be a not best practice to have very small databases, where small means the average database is less than 512M because there will be a lot of space wasted by file preallocation. File preallocation can be disabled by the noprealloc option but that is not recommended. I think the per-database write lock can be held during file allocation and there would be more stalls.

Space allocation for documents is a hot topic. Note that documents are stored contiguously in a file- a 4MB document will be stored from file offset X to file offset X+4MB. Compare this to InnoDB which uses no more than half of a page (16kb by default) per row and then stores large columns in overflow pages with each spilled column stored on its own page(s). There are some cases where InnoDB can waste a lot of space from spilled blob columns that are not too large. For example, when a spilled blob column is 4k then it will use a 16k page for itself. MongoDB might be more space efficient in that case.

Free space lists are maintained to store all unused space that can store records of size 32, 64, 128, 256, 512, 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M and 16M. Space is added to these lists when a file is created and probably when records are deleted and moved. Entries on the free space lists can have variable sizes but allocation requests are quantized (see below) to reduce the number of possible sizes. When InnoDB compression is used and there are either tables with different compression factors or compressed and uncompressed tables then InnoDB has to manage free pages with different sizes, but the number of sizes is greatly reduced (16k, 8k, 4k, 2k). We had one performance bug with code in InnoDB that tried to merge adjacent pages on line list to promote it to the next larger list. I wonder how the MongoDB code handles that.

A record must be moved when it is updated and the post-update size exceeds the space originally allocated. This is why padding can be important. When space is needed for a new or a moved record there are two algorithms to decide how much space to allocate. The original algorithm uses a padding factor so that X * paddingFactor bytes are allocated for an X byte document. The padding factor is managed per collection and can be queried but cannot be set by a DBA. It took me a few minutes to realize I could not set it. Well, it can be set during compact but not repairDatabase (more on this later). The padding factor is kept between 1.0 and 2.0, decreased by 0.001 after some updates that are done in place and increased by 0.001 X (min(nIndexes, 7) + 3) after some updates that are not done in place. Some means a sample of 25% of the updates. Even when the padding factor is 1.0 there is likely to be some padding because space allocation is quantized. The amount of space to allocate is rounded up to a multiple of X bytes where X is slab-size / 16 and the slab sizes are listed above (32, 64, 128, ...). There might be an open bug where allocation requests that are already a power of 2 get rounded up too much when quantized.

The new space allocation algorithm is powerOf2Sizes. The space allocated for a document is rounded up to the next power of 2. Without using this in production I assume that on average this uses 1.5X more bytes than needed and the overhead from a fragmented b-tree is similar. With the old approach (padding factor) there isn't much padding on a new collection and with the new approach (powerOf2Sizes) there is a lot of padding. As powerOf2Sizes can be set on a collection I expect that a common pattern will be to set it prior to a reload for collections that will get frequent changes after the load and to not set it for collections that will be read-only after a load. Note that using powerOf2Sizes for a read-only collection will waste a lot of space. It can take time for MongoDB to settle on a good value for the padding factor of a collection. The good value can be lost during dump & reload. It can be preserved during the compact command but not during repairDatabase or an explicit dump and reload. There might be a few feature requests here.

MongoDB doesn't support compression for database files. Support for compression of read-only collections will be much easier to implement than for non read-only collections. Maybe we will see something similar to MyISAM where it is only implemented for compressed collections. Note that TokuMX supports compression for read-write collections. InnoDB is able to do compression for read-write tables and a big reason this was possible is the ability to do a page split when a page does not compress by the expected amount. This requires clustered indexes and MongoDB doesn't do clustered indexes.

MongoDB doesn't return space to the OS when there is free space at the tail of a large file. InnoDB also lacks this feature but it has never been a priority for me. The compact command defragments one collection and there are 1+ collections per database. The docs state that it can use up to 2G of extra space when running. I assume it might be writing into new extents when copying out from fragmented extents. So the result is that compact can leave the database with more files on completion but much more free space in the old files. repairDatabase appears to have similar behavior in terms of file allocation. It defragments all collections in a database. It might be OK to describe this as incremental dump and reload where space from the input files is released as new files are created. I like the incremental nature with respect to file allocation.

Last but definitely not least there are no checksums on database pages. There are checksums on journal writes. Page checksums are extremely important and have helped identify bad storage devices on many occasions. I prefer not to support a database running at scale without them.

Sunday, February 23, 2014

Use case?

It is time for a new blog. I still work on small data (OLTP) but my focus is not limited to MySQL. Fortunately the product is in great shape thanks to investment by Oracle/MySQL, SkySQL/MariaDB and Percona. The MySQL teams at Facebook are also doing amazing work and some of that will be described at the MySQL UC.

I don't think that semi-sync replication was sufficiently appreciated. Perhaps it was misunderstood but I might be biased because my team at Google did the first implementation of it. Rather than dwell on the past lets focus on the future with enhanced semi-sync. I think this provides the semantics that people hoped to get from semi-sync and will be a huge feature beyond the already big deal that fast & automated failover will become courtesy of global transaction IDs and semi-sync replication.

So we will get faster & automated failover in 2014 and might get a solution that makes the binlog 2-safe without much complexity and with very low latency. But is that enough? I think we need better support for dynamic schemas (see documents in MongoDB) and Postgres might need the same. By this I mean that a table should support the use of columns that are not declared in the schema. And then we need to support indexes on those columns. As a bonus it would be nice to avoid repeating dynamic column names in every row when stored on disk because use really short column names isn't a desirable best practice for long-lived applications with many developers. MariaDB has some support for dynamic columns and the feature continues to get better in MariaDB v10. I don't know whether there are plans for indexing.

The other feature is support for auto-sharding. This is a much simpler problem to solve for a document data model than for a relational model where there are multiple tables with PK-FK constraints.

The standard response when making feature requests is to explain the use case and that is sometimes followed up with an expression of doubt enough customers want this feature. I have memories from my time at BigDBMS company when product managers would repeat that BigDBMS wasn't losing customers to product X or BigDBMS customers weren't asking for features provided by product X. The implication was that new features in product X weren't compelling. This ignored the real problem -- customers who needed the features in product X were never becoming customers of BigDBMS and the new license revenue growth rate might have made that clear.

I think we are in a similar situation with MySQL. Customers who want some support for sharding and documents are using other products. I think we can make MySQL competitive for this market and leverage some of the awesome features that are much better in MySQL than elsewhere like InnoDB, TokuDB, monitoring, support and a lot of expertise.