Posts

Showing posts from March, 2014

TokuMX, MongoDB and InnoDB versus the insert benchmark with disks

I used the insert benchmark on servers that use disks in my quest to learn more about MongoDB internals. The insert benchmark is interesting for a few reasons. First while inserting a lot of data isn't something I do all of the time it is something for which performance matters some of the time. Second it subjects secondary indexes to fragmentation and excessive fragmentation leads to wasted IO and wasted disk space. Finally it allows for useful optimizations including write-optimized algorithms ( fractal tree via TokuMX , LSM vis RocksDB and WiredTiger ) or the InnoDB insert buffer . Hopefully I can move onto other workloads after this week. This test is interesting for another reason that I don't really explore here but will in a future post. While caching all or most of the database in RAM works great at eliminating reads it might not do much for avoiding random writes. So a write heavy workload with a cached database can still be limited by random write IO and this will

Redo logs in MongoDB and InnoDB

Both MongoDB and InnoDB support ACID. For MongoDB this is limited to single document changes while InnoDB extends that to multi-statement and possibly long-lived transactions. My goal in this post is to explain how the MongoDB journal is implemented and used to support ACID. Hopefully this will help to understand performance. I include comparisons to InnoDB. What is ACID? There are a few interesting constraints on the support for ACID with MongoDB. It uses a per-database reader-writer lock . When a write is in progress all other uses of that database (writes & reads) are blocked. Reads can be done concurrent with other reads but block writes. The manual states that the lock is writer greedy so that a pending writer gets the lock before a pending reader. I am curious if this also means that a pending writer prevents additional readers when the lock is currently in read mode and added that question to my TODO list. The reader-writer lock code is kind of complex. By in progress I

A few comments on MongoDB and InnoDB replication

MongoDB replication has something  like  InnoDB fake changes  built in. It prefetches all documents to be changed while holding a read lock before trying to apply any changes. I don't know whether the read prefetch extends to indexes. That question has now been added to my TODO list. Using fake changes to prefetch on MySQL replicas for InnoDB worked better than everything that came before it because it prefetched any index pages that were needed for index maintenance. Then we made it even better by making sure to prefetch sibling pages in b-tree leaf pages when pessimistic changes (changes not limited to a single page) might be done (thanks Domas).  Hopefully InnoDB fake changes can be retired with the arrival of parallel replication apply . This is an example of the challenge of running a DBMS at web scale -- we can't always wait and many of our solutions are in production 3 to 5 years before the proper fix arrives from the upstream vendor. I am not complaining as this gives u

Insert benchmark for InnoDB, MongoDB and TokuMX and flash storage

This work is introduced with a few disclaimers in an earlier post . For these tests I ran the insert benchmark client in two steps: first to load 100M documents/rows into an empty database and then to load another 400M documents/rows. The test used 1 client thread, 1 collection/table and the query threads were disabled. The replication log (oplog/binlog) was disabled for all tests. I assume that all of the products would suffer with that enabled. Note that the database in this test was fully cached by InnoDB and TokuMX and almost fully cached for MongoDB. This means there will be no disk reads for InnoDB during index maintenance, no disk reads for TokuMX during compaction and few disk reads for MongoDB  during index maintenance. Future posts have results for databases much larger than RAM. Performance summary: database size - InnoDB matches TokuMX when using 2X compression alas this comes at a cost in the insert rate. Without compression TokuMX is better than InnoDB and both are

Bigger data

Database benchmarks are hard but not useless. I use them to validate performance models and to find behavior that can be improved. It is easy to misunderstand results produced by others and they are often misused for marketing (benchmarketing). It is also easy to report incorrect results and I have done that a few times for InnoDB. A benchmark report is much more useful when it includes an explanation. Only one of these is an explanation: A is faster than B,  A is faster than B because it uses less random IO . It isn't easy to explain results. That takes time and expertise in the DBMS and the rest of the hardware and software stack used during the test. The trust I have in benchmark reports is inversely related to the number of different products that have been tested. This is an introduction for a sequence of blog posts that compare MongoDB, TokuMX and InnoDB. I am an expert with InnoDB, above average with TokuDB (and the Toku part of TokuMX) and just getting started with MongoDB.

When does MongoDB make a transaction visible?

Changes to a single document are atomic in MongoDB, but what does that mean? One thing it means is that when reads run concurrent with an update to a document the reads will see the document in either the pre-update or the post-update state. But there are several post-update states. There is the pre-fsync post-update state and the post-fsync post-update state. In the pre-fsync post-update state the change record for the update has been written to the journal but the journal has not been forced to disk. If a client were to read the document in this state and then the server rebooted before the group commit fsync was done then the document might be restored to the pre-update state during crash recovery. Note that if the only failure is a mongod process failure then the document will be restored to the post-update state. But another client might have read the post-update state prior to the server reboot. Is it a problem for a client to observe a transaction that is undone? That is appli