Small Datum: March 2014

Monday, March 24, 2014

TokuMX, MongoDB and InnoDB versus the insert benchmark with disks

I used the insert benchmark on servers that use disks in my quest to learn more about MongoDB internals. The insert benchmark is interesting for a few reasons. First while inserting a lot of data isn't something I do all of the time it is something for which performance matters some of the time. Second it subjects secondary indexes to fragmentation and excessive fragmentation leads to wasted IO and wasted disk space. Finally it allows for useful optimizations including write-optimized algorithms (fractal tree via TokuMX, LSM vis RocksDB and WiredTiger) or the InnoDB insert buffer. Hopefully I can move onto other workloads after this week.

This test is interesting for another reason that I don't really explore here but will in a future post. While caching all or most of the database in RAM works great at eliminating reads it might not do much for avoiding random writes. So a write heavy workload with a cached database can still be limited by random write IO and this will be more of an issue as RAM capacity grows on commodity servers while people try to reuse their favorite update-in-place b-tree for cached workloads. Some of the impact from that can be viewed in the results for MongoDB when the database is smaller than 72G. I wonder whether InnoDB can be improved in this case. The traditional solution is to use snapshots (sequential IO) and a redo log.

The test server has 72G of RAM and at least 8 10K RPM SAS disks with HW RAID and a battery-backed write cache so it can do a few thousand random IOPs given many pending requests if we trade latency for throughput. The insert benchmark was used with 1 client thread and the test was started with an empty collection/table. I used the Java client for MongoDB and TokuMX and the Python client for InnoDB. The MongoDB inserts are done with w:1,j:1 and journalCommitInterval=2 (or logFlushPeriod=0 with TokuMX). So there is a wait for fsync but with all of the tests I have done to this point there is not much difference between j:0 and j:1 as the journal sync does not have much impact when inserting 1000 documents per insert request. The InnoDB inserts are done with innodb_flush_log_at_trx_commit=1 so it also waits for fsync. I also used 8kb pages for InnoDB and disabled the doublewrite buffer. Compression was not used for InnoDB. Fsync is fast on the test hardware given the RAID write cache. The clients run as the same host as the server to reduce network latency. The oplog/binlog was disabled.

I usually have feature requests listed in a post but not this time. I think that MongoDB needs much more in the way of per-collection and per-index statistics. That shouldn't be a surprise given my work on the same for MySQL. But that will wait for another post.

The performance summary isn't a surprise. TokuMX does better than InnoDB because fractal trees greatly reduce the random IOPs demand. InnoDB does better than MongoDB. There are a few reasons why InnoDB does better than MongoDB even though they both use an update-in-place b-tree:

Databases with MongoDB are larger than with InnoDB so cache hit rates are lower when the database is larger than RAM. I don't understand all of the reasons for the size differences. Including attribute names in every document is not the majority of the problem. I think there is more secondary index fragmentation with MongoDB. I have results with and without the powerOf2Sizes option enabled and that doesn't explain the difference.
The InnoDB insert buffer is the primary reason that InnoDB does better. This is true when comparing InnoDB to many products that use an update-in-place b-tree, not just MongoDB. Because of the insert buffer InnoDB is unlikely to stall on disk reads to leaf pages of secondary indexes during index maintenance. Those reads can be done in the background using many concurrent IO requests. MongoDB doesn't have this feature. It blocks on disk reads during secondary index maintenance and won't benefit from concurrent IO for reads despite the RAID array used by the server. This note has performance results for the insert benchmark and InnoDB when the insert buffer is disabled to show the benefit from that feature. I have also written about problems since fixed in InnoDB that prevented the insert buffer from being useful because it became full.

Results

For the test the client inserts up to 2B rows. But I wasn't willing to wait for MongoDB and stopped it after less than 600M rows. InnoDB was stopped after 1.8B rows. The columns used for the result table are listed below. There are a lot more details on these columns in a previous post. Each of the sections that follow describe the performance to insert the next 100M documents/rows.

sizeGB - the database size in GB
bpd - bytes per document/row computed from sizeGB / #docs (or #rows)
MB/s - the average rate for bytes written per second computed from iostat. This has IO for the database file and the journal/redo logs
GBw - the total number of GB written to the database including journal/redo logs
secs - the number of seconds to insert data
irate - the rate of documents or rows inserted per second
notes - more details on the configuration

From 0 to 100M rows

This has results from inserting 100M documents/rows to an empty collection/table. Things that interest me that I have previously reported include 1) MongoDB databases are much larger and 2) MongoDB does much more disk IO for the same workload and the increase in bytes written isn't explained by the database being larger. One of the reasons for the high bytes written rate is that the test takes longer to complete with MongoDB and a hard checkpoint is done every syncdelay seconds. InnoDB is better at delaying writeback for dirty pages.

The interesting result that I have seen in a few cases with both MongoDB 2.4.9 and 2.6.0 is that results are worse with powerOf2Sizes enabled. I have not take the time to debug this problem. That is on my TODO list. At first I thought I had a few bad servers (flaky HW, etc) but I haven't seen the opposite for this workload (powerOf2Sizes enabled getting better insertion rates). The problem appears to be intermittent. Note that 2.6 has a fix for JIRA 12216 that doesn't block allocation of new files when msync is in progress so 2.6 should be somewhat faster than 2.4.

config sizeGB bpd MB/s GBw secs irate notes

innodb 16 171 28.9 124 4290 23308

tokumx 9.2 98 11.1 79 7127 14030

mongo24 43 461 46.4 1539 33230 3009 powerOf2Sizes=0

mongo24 44 472 30.0 1545 51634 1937 powerOf2Sizes=1

mongo26 42 450 47.9 1446 30199 3311 powerOf2Sizes=1

From 100M to 200M rows

TokuMX and fractal trees are starting to show a benefit relative to InnoDB.

config sizeGB bpd MB/s GBw secs irate notes

innodb 31 166 24.3 238 9781 10224

tokumx 17 91 12.3 90 7328 13646

mongo24 72 386 37.4 1768 47329 2113 powerOf2Sizes=0

mongo24 79 424 24.6 1731 70325 1422 powerOf2Sizes=1

mongo26 76 408 39.3 1611 40992 2439 powerOf2Sizes=1

From 200M to 300M rows

More of the same as TokuMX gets better relative to others.

config sizeGB bpd MB/s GBw secs irate notes
innodb 45 161 21.7 350 16136 6198
tokumx 25 89 12.0 84 7071 14142
mongo24 98 350 30.7 2008 65514 1526 powerOf2Sizes=0
mongo24 106 379 19.9 1917 96351 1038 powerOf2Sizes=1
mongo26 108 386 24.9 1933 77677 1287 powerOf2Sizes=1

From 300M to 400M rows

TokuMX begins to get slower. MongoDB gets a lot slower as the database is much larger than RAM. Problems unrelated to MongoDB cost me two of the long running test servers (for 2.4.9 and 2.6.0 with powerOf2Sizes=1).

config sizeGB bpd MB/s GBw secs irate notes
innodb 61 163 21.1 376 17825 5610
tokumx 31 83 12.1 86 7172 13941
mongo24 130 348 14.7 2313 157395 635 powerOf2Sizes=0

From 400M to 500M rows

MongoDB is getting significantly slower as the database is larger than RAM. More on this in the next section.

config sizeGB bpd MB/s GBw secs irate notes

innodb 75 161 19.8 462 23337 4285

tokumx 39 83 11.1 84 7584 13186

mongo24 160 344 4.7 2105 441534 202 powerOf2Sizes=0

From 500M to 600M rows

I wasn't willing to wait for MongoDB to make it to 600M. I stopped the test when it reached ~540M inserts. The insert rate continues to drop dramatically. InnoDB does better because of the insert buffer. I assume that for MongoDB it would drop to ~50/second were I willing to wait. That would happen when there was a disk read for every secondary index per inserted document, there are 3, and the disk array can do ~150 disk reads/second when requests are submitted serially. InnoDB was slightly faster compared to the previous 100M inserts, but it will get slower in the long run.

I looked at iostat output and the MongoDB host was doing ~260 disk reads/second and ~375 disk writes/second at test end. For both reads and writes the average request size was ~8kb. The write stats include writes to journal and database files. From PMP stack traces I see a single thread busy walking b-tree indexes most of the time. Note that the write rate for MongoDB has fallen in line with the reduction in the insert rate. Database pages aren't getting dirty as fast as they used to get because MongoDB is stalled on secondary index leaf node reads.

config sizeGB bpd MB/s GBw secs irate notes

... at 600M docs/rows

innodb 89 159 20.1 392 19465 5137

tokumx 46 82 11.6 90 7741 12917

... at 540M documents

mongo24 168 340 2.9 1235 X 123 powerOf2Sizes=0

From 900M to 1B rows

Alas InnoDB has begun to degrade faster. Even the insert buffer eventually is no match for a write-optimized algorithm.

config sizeGB bpd MB/s GBw secs irate notes

innodb 148 158 15.4 1515 98413 1016

tokumx 74 79 10.9 92 8436 11853

From 1.4B to 1.5B rows

More of the same.

config sizeGB bpd MB/s GBw secs irate notes

innodb 221 158 12.5 1745 140274 713

tokumx 104 74 11.0 96 8722 11464

From 1.5B to 2B rows

TokuMX is all alone.

config sizeGB bpd MB/s GBw secs irate notes

tokumx 142 76 12.6 99 7868 12709

Sunday, March 23, 2014

Redo logs in MongoDB and InnoDB

Both MongoDB and InnoDB support ACID. For MongoDB this is limited to single document changes while InnoDB extends that to multi-statement and possibly long-lived transactions. My goal in this post is to explain how the MongoDB journal is implemented and used to support ACID. Hopefully this will help to understand performance. I include comparisons to InnoDB.

What is ACID?

There are a few interesting constraints on the support for ACID with MongoDB. It uses a per-database reader-writer lock. When a write is in progress all other uses of that database (writes & reads) are blocked. Reads can be done concurrent with other reads but block writes. The manual states that the lock is writer greedy so that a pending writer gets the lock before a pending reader. I am curious if this also means that a pending writer prevents additional readers when the lock is currently in read mode and added that question to my TODO list. The reader-writer lock code is kind of complex. By in progress I mean updating in-memory structures from mmap'd files like the document in the heap storage and any b-tree indexes. For a multi-document write the lock can be yielded (released) in between documents so the write might be done from multiple points in time and readers can see the write in progress. There are no dirty reads. The $isolated option can be used to avoid yielding the per-db lock for multi-document writes. Even with that option an error half-way into a multi-document operation results in a half done change as there is no undo. MyISAM users are familiar with this problem. The cursor isolation provided by MongoDB isn't that different from READ COMMITTED on a system that doesn't do MVCC snapshots (see older versions of BerkeleyDB and maybe IBM DB2 today).

It is bad for an in progress write to stall on a disk read (page fault from an mmap'd file) while holding the per-database lock. MongoDB has support to yield the per-db lock on a page fault (real or predicted) but I have questions about this. Is the yield only done for the document pages in the heap storage or extended to index pages? Is anything done to guarantee that most or all pages (document in heap store, all index pages to be read) are in memory before applying any writes. Note that if the document is updated in memory and then a page fault occurs during index maintenance then I doubt the lock can be yielded. This is another question on my TODO list. I am not the only person with that question. MongoDB has something like an LRU to predict whether there will be a page fault on a read and understanding the performance overhead from that is also on my TODO list. I have seen a lot of CPU overhead from that code on some benchmarks.

MongoDB doesn't have row locks. The change to a document is visible as soon as the per-db lock is released. Not only are some writes visible from a multi-document change before all documents have been modified but all changes are visible before the change is durable via the journal. This behavior is different than what you can get from a DBMS and users should be aware of that.

Redo logs

InnoDB has a redo log and uses the system tablespace for undo. The changes written to support undo are made durable via the redo log just like changes to database tables. The undo information enables consistent reads for long running transactions. The InnoDB redo log uses buffered IO by default and is configured via the innodb_flush_method option. Redo log pages are 512 bytes and that might need to change when the conversion from 512 to 4096 byte disk sectors is complete. Each log page has a checksum. The innodb_flush_log_at_trx_commit option determines whether the log is forced on commit or once per second. There are a fixed number of redo log files and they are preallocated. With buffered IO and 512 byte aligned writes the first write on a 4kb boundary can incur a disk read to get the page into the OS cache before applying the change. This is a waste of disk IO and the workaround seems to be some form of padding to 4kb for some writes. But note that the padding will be reused/reclaimed on the next log write. An alternative is to use direct IO but there might be several calls to write or pwrite before the log must be forced and making each write synchronous will delay log writing. With buffered IO the filesystem can coalesce the data from multiple writes as these are adjacent in the log file.

MongoDB doesn't do undo. It does have a redo log called the journal. This uses direct IO on Linux. The log page size is 8kb and is protected by a checksum. The in-memory log buffer is compressed via Snappy before the log write is done and the compressed result is padded to 8kb. The space taken by padding isn't reused/reclaimed for the next write so a sequence of small inserts with j:1 each write at least 8kb to the journal. Writes to the journal are done by a background thread (see durThread in dur.cpp). Note that the background thread iterates over a list of redo log entries that must be written to the journal, copies them to a string buffer, then uses Snappy to compress that data, then pads the compressed output to the next multiple of 8kb, then writes the padded output to the journal file. The dur section in serverStatus output has counters for the amount of data written to the journal which includes the padding (journaledMB). The size of the data prior to padding is the journaledMB counter divided by the compression counter. Note that these counters are computed over the last few seconds.

MongoDB optionally recycles log files and this is determined when journal preallocation (preallocj) is enabled. With preallocj 3 journal files are created at process start and this can delay startup for the time required to create 3 1GB journal files (see preallocateFiles in dur_journal.cpp). This writes data to the files so real IO is done including an fsync. In steady state, after process start, old log files are recycled with preallocj (see removeOldJournalFile in dur_journal.cpp). Without preallocj the journal files are not preallocated at process start and old journal files are not reused. There is an undocumented option, --nopreallocj, that can disable preallocj. There is no option to force preallocj. That is determined by a short performance test done at process start (see preallocateIsFaster in dur_journal.cpp). One way to determine whether preallocj is in use is to check the journal directory for the preallocated files.

Preallocation for both database and journal doesn't mean that files are written an extra time -- once during preallocation, at least once during regular use. I was happy to learn this. Database file preallocation uses posix_fallocate rather than write/pwrite on Linux (see the run and ensureLength methods in the FileAllocator class). Journal file preallocation uses write append, but that should only be done at process start and then the files are recycled (see preallocateFile and removeOldJournalFile in dur_journal.cpp).

Using strace is a great way to understand complex DBMS software. This shows the sequence of 8kb writes to the journal during the insert benchmark with a client that uses j:1 and 1 document per insert:
strace -f -p $( pidof mongod ) -ewrite
write(5, "g\n\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224'L\376\377\377\377iiben"..., 8192) = 8192
write(5, "\t\v\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\344)L\376\377\377\377iiben"..., 8192) = 8192
write(5, "m\6\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\264\26L\376\377\377\377iiben"..., 8192) = 8192
write(5, "R\4\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224\16L\376\377\377\377iiben"..., 8192) = 8192

Group commit

Proper group commit is now supported for InnoDB but I will skip the details. It is done directly by a thread handling the COMMIT operation for a user's connection and there is no wait unless another thread is already forcing the log. My team did the first implementation of group commit but MariaDB and MySQL did something better. We were thrilled to remove that change from our big patch for MySQL.

MongoDB has group commit. The journal is forced to disk every journalCommitInterval milliseconds. When a thread is blocked waiting for the journal to be forced the interval is reduced to 1/3 of that value. The minimum value for journalCommitInterval is 2, so the maximum wait in that case should be 1 (2/3 rounded up). This means that MongoDB will do at most 1000 log forces per second. Some hardware can do 5000+ fast fsyncs courtesy of battery backed write cache in HW RAID or flash so there are some workloads that will want MongoDB to force the log faster than it can today. Group commit is done by a background thread (see durThread, durThreadGroupCommit, and _groupCommit in dur.cpp). Forcing the journal at journalCommitInterval/3 milliseconds is also done when there is too much data ready to be written to it.

Performance

I used the insert benchmark to understand redo log performance. The test used 1 client thread to insert 10M documents/rows into an empty collection/table with 1 document/row per insert. The test was repeated in several configurations to understand what limited performance. I did this to collect data for several questions: how fast can a single-threaded workload sync the log and how much data is written to the log per small transaction. For the InnoDB tests I used MySQL 5.6.12, disabled the doublewrite buffer and used an 8k page. The MongoDB tests used 2.6.0 rc0. The TokuMX tests use 1.4.0. The following configurations were tested:

inno-sync - fsync per insert with innodb_flush_log_at_trx_commit=1.
inno-lazy - ~1 fsync per second with innodb_flush_log_at_trx_commit=2
toku-sync - fsync per insert with logFlushPeriod=0
mongo-sync - fsync per insert, journalCommitInterval=2, inserts used j:1
mongo-lazy - a few fsyncs/second, journalCommitInterval=300, inserts used w:1, j:0
mongo-nojournal - journal disabled, inserts used w:1

Test Results

The following metrics are reported for this test.

bpd - bytes per document (or row). This is the size of the database at test end divided by the number of documents (or rows) in the database. As I previously reported, MongoDB uses much more space than InnoDB whether or not powerOf2Sizes is enabled. They are both update-in-place b-trees so I don't understand why MongoDB does so much worse when subject to a workload that causes fragmentation. Storing attribute names in every document doesn't explain the difference. But in this case the results overstate the MongoDB overhead because of database file preallocation.
MB/s - the average disk write rate during the test in megabytes per second
GBw - the total number of bytes written to disk during the test in GB. This includes writes to the database files and (when enabled) the redo logs. The difference between inno-sync and inno-lazy is the overhead of a 4kb redo log write per insert. The same is not true between mongo-sync and mongo-lazy. My educated guess to explain why MongoDB and InnoDB are different is that for mongo-sync the test takes much longer to finish than mongo-lazy so there are many more hard checkpoints (write all dirty pages each 60 seconds). InnoDB is much better at keeping pages dirty in the buffer pool without writeback. In all cases MongoDB is writing much more data to disk. In the lazy mode it writes ~15X more and in the sync mode it writes ~6X more. I don't know if MongoDB does hard checkpoints (force all dirty pages to disk every syncdelay seconds) when the journal is disabled. Perhaps I was too lazy to read more code.
secs - the number of seconds to insert 10M documents/rows.
bwpi - bytes written per inserted document/row. This is GBw divided by the number of documents/rows inserted. The per row overhead for inno-sync is 4kb because a redo log force is a 4kb write. The per document overhead for mongo-sync is 8kb because a redo log force is an 8kb write. So most of the difference in the write rates between MongoDB and InnoDB is not from the redo log force overhead.
irate - the rate of documents/rows inserted per second. MongoDB does fewer than 1000 per second as expected given the use of journalCommitInterval. This makes for a simple implementation of group commit but is not good for some workloads (single-threaded with j:1).
logb - this is the total bytes written to the redo log as reported by the DBMS. Only MongoDB reports this accurately when sync-on-commit is used because it pads the result to a multiple of the filesystem page size. For MongoDB the data comes from the dur section of the serverStatus output. But I changed MongoDB to not reset the counters as upstream code resets them every few seconds. InnoDB pads to a multiple of 512 bytes and I used the os_log_written counter to measure it. AFAIK TokuMX doesn't pad and the counter is LOGGER_BYTES_WRITTEN. So both TokuMX and InnoDB don't account for the fact that the write is done using the filesystem page size (multiple of 4kb).
logbpi - log bytes per insert. This is logb divided by the number of documents/rows inserted. There are two numbers for MongoDB. The first is the size after padding and compression. It is a bit larger than 8kb. As the minimum value is 8kb given this workload this isn't a surprise. The second number is the size prior to compression and padding. This value can be compared to InnoDB and TokuMX and I am surprised that it is so much larger for MongoDB. I assume MongoDB doesn't log page images. This is something for my TODO list.

iibench 1 doc/insert, fsync, 10M rows
bpd MB/s GBw secs bwpi irate logb logbpi
inno-sync 146 18.9 54.3 3071 5690 3257 7.5G 785
inno-lazy 146 2.8 5.8 2251 613 4442 7.5G 785
toku-sync 125 31.0 86.8 2794 9104 3579 2.3G 251
mongo-sync 492 23.1 312.0 13535 32712 739 83.3G 8733/4772
mongo-lazy 429 40.5 79.8 1969 8365 5078 21.9G 2294/4498
mongo-nojournal 440 34.1 42.0 1226 4401 8154 NA NA

Feature requests

From all of this I have a few feature requests:

Don't try to compress the journal buffer when it is already less than 8kb. That makes commit processing slower and doesn't reduce the amount of data written to the journal as it will be padded to 8kb. See JIRA 13343.
Provide an option to disable journal compression. For some configurations of the insert benchmark I get 10% more inserts/second with compression disabled. Compression is currently done by the background thread before writing to the log. This adds latency for many workloads. When compression is enabled it is possible to be more clever and begin compressing the journal buffer early. Compression requires 3 copies of data -- once to the compression input buffer, once for compression and once for the write to the journal file. It also requires the overhead from Snappy. See JIRA 13345.
Pad journal writes to 4kb rather than 8kb. I don't understand why padding is done to 8kb. See JIRA 13344.
Provide an option to force journal file preallocation and recycling. When running a large number of servers I don't want to have this behavior determined dynamically which will result with a mix of behaviors. Even with fast storage I get 10% more inserts/second with journal file preallocation on some workloads. See JIRA 13346.
Provide a section in serverStatus output with dur stats that are not reset. See JIRA 13186.
Write less data to the redo log. The Test Results section above shows that many more bytes are generated for changes to MongoDB than for InnoDB and TokuMX. I don't know why but given that MongoDB and InnoDB are both update-in-place b-trees I suspect that MongoDB can be more efficient. I won't create a JIRA request until I can be more specific.
Write less data to the database files. This will be hard to fix. MongoDB does hard checkpoints every syncdelay seconds. There is some reduction from using a larger value for syncdelay but I don't think that is the solution. It writes much more data to database files than InnoDB for the insert benchmark. It will be interesting to compare this for other workloads.

Saturday, March 22, 2014

A few comments on MongoDB and InnoDB replication

MongoDB replication has something like InnoDB fake changes built in. It prefetches all documents to be changed while holding a read lock before trying to apply any changes. I don't know whether the read prefetch extends to indexes. That question has now been added to my TODO list. Using fake changes to prefetch on MySQL replicas for InnoDB worked better than everything that came before it because it prefetched any index pages that were needed for index maintenance. Then we made it even better by making sure to prefetch sibling pages in b-tree leaf pages when pessimistic changes (changes not limited to a single page) might be done (thanks Domas). Hopefully InnoDB fake changes can be retired with the arrival of parallel replication apply. This is an example of the challenge of running a DBMS at web scale -- we can't always wait and many of our solutions are in production 3 to 5 years before the proper fix arrives from the upstream vendor. I am not complaining as this gives us many interesting problems to help solve.

I was curious whether there are opportunities for inconsistency when multi-document operations are replayed on a replica. This states that the oplog records per-document changes rather than one entry representing the multi-document write. This makes the replay order on the replica match the order on the master and avoids problems when concurrent multi-document changes are interleaved on the primary. This is similar in spirit to using row-based replication in MySQL. I expect it to also avoid inconsistencies when an error occurs half-way through a multi-document change on the master as the oplog would only contain changes prior to the error.

Tuesday, March 11, 2014

Insert benchmark for InnoDB, MongoDB and TokuMX and flash storage

This work is introduced with a few disclaimers in an earlier post. For these tests I ran the insert benchmark client in two steps: first to load 100M documents/rows into an empty database and then to load another 400M documents/rows. The test used 1 client thread, 1 collection/table and the query threads were disabled. The replication log (oplog/binlog) was disabled for all tests. I assume that all of the products would suffer with that enabled.

Note that the database in this test was fully cached by InnoDB and TokuMX and almost fully cached for MongoDB. This means there will be no disk reads for InnoDB during index maintenance, no disk reads for TokuMX during compaction and few disk reads for MongoDB during index maintenance. Future posts have results for databases much larger than RAM.

Performance summary:

database size - InnoDB matches TokuMX when using 2X compression alas this comes at a cost in the insert rate. Without compression TokuMX is better than InnoDB and both are much better than MongoDB. I ran tests using the readable attribute names in iibench-mongodb like "cashregisterid" and then repeated the tests after changing the code to use names that were between 2 and 5 bytes. Using shorter names did not have a big impact on database size because most of the space is used in indexes not for document storage. After one test with 500M documents and shorter names the space used is 38.5% for document storage and 62% for the indexes.
write efficiency - MongoDB wrote about 10X more bytes to storage than InnoDB and about 20X more than TokuMX. But the MongoDB database was only 2X to 4X larger than the others. It will rewrite the same page to disk much more frequently than InnoDB. The tests used the default for syncdelay (60) and it tries to flush all dirty database files to disk every 60 seconds via calls to msync. InnoDB is able to keep dirty pages in the buffer pool for a much longer time. But increasing syncdelay to 300 did not have a significant impact on the bytes written rate -- that reduced the rate by between 10% and 20%. I encountered intermittent stalls on writes with MongoDB 2.4.9 that occur when an msync is in progress and an insert needs to allocate a new file. That is fixed in 2.6 and PMP helped me to debug it.
insert rate - uncompressed InnoDB did better than TokoMX and compressed InnoDB matched it. All did better than MongoDB. The problem for MongoDB was write efficiency. The problem for TokuMX is that the random write penalty for InnoDB is greatly reduced by a fast storage device. This advantage is reduced when the database gets larger than RAM and completely lost when slower storage is used which will be described in a future post.

I enabled fsync on commit for all tests. That used j:1 for inserts for MongoDB and TokuMX. That also used journalCommitInterval=2 for MongoDB and logFlushPeriod=0 for TokuDB. This means there can be a 2/3 * 1/2 millisecond wait for group commit on average with MongoDB. But tests run with j:0 (not reported here) showed that forcing the journal to disk has little impact on performance. For InnoDB I used innodb_flush_log_at_trx_commit=1. I disable the InnoDB doublewrite buffer for most of the tests. That sacrifices protection from torn page writes. MongoDB doesn't have an option to protect against torn page writes. TokuMX isn't at risk from torn page writes. Otherwise I used the default options for MongoDB and TokuMX. They are much easier to configure than MySQL. I won't describe all of the MySQL settings but I use direct IO. The test server has 144G of RAM, 24 cores with HT enabled and fast flash storage that can do more than 50,000 16kb random IOPs.

The powerOf2Sizes option was not enabled for this test. From reading source code that should only change how space is allocated for documents in the heap and it won't help or hurt with index fragmentation. I have reports to share for powerOf2Sizes in future blog posts.

The tests used iibench-mongodb in Java for TokuMX and MongoDB and the python version for InnoDB. I tested MySQL 5.6.12, TokuMX 1.4.0 and MongoDB 2.4.9 in the following configurations.

inno-16k-dw0 - InnoDB with 16kb pages and the doublewrite buffer disabled
inno-16k-dw1 - InnoDB with 16kb pages and the doublewrite buffer enabled
inno-8k-dw0 - InnoDB with 8kb pages and the doublewrite buffer disabled
inno-8k-dw0 - InnoDB with 8kb pages and the doublewrite buffer disabled
inno-8k-dw0-zlib - InnoDB with 8kb pages, the doublewrite buffer disabled, and 2X compression enabled
tokumx - TokuMX with logFlushPeriod=0 and writes done with j:1
mongo24 - MongoDB 2.4.9 with journalCommitInterval=2 and writes done with j:1

The table below summarizes the results after inserting 100M documents/rows.

DB-size Bytes-per-doc Write-rate Bytes-written Test-seconds Inserts/sec Server
17.0 GB 182 20.8 MB/s 80.7 GB 3895 25674 innodb-16k-dw0
17.0 GB 182 35.4 MB/s 140.1 GB 3963 25231 innodb-16k-dw1
16.0 GB 171 34.9 MB/s 118.0 GB 3375 29622 innodb-8k-dw0
8.5 GB 91 21.3 MB/s 162.8 GB 7661 13048 innodb-8k-dw0-zlib
9.3 GB 99 15.0 MB/s 99.0 GB 6586 15182 tokumx
42.0 GB 450 171.1 MB/s 1825.4 GB 10679 9364 mongo24

The table below summarizes the results after inserting another 400M documents/rows.

DB-size Bytes-per-doc Write-rate Bytes-written Test-seconds Inserts/sec Server
75 GB 161 42.2 MB/s 667 GB 15846 25244 innodb-16k-dw0
75 GB 161 81.7 MB/s 1300 GB 15926 25117 innodb-16k-dw1
75 GB 161 72.9 MB/s 988 GB 13562 29492 innodb-8k-dw0
40 GB 85 24.2 MB/s 730 GB 30130 13276 innodb-8k-dw0-zlib
46 GB 98 15.3 MB/s 417 GB 27241 14684 tokumx
154 GB 330 188.6 MB/s 9838 GB 52233 7658 mongo24

Bigger data

Database benchmarks are hard but not useless. I use them to validate performance models and to find behavior that can be improved. It is easy to misunderstand results produced by others and they are often misused for marketing (benchmarketing). It is also easy to report incorrect results and I have done that a few times for InnoDB. A benchmark report is much more useful when it includes an explanation. Only one of these is an explanation: A is faster than B, A is faster than B because it uses less random IO. It isn't easy to explain results. That takes time and expertise in the DBMS and the rest of the hardware and software stack used during the test. The trust I have in benchmark reports is inversely related to the number of different products that have been tested.

This is an introduction for a sequence of blog posts that compare MongoDB, TokuMX and InnoDB. I am an expert with InnoDB, above average with TokuDB (and the Toku part of TokuMX) and just getting started with MongoDB. Moving from technical expertise to personal bias I have had a lot of success with InnoDB. I also have a lot of respect for Tokutek and my brother works there. While I am new to MongoDB, I think they are building a useful product. My focus is on DBMS internals and storage efficiency. This is a narrow focus but critical for deployments where the hardware cost is significant. My goal is to describe problems that can be fixed to make MongoDB better.

I used the insert benchmark to measure the insert rate, write efficiency and database size. The benchmark loads a collection/table in PK order. The table has 3 secondary indexes and the values for these attributes/columns are randomly generated. The random key order for secondary index maintenance can generate fragmentation and random IO. I use the benchmark to measure storage efficiency and am more interested in write efficiency and database size than in the average insert rate.

Can the DBMS use the IOPs capacity of the storage devices? Many years ago InnoDB wasn't able to do more than 10,000 reads/second from disk. Now I am able to demo something between 150k and 200k per second assuming low latency storage and 24 CPU cores with HT enabled. It has come a long way. There are many problems that prevent a DBMS from using the IOPs capacity including mutex contention, CPU overhead/pathlength and IO serialization. TokuMX and InnoDB have one CPU overhead that is missing from MongoDB. They generate checksums when database pages are written and validate the checksums when pages are read from disk. That is an overhead that I hope will be added to MongoDB.
Can the DBMS reduce random IO for secondary indexes? InnoDB has the change buffer for this and TokuMX has fractal trees.
Can the DBMS reduce index fragmentation? TokuMX has fractal trees. InnoDB is an update-in-place b-tree. I won't claim it has anything special in this regard. A common problem for database benchmarks is to load and go so that queries are tested against an unfragmented database while the usual production DBMS has been fragmented. By fragmented I mean subject to inserts, updates and deletes in random key order so write-optimized databases can also be fragmented and there is an impact on query performance in that case, but that is for another blog post. The insert benchmark tries very hard to make the database fragmented by requiring secondary index maintenance in random key order.
Can the DBMS reduce the amount of disk space used for the database? Both TokuMX and InnoDB support compression.

For write efficiency I report the total bytes written to the storage device during the test. Flash devices are rated by write endurance. A DBMS that writes less to storage allows you to either buy less expensive flash devices or get more years of use from a device. I used iostat to record the write rate and this includes writes to the journal file (or redo log) and database files. Note that all writes are not created equal. TokuMX does large random writes. InnoDB and MongoDB do small random writes and small random writes also require writes in the background when flash garbage collection (block cleaning) is done. The writes from flash GC are not captured by iostat and I do not report them.

Database size can be increased by fragmentation for update-in-place indexes and old versions of rows for write-optimized database. It can be decreased by compression. A smaller database means you buy less flash or disk to store it and to store backups.

The insert rate is the least interesting of the three metrics. I don't ignore the results but a lower insert rate can be tolerated if that provides better write efficiency or database compression.

This was a long introduction. Benchmark results will be in another post.

Saturday, March 1, 2014

When does MongoDB make a transaction visible?

Changes to a single document are atomic in MongoDB, but what does that mean? One thing it means is that when reads run concurrent with an update to a document the reads will see the document in either the pre-update or the post-update state. But there are several post-update states. There is the pre-fsync post-update state and the post-fsync post-update state.

In the pre-fsync post-update state the change record for the update has been written to the journal but the journal has not been forced to disk. If a client were to read the document in this state and then the server rebooted before the group commit fsync was done then the document might be restored to the pre-update state during crash recovery. Note that if the only failure is a mongod process failure then the document will be restored to the post-update state. But another client might have read the post-update state prior to the server reboot. Is it a problem for a client to observe a transaction that is undone? That is application dependent. AFAIK there is no way to avoid this race with MongoDB as the per-db write lock is released before waiting for the group commit fsync. There are other write concern options that block until the write is acknowledged by replicas. I assume the race exists in that case too, meaning the update is visible on the primary to others before a slave ack has been received.

I think this behavior should be described in the MongoDB documentation on write concerns or elsewhere. I am not sure it has been.

While reading the code might make this obvious (that the per-db write lock is released before group commit wait) I used a testcase to convince myself. This is easy to see in MongoDB with a version modified to allow huge values for journalCommitInterval. I set it to 300000 and then used one client to do an insert with j:1 and w:1 so the insert (via pymongo) blocked until a journal fsync was done. MongoDB uses group commit and fsync is done every journalCommitInterval milliseconds (think of this as the fsync train leaving every X milliseconds). But when there is an operation that requested an fsync then an express train arrives in no more than journalCommitInterval/3 milliseconds. In another window I used the mongo client to query the data in the collection and the newly inserted document was there long before the fsync was done.

MySQL has two races like this but they can be avoided. InnoDB has a similar race when innodb_flush_log_at_trx_commit=2, but that can be avoided by setting the option to 1. Commit latency is reduced when the application can tolerate the race. Semi-sync replication for MySQL always has the race. Committed changes are visible on the primary to others before a slave ack has been received. But the race is avoided by using enhanced semi-sync.