Sunday, March 23, 2014

Redo logs in MongoDB and InnoDB

Both MongoDB and InnoDB support ACID. For MongoDB this is limited to single document changes while InnoDB extends that to multi-statement and possibly long-lived transactions. My goal in this post is to explain how the MongoDB journal is implemented and used to support ACID. Hopefully this will help to understand performance. I include comparisons to InnoDB.

What is ACID?

There are a few interesting constraints on the support for ACID with MongoDB. It uses a per-database reader-writer lock. When a write is in progress all other uses of that database (writes & reads) are blocked. Reads can be done concurrent with other reads but block writes. The manual states that the lock is writer greedy so that a pending writer gets the lock before a pending reader. I am curious if this also means that a pending writer prevents additional readers when the lock is currently in read mode and added that question to my TODO list. The reader-writer lock code is kind of complex. By in progress I mean updating in-memory structures from mmap'd files like the document in the heap storage and any b-tree indexes. For a multi-document write the lock can be yielded (released) in between documents so the write might be done from multiple points in time and readers can see the write in progress. There are no dirty reads. The $isolated option can be used to avoid yielding the per-db lock for multi-document writes. Even with that option an error half-way into a multi-document operation results in a half done change as there is no undo. MyISAM users are familiar with this problem. The cursor isolation provided by MongoDB isn't that different from READ COMMITTED on a system that doesn't do MVCC snapshots (see older versions of BerkeleyDB and maybe IBM DB2 today).

It is bad for an in progress write to stall on a disk read (page fault from an mmap'd file) while holding the per-database lock. MongoDB has support to yield the per-db lock on a page fault (real or predicted) but I have questions about this. Is the yield only done for the document pages in the heap storage or extended to index pages? Is anything done to guarantee that most or all pages (document in heap store, all index pages to be read) are in memory before applying any writes. Note that if the document is updated in memory and then a page fault occurs during index maintenance then I doubt the lock can be yielded. This is another question on my TODO list. I am not the only person with that question. MongoDB has something like an LRU to predict whether there will be a page fault on a read and understanding the performance overhead from that is also on my TODO list. I have seen a lot of CPU overhead from that code on some benchmarks.

MongoDB doesn't have row locks. The change to a document is visible as soon as the per-db lock is released. Not only are some writes visible from a multi-document change before all documents have been modified but all changes are visible before the change is durable via the journal. This behavior is different than what you can get from a DBMS and users should be aware of that.

Redo logs

InnoDB has a redo log and uses the system tablespace for undo. The changes written to support undo are made durable via the redo log just like changes to database tables. The undo information enables consistent reads for long running transactions. The InnoDB redo log uses buffered IO by default and is configured via the innodb_flush_method option. Redo log pages are 512 bytes and that might need to change when the conversion from 512 to 4096 byte disk sectors is complete. Each log page has a checksum. The innodb_flush_log_at_trx_commit option determines whether the log is forced on commit or once per second. There are a fixed number of redo log files and they are preallocated. With buffered IO and 512 byte aligned writes the first write on a 4kb boundary can incur a disk read to get the page into the OS cache before applying the change. This is a waste of disk IO and the workaround seems to be some form of padding to 4kb for some writes. But note that the padding will be reused/reclaimed on the next log write. An alternative is to use direct IO but there might be several calls to write or pwrite before the log must be forced and making each write synchronous will delay log writing. With buffered IO the filesystem can coalesce the data from multiple writes as these are adjacent in the log file.

MongoDB doesn't do undo. It does have a redo log called the journal. This uses direct IO on Linux. The log page size is 8kb and is protected by a checksum. The in-memory log buffer is compressed via Snappy before the log write is done and the compressed result is padded to 8kb. The space taken by padding isn't reused/reclaimed for the next write so a sequence of small inserts with j:1 each write at least 8kb to the journal. Writes to the journal are done by a background thread (see durThread in dur.cpp). Note that the background thread iterates over a list of redo log entries that must be written to the journal, copies them to a string buffer, then uses Snappy to compress that data, then pads the compressed output to the next multiple of 8kb, then writes the padded output to the journal file. The dur section in serverStatus output has counters for the amount of data written to the journal which includes the padding (journaledMB). The size of the data prior to padding is the journaledMB counter divided by the compression counter. Note that these counters are computed over the last few seconds.

MongoDB optionally recycles log files and this is determined when journal preallocation (preallocj) is enabled. With preallocj 3 journal files are created at process start and this can delay startup for the time required to create 3 1GB journal files (see preallocateFiles in dur_journal.cpp). This writes data to the files so real IO is done including an fsync. In steady state, after process start, old log files are recycled with preallocj (see removeOldJournalFile in dur_journal.cpp). Without preallocj the journal files are not preallocated at process start and old journal files are not reused. There is an undocumented option, --nopreallocj, that can disable preallocj. There is no option to force preallocj. That is determined by a short performance test done at process start (see preallocateIsFaster in dur_journal.cpp). One way to determine whether preallocj is in use is to check the journal directory for the preallocated files.

Preallocation for both database and journal doesn't mean that files are written an extra time -- once during preallocation, at least once during regular use. I was happy to learn this. Database file preallocation uses posix_fallocate rather than write/pwrite on Linux (see the run and ensureLength methods in the FileAllocator class). Journal file preallocation uses write append, but that should only be done at process start and then the files are recycled (see preallocateFile and removeOldJournalFile in dur_journal.cpp).

Using strace is a great way to understand complex DBMS software. This shows the sequence of 8kb writes to the journal during the insert benchmark with a client that uses j:1 and 1 document per insert:
strace -f -p $( pidof mongod ) -ewrite
write(5, "g\n\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224'L\376\377\377\377iiben"..., 8192) = 8192
write(5, "\t\v\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\344)L\376\377\377\377iiben"..., 8192) = 8192
write(5, "m\6\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\264\26L\376\377\377\377iiben"..., 8192) = 8192
write(5, "R\4\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224\16L\376\377\377\377iiben"..., 8192) = 8192

Group commit

Proper group commit is now supported for InnoDB but I will skip the details. It is done directly by a thread handling the COMMIT operation for a user's connection and there is no wait unless another thread is already forcing the log. My team did the first implementation of group commit but MariaDB and MySQL did something better. We were thrilled to remove that change from our big patch for MySQL.

MongoDB has group commit. The journal is forced to disk every journalCommitInterval milliseconds. When a thread is blocked waiting for the journal to be forced the interval is reduced to 1/3 of that value. The minimum value for journalCommitInterval is 2, so the maximum wait in that case should be 1 (2/3 rounded up). This means that MongoDB will do at most 1000 log forces per second. Some hardware can do 5000+ fast fsyncs courtesy of battery backed write cache in HW RAID or flash so there are some workloads that will want MongoDB to force the log faster than it can today. Group commit is done by a background thread (see durThread, durThreadGroupCommit, and _groupCommit in dur.cpp). Forcing the journal at journalCommitInterval/3 milliseconds is also done when there is too much data ready to be written to it.

Performance

I used the insert benchmark to understand redo log performance. The test used 1 client thread to insert 10M documents/rows into an empty collection/table with 1 document/row per insert. The test was repeated in several configurations to understand what limited performance. I did this to collect data for several questions: how fast can a single-threaded workload sync the log and how much data is written to the log per small transaction. For the InnoDB tests I used MySQL 5.6.12, disabled the doublewrite buffer and used an 8k page. The MongoDB tests used 2.6.0 rc0.  The TokuMX tests use 1.4.0. The following configurations were tested:

  • inno-sync - fsync per insert with innodb_flush_log_at_trx_commit=1.
  • inno-lazy - ~1 fsync per second with innodb_flush_log_at_trx_commit=2
  • toku-sync - fsync per insert with logFlushPeriod=0
  • mongo-sync - fsync per insert, journalCommitInterval=2, inserts used j:1
  • mongo-lazy - a few fsyncs/second, journalCommitInterval=300, inserts used w:1, j:0
  • mongo-nojournal - journal disabled, inserts used w:1

Test Results

The following metrics are reported for this test.
  • bpd - bytes per document (or row). This is the size of the database at test end divided by the number of documents (or rows) in the database. As I previously reported, MongoDB uses much more space than InnoDB whether or not powerOf2Sizes is enabled. They are both update-in-place b-trees so I don't understand why MongoDB does so much worse when subject to a workload that causes fragmentation. Storing attribute names in every document doesn't explain the difference. But in this case the results overstate the MongoDB overhead because of database file preallocation.
  • MB/s - the average disk write rate during the test in megabytes per second
  • GBw - the total number of bytes written to disk during the test in GB. This includes writes to the database files and (when enabled) the redo logs. The difference between inno-sync and inno-lazy is the overhead of a 4kb redo log write per insert. The same is not true between mongo-sync and mongo-lazy. My educated guess to explain why MongoDB and InnoDB are different is that for mongo-sync the test takes much longer to finish than mongo-lazy so there are many more hard checkpoints (write all dirty pages each 60 seconds). InnoDB is much better at keeping pages dirty in the buffer pool without writeback. In all cases MongoDB is writing much more data to disk. In the lazy mode it writes ~15X more and in the sync mode it writes ~6X more. I don't know if MongoDB does hard checkpoints (force all dirty pages to disk every syncdelay seconds) when the journal is disabled. Perhaps I was too lazy to read more code.
  • secs - the number of seconds to insert 10M documents/rows.
  • bwpi - bytes written per inserted document/row. This is GBw divided by the number of documents/rows inserted. The per row overhead for inno-sync is 4kb because a redo log force is a 4kb write. The per document overhead for mongo-sync is 8kb because a redo log force is an 8kb write. So most of the difference in the write rates between MongoDB and InnoDB is not from the redo log force overhead.
  • irate - the rate of documents/rows inserted per second. MongoDB does fewer than 1000 per second as expected given the use of journalCommitInterval. This makes for a simple implementation of group commit but is not good for some workloads (single-threaded with j:1).
  • logb - this is the total bytes written to the redo log as reported by the DBMS. Only MongoDB reports this accurately when sync-on-commit is used because it pads the result to a multiple of the filesystem page size. For MongoDB the data comes from the dur section of the serverStatus output. But I changed MongoDB to not reset the counters as upstream code resets them every few seconds. InnoDB pads to a multiple of 512 bytes and I used the os_log_written counter to measure it. AFAIK TokuMX doesn't pad and the counter is LOGGER_BYTES_WRITTEN. So both TokuMX and InnoDB don't account for the fact that the write is done using the filesystem page size (multiple of 4kb). 
  • logbpi - log bytes per insert. This is logb divided by the number of documents/rows inserted.  There are two numbers for MongoDB. The first is the size after padding and compression. It is a bit larger than 8kb. As the minimum value is 8kb given this workload this isn't a surprise. The second number is the size prior to compression and padding. This value can be compared to InnoDB and TokuMX and I am surprised that it is so much larger for MongoDB. I assume MongoDB doesn't log page images. This is something for my TODO list.

iibench 1 doc/insert, fsync, 10M rows
                 bpd   MB/s    GBw    secs   bwpi  irate    logb  logbpi
inno-sync        146   18.9    54.3   3071   5690   3257    7.5G   785  
inno-lazy        146    2.8     5.8   2251    613   4442    7.5G   785 
toku-sync        125   31.0    86.8   2794   9104   3579    2.3G   251 
mongo-sync       492   23.1   312.0  13535  32712    739   83.3G  8733/4772
mongo-lazy       429   40.5    79.8   1969   8365   5078   21.9G  2294/4498
mongo-nojournal  440   34.1    42.0   1226   4401   8154    NA      NA

Feature requests

From all of this I have a few feature requests:

  1. Don't try to compress the journal buffer when it is already less than 8kb. That makes commit processing slower and doesn't reduce the amount of data written to the journal as it will be padded to 8kb. See JIRA 13343.
  2. Provide an option to disable journal compression. For some configurations of the insert benchmark I get 10% more inserts/second with compression disabled. Compression is currently done by the background thread before writing to the log. This adds latency for many workloads. When compression is enabled it is possible to be more clever and begin compressing the journal buffer early. Compression requires 3 copies of data -- once to the compression input buffer, once for compression and once for the write to the journal file. It also requires the overhead from Snappy. See JIRA 13345.
  3. Pad journal writes to 4kb rather than 8kb. I don't understand why padding is done to 8kb. See JIRA 13344.
  4. Provide an option to force journal file preallocation and recycling. When running a large number of servers I don't want to have this behavior determined dynamically which will result with a mix of behaviors. Even with fast storage I get 10% more inserts/second with journal file preallocation on some workloads. See JIRA 13346.
  5. Provide a section in serverStatus output with dur stats that are not reset. See JIRA 13186.
  6. Write less data to the redo log. The Test Results section above shows that many more bytes are generated for changes to MongoDB than for InnoDB and TokuMX. I don't know why but given that MongoDB and InnoDB are both update-in-place b-trees I suspect that MongoDB can be more efficient. I won't create a JIRA request until I can be more specific.
  7. Write less data to the database files. This will be hard to fix. MongoDB does hard checkpoints every syncdelay seconds. There is some reduction from using a larger value for syncdelay but I don't think that is the solution. It writes much more data to database files than InnoDB for the insert benchmark. It will be interesting to compare this for other workloads.


7 comments:

  1. Great post as usual :-)

    I have some WIP changes that allow any block size for the InnoDB redo logs. What we want to do is separate the log buffer writes from the log disk writes, separate logwr thread perhaps. That way making it O_DIRECT or buffered could be a log writer thread property. All very experimental yet. There was a suggestion from Taobao to improve the writes to the log buffer (see http://bugs.mysql.com/bug.php?id=70950). In our tests on 5.7.4 we didn't see any improvement but then we didn't use O_DIRECT either. However, I don't think O_DIRECT would have improved things.

    ReplyDelete
  2. given that mongo uses mmaps, would it really be considered ACID in the case of soft/hard offlining of pages before they make it to the journal ? I had some investigation into the usage of mmaps, from my reading of, there are two mmaps that are used per instance.

    interesting op's note, when upgrading to 2.2 from 1.8 and having the journal enabled by default, the disks had to do a lot more [ even with 512mb write back ] and being able to increase the commit interval to 499 ms during run time kept io wait from climbing too badly. eventually moving to a dual hybrid raid config allowed the journal to run at 100ms without issues. changing the sync delay down from 60 makes a huge difference as well, although the flush is not *suppose* to be blocking, it is, but less now than it was in earlier releases.

    http://blog.mongodb.org/post/64962828969/performance-tuning-mongodb-on-solidfire

    ~Alexis

    ReplyDelete
    Replies
    1. syncdelay=1 as mentioned in the article must make the SSD vendor happy, more writes = less endurance on flash.

      Not sure what you mean by soft/hard offlining. While they have some detail on how the mmap is done, maybe they could be more clear (http://docs.mongodb.org/manual/core/journaling/). There are two mmap views -- one regular and the other created with MMAP_PRIVATE (the private view). The private view gets all changes prior to the msync call every syncdelay secs. My guess is that there is only one private view, so all changes get applied to it in sequence. At msync time something is done to copy the changes from the private view to the other view and then msync is done there (haven't read enough of that code yet). It is probably OK to copy changes from the private view over to the other view as soon as journal changes are forced to disk. Maybe that is a future optimization.

      I don't think MongoDB has protection against torn writes (partial writes). So writes in progress during a crash may result in a corrupt database that can't be recovered from the journal. That requires the full copy of the page to be written. InnoDB has that via the doublewrite buffer, but that recovery safety comes at a high cost (it doubles the write rate to storage) so you need to be careful about which storage device gets the doublewrite writes.

      Delete
    2. No wonder the MongoDB guys asked us to write an LMDB driver for them. LMDB is immune to torn writes.

      Delete
  3. Updated with JIRAs:
    https://jira.mongodb.org/browse/SERVER-13186
    https://jira.mongodb.org/browse/SERVER-13343
    https://jira.mongodb.org/browse/SERVER-13344
    https://jira.mongodb.org/browse/SERVER-13345
    https://jira.mongodb.org/browse/SERVER-13346

    ReplyDelete
  4. From http://docs.mongodb.org/manual/core/journaling/: "a group commit must block all writers during the commit." Do you know whether "writers" refer to data writers or specifically to "other group committers"? Since readers are apparently not affected, is this a second locking mechanism or another state of the readers/writers lock? TIA.

    ReplyDelete
  5. My memory is vague but I think Asya can give you a great answer on https://groups.google.com/forum/#!forum/mongodb-user

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...