Small Datum: TokuMX, MongoDB and InnoDB versus the insert benchmark with disks

I used the insert benchmark on servers that use disks in my quest to learn more about MongoDB internals. The insert benchmark is interesting for a few reasons. First while inserting a lot of data isn't something I do all of the time it is something for which performance matters some of the time. Second it subjects secondary indexes to fragmentation and excessive fragmentation leads to wasted IO and wasted disk space. Finally it allows for useful optimizations including write-optimized algorithms (fractal tree via TokuMX, LSM vis RocksDB and WiredTiger) or the InnoDB insert buffer. Hopefully I can move onto other workloads after this week.

This test is interesting for another reason that I don't really explore here but will in a future post. While caching all or most of the database in RAM works great at eliminating reads it might not do much for avoiding random writes. So a write heavy workload with a cached database can still be limited by random write IO and this will be more of an issue as RAM capacity grows on commodity servers while people try to reuse their favorite update-in-place b-tree for cached workloads. Some of the impact from that can be viewed in the results for MongoDB when the database is smaller than 72G. I wonder whether InnoDB can be improved in this case. The traditional solution is to use snapshots (sequential IO) and a redo log.

The test server has 72G of RAM and at least 8 10K RPM SAS disks with HW RAID and a battery-backed write cache so it can do a few thousand random IOPs given many pending requests if we trade latency for throughput. The insert benchmark was used with 1 client thread and the test was started with an empty collection/table. I used the Java client for MongoDB and TokuMX and the Python client for InnoDB. The MongoDB inserts are done with w:1,j:1 and journalCommitInterval=2 (or logFlushPeriod=0 with TokuMX). So there is a wait for fsync but with all of the tests I have done to this point there is not much difference between j:0 and j:1 as the journal sync does not have much impact when inserting 1000 documents per insert request. The InnoDB inserts are done with innodb_flush_log_at_trx_commit=1 so it also waits for fsync. I also used 8kb pages for InnoDB and disabled the doublewrite buffer. Compression was not used for InnoDB. Fsync is fast on the test hardware given the RAID write cache. The clients run as the same host as the server to reduce network latency. The oplog/binlog was disabled.

I usually have feature requests listed in a post but not this time. I think that MongoDB needs much more in the way of per-collection and per-index statistics. That shouldn't be a surprise given my work on the same for MySQL. But that will wait for another post.

The performance summary isn't a surprise. TokuMX does better than InnoDB because fractal trees greatly reduce the random IOPs demand. InnoDB does better than MongoDB. There are a few reasons why InnoDB does better than MongoDB even though they both use an update-in-place b-tree:

Databases with MongoDB are larger than with InnoDB so cache hit rates are lower when the database is larger than RAM. I don't understand all of the reasons for the size differences. Including attribute names in every document is not the majority of the problem. I think there is more secondary index fragmentation with MongoDB. I have results with and without the powerOf2Sizes option enabled and that doesn't explain the difference.
The InnoDB insert buffer is the primary reason that InnoDB does better. This is true when comparing InnoDB to many products that use an update-in-place b-tree, not just MongoDB. Because of the insert buffer InnoDB is unlikely to stall on disk reads to leaf pages of secondary indexes during index maintenance. Those reads can be done in the background using many concurrent IO requests. MongoDB doesn't have this feature. It blocks on disk reads during secondary index maintenance and won't benefit from concurrent IO for reads despite the RAID array used by the server. This note has performance results for the insert benchmark and InnoDB when the insert buffer is disabled to show the benefit from that feature. I have also written about problems since fixed in InnoDB that prevented the insert buffer from being useful because it became full.

Results

For the test the client inserts up to 2B rows. But I wasn't willing to wait for MongoDB and stopped it after less than 600M rows. InnoDB was stopped after 1.8B rows. The columns used for the result table are listed below. There are a lot more details on these columns in a previous post. Each of the sections that follow describe the performance to insert the next 100M documents/rows.

sizeGB - the database size in GB
bpd - bytes per document/row computed from sizeGB / #docs (or #rows)
MB/s - the average rate for bytes written per second computed from iostat. This has IO for the database file and the journal/redo logs
GBw - the total number of GB written to the database including journal/redo logs
secs - the number of seconds to insert data
irate - the rate of documents or rows inserted per second
notes - more details on the configuration

From 0 to 100M rows

This has results from inserting 100M documents/rows to an empty collection/table. Things that interest me that I have previously reported include 1) MongoDB databases are much larger and 2) MongoDB does much more disk IO for the same workload and the increase in bytes written isn't explained by the database being larger. One of the reasons for the high bytes written rate is that the test takes longer to complete with MongoDB and a hard checkpoint is done every syncdelay seconds. InnoDB is better at delaying writeback for dirty pages.

The interesting result that I have seen in a few cases with both MongoDB 2.4.9 and 2.6.0 is that results are worse with powerOf2Sizes enabled. I have not take the time to debug this problem. That is on my TODO list. At first I thought I had a few bad servers (flaky HW, etc) but I haven't seen the opposite for this workload (powerOf2Sizes enabled getting better insertion rates). The problem appears to be intermittent. Note that 2.6 has a fix for JIRA 12216 that doesn't block allocation of new files when msync is in progress so 2.6 should be somewhat faster than 2.4.

config sizeGB bpd MB/s GBw secs irate notes

innodb 16 171 28.9 124 4290 23308

tokumx 9.2 98 11.1 79 7127 14030

mongo24 43 461 46.4 1539 33230 3009 powerOf2Sizes=0

mongo24 44 472 30.0 1545 51634 1937 powerOf2Sizes=1

mongo26 42 450 47.9 1446 30199 3311 powerOf2Sizes=1

From 100M to 200M rows

TokuMX and fractal trees are starting to show a benefit relative to InnoDB.

config sizeGB bpd MB/s GBw secs irate notes

innodb 31 166 24.3 238 9781 10224

tokumx 17 91 12.3 90 7328 13646

mongo24 72 386 37.4 1768 47329 2113 powerOf2Sizes=0

mongo24 79 424 24.6 1731 70325 1422 powerOf2Sizes=1

mongo26 76 408 39.3 1611 40992 2439 powerOf2Sizes=1

From 200M to 300M rows

More of the same as TokuMX gets better relative to others.

config sizeGB bpd MB/s GBw secs irate notes
innodb 45 161 21.7 350 16136 6198
tokumx 25 89 12.0 84 7071 14142
mongo24 98 350 30.7 2008 65514 1526 powerOf2Sizes=0
mongo24 106 379 19.9 1917 96351 1038 powerOf2Sizes=1
mongo26 108 386 24.9 1933 77677 1287 powerOf2Sizes=1

From 300M to 400M rows

TokuMX begins to get slower. MongoDB gets a lot slower as the database is much larger than RAM. Problems unrelated to MongoDB cost me two of the long running test servers (for 2.4.9 and 2.6.0 with powerOf2Sizes=1).

config sizeGB bpd MB/s GBw secs irate notes
innodb 61 163 21.1 376 17825 5610
tokumx 31 83 12.1 86 7172 13941
mongo24 130 348 14.7 2313 157395 635 powerOf2Sizes=0

From 400M to 500M rows

MongoDB is getting significantly slower as the database is larger than RAM. More on this in the next section.

config sizeGB bpd MB/s GBw secs irate notes

innodb 75 161 19.8 462 23337 4285

tokumx 39 83 11.1 84 7584 13186

mongo24 160 344 4.7 2105 441534 202 powerOf2Sizes=0

From 500M to 600M rows

I wasn't willing to wait for MongoDB to make it to 600M. I stopped the test when it reached ~540M inserts. The insert rate continues to drop dramatically. InnoDB does better because of the insert buffer. I assume that for MongoDB it would drop to ~50/second were I willing to wait. That would happen when there was a disk read for every secondary index per inserted document, there are 3, and the disk array can do ~150 disk reads/second when requests are submitted serially. InnoDB was slightly faster compared to the previous 100M inserts, but it will get slower in the long run.

I looked at iostat output and the MongoDB host was doing ~260 disk reads/second and ~375 disk writes/second at test end. For both reads and writes the average request size was ~8kb. The write stats include writes to journal and database files. From PMP stack traces I see a single thread busy walking b-tree indexes most of the time. Note that the write rate for MongoDB has fallen in line with the reduction in the insert rate. Database pages aren't getting dirty as fast as they used to get because MongoDB is stalled on secondary index leaf node reads.

config sizeGB bpd MB/s GBw secs irate notes

... at 600M docs/rows

innodb 89 159 20.1 392 19465 5137

tokumx 46 82 11.6 90 7741 12917

... at 540M documents

mongo24 168 340 2.9 1235 X 123 powerOf2Sizes=0

From 900M to 1B rows

Alas InnoDB has begun to degrade faster. Even the insert buffer eventually is no match for a write-optimized algorithm.

config sizeGB bpd MB/s GBw secs irate notes

innodb 148 158 15.4 1515 98413 1016

tokumx 74 79 10.9 92 8436 11853

From 1.4B to 1.5B rows

More of the same.

config sizeGB bpd MB/s GBw secs irate notes

innodb 221 158 12.5 1745 140274 713

tokumx 104 74 11.0 96 8722 11464

From 1.5B to 2B rows

TokuMX is all alone.

config sizeGB bpd MB/s GBw secs irate notes

tokumx 142 76 12.6 99 7868 12709

Small Datum

Monday, March 24, 2014

TokuMX, MongoDB and InnoDB versus the insert benchmark with disks

Results

From 0 to 100M rows

From 100M to 200M rows

From 200M to 300M rows

From 300M to 400M rows

From 400M to 500M rows

From 500M to 600M rows

From 900M to 1B rows

From 1.4B to 1.5B rows

From 1.5B to 2B rows

1 comment:

Is it time for TPC-BLOB?