Monday, March 24, 2014

TokuMX, MongoDB and InnoDB versus the insert benchmark with disks

I used the insert benchmark on servers that use disks in my quest to learn more about MongoDB internals. The insert benchmark is interesting for a few reasons. First while inserting a lot of data isn't something I do all of the time it is something for which performance matters some of the time. Second it subjects secondary indexes to fragmentation and excessive fragmentation leads to wasted IO and wasted disk space. Finally it allows for useful optimizations including write-optimized algorithms (fractal tree via TokuMX, LSM vis RocksDB and WiredTiger) or the InnoDB insert buffer. Hopefully I can move onto other workloads after this week.

This test is interesting for another reason that I don't really explore here but will in a future post. While caching all or most of the database in RAM works great at eliminating reads it might not do much for avoiding random writes. So a write heavy workload with a cached database can still be limited by random write IO and this will be more of an issue as RAM capacity grows on commodity servers while people try to reuse their favorite update-in-place b-tree for cached workloads. Some of the impact from that can be viewed in the results for MongoDB when the database is smaller than 72G. I wonder whether InnoDB can be improved in this case. The traditional solution is to use snapshots (sequential IO) and a redo log.

The test server has 72G of RAM and at least 8 10K RPM SAS disks with HW RAID and a battery-backed write cache so it can do a few thousand random IOPs given many pending requests if we trade latency for throughput. The insert benchmark was used with 1 client thread and the test was started with an empty collection/table. I used the Java client for MongoDB and TokuMX and the Python client for InnoDB. The MongoDB inserts are done with w:1,j:1 and journalCommitInterval=2 (or logFlushPeriod=0 with TokuMX). So there is a wait for fsync but with all of the tests I have done to this point there is not much difference between j:0 and j:1 as the journal sync does not have much impact when inserting 1000 documents per insert request. The InnoDB inserts are done with innodb_flush_log_at_trx_commit=1 so it also waits for fsync. I also used 8kb pages for InnoDB and disabled the doublewrite buffer. Compression was not used for InnoDB. Fsync is fast on the test hardware given the RAID write cache. The clients run as the same host as the server to reduce network latency. The oplog/binlog was disabled.

I usually have feature requests listed in a post but not this time. I think that MongoDB needs much more in the way of per-collection and per-index statistics. That shouldn't be a surprise given my work on the same for MySQL. But that will wait for another post.

The performance summary isn't a surprise. TokuMX does better than InnoDB because fractal trees greatly reduce the random IOPs demand. InnoDB does better than MongoDB. There are a few reasons why InnoDB does better than MongoDB even though they both use an update-in-place b-tree:
  1. Databases with MongoDB are larger than with InnoDB so cache hit rates are lower when the database is larger than RAM. I don't understand all of the reasons for the size differences. Including attribute names in every document is not the majority of the problem. I think there is more secondary index fragmentation with MongoDB. I have results with and without the powerOf2Sizes option enabled and that doesn't explain the difference.
  2. The InnoDB insert buffer is the primary reason that InnoDB does better. This is true when comparing InnoDB to many products that use an update-in-place b-tree, not just MongoDB. Because of the insert buffer InnoDB is unlikely to stall on disk reads to leaf pages of secondary indexes during index maintenance. Those reads can be done in the background using many concurrent IO requests. MongoDB doesn't have this feature. It blocks on disk reads during secondary index maintenance and won't benefit from concurrent IO for reads despite the RAID array used by the server. This note has performance results for the insert benchmark and InnoDB when the insert buffer is disabled to show the benefit from that feature. I have also written about problems since fixed in InnoDB that prevented the insert buffer from being useful because it became full.

Results

For the test the client inserts up to 2B rows. But I wasn't willing to wait for MongoDB and stopped it after less than 600M rows. InnoDB was stopped after 1.8B rows. The columns used for the result table are listed below. There are a lot more details on these columns in a previous post. Each of the sections that follow describe the performance to insert the next 100M documents/rows.
  • sizeGB - the database size in GB
  • bpd - bytes per document/row computed from sizeGB / #docs (or #rows)
  • MB/s - the average rate for bytes written per second computed from iostat. This has IO for the database file and the journal/redo logs
  • GBw - the total number of GB written to the database including journal/redo logs
  • secs - the number of seconds to insert data
  • irate - the rate of documents or rows inserted per second
  • notes - more details on the configuration

From 0 to 100M rows

This has results from inserting 100M documents/rows to an empty collection/table. Things that interest me that I have previously reported include 1) MongoDB databases are much larger and 2) MongoDB does much more disk IO for the same workload and the increase in bytes written isn't explained by the database being larger. One of the reasons for the high bytes written rate is that the test takes longer to complete with MongoDB and a hard checkpoint is done every syncdelay seconds. InnoDB is better at delaying writeback for dirty pages.

The interesting result that I have seen in a few cases with both MongoDB 2.4.9 and 2.6.0 is that results are worse with powerOf2Sizes enabled. I have not take the time to debug this problem. That is on my TODO list. At first I thought I had a few bad servers (flaky HW, etc) but I haven't seen the opposite for this workload (powerOf2Sizes enabled getting better insertion rates). The problem appears to be intermittent. Note that 2.6 has a fix for JIRA 12216 that doesn't block allocation of new files when msync is in progress so 2.6 should be somewhat faster than 2.4.

config   sizeGB   bpd   MB/s    GBw    secs   irate  notes
innodb     16     171   28.9    124    4290   23308
tokumx      9.2    98   11.1     79    7127   14030
mongo24    43     461   46.4   1539   33230    3009  powerOf2Sizes=0
mongo24    44     472   30.0   1545   51634    1937  powerOf2Sizes=1
mongo26    42     450   47.9   1446   30199    3311  powerOf2Sizes=1

From 100M to 200M rows

TokuMX and fractal trees are starting to show a benefit relative to InnoDB.

config   sizeGB   bpd   MB/s    GBw    secs   irate  notes
innodb     31     166   24.3    238    9781   10224
tokumx     17      91   12.3     90    7328   13646
mongo24    72     386   37.4   1768   47329    2113  powerOf2Sizes=0
mongo24    79     424   24.6   1731   70325    1422  powerOf2Sizes=1
mongo26    76     408   39.3   1611   40992    2439  powerOf2Sizes=1

From 200M to 300M rows

More of the same as TokuMX gets better relative to others.

config   sizeGB   bpd   MB/s    GBw    secs   irate  notes
innodb     45     161   21.7    350   16136    6198
tokumx     25      89   12.0     84    7071   14142
mongo24    98     350   30.7   2008   65514    1526  powerOf2Sizes=0
mongo24   106     379   19.9   1917   96351    1038  powerOf2Sizes=1
mongo26   108     386   24.9   1933   77677    1287  powerOf2Sizes=1

From 300M to 400M rows

TokuMX begins to get slower. MongoDB gets a lot slower as the database is much larger than RAM. Problems unrelated to MongoDB cost me two of the long running test servers (for 2.4.9 and 2.6.0 with powerOf2Sizes=1).

config   sizeGB  bpd   MB/s    GBw    secs   irate  notes
innodb     61    163   21.1    376   17825    5610
tokumx     31     83   12.1     86    7172   13941
mongo24   130    348   14.7   2313  157395     635  powerOf2Sizes=0

From 400M to 500M rows

MongoDB is getting significantly slower as the database is larger than RAM. More on this in the next section.

config   sizeGB  bpd   MB/s    GBw    secs   irate  notes
innodb     75    161   19.8    462   23337    4285
tokumx     39     83   11.1     84    7584   13186
mongo24   160    344   4.7    2105  441534     202  powerOf2Sizes=0

From 500M to 600M rows

I wasn't willing to wait for MongoDB to make it to 600M. I stopped the test when it reached ~540M inserts. The insert rate continues to drop dramatically. InnoDB does better because of the insert buffer. I assume that for MongoDB it would drop to ~50/second were I willing to wait. That would happen when there was a disk read for every secondary index per inserted document, there are 3, and the disk array can do ~150 disk reads/second when requests are submitted serially. InnoDB was slightly faster compared to the previous 100M inserts, but it will get slower in the long run.

I looked at iostat output and the MongoDB host was doing ~260 disk reads/second and ~375 disk writes/second at test end. For both reads and writes the average request size was ~8kb. The write stats include writes to journal and database files. From PMP stack traces I see a single thread busy walking b-tree indexes most of the time. Note that the write rate for MongoDB has fallen in line with the reduction in the insert rate. Database pages aren't getting dirty as fast as they used to get because MongoDB is stalled on secondary index leaf node reads.

config   sizeGB  bpd   MB/s    GBw    secs   irate  notes
... at 600M docs/rows
innodb     89    159   20.1    392   19465    5137
tokumx     46     82   11.6     90    7741   12917
... at 540M documents
mongo24   168    340    2.9   1235       X     123  powerOf2Sizes=0

From 900M to 1B rows

Alas InnoDB has begun to degrade faster. Even the insert buffer eventually is no match for a write-optimized algorithm.

config   sizeGB  bpd   MB/s    GBw    secs   irate  notes
innodb    148    158   15.4   1515   98413    1016
tokumx     74     79   10.9     92    8436   11853

From 1.4B to 1.5B rows

More of the same.

config   sizeGB  bpd   MB/s    GBw    secs   irate  notes
innodb    221    158   12.5   1745  140274     713
tokumx    104     74   11.0     96    8722   11464

From 1.5B to 2B rows

TokuMX is all alone.

config   sizeGB  bpd   MB/s    GBw    secs   irate  notes
tokumx    142     76   12.6     99    7868   12709

1 comment:

  1. I did it again, Mark… Cliff notes with a pretty picture: http://bit.ly/1iy95WA

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...