Tuesday, March 11, 2014

Bigger data

Database benchmarks are hard but not useless. I use them to validate performance models and to find behavior that can be improved. It is easy to misunderstand results produced by others and they are often misused for marketing (benchmarketing). It is also easy to report incorrect results and I have done that a few times for InnoDB. A benchmark report is much more useful when it includes an explanation. Only one of these is an explanation: A is faster than B, A is faster than B because it uses less random IO. It isn't easy to explain results. That takes time and expertise in the DBMS and the rest of the hardware and software stack used during the test. The trust I have in benchmark reports is inversely related to the number of different products that have been tested.

This is an introduction for a sequence of blog posts that compare MongoDB, TokuMX and InnoDB. I am an expert with InnoDB, above average with TokuDB (and the Toku part of TokuMX) and just getting started with MongoDB. Moving from technical expertise to personal bias I have had a lot of success with InnoDB. I also have a lot of respect for Tokutek and my brother works there. While I am new to MongoDB, I think they are building a useful product. My focus is on DBMS internals and storage efficiency. This is a narrow focus but critical for deployments where the hardware cost is significant. My goal is to describe problems that can be fixed to make MongoDB better.

I used the insert benchmark to measure the insert rate, write efficiency and database size. The benchmark loads a collection/table in PK order. The table has 3 secondary indexes and the values for these attributes/columns are randomly generated. The random key order for secondary index maintenance can generate fragmentation and random IO. I use the benchmark to measure storage efficiency and am more interested in write efficiency and database size than in the average insert rate.

  • Can the DBMS use the IOPs capacity of the storage devices? Many years ago InnoDB wasn't able to do more than 10,000 reads/second from disk. Now I am able to demo something between 150k and 200k per second assuming low latency storage and 24 CPU cores with HT enabled. It has come a long way. There are many problems that prevent a DBMS from using the IOPs capacity including mutex contention, CPU overhead/pathlength and IO serialization. TokuMX and InnoDB have one CPU overhead that is missing from MongoDB. They generate checksums when database pages are written and validate the checksums when pages are read from disk. That is an overhead that I hope will be added to MongoDB.
  • Can the DBMS reduce random IO for secondary indexes? InnoDB has the change buffer for this and TokuMX has fractal trees.
  • Can the DBMS reduce index fragmentation? TokuMX has fractal trees. InnoDB is an update-in-place b-tree. I won't claim it has anything special in this regard. A common problem for database benchmarks is to load and go so that queries are tested against an unfragmented database while the usual production DBMS has been fragmented. By fragmented I mean subject to inserts, updates and deletes in random key order so write-optimized databases can also be fragmented and there is an impact on query performance in that case, but that is for another blog post. The insert benchmark tries very hard to make the database fragmented by requiring secondary index maintenance in random key order.
  • Can the DBMS reduce the amount of disk space used for the database? Both TokuMX and InnoDB support compression.

For write efficiency I report the total bytes written to the storage device during the test. Flash devices are rated by write endurance. A DBMS that writes less to storage allows you to either buy less expensive flash devices or get more years of use from a device. I used iostat to record the write rate and this includes writes to the journal file (or redo log) and database files. Note that all writes are not created equal. TokuMX does large random writes. InnoDB and MongoDB do small random writes and small random writes also require writes in the background when flash garbage collection (block cleaning) is done. The writes from flash GC are not captured by iostat and I do not report them.

Database size can be increased by fragmentation for update-in-place indexes and old versions of rows for write-optimized database. It can be decreased by compression. A smaller database means you buy less flash or disk to store it and to store backups.

The insert rate is the least interesting of the three metrics. I don't ignore the results but a lower insert rate can be tolerated if that provides better write efficiency or database compression.

This was a long introduction. Benchmark results will be in another post.


  1. Would be interesting for you to add LMDB to your test mix. I don't claim to be an expert at tuning InnoDB, so you should get better results than I did. http://symas.com/mdb/memcache/

  2. I appreciate the hard work that has gone into documenting the internals of LMDB and then explaining performance via benchmark results. The world could use a few more copy-on-write b-trees.