Friday, December 26, 2014

Storage overhead for attribute names in MongoDB

The flexibility of dynamic schema support in MongoDB comes with a few costs. One of the costs is that extra space is required to store the database. The problem is that attribute names are repeated in every document. If sizeof(attribute names) is significant relative to sizeof(attribute values) then a lot more space will be used for MongoDB with the mmapv1 engine compared to a DBMS that doesn't support dynamic schemas. Note that dynamic schemas are valuable because they support less schema, not no schema. Indexes, required attributes and assumptions about attribute values are all examples where some schema is used.

How much extra space is used for attribute names? Does page level compression in the WiredTiger and TokuMX engines make this a non-issue? Repeated values in every document from long attribute names seem like something that is easy to compress. And the alternative of using extra short attribute names will cause pain for anyone trying to use the database so page level compression might be the preferred solution.

While page level compression can remove the bloat from long attribute names for compressed copies of pages it doesn't solve the problem for uncompressed copies of pages. So there is still a cost from dynamic schemas. Perhaps one day we will get an engine that encodes long attribute names efficiently even for uncompressed pages. The impact from this is that fewer uncompressed pages can fit in cache because of the space overhead. Note that when page level compression is used there are some database pages that are in cache in both compressed and uncompressed forms. I assume that the WiredTiger and TokuMX block caches only cache uncompressed pages, but I am not an expert in either engine, and the OS filesystem cache has copies of compressed pages. I am not sure what happens when direct IO is used with WiredTiger because that prevents use of the OS filesystem cache.

Results

I used iibench for MongoDB and loaded 2B documents for a few configurations: mmapv1 and WiredTiger without compression (wt-none), with snappy compression (wt-snappy) and with zlib compression (wt-zlib). To keep the documents small I edited the iibench test to use a 1-byte character field per document and disabled creation of secondary indexes. I used two versions of iibench. The first used the attribute names as-is (long) and the second used shorter versions (short) for a few of the attribute names:
  • long attribute names: price, customerid, cashregisterid, dateandtime
  • short attribute names: price, cuid, crid, ts
There are two graphs below. The first graph has the ratio of the database size with long attributes versus the size with short attributes. A larger ratio means there is more overhead from long attributes and the mmapv1 engine has the worst ratio (1.53). The ratio for WiredTiger with the zlib engine is close to one (1.01). WiredTiger with snappy wasn't as good as zlib and I think that part of the reason is that a larger page size is used by WiredTiger when zlib compression is enabled. While that improves the compression ratio it can cause other performance problems but I am waiting for documentation to catch up to the code to understand this. One problem from a larger page size is that more memory is wasted in the block cache when a page is read to get one hot document. Another problem from a larger page size is that flash devices are usually much faster at 4kb reads than at 64kb reads, while there isn't much difference with disk for 4kb versus 64kb reads. For an IO-bound workload, there is also more CPU overhead from decompressing a 64kb page versus a 4kb page.
This shows the database size in GB for each of the engine configurations. Note that WiredTiger with zlib uses about 1/8th the space compared to mmapv1 and even the uncompressed WiredTiger engine does a lot better than mmapv1. I suspect that most of the benefit for wt-none versus mmapv1 is the overhead from using power of 2 allocations in mmapv1. As a side note, I am not sure we will be able to turn off power of 2 allocation for mmapv1 in future releases.


7 comments:

  1. As you mention, compression can eliminate much of the overhead on disk, but field names are uncompressed in the cache and reduce it's effectiveness.

    ReplyDelete
    Replies
    1. Forget to re-share the earlier results from TokuMX. Are there others? http://www.tokutek.com/2013/08/tokumx-tip-create-any-field-name-you-want/

      Delete
  2. Regarding your question about turning off powerOf2, it is still available in 2.8 using the "noPadding" argument to createCollection or via collMod command on an existing collection. It's a little buried in the release notes at the moment: http://docs.mongodb.org/v2.8/release-notes/2.8-general-improvements/#mmapv1-record-allocation-behavior-changed. powerOf2Sizes can definitely cost space in an insert-only experiment like this.

    ReplyDelete
  3. Favorite topic of mine :-) Just wanted to jot down here some comments from our fb thread:

    - I think you present here a worst case: Values are 1 character only, an insert only workload has no fragmentation, and I believe the usePowerOf2Sizes (which you wouldn't really want to use in an insert only workload) also amplifies the difference.

    - Nevertheless, it is useful to know what the worst case can be. (Of course, you could have ridiculously long key names like commonly seen in Java or .Net to show an even worse case.)

    - The similar test by your brother, that you linked to in comments, showed only 10% difference with mmap engine. I believe this is closer to real world metrics, but have not done any measurements of my own. (But it seems Tim's test also has no fragmentation, so a real world situation could be even below 10%.)

    - A lot of "experts" routinely advice every MongoDB user to use short key names. I believe as a general piece of advice this is misguided. The loss in readability is a bigger problem than the overhead from longer key names. Such advice should always be accompanied with actual numbers proving the overhead that is avoided. As this blog shows, the real fix is to fix the data storage to avoid the overhead. (...where optimizing RAM consumption remains on the TODO list for MongoDB.)

    - Most interesting part of your results is the observation that snappy compression in WT does very little to fix this problem. It's good to note the commentary from your previous blog post that WT is configured to use different page sizes with snappy vs zlib (as the assumed use cases are different). It would be interesting (but not that important, really) to know if this difference is due to the compression algorithm or just the page size.

    ReplyDelete
    Replies
    1. We need more documentation on the WT engines to make it easier to understand things like the difference between page sizes used for snappy and zlib. Maybe someone should write a book?

      Delete
  4. I find it interesting that mongodb does not have something similar to innodb's unzip LRU. Is this on purpose or is it just not implemented? For io-bound workloads it would be more efficient to keep the compressed copies and not the uncompressed ones.

    ReplyDelete
    Replies
    1. With buffered IO and WiredTiger the OS filesystem cache is the LRU for compressed pages. Not as fast as accessing them from the mongod address space, but avoids a lot of complexity.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...