Tuesday, December 30, 2014

malloc and MongoDB performance

I used iibench to understand the impact of malloc on MongoDB performance to compare malloc from glibc 2.17, jemalloc 3.6.0 and tcmalloc/gperftools 2.2 for an insert-only workload with 10 concurrent clients. The test server has 40 cores with hyperthread enabled. Alas, I lost a few configuration details, but think I used MongoDB 2.8.0rc4 and the WiredTiger b-tree.

The summary is that tcmalloc and jemalloc provide a similar benefit and are much better than glibc malloc. I made no attempt to tune tcmalloc or jemalloc. I ran iibench 3 times for each configuration and chose the median result. There are four metrics for each configuration:

  1. Test duration in seconds
  2. Address space size per the VSZ column in ps
  3. Normalized context switch rate - the number of context switches per N documents inserted. I will leave N as undefined for now so the absolute value isn't interesting. The value can be compared between malloc implementations.
  4. Normalized CPU utilization - CPU utilization per N documents inserted. See #3.

Results

          seconds  VSZ(GB)   context-switch(relative)   CPU(relative)
tcmalloc   1943     65.1            6.6                   26510
jemalloc   1877     64.5            7.1                   27165
glibc      2251     72.3           23.0                   32120

For this test glibc uses 1.19X more time, 1.12X more address space, 3.23X more context switches and 1.18X more CPU than jemalloc and tcmalloc is similar to jemalloc. As tcmalloc is bundled with the source distribution for MongoDB I assume it is used in the binaries they distribute. This appears to be a good thing.

Friday, December 26, 2014

Storage overhead for attribute names in MongoDB

The flexibility of dynamic schema support in MongoDB comes with a few costs. One of the costs is that extra space is required to store the database. The problem is that attribute names are repeated in every document. If sizeof(attribute names) is significant relative to sizeof(attribute values) then a lot more space will be used for MongoDB with the mmapv1 engine compared to a DBMS that doesn't support dynamic schemas. Note that dynamic schemas are valuable because they support less schema, not no schema. Indexes, required attributes and assumptions about attribute values are all examples where some schema is used.

How much extra space is used for attribute names? Does page level compression in the WiredTiger and TokuMX engines make this a non-issue? Repeated values in every document from long attribute names seem like something that is easy to compress. And the alternative of using extra short attribute names will cause pain for anyone trying to use the database so page level compression might be the preferred solution.

While page level compression can remove the bloat from long attribute names for compressed copies of pages it doesn't solve the problem for uncompressed copies of pages. So there is still a cost from dynamic schemas. Perhaps one day we will get an engine that encodes long attribute names efficiently even for uncompressed pages. The impact from this is that fewer uncompressed pages can fit in cache because of the space overhead. Note that when page level compression is used there are some database pages that are in cache in both compressed and uncompressed forms. I assume that the WiredTiger and TokuMX block caches only cache uncompressed pages, but I am not an expert in either engine, and the OS filesystem cache has copies of compressed pages. I am not sure what happens when direct IO is used with WiredTiger because that prevents use of the OS filesystem cache.

Results

I used iibench for MongoDB and loaded 2B documents for a few configurations: mmapv1 and WiredTiger without compression (wt-none), with snappy compression (wt-snappy) and with zlib compression (wt-zlib). To keep the documents small I edited the iibench test to use a 1-byte character field per document and disabled creation of secondary indexes. I used two versions of iibench. The first used the attribute names as-is (long) and the second used shorter versions (short) for a few of the attribute names:
  • long attribute names: price, customerid, cashregisterid, dateandtime
  • short attribute names: price, cuid, crid, ts
There are two graphs below. The first graph has the ratio of the database size with long attributes versus the size with short attributes. A larger ratio means there is more overhead from long attributes and the mmapv1 engine has the worst ratio (1.53). The ratio for WiredTiger with the zlib engine is close to one (1.01). WiredTiger with snappy wasn't as good as zlib and I think that part of the reason is that a larger page size is used by WiredTiger when zlib compression is enabled. While that improves the compression ratio it can cause other performance problems but I am waiting for documentation to catch up to the code to understand this. One problem from a larger page size is that more memory is wasted in the block cache when a page is read to get one hot document. Another problem from a larger page size is that flash devices are usually much faster at 4kb reads than at 64kb reads, while there isn't much difference with disk for 4kb versus 64kb reads. For an IO-bound workload, there is also more CPU overhead from decompressing a 64kb page versus a 4kb page.
This shows the database size in GB for each of the engine configurations. Note that WiredTiger with zlib uses about 1/8th the space compared to mmapv1 and even the uncompressed WiredTiger engine does a lot better than mmapv1. I suspect that most of the benefit for wt-none versus mmapv1 is the overhead from using power of 2 allocations in mmapv1. As a side note, I am not sure we will be able to turn off power of 2 allocation for mmapv1 in future releases.


Tuesday, December 23, 2014

Read-modify-write optimized

A log structured merge tree (LSM) has optimizations that reduce the amount of reads from and writes to storage for a write-heavy workload. These optimizations aren't unique to an LSM as the fractal tree algorithm used by TokuMX and TokuDB has similar benefits.

There are different kinds of writes so I will explain this workload. I used iibench for MongoDB with 3 secondary indexes. This loads a collection in primary key order. Random values are used for attributes with secondary indexes. I repeated the test for different database cache sizes to show where an LSM is faster than a b-tree. While I wouldn't call the WiredTiger btree a write-optimized database algorithm, it is much more IO efficient than the mmapv1 engine (uses less space on disk, reads less from and writes less to storage for the same workload).

Results

This displays the document insert rate from iibench with 10 loader threads inserting 200M documents each. The collection goes from 0 to 2B documents during the benchmark. The test was repeated for the B-tree and LSM file structures in WiredTiger and the mmapv1 engine. The test used 3 cache sizes - 144G, 32G and 8G. Buffered IO was used for 144G and direct IO was used for 32G and 8G. The server has fast flash storage, 144G RAM and 40 hyperthread cores. The test used MongoDB 2.8.0rc3. At test end the database was 310G for the b-tree, 290G for the LSM and 815G for mmapv1. The WiredTiger tests used snappy compression. All tests used the SAFE write concern with 1000 documents inserted per request.

The insert rate for the LSM is insensitive to cache size which is expected for this workload. The insert rate for the b-tree is sensitive to cache size and goes from being much better than the LSM when the secondary index data stays in cache to much worse than the LSM when it does not. Results for mmap1 explain why it is great that MongoDB 2.8 includes WiredTiger.
I prefer benchmark results that include enough detail to understand the results. The results below are from a test with 2.8.0rc2 and are metrics from vmstat and iostat normalized per 200k inserted documents. As each insert request was for 1000 documents then the results are also per 200 requests. They are best used to compare between the WiredTiger btree (wt btree), WiredTiger LSM (wt lsm) and mmapv1 engine. The columns are:
  • vm.cs - context switches per 200 requests. Note the spike for the WiredTiger btree as the cache size is decreased. 
  • vm.us+sy - user & system CPU time (sum of the us and sy columns from vmstat multiplied by 1M). The relative value is useful. Again there is a spike for the WiredTiger btree and none for the LSM but the LSM uses much more CPU when the cache isn't small.
  • io.kBw - KB written per iostat per 200 requests. The WiredTiger LSM is always more efficient than the btree and the rate doesn't grow as the cache shrinks. The IO rate grows a lot for the WiredTiger btree as the cache shrink. All of that is expected. The WiredTiger btree has a much lower rate than mmapv1 for the same cache size.
  • io.kBr - KB read per iostat per 200 requests. The WiredTiger btree rate grows fast as the cache shrinks while the LSM is rate does not grow as the cache is reduced from 32G to 8G. The rate for mmapv1 at 144G is much worse than for WiredTiger.
  • io.r - read requests per iostat per 200 requests. The results match io.kBr.

per 200k inserts
vm.cs  vm.us+sy  io.kBw   io.kBr    io.r     engine, cache
  29     42445     417      0.49     0.02    wt btree, 144G
 306     67025     791    179.0     17.66    wt btree, 32G
1744    140560    2524   1889.0    215.0     wt btree, 8G

  76    152235     294      0.31     0.02    wt lsm, 144G
 134    148545     359     65.0      4.02    wt lsm, 32G
 137    146215     355     65.0      4.11    wt lsm, 8G 

 350     79500    2414      3.17     0.16    mmapv1, 144G 

B-tree

For a b-tree the IO pattern for the primary key index is efficient -- it is append only to grow the index to the right. If the user provides the value for the PK then the database must confirm that value is unique by reading from the database. That will not require extra reads from storage here because the database pages with that value should remain in cache.

The IO patterns for the secondary indexes are not efficient as inserted docs use random values for the secondary index attributes. Secondary index maintenance for a b-tree is a read-modify-write operation.  The page that has or will have the key must be fetched before it is written. For WiredTiger, and most other popular b-tree implementations, this is a random storage read eventually followed by a random storage write. When compression is used then a page must be decompressed as part of the read and compressed as part of the write.

InnoDB is able to avoid some of the random storage reads from secondary index maintenance via the change buffer. That avoids reads when the change buffer gets multiple changes to a secondary index page before doing the read and it can be very useful in production and on benchmarks.

A b-tree backed by a redo log defers writing back dirty pages until checkpoint or buffer pool pressure forces some pages to be written back. This reduces the IO overhead when one write to storage includes multiple changes to the page. This is less likely as the database:cache ratio increases. This occurs more frequently for workloads with skew (some keys are written much more frequently).

The size of the secondary indexes relative to cache are critical. While the database might be huge relative to cache (100X larger for example) if the secondary index fits in cache then all storage reads and many storage writes for the secondary index pages will be avoided and that makes a huge difference on the amount of IO for this workload.

For now I will ignore a b-tree that uses log-structured writes. That might reduce the cost for writing back pages but it will require storage reads and writes for garbage collection to copy-out live pages from previously written extents. It will also suffer from write-amplification for both the copy-out done during GC and from writing back a dirty page when only a fraction of the page has changed.

LSM

The popular compaction algorithms for an LSM are size-tiered and leveled. The WiredTiger LSM uses size-tiered. Without looking at the WiredTiger source, blind writes should be sufficient for secondary index maintenance because there are no unique constraints. A storage read is not required before the write.

A read before the write might be required for primary index maintenance to confirm that the PK for the inserted doc is unique. However the page with that data should almost always be in cache. An LSM can also use a bloom filter for workloads that don't load in PK order. It might be hard to define a bloom filter that is great for multiple indexes, in which case a column family is needed per index. I am not sure if WiredTiger used. I did not find references to bloom in the WiredTiger engine for MongoDB.

An LSM will do writes to storage during compaction and memtable flushes. It can be difficult to compare the overhead from these operations as they use larger IO requests (many pages at a time) compared to a b-tree. A big difference between an LSM and a b-tree is that most of the requests from the LSM are larger (many pages at a time) and amortized over many user operations. A busy LSM still does random IO:

  • multi-threaded compaction means that each thread has one stream of writes and one or more stream of reads when the database is larger than cache. The request size for these is large (many pages at a time)
  • another stream of writes is in progress for the redo log (WAL, write ahead log)
  • user operations can require single-page reads from storage