Tuesday, December 23, 2014

Read-modify-write optimized

A log structured merge tree (LSM) has optimizations that reduce the amount of reads from and writes to storage for a write-heavy workload. These optimizations aren't unique to an LSM as the fractal tree algorithm used by TokuMX and TokuDB has similar benefits.

There are different kinds of writes so I will explain this workload. I used iibench for MongoDB with 3 secondary indexes. This loads a collection in primary key order. Random values are used for attributes with secondary indexes. I repeated the test for different database cache sizes to show where an LSM is faster than a b-tree. While I wouldn't call the WiredTiger btree a write-optimized database algorithm, it is much more IO efficient than the mmapv1 engine (uses less space on disk, reads less from and writes less to storage for the same workload).


This displays the document insert rate from iibench with 10 loader threads inserting 200M documents each. The collection goes from 0 to 2B documents during the benchmark. The test was repeated for the B-tree and LSM file structures in WiredTiger and the mmapv1 engine. The test used 3 cache sizes - 144G, 32G and 8G. Buffered IO was used for 144G and direct IO was used for 32G and 8G. The server has fast flash storage, 144G RAM and 40 hyperthread cores. The test used MongoDB 2.8.0rc3. At test end the database was 310G for the b-tree, 290G for the LSM and 815G for mmapv1. The WiredTiger tests used snappy compression. All tests used the SAFE write concern with 1000 documents inserted per request.

The insert rate for the LSM is insensitive to cache size which is expected for this workload. The insert rate for the b-tree is sensitive to cache size and goes from being much better than the LSM when the secondary index data stays in cache to much worse than the LSM when it does not. Results for mmap1 explain why it is great that MongoDB 2.8 includes WiredTiger.
I prefer benchmark results that include enough detail to understand the results. The results below are from a test with 2.8.0rc2 and are metrics from vmstat and iostat normalized per 200k inserted documents. As each insert request was for 1000 documents then the results are also per 200 requests. They are best used to compare between the WiredTiger btree (wt btree), WiredTiger LSM (wt lsm) and mmapv1 engine. The columns are:
  • vm.cs - context switches per 200 requests. Note the spike for the WiredTiger btree as the cache size is decreased. 
  • vm.us+sy - user & system CPU time (sum of the us and sy columns from vmstat multiplied by 1M). The relative value is useful. Again there is a spike for the WiredTiger btree and none for the LSM but the LSM uses much more CPU when the cache isn't small.
  • io.kBw - KB written per iostat per 200 requests. The WiredTiger LSM is always more efficient than the btree and the rate doesn't grow as the cache shrinks. The IO rate grows a lot for the WiredTiger btree as the cache shrink. All of that is expected. The WiredTiger btree has a much lower rate than mmapv1 for the same cache size.
  • io.kBr - KB read per iostat per 200 requests. The WiredTiger btree rate grows fast as the cache shrinks while the LSM is rate does not grow as the cache is reduced from 32G to 8G. The rate for mmapv1 at 144G is much worse than for WiredTiger.
  • io.r - read requests per iostat per 200 requests. The results match io.kBr.

per 200k inserts
vm.cs  vm.us+sy  io.kBw   io.kBr    io.r     engine, cache
  29     42445     417      0.49     0.02    wt btree, 144G
 306     67025     791    179.0     17.66    wt btree, 32G
1744    140560    2524   1889.0    215.0     wt btree, 8G

  76    152235     294      0.31     0.02    wt lsm, 144G
 134    148545     359     65.0      4.02    wt lsm, 32G
 137    146215     355     65.0      4.11    wt lsm, 8G 

 350     79500    2414      3.17     0.16    mmapv1, 144G 


For a b-tree the IO pattern for the primary key index is efficient -- it is append only to grow the index to the right. If the user provides the value for the PK then the database must confirm that value is unique by reading from the database. That will not require extra reads from storage here because the database pages with that value should remain in cache.

The IO patterns for the secondary indexes are not efficient as inserted docs use random values for the secondary index attributes. Secondary index maintenance for a b-tree is a read-modify-write operation.  The page that has or will have the key must be fetched before it is written. For WiredTiger, and most other popular b-tree implementations, this is a random storage read eventually followed by a random storage write. When compression is used then a page must be decompressed as part of the read and compressed as part of the write.

InnoDB is able to avoid some of the random storage reads from secondary index maintenance via the change buffer. That avoids reads when the change buffer gets multiple changes to a secondary index page before doing the read and it can be very useful in production and on benchmarks.

A b-tree backed by a redo log defers writing back dirty pages until checkpoint or buffer pool pressure forces some pages to be written back. This reduces the IO overhead when one write to storage includes multiple changes to the page. This is less likely as the database:cache ratio increases. This occurs more frequently for workloads with skew (some keys are written much more frequently).

The size of the secondary indexes relative to cache are critical. While the database might be huge relative to cache (100X larger for example) if the secondary index fits in cache then all storage reads and many storage writes for the secondary index pages will be avoided and that makes a huge difference on the amount of IO for this workload.

For now I will ignore a b-tree that uses log-structured writes. That might reduce the cost for writing back pages but it will require storage reads and writes for garbage collection to copy-out live pages from previously written extents. It will also suffer from write-amplification for both the copy-out done during GC and from writing back a dirty page when only a fraction of the page has changed.


The popular compaction algorithms for an LSM are size-tiered and leveled. The WiredTiger LSM uses size-tiered. Without looking at the WiredTiger source, blind writes should be sufficient for secondary index maintenance because there are no unique constraints. A storage read is not required before the write.

A read before the write might be required for primary index maintenance to confirm that the PK for the inserted doc is unique. However the page with that data should almost always be in cache. An LSM can also use a bloom filter for workloads that don't load in PK order. It might be hard to define a bloom filter that is great for multiple indexes, in which case a column family is needed per index. I am not sure if WiredTiger used. I did not find references to bloom in the WiredTiger engine for MongoDB.

An LSM will do writes to storage during compaction and memtable flushes. It can be difficult to compare the overhead from these operations as they use larger IO requests (many pages at a time) compared to a b-tree. A big difference between an LSM and a b-tree is that most of the requests from the LSM are larger (many pages at a time) and amortized over many user operations. A busy LSM still does random IO:

  • multi-threaded compaction means that each thread has one stream of writes and one or more stream of reads when the database is larger than cache. The request size for these is large (many pages at a time)
  • another stream of writes is in progress for the redo log (WAL, write ahead log)
  • user operations can require single-page reads from storage


  1. Thank you for publishing this, as it saves me from working on a similar test. Also, it is nice to see improvements in the LSM engine compared to RC0!

    I have to say I admire your bench setup, with a single run you're able to produce more results than I did with my setup since Dec. Makes me think I should put more efforts into automation. Otoh my work is ad hoc by nature (every customer is different), which is the reason I'm often left with very manual, ad hoc solutions too.


    You report the db size but did you record the index sizes? (db.collection.stats()) From the results I would guess indexes have been way above 32G and maybe below 144G?

    Did you use LSM just for the secondary indexes or everything? (Another way to phrase the same question is: what was the exact command line string used?)

    In the LSM test, what did the data dir look like in the end. It seems the LSM engine creates lots of files, but I think in my tests with RC0 there might have been some bug that caused merges to never happen at all, or some other anomaly.

    As you note the _id index is a unique index. (Note that the way MongoDB storage engine api is implemented, _id is really just another secondary index, and there's an internal hidden primary key which is a 64-bit integer, increasing. The WT data file is then an index with data clustered around this hidden primary key.) Nevertheless, if _id is of type ObjectId(), it will also be increasing and therefore almost always in cache.

    I had some discussions with Tokutek people about that some time ago, but don't remember the exact outcome... I think we agreed that using upserts will allow a completely blind write but an insert would have to read the index to verify uniqueness. Since MongoDB 2.6 it seems to be possible to do bulk updates/upserts, but I have not tried those myself.

    WT documentation mentions bloom filters here: http://source.wiredtiger.com/2.3.1/lsm.html

    1. I did not record per index sizes.

      I assume it was LSM for everything when LSM was used via...
      bin/mongod --config /data/mysql/mongo.280rc4/mongo.conf --storageEngine wiredTiger --wiredTigerIndexConfigString="type=lsm"

      I have read discussion in a few places that the _id index is an extra secondary index. I hope that is fully disclosed and then fixed.

      Compaction was getting done for the LSM. The only unexpected space usage was from the WT b-tree where too-slow checkpoints led to too many journal files.

      Bloom filters are an interesting topic. They can create more problems than they solve so there is opportunity for code or consultants to be clever and figure out when to use them and when to avoid them.

    2. There should also be a --wiredTigerCollectionConfigString="type=lsm". However, I don't know if that would make performance better or worse than allowing the datafile to remain type=btree. I can think of arguments why either could be better.

      About _id index, it is a known limitation for now. Essentially in MMAP engine the record identifier is called DiskLoc, a 64bit pointer to the data file. This is unfortunately also the record identifier in the storage engine API. It will doubtless be fixed, but AFAIK not for 2.8. (I haven't pestered the devs too much on this point, but my understanding is that the explanation is: "there's only so much refactoring that could be done and still ship something by year end 2014", which is reasonable.) So this is why the WT hidden primary key is a 64bit integer.

    3. Do we really need the option to set type=lsm for both index and collection configs?
      Does anyone know the difference between doing that for collection vs index?

    4. Sorry, old thread but stumbled here today...

      Arguably and also according to some internal testing, there's no benefit for WT-in-MongoDB to have the collection itself (which from WT point of view is the clustered PK) in LSM format. This is because new documents get the auto-incremented hidden PK (DiskLoc), so they will cause the btree structure to be append only. After this the hidden PK is immutable. Hence, there's no benefit from using LSM and seems that there actually is overhead as the merging has to be done anyway.

    5. No benefit in what context? Even for the load an append-only copy-on-write b-tree isn't just as good as the LSM as the b-tree does read-before-write to read a leaf page before changing it. The LSM doesn't do read-before-write for non-unique secondary indexes.