A
log structured merge tree (LSM) has optimizations that reduce the amount of reads from and writes to storage for a write-heavy workload. These optimizations aren't unique to an LSM as the fractal tree algorithm used by TokuMX and TokuDB has similar benefits.
There are
different kinds of writes so I will explain this workload. I used iibench for MongoDB with 3 secondary indexes. This loads a collection in primary key order. Random values are used for attributes with secondary indexes. I repeated the test for different database cache sizes to show where an LSM is faster than a b-tree. While I wouldn't call the WiredTiger btree a write-optimized database algorithm, it is much more IO efficient than the mmapv1 engine (uses less space on disk, reads less from and writes less to storage for the same workload).
Results
This displays the document insert rate from
iibench with 10 loader threads inserting 200M documents each. The collection goes from 0 to 2B documents during the benchmark. The test was repeated for the B-tree and LSM file structures in WiredTiger and the mmapv1 engine. The test used 3 cache sizes - 144G, 32G and 8G. Buffered IO was used for 144G and direct IO was used for 32G and 8G. The server has fast flash storage, 144G RAM and 40 hyperthread cores. The test used MongoDB 2.8.0rc3. At test end the database was 310G for the b-tree, 290G for the LSM and 815G for mmapv1. The WiredTiger tests used snappy compression. All tests used the
SAFE write concern with 1000 documents inserted per request.
The insert rate for the LSM is insensitive to cache size which is expected for this workload. The insert rate for the b-tree is sensitive to cache size and goes from being much better than the LSM when the secondary index data stays in cache to much worse than the LSM when it does not. Results for mmap1 explain why it is great that MongoDB 2.8 includes WiredTiger.
I prefer benchmark results that include enough detail to understand the results. The results below are from a test with 2.8.0rc2 and are metrics from vmstat and iostat normalized per 200k inserted documents. As each insert request was for 1000 documents then the results are also per 200 requests. They are best used to compare between the WiredTiger btree (wt btree), WiredTiger LSM (wt lsm) and mmapv1 engine. The columns are:
- vm.cs - context switches per 200 requests. Note the spike for the WiredTiger btree as the cache size is decreased.
- vm.us+sy - user & system CPU time (sum of the us and sy columns from vmstat multiplied by 1M). The relative value is useful. Again there is a spike for the WiredTiger btree and none for the LSM but the LSM uses much more CPU when the cache isn't small.
- io.kBw - KB written per iostat per 200 requests. The WiredTiger LSM is always more efficient than the btree and the rate doesn't grow as the cache shrinks. The IO rate grows a lot for the WiredTiger btree as the cache shrink. All of that is expected. The WiredTiger btree has a much lower rate than mmapv1 for the same cache size.
- io.kBr - KB read per iostat per 200 requests. The WiredTiger btree rate grows fast as the cache shrinks while the LSM is rate does not grow as the cache is reduced from 32G to 8G. The rate for mmapv1 at 144G is much worse than for WiredTiger.
- io.r - read requests per iostat per 200 requests. The results match io.kBr.
per 200k inserts
vm.cs vm.us+sy io.kBw io.kBr io.r engine, cache
29 42445 417 0.49 0.02 wt btree, 144G
306 67025 791 179.0 17.66 wt btree, 32G
1744 140560 2524 1889.0 215.0 wt btree, 8G
76 152235 294 0.31 0.02 wt lsm, 144G
134 148545 359 65.0 4.02 wt lsm, 32G
137 146215 355 65.0 4.11 wt lsm, 8G
350 79500 2414 3.17 0.16 mmapv1, 144G
B-tree
For a b-tree the IO pattern for the primary key index is efficient -- it is append only to grow the index to the right. If the user provides the value for the PK then the database must confirm that value is unique by reading from the database. That will not require extra reads from storage here because the database pages with that value should remain in cache.
The IO patterns for the secondary indexes are not efficient as inserted docs use random values for the secondary index attributes. Secondary index maintenance for a b-tree is a read-modify-write operation. The page that has or will have the key must be fetched before it is written. For WiredTiger, and most other popular b-tree implementations, this is a random storage read eventually followed by a random storage write. When compression is used then a page must be decompressed as part of the read and compressed as part of the write.
InnoDB is able to avoid some of the random storage reads from secondary index maintenance via the
change buffer. That avoids reads when the change buffer gets multiple changes to a secondary index page before doing the read and it can be very useful in production and
on benchmarks.
A b-tree backed by a redo log defers writing back dirty pages until checkpoint or buffer pool pressure forces some pages to be written back. This reduces the IO overhead when one write to storage includes multiple changes to the page. This is less likely as the database:cache ratio increases. This occurs more frequently for workloads with skew (some keys are written much more frequently).
The size of the secondary indexes relative to cache are critical. While the database might be huge relative to cache (100X larger for example) if the secondary index fits in cache then all storage reads and many storage writes for the secondary index pages will be avoided and that makes a huge difference on the amount of IO for this workload.
For now I will ignore a b-tree that uses log-structured writes. That might reduce the cost for writing back pages but it will require storage reads and writes for garbage collection to copy-out live pages from previously written extents. It will also suffer from write-amplification for both the copy-out done during GC and from writing back a dirty page when only a fraction of the page has changed.
LSM
The popular compaction algorithms for an LSM are
size-tiered and
leveled. The WiredTiger LSM uses size-tiered. Without looking at the WiredTiger source,
blind writes should be sufficient for secondary index maintenance because there are no unique constraints. A storage read is not required before the write.
A read before the write might be required for primary index maintenance to confirm that the PK for the inserted doc is unique. However the page with that data should almost always be in cache. An LSM can also use a
bloom filter for workloads that don't load in PK order. It might be hard to define a bloom filter that is great for multiple indexes, in which case a
column family is needed per index. I am not sure if WiredTiger used. I did not find references to
bloom in the WiredTiger engine for MongoDB.
An LSM will do writes to storage during compaction and memtable flushes. It can be difficult to compare the overhead from these operations as they use larger IO requests (many pages at a time) compared to a b-tree. A big difference between an LSM and a b-tree is that most of the requests from the LSM are larger (many pages at a time) and amortized over many user operations. A busy LSM still does random IO:
- multi-threaded compaction means that each thread has one stream of writes and one or more stream of reads when the database is larger than cache. The request size for these is large (many pages at a time)
- another stream of writes is in progress for the redo log (WAL, write ahead log)
- user operations can require single-page reads from storage