Friday, August 7, 2015

Different kinds of copy-on-write for a b-tree: CoW-R, CoW-S

I use CoW-R and CoW-S to distinguish between two approaches to doing copy-on-write for a B-Tree.

CoW-R stands for copy-on-write random and is used by LMDB and WiredTiger. Page writeback is not in place. The writes use the space from dead versions of previously written pages. Background threads are not used to copy live data from old log segments or extents. This tends to look like random IO. When the number of pages written per fsync is large then a clever IO scheduler can significantly reduce the random IO penalty for disks and LMDB might benefit from that. WiredTiger has shown that compression is possible with CoW-R. But compression makes space management harder.

CoW-S stands for copy-on-write sequential. New writes are done to one or a few active log segments. Background threads are required to copy-out live data from previously written log segments. Compared to CoW-R this has less random IO at the cost of extra write-amplification from cleaning old segments. It allows for a tradeoff between space and write amplification -- more dead data can be allowed to reduce the frequency of cleaning and the write amplification overhead. I am not aware of an implementation of CoW-S for a B-Tree.

In CoW-R and CoW-S B-Trees the file structure is page based. In the worst case a page is written back to disk when with one dirty row and the write amplification from that is sizeof(page) / sizeof(row). This can be large when a row is small (128 bytes) and a page is 4kb or 8kb. Write optimized approaches avoid that source of write amplification as they only write back the changed rows. Examples of this includes an LSM like RocksDB and a log structured product like ForestDB. Alas, there is no free lunch. While these avoid the write amplification from page writeback, they add write amplification from compaction in an LSM and log segment cleaning in ForestDB.


  1. That's a pretty fair and neutral description of things. But there's no need to be so neutral; the trend in technology favors the LMDB approach.

    E.g., compare a SATA HDD to a SATA SSD. The bulk data transfer speed for a 7200rpm HDD is around 160MB/sec and it yields only ~54 IOPS reads / ~125 IOPS writes.

    Meanwhile a SATA SSD yields around 500MB/sec bulk transfer and ~100K IOPS.

    So while data transfer rate only doubles or triples, IOPS increases by a factor of a thousand or more. This means it's far more important to optimize write amplification than it is to optimize random I/Os, and everything log-based does far worse on the write amplification front as record sizes increase.

    Aside from that, the LMDB approach yields deterministic write throughput, while the log-oriented approaches get unpredictable latency spikes due to the background threads running compaction/cleanup.

    1. 1) While I think LMDB is kind of awesome, we might not agree on whether it is suitable for all workloads. Regardless, I appreciate the work that you do.
      2) SSD/flash peak write rates are excellent but if you want the device to last for a few years then the write rates that can be sustained for that time are not that impressive.
      3) I agree that write-amp for page-based stores, like LMDB or InnoDB, decreases as record size increases. Alas many of the workloads I care about has average record sizes that are much less than 1kb, and write-optimized solutions like RocksDB provide much better compression and much less write amplification.

    2. I can't disagree with facts. Everything I've seen shows that for small record sizes, the LSM approaches' write speeds blows LMDB out of the water.

      Write speed doesn't really tip in LMDB's favor until record sizes are over 1KB.

      My only other comment is that write-ahead loggers are fiendishly difficult to get right. BerkeleyDB has been around since 1996 and ~20 years later they're still fixing reliability bugs in their transaction log handling. [#24169], log file format change in, [#22215], etc... The LevelDB family has its share of reliability flaws still too. SQLite, which ought to be fairly simple (compared to many larger systems) is also plagued with WAL issues. The complexity here is a major liability, one which doesn't plague LMDB.

    3. (Going back to restate - on random access patterns, LSMs write speeds excel for small record sizes. For sequential writes, LMDB still wins at all record sizes.)

    4. When byte-addressable NVM gets more ubiquitous it may become worthwhile to investigate small-record B+tree optimizations...