Small Datum: Different kinds of copy-on-write for a b-tree: CoW-R, CoW-S

Friday, August 7, 2015

Different kinds of copy-on-write for a b-tree: CoW-R, CoW-S

I use CoW-R and CoW-S to distinguish between two approaches to doing copy-on-write for a B-Tree.

CoW-R stands for copy-on-write random and is used by LMDB and WiredTiger. Page writeback is not in place. The writes use the space from dead versions of previously written pages. Background threads are not used to copy live data from old log segments or extents. This tends to look like random IO. When the number of pages written per fsync is large then a clever IO scheduler can significantly reduce the random IO penalty for disks and LMDB might benefit from that. WiredTiger has shown that compression is possible with CoW-R. But compression makes space management harder.

CoW-S stands for copy-on-write sequential. New writes are done to one or a few active log segments. Background threads are required to copy-out live data from previously written log segments. Compared to CoW-R this has less random IO at the cost of extra write-amplification from cleaning old segments. It allows for a tradeoff between space and write amplification -- more dead data can be allowed to reduce the frequency of cleaning and the write amplification overhead. I am not aware of an implementation of CoW-S for a B-Tree.

In CoW-R and CoW-S B-Trees the file structure is page based. In the worst case a page is written back to disk when with one dirty row and the write amplification from that is sizeof(page) / sizeof(row). This can be large when a row is small (128 bytes) and a page is 4kb or 8kb. Write optimized approaches avoid that source of write amplification as they only write back the changed rows. Examples of this includes an LSM like RocksDB and a log structured product like ForestDB. Alas, there is no free lunch. While these avoid the write amplification from page writeback, they add write amplification from compaction in an LSM and log segment cleaning in ForestDB.

5 comments:

hycAugust 7, 2015 at 9:43 PM
That's a pretty fair and neutral description of things. But there's no need to be so neutral; the trend in technology favors the LMDB approach.

E.g., compare a SATA HDD to a SATA SSD. The bulk data transfer speed for a 7200rpm HDD is around 160MB/sec and it yields only ~54 IOPS reads / ~125 IOPS writes. http://www.storagereview.com/seagate_barracuda_3tb_review_1tb_platters_st3000dm001

Meanwhile a SATA SSD yields around 500MB/sec bulk transfer and ~100K IOPS.
http://www.storagereview.com/crucial_mx200_ssd_review

So while data transfer rate only doubles or triples, IOPS increases by a factor of a thousand or more. This means it's far more important to optimize write amplification than it is to optimize random I/Os, and everything log-based does far worse on the write amplification front as record sizes increase.

Aside from that, the LMDB approach yields deterministic write throughput, while the log-oriented approaches get unpredictable latency spikes due to the background threads running compaction/cleanup.
ReplyDelete
Replies

Add comment

Friday, August 7, 2015

Different kinds of copy-on-write for a b-tree: CoW-R, CoW-S

5 comments:

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1