Small Datum: wiredtiger

Sunday, November 21, 2021

Welll actually, LSM and B-tree

Things I read about LSM that make me go well actually:

LSM does sequential writes - well actually, writes are sequential logically (per-file) but not physically (per-device). From the physical perspective compaction writes are large & concurrent rather than sequential. Compaction writes to files sequentially but it is usually concurrent (multi-threaded) with a file written sequentially per-thread. There can also be small writes to the WAL. With concurrent compaction threads the large (maybe 1MB+) writes are interleaved at the device.
Compaction does merge-sort - well actually, that is a k-way merge, not a merge-sort.
Leveled compaction suffers from write-amplification - well actually, write-amplification was 10X smaller with MyRocks when the first workload was moved to it from InnodDB. Write-amp with leveled compaction is larger than with tiered, but that is the (C)RUM Conjecture in action: leveled means less space-amp at the cost of more write-amp. One of the problems is that the naive (hand-wavy) estimate of write-amp for leveled (~8 or ~10 per level) is frequently too pessimistic.
LSM suffers from write stalls - well actually, that is more true than the previous bullet points. But the real problem for an LSM is figuring out how to smooth the write stalls (reduce response time variance) and RocksDB needs to get better at that. Everything (b-tree, LSM, etc) suffers from write-stalls when ingest exceeds write back throughput and the easy way to reproduce that is a benchmark with WAL fsync disabled. A b-tree benefits from feedback that serves as a small write-stall - when the buffer pool is full of dirty pages and the working set is not cached then a write likely needs to wait for 1 page to be written back before it can read something into the buffer pool. But an LSM doesn't have that feedback as a write usually just needs something to be put into the memtable, non-unique secondary index maintenance doesn't require reads, and even reads needed to check unique constraints (likely for a PK) can be avoided in some cases with bloom filters. Checkpoint is one place where a b-tree can suffer from large write stalls. Not that many years ago the insert benchmark was great at documenting longer write-stalls related to checkpoint in InnoDB and WiredTiger. Fortunately those have been fixed. Postgres also had serious problems, now fixed, from checkpoint IO creating bursts of write IO that starve other IO (like reads for user queries). Checkpoint is a hard problem.

Sunday, November 21, 2021

Welll actually, LSM and B-tree

Postgres 18rc1 vs sysbench