Many years ago I sat next to a Couchbase exec at a VLDB event and was explained on how awesome their new storage engine was especially when compared to RocksDB. That new engine was based on ForestDB and while the idea was interesting the engine was never finished. On the bright side I got to meet the clever person who invented the ForestDB algorithm and had fun evaluating the perf claims (blog posts A, B, C, D).
At last, Couchbase has a new storage named Magma. The VLDB paper is worth reading, the engine design is interesting and the product is feature complete. The rest of this post explains Magma based on the paper and a few other posts. The Magma engine is an example of the index+log index structure and looks like a great upgrade from the previous engine (Couchstore).
Motivation
Magma should eventually replace Couchstore as the Couchbase storage engine. Couchstore is a copy-on-write B-Tree (CoW-S). Compaction (GC) for Couchstore is single-threaded and not incremental (a database file is read & fully rewritten). Couchstore does not do block compression. Couchstore writes are done by appending the modified pages (from leaf to root) to the end of the database file. That is a lot of write-amp.
It was time for something new and I like the replacement - Magma. Goals for it include:
- concurrent, incremental compaction
- less write-amplification
- optimized for SSD (sustain more concurrent IO requests)
- support deployments with 100:1 ratios for data:memory
- avoid the fatal flaw of index+log (don't do index lookups during GC)
- point read of documents by primary key
- range scan of documents by seqno (commit sequence number) for change feeds
- upsert N documents
- while this requires two indexes (by PK, by seqno) secondary indexes on document attributes is provided by another service
- average document size is between 1kb and 16kb
Implementation
A summary of the implementation:
- writes are persisted to a WAL and then added to the write cache. I assume each document gets a unique seqno at that time.
- the write cache (RocksDB memtable) buffers KV pairs with 2 skiplists. The paper first describes them as the active and immutable lists. Later it states one skiplist orders by PK and the other by seqno. I believe the later description. When full the documents are appended to the tail of the open log segment and the (key,seqno, doc size) tuples are flushed to the LSM index.
- an LSM tree index stores (key, seqno, doc size) tuples to map a document PK to a seqno
- the Log Structured Object Store is composed of log segments. Per-segment metadata includes the min and max seqno in that segment and the seqno values do not overlap between segments.
- there is a document cache and index block cache. The index block cache caches blocks from the LSM index and from the per-segment B-tree indexes.
- Search the write-cache, if found return the document and stop
- Else search the LSM index. If the PK is found use its seqno to search a log segment, else done
- Use the seqno to find the log segment that contains it
- Use the B-tree embedded in that log segment to find the data block that has the seqno
- Read that data block and return the document
The Log Structured Object Store is a set of log segments. Each log segment has a CoW B-Tree to map seqno to data block. When the write-cache is flushed 1+ data blocks are appended to the tail of the open log segment (each data block is 4kb) and one index entry per datablock is added to the B-Tree. The index maps seqno to data block. Updates to the B-Tree are done by appending the new or changed index blocks to the tail of the open log segment and that is done by writing from leaf up to the root (or less than the root in most cases). This B-Tree index is right growing (seqno is increasing). The data blocks are compressed with lz4. I am not sure if the index blocks are compressed.
I am not sure what is done for documents that are larger than 4kb, but that should not be a hard problem to solve.
Eventually the log segments accumulate documents that have been deleted and GC is done to reclaim that space. The important thing is that Magma doesn't have to do index lookups during GC to determine if a document is the latest version (live) for a given PK. Instead, Magma has a clever way to main lists of deleted seqnos and uses that list when copying out live documents from a log segment during GC.
Perf results
I tend to be skeptical of perf comparisons with RocksDB. But I also don't let my skepticism prevent me from appreciating a paper, and this paper deserves to be appreciated. The authors did a great job of presenting results while covering a variety of concerns (performance and efficiency).
Of course the paper shows that performance and efficiency are better than RocksDB and there wouldn't be a VLDB paper if their results didn't show that. The improvements vs Couchstore are even larger and there wouldn't be the investment to build Magma if that weren't true. So while I am skeptical I don't think these results are misleading. I also have no plans to run my own tests to compare RocksDB with Magma from my perspective.