Wednesday, September 28, 2022

Magma, a new storage engine for Couchbase

Many years ago I sat next to a Couchbase exec at a VLDB event and was explained on how awesome their new storage engine was especially when compared to RocksDB. That new engine was based on ForestDB and while the idea was interesting the engine was never finished. On the bright side I got to meet the clever person who invented the ForestDB algorithm and had fun evaluating the perf claims (blog posts A, B, C, D).

At last, Couchbase has a new storage named Magma. The VLDB paper is worth reading, the engine design is interesting and the product is feature complete. The rest of this post explains Magma based on the paper and a few other posts. The Magma engine is an example of the index+log index structure and looks like a great upgrade from the previous engine (Couchstore).


Magma should eventually replace Couchstore as the Couchbase storage engine. Couchstore is a copy-on-write B-Tree (CoW-S). Compaction (GC) for Couchstore is single-threaded and not incremental (a database file is read & fully rewritten). Couchstore does not do block compression. Couchstore writes are done by appending the modified pages (from leaf to root) to the end of the database file. That is a lot of write-amp.

It was time for something new and I like the replacement - Magma. Goals for it include:

  • concurrent, incremental compaction
  • less write-amplification
  • optimized for SSD (sustain more concurrent IO requests)
  • support deployments with 100:1 ratios for data:memory
  • avoid the fatal flaw of index+log (don't do index lookups during GC)
The basic workload for a Couchbase storage engine is:
  • point read of documents by primary key
  • range scan of documents by seqno (commit sequence number) for change feeds
  • upsert N documents
  • while this requires two indexes (by PK, by seqno) secondary indexes on document attributes is provided by another service
  • average document size is between 1kb and 16kb


A summary of the implementation:

  • writes are persisted to a WAL and then added to the write cache. I assume each document gets a unique seqno at that time.
  • the write cache (RocksDB memtable) buffers KV pairs with 2 skiplists. The paper first describes them as the active and immutable lists. Later it states one skiplist orders by PK and the other by seqno. I believe the later description. When full the documents are appended to the tail of the open log segment and the (key,seqno, doc size) tuples are flushed to the LSM index.
  • an LSM tree index stores (key, seqno, doc size) tuples to map a document PK to a seqno
  • the Log Structured Object Store is composed of log segments. Per-segment metadata includes the min and max seqno in that segment and the seqno values do not overlap between segments.
  • there is a document cache and index block cache. The index block cache caches blocks from the LSM index and from the per-segment B-tree indexes. 
If you want to avoid read IO while accessing the index then with an LSM like RocksDB you only need to cache one key per data block because the KV pairs in the data blocks are in key order. But with an index+log approach like Magma the documents in the log segments are not in PK order, so you would need to cache one key per document rather than per block (meaning there is more data to cache). Thus the Magma approach suffers from more cache amplification than RocksDB. Alas, there is no free lunch and the index+log approach has other benefits that can be worth the price of more cache-amp.

The typical search flow to find a document by PK is:
  1. Search the write-cache, if found return the document and stop
  2. Else search the LSM index. If the PK is found use its seqno to search a log segment, else done
  3. Use the seqno to find the log segment that contains it
  4. Use the B-tree embedded in that log segment to find the data block that has the seqno
  5. Read that data block and return the document
Log segments

The Log Structured Object Store is a set of log segments. Each log segment has a CoW B-Tree to map seqno to data block. When the write-cache is flushed 1+ data blocks are appended to the tail of the open log segment (each data block is 4kb) and one index entry per datablock is added to the B-Tree. The index maps seqno to data block. Updates to the B-Tree are done by appending the new or changed index blocks to the tail of the open log segment and that is done by writing from leaf up to the root (or less than the root in most cases). This B-Tree index is right growing (seqno is increasing). The data blocks are compressed with lz4. I am not sure if the index blocks are compressed.

I am not sure what is done for documents that are larger than 4kb, but that should not be a hard problem to solve.

Eventually the log segments accumulate documents that have been deleted and GC is done to reclaim that space. The important thing is that Magma doesn't have to do index lookups during GC to determine if a document is the latest version (live) for a given PK. Instead, Magma has a clever way to main lists of deleted seqnos and uses that list when copying out live documents from a log segment during GC.

The clever solution is to persist the list of (seqno, doc size) pairs in a separate LSM tree called the delete list. When the LSM index tree is compacted and (PK, seqno, doc size) tuples are to be dropped (because they were deleted) then those tuples are first written to the delete list LSM tree. The doc size values are used to estimate the amount of deleted bytes per log segment and compaction is done for log segments that have the largest ratio of deleted bytes. Compaction then merges the delete list with the documents from a log segment -- if the seqno for a document is on the delete list then the document is not copied out by GC.

Perf results

I tend to be skeptical of perf comparisons with RocksDB. But I also don't let my skepticism prevent me from appreciating a paper, and this paper deserves to be appreciated. The authors did a great job of presenting results while covering a variety of concerns (performance and efficiency).

Of course the paper shows that performance and efficiency are better than RocksDB and there wouldn't be a VLDB paper if their results didn't show that. The improvements vs Couchstore are even larger and there wouldn't be the investment to build Magma if that weren't true. So while I am skeptical I don't think these results are misleading. I also have no plans to run my own tests to compare RocksDB with Magma from my perspective.

Snark aside, I appreciate this paper provided more info than the typical conference paper with respect to how they configured RocksDB. They used RocksDB 5.18, perhaps because that test was done long ago. Modern RocksDB now can use io_uring to get concurrent storage reads during MultiGet operations and coroutines to get IO concurrency elsewhere (I am still learning about that). Modern RocksDB also has an option to do key-value separation via Integrated BlobDB.

No comments:

Post a Comment