Wednesday, September 28, 2022

Magma, a new storage engine for Couchbase

Many years ago I sat next to a Couchbase exec at a VLDB event and was explained on how awesome their new storage engine was especially when compared to RocksDB. That new engine was based on ForestDB and while the idea was interesting the engine was never finished. On the bright side I got to meet the clever person who invented the ForestDB algorithm and had fun evaluating the perf claims (blog posts A, B, C, D).

At last, Couchbase has a new storage named Magma. The VLDB paper is worth reading, the engine design is interesting and the product is feature complete. The rest of this post explains Magma based on the paper and a few other posts. The Magma engine is an example of the index+log index structure and looks like a great upgrade from the previous engine (Couchstore).

Motivation

Magma should eventually replace Couchstore as the Couchbase storage engine. Couchstore is a copy-on-write B-Tree (CoW-S). Compaction (GC) for Couchstore is single-threaded and not incremental (a database file is read & fully rewritten). Couchstore does not do block compression. Couchstore writes are done by appending the modified pages (from leaf to root) to the end of the database file. That is a lot of write-amp.

It was time for something new and I like the replacement - Magma. Goals for it include:

  • concurrent, incremental compaction
  • less write-amplification
  • optimized for SSD (sustain more concurrent IO requests)
  • support deployments with 100:1 ratios for data:memory
  • avoid the fatal flaw of index+log (don't do index lookups during GC)
The basic workload for a Couchbase storage engine is:
  • point read of documents by primary key
  • range scan of documents by seqno (commit sequence number) for change feeds
  • upsert N documents
  • while this requires two indexes (by PK, by seqno) secondary indexes on document attributes is provided by another service
  • average document size is between 1kb and 16kb

Implementation

A summary of the implementation:

  • writes are persisted to a WAL and then added to the write cache. I assume each document gets a unique seqno at that time.
  • the write cache (RocksDB memtable) buffers KV pairs with 2 skiplists. The paper first describes them as the active and immutable lists. Later it states one skiplist orders by PK and the other by seqno. I believe the later description. When full the documents are appended to the tail of the open log segment and the (key,seqno, doc size) tuples are flushed to the LSM index.
  • an LSM tree index stores (key, seqno, doc size) tuples to map a document PK to a seqno
  • the Log Structured Object Store is composed of log segments. Per-segment metadata includes the min and max seqno in that segment and the seqno values do not overlap between segments.
  • there is a document cache and index block cache. The index block cache caches blocks from the LSM index and from the per-segment B-tree indexes. 
If you want to avoid read IO while accessing the index then with an LSM like RocksDB you only need to cache one key per data block because the KV pairs in the data blocks are in key order. But with an index+log approach like Magma the documents in the log segments are not in PK order, so you would need to cache one key per document rather than per block (meaning there is more data to cache). Thus the Magma approach suffers from more cache amplification than RocksDB. Alas, there is no free lunch and the index+log approach has other benefits that can be worth the price of more cache-amp.

The typical search flow to find a document by PK is:
  1. Search the write-cache, if found return the document and stop
  2. Else search the LSM index. If the PK is found use its seqno to search a log segment, else done
  3. Use the seqno to find the log segment that contains it
  4. Use the B-tree embedded in that log segment to find the data block that has the seqno
  5. Read that data block and return the document
Log segments

The Log Structured Object Store is a set of log segments. Each log segment has a CoW B-Tree to map seqno to data block. When the write-cache is flushed 1+ data blocks are appended to the tail of the open log segment (each data block is 4kb) and one index entry per datablock is added to the B-Tree. The index maps seqno to data block. Updates to the B-Tree are done by appending the new or changed index blocks to the tail of the open log segment and that is done by writing from leaf up to the root (or less than the root in most cases). This B-Tree index is right growing (seqno is increasing). The data blocks are compressed with lz4. I am not sure if the index blocks are compressed.

I am not sure what is done for documents that are larger than 4kb, but that should not be a hard problem to solve.

Eventually the log segments accumulate documents that have been deleted and GC is done to reclaim that space. The important thing is that Magma doesn't have to do index lookups during GC to determine if a document is the latest version (live) for a given PK. Instead, Magma has a clever way to main lists of deleted seqnos and uses that list when copying out live documents from a log segment during GC.

The clever solution is to persist the list of (seqno, doc size) pairs in a separate LSM tree called the delete list. When the LSM index tree is compacted and (PK, seqno, doc size) tuples are to be dropped (because they were deleted) then those tuples are first written to the delete list LSM tree. The doc size values are used to estimate the amount of deleted bytes per log segment and compaction is done for log segments that have the largest ratio of deleted bytes. Compaction then merges the delete list with the documents from a log segment -- if the seqno for a document is on the delete list then the document is not copied out by GC.

Perf results

I tend to be skeptical of perf comparisons with RocksDB. But I also don't let my skepticism prevent me from appreciating a paper, and this paper deserves to be appreciated. The authors did a great job of presenting results while covering a variety of concerns (performance and efficiency).

Of course the paper shows that performance and efficiency are better than RocksDB and there wouldn't be a VLDB paper if their results didn't show that. The improvements vs Couchstore are even larger and there wouldn't be the investment to build Magma if that weren't true. So while I am skeptical I don't think these results are misleading. I also have no plans to run my own tests to compare RocksDB with Magma from my perspective.

Snark aside, I appreciate this paper provided more info than the typical conference paper with respect to how they configured RocksDB. They used RocksDB 5.18, perhaps because that test was done long ago. Modern RocksDB now can use io_uring to get concurrent storage reads during MultiGet operations and coroutines to get IO concurrency elsewhere (I am still learning about that). Modern RocksDB also has an option to do key-value separation via Integrated BlobDB.

Storage engines, efficiency and large documents, rows, objects

Storage engines for OLTP focus on storing small objects and by small object I mean sizeof(object) << sizeof(page). They can store larger objects but the engines are not optimized for that and both efficiency and performance will suffer. By larger objects I mean sizeof(object) >= sizeof(page). 

Given the growth of the document model we need some OLTP storage engines to be efficient and performant for larger objects. What might that mean? The rest of this post is speculation by me. I am far from an expert on this topic and will try to be careful. The topic would make for an interesting conference paper.

This post is more truthy then true. It has speculation, unanswered questions and likely has several mistakes or things I omitted because I was not aware of them.

I will focus on four index structures that are popular in my world:

  • Clustered b-tree (InnoDB)
  • Heap-organized b-tree (Postgres)
  • Cow-R b-tree (WiredTiger)
  • LSM (MyRocks)
At a high-level I want to understand where the storage engine consumes CPU & IO while reading and writing large objects. The writes can be small updates (changing a few fields in an object) or big updates/inserts (changing all or most of an object).

The context that I care about is a workload where the working set isn't in memory (because RAM is expensive and fast storage is here).

One more bit of context - with JSON documents stored in a document DBMS like MongoDB or a SQL DBMS like MySQL, I hope to learn about support for partial updates. If only a few fields are updated in a large document does the DBMS have optimizations to reduce the overhead of persistent and replicating such updates?

Document vs Relational

I started to think about this blog post when considering the impact of document vs relational approaches -- or larger documents vs smaller rows -- starting with the scenario where you want to update a few fields in that large document vs updating a few of the small rows in the relational approach. And I assume the working set doesn't fit in cache.

At a high-level there is a big difference between a b-tree and LSM. With an LSM if a 3KB object is modified without support for partial updates, then 3KB is written to the memtable and then compaction does its thing. We can estimate the total write overhead based on average write-amp and sizeof(updated object). The predicted cost is largely determined by sizeof(updated object). But this is less true with a b-tree given my working-set assumption above. With a b-tree whether the updated object is only a few bytes or a few KB, if that object fits in one page and page write-back is usually done for pages with only one modified object then the write-back cost is less dependent on object size -- because the write-back cost is 1 page.

A rule of thumb from the previous paragraph can be that a b-tree write-back overhead is less sensitive than an LSM to large objects as long as sizeof(object) <= sizeof(page).

From the read perspective, if you need all or most of the fields of a large document then it will be more efficient to keep them in a large document and avoid the joins. It will also make better use of the cache memory because the SQL DBMS approach might end up with pages in cache where only one row of such pages are getting accessed.

Efficiency overview

I focus my speculation on efficiency. But being more efficient frequently implies being more performant. The efficiency areas are:
  • storage - how are the larger objects stored in the index structure? In general it is another source of (reasonable) complexity to understand if you want to be a power user.
  • write-back - for a b-tree this means the work (CPU & IO) to write back dirty pages. For an LSM this is the work done by compaction and memtable flush. The overhead for this is measured in the amount of CPU, random page writes and KB written per transaction. The relative importance of random page writes vs KB written depends on your storage technology (random page writes hurt more with spinning disk than with SSD). Note that with compression the CPU overhead includes the cost to compress the data being written back or compacted. There can also be an overhead from reading data -- obviously with compaction, but also with read-modify-write for a b-tree as pages can't be modified unless they are in-memory.
  • redo log - the redo log lets a write transaction do a small amount of writing to the log to defer a larger amount of writing needed by write-back
  • replication - the amount of data that must be written to a replication log and sent over the network per transaction. If log shipping is used then this is just a repeat of the redo log analysis. Postgres does log shipping and there has been work to support statement-based. MySQL started with statement-based and then moved to row-based. My knowledge of MongoDB internals is weaker but I assume it is document-based, which is similar to MySQL's row-based. '
  • partial updates - when a large document gets a partial updates (a few fields are changed) does the DBMS optimize for that case WRT to replication, redo log and write-back?
Efficiency analysis

Disclaimer, this section is more truthy than true. I am curious to learn which of the DBMS (Postgres, MySQL, MongoDB) have optimizations for partial updates to large documents but not curious enough to do the research.

Clustered b-tree (InnoDB)

  • storage - objects are stored inline in the clustered index leaf pages if they are small enough. The definition of small enough depends on the row format but certainly it must be smaller than the InnoDB page size. I wonder if a clustered index is the right structure for large objects unless the large objects are always stored in overflow pages. Percona has a great summary and the reference manual has more info. But large objects are likely to be stored in overflow pages (meaning there can be more random IO to read them) and overflow pages are not shared by multiple blobs (meaning there will be more space amplification). Compression is possible via table or page compression. I am not a fan of the page compression approach but table compression worked great for me long ago. There are many recent improvements to InnoDB for large objects done to support JSON docs including support for partial updates, per-LOB indexes and compressed LOBs. I have yet to figure out whether I can enable compression for large JSON documents stored in LOB overflow pages without enabling table or page compression.
  • write-back - assuming overflow pages are used then write-back likely means writing back both the clustered index leaf page that references the overflow pages along with the overflow pages. This likely means more write amplification. Assuming the doublewrite buffer is enabled then there is extra write amplification from the doublewrite buffer when previously written pages are to be written again.
  • redo log - InnoDB doesn't write page images to the redo log, with one exception (compression) that will ignore for now. But the entire row must be written to the redo log unless there is something clever for partial updates.
  • replication - read up on the binlog_row_image option. But I am not sure if MySQL does anything clever with respect to the partial updates mentioned above. Can it avoid sending the full document when only a few fields have changed?
  • partial updates - yes, see here
Heap-organized b-tree (Postgres)
  • storage - objects are stored in the heap-organized table pages if they fit. When they don't read up on TOAST. I like that there compression can be enabled/disabled just for TOAST.
  • write-back - similar to InnoDB there might be extra write amplification from writing back both the heap table pages that reference TOAST columns and the out-of-line storage used to store the TOAST columns.
  • redo log - the first time a page is modified for a given redo log then that page is written to the redo log. For large objects the entire object must be written to the log.
  • replication
  • partial updates - I don't know whether that is optimized for JSON/JSONB.
Cow-R b-tree (WiredTiger, mostly unanswered because my internals knowledge has faded)
  • storage
  • write-back
  • redo log
  • replication
  • partial updates - I don't know whether MongoDB does anything special
LSM (MyRocks)
  • storage - large objects are stored inline unless you use Integrated BlobDB to get key-value separation. Compression is trivial.
  • write-back - the write-back cost is average-write-amp * sizeof(object)
  • redo log - the entire large object is written to the RocksDB redo log
  • replication - see binlog_row_image
  • partial updates - yes via the merge operator if you use pure RocksDB. Optimizations for large objects have yet to arrive.

Monday, September 19, 2022

Understanding some jemalloc stats for RocksDB

I have been struggling to understand the root cause for RSS being larger than expected during my usage of db_bench. In this case db_bench used jemalloc. Statistics for jemalloc are written to LOG every stats_dump_period_sec seconds and this is an abbreviated example of that output -- I hacked db_bench to disable the mutex related stats in the call to malloc_stats_print.

To keep this post short and because I am not a jemalloc expert I will only try to explain a few things in this output that might help you, and me in a few years when I have to understand jemalloc again. It helps to know a lot about RocksDB internals to understand which of the jemalloc bin sizes map to allocations in RocksDB. When you don't know that (as I don't know all of that) then a heap profiler like Massif can be used.

The first is the number of arenas that are listed on the opt.arenas line. In this case there were 320 and I assume that jemalloc uses 4 arenas per CPU (the server had 80 CPUs). There is also a line with Arenas: 321 which is confusing but the difference isn't a big deal to me. Arenas are a way to shard the allocator to reduce mutex contention and when threads are (hopefully) assigned to a different arenas. In my usage with db_bench the threads are long-lived and I assume the mapping of thread to arena doesn't change. I don't know whether that is a potential source of perf problems.

The next interesting line starts with Allocated: and displays a few counters for currently allocated memory. The value next to Allocated: is ~51GB which is reasonable given I configured RocksDB to use 50GB for the block cache. 

This is followed by an interesting section that starts with Merged arena stats. This has various counters aggregated over all 320 arenas. Start with the assigned threads: 10 line. That means that 10 threads are currently assigned to arenas. While per-arena stats might also be useful there are many arenas, that would require a lot of output to print and read and I have not been able to get malloc_stats_print to return results for each of the 320 arenas (it stops after 4) so I prefer just using malloc_stats_print(..., opts="a") to disable per-arena stats. If you use opts="xa" then per-arena and mutex counters are not displayed which makes the output easier to read.

Next are lines that start with small: and large: to show the amount of memory currently allocated for small and large allocations sizes. Most of the allocations in this case are small and <= 8kb.

The next section has counters for active, mapped, retained and more. These are frequently interesting but not so much in this case.

Given that most allocations are small the following section for bins: is the most interesting. The bin with size 8 is used for all allocations of <= 8 bytes. The bin with size 16 is used for all allocations between 9 and 16 bytes. This section has long lines and clicking on Raw improves readability but I can't link to lines in the Raw view.

I wanted to know the bins that accounted for the largest amount of currently allocated memory. Sorting these lines by curslabs * pgs would provide that. Alas I can't sort that in the gist so I just look for the lines with the largest value in curslabs. The curslabs column is the number of slabs (slab is better explained by an expert, not me) and pgs is the size of a slab. The unit for pgs is 4kb in this case so curslabs * pgs * 4096 is the number of bytes currently mapped for a given bin.

In this case the bins using the most memory have sizes 32, 96, 448 and 8192. Note that this is from db_bench configured to use Integrated BlobDB with 400-byte blob values and blob values share the block cache:

  • bins of size 32, 80 (not in this case though) and 96 are commonly used. Their usage is described by an expert as the 32-byte bucket corresponds to the blob-owning object, the 80-byte bucket to the block-owning object, and the 96-byte bucket to the cache metadata (which is an overhead that’s applicable to both blocks and blobs)
  • the bin of size 448 is used by the 400-byte blob values stored in the RocksDB block cache
  • the bin of size 8192 is used by the uncompressed data blocks that are approximately 8192 bytes each because the block_size option was set to 8192.
The util column is also interesting. This lists the utilization fraction, a value between 0 and 1. Multiply by 100 to get the utilization percentage. I read the source code to learn how it is computed. It is curregs / (regs * curslabs) where curregs is the number of currently used regions, regs is the number of regions per slab and curslabs is the number of slabs.

In this example the util value is low (~0.7) for bins of size 32, 96 and 448 but large (1) for the bin of size 8192. To me this indicates there is fragmentation for the bins of size 32, 96 and 448 and removing that fragmentation will reduce RSS and improve memory efficiency. Alas I have yet to figure how to make that happen. The root cause with respect to RocksDB is that at test start the blob values get a larger fraction of the block cache, over time that fraction is reduced and the difference between the peak memory usage near startup and the usage over time became significant (by my standards) as RSS was ~1.15X the size of the RocksDB block cache.

Many of the columns are explained in the man page, however some of the column names used in the malloc_stats_print output are different from what is used in the source code and explained in the man page. One example is the usage of slab_size in the man page vs pgs in the malloc_stats_print output where pgs * 4096 = slab_size. The source for printing per-arena metrics is in stats_arena_print. But you can find curslabs in the man page.

Thursday, September 15, 2022

Configuring the RocksDB block cache

I learn by making mistakes and the rate at which RocksDB improves makes it easier for me to make mistakes. My mistake here was reading that RocksDB has high and low priority classes for the block cache and assuming there was a medium priority. There is a third priority but its name is bottom and it is lower than low priority.

RocksDB has had an option to do key-value separation for many years. That was recently rewritten so it could be feature complete with the RocksDB API. The official name for the feature is Integrated BlobDB but I just call it BlobDB.

I am excited about BlobDB and have begun testing it after adding support for it to tools/benchmark.sh. But it brings a new challenge with respect to the RocksDB block cache that stores uncompressed data, index and filter blocks in memory. 

Thanks to a great intern project, with BlobDB there is now another thing that can be stored in the block cache -- blob values. The usage of the db_bench option named use_shared_block_and_blob_cache option shows how that is enabled. There are 3 priority classes that can be used by the block cache: high, low, bottom.

The options that control this are:

  • high_pri_pool_ratio
    • The fraction of the cache devoted to filter and index blocks.
  • low_pri_pool_ratio 
    • When zero then data block and blob values have the same priority. Otherwise, this is the fraction of the cache for data blocks and blob values get the remainder (1 - high_pri_pool_ratio - low_pri_pool_ratio). The sum of high_pri_pool_ratio and low_pri_pool_ratio should be <= 1.

The priority for storing things in the cache is one of the following depending on how you configure it:

  • everything has the same priority
    • when high_pri_pool ratio=0 and low_pri_pool_ratio=0
  • index & filter blocks have high priority, data blocks and blob values have low priority
    • when high_pri_pool ratio > 0 and low_pri_pool_ratio=0
  • index & filter blocks have high-pri, data blocks have low-pri, blob values have bottom-pri
    • when high_pri_pool ratio > 0 and low_pri_pool_ratio > 0
  • data, index & filter blocks have low priority, blob values have bottom priority
    • when high_pri_pool_ratio=0 and low_pri_pool_ratio > 0

How to ask questions

This is a follow-up to my post on how to answer questions. The audience for that post was people doing in something in production that others have a strong opinion about. The audience for this post is people with strong opinions about what others are doing in production.

One of these is a great way to start a discussion:

  • You are doing X, you should be doing Y.
    • This is isn't even a question and it isn't a productive way to start a discussion. With respect to supporting a social graph OLTP workload I have been told on separate occasions that I should be using an in-memory DBMS, an OLAP DBMS, a graph DBMS and a NoSQL DBMS (ScyllaDB). In two of those cases the people offering that advice were quite famous. Alas, nobody offering that advice followed up and submitted a PR for Linkbench to show the benefits of such a change.
  • Why are you doing X!
    • Still not a question. Still not a great way to start a discussion.
  • Why are you doing X?
    • We have a winner. I still remember being quizzed by Peter Bailis, while at HPTS, about the social graph OLTP workload I supported. I did my best to answer his questions. It was a great discussion and I was impressed by his curiousity.

Wednesday, September 14, 2022

Using SPIN to validate the InnoDB read-write lock

I hope to being learning how to use TLA+ and that reminded me of prior work I did with SPIN.

I have written two blog posts (here and here) about my usage of SPIN to validate concurrent code. The most important usage was for the rewrite of the InnoDB read-write lock that my team did at Google. It was important because that model helped to convince the InnoDB team to accept the change. I hope to being learning about TLA+ 

Within Google the project was mostly done by Ben Handy and I helped. This was ~15 years ago so my memory is vague. I remember reading the SPIN book to learn how to write a model for the algorithm. I also remember finding a machine with a lot of RAM and then doing what I could to reduce the memory requirements to run the model. Reducing the amount of concurrency tested was one such change. Finally, I remember enjoying SPIN.

I didn't share the model before but a friend, and InnoDB/TiDB expert, provided me with a copy. That model can now be read here.