Wednesday, September 28, 2022

Magma, a new storage engine for Couchbase

Many years ago I sat next to a Couchbase exec at a VLDB event and was explained on how awesome their new storage engine was especially when compared to RocksDB. That new engine was based on ForestDB and while the idea was interesting the engine was never finished. On the bright side I got to meet the clever person who invented the ForestDB algorithm and had fun evaluating the perf claims (blog posts A, B, C, D).

At last, Couchbase has a new storage named Magma. The VLDB paper is worth reading, the engine design is interesting and the product is feature complete. The rest of this post explains Magma based on the paper and a few other posts. The Magma engine is an example of the index+log index structure and looks like a great upgrade from the previous engine (Couchstore).


Magma should eventually replace Couchstore as the Couchbase storage engine. Couchstore is a copy-on-write B-Tree (CoW-S). Compaction (GC) for Couchstore is single-threaded and not incremental (a database file is read & fully rewritten). Couchstore does not do block compression. Couchstore writes are done by appending the modified pages (from leaf to root) to the end of the database file. That is a lot of write-amp.

It was time for something new and I like the replacement - Magma. Goals for it include:

  • concurrent, incremental compaction
  • less write-amplification
  • optimized for SSD (sustain more concurrent IO requests)
  • support deployments with 100:1 ratios for data:memory
  • avoid the fatal flaw of index+log (don't do index lookups during GC)
The basic workload for a Couchbase storage engine is:
  • point read of documents by primary key
  • range scan of documents by seqno (commit sequence number) for change feeds
  • upsert N documents
  • while this requires two indexes (by PK, by seqno) secondary indexes on document attributes is provided by another service
  • average document size is between 1kb and 16kb


A summary of the implementation:

  • writes are persisted to a WAL and then added to the write cache. I assume each document gets a unique seqno at that time.
  • the write cache (RocksDB memtable) buffers KV pairs with 2 skiplists. The paper first describes them as the active and immutable lists. Later it states one skiplist orders by PK and the other by seqno. I believe the later description. When full the documents are appended to the tail of the open log segment and the (key,seqno, doc size) tuples are flushed to the LSM index.
  • an LSM tree index stores (key, seqno, doc size) tuples to map a document PK to a seqno
  • the Log Structured Object Store is composed of log segments. Per-segment metadata includes the min and max seqno in that segment and the seqno values do not overlap between segments.
  • there is a document cache and index block cache. The index block cache caches blocks from the LSM index and from the per-segment B-tree indexes. 
If you want to avoid read IO while accessing the index then with an LSM like RocksDB you only need to cache one key per data block because the KV pairs in the data blocks are in key order. But with an index+log approach like Magma the documents in the log segments are not in PK order, so you would need to cache one key per document rather than per block (meaning there is more data to cache). Thus the Magma approach suffers from more cache amplification than RocksDB. Alas, there is no free lunch and the index+log approach has other benefits that can be worth the price of more cache-amp.

The typical search flow to find a document by PK is:
  1. Search the write-cache, if found return the document and stop
  2. Else search the LSM index. If the PK is found use its seqno to search a log segment, else done
  3. Use the seqno to find the log segment that contains it
  4. Use the B-tree embedded in that log segment to find the data block that has the seqno
  5. Read that data block and return the document
Log segments

The Log Structured Object Store is a set of log segments. Each log segment has a CoW B-Tree to map seqno to data block. When the write-cache is flushed 1+ data blocks are appended to the tail of the open log segment (each data block is 4kb) and one index entry per datablock is added to the B-Tree. The index maps seqno to data block. Updates to the B-Tree are done by appending the new or changed index blocks to the tail of the open log segment and that is done by writing from leaf up to the root (or less than the root in most cases). This B-Tree index is right growing (seqno is increasing). The data blocks are compressed with lz4. I am not sure if the index blocks are compressed.

I am not sure what is done for documents that are larger than 4kb, but that should not be a hard problem to solve.

Eventually the log segments accumulate documents that have been deleted and GC is done to reclaim that space. The important thing is that Magma doesn't have to do index lookups during GC to determine if a document is the latest version (live) for a given PK. Instead, Magma has a clever way to main lists of deleted seqnos and uses that list when copying out live documents from a log segment during GC.

The clever solution is to persist the list of (seqno, doc size) pairs in a separate LSM tree called the delete list. When the LSM index tree is compacted and (PK, seqno, doc size) tuples are to be dropped (because they were deleted) then those tuples are first written to the delete list LSM tree. The doc size values are used to estimate the amount of deleted bytes per log segment and compaction is done for log segments that have the largest ratio of deleted bytes. Compaction then merges the delete list with the documents from a log segment -- if the seqno for a document is on the delete list then the document is not copied out by GC.

Perf results

I tend to be skeptical of perf comparisons with RocksDB. But I also don't let my skepticism prevent me from appreciating a paper, and this paper deserves to be appreciated. The authors did a great job of presenting results while covering a variety of concerns (performance and efficiency).

Of course the paper shows that performance and efficiency are better than RocksDB and there wouldn't be a VLDB paper if their results didn't show that. The improvements vs Couchstore are even larger and there wouldn't be the investment to build Magma if that weren't true. So while I am skeptical I don't think these results are misleading. I also have no plans to run my own tests to compare RocksDB with Magma from my perspective.

Snark aside, I appreciate this paper provided more info than the typical conference paper with respect to how they configured RocksDB. They used RocksDB 5.18, perhaps because that test was done long ago. Modern RocksDB now can use io_uring to get concurrent storage reads during MultiGet operations and coroutines to get IO concurrency elsewhere (I am still learning about that). Modern RocksDB also has an option to do key-value separation via Integrated BlobDB.

Storage engines, efficiency and large documents, rows, objects

Storage engines for OLTP focus on storing small objects and by small object I mean sizeof(object) << sizeof(page). They can store larger objects but the engines are not optimized for that and both efficiency and performance will suffer. By larger objects I mean sizeof(object) >= sizeof(page). 

Given the growth of the document model we need some OLTP storage engines to be efficient and performant for larger objects. What might that mean? The rest of this post is speculation by me. I am far from an expert on this topic and will try to be careful. The topic would make for an interesting conference paper.

This post is more truthy then true. It has speculation, unanswered questions and likely has several mistakes or things I omitted because I was not aware of them.

I will focus on four index structures that are popular in my world:

  • Clustered b-tree (InnoDB)
  • Heap-organized b-tree (Postgres)
  • Cow-R b-tree (WiredTiger)
  • LSM (MyRocks)
At a high-level I want to understand where the storage engine consumes CPU & IO while reading and writing large objects. The writes can be small updates (changing a few fields in an object) or big updates/inserts (changing all or most of an object).

The context that I care about is a workload where the working set isn't in memory (because RAM is expensive and fast storage is here).

One more bit of context - with JSON documents stored in a document DBMS like MongoDB or a SQL DBMS like MySQL, I hope to learn about support for partial updates. If only a few fields are updated in a large document does the DBMS have optimizations to reduce the overhead of persistent and replicating such updates?

Document vs Relational

I started to think about this blog post when considering the impact of document vs relational approaches -- or larger documents vs smaller rows -- starting with the scenario where you want to update a few fields in that large document vs updating a few of the small rows in the relational approach. And I assume the working set doesn't fit in cache.

At a high-level there is a big difference between a b-tree and LSM. With an LSM if a 3KB object is modified without support for partial updates, then 3KB is written to the memtable and then compaction does its thing. We can estimate the total write overhead based on average write-amp and sizeof(updated object). The predicted cost is largely determined by sizeof(updated object). But this is less true with a b-tree given my working-set assumption above. With a b-tree whether the updated object is only a few bytes or a few KB, if that object fits in one page and page write-back is usually done for pages with only one modified object then the write-back cost is less dependent on object size -- because the write-back cost is 1 page.

A rule of thumb from the previous paragraph can be that a b-tree write-back overhead is less sensitive than an LSM to large objects as long as sizeof(object) <= sizeof(page).

From the read perspective, if you need all or most of the fields of a large document then it will be more efficient to keep them in a large document and avoid the joins. It will also make better use of the cache memory because the SQL DBMS approach might end up with pages in cache where only one row of such pages are getting accessed.

Efficiency overview

I focus my speculation on efficiency. But being more efficient frequently implies being more performant. The efficiency areas are:
  • storage - how are the larger objects stored in the index structure? In general it is another source of (reasonable) complexity to understand if you want to be a power user.
  • write-back - for a b-tree this means the work (CPU & IO) to write back dirty pages. For an LSM this is the work done by compaction and memtable flush. The overhead for this is measured in the amount of CPU, random page writes and KB written per transaction. The relative importance of random page writes vs KB written depends on your storage technology (random page writes hurt more with spinning disk than with SSD). Note that with compression the CPU overhead includes the cost to compress the data being written back or compacted. There can also be an overhead from reading data -- obviously with compaction, but also with read-modify-write for a b-tree as pages can't be modified unless they are in-memory.
  • redo log - the redo log lets a write transaction do a small amount of writing to the log to defer a larger amount of writing needed by write-back
  • replication - the amount of data that must be written to a replication log and sent over the network per transaction. If log shipping is used then this is just a repeat of the redo log analysis. Postgres does log shipping and there has been work to support statement-based. MySQL started with statement-based and then moved to row-based. My knowledge of MongoDB internals is weaker but I assume it is document-based, which is similar to MySQL's row-based. '
  • partial updates - when a large document gets a partial updates (a few fields are changed) does the DBMS optimize for that case WRT to replication, redo log and write-back?
Efficiency analysis

Disclaimer, this section is more truthy than true. I am curious to learn which of the DBMS (Postgres, MySQL, MongoDB) have optimizations for partial updates to large documents but not curious enough to do the research.

Clustered b-tree (InnoDB)

  • storage - objects are stored inline in the clustered index leaf pages if they are small enough. The definition of small enough depends on the row format but certainly it must be smaller than the InnoDB page size. I wonder if a clustered index is the right structure for large objects unless the large objects are always stored in overflow pages. Percona has a great summary and the reference manual has more info. But large objects are likely to be stored in overflow pages (meaning there can be more random IO to read them) and overflow pages are not shared by multiple blobs (meaning there will be more space amplification). Compression is possible via table or page compression. I am not a fan of the page compression approach but table compression worked great for me long ago. There are many recent improvements to InnoDB for large objects done to support JSON docs including support for partial updates, per-LOB indexes and compressed LOBs. I have yet to figure out whether I can enable compression for large JSON documents stored in LOB overflow pages without enabling table or page compression.
  • write-back - assuming overflow pages are used then write-back likely means writing back both the clustered index leaf page that references the overflow pages along with the overflow pages. This likely means more write amplification. Assuming the doublewrite buffer is enabled then there is extra write amplification from the doublewrite buffer when previously written pages are to be written again.
  • redo log - InnoDB doesn't write page images to the redo log, with one exception (compression) that will ignore for now. But the entire row must be written to the redo log unless there is something clever for partial updates.
  • replication - read up on the binlog_row_image option. But I am not sure if MySQL does anything clever with respect to the partial updates mentioned above. Can it avoid sending the full document when only a few fields have changed?
  • partial updates - yes, see here
Heap-organized b-tree (Postgres)
  • storage - objects are stored in the heap-organized table pages if they fit. When they don't read up on TOAST. I like that there compression can be enabled/disabled just for TOAST.
  • write-back - similar to InnoDB there might be extra write amplification from writing back both the heap table pages that reference TOAST columns and the out-of-line storage used to store the TOAST columns.
  • redo log - the first time a page is modified for a given redo log then that page is written to the redo log. For large objects the entire object must be written to the log.
  • replication
  • partial updates - I don't know whether that is optimized for JSON/JSONB.
Cow-R b-tree (WiredTiger, mostly unanswered because my internals knowledge has faded)
  • storage
  • write-back
  • redo log
  • replication
  • partial updates - I don't know whether MongoDB does anything special
LSM (MyRocks)
  • storage - large objects are stored inline unless you use Integrated BlobDB to get key-value separation. Compression is trivial.
  • write-back - the write-back cost is average-write-amp * sizeof(object)
  • redo log - the entire large object is written to the RocksDB redo log
  • replication - see binlog_row_image
  • partial updates - yes via the merge operator if you use pure RocksDB. Optimizations for large objects have yet to arrive.

Monday, September 19, 2022

Understanding some jemalloc stats for RocksDB

I have been struggling to understand the root cause for RSS being larger than expected during my usage of db_bench. In this case db_bench used jemalloc. Statistics for jemalloc are written to LOG every stats_dump_period_sec seconds and this is an abbreviated example of that output -- I hacked db_bench to disable the mutex related stats in the call to malloc_stats_print.

To keep this post short and because I am not a jemalloc expert I will only try to explain a few things in this output that might help you, and me in a few years when I have to understand jemalloc again.

The first is the number of arenas that are listed on the opt.arenas line. In this case there were 320 and I assume that jemalloc uses 4 arenas per CPU (the server had 80 CPUs). There is also a line with Arenas: 321 which is confusing but the difference isn't a big deal to me. Arenas are a way to shard the allocator to reduce mutex contention and when threads are (hopefully) assigned to a different arenas. In my usage with db_bench the threads are long-lived and I assume the mapping of thread to arena doesn't change. I don't know whether that is a potential source of perf problems.

The next interesting line starts with Allocated: and displays a few counters for currently allocated memory. The value next to Allocated: is ~51GB which is reasonable given I configured RocksDB to use 50GB for the block cache. 

This is followed by an interesting section that starts with Merged arena stats. This has various counters aggregated over all 320 arenas. Start with the assigned threads: 10 line. That means that 10 threads are currently assigned to arenas. While per-arena stats might also be useful there are many arenas, that would require a lot of output to print and read and I have not been able to get malloc_stats_print to return results for each of the 320 arenas (it stops after 4) so I prefer just using malloc_stats_print(..., opts="a") to disable per-arena stats. If you use opts="xa" then per-arena and mutex counters are not displayed which makes the output easier to read.

Next are lines that start with small: and large: to show the amount of memory currently allocated for small and large allocations sizes. Most of the allocations in this case are small and <= 8kb.

The next section has counters for active, mapped, retained and more. These are frequently interesting but not so much in this case.

Given that most allocations are small the following section for bins: is the most interesting. The bin with size 8 is used for all allocations of <= 8 bytes. The bin with size 16 is used for all allocations between 9 and 16 bytes. This section has long lines and clicking on Raw improves readability but I can't link to lines in the Raw view.

I wanted to know the bins that accounted for the largest amount of currently allocated memory. Sorting these lines by curslabs * pgs would provide that. Alas I can't sort that in the gist so I just look for the lines with the largest value in curslabs. The curslabs column is the number of slabs (slab is better explained by an expert, not me) and pgs is the size of a slab. The unit for pgs is 4kb in this case so curslabs * pgs * 4096 is the number of bytes currently mapped for a given bin.

In this case the bins using the most memory have sizes 32, 96, 448 and 8192. Note that this is from db_bench configured to use Integrated BlobDB with 400-byte blob values and blob values share the block cache:

  • bins of size 32, 80 (not in this case though) and 96 are commonly used. Their usage is described by an expert as the 32-byte bucket corresponds to the blob-owning object, the 80-byte bucket to the block-owning object, and the 96-byte bucket to the cache metadata (which is an overhead that’s applicable to both blocks and blobs)
  • the bin of size 448 is used by the 400-byte blob values stored in the RocksDB block cache
  • the bin of size 8192 is used by the uncompressed data blocks that are approximately 8192 bytes each because the block_size option was set to 8192.
The util column is also interesting. This lists the utilization fraction, a value between 0 and 1. Multiply by 100 to get the utilization percentage. I read the source code to learn how it is computed. It is curregs / (regs * curslabs) where curregs is the number of currently used regions, regs is the number of regions per slab and curslabs is the number of slabs.

In this example the util value is low (~0.7) for bins of size 32, 96 and 448 but large (1) for the bin of size 8192. To me this indicates there is fragmentation for the bins of size 32, 96 and 448 and removing that fragmentation will reduce RSS and improve memory efficiency. Alas I have yet to figure how to make that happen. The root cause with respect to RocksDB is that at test start the blob values get a larger fraction of the block cache, over time that fraction is reduced and the difference between the peak memory usage near startup and the usage over time became significant (by my standards) as RSS was ~1.15X the size of the RocksDB block cache.

Many of the columns are explained in the man page, however some of the column names used in the malloc_stats_print output are different from what is used in the source code and explained in the man page. One example is the usage of slab_size in the man page vs pgs in the malloc_stats_print output where pgs * 4096 = slab_size. The source for printing per-arena metrics is in stats_arena_print. But you can find curslabs in the man page.

Thursday, September 15, 2022

Configuring the RocksDB block cache

I learn by making mistakes and the rate at which RocksDB improves makes it easier for me to make mistakes. My mistake here was reading that RocksDB has high and low priority classes for the block cache and assuming there was a medium priority. There is a third priority but its name is bottom and it is lower than low priority.

RocksDB has had an option to do key-value separation for many years. That was recently rewritten so it could be feature complete with the RocksDB API. The official name for the feature is Integrated BlobDB but I just call it BlobDB.

I am excited about BlobDB and have begun testing it after adding support for it to tools/ But it brings a new challenge with respect to the RocksDB block cache that stores uncompressed data, index and filter blocks in memory. 

Thanks to a great intern project, with BlobDB there is now another thing that can be stored in the block cache -- blob values. The usage of the db_bench option named use_shared_block_and_blob_cache option shows how that is enabled. There are 3 priority classes that can be used by the block cache: high, low, bottom.

The options that control this are:

  • high_pri_pool_ratio
    • The fraction of the cache devoted to filter and index blocks.
  • low_pri_pool_ratio 
    • When zero then data block and blob values have the same priority. Otherwise, this is the fraction of the cache for data blocks and blob values get the remainder (1 - high_pri_pool_ratio - low_pri_pool_ratio). The sum of high_pri_pool_ratio and low_pri_pool_ratio should be <= 1.

The priority for storing things in the cache is one of the following depending on how you configure it:

  • everything has the same priority
    • when high_pri_pool ratio=0 and low_pri_pool_ratio=0
  • index & filter blocks have high priority, data blocks and blob values have low priority
    • when high_pri_pool ratio > 0 and low_pri_pool_ratio=0
  • index & filter blocks have high-pri, data blocks have low-pri, blob values have bottom-pri
    • when high_pri_pool ratio > 0 and low_pri_pool_ratio > 0
  • data, index & filter blocks have low priority, blob values have bottom priority
    • when high_pri_pool_ratio=0 and low_pri_pool_ratio > 0

How to ask questions

This is a follow-up to my post on how to answer questions. The audience for that post was people doing in something in production that others have a strong opinion about. The audience for this post is people with strong opinions about what others are doing in production.

One of these is a great way to start a discussion:

  • You are doing X, you should be doing Y.
    • This is isn't even a question and it isn't a productive way to start a discussion. With respect to supporting a social graph OLTP workload I have been told on separate occasions that I should be using an in-memory DBMS, an OLAP DBMS, a graph DBMS and a NoSQL DBMS (ScyllaDB). In two of those cases the people offering that advice were quite famous. Alas, nobody offering that advice followed up and submitted a PR for Linkbench to show the benefits of such a change.
  • Why are you doing X!
    • Still not a question. Still not a great way to start a discussion.
  • Why are you doing X?
    • We have a winner. I still remember being quizzed by Peter Bailis, while at HPTS, about the social graph OLTP workload I supported. I did my best to answer his questions. It was a great discussion and I was impressed by his curiousity.

Wednesday, September 14, 2022

Using SPIN to validate the InnoDB read-write lock

I hope to being learning how to use TLA+ and that reminded me of prior work I did with SPIN.

I have written two blog posts (here and here) about my usage of SPIN to validate concurrent code. The most important usage was for the rewrite of the InnoDB read-write lock that my team did at Google. It was important because that model helped to convince the InnoDB team to accept the change. I hope to being learning about TLA+ 

Within Google the project was mostly done by Ben Handy and I helped. This was ~15 years ago so my memory is vague. I remember reading the SPIN book to learn how to write a model for the algorithm. I also remember finding a machine with a lot of RAM and then doing what I could to reduce the memory requirements to run the model. Reducing the amount of concurrency tested was one such change. Finally, I remember enjoying SPIN.

I didn't share the model before but a friend, and InnoDB/TiDB expert, provided me with a copy. That model can now be read here.

Tuesday, August 23, 2022

How I do RocksDB performance tests, part 2

This extends on my previous postThis post isn’t specific to RocksDB. It also has more opinions and might serve as speaker notes were I to write slides. I am writing this prior to speaking to peers about my work so it might have an audience of one (me) but that is typical of many of my posts. Regardless, I appreciate that people read and engage with some of my posts.


  • How did I get here?
    • Long ago I worked on DBMS internals full time - I added features, fixed bugs and created bugs. Then I moved to web-scale MySQL at Google and started to spend time in production. Production is a great education but it came at the cost of less time for new features. I still spent much time finding and fixing bugs. After a few years I moved to Facebook and the trend continued. Over time I stopped doing feature development, spent much less time fixing bugs but still spend time reporting bugs. I read a lot of code when trying to explain things, but I don't write much that makes it upstream. I have enjoyed the change. I don't need to write code because I am surrounded by talented developers. I can specialize in my thing, and others specializing in their thing. It is hard to be expert in too many things.
  • Benchmarks are what you make of them
    • They are far from perfect but they are quite useful. Testing by comparing things and explaining the results makes them more value. Benchmarks informed by production are even better.
  • How does code age?
    • Single-thread performance on CPUs isn't improving like it used to. Long-lived code tends to attract CPU regressions. This combination is a problem. Good regression tests help spot the problems and spotting them early is a big deal because removing them long after they arrive is too expensive. This isn't just a technical problem. How do you reject new features that help a fraction of the user base when the cost if more CPU overhead for all users?
  • Needs improvement
    • I hope to get better about using benchmarks that avoid coordinated omission, have more statistical rigor, expand beyond single-node tests, use benchmark workloads that are adaptive and use benchmarks that enforce response time constraints.
  • Build a network of smart peers
    • I have been fortunate to have many talented peers. I engage with Postgres experts on Twitter and have also met smart people who work on upstream projects via bug report discussions. 
  • Explain things
    • Explain your results. But find a balance because explaining everything will slow you down.
  • Testing an LSM is complicated
    • Old posts on this are here and here.
    • The shape of an LSM tree has more variance than the shape of a B-tree. This can be a source of variance in benchmarks, especially in read-heavy ones. While this is still a work in progress there are db_bench commands to make the LSM shape more deterministic (flush memtable, compact L0, compact L1, wait-for-compaction).
    • Another problem is a test that inherits a non-deterministic amount of compaction debt. If the sequence is: —benchmarks=write-heavy,read-heavy then the read-heavy step might suffer from compaction debt inherited from write-heavy. The impact of reducing this debt during the read-heavy step can vary and produce confusing results for the read-heavy step.
    • Try to get the LSM tree into a steady state before read-heavy tests. For example, after fillseq there is no overlap between SSTs. After a full compaction there is only one level (or one SST). These are usually not steady states.
    • For a load + query benchmark it is easy for the LSM (or B-Tree) to not be in a steady state after the load and many benchmarks suffer from this. If the load is done in key order then the PK index won’t be fragmented with a B-Tree and the SSTs won’t overlap with an LSM — which hides some of the overhead that occurs during query processing. When storage is a local attached SSD and the workload is heavy on IO then you need to worry about non-determinism from the SSD — you either want no SSD GC or to get SSD GC into a steady state (by running the test for long enough and having database files that are large enough, something between 50% and 90% of the device capacity).
  • Make the DBMS unhappy
    • Find ways to make the DBMS unhappy and see if it falls over. The challenge is that there are more and less realistic ways to make a DBMS fall over. An easy way to make a DBMS unhappy is to provide it with too many concurrent requests, especially a DBMS that doesn’t provide admission control (like RocksDB). But some problems are best fixed elsewhere because fixes have an opportunity cost. It might be nice to have an optional RocksDB API that implements admission control. 
  • Define your goals
    • Do you care about average throughput or outliers (p99, p99.9, p99.99). I have a post on this. Average throughput is easy to measure but p99 and beyond (p99.9, p99.99) matters in production because outliers determine user experience and capacity planning is based on p99. While single-valued metrics like p99 are easy to share, graphs for throughput over time at 1-second intervals make it easier to spot problems like stalls, cyclic behavior or throughput that degrades over time.
  • Statistical rigor
    • Statistical rigor is great but can be expensive. Repeating every benchmark 3 times increases the accuracy of your results at the cost of 3X more HW. I usually get less rigorous statistical rigor because I frequently repeat benchmark runs because I made a mistake or need to measure one more thing. Another way to think of this is: assume B units of HW capacity, each benchmark has a warmup cost of W and runtime of R. Then solve for N in B = N(W+R) where N is the number of times the benchmark is repeated. A larger value for N implies a smaller value for R and the confidence interval is a function of both N and R.
  • Coordinated omission
    • Coordinated omission is a real problem. All of the benchmark clients that I use suffer from it, yet they are still useful. Two things prevent me from doing open-loop benchmarks. First, the benchmark clients I use don’t support it and it takes work to implement a new benchmark client and incorporate it into my workflow. Second, an open-loop benchmark takes more work to setup as I need to discover an arrival rate that the DBMS can handle — or I need a more complicated client that can discover it for me. One day I will use open-loop clients.
  • Response time constraints
    • The db_bench benchmark client for RocksDB doesn't have an option to use response time constraints (ignore responses slower than X ms). Another problem is computing throughput without a response time constraint. More concurrency usually means more throughput, but it also means worse response time and more response time outliers. Those slow responses should not be counted. Most of the benchmark clients that I use don’t enforce a response time SLA. Such an SLA is more work, you need to select a reasonable value, but I hope to improve with this. I hope to add them to db_bench.
  • Single node
    • Most of my testing runs the client and server on the same server. While I prefer to use separate servers for client & server when the DBMS supports it, that introduces the risk of perf variance because I will be sharing the network.
  • Stable platform
    • I use HW at work, in the public cloud and my home test cluster. My work HW has value-added services that consume a variable and occasionally significant amount of compute and storage so I am wary of using it for low-concurrency benchmarks. Public cloud HW means I am using a VM and might be sharing compute and storage with noisy neighbors so I found a way to reduce the CPU variance by using the largest number of CPUs for a given instance type and disabling HT. From quick tests with fio there wasn't much variance in the cloud block storage I chose. My home HW is the most stable after I disabled HT and turbo boost. Alas, it is also the least capable — 4 CPUs, 16G of RAM.
  • Compare things 
    • I rarely test one thing in isolation because the results are hard to interpret. So I do A/B or even A/B/C/D/... testing where these represent different DBMS, different versions of the same DBMS or different configurations for one version of one DBMS.
  • Measure things
    • Start with throughput, then add variance, then add CPU, IO and memory. Foreground CPU and IO can remain constant while background CPU and IO change significantly and most DBMS do much work in the background (excluding SQLite which doesn’t have background threads. Don’t forget to watch VSZ/RSS for the DBMS processes because increases there might lead to OOM. Has disk space usage increases because that can lead to out of space errors. When something is slower search top down. Look at iostat metrics to see if IO/query has changed. Look at vmstat to see if CPU/query has changed. Look at vmstat to see if context switches/query has changed (mutex contention?). Normalize your metrics — IO/query, CPU/query, context switches/query. I frequently have scripts running that scrape output from ps and top. To watch for disk space issues I have a script that runs du and ls in a loop during benchmarks.
  • Summarize things
    • One practice I have it to create one line performance summaries with useful numbers for throughput, HW (CPU/storage/memory/disk space) usage, normalized HW usage (CPU/query, IO/query). One line summaries make it easy to compare performance when A/B or A/B/C/D/... testing is done. They also make it easy to spot regressions that don't directly hurt throughput but are a concern -- larger RSS for the DBMS process, more disk space used, more CPU consumed by background threads. The summaries also provide a starting point when I try to explain a performance change. An example is here.
  • Name & archive things
    • A mistake I have made many times is starting a benchmark, getting interrupted for a week and forgetting an important detail about the benchmark when I return. Naming patterns reduces the change of this. I try to archive the test scripts and command lines via Github. Saving RocksDB LOG files is also important. All of my important scripts are in Github.
  • Adaptive benchmark clients
    • I often have to assume things when configuring a benchmark client. The number of threads that db_bench uses for clients is currently fixed. It would be nice to have some benchmarks that increase the request rate or number of request clients over time or until a response time constraint is violated. I currently do this manually and my solution is sad.
  • Proactive vs reactive
    • Is it a bug when it has yet to happen in production? That is an interesting question. The answer requires nuance. Some bugs do happen but have yet to be noticed, some bugs might happen and are worth avoiding other bugs just aren't worth fixing. It isn't always easy to classify a bug into one of these groups.

Friday, August 19, 2022

When is a secondary index-only query not index-only for InnoDB and Postgres?

For Postgres and MySQL/InnoDB an index-only query plan that uses a covering secondary index might still require access to the base row for some of the index entries read from the secondary index. Most of the time index-only really is index-only. But in some cases there will still be fetches of the indexed row and power users should know when this can happen.

This doesn't happen with MySQL/MyRocks.


One reason for this is recent transactions. The issue for Postgres is well-documented and widely known -- the visibility map. I won't explain it because experts have done a better job than I can. I don't know whether Postgres has performance counters to monitor the frequency of this.

One interesting variant of this with Postgres was for insert-only workloads that I encountered with the Insert Benchmark. The problem was that prior to PG 13 autovacuum wasn't triggered by insert-only workloads, thus visibility map bits were not set for pages full of new inserts and index-only scans weren't really index only for deployments that relied on autovacuum.

The result is that for some index entries read from the secondary index for an index-only query plan, Postgres might have to fetch the row by using the tuple id to read from the heap-organized table. This can result in unexpected random IO. 


The issue for InnoDB is less widely-known and not well documented. I don't think there are performance counters to monitor the frequency for this. Long-ago when I supported InnoDB in production I added some to the FB patch.

The best doc I found in a quick search is this slide deck (thanks Marko from MariaDB) and I will quote parts from it:

  1. secondary indexes have a PAGE_MAX_TRX_ID
  2. PK index entries have a DB_TRX_ID per entry, secondary index entries do not
  3. See here for a description of DB_TRX_ID
  4. Secondary indexes may contain multiple (indexed_col, pk) records pointing to a single pk, one for each value of indexed_col
  5. All versions except at most one must be delete-marked
  6. If a secondary index record is delete-marked, MVCC must look up pk in the PRIMARY index and attempt to construct a version that matches indexed_col, to determine if (indexed_col, pk) exists in the read view
  7. If the PAGE_MAX_TRX_ID is too new, for each record in the secondary index leaf page, we must look up pk
  8. For old read views, secondary index scan can be very expensive!
InnoDB looks up the base by searching the clustered PK index using the PK columns stored in the secondary index entry.

While the InnoDB docs have some description of this, they aren't as clear as the slide deck above. 

When a secondary index column is updated, old secondary index records are delete-marked, new records are inserted, and delete-marked records are eventually purged. When a secondary index record is delete-marked or the secondary index page is updated by a newer transaction, InnoDB looks up the database record in the clustered index. In the clustered index, the record's DB_TRX_ID is checked, and the correct version of the record is retrieved from the undo log if the record was modified after the reading transaction was initiated.

If a secondary index record is marked for deletion or the secondary index page is updated by a newer transaction, the covering index technique is not used. Instead of returning values from the index structure, InnoDB looks up the record in the clustered index.

That page also explains another detail about secondary indexes that many InnoDB users might not know. Secondary index entries are not updated in place. An update is implemented as delete-mark existing entry, insert new entry. Eventually purge will physically remove the delete-marked entry after it is no longer visible to current snapshots. Thus InnoDB secondary indexes risk suffering from bloat and churn which is worse with long-open snapshots that block purge -- because those delete-marked entries can linger leading to excessive page-splits.
I wrote above that MyRocks doesn't have this problem and that is true. It can have problems with bloat/churn after secondary index maintenance that I mention above for InnoDB. Fortunately the SingleDelete optimization helps a lot to avoid that as described in the VLDB paper. SingleDelete allows for faster removal of tombstones as normal (Delete) tombstones are frequently not removed until reaching the max level of the LSM tree, and MyRocks uses SingleDelete for secondary index maintenance.

The reason that tombstones are frequently not removed until reaching the max level of the LSM tree is that the removal check chooses to be fast and has many false positives. During compaction when a tombstone is found, if the key for the tombstone is covered by the key ranges of SSTs in larger levels, then it is not dropped because the key might exist in those larger levels. The number of false positives could be reduced at the cost of more CPU, and possibly some IO, if after confirming overlap the bloom filters in those larger levels were probed.

Thursday, August 18, 2022

How I do performance tests for RocksDB

My primary tool for doing RocksDB performance tests is db_bench, but as always there are layers of shell scripts to automate the process. First, I will define two terms. A benchmark is a sequence of benchmark steps. A common pattern for big data benchmarks is load and then query where load and query are the steps. I do small data (OLTP) so the common pattern for me is load and then several read+write steps. To save time the load step can be replaced by copying from a backup or snapshot.

I have shared notes on how to do benchmarks for an LSM with a focus on RocksDB and LevelDB. These are based on the many mistakes I have made and occasionally learned from: see here and here.

The layers of shell scripts are:

  • tools/ - runs one benchmark step by invoking db_bench
  • tools/ - runs a benchmark by invoking a sequence of benchmark steps
  • - selects configuration options based on HW size then calls
  • - selects configuration options based on workload (IO-bound vs cached) then calls
I also forked the scripts above to work on older versions of RocksDB (versions 4 and 5). The forks are here and were done to reduce the complexity in the scripts above.

The scripts also let me run a benchmark for many versions of RocksDB. When I test many versions it might take more than a week for to finish (tmux and screen help here), but it runs unattended which saves me time and in the end I get ascii formatted reports that make it easy to compare results across versions.

RocksDB has many configuration options. The script has good default values for many options, but the values for some options depend on HW size (amount of RAM, number of CPUs) which is one reason there are layers of scripts.

The benchmark tool, db_bench, also has many configuration options to select different benchmark steps (load in key order, point queries, etc). The primary purpose for is to run the benchmark steps (and configure db_bench to do that) in an order that is useful to me and minimizes noise. For example, if benchmark steps were to follow overwrite then they might inherit a large and varying amount of compaction debt which would cause variance. The sequence of benchmark steps begins here.

The script selects values for RocksDB configuration options based on HW capacity. For now I have hardwired this based on known servers that I use (see here) but eventually this will accept CPU and memory sizes as input and go from there.

The script is invoked by me. It defines four workloads:
  • byrx
    • byrx is short for cached by RocksDB and the database fits in the RocksDB block cache
  • byos
    • byos is short for cached by OS and the database fits in the OS page cache but is larger than the RocksDB block cache. This simulates fast storage for reads and lets me increase stress on the RocksDB block cache code.
  • iobuf
    • iobuf is short for IO-bound with buffered IO. The database is larger than RAM and RocksDB uses buffered IO.
  • iodir
    • iodir is short for IO-bound with O_DIRECT. The database is larger than RAM and RocksDB uses O_DIRECT for user reads and compaction.
An example command line is:

nohup bash 22 no 1800 c30r240 40000000 2000000000 iodir &

Benchmark steps

Note that I have pending changes for and that are not yet pushed upstream.

The benchmark steps are:
  • fillseq - loads the database in key order. There is not much compaction debt when this finishes.
  • read-only
    • revrange - reverse range scans, the real use for this is to let compacton catch up
    • fwdrange - forward range scans, this has too much noise that I have yet to explain
    • readrandom - point queries
    • multireadrandom - more point queries, but enhanced by io_uring
  • fragment the LSM tree, prior to this keys of SST files do not overlap
    • overwritesome - does overwrite with –num set to 10% of the keys
    • flush_mt_l0 - flushes the memtable, flushes the L0 then waits for compaction to catch up
  • read+write - perf is reported for reads, the background writer has a 2MB/s rate limit
    • revrangewhilewriting - short, reverse range scans
    • fwdrangewhilewriting - short, forward range scans
    • readwhilewriting - point queries
  • write-only
    • overwrite - the writer does not have a rate limit. If there were more write-only tests that followed then I would use overwriteandwait which waits for compaction to finish when overwrite ends.

Thursday, July 21, 2022

The impact of max_background_threads, part 2

This is a follow to the results I published for the impact of max_background_jobs on RocksDB performance. The previous post used an IO-bound workload. This post uses a cached workload. The purpose for this work is to evaluate my rule of thumb that the number of busy threads should be <= the number of CPU cores. This is explained in detail in the previous post.


  • Similar to the results for IO-bound, while throughput increases with concurrency, so does variance so my rule of thumb still holds (keep the number of busy threads <= the number of cores).


In this post I use jobs=X to indicate the value for max_background_jobs. Tests were run for jobs=8 and jobs=16. Command lines are here for a run that uses max_background_jobs=8 and 8 client threads. The IO-bound test used a database with 4B key-value pairs. This test uses 40M key-value pairs so the database fits in the RocksDB block cache. See the previous post for more details.

Results: throughput

The benchmark summaries are here for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads.

For all of the throughput graphs below except overwrite, the improvement with concurrency after 32 client threads isn't as good compared to the IO-bound results in the previous post. This is expected because the results here are for a CPU-bound workload, the server has 40 CPUs with 80 HW threads, and hyperthreading doesn't double compute capacity. The overwrite case is special because 1 client thread is sufficient to saturate throughput and beyond that the extra client threads just interfere with compaction threads.

Graphs for throughput are next. The first graph is for readrandom. I don't share the graph for fillseq because the benchmark step finishes in less than 1 minute. Throughput improves with concurrency but the improvement degrades with more concurrency.

The graph for fwdrangewhilewriting is similar to readrandom.

The graph for readwhilewriting is similar to readrandom.

The graph for overwrite shows that 1 client thread is sufficient to saturate throughput. That isn't a surprise because fsync isn't enabled and writes are just inserts into an in-memory table. The average throughput is usually, but not always, better for jobs=16 than jobs=8.

Results: p99.99 response time

The p99.99 response time graphs are similar. Performance degrades once the CPU becomes saturated around 40 client threads. The first graph is for readrandom.

The next graph is for fwdrangewhilewriting.

The next graph is for read while writing. The curve here is straighter than the curve above for fwdrangewhilewriting. I have yet to explain that.

And the final graph is for overwrite. There is an interesting artifact between 64 and 80 client threads. I have yet to explain that.

Results: throughput vs time

The next graphs display throughput per 5-second interval. There are two graphs per benchmark step -- one for jobs=8 and another for jobs=16.

The readrandom results aren't a surprise. Throughput increases with concurrency as does variance - compare the results for 1 client thread (nt1) with 80 client threads (nt80).

The graphs for fwdrangewhilewriting have more variance with increase concurrency compared to readrandom. This is expected for two reasons. First, there is less CPU because client threads compete with compaction threads for CPU. Second, there is more mutex contention and can be more write stalls. In this case there weren't any write stalls but mutex contention is harder to measure. Also note that throughput doesn't improve much beyond 32 client threads. This is also OK.

Results for readwhilewriting are similar to fwdrangewhilewriting in that beyond 32 client threads there is more variance and not much improvement in throughput. This lack of improvement is OK but I hope we can reduce the variance. However there are differences. There is less variance in general here than for fwdrangewhilewriting. My theory is that the range queries in fwdrangewhilewriting are more sensitive to the shape of the LSM tree. The point queries benefit from bloom filters. The range queries do not and there is periodic behavior as new SSTs are added to the L0, then compacted, repeat.

The overwrite graphs show the most variance to the point that they are hard to read. Keep on reading and there will be other graphs. Note that 1 client thread is sufficient to saturate throughput for overwrite and extra threads just compete for the CPU with compaction threads and increase variance.

The results for overwrite have much variance even at 1 client thread. The results for jobs=8 and jobs=16 are similar. The worst-case write stalls are less than 0.5 seconds.

The results for overwrite with 8 client threads have a different pattern than 1 client thread, but still much variance.

The last pair of graphs is for overwrite with 32 client threads. The results are similar to the graphs above for 8 client threads.

Results: overwrite

This section has a few more graphs for overwrite.

The p50 response times are similar for jobs=8 and jobs=16.

The p99 response times are similar for jobs=8 and jobs=16.

The p99.9 response times are similar for jobs=8 and jobs=16.

The p99.99 response times are similar for jobs=8 and jobs=16.

The max response times are similar for jobs=8 and jobs=16, but less so than the graphs above.

The graphs for write stall% are similar for jobs=8 vs jobs=16 but not similar to the graphs above. The best (lowest) case occurs for 1 client thread. The worst is for ~32 client threads. I don't know why it drops from there.

Wednesday, July 20, 2022

The impact of max_background_jobs on write throughput for RocksDB

I rely on rules of thumb when configuring a DBMS for a benchmark. This post is about configuring the number of background threads and the number of client threads for RocksDB to improve throughput and quality of service (low variance) with IO-bound workloads. The background threads are used for compaction and memtable flushes. The option is max_background_jobs

Eventually I need to evaluate or re-evaluate whether these rules of thumb are valid and I am doing that here. One rule I use is that the number of busy threads should be less than or equal to the number of CPU cores (real cores, not HW threads courtesy of hyperthreading) and the reason for that rule is to reduce variance for workloads that are either CPU-bound or write-heavy (write-heavy implies much CPU from compaction). I frequently run benchmarks to look for performance regressions and variance makes that search harder. By busy threads I mean #client-threads + #background-threads where #client-threads is the number of client threads used by the benchmark client and #background-threads is the value for max_background_jobs.

For this blog post I compare the performance between two RocksDB configurations that are identical with one exception. One uses max_background_jobs=8 and the other uses =16.

The tl;dr context is an IO-bound workload (many storage reads and writes) on a server with fast storage and 80 HW threads (40 cores, hyperthreading enabled).


  • Throughput improves with concurrency for up to 80 client threads for IO-bound workloads that are read-only or read-mostly. This comes at a reasonable increase in response time, usually <= 1.5x although fwdrangewhilewriting stops improving at 72 client threads.
  • Throughput for overwrite (write-only) degrades with concurrency. The peak is at <= 16 client threads. This isn't a surprise, as there is no fsync on write and the additional client threads just compete with compaction threads for CPU although the interference is interesting (keep on reading).
  • Throughput for overwrite is ~1.4X better with max_background_jobs=16 vs =8 on average. But the cost of better average throughput is more variance.
For now I will stick with jobs=8 when doing benchmarks on this hardware to get better quality of service but less throughput for overwrite.


Something I learned recently is that the number of flush threads is max_background_jobs/4 and the number of compaction threads is max_background_jobs minus the number of flush threads. The code is here.

I used, a server with 80 HW threads (hyperthreading enabled) and fast storage. The workload was IO-bound -- 4B KV pairs, ~1TB database and a low cache hit rate. The benchmarks were run for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads using two configurations that only differed in the configuration value for max_background jobs -- 8 vs 16. Note that fillseq always uses 1 client thread.

I used a recent build of db_bench at git hash 7e2004a123. This has code that will become the 7.5 release. It includes two changes that greatly reduce the worst-case write stall (6115254 and b397dcd) but that is a topic for another post.

The sequence of benchmark steps is determined by and the interesting steps were: fillseq, readrandom, multireadrandom, fwdrangewhilewriting, readwhilewriting and overwrite. The *whilewriting steps used a 2MB/s rate limit on the background writer and were not write heavy. The fillseq step is write heavy but doesn't require compaction because inserts are in key order and the push down optimization is used. The final step, overwrite, is write heavy. Command lines are here for a run with 8 client threads and max_background_jobs=8.

I neglected to use --multiread_batched=true with the multireadrandom benchmark step so the results are mostly a duplicate of readrandom.

The benchmark summaries are here for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads.

These benchmarks might suffer from coordinated omission because a fixed number of client threads is used. These results might also overstate the benefit of more throughput from more concurrency because the benchmark client doesn't enforce a response time SLA.

Results: throughput

The throughput graphs have results for 1, 8, 16, 24, 32, 40, 48, 56, 64, 72 and 80 client threads.

I used gnuplot for the graphs and did not set the y-range to zero. This was done to make it easier to see the differences but can mislead a drive-by reader. I also use jobs= rather than max_background_jobs= to save on typing. I also used lines rather than bars. Next time I might use bars.

Throughput for fillseq was slightly better with jobs=8 than =16. The difference is small and I won't try to explain it. The fillseq benchmark step does inserts in key order and always uses 1 client thread.

Throughput for readrandom improves with concurrency up to 80 client threads but the improvement degrades and USL might explain that. There is no difference between jobs=8 and =16. That is expected because this is read-only.

The results for fwdrangewhilewriting are similar to readrandom. There is no difference between jobs=8 and =16. That is probably because the write rate is low (2MB/s).

The results for readwhilewriting are similar to readrandom.

The throughput for overwrite is ~1.4X better for jobs=16 than =8. I didn't expect this but it is not a huge surprise. Write amplification during this test is ~17. Compaction IO statistics from the test end are here. The LSM tree has data in 8 levels so this is one example where the write-amp estimate of per-level fanout X number-of-levels is pessimistic.

Results: p99 response time

These graphs show the p99 response time in microseconds.

The results for readrandom and multireadrandom are interesting but I won't try to explain them.

The p99 response time for fwdrangewhilewriting shows that saturation happens after 72 client threads. That it happens at some point is expected.

The p99 response time for readwhilewriting doesn't show the saturation that is visible above for fwdrangewhilewriting.

The p99 response time for overwrite is the most interesting. Above it was shown that throughput for jobs=16 was ~1.4X better than for =8. But here the p99 response times are similar. While I haven't displayed it, the p50 response times are also similar. This will be explained below.

Throughput vs time

Next up is graphs for throughput vs time at 5-second intervals and I just wrote a note to myself change this to 1-second intervals for future tests. For each benchmark step there are two graphs, one for jobs=8 and another for jobs=16, with 11 lines per graph. There is one line for each number of threads, from 1 to 80 and they are labeled as nt$X on the graph where $X is the number of threads.

The results for fillseq show more variance at first glance for jobs=8 but that might be an illusion because the max for the y-range is larger for jobs=16. The throughput range in both cases is between ~100k/second and ~300k/second after 2000 seconds.

For readrandom the variance increases with the number of client threads. The result for 80 client threads (nt80) has the most jitter. Results for multireadrandom are similar and I won't show them here.

Results for fwdrangewhilewriting show increasing variance as the number of client threads increases.

Results for readwhilewriting show increasing variance as the number of client threads increases.

The overwrite benchmark step has the most interesting graphs although it can be hard to see that here given there are 11 lines per graph. That will be fixed below if you keep on reading. Important points are:

  1. The worst-case write stall (minutes long) doesn't occur because it was recently fixed. I will explain more about that soon in another blog post.
  2. The result for jobs=8 shows a regular pattern of a peak then a gradual decline.
  3. The result for jobs=16 has more variance.
Which do you prefer -- better average throughput or less variance? Obviously, both is my answer.

The graphs above for overwrite are busy so I also have graphs limited to a specific number of client threads -- 1, 8 and 32. The problem with variance for jobs=16 doesn't occur at 1 client thread (nt1) but does occur at 8 or more client threads. I hope to explain this one day. For now I will wave my hands and claim that the additional client threads interfere with the compaction threads -- for mutexes and CPU time.

Overwrite: the mystery about p50 and p99

Above I showed that overwrite has similar p99 response times for jobs=8 and jobs=16 but average throughput is ~1.4X better for jobs=16. I didn't show it but the p50 response times are also similar. But in the p99.9 and p99.99 results (no graphs but read the summaries linked above) the values are worse for jobs=16.

The problem is that outliers are worse for jobs=8 and one way to view that is via the response time histrograms printed at the end of the benchmark step. They are here for the nt1 (1 client thread) test. With jobs=16, 99.278% of responses take <= 10 usecs vs (see here and here).

While they are listed in the summaries linked above I don't have graphs for the write stall% from overwrite as this post is already too long. But from the linked summaries at 1 client thread the write stall% is 24.6 for jobs=8 vs 0% for jobs=16. At 8 client threads it is 54.2% for jobs=8 vs 33.1% for jobs=16.