Posts

Showing posts from September, 2022

Magma, a new storage engine for Couchbase

Many years ago I sat next to a Couchbase exec at a VLDB event and was explained on how awesome their new storage engine was especially when compared to RocksDB. That new engine was based on ForestDB and while the idea was interesting the engine was never finished. On the bright side I got to meet the clever person who invented the ForestDB algorithm and had fun evaluating the perf claims (blog posts A , B , C , D ). At last, Couchbase has a new storage named Magma. The VLDB paper is worth reading, the engine design is interesting and the product is feature complete. The rest of this post explains Magma based on the paper and a few other posts. The Magma engine is an example of the index+log index structure  and looks like a great upgrade from the previous engine (Couchstore). Motivation Magma should eventually replace Couchstore as the Couchbase storage engine. Couchstore is a copy-on-write B-Tree ( CoW-S ). Compaction (GC) for Couchstore is single-threaded and not incremental (a dat

Storage engines, efficiency and large documents, rows, objects

Storage engines for OLTP focus on storing small objects and by small object I mean sizeof(object) << sizeof(page). They can store larger objects but the engines are not optimized for that and both efficiency and performance will suffer. By larger objects I mean sizeof(object) >= sizeof(page).  Given the growth of the document model we need some OLTP storage engines to be efficient and performant for larger objects. What might that mean? The rest of this post is speculation by me. I am far from an expert on this topic and will try to be careful. The topic would make for an interesting conference paper. This post is more truthy then true. It has speculation, unanswered questions and likely has several mistakes or things I omitted because I was not aware of them. I will focus on four index structures that are popular in my world: Clustered b-tree (InnoDB) Heap-organized b-tree (Postgres) Cow-R b-tree (WiredTiger) LSM (MyRocks) At a high-level I want to understand where the sto

Understanding some jemalloc stats for RocksDB

I have been struggling to understand the root cause for RSS being larger than expected during my usage of db_bench. In this case db_bench used jemalloc. Statistics for jemalloc are written to LOG every stats_dump_period_sec seconds and this is an abbreviated example of that output -- I hacked db_bench to disable the mutex related stats in the call to malloc_stats_print. To keep this post short and because I am not a jemalloc expert I will only try to explain a few things in this output that might help you, and me in a few years when I have to understand jemalloc again. It helps to know a lot about RocksDB internals to understand which of the jemalloc bin sizes map to allocations in RocksDB. When you don't know that (as I don't know all of that) then a heap profiler like Massif can be used. The first is the number of arenas that are listed on the opt.arenas line . In this case there were 320 and I assume that jemalloc uses 4 arenas per CPU (the server had 80 CPUs). There is a

Configuring the RocksDB block cache

I learn by making mistakes and the rate at which RocksDB improves makes it easier for me to make mistakes. My mistake here was reading that RocksDB has high and low priority classes for the block cache and assuming there was a medium priority. There is a third priority but its name is bottom and it is lower than low priority. RocksDB has had an option to do key-value separation for many years. That was recently rewritten so it could be feature complete with the RocksDB API. The official name for the feature is Integrated BlobDB but I just call it BlobDB. I am excited about BlobDB and have begun testing it after adding support for it to tools/benchmark.sh . But it brings a new challenge with respect to the RocksDB block cache that stores uncompressed data, index and filter blocks in memory.  Thanks to a great intern project, with BlobDB there is now another thing that can be stored in the block cache -- blob values. The usage of the db_bench option named use_shared_block_and_blob_cach

How to ask questions

This is a follow-up to my post on how to answer questions. The audience for that post was people doing in something in production that others have a strong opinion about. The audience for this post is people with strong opinions about what others are doing in production. One of these is a great way to start a discussion: You are doing X, you should be doing Y. This is isn't even a question and it isn't a productive way to start a discussion. With respect to supporting a social graph OLTP workload I have been told on separate occasions that I should be using an in-memory DBMS, an OLAP DBMS, a graph DBMS and a NoSQL DBMS (ScyllaDB). In two of those cases the people offering that advice were quite famous. Alas, nobody offering that advice followed up and submitted a PR for Linkbench to show the benefits of such a change. Why are you doing X! Still not a question. Still not a great way to start a discussion. Why are you doing X? We have a winner. I still remember being quizzed by

Using SPIN to validate the InnoDB read-write lock

I hope to being learning how to use TLA+ and that reminded me of prior work I did with SPIN . I have written two blog posts ( here and here ) about my usage of SPIN to validate concurrent code. The most important usage was for the rewrite of the InnoDB read-write lock that my team did at Google. It was important because that model helped to convince the InnoDB team to accept the change. I hope to being learning about TLA+  Within Google the project was mostly done by Ben Handy and I helped. This was ~15 years ago so my memory is vague. I remember reading the SPIN book to learn how to write a model for the algorithm. I also remember finding a machine with a lot of RAM and then doing what I could to reduce the memory requirements to run the model. Reducing the amount of concurrency tested was one such change. Finally, I remember enjoying SPIN. I didn't share the model before but a friend, and InnoDB/TiDB expert, provided me with a copy. That model can now be read here .