Posts

Showing posts from October, 2022

s/optimal/better/g - on reviewing conference papers

I spent a few years reviewing papers for database conferences. I think that is winding down. I was OK as a reviewer, definitely not great, and this summarizes my experience. For starters, I am in awe of good reviwers. As a reviewer you get to see feedback from the other reviwers after submitting your review. And I was always nervous while waiting to see the other reviews. Was my review an outlier? How much did I miss in my review? Reading good reviews after submitting a mediocre review is a great way to learn. As always, a key to success is to choose the right base case especially if you want to show linear speedup or scaleup. Many of the papers use the DBMS that I know quite well (MySQL, RocksDB) as the base case. So I have frequent someone on the internet is wrong moments while reading such papers. My offer to provide (free) benchmark advice to research projects was ignored. But maybe that is OK, because I am already busy. My goal was to focus on the ideas in the paper and allow t

Quantifying storage on Linux

Some things are complicated but I understand them (RocksDB). Clearly that isn't too complicated and the complexity might be a barrier to entry which boosts the demand for my skills. Other things are complicated and I don't understand them that well. Clearly those things are too complicated. Yes, I am trying to be funny but what I wrote above might be true for many of us. In this case the thing that I don't understand that well are things that support IO for a DBMS -- filesystems, block layer and storage devices. It is likely that something in this post is factually incorrect and I am happy to be corrected. Some of my posts are thinly veiled attempts to get free advice from experts. The problem I am trying to understand this week is the size of IO requests at different layers of the stack while running RocksDB benchmarks. To be specific: are the reads being done at a multiple of 512 or 4096 bytes? And when might that be possible (O_DIRECT vs buffered IO). From the details be

Small servers for performance testing, v4

I am setting up my fourth cluster of small servers to test open source database software. Cluster might be an overstatement because each cluster is limited to 2 or 3 servers. The clusters were/are: v1 - Intel NUC5i3ryh (5th gen core i3), 8G RAM, SATA disk for OS, Samsun 850 EVO m.2 for db v2 - Intel NUC7i5bnh (7th gen core i5), 16G RAM, Samsung 850 EVO SATA for OS, Samsung 960 EVO m.2 for db v3 - Intel NUC8i7beh (8th gen core i7), 16G RAM, Samsung 860 EVO SATA for OS, Samsung 970 EVO m.2 for db v4 - Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, WD Blue 1T SATA for OS, Kingston NVMe for db v5 - Beelink SER7 7840HS with Ryzen 7 7840HS, 32G RAM, 1T m.2 SSD for OS, 2T Samsung Pro 990 for the database. This is also my first mini PC with a fan and TDP is 65w which is a bit larger than what came before. Passmark single-thread rating is 3771 vs 2532 for the v4 server. v6 - SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, XFS). CPU is Intel Xeon

Early lock release and InnoDB

Early lock release has been in the news and I almost forgot that we prototyped this for InnoDB while figuring out how to do group commit for both the InnoDB redo log and MySQL replication log. A Facebook Note about that work is here , but formatting isn't great as Notes for Pages have been deprecated . Had the feature made it into a release it would have been documented, because it isn't good to surprise users with a feature that can make visible commits disappear after the primary DBMS crashes and recovers. It is even worse when the race is as simple as the DBMS process doing crash/recover which is more common than the primary node's HW doing the same. In the former there is no protection, in the latter enabling fsync on commit prevents it. Early lock release also made it into MariaDB and Percona via XtraDB. Again, that was documented. Since I don't want to lose the content for the note I have republished it below. Several of the links no long work. Content Group comm

Hyping the hyper clock cache in RocksDB

Image
I previously tweeted about the performance improvements from the hyper clock cache for RocksDB. In this post I provide more info about my performance tests. This feature is new in RocksDB 7.7.3 and you can try it via db_bench --cache_type=hyper_clock_cache .... The hyper clock cache feature implements the block cache for RocksDB and is expected to replace the the LRU cache. As you can guess from the name, the hyper clock cache implements a variant of CLOCK cache management . The LRU implementation is sharded with a mutex per shard. Even with many shards, I use 64, there will be hot shards and mutex contention. The hyper clock cache avoids those problems, but I won't try to explain it here because I am not an expert on it. At the time of writing this, nobody has claimed to be running this feature in production and the 7.7.3 release is new. I am not throwing shade at the feature, but it is new code and new DBMS code takes time to prove itself in production. Experiments My first se

Reasons for writeback with an update-in-place B-Tree

Are there well known names for the things that trigger dirty page writeback with an update-in-place B-Tree? Some of my performance discussions would be easier if I could point to those definitions. I spent many years keeping InnoDB, an update-in-place B-Tree , happy in production and a shorter time with WiredTiger, a copy-on-write random  (CoW-R) B-Tree. At work I recently revisited the topic of checkpoint and wondered if there were good names for the things that trigger page dirty page writeback. Alas, my InnoDB expertise isn't what it used to be. InnoDB implements fuzzy checkpointing. While the reference manual page for this is too brief, the advice on that page is excellent -- a larger redo log is good because that reduces write-amplification. A short pitch for fuzzy checkpointing is that it avoids writeback storms, the kind that used to plague Postgres . When you ask a storage device to write GBs of data in a short amount of time there will be a queue and requests (reads, log f