Posts

Showing posts from 2022

Benchmark(et)ing RocksDB vs SplinterDB

While I am a huge fan of research papers presenting storage engines that claim to be better than RocksDB, I am always wary of the performance results. A paper can be great despite an imperfect performance evaluation, so pointing out the imperfections doesn't take away from the interesting ideas in the paper. Also, as a believer of the (C)RUM Conjecture I want to know how the new thing is better and worse, but papers mostly focus only on the better parts and don't highlight what isn't better. One factor that determines the truthiness of a database benchmark is the number of DBMS that are compared. It is hard enough to get expertise in one DBMS and more DBMS == more chance of making a mistake. Here I present a result for RocksDB and SplinterDB . I am definitely not an expert on SplinterDB. Perhaps my results are more truthy than true. I read the SplinterDB paper and hope their research continues. However, the paper didn't have enough detail on how the benchmark was done

Compiling MySQL 5.6 & 5.7 on Ubuntu 22.04

One of my hobbies is testing open source DBMS for CPU regressions and for that I want to compare perf between old and new versions of the DBMS. Depending on the DBMS it can be a challenge to build the old DBMS with the current (modern) compiler toolchain.  Using open source frequently means compiling from source and compiling from source eventually means debugging a failed build. Alas, the proliferation of build tools means you are likely to be debugging a build tool you know little about. For me that included svn+MongoDB, cmake+MySQL, make/configure+MySQL, mvn+Linkbench, mvn+First_Robotics and make+RocksDB. Perhaps my debugging would be easier if there weren't as many build tools. Postgres might be an exception WRT compiling old versions - it works great. Alas, this isn't as easy with MySQL versions 5.6.51 and 5.7.39. Note that MySQL 5.6 reached end of life in 2021 but 5.7 doesn't reach that until next year. For MySQL 5.6, 5.7 and perhaps some 8.0 releases prior to 8.0.31,

Insert benchmark: Postgres, InnoDB and MyRocks with low concurrency

This has results for the insert benchmark using Postgres, InnoDB and MyRocks. For an overview of the insert benchmark see here and here . Some information on the performance summaries generated by my test scripts is here . I used  small servers  and ran the test at low concurrency (1 or 2 threads) for cached and IO-bound workloads. The insert benchmark has several phases and the interesting phases are: insert-only without secondary indexes, insert-only with 3 secondary indexes and then range queries with rate-limited inserts. Performance reports are provided for: Postgres versions 12.11, 13.7, 14.6 and 15.1:  cached  and  IO-bound InnoDB 5.6.49, 5.7.35, 8.0.21 and 8.0.31:  cached  and  IO-bound MyRocks 5.6.35 and 8.0.28:  cached  and  IO-bound Disclaimer - I used a small server and will soon repeat this on larger servers and with more concurrency. tl;dr Postgres is boring: no CPU regressions from version 12 to 15 for cached and IO-bound  workloads. The insert benchmark found a CPU r

SSD read response time: raw device vs a filesystem

I am trying to understand why 4kb random reads from an SSD are about 2X slower when using a filesystem vs a raw device and this reproduces across different servers but I am only sharing the results from my home Intel NUCs . The symptoms are, the read response time is: ~2X larger per iostat's r_await for reads done via a filesystem vs a raw device (.04 vs .08 millisecs) ~3X larger per blkparse for reads done via a filesystem vs a raw device (~16 vs 50+ microsecs) Once again, I am confused. Is something below user land doing (if from-filesystem -> go-slow). In theory, if the D->C transition reported by blkparse includes a lot of CPU overhead from the filesystem then that might explain this, but the CPU per IO overhead isn't large enough for that to be true here. My test scripts for this are rfio.sh and do_fio.sh . Update - the mystery has been solved thanks to advice from an expert (Andreas Freund) who is in my Twitter circle, and engaging with experts I would never get

s/optimal/better/g - on reviewing conference papers

I spent a few years reviewing papers for database conferences. I think that is winding down. I was OK as a reviewer, definitely not great, and this summarizes my experience. For starters, I am in awe of good reviwers. As a reviewer you get to see feedback from the other reviwers after submitting your review. And I was always nervous while waiting to see the other reviews. Was my review an outlier? How much did I miss in my review? Reading good reviews after submitting a mediocre review is a great way to learn. As always, a key to success is to choose the right base case especially if you want to show linear speedup or scaleup. Many of the papers use the DBMS that I know quite well (MySQL, RocksDB) as the base case. So I have frequent someone on the internet is wrong moments while reading such papers. My offer to provide (free) benchmark advice to research projects was ignored. But maybe that is OK, because I am already busy. My goal was to focus on the ideas in the paper and allow t

Quantifying storage on Linux

Some things are complicated but I understand them (RocksDB). Clearly that isn't too complicated and the complexity might be a barrier to entry which boosts the demand for my skills. Other things are complicated and I don't understand them that well. Clearly those things are too complicated. Yes, I am trying to be funny but what I wrote above might be true for many of us. In this case the thing that I don't understand that well are things that support IO for a DBMS -- filesystems, block layer and storage devices. It is likely that something in this post is factually incorrect and I am happy to be corrected. Some of my posts are thinly veiled attempts to get free advice from experts. The problem I am trying to understand this week is the size of IO requests at different layers of the stack while running RocksDB benchmarks. To be specific: are the reads being done at a multiple of 512 or 4096 bytes? And when might that be possible (O_DIRECT vs buffered IO). From the details be

Small servers for performance testing, v4

I am setting up my fourth cluster of small servers to test open source database software. Cluster might be an overstatement because each cluster is limited to 2 or 3 servers. The clusters were/are: v1 - Intel NUC5i3ryh (5th gen core i3), 8G RAM, SATA disk for OS, Samsun 850 EVO m.2 for db v2 - Intel NUC7i5bnh (7th gen core i5), 16G RAM, Samsung 850 EVO SATA for OS, Samsung 960 EVO m.2 for db v3 - Intel NUC8i7beh (8th gen core i7), 16G RAM, Samsung 860 EVO SATA for OS, Samsung 970 EVO m.2 for db v4 - Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, WD Blue 1T SATA for OS, Kingston NVMe for db v5 - Beelink SER7 7840HS with Ryzen 7 7840HS, 32G RAM, 1T m.2 SSD for OS, 2T Samsung Pro 990 for the database. This is also my first mini PC with a fan and TDP is 65w which is a bit larger than what came before. Passmark single-thread rating is 3771 vs 2532 for the v4 server. v6 - SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, XFS). CPU is Intel Xeon

Early lock release and InnoDB

Early lock release has been in the news and I almost forgot that we prototyped this for InnoDB while figuring out how to do group commit for both the InnoDB redo log and MySQL replication log. A Facebook Note about that work is here , but formatting isn't great as Notes for Pages have been deprecated . Had the feature made it into a release it would have been documented, because it isn't good to surprise users with a feature that can make visible commits disappear after the primary DBMS crashes and recovers. It is even worse when the race is as simple as the DBMS process doing crash/recover which is more common than the primary node's HW doing the same. In the former there is no protection, in the latter enabling fsync on commit prevents it. Early lock release also made it into MariaDB and Percona via XtraDB. Again, that was documented. Since I don't want to lose the content for the note I have republished it below. Several of the links no long work. Content Group comm

Hyping the hyper clock cache in RocksDB

Image
I previously tweeted about the performance improvements from the hyper clock cache for RocksDB. In this post I provide more info about my performance tests. This feature is new in RocksDB 7.7.3 and you can try it via db_bench --cache_type=hyper_clock_cache .... The hyper clock cache feature implements the block cache for RocksDB and is expected to replace the the LRU cache. As you can guess from the name, the hyper clock cache implements a variant of CLOCK cache management . The LRU implementation is sharded with a mutex per shard. Even with many shards, I use 64, there will be hot shards and mutex contention. The hyper clock cache avoids those problems, but I won't try to explain it here because I am not an expert on it. At the time of writing this, nobody has claimed to be running this feature in production and the 7.7.3 release is new. I am not throwing shade at the feature, but it is new code and new DBMS code takes time to prove itself in production. Experiments My first se