I am wary of papers with performance results for too many products.Too many means including results from systems for which you lack expertise. Wary means I have less faith in the comparison even when the ideas in the paper are awesome. I have expertise in MySQL, MongoDB, RocksDB, WiredTiger and InnoDB but even for them I have made and acknowledged ridiculous mistakes.
Database configuration is too hard. There are too many options, most of them aren't significant and the approach is bottom-up. I an expert on this -- in addition to years of tuning I have added more than a few options to RocksDB and MySQL.
This post was motivated by PostgreSQL. I want to run the insert benchmark for it and need a good configuration. I have nothing against PG with the exception of a few too many why not Postgres comments. The community is strong, docs are great and the product is still improving. But I think PostgreSQL configuration has room to improve -- just like RocksDB (here, here) and MySQL/InnoDB.
Too many options
A non-expert user lacks both the ability to choose good values for options and the ability to understand which options might be useful to set. My solution to too many options and most aren't significant is to use good defaults and split the option name space into two parts -- regular and expert. Regular options are set by most users because they matter for performance and don't have good default values. The amount of memory the DBMS can use is one such option - the default will be small.
Everything else is an expert option. These include options for which the default is great and options that rarely impact performance. There is a reason for expert options -- some workloads benefit from their existence and being able to set that option at runtime might avoid downtime. Options are also added early in the lifecycle of new features to allow developers to evaluate the new feature and choose good default values. But such options don't need to be exposed to all users.
The benefit from doing this is to avoid presenting a new user with tens or hundreds of options to consider. That is a lousy experience. And while X is too hard isn't always a valid complaint -- language (human and database query) is complex because they let us express complex idea -- I don't think we gain much from the current approach.
RocksDB has added functions that simplify configuration and even split the option namespace into two parts -- regular and advanced. This is a step in the right direction but I hope for more. I confirmed that most RocksDB options either have good defaults or aren't significant for my workloads and then published advice on tuning RocksDB.
The performance configurations I use for MongoDB/WiredTiger and MySQL/InnoDB are similar to my experience with RocksDB. I don't have to set too many options to get great performance. Alas, it took a long time to figure that out.
Top-down configuration is another approach that can help. The idea is simple - tell the DBMS about the hardware it can use and optionally state a few constraints.
The basic hardware configuration is empty which implies the DBMS gets everything it can find -- all memory, all CPU cores, all IO capacity. When a host does more than run a DBMS it should be easy to enforce that limit with one option for memory consumption, one for CPU, etc. The user shouldn't have to set ten options for ten different memory consumers. It is even worse when these limits are per instance -- limiting how much memory each sort buffer gets is a lousy way to manage total memory usage. IO capacity is interesting. AFAIK there was a tool included in RethinkDB that characterized IO capacity, PostgreSQL has a tool for fsync performance and we can't forget fio. But it is easy to be mislead about SSD performance.
The constraints cover things that are subjective. What is the max recovery time objective? How do you rank read, write, space and memory efficiency?
A great example of this is SQL Memory Management in Oracle 9i -- tell the DBMS how much memory it can use and let it figure out the best way to use it.
What about ML
I hope that ML makes it easier to discover the options that aren't significant and can be moved into the expert options namespace. But I prefer a solution with fewer tuning knobs, or at least fewer visible tuning knobs. I hope to avoid too many knobs (status quota) combined with ML. Lets make smarter database algorithms. If nothing else this should be a source of research funding, interesting PhDs and many papers worth reading.
While I appreciate that someone made the MySQL memory calculator available I wish this weren't needed. Setting memory limits based on peak concurrency means you will under-allocate memory in the normal case or instead you can over-allocate at peak concurrency and get OOM.