Thursday, September 17, 2020

Performance results and DBMS experts

I used to assume that many performance results couldn't be trusted because the people sharing the results lacked expertise in the system(s) under test. I am still wary of results in conference papers that have results for too many DBMS and my wariness is a function of the number of DBMS considered. 

But my opinion on this has changed. Now I prefer to focus on the results that the typical performance-sensitive deployment will get. By performance-sensitive I mean a deployment that cares enough about performance to spend some money and/or time to get it. This isn't a deployment that uses a default configuration but it might not be able to afford full time performance teams and months of tuning.

There are far more deployments than experts, so it is safe to assume that expertise is in short supply. If we are to assign blame then I prefer to blame the DBMS for being too complex, having too many sharp edges and not being adaptive.

Lack of expertise in a DBMS isn't a mark of shame. Resources are finite (time to learn, brain capacity).  Some of us can afford to fill our heads with DBMS trivia at the expense of learning other useful skills. But most people need to focus on things other than the DBMS to get their jobs done.

There is a big opportunity to build systems that provide better performance without excessive tuning. While ML seems to be the primary motivator for current research, I expect non black-box approaches to deliver interesting results as well. And by non black-box approaches I mean designing algorithms that consider performance goals and know how to adjust to achieve them. This assumes they have some understanding of performance models. Can we call these self-aware index structures?

Managed databases are also an opportunity for scaling expertise. The service provider has the expertise and has a better chance of applying it across their many customers.


  1. You have it completely about-face. Yes resources are finite. The issue is that those resources are directed in utterly the wrong direction; writing code, thinking code is paramount (rather than outcomes/data), believing you can write code that is superior to those before you who studied for years on one algorithm, being “agile”, being “quick”.

    Databases are on the receiving end of this idiocy. No amount of technical nonsense can change that; databases are a solved problem– predicate logic & set theory are prerequisites for any true developer. Without that knowledge you cannot write robust and correct applications. Dijkstra knew this. This was 50 years ago. “Lack of expertise in a DBMS isn't a mark of shame”. Wrong. Expertise in DBMS is the hallmark of real developers. Lack of expertise in logic & set theory- when saying you are a developer –means you are a charlatan. Stop looking for magic bullet bullshit. Just be a real developer who actually learns & applies fundamentals. If you can’t or won’t do that then leave the profession please.

    1. This might be my favorite smalldatum reply ever. Since I might never be judged by Fabian Pascal of fame this will have to do.

  2. Curious, do you think that this shift in your focus will change how you look to do benchmarking on this blog? And that first reply has got to be some sort of meme. Or someone on mind altering substances.

    1. Yes, it will change.
      1) I have been using one configuration per DBMS regardless of the workload, at least as my starting point. And then will document when per-benchmark options are added.
      2) Later this year I will publish results for the Insert Benchmark and Linkbench to show the impact of config options for MyRocks, InnoDB and Postgres.
      3) I will continue to pontificate in public and collaborate in private on this issue. I am not sure the former has an impact, but Twitter is meant for pontification.
      4) With innodb_dedicate_server, InnoDB is easier to configure than Postgres and RocksDB. I expect to spend more time thinking about ways to make LSM configuration simpler. My hope is for storage engines to do more for self-tuning given goals by the end user, for example favoring read, write or space efficiency.