Thursday, September 17, 2020

Performance results and DBMS experts

I used to assume that many performance results couldn't be trusted because the people sharing the results lacked expertise in the system(s) under test. I am still wary of results in conference papers that have results for too many DBMS and my wariness is a function of the number of DBMS considered. 

But my opinion on this has changed. Now I prefer to focus on the results that the typical performance-sensitive deployment will get. By performance-sensitive I mean a deployment that cares enough about performance to spend some money and/or time to get it. This isn't a deployment that uses a default configuration but it might not be able to afford full time performance teams and months of tuning.

There are far more deployments than experts, so it is safe to assume that expertise is in short supply. If we are to assign blame then I prefer to blame the DBMS for being too complex, having too many sharp edges and not being adaptive.

Lack of expertise in a DBMS isn't a mark of shame. Resources are finite (time to learn, brain capacity).  Some of us can afford to fill our heads with DBMS trivia at the expense of learning other useful skills. But most people need to focus on things other than the DBMS to get their jobs done.

There is a big opportunity to build systems that provide better performance without excessive tuning. While ML seems to be the primary motivator for current research, I expect non black-box approaches to deliver interesting results as well. And by non black-box approaches I mean designing algorithms that consider performance goals and know how to adjust to achieve them. This assumes they have some understanding of performance models. Can we call these self-aware index structures?

Managed databases are also an opportunity for scaling expertise. The service provider has the expertise and has a better chance of applying it across their many customers.

Friday, September 4, 2020

When is it OK to lose a commit?

I am thrilled that OSS DBMS are better at avoiding the loss of commits thanks to replication that can do more than async. But there are good reasons for some workloads to not pay the cost (in latency) from such features.

By losing a commit here I mean the loss of the most recent N commits, not the lost of any commit from the past. By workload I mean one of the applications using the DBMS and there can be a variety of workloads using the DBMS. For some DBMS, all applications must use the same level of commit durability. But that is an implementation artifact, not something fundamental to the algorithms we implement.

A workload for which it is OK to lose the most recent N commits is inserting into a DBMS from a log file. If this has limited concurrency then commit latency can be a problem. If commits are lost then they can be replayed. This assumes that the application 1) can determine when a commit has been lost and 2) figure out where to begin the replay.

To determine when a commit has been lost the application queries DBMS state while connecting and while reconnecting. One reason to reconnect is after a failover and a tradeoff for faster commit is the potential for losing the last N commits on failover.

In the simple case there is one single-threaded application doing inserts from a log file into one DBMS table. It is feasible but not trivial to make sure there really is at most one application running in this case -- because distributed system challenges extend to applications. But I ignore that complexity for now.

The DBMS state is found in either the table to which the inserts are done or a separate state table. If there is a way to find the last inserted row, then that tells you where to restart, assuming you can map the value of that row back to an offset in the source (log). Otherwise, a separate state table can maintain the log offset from which inserts should continue and the state table is updated in the transaction that does bulk inserts from the log.

Update

I just remembered something that I never understood. Ingres Replicator does async replication via 2pc between the primary and a secondary -- see here and here. I never understood why. That is an expensive way to replay a log (from the primary) on a secondary. The alternative is for a secondary to track its state independent of the primary, replay transactions on a secondary (commit locally on the secondary, update state table on the secondary) and some time after commit the state (queue of transactions to be replayed) can be updated (truncated).


RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...