Wednesday, April 16, 2014

Types of writes

What does it mean to make writes fast? It helps to distinguish between the different types of writes. The slowest is a write that must be implemented as read-modify-write. This might require a disk read and can also create contention from preventing concurrent changes to the row for the duration of the read, modify and write. The row might not be unlocked until the change is made durable on storage (commit, fsync, etc) which lets you estimate the peak rate at which a single row can be changed on a traditional DBMS. And this latency between changes can get even worse when there is sync replication or multiple client-server round trips per transaction. The UPDATE statement in SQL is usually implemented as read-modify-write. Some DBMS engines require locking to be done above the DBMS because they don't support locking across operations where read and write are separate operations (RocksDB is an example). Other DBMS engines compensate for that with a conditional put that performs a write when parts of the row have not changed like checkAndPut in HBase. But if the client does a read prior to the write then the overhead from the read still exists.

Some UPDATE statements perform changes that are commutative and it is possible to avoid the read prior to the write. That optimization is rarely implemented but it is possible in RocksDB with the merge operatorTokuDB, and Redis. Increment and decrement are examples of commutative operations. But this also requires upsert behavior to handle the case where the row to be changed doesn't exist. If the read is to be skipped it also means that a result cannot be returned -- that is the count of rows changed or the old/new value of the changed column.

blind-write is the easiest write to make fast. This can be done via a Put call with a key-value store like LevelDB or RocksDB. Some SQL operations can also use a blind-write if we have an option to not return the count of changed rows when the statement is executed and the operation behaves like an upsert. This is rarely implemented but TokuDB might help people appreciate it.

So there are at least 3 types of writes and from slowest to fastest they are read-modify-writecommutative-writeblind-write. Hopefully these optimizations will become more common in new database engines. From the write operation there shouldn't be a performance difference between commutative-write and blind-write. But the query latency for a row subject to commutative-write can be much worse than for blind-write because many old updates might have to be read and merged.


  1. I don't think the write needs to be commutative in order to be fast. We just need to know what keys will need modification in advance. Let's take the following SQL example:
    create table foo (id int, a int, b int, primary key (id), key (a))

    In this schema, the primary key is the sole clustering key. Suppose I did "update foo set b=100 where id=10;" This update can be done fast, we know the pk needs a modification, and we know which one (the one where it is equal to 10). We also know that secondary index 'a' needs no modification.

    Now take the following two examples:
    update foo set a=a+1 where id=10;
    update foo set b=b+1 where a=10;

    In each of these cases, the update cannot be done fast. In the first example, we know the secondary index 'a' will need to be modified, but we don't know for which value. So a query is necessary. Similarly, in the second case, we don't know which id in the primary key associated with a being 10 needs modification, so that requires a query as well.

    Similarly, if we have multiple clustering indexes, many updates that used to be fast are no longer fast. If 'a' is a clustering key, the this update that used to be done fast can no longer be fast:
    update foo set b=100 where id=10;
    This is because the new 'b' value will need to reflect the row in the clustering key, and we don't know where that row is.

    So, really, the common update that can theoretically be done fast for TokuMX, TokuDB, etc... is one where the where clause is on the PK, the modifications don't touch indexed columns, and we have no additional clustering keys. I'd be interested in hearing how common such a scenario is.

  2. Your first example is an example of a blind-write. I think we agree that writes don't have commute to be fast. My guess is that updating counts is the most common use case and maintaining last-N for a bag/set subject to a max size is another use case.

  3. The handler interface supports avoiding the reads in the below case:
    update foo set b=100 where id=10
    It's used by NDB currently but can be implemented by any storage engine that supports it.
    It would even be possible to do blind writes of:
    update foo set a=a+1 where id=10
    but this requires a bit more mapping the calculation to something useful so isn't currently
    supported in the handler interface. Naturally this would require support for calculations in
    the storage engine.

    1. How do we bring the greatness of NDB/Cluster to more users. Maybe hosted NDB as a service is the solution?

    2. Is this new? I investigated this in MySQL 5.1 and MySQL 5.5 without any success. Also, I recall a bunch of innovations NDB had (like online alter table, back in the day) were available in the MySQL Cluster code base and not the MySQL Server code base. Has that changed?

    3. I guess it's on our side to make it easier to use, to have less amount of hassles. This is the major focus of the next cluster version MySQL Cluster 7.4. I think also the development on hardware (more memory, more networking bandwidth and more CPU resources) and the MySQL Cluster product becoming more and more stable works for the product.

    4. It's new in the MySQL Server code base that avoiding reads before updates is available in the handler interface (added in MySQL 5.6, it's been in MySQL Cluster code base for a long time). On-line alter support was added in 5.6 as well in the MySQL Server code base and there are some things that InnoDB supports performing on-line.