Friday, September 20, 2019

Review of Spanner: Becoming a SQL System

This is a review of Spanner: Becoming a SQL System, at least the bits of it that I understand. It is worth reading and is from SIGMOD 2017 but I just noticed it. I am impressed by the ability of Google to build large-scale systems. Spanner + SQL is used within Google and by external customers for OLTP and OLAP (HTAP) although I am curious whether it is performant for OLAP - which is really a function of the time needed to improve the optimizer and execution engine for complex SQL. From the paper SQL and a common SQL across DBMS within Google has greatly increased Spanner usage.

Cloud Spanner provides serializable isolation. The paper states this is done with pessimistic concurrency control and timestamps so I assume that reads take locks in read-write transactions to prevent changes to rows that have been read. I wonder whether conflicts are an issue for workloads with contention. It isn't clear to me whether CockroachDB uses read locks.

There are several SQL systems within Google -- F1, Cloud Spanner and Dremel. They implement the same SQL language thanks to a common front-end (parse, semantic analysis), library of scalar functions, test framework, feature tests (coverage and correctness) and random query generator for testing. I wonder what is to become of F1 which first did SQL on Spanner for Google Ads.

Range extraction is done at compile time and run time to prune table shards, determine which fragments of a index should be read and determine key ranges to lock. All DBMS do this at compile time. Doing this analysis at run time is less common. Distributed SQL has more overhead for data access so the overhead from this analysis is worthwhile.

Restart tokens are used to allow the client library to hide transient failures even in cases where partial query results have been received. The token represents the internal state of query execution and works across upgrades. Intermediate query results are not persisted on the server side as the target workload is OLTP. The comdb2 DBMS also returns information to the client to hide transient failures.

Ressi is an LSM optimized for OLAP rather than OLTP. This uses the PAX idea within the LSM data blocks. Data within the LSM tree is ordered by key (row-wise) but within each data block is stored column-wise. This is a great compromise for HTAP -- one IO fetches all columns for a row but the vectorized format within the data block reduces CPU overhead during OLAP and the paper states that Ressi can operate directly on vectors in compressed form. Other features from Ressi include:
  • Large (multi-page) values are stored in separate files. I assume this is similar to the LOB indirection done by InnoDB.
  • There are active and inactive files. The active file has the current version for a key while the inactive file has older versions for a key. It wasn't clear to me how this is integrated into the LSM tree. My guess is that the split is done during LSM compaction and I am curious if this ever creates search overhead.
I am uncertain but think this means that a transaction can't read its own uncommitted writes. That seems like a source of confusion for external users.
Just like SQL syntax and semantics, transactional guarantees of database products vary widely from one system to another. Spanner supports very strong guarantees that establish a global order among transactions, called external consistency. Within a given read-modify-write transaction, however, Spanner imposes a constraint that all updates must be committed as the last step of the transaction. Overcoming this limitation requires supporting reading the uncommitted results within an active transaction. While we have seen comparatively little demand for this feature internally, supporting such semantics improves compatibility with other SQL systems and their ecosystems and is on our long-term radar. 
While the paper didn't cover pricing I wonder how Spanner compares to the alternatives from AWS and MSFT. Spanner charges per GB of storage while AWS Aurora charges for GB of storage and per IO. Is that a big deal? Do users pay for extra RAM to reduce their read IO with Aurora?

No comments:

Post a Comment

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for mak...