Tuesday, September 10, 2019

FoundationDB Record Layer

This is a review of the FoundationDB Record Layer paper. My summary is that Record Layer is a wonderful solution when you need to support billions of OLTP databases where each database is small with a simple and low-concurrency workload. I am not an expert on FoundationDB so this review contains questions and might have mistakes. My focus is not on the distributed system properties.

The big question is whether Record Layer can be the solution for less simple workloads, larger databases and higher concurrency. There doesn't appear to be an OLTP SQL layer in development for FoundationDB and there are many interesting alternatives in the open-source and source-available scale-out OLTP DBMS space that already have query layers including CockroachDB, Yugabyte, FaunaDB, TiDB, MySQL Cluster, Comdb2 and MongoDB.

Apple's CloudKit is the star customer for FoundationDB. Other use cases include SnowflakeDB (metadata mgtmt) and Wavefront (primary store). FoundationDB is open source (Apache 2.0) and has an active community. At the upcoming summit you can learn about Redwood, the new storage engine that will replace SQLite.

Update - CouchDB might be moving to FoundationDB

Questions and Comments

An overview of the API is here.

  • Record Layer uses optimistic concurrency control to provide serializable isolation. Commit can fail when there are conflicts. Locks are not held prior to commit. There will be conflicts especially when there is a lag between the time at which the transactions does reads and commit (stalls for reads from storage make this worse). 
  • It provides atomic RMW operations that don't fail from conflicts at commit like min/max and increment because they never conflict (but what happens when the target row has been deleted). This is great but not enough to reduce my concern about support for high-contention workloads -- there will be conflicts.
  • Support for server-side logic would help reduce conflicts by shrinking transaction duration, alas there is neither a layer for SQL nor for server-side logic.
  • Clients can remove keys from a transactions read set to reduce the chance of conflict. I wonder if this supports weaker isolation like repeatable read. Search for conflict and conflict ranges in the paper.
  • Record Layer supports aggregation indexes to maintain things like sum, count, min and max. I assume this uses the atomic RMW operations. I wonder what trouble this could cause with hot spots.
  • SQLite implements the ssd engine. This isn't the best choice if you care more about performance or efficiency. This is a great choice if you care about manageability while scaling out to billions of databases.
  • Transactions can run for at most 5 seconds. With the ssd engine, implemented by SQLite, only committed versions are forced to disk and undo is buffered in memory to support reads at previous versions. The 5 second limit is there to reduce the memory demand. This restriction might go away with the upcoming Redwood engine.
  • Transactions are limited to 10mb of changes.
  • Redwood is expected to replace SQLite. It is a copy-on-write b-tree with prefix compression. Long common prefixes are frequent with Record Layer so prefix compression is in demand. It uses shadow pages for the leaf pages and then remaps page IDs for non-leaf pages to reduce write-amp. There was an awesome PhD on shadow paging many years ago.
  • The storage engine must be single-threaded which makes it harder to reuse engines like RocksDB. 
  • Order by is only supported via indexes to reduce the memory demand on the server.
  • Record Layer is stateless. Query results are streamed back to the client. I hope the client can get more than one row per network RPC. A continuation is returned to the client when it must fetch more from a cursor.
  • Protocol Buffers are used to encode data. Index keys are compared via memcmp and Record Layer has support for combining composite keys, and hopefully for charsets, into a string for which memcmp does the right thing.

4 comments:

  1. "The big question is whether Record Layer can be the solution for less simple workloads, larger databases and higher concurrency."

    FoundationDB's fatal flaw for any workload of this kind is its HA behavior. Any failure in the metadata topology (cluster controller, master, proxies, or resolvers) triggers a total reconstruction of that topology, resulting in downtime. This includes upgrades of the database software itself, which requires a maintenance window.

    FoundationDB was not designed for unreliable cloud hardware.

    ReplyDelete
  2. CouchDB did vote to adopt FoundationDB in CouchDB 4.0,https://lists.apache.org/thread.html/9f76b49de19442b0f1896100bdf598ebc7718b65b7bcfd54ee32a85a@%3Cdev.couchdb.apache.org%3E

    ReplyDelete
    Replies
    1. I was excited to learn that earlier today and added an update above. Hopefully your URL will render better as https://lists.apache.org/thread.html/9f76b49de19442b0f1896100bdf598ebc7718b65b7bcfd54ee32a85a@%3Cdev.couchdb.apache.org%3E

      Delete
    2. Not sure why everything is rendering weird! Sorry about that. Great read otherwise, thank you!

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...