Wednesday, August 11, 2021

On storage engines

I wish it were easier to implement new storage engines for MySQL, Postgres and MongoDB and other OSS databases. There is so much innovation that we miss out on - FASTER is one example. All (MySQL, MongoDB, Postgres) have storage engine APIs but there are not many OSS implementations of them.

MyRocks, MySQL Aurora and MySQL HeatWave are examples of the benefits. But they also show that it helps to have the backing of a well-funded company because this is a huge undertaking.

The API for Postgres is the most recent and perhaps that will be the most popular. The API for MySQL is the oldest and has become harder as more requirements (like partitioning) are pushed into it. The MongoDB API is friendlier than the MySQL API, but MongoRocks was deprecated when the API was enhanced to support transactions and RocksDB wasn't able to support user-provided commit timestamps.

One problem is naming. MongoDB and WiredTiger are databases, but WiredTiger is also a component of MongoDB. To avoid confusion I will use storage engine for things like WiredTiger, RocksDB and InnoDB and database for the systems that provide query languages, replication and more.

I have been involved in three such engines -- MyRocks, MongoRocks and a read-only MySQL engine used long-ago at Google. MyRocks benefited from a large team and Sergey Petrunya at MariaDB and MongoRocks benefited from amazing support from MongoDB the company.

Two interesting things are:

  • The impact of database design decisions on the storage engine API. See what is needed for transactions in the MongoDB API.
  • The impact of the original storage engine on the storage engine API.
    • MongoDB doesn't take advantage of clustered indexes in WiredTiger or RocksDB. An extra (hidden) index must be used to map between DiskLoc and the PK index. See SERVER-14569.
    • I have more research to do before I understand the impact of vacuum and 32-bit transactions IDs on the Postgres API.

I also wonder whether the RocksDB API could be the universal storage engine API. In theory any storage engine implementing that would be able to replace RocksDB in MyRocks or MongoRocks. The goal is then to implement the RocksDB API glue once per database and then be able to use a variety of storage engines based on that. Of course, XKCD has a great take on standards and this might just be naive and wishful thinking on my part.


  1. One point on MongoRocks: the fabulous @wolfkdy has added distributed transaction support, bringing it up to date with MongoDB 4.2: There aren't any significant functional changes to MongoDB's storage engine API between 4.2 and 5.0, so bringing it further up to date should be tractable.

    On your larger point: I'm not sure how much of the issue is really about the API vs the required storage functionality. For example, AFAICT, FASTER doesn't support range scans, which MongoDB needs to support indexes. Many storage engines don't have general purpose transaction support (let alone functionality to allow them to participate in distributed transactions). It doesn't seem like there are lots of potential OSS storage engines that are underused because nobody has glued them into a database. That said, databases could lower barriers to entry for new storage engines by making advanced functionality optional in the storage engine API.

    WiredTiger did maintain both a LevelDB and RocksDB API for a while, primarily to make comparative benchmarking easier. Eventually we removed them because the effort of maintaining them wasn't paying off.

    The largest surface area in each storage engine's API is configuration knobs. These are highly implementation dependent. If we had a generic storage engine API with configuration abstracted out, then maybe it would make sense for multiple OSS storage engines to implement it.

    1. I lost track of the work on MongoRocks. I am happy to see the github repo is active. That project deserves more visibility. With MongoRocks and MyRocks the next step is an LSM for Postgres.

      The lack of range scans will be a showstopper in some cases. Support for transactions can be bolted on after the fact, as it has been for RocksDB, although that comes with a few costs. Support for distributed transactions (PREPARE ?) won't be as easy to bolt on.

  2. Yes, in Jan 2020 Deyu Kong (wolfkdy) and Yanqin Jin submitted timestamp ordering transaction #6407 pull request in facebook/rocksdb. That was the prerequisite for going above > v3.4 MongoDB support (v3.6 may not have provided user transactions but the timestamps were already pervading a lot of code).

    Then the MongoRocks v40 fork was shared as soon as someone granted the access to the mongodb-partners repo I think.

    As you joined MongoDB at this approximate time Mark I had guessed it was all about that! LSM expert, LSM storage engine, strong coincidence.

    Deyu and Yanqin did deserve a lot more visibility for this.

    1. Yes, they deserve much more credit and I appreciate how hard they have worked on this. That PR, 6407, has yet to be merged. Is there any update on it?

  3. Not that I'm aware of. I can't recall if I asked Igor Canadi why.
    (Checking emails ... no, no record sorry.)

    As it stands at the moment I think Huawei's DDS (their MongoDB DBaaS) is using it though. I.e. only one branch of Huawei's repo of RocksDB.

    1. Are there any interesting URLs you can share that describe MongoRocks in Huawei DDS?

  4. (That one is in english, which is an exception for the site.) is an older announcement-style post.

  5. 1. I think rocksdb, if fully supported timestamp api(get/put/delete/merge/compaction/reserve oldest timestamp window/rollback to timestamp) in the future, is the most hopeful candidate to be ‘the only one’kv engine, even though it does not support these apis, it is now the most widely used
    1.1. Currently, rocksdb only support lsn(for snapshot) as the default multi-version control mechanism, however, almost all the new sql databases(crdb, tidb...) are distributed and timestamp based concurrency control seems the best choice. rocksdb(today 6.x as the latest stable major version) lacks the intergreted timestamp-c(oncurrency)c(ontrol).

    1.2. Rocksdb's timestamp api now is experimental now, lacks compaction/(pin oldest timesamp window)/(compaction to a timestamp, named rtt in wiredTiger) support, I believe rocksdb's timestamp api won't be released before December, 2022.

    2. mongoRocks currently supports mongodb v4.2.5. It mainly added the prepareHeap for ClockSI's reads-wait-for-prepare operation. I released this because it passed almost all the official mongodb's javascript resmoke tests. So mongoRocks now supports distributed transactions.

    why there aren't mongorocks v4.2.6,v4.2.7,.....v4.2.N? because there are no significant changes in the storage layer.

    why there aren' mongorocks v4.4 or v5.0? It is not that difficult, as @[Michael Cahill]( mentioned, there aren't too many changes in the storage engine layer after v4.4. Currently I am focusing on rewriting the new server-layer apis to get rid of the annoying SSPL license,

    I will release some problem-fix patches for mongorocks v4.2.5 in September. v4.4 may be soon but v5.0 may not be that soon.

    I also added an issue to use official rocksdb for mongoRocks, this work won't be done until rocksdb fully supports timestamp api (currently, compaction/oldest timestamp window/rollback to some specific timestamp still lacks)

    3. @[Mark Callaghan]( tech slides about Huawei mongoRocks are almost written in Chinese, these is one article, which is in English, but I don't know who translated it. It can be found here.

    1. Thank you for so much info and for bringing MongoRocks forward.