Friday, October 18, 2019

Just put the cold data over there

There are several ways to use less SSD for an OLTP workload: choose a database engine has less space amplification, store less data, move the cold data elsewhere. The first approach is a plan while the others are goals. A plan is something you can implement. A goal requires a plan to get done.

This matters when you want to decrease the cost of storage for a database workload but not everyone needs to do that. The first approach assumes your DBMS supports an LSM with leveled compaction and compression (MariaDB and Percona Server include MyRocks, ScyllaDB and Cassandra are also options). The second approach, store less data, assumes you can get everyone to agree to remove data and that is a hard conversation.

The third approach, move the cold data elsewhere, is a wonderful goal. I wonder how often that goal is achieved. To implement this you must find data that won't (well, almost never) be read or written again and then move it to less expensive storage. I assume this has been more successful when implemented in storage than in a DBMS. The obvious example is an OS page cache but there are also tiered storage servers. An LSM already classifies data that is hot vs cold for writes, data closer to the L0 was written more recently, but that might not imply anything about read likelihood.

I have read wonderful research papers that show how a DBMS might do this but I am still wary. Assume that data can be described by read and write likelihood attributes -- {read N, read forever} X {write N, write forever} then the goal is to find data that is read N, write N for which N has been reached. You frequently won't know the value of N and monitoring will be needed to identify data that is rarely read or written, along with more CPU, RAM and IO to perform that monitoring. This is easier to do when the granularity of hot vs cold is per table but that is rare in my experience. I assume the common case is a table with a mix of hot and cold data.

Don't forget that it is a lousy experience when cold data becomes hot again.

This post was inspired by something I read on a wonderful blog -- Blocks and Files (no snark, blog is great). My summary of the post is that SSD endurance isn't a problem, just evict cold data to cold storage. Just remember that is a goal not a plan.

Update - I can't believe I forgot to mention the RocksDB persistent read cache that can use a faster persistent device (like Optane) to cache data read from a slower persistent device. The RocksDB-cloud effort from Rockset makes RocksDB work on S3 and uses the read cache to benefit from local storage. This post explains RocksDB-cloud.

Update - I am not sure I reviewed this before, but there is an interesting paper that extends RocksDB to migrate cold data to cold storage -- see Mutant.


  1. Are you aware of bcache and bcachefs?

    It is a lot like Flashcache. You set up bcache so that your recently accessed data is on SSD and infrequently accessed data is on slower storage. It really works quite well.

    1. I am somewhat familiar and look forward to progress. I support the effort via

    2. I'm going to try to test it with external rasters with PostGIS.

  2. This comment has been removed by the author.