The PrismDB paper explains how to use heterogenous storage with an LSM. This is my review of the paper and I don't review papers that aren't interesting and worth reading. I might be biased given that I worked on RocksDB, MyRocks, helped with the persistent read cache for tiered storage in RocksDB for which there is an interesting paper. For the rest of this post I use the word tiered instead of heterogenous as that is easier to type.
The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.
An LSM tree with leveled compaction organizes keys by write recency. When n < m then a key/value pair in Ln is likely to have been written more recently than a key in Lm. But it isn't clear that location in the LSM tree suggests something about read recency or frequency. Thus the simple tiered storage approach mentioned in the previous paragraph might not help much for reads.
Paper summary
The idea in the paper is to pin hot key/value pairs higher in the LSM tree and then use higher-perf, higher-cost storage for those levels. Thus, storage read latency for such data will be reduced. This is implemented via three components: tracker, mapper, placer.
- The tracker identifies keys that are frequently accessed. This is done via CLOCK with 2 bits/key and implemented via a concurrent hash map.
- The mapper decides which keys are hot. The mapper can achieve the goal of at most X% of keys on a given level getting pinned. Given such a constraint the mapper will determine which keys can get pinned in that level. The mapper uses the information from the concurrent hash map managed by the tracker.
- The placer implements pinned compaction, explained below, to implement the decisions made by the mapper.
Pinned compaction is a new compaction strategy. For leveled compaction from Ln to Ln+1 pinned compaction will keep keys on Ln when such keys are listed for pinning on that level. Pinned compaction will also pull up keys form Ln+1 to Ln for the same reason.
Pinned compaction benefits from weighted scores that determine which SSTs are most likely to benefit from pinned compaction (have keys that should be pinned). The scores determine which SSTs to use as input for the next round of pinned compaction.
Access distributions
My biggest concern about the paper is whether production access patterns will benefit form this approach. Academic work on this topic has been hampered because not enough has been published on this topic. It is hard to work on cache management without knowing the access distributions.
The paper has two claims on this point:
- Objects stored in the higher levels of the LSM are read and updated much more often than objects stored at the bottom layers.
- In addition, since they are updated much less frequently, these lower levels can meet the lower endurance requirements of cheaper flash storage.
I agree that objects in the higher levels are likely to be updated more often. I am less certain that this extends to reads. Most of the benchmarks from the paper use YCSB with Zipfian and the claims will definitely be true in that case. The claim about reads is more likely to be true for workloads with many range reads especially when prefix bloom filters cannot be used.
For point #2, my experience with production and benchmark workloads has been that per-level compaction write rates are similar (MB/s written to storage per level). But this can be true without contradicting claim 2 above as the larger levels have more objects. Regardless the per-level write rates must be considered to make sure a device will meet the desired lifetime.
By similar write rates I mean they tend to be within a factor of 2 and
here is an example from Linkbench -- see the Write(GB) column.
Deployment
I know that conference papers don't have to solve my production concerns, but I want to share more background on the topic.
How can I deploy this in production? If I am a cloud customer then my storage form factors are different from on-prem. For AWS I can choose from ephemeral, EBS disk, EBS SSD, EBS SSD with PIOPs and S3. I probably don't have to worry about SSD endurance in the cloud but if my workload does many writes then I have to pay for that except when using ephemeral storage. In the cloud there are still different price & performance tradeoffs and I assume work like PrismDB is relevant there.
For on-prem it isn't easy to get multiple storage devices (NVM & TLC & QLC) into one server as web-scale operators try to reduce the number of items on the menu. Even if they allow this the question is whether I can get something with small amounts of NVM per server. Disaggregated storage might solve the small amounts of storage per server problem but with disagg I have to worry about new failure modes when all servers in a rack are sharing access to the same NVM device. I think the most likely solution is for storage vendors to provide a hybrid device that includes some NVM with more MLC/TLC NAND and a lot of QLC NAND (they already do this to some extent).
Assorted comments
- Does the mapper create a list of keys (to be pinned) per level or is there a global list?
- Does pinned compaction risk creating SSTs with a small number of keys? I'd rather not have an LSM tree with too many small SSTs.
- Do the decisions about pinning avoid the double caching problem? I am not sure I need to pin objects that are likely to be in the block cache.
- Is the CPU overhead of searching the to-be-pinned data structure significant? This might depend on whether pinned compaction is the only type of compaction run. The paper suggests regular compaction is also run.
- For the two points above, using a bloom filter per level for the keys to be pinned might help.
- Is there a mechanism for unpin? If regular compaction is still run then is that the way in which pinned keys get unpinned. Or does recomputing the to-be-pinned list over time server to unpin keys that used to be hot and now are cold.
- Logical backup and long scans for batch queries are things that can interfere with the computation of the to-be-pinned list. In MyRocks hints can be used as part of a SELECT statement to suggest that data for the query should not pulled into the block cache.