Thursday, August 20, 2020

iostat output format changes again

The format for iostat differs across the last 3 Ubuntu LTS (16.04, 18.04, 20.04). Some of my scripts scrape that output and now it is time to update them to detect the new format. I wrote about this before. There are other ways to get this data so this is a risk I accept for being lazy. The changes are from added columns, removed columns, renamed columns and reordered columns. I wish the last two weren't done. There are a few more projects out there that scrape iostat output and must update how they parse it.

In Ubuntu 20.04 the new columns are for trim, which is useful. The svctm column was removed and fields were reordered. I assume that the bytes written metrics no longer include bytes trimmed -- which inflated by 2X the write rate for anything that does frequent unlike (like RocksDB).

New fields in 20.04 are:

  • d/s - number (after merges) of discard requests completed per second for the device
  • dsec/s (dkB/s, dMB/s) - number of sectors (KB, MB) discarded for the device per second
  • drqm/s - number of discard requests merged per second that were queued to the device
  • %drqm - percentage of discard requests merged together before being sent to the device
  • dareq-sz - average size (in kilobytes) of the discard requests that were issued to the device
  • d_await - average time (in milliseconds) for discard requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them

Sunday, August 16, 2020

Review of PrismDB: Read-aware LSM Tree for Heterogeneous Storage

The PrismDB paper explains how to use heterogenous storage with an LSM. This is my review of the paper and I don't review papers that aren't interesting and worth reading. I might be biased given that I worked on RocksDB, MyRocks, helped with the persistent read cache for tiered storage in RocksDB for which there is an interesting paper. For the rest of this post I use the word tiered instead of heterogenous as that is easier to type. 

The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.

An LSM tree with leveled compaction organizes keys by write recency. When n < m then a key/value pair in Ln is likely to have been written more recently than a key in Lm. But it isn't clear that location in the LSM tree suggests something about read recency or frequency. Thus the simple tiered storage approach mentioned in the previous paragraph might not help much for reads. 

Paper summary

The idea in the paper is to pin hot key/value pairs higher in the LSM tree and then use higher-perf, higher-cost storage for those levels. Thus, storage read latency for such data will be reduced. This is implemented via three components: tracker, mapper, placer.

  • The tracker identifies keys that are frequently accessed. This is done via CLOCK with 2 bits/key and implemented via a concurrent hash map. 
  • The mapper decides which keys are hot. The mapper can achieve the goal of at most X% of keys on a given level getting pinned. Given such a constraint the mapper will determine which keys can get pinned in that level. The mapper uses the information from the concurrent hash map managed by the tracker.
  • The placer implements pinned compaction, explained below, to implement the decisions made by the mapper.

Pinned compaction is a new compaction strategy. For leveled compaction from Ln to Ln+1 pinned compaction will keep keys on Ln when such keys are listed for pinning on that level. Pinned compaction will also pull up keys form Ln+1 to Ln for the same reason.

Pinned compaction benefits from weighted scores that determine which SSTs are most likely to benefit from pinned compaction (have keys that should be pinned). The scores determine which SSTs to use as input for the next round of pinned compaction.

Access distributions

My biggest concern about the paper is whether production access patterns will benefit form this approach. Academic work on this topic has been hampered because not enough has been published on this topic. It is hard to work on cache management without knowing the access distributions. 

The paper has two claims on this point:

  1. Objects stored in the higher levels of the LSM are read and updated much more often than objects stored at the bottom layers.
  2. In addition, since they are updated much less frequently, these lower levels can meet the lower endurance requirements of cheaper flash storage.
I agree that objects in the higher levels are likely to be updated more often. I am less certain that this extends to reads. Most of the benchmarks from the paper use YCSB with Zipfian and the claims will definitely be true in that case. The claim about reads is more likely to be true for workloads with many range reads especially when prefix bloom filters cannot be used.

For point #2, my experience with production and benchmark workloads has been that per-level compaction write rates are similar (MB/s written to storage per level). But this can be true without contradicting claim 2 above as the larger levels have more objects. Regardless the per-level write rates must be considered to make sure a device will meet the desired lifetime. 

By similar write rates I mean they tend to be within a factor of 2 and here is an example from Linkbench -- see the Write(GB) column.


I know that conference papers don't have to solve my production concerns, but I want to share more background on the topic.

How can I deploy this in production? If I am a cloud customer then my storage form factors are different from on-prem. For AWS I can choose from ephemeral, EBS disk, EBS SSD, EBS SSD with PIOPs and S3. I probably don't have to worry about SSD endurance in the cloud but if my workload does many writes then I have to pay for that except when using ephemeral storage. In the cloud there are still different price & performance tradeoffs and I assume work like PrismDB is relevant there. 

For on-prem it isn't easy to get multiple storage devices (NVM & TLC & QLC) into one server as web-scale operators try to reduce the number of items on the menu. Even if they allow this the question is whether I can get something with small amounts of NVM per server. Disaggregated storage might solve the small amounts of storage per server problem but with disagg I have to worry about new failure modes when all servers in a rack are sharing access to the same NVM device. I think the most likely solution is for storage vendors to provide a hybrid device that includes some NVM with more MLC/TLC NAND and a lot of QLC NAND (they already do this to some extent).

Assorted comments

  • Does the mapper create a list of keys (to be pinned) per level or is there a global list? 
  • Does pinned compaction risk creating SSTs with a small number of keys? I'd rather not have an LSM tree with too many small SSTs.
  • Do the decisions about pinning avoid the double caching problem? I am not sure I need to pin objects that are likely to be in the block cache.
  • Is the CPU overhead of searching the to-be-pinned data structure significant? This might depend on whether pinned compaction is the only type of compaction run. The paper suggests regular compaction is also run. 
  • For the two points above, using a bloom filter per level for the keys to be pinned might help.
  • Is there a mechanism for unpin? If regular compaction is still run then is that the way in which pinned keys get unpinned. Or does recomputing the to-be-pinned list over time server to unpin keys that used to be hot and now are cold.
  • Logical backup and long scans for batch queries are things that can interfere with the computation of the to-be-pinned list. In MyRocks hints can be used as part of a SELECT statement to suggest that data for the query should not pulled into the block cache.

Thursday, August 6, 2020

Over fetching in a DBMS

By over fetching I mean fetching irrelevant documents or fields while processing a query. By fetching I mean the data examined by the DBMS, not the result set returned to the user.  By irrelevant I mean documents and fields that won't change the query result if they don't exist. This also applies to a SQL DBMS after substituting column for field, row for document and collection for table. In the rest of this post I mostly use MongoDB names (documents, fields and collections). I might reuse some of the terminology from a post on predicates from Use The Index, Luke. That post is interesting.

I can refine the over fetching definition into four parts:

  • co-located
  • single-table
  • join
  • aggregation
  • other

Co-located over fetching occurs when irrelevant fields are read from a document or indexes. Columnar storage is a one way to avoid this for analytic workloads. Sometimes a covering index can reduce this for OLTP workloads.

For a collection C with fields a1, a2, ..., aN then db.C.find({}, {a1:1, a2:1, _id:0}) does co-located over fetching when a collection scan is used. An index on (a1, a2) avoids over fetching because the fields a3 ... aN are not in the index. The SQL version of the query is select a1, a2 from C.

Co-located over fetching also occurs when an index has fields that are needed by the query. For the query in the previous paragraph if there is an index on (a1, a2, a3, a4) and the index is used for the query then over fetching occurs because a3 and a4 are read by the DBMS but not needed.


Single-table over fetching occurs when irrelevant documents are read. A good index is one way to avoid this, especially for OLTP workloads. 

For this example the collection C has the documents:
  • { _id: 0, a1:7, a2:8, a3:9 }
  • { _id: 1, a1:7, a2:18, a3:19 }
  • { _id: 2, a1:27, a2:28, a3:29 }
The query db.C.find({a1: 7}) does single-table over fetching when there isn't an index on a1. In that case a collection scan is done that examines all docs but the doc with _id:2 is irrelevant. With an index on a1 then only the docs with _id:0 and _id:1 are examined and there is no over fetching. The SQL version of this query is select * from C where a1=7.

Single-table over fetching can occur with indexes. For the query db.C.find({ a1: 7, a2: 8}) with an index on a1 then docs with _id:0 and _id:1 are examined but _id:1 is irrelevant because the predicate on a2 excludes it. An index on (a1, a2) avoids the over fetching. The SQL version of this query is select * from C where a1=7 and a2=8.


Join over fetching occurs when join predicates filter documents. The docs that were filtered by the join predicate don't change the query result and figuring out how to avoid examining them prior to the join might improve performance.

With MongoDB $lookup is a left outer join and over fetching cannot occur for the input documents. But it can occur for docs in the from collection when there is no index, or no good index, on it. But explain doesn't show the access path for the from collection -- until SERVER-22622 is fixed.

Whether join over fetching occurs in SQL depends on the join (inner, left outer, right outer, etc). The example uses tables C1 and C2 and the query select * from C1, C2 where C1.x1 = C2.y1
  • C1 has the rows: (x1:1, x2:2), (x1:11, x2:12), (x1:21, x2:22)
  • C2 has the rows: (y1:1, y2:2), (y1:11, y2:12), (y1:31, y2:32)
Assume this is evaluated by scanning C1 and then probing C2 (nested loops join). If there is no index on C2.y1 then join over fetching occurs for C2 because (y1:31, y2:32) is examined by a table scan but filtered by the join predicate. Join over fetching occurs for C1 whether or not there is an index on C2.y1 because (x1:21, x2:22) is examined but filtered by the join predicate.


Aggregation over fetching occurs when docs are filtered by the aggregation operator semantics. The obvious ones are $max and $min in MongoDB (max and min in SQL).

For this example the collection C has the documents { _id: 0, a1:7, a2:8 }, { _id:1, a1:7, a2:9 } and the queries are:
  • MongoDB: db.C.aggregate([ { $group : { _id: "$a1", maxval : { $max : "$a2" } } } ])
  • SQL: select max(a2), a1 from C group by a1
Aggregation over fetching occurs for the doc with _id:0 because it doesn't have the max value of a2 for the group with a1:7. An index on (a1, a2) can avoid the aggregation over fetching but you should consult the DBMS documentation to understand whether that optimization has been implemented. MySQL has loose index scan for min/max and distinct, Postgres has recursive CTE (here and here) while MongoDB has DISTINCT_SCAN (for $first but not for $min or $max - see SERVER-40090).


Sometimes index-only queries aren't as index-only as you want them to be. Postgres relies on bits being set in the visibility map or it will fetch the base row from the heap. InnoDB relies on a different mechanism but there are cases where it too must fetch base row images from the PK (clustered) index for queries that appear to be index only.