Thursday, September 17, 2020

Performance results and DBMS experts

I used to assume that many performance results couldn't be trusted because the people sharing the results lacked expertise in the system(s) under test. I am still wary of results in conference papers that have results for too many DBMS and my wariness is a function of the number of DBMS considered. 

But my opinion on this has changed. Now I prefer to focus on the results that the typical performance-sensitive deployment will get. By performance-sensitive I mean a deployment that cares enough about performance to spend some money and/or time to get it. This isn't a deployment that uses a default configuration but it might not be able to afford full time performance teams and months of tuning.

There are far more deployments than experts, so it is safe to assume that expertise is in short supply. If we are to assign blame then I prefer to blame the DBMS for being too complex, having too many sharp edges and not being adaptive.

Lack of expertise in a DBMS isn't a mark of shame. Resources are finite (time to learn, brain capacity).  Some of us can afford to fill our heads with DBMS trivia at the expense of learning other useful skills. But most people need to focus on things other than the DBMS to get their jobs done.

There is a big opportunity to build systems that provide better performance without excessive tuning. While ML seems to be the primary motivator for current research, I expect non black-box approaches to deliver interesting results as well. And by non black-box approaches I mean designing algorithms that consider performance goals and know how to adjust to achieve them. This assumes they have some understanding of performance models. Can we call these self-aware index structures?

Managed databases are also an opportunity for scaling expertise. The service provider has the expertise and has a better chance of applying it across their many customers.

Friday, September 4, 2020

When is it OK to lose a commit?

I am thrilled that OSS DBMS are better at avoiding the loss of commits thanks to replication that can do more than async. But there are good reasons for some workloads to not pay the cost (in latency) from such features.

By losing a commit here I mean the loss of the most recent N commits, not the lost of any commit from the past. By workload I mean one of the applications using the DBMS and there can be a variety of workloads using the DBMS. For some DBMS, all applications must use the same level of commit durability. But that is an implementation artifact, not something fundamental to the algorithms we implement.

A workload for which it is OK to lose the most recent N commits is inserting into a DBMS from a log file. If this has limited concurrency then commit latency can be a problem. If commits are lost then they can be replayed. This assumes that the application 1) can determine when a commit has been lost and 2) figure out where to begin the replay.

To determine when a commit has been lost the application queries DBMS state while connecting and while reconnecting. One reason to reconnect is after a failover and a tradeoff for faster commit is the potential for losing the last N commits on failover.

In the simple case there is one single-threaded application doing inserts from a log file into one DBMS table. It is feasible but not trivial to make sure there really is at most one application running in this case -- because distributed system challenges extend to applications. But I ignore that complexity for now.

The DBMS state is found in either the table to which the inserts are done or a separate state table. If there is a way to find the last inserted row, then that tells you where to restart, assuming you can map the value of that row back to an offset in the source (log). Otherwise, a separate state table can maintain the log offset from which inserts should continue and the state table is updated in the transaction that does bulk inserts from the log.


I just remembered something that I never understood. Ingres Replicator does async replication via 2pc between the primary and a secondary -- see here and here. I never understood why. That is an expensive way to replay a log (from the primary) on a secondary. The alternative is for a secondary to track its state independent of the primary, replay transactions on a secondary (commit locally on the secondary, update state table on the secondary) and some time after commit the state (queue of transactions to be replayed) can be updated (truncated).

Thursday, August 20, 2020

iostat output format changes again

The format for iostat differs across the last 3 Ubuntu LTS (16.04, 18.04, 20.04). Some of my scripts scrape that output and now it is time to update them to detect the new format. I wrote about this before. There are other ways to get this data so this is a risk I accept for being lazy. The changes are from added columns, removed columns, renamed columns and reordered columns. I wish the last two weren't done. There are a few more projects out there that scrape iostat output and must update how they parse it.

In Ubuntu 20.04 the new columns are for trim, which is useful. The svctm column was removed and fields were reordered. I assume that the bytes written metrics no longer include bytes trimmed -- which inflated by 2X the write rate for anything that does frequent unlike (like RocksDB).

New fields in 20.04 are:

  • d/s - number (after merges) of discard requests completed per second for the device
  • dsec/s (dkB/s, dMB/s) - number of sectors (KB, MB) discarded for the device per second
  • drqm/s - number of discard requests merged per second that were queued to the device
  • %drqm - percentage of discard requests merged together before being sent to the device
  • dareq-sz - average size (in kilobytes) of the discard requests that were issued to the device
  • d_await - average time (in milliseconds) for discard requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them

Sunday, August 16, 2020

Review of PrismDB: Read-aware LSM Tree for Heterogeneous Storage

The PrismDB paper explains how to use heterogenous storage with an LSM. This is my review of the paper and I don't review papers that aren't interesting and worth reading. I might be biased given that I worked on RocksDB, MyRocks, helped with the persistent read cache for tiered storage in RocksDB for which there is an interesting paper. For the rest of this post I use the word tiered instead of heterogenous as that is easier to type. 

The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.

An LSM tree with leveled compaction organizes keys by write recency. When n < m then a key/value pair in Ln is likely to have been written more recently than a key in Lm. But it isn't clear that location in the LSM tree suggests something about read recency or frequency. Thus the simple tiered storage approach mentioned in the previous paragraph might not help much for reads. 

Paper summary

The idea in the paper is to pin hot key/value pairs higher in the LSM tree and then use higher-perf, higher-cost storage for those levels. Thus, storage read latency for such data will be reduced. This is implemented via three components: tracker, mapper, placer.

  • The tracker identifies keys that are frequently accessed. This is done via CLOCK with 2 bits/key and implemented via a concurrent hash map. 
  • The mapper decides which keys are hot. The mapper can achieve the goal of at most X% of keys on a given level getting pinned. Given such a constraint the mapper will determine which keys can get pinned in that level. The mapper uses the information from the concurrent hash map managed by the tracker.
  • The placer implements pinned compaction, explained below, to implement the decisions made by the mapper.

Pinned compaction is a new compaction strategy. For leveled compaction from Ln to Ln+1 pinned compaction will keep keys on Ln when such keys are listed for pinning on that level. Pinned compaction will also pull up keys form Ln+1 to Ln for the same reason.

Pinned compaction benefits from weighted scores that determine which SSTs are most likely to benefit from pinned compaction (have keys that should be pinned). The scores determine which SSTs to use as input for the next round of pinned compaction.

Access distributions

My biggest concern about the paper is whether production access patterns will benefit form this approach. Academic work on this topic has been hampered because not enough has been published on this topic. It is hard to work on cache management without knowing the access distributions. 

The paper has two claims on this point:

  1. Objects stored in the higher levels of the LSM are read and updated much more often than objects stored at the bottom layers.
  2. In addition, since they are updated much less frequently, these lower levels can meet the lower endurance requirements of cheaper flash storage.
I agree that objects in the higher levels are likely to be updated more often. I am less certain that this extends to reads. Most of the benchmarks from the paper use YCSB with Zipfian and the claims will definitely be true in that case. The claim about reads is more likely to be true for workloads with many range reads especially when prefix bloom filters cannot be used.

For point #2, my experience with production and benchmark workloads has been that per-level compaction write rates are similar (MB/s written to storage per level). But this can be true without contradicting claim 2 above as the larger levels have more objects. Regardless the per-level write rates must be considered to make sure a device will meet the desired lifetime. 

By similar write rates I mean they tend to be within a factor of 2 and here is an example from Linkbench -- see the Write(GB) column.


I know that conference papers don't have to solve my production concerns, but I want to share more background on the topic.

How can I deploy this in production? If I am a cloud customer then my storage form factors are different from on-prem. For AWS I can choose from ephemeral, EBS disk, EBS SSD, EBS SSD with PIOPs and S3. I probably don't have to worry about SSD endurance in the cloud but if my workload does many writes then I have to pay for that except when using ephemeral storage. In the cloud there are still different price & performance tradeoffs and I assume work like PrismDB is relevant there. 

For on-prem it isn't easy to get multiple storage devices (NVM & TLC & QLC) into one server as web-scale operators try to reduce the number of items on the menu. Even if they allow this the question is whether I can get something with small amounts of NVM per server. Disaggregated storage might solve the small amounts of storage per server problem but with disagg I have to worry about new failure modes when all servers in a rack are sharing access to the same NVM device. I think the most likely solution is for storage vendors to provide a hybrid device that includes some NVM with more MLC/TLC NAND and a lot of QLC NAND (they already do this to some extent).

Assorted comments

  • Does the mapper create a list of keys (to be pinned) per level or is there a global list? 
  • Does pinned compaction risk creating SSTs with a small number of keys? I'd rather not have an LSM tree with too many small SSTs.
  • Do the decisions about pinning avoid the double caching problem? I am not sure I need to pin objects that are likely to be in the block cache.
  • Is the CPU overhead of searching the to-be-pinned data structure significant? This might depend on whether pinned compaction is the only type of compaction run. The paper suggests regular compaction is also run. 
  • For the two points above, using a bloom filter per level for the keys to be pinned might help.
  • Is there a mechanism for unpin? If regular compaction is still run then is that the way in which pinned keys get unpinned. Or does recomputing the to-be-pinned list over time server to unpin keys that used to be hot and now are cold.
  • Logical backup and long scans for batch queries are things that can interfere with the computation of the to-be-pinned list. In MyRocks hints can be used as part of a SELECT statement to suggest that data for the query should not pulled into the block cache.

Thursday, August 6, 2020

Over fetching in a DBMS

By over fetching I mean fetching irrelevant documents or fields while processing a query. By fetching I mean the data examined by the DBMS, not the result set returned to the user.  By irrelevant I mean documents and fields that won't change the query result if they don't exist. This also applies to a SQL DBMS after substituting column for field, row for document and collection for table. In the rest of this post I mostly use MongoDB names (documents, fields and collections). I might reuse some of the terminology from a post on predicates from Use The Index, Luke. That post is interesting.

I can refine the over fetching definition into four parts:

  • co-located
  • single-table
  • join
  • aggregation
  • other

Co-located over fetching occurs when irrelevant fields are read from a document or indexes. Columnar storage is a one way to avoid this for analytic workloads. Sometimes a covering index can reduce this for OLTP workloads.

For a collection C with fields a1, a2, ..., aN then db.C.find({}, {a1:1, a2:1, _id:0}) does co-located over fetching when a collection scan is used. An index on (a1, a2) avoids over fetching because the fields a3 ... aN are not in the index. The SQL version of the query is select a1, a2 from C.

Co-located over fetching also occurs when an index has fields that are needed by the query. For the query in the previous paragraph if there is an index on (a1, a2, a3, a4) and the index is used for the query then over fetching occurs because a3 and a4 are read by the DBMS but not needed.


Single-table over fetching occurs when irrelevant documents are read. A good index is one way to avoid this, especially for OLTP workloads. 

For this example the collection C has the documents:
  • { _id: 0, a1:7, a2:8, a3:9 }
  • { _id: 1, a1:7, a2:18, a3:19 }
  • { _id: 2, a1:27, a2:28, a3:29 }
The query db.C.find({a1: 7}) does single-table over fetching when there isn't an index on a1. In that case a collection scan is done that examines all docs but the doc with _id:2 is irrelevant. With an index on a1 then only the docs with _id:0 and _id:1 are examined and there is no over fetching. The SQL version of this query is select * from C where a1=7.

Single-table over fetching can occur with indexes. For the query db.C.find({ a1: 7, a2: 8}) with an index on a1 then docs with _id:0 and _id:1 are examined but _id:1 is irrelevant because the predicate on a2 excludes it. An index on (a1, a2) avoids the over fetching. The SQL version of this query is select * from C where a1=7 and a2=8.


Join over fetching occurs when join predicates filter documents. The docs that were filtered by the join predicate don't change the query result and figuring out how to avoid examining them prior to the join might improve performance.

With MongoDB $lookup is a left outer join and over fetching cannot occur for the input documents. But it can occur for docs in the from collection when there is no index, or no good index, on it. But explain doesn't show the access path for the from collection -- until SERVER-22622 is fixed.

Whether join over fetching occurs in SQL depends on the join (inner, left outer, right outer, etc). The example uses tables C1 and C2 and the query select * from C1, C2 where C1.x1 = C2.y1
  • C1 has the rows: (x1:1, x2:2), (x1:11, x2:12), (x1:21, x2:22)
  • C2 has the rows: (y1:1, y2:2), (y1:11, y2:12), (y1:31, y2:32)
Assume this is evaluated by scanning C1 and then probing C2 (nested loops join). If there is no index on C2.y1 then join over fetching occurs for C2 because (y1:31, y2:32) is examined by a table scan but filtered by the join predicate. Join over fetching occurs for C1 whether or not there is an index on C2.y1 because (x1:21, x2:22) is examined but filtered by the join predicate.


Aggregation over fetching occurs when docs are filtered by the aggregation operator semantics. The obvious ones are $max and $min in MongoDB (max and min in SQL).

For this example the collection C has the documents { _id: 0, a1:7, a2:8 }, { _id:1, a1:7, a2:9 } and the queries are:
  • MongoDB: db.C.aggregate([ { $group : { _id: "$a1", maxval : { $max : "$a2" } } } ])
  • SQL: select max(a2), a1 from C group by a1
Aggregation over fetching occurs for the doc with _id:0 because it doesn't have the max value of a2 for the group with a1:7. An index on (a1, a2) can avoid the aggregation over fetching but you should consult the DBMS documentation to understand whether that optimization has been implemented. MySQL has loose index scan for min/max and distinct, Postgres has recursive CTE (here and here) while MongoDB has DISTINCT_SCAN (for $first but not for $min or $max - see SERVER-40090).


Sometimes index-only queries aren't as index-only as you want them to be. Postgres relies on bits being set in the visibility map or it will fetch the base row from the heap. InnoDB relies on a different mechanism but there are cases where it too must fetch base row images from the PK (clustered) index for queries that appear to be index only.

Wednesday, July 22, 2020

Sustaining high insert rates despite secondary indexes

If you want to sustain high insert rates in the presence of secondary indexes then there are two approaches at a high level. The first is to buy enough RAM to cache the entire database but that gets expensive. The second is to avoid storage reads during secondary index maintenance.

So now the problem shifts to methods for avoiding storage reads during secondary index maintenance and there are a few popular ways to do that.
  1. Partition the database so that only a few of the partitions get inserts at any point in time and then use local (per-partition) secondary indexes. If the partitions are small enough they will fit in memory and then traditional indexing (B-Tree) can be used and the per-partition secondary indexes will fit it memory for the partitions that are getting inserts. TimescaleDB is an example of this.
  2. Use an LSM, or something like an LSM, that supports read-free index maintenance for non-unique secondary indexes. MyRocks is an example of that.
  3. Don't use secondary indexes because the target workloads expect to scan large amounts of data
The interesting thing about the first approach is that you end up with a tree per partition and then a query that uses the per-partition secondary index might have to search many trees (fanout or broadcast queries) unless there are other query predicates that enable enough partitions to be pruned.

Searching trees of trees sounds a lot like what an LSM does, which is a nice way of stating that the CPU overhead of an LSM for reads can be reproduced by the first approach. In an LSM with leveled compaction there is a sorted run (or tree) per level. However an LSM benefits from a bloom filter for point queries and might benefit from a prefix bloom filter for range queries. There are also several interesting research papers that show how to use something like a bloom filter for range queries -- see papers on SuRF and Rosetta. I am not sure whether systems that use the first approach (partition with local secondary indexes) also provide something like a bloom filter to match what an LSM can do. But if most partitions eventually become read-only then there are opportunities for being clever.

  • this post has more detail on avoiding reads in a write-optimized index structure -- RocksDB merge operator, InnoDB change buffer and more
  • I would put things like Postgres BRIN (min/max values per block) into the third category (do scans but more efficiently)

Postgres, vacuum and the insert benchmark

I have been running benchmarks with Postgres for a few months and am slowly learning enough about Postgres to trust the results that I get. When I first read about Postgres vacuum I didn't think it would be important for the insert benchmark because the workload is insert-only. This post explains why I was wrong.

Vacuum arrived in the original POSTGRES storage system in the 1980s. Today it serves several purposes: 1) reclaim space for old versions of rows that are no longer visible, 2) update statistics for the query planner, 3) update bits in the visibility map to help index-only queries and 4) preventing transactionID wraparound.

I didn't expect vacuum to help much with the insert benchmark because it doesn't do updates or deletes and there is no space for vacuum to reclaim. Also, query plans were OK and there weren't enough transactions for wraparound to be an issue.

What I missed was the impact of the visibility map on the index-only range scans run during the insert benchmark. These bits are only set by vacuum and if the bit is not set for a heap page, then the base row must be read from the heap page during an index-only scan for any index entries that point to that heap page. The impact from reading the base row is significant when the working set is not cached but still visible when the working set is cached.

Fortunately, the insert benchmark pattern of inserts seems to be friendly to Postgres in that a heap page will become full after some inserts and then doesn't get more inserts. So once the visibility map bit gets set it remains set for that heap page.


Autovacuum is triggered for a table based on the number of updates and deletes done to the table. So while an insert-only workload can benefit from vacuum when it also does range scans, the inserts won't trigger vacuum. I haven't read the code to confirm that inserts don't trigger autovacuum. The main docs aren't clear as the routine vacuuming section states that When enabled, autovacuum checks for tables that have had a large number of inserted, updated or deleted tuples. But the docs for autovacuum_vacuum threshold state that it is only updates and deletes: Specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any one table.

I now run the insert benchmark as a sequence of steps: load some data with only PK indexes, create secondary indexes, load more data, then read+write. The read+write step does short range scans as fast as possible and the writes are rate-limited inserts where there is a target on the insert rate per second. As part of this I optionally run vacuum after creating the secondary indexes. Each read+write test runs for 1 hour and vacuum is started at the end of each hour. While my scripts wait for vacuum to finish after creating the indexes, and don't wait for it to finish during the read+write tests that doesn't matter much because vacuum is fast during the insert benchmark.

Vacuum internals

This is a brief description. Hopefully it is truthy. The key point is that vacuum can require full index scans for every index of the vacuumed table. That can take a long time. Alas, this isn't an issue for the insert benchmark because it is insert-only and doesn't create old versions of rows so there are no full scans of all indexes during autovacuum.

The Internals of Postgres web-site is a great place to start after reading the online Postgres docs if you want to learn more -- see chapter 5 and chapter 6. Note that chapter 6 of the Internals site lists steps 3 before step 2 and I need to figure that out before I write too many more blog posts on this topic. There are many useful blog posts on this topic including this blog post.

I don't think vacuum in original POSTGRES did full index scans. I have read some of the discussion that explains why a full index scan is done today but I didn't save links to that. My preference is for vacuum to do index probes rather than full index scans to reduce the time between consecutive vacuums for the same table.

Regardless let me explain some of the work that vacuum can do to make it obvious that vacuum for a table can take a long time, and does when I run Linkbench.
  1. Scan heap pages for which visibility map bits are not set. From each heap page if a row is an old version (because delete or update) and not visible to any current or future transaction then copy the row's CTID (into an array?)
  2. If CTIDs were found in the previous step then scan all secondary indexes for this table and for each index entry determine whether its CTID is in the array from step 1. If yes then reclaim the space for that index entry.
  3. Repeat step 1 but this time reclaim the space for the dead rows.
Finally, the memory used to buffer CTIDs is determined by maintenance_work_mem. If that limit is reached then the steps (and full index scans) are repeated.