Thursday, September 17, 2020

Performance results and DBMS experts

I used to assume that many performance results couldn't be trusted because the people sharing the results lacked expertise in the system(s) under test. I am still wary of results in conference papers that have results for too many DBMS and my wariness is a function of the number of DBMS considered. 

But my opinion on this has changed. Now I prefer to focus on the results that the typical performance-sensitive deployment will get. By performance-sensitive I mean a deployment that cares enough about performance to spend some money and/or time to get it. This isn't a deployment that uses a default configuration but it might not be able to afford full time performance teams and months of tuning.

There are far more deployments than experts, so it is safe to assume that expertise is in short supply. If we are to assign blame then I prefer to blame the DBMS for being too complex, having too many sharp edges and not being adaptive.

Lack of expertise in a DBMS isn't a mark of shame. Resources are finite (time to learn, brain capacity).  Some of us can afford to fill our heads with DBMS trivia at the expense of learning other useful skills. But most people need to focus on things other than the DBMS to get their jobs done.

There is a big opportunity to build systems that provide better performance without excessive tuning. While ML seems to be the primary motivator for current research, I expect non black-box approaches to deliver interesting results as well. And by non black-box approaches I mean designing algorithms that consider performance goals and know how to adjust to achieve them. This assumes they have some understanding of performance models. Can we call these self-aware index structures?

Managed databases are also an opportunity for scaling expertise. The service provider has the expertise and has a better chance of applying it across their many customers.

Friday, September 4, 2020

When is it OK to lose a commit?

I am thrilled that OSS DBMS are better at avoiding the loss of commits thanks to replication that can do more than async. But there are good reasons for some workloads to not pay the cost (in latency) from such features.

By losing a commit here I mean the loss of the most recent N commits, not the lost of any commit from the past. By workload I mean one of the applications using the DBMS and there can be a variety of workloads using the DBMS. For some DBMS, all applications must use the same level of commit durability. But that is an implementation artifact, not something fundamental to the algorithms we implement.

A workload for which it is OK to lose the most recent N commits is inserting into a DBMS from a log file. If this has limited concurrency then commit latency can be a problem. If commits are lost then they can be replayed. This assumes that the application 1) can determine when a commit has been lost and 2) figure out where to begin the replay.

To determine when a commit has been lost the application queries DBMS state while connecting and while reconnecting. One reason to reconnect is after a failover and a tradeoff for faster commit is the potential for losing the last N commits on failover.

In the simple case there is one single-threaded application doing inserts from a log file into one DBMS table. It is feasible but not trivial to make sure there really is at most one application running in this case -- because distributed system challenges extend to applications. But I ignore that complexity for now.

The DBMS state is found in either the table to which the inserts are done or a separate state table. If there is a way to find the last inserted row, then that tells you where to restart, assuming you can map the value of that row back to an offset in the source (log). Otherwise, a separate state table can maintain the log offset from which inserts should continue and the state table is updated in the transaction that does bulk inserts from the log.


I just remembered something that I never understood. Ingres Replicator does async replication via 2pc between the primary and a secondary -- see here and here. I never understood why. That is an expensive way to replay a log (from the primary) on a secondary. The alternative is for a secondary to track its state independent of the primary, replay transactions on a secondary (commit locally on the secondary, update state table on the secondary) and some time after commit the state (queue of transactions to be replayed) can be updated (truncated).

Thursday, August 20, 2020

iostat output format changes again

The format for iostat differs across the last 3 Ubuntu LTS (16.04, 18.04, 20.04). Some of my scripts scrape that output and now it is time to update them to detect the new format. I wrote about this before. There are other ways to get this data so this is a risk I accept for being lazy. The changes are from added columns, removed columns, renamed columns and reordered columns. I wish the last two weren't done. There are a few more projects out there that scrape iostat output and must update how they parse it.

In Ubuntu 20.04 the new columns are for trim, which is useful. The svctm column was removed and fields were reordered. I assume that the bytes written metrics no longer include bytes trimmed -- which inflated by 2X the write rate for anything that does frequent unlike (like RocksDB).

New fields in 20.04 are:

  • d/s - number (after merges) of discard requests completed per second for the device
  • dsec/s (dkB/s, dMB/s) - number of sectors (KB, MB) discarded for the device per second
  • drqm/s - number of discard requests merged per second that were queued to the device
  • %drqm - percentage of discard requests merged together before being sent to the device
  • dareq-sz - average size (in kilobytes) of the discard requests that were issued to the device
  • d_await - average time (in milliseconds) for discard requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them

Sunday, August 16, 2020

Review of PrismDB: Read-aware LSM Tree for Heterogeneous Storage

The PrismDB paper explains how to use heterogenous storage with an LSM. This is my review of the paper and I don't review papers that aren't interesting and worth reading. I might be biased given that I worked on RocksDB, MyRocks, helped with the persistent read cache for tiered storage in RocksDB for which there is an interesting paper. For the rest of this post I use the word tiered instead of heterogenous as that is easier to type. 

The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.

An LSM tree with leveled compaction organizes keys by write recency. When n < m then a key/value pair in Ln is likely to have been written more recently than a key in Lm. But it isn't clear that location in the LSM tree suggests something about read recency or frequency. Thus the simple tiered storage approach mentioned in the previous paragraph might not help much for reads. 

Paper summary

The idea in the paper is to pin hot key/value pairs higher in the LSM tree and then use higher-perf, higher-cost storage for those levels. Thus, storage read latency for such data will be reduced. This is implemented via three components: tracker, mapper, placer.

  • The tracker identifies keys that are frequently accessed. This is done via CLOCK with 2 bits/key and implemented via a concurrent hash map. 
  • The mapper decides which keys are hot. The mapper can achieve the goal of at most X% of keys on a given level getting pinned. Given such a constraint the mapper will determine which keys can get pinned in that level. The mapper uses the information from the concurrent hash map managed by the tracker.
  • The placer implements pinned compaction, explained below, to implement the decisions made by the mapper.

Pinned compaction is a new compaction strategy. For leveled compaction from Ln to Ln+1 pinned compaction will keep keys on Ln when such keys are listed for pinning on that level. Pinned compaction will also pull up keys form Ln+1 to Ln for the same reason.

Pinned compaction benefits from weighted scores that determine which SSTs are most likely to benefit from pinned compaction (have keys that should be pinned). The scores determine which SSTs to use as input for the next round of pinned compaction.

Access distributions

My biggest concern about the paper is whether production access patterns will benefit form this approach. Academic work on this topic has been hampered because not enough has been published on this topic. It is hard to work on cache management without knowing the access distributions. 

The paper has two claims on this point:

  1. Objects stored in the higher levels of the LSM are read and updated much more often than objects stored at the bottom layers.
  2. In addition, since they are updated much less frequently, these lower levels can meet the lower endurance requirements of cheaper flash storage.
I agree that objects in the higher levels are likely to be updated more often. I am less certain that this extends to reads. Most of the benchmarks from the paper use YCSB with Zipfian and the claims will definitely be true in that case. The claim about reads is more likely to be true for workloads with many range reads especially when prefix bloom filters cannot be used.

For point #2, my experience with production and benchmark workloads has been that per-level compaction write rates are similar (MB/s written to storage per level). But this can be true without contradicting claim 2 above as the larger levels have more objects. Regardless the per-level write rates must be considered to make sure a device will meet the desired lifetime. 

By similar write rates I mean they tend to be within a factor of 2 and here is an example from Linkbench -- see the Write(GB) column.


I know that conference papers don't have to solve my production concerns, but I want to share more background on the topic.

How can I deploy this in production? If I am a cloud customer then my storage form factors are different from on-prem. For AWS I can choose from ephemeral, EBS disk, EBS SSD, EBS SSD with PIOPs and S3. I probably don't have to worry about SSD endurance in the cloud but if my workload does many writes then I have to pay for that except when using ephemeral storage. In the cloud there are still different price & performance tradeoffs and I assume work like PrismDB is relevant there. 

For on-prem it isn't easy to get multiple storage devices (NVM & TLC & QLC) into one server as web-scale operators try to reduce the number of items on the menu. Even if they allow this the question is whether I can get something with small amounts of NVM per server. Disaggregated storage might solve the small amounts of storage per server problem but with disagg I have to worry about new failure modes when all servers in a rack are sharing access to the same NVM device. I think the most likely solution is for storage vendors to provide a hybrid device that includes some NVM with more MLC/TLC NAND and a lot of QLC NAND (they already do this to some extent).

Assorted comments

  • Does the mapper create a list of keys (to be pinned) per level or is there a global list? 
  • Does pinned compaction risk creating SSTs with a small number of keys? I'd rather not have an LSM tree with too many small SSTs.
  • Do the decisions about pinning avoid the double caching problem? I am not sure I need to pin objects that are likely to be in the block cache.
  • Is the CPU overhead of searching the to-be-pinned data structure significant? This might depend on whether pinned compaction is the only type of compaction run. The paper suggests regular compaction is also run. 
  • For the two points above, using a bloom filter per level for the keys to be pinned might help.
  • Is there a mechanism for unpin? If regular compaction is still run then is that the way in which pinned keys get unpinned. Or does recomputing the to-be-pinned list over time server to unpin keys that used to be hot and now are cold.
  • Logical backup and long scans for batch queries are things that can interfere with the computation of the to-be-pinned list. In MyRocks hints can be used as part of a SELECT statement to suggest that data for the query should not pulled into the block cache.

Thursday, August 6, 2020

Over fetching in a DBMS

By over fetching I mean fetching irrelevant documents or fields while processing a query. By fetching I mean the data examined by the DBMS, not the result set returned to the user.  By irrelevant I mean documents and fields that won't change the query result if they don't exist. This also applies to a SQL DBMS after substituting column for field, row for document and collection for table. In the rest of this post I mostly use MongoDB names (documents, fields and collections). I might reuse some of the terminology from a post on predicates from Use The Index, Luke. That post is interesting.

I can refine the over fetching definition into four parts:

  • co-located
  • single-table
  • join
  • aggregation
  • other

Co-located over fetching occurs when irrelevant fields are read from a document or indexes. Columnar storage is a one way to avoid this for analytic workloads. Sometimes a covering index can reduce this for OLTP workloads.

For a collection C with fields a1, a2, ..., aN then db.C.find({}, {a1:1, a2:1, _id:0}) does co-located over fetching when a collection scan is used. An index on (a1, a2) avoids over fetching because the fields a3 ... aN are not in the index. The SQL version of the query is select a1, a2 from C.

Co-located over fetching also occurs when an index has fields that are needed by the query. For the query in the previous paragraph if there is an index on (a1, a2, a3, a4) and the index is used for the query then over fetching occurs because a3 and a4 are read by the DBMS but not needed.


Single-table over fetching occurs when irrelevant documents are read. A good index is one way to avoid this, especially for OLTP workloads. 

For this example the collection C has the documents:
  • { _id: 0, a1:7, a2:8, a3:9 }
  • { _id: 1, a1:7, a2:18, a3:19 }
  • { _id: 2, a1:27, a2:28, a3:29 }
The query db.C.find({a1: 7}) does single-table over fetching when there isn't an index on a1. In that case a collection scan is done that examines all docs but the doc with _id:2 is irrelevant. With an index on a1 then only the docs with _id:0 and _id:1 are examined and there is no over fetching. The SQL version of this query is select * from C where a1=7.

Single-table over fetching can occur with indexes. For the query db.C.find({ a1: 7, a2: 8}) with an index on a1 then docs with _id:0 and _id:1 are examined but _id:1 is irrelevant because the predicate on a2 excludes it. An index on (a1, a2) avoids the over fetching. The SQL version of this query is select * from C where a1=7 and a2=8.


Join over fetching occurs when join predicates filter documents. The docs that were filtered by the join predicate don't change the query result and figuring out how to avoid examining them prior to the join might improve performance.

With MongoDB $lookup is a left outer join and over fetching cannot occur for the input documents. But it can occur for docs in the from collection when there is no index, or no good index, on it. But explain doesn't show the access path for the from collection -- until SERVER-22622 is fixed.

Whether join over fetching occurs in SQL depends on the join (inner, left outer, right outer, etc). The example uses tables C1 and C2 and the query select * from C1, C2 where C1.x1 = C2.y1
  • C1 has the rows: (x1:1, x2:2), (x1:11, x2:12), (x1:21, x2:22)
  • C2 has the rows: (y1:1, y2:2), (y1:11, y2:12), (y1:31, y2:32)
Assume this is evaluated by scanning C1 and then probing C2 (nested loops join). If there is no index on C2.y1 then join over fetching occurs for C2 because (y1:31, y2:32) is examined by a table scan but filtered by the join predicate. Join over fetching occurs for C1 whether or not there is an index on C2.y1 because (x1:21, x2:22) is examined but filtered by the join predicate.


Aggregation over fetching occurs when docs are filtered by the aggregation operator semantics. The obvious ones are $max and $min in MongoDB (max and min in SQL).

For this example the collection C has the documents { _id: 0, a1:7, a2:8 }, { _id:1, a1:7, a2:9 } and the queries are:
  • MongoDB: db.C.aggregate([ { $group : { _id: "$a1", maxval : { $max : "$a2" } } } ])
  • SQL: select max(a2), a1 from C group by a1
Aggregation over fetching occurs for the doc with _id:0 because it doesn't have the max value of a2 for the group with a1:7. An index on (a1, a2) can avoid the aggregation over fetching but you should consult the DBMS documentation to understand whether that optimization has been implemented. MySQL has loose index scan for min/max and distinct, Postgres has recursive CTE (here and here) while MongoDB has DISTINCT_SCAN (for $first but not for $min or $max - see SERVER-40090).


Sometimes index-only queries aren't as index-only as you want them to be. Postgres relies on bits being set in the visibility map or it will fetch the base row from the heap. InnoDB relies on a different mechanism but there are cases where it too must fetch base row images from the PK (clustered) index for queries that appear to be index only.

Wednesday, July 22, 2020

Sustaining high insert rates despite secondary indexes

If you want to sustain high insert rates in the presence of secondary indexes then there are two approaches at a high level. The first is to buy enough RAM to cache the entire database but that gets expensive. The second is to avoid storage reads during secondary index maintenance.

So now the problem shifts to methods for avoiding storage reads during secondary index maintenance and there are a few popular ways to do that.
  1. Partition the database so that only a few of the partitions get inserts at any point in time and then use local (per-partition) secondary indexes. If the partitions are small enough they will fit in memory and then traditional indexing (B-Tree) can be used and the per-partition secondary indexes will fit it memory for the partitions that are getting inserts. TimescaleDB is an example of this.
  2. Use an LSM, or something like an LSM, that supports read-free index maintenance for non-unique secondary indexes. MyRocks is an example of that.
  3. Don't use secondary indexes because the target workloads expect to scan large amounts of data
The interesting thing about the first approach is that you end up with a tree per partition and then a query that uses the per-partition secondary index might have to search many trees (fanout or broadcast queries) unless there are other query predicates that enable enough partitions to be pruned.

Searching trees of trees sounds a lot like what an LSM does, which is a nice way of stating that the CPU overhead of an LSM for reads can be reproduced by the first approach. In an LSM with leveled compaction there is a sorted run (or tree) per level. However an LSM benefits from a bloom filter for point queries and might benefit from a prefix bloom filter for range queries. There are also several interesting research papers that show how to use something like a bloom filter for range queries -- see papers on SuRF and Rosetta. I am not sure whether systems that use the first approach (partition with local secondary indexes) also provide something like a bloom filter to match what an LSM can do. But if most partitions eventually become read-only then there are opportunities for being clever.

  • this post has more detail on avoiding reads in a write-optimized index structure -- RocksDB merge operator, InnoDB change buffer and more
  • I would put things like Postgres BRIN (min/max values per block) into the third category (do scans but more efficiently)

Postgres, vacuum and the insert benchmark

I have been running benchmarks with Postgres for a few months and am slowly learning enough about Postgres to trust the results that I get. When I first read about Postgres vacuum I didn't think it would be important for the insert benchmark because the workload is insert-only. This post explains why I was wrong.

Vacuum arrived in the original POSTGRES storage system in the 1980s. Today it serves several purposes: 1) reclaim space for old versions of rows that are no longer visible, 2) update statistics for the query planner, 3) update bits in the visibility map to help index-only queries and 4) preventing transactionID wraparound.

I didn't expect vacuum to help much with the insert benchmark because it doesn't do updates or deletes and there is no space for vacuum to reclaim. Also, query plans were OK and there weren't enough transactions for wraparound to be an issue.

What I missed was the impact of the visibility map on the index-only range scans run during the insert benchmark. These bits are only set by vacuum and if the bit is not set for a heap page, then the base row must be read from the heap page during an index-only scan for any index entries that point to that heap page. The impact from reading the base row is significant when the working set is not cached but still visible when the working set is cached.

Fortunately, the insert benchmark pattern of inserts seems to be friendly to Postgres in that a heap page will become full after some inserts and then doesn't get more inserts. So once the visibility map bit gets set it remains set for that heap page.


Autovacuum is triggered for a table based on the number of updates and deletes done to the table. So while an insert-only workload can benefit from vacuum when it also does range scans, the inserts won't trigger vacuum. I haven't read the code to confirm that inserts don't trigger autovacuum. The main docs aren't clear as the routine vacuuming section states that When enabled, autovacuum checks for tables that have had a large number of inserted, updated or deleted tuples. But the docs for autovacuum_vacuum threshold state that it is only updates and deletes: Specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any one table.

I now run the insert benchmark as a sequence of steps: load some data with only PK indexes, create secondary indexes, load more data, then read+write. The read+write step does short range scans as fast as possible and the writes are rate-limited inserts where there is a target on the insert rate per second. As part of this I optionally run vacuum after creating the secondary indexes. Each read+write test runs for 1 hour and vacuum is started at the end of each hour. While my scripts wait for vacuum to finish after creating the indexes, and don't wait for it to finish during the read+write tests that doesn't matter much because vacuum is fast during the insert benchmark.

Vacuum internals

This is a brief description. Hopefully it is truthy. The key point is that vacuum can require full index scans for every index of the vacuumed table. That can take a long time. Alas, this isn't an issue for the insert benchmark because it is insert-only and doesn't create old versions of rows so there are no full scans of all indexes during autovacuum.

The Internals of Postgres web-site is a great place to start after reading the online Postgres docs if you want to learn more -- see chapter 5 and chapter 6. Note that chapter 6 of the Internals site lists steps 3 before step 2 and I need to figure that out before I write too many more blog posts on this topic. There are many useful blog posts on this topic including this blog post.

I don't think vacuum in original POSTGRES did full index scans. I have read some of the discussion that explains why a full index scan is done today but I didn't save links to that. My preference is for vacuum to do index probes rather than full index scans to reduce the time between consecutive vacuums for the same table.

Regardless let me explain some of the work that vacuum can do to make it obvious that vacuum for a table can take a long time, and does when I run Linkbench.
  1. Scan heap pages for which visibility map bits are not set. From each heap page if a row is an old version (because delete or update) and not visible to any current or future transaction then copy the row's CTID (into an array?)
  2. If CTIDs were found in the previous step then scan all secondary indexes for this table and for each index entry determine whether its CTID is in the array from step 1. If yes then reclaim the space for that index entry.
  3. Repeat step 1 but this time reclaim the space for the dead rows.
Finally, the memory used to buffer CTIDs is determined by maintenance_work_mem. If that limit is reached then the steps (and full index scans) are repeated.

Friday, July 17, 2020

Review of -- TimescaleDB: SQL made scalable for time-series data

This is a short review of TimescaleDB: SQL made scalable for time-series data. My focus is on indexing and I ignore many interesting things described in that paper. My goal in reading it was to understand possible optimizations for workloads like the insert benchmark v3. I also read some of the online TimescaleDB docs. This blog post is truthy as I am not an expert on TimescaleDB. Hopefully I haven't made too many mistakes. The paper is several years old and TimescaleDB has improved a lot since then.

The paper does a great job asserting what they know to be true about the workloads for which TimescaleDB is targeted.

Assertion 1:
Time-series data is largely immutable. New data continually arrives, typically corresponding to the latest time periods. In other words, writes primarily occur as new inserts, not as updates to existing rows. Further, while the database needs to be able to support backfill for delayed data, writes are made primarily to recent time intervals.
Assertion 2:
Workloads have a natural partitioning across both time and space. Writes typically are made to the latest time interval(s) and across the “partitioning key” in the space dimension (e.g., data sources, devices, users, etc.). Queries typically ask questions about a specific time series or data source, or across many data sources constrained to some time interval. Yet the queries might not be limited to a particular metric, but may regularly select multiple metrics at once (or use predicates that rely on multiple metrics).
Rows are stored in hypertables rather than tables and hypertables are composed of chunks. A hypertable must have a time column and can have a partition column. The paper states that the partition column is required for clustered TimescaleDB. But the online docs have more nuance.

Chunks are right-sized (their name for it) to fit in memory. Without a partition column the hypertable is range partitioned on time into chunks. With the partition column the data is distributed by hash on the partition column and then range on the time column. In SQL DBMS partitioning terminology this is composite partitioning with: hash(partition), range(time).

But this isn't traditional partitioning because it is dynamic and automatic. Nobody has to run DDL to add, drop and change partitions. That is one way they add value.
Compression has been added since the paper was published. It is described here. I have yet to read that post but assume that old chunks are compressed while chunks still getting inserts remain in row format.
Secondary indexes are local to a chunk. I assume that means that fanout over (too) many chunks can happen when a query doesn't have a predicate on the partition column. But it also means that if the chunk fits in memory, then secondary index maintenance is not delayed by reading from storage and it is easier to sustain high ingest rates.


What is a good indexing strategy for a simple time series workload where new data has four attributes: time, deviceID, metricID, metricValue? I write about this in my post on a replacement for the insert benchmark. Assume that I will use a partition column in addition to the time column. My choices with TimescaleDB are:
  1. partition on deviceID and create local secondary indexes on metricID
  2. partition on metricID and create local secondary indexes on deviceID
I am not sure what I would do to make the min/max query fast as explained in the section on Physical Schema - Read Optimized.


I quickly read the post on columnar compression and I am impressed they made this work in the context of Postgres. They use TOAST to store compressed columnar chunks off-page so IO is only done for the target columns. In terms of being surprised that this was feasible, it reminds me of the work to support columnar in SQL Server.

Wednesday, July 15, 2020

Indexing and write-heavy workloads

When I see impressive numbers for the insert rate that a DBMS can sustain I wonder what indexes exist and whether the inserts are in sequential or random order with respect to each index. One way to explain this is in terms of the numbers of points in the index at which the inserts occur. Although I use streams rather than insert points in what follows.

I am writing this in part so that I can reference this post in future performance reports when describing workloads. It isn't sufficient to state that inserts are in PK order. They can be in ascending or descending PK order. When ascending the point at which the inserts are done can be at the right end of the index (inserted keys > than existing keys) or somewhere in the middle of the index. When descending the inserts can be done at the left end of the index (inserted keys < existing keys) or somewhere in the middle of the index.

Explaining insert patterns

There are four attributes per index that can explain such insert patterns. The attributes are:
  • nAsc - number of streams for which inserts occur in ascending order WRT the index
  • nDesc - number of streams for which inserts occur in descending order WRT the index
  • nLHS - the number of descending streams that are at the left end of the index 
  • nRHS - the number of ascending streams that are at the right end of the index
  • nAsc >= 0, nDesc >= 0 and (nAsc + nDesc) >= 1
  • nLHS and nRHS must be 0 or 1
  • if nLHS is 1 then nDesc must be >= 1
  • if nRHS is 1 then nAsc must be >= 1
There is one exception. When the insert pattern is random WRT the index then inf is used instead of the four attributes.

Geek Code

This is a geek code for explaining insert patterns. The attributes are specified per index. When there is only a PK index, named pk, and inserts occur in PK order at the right end of the index (right growing) then the geek code is:
pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)
When there is only a PK index but the inserts are in random order WRT the PK then the geek code is:
To improve readability I omit attributes for which the value is 0. So these mean the same thing:
pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)
pk=(nAsc:1, nRHS:1)

I am interested in this for three reasons. First, index maintenance has a big impact on insert performance whether or not the working set is in memory. Second, there are optimizations that a DBMS can do for some insert patterns and I suspect there is room for even more optimizations. Many storage engines optimize for right-growing inserts. In that case RocksDB with leveled compaction will have write amplification of 2 -- write once for the WAL, write again for the memtable flush, no compaction. Finally, this makes it easier to explain write-heavy workloads.

Steams and insert points

I use ordered arrays rather than indexes to explain streams (insert points). Assume the array starts as: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], this represents the keys in the PK index and there are no secondary indexes. Some of the examples showwhy I use streams to describe this.

  • Random
    • pk=(inf)
    • insert sequence: 1.5, 6.5, 1.7, 8.1, 0.0, 4.5, ...
  • Right growing
    • pk=(nAsc:1, nRHS:1)
    • insert sequence: 10.0, 11.0, 12.0, ...
  • Left growing
    • pk=(nDesc:1, nLHS:1)
    • insert sequence: 0.0, -1.0, -2.0, ...
  • Left & right growing
    • pk=(nAsc:1, nDesc:1, nLHS:1, nRHS:1)
    • insert sequence: 10.0, 0.0, -1.0, 11.0, 12.0, -2.0
    • insert sequence as interleaved streams: [10.0, 11.0, 12.0] and [0.0, -1.0, -2.0]
  • 1 middle ascending
    • pk=(nAsc:1, nRHS:0)
    • insert sequence: 8.1, 8.11, 8.111, 8.1111, ...
  • 1 middle descending
    • pk=(nDesc:1, nLHS:1)
    • insert sequence: 7.9, 7.89, 7.889, 7.8889, ...
  • 1 middle ascending, 1 middle descending
    • pk=(nAsc:1, nDesc:1)
    • insert sequence: 8.1, 7.9, 8.11, 7.89, 8.111, 7.889, ... 
    • insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [7.9, 7.89, 7.889]
  • 2 middle ascending:
    • pk=(nAsc:2)
    • insert sequence: 8.1, 6.1, 8.11, 6.11, 8.111, 6.111, ...
    • insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [6.1, 6.11, 6.111]
  • N middle ascending
    • pk=(nAsc:N) for some finite value N

Explaining the insert benchmark

Until recently I ran the insert benchmark by first creating a PK index and 3 secondary indexes per table (or collection) and then doing inserts. Informally, the inserts were in PK order but random WRT to each secondary index. More formally, the insert pattern is the following when the secondary indexes are named s1, s2 and s3:
pk=(nAsc:1, nRHS:1)
The insert benchmark can become extremely IO-bound because of the random insert patterns for each of the secondary indexes. In the worst case with a B-Tree there is one page read and one page written back per secondary index per insert (3 pages read, 3 pages written back with 3 secondary indexes).

I recently changed the way that I run the insert benchmark to create the PK index, load some data, create secondary indexes and then load more data. In this case the insert pattern during load some data is:
pk=(nAsc:1, nRHS:1)

And then during load more data (with secondary indexes in place) is:
pk=(nAsc:1, nRHS:1)

Tuesday, July 14, 2020

Review: The Design of the Postgres Storage System

This is a review of The Design of the Postgres Storage System. The paper was in VLDB 1987 although my review used a version of the paper that might differ from the one in VLDB (which is appropriate, given MVCC). This is the also the first in a series of posts on MVCC GC. My standard disclaimer is that I don't review papers that aren't worth reading.

The paper is about POSTGRES, not modern PostgreSQL. POSTGRES was using POSTQUEL at the time and the paper has sample POSTQUEL statements. Something I like very much about the paper is the use of simple estimates to explain the design decisions.

1 Introduction

The abstract and introduction provide four reasons for the research:
  1. Optionally archive all previous versions of rows on separate storage as WORM optical disks were likely to arrive on the market. 
  2. Don't use a WAL (redo log). The benefits include instant recovery, less code complexity and better support for archiving.
  3. Use multiple processes to benefit from multi-processor systems that were likely to arrive and were in development at UC Berkeley and elsewhere.
  4. Use non-volatile RAM (NVRAM) to boost performance.
I assume that the first point (archive all previous versions) was the big deal. While this feature might not have been an immediate success it has turned into a big deal in production systems today. Atlas Data Lake from MongoDB is one example. Another popular way to archive all versions is via change data capture (CDC) to store the older versions in a data warehouse separate from the OLTP system.

2.1 Transaction System

POSTGRES used 40-bit transaction IDs and stated that was sufficient for 320 years of operation assuming 1 TPS. The POSTGRES transaction log uses 2 bits per transaction -- there is no redo log. Logically the log is a large array indexed by XID. The bits represent the status of a transaction: committed, aborted, in progress. The XID is assigned at transaction start. Commit is done by changing the bit in the log to committed, forcing the log page to stable storage and forcing modified database pages to stable storage. Stable storage is either magnetic disk or NVRAM.

The log tail is log from the XID of the oldest active transaction to the present and requires 2 bits per XID. The body is the rest of the log (transactions oldest than the oldest active transaction). As all transactions in the log body are either committed or aborted only 1 bit per XID is needed for the log body.

The goal is to keep the log tail in NVRAM and the log body cached in memory. The log body is read-only, the log tail is not. Both can be searched to determine whether a row is visible to a snapshot and the goal is to avoid disk reads in that search. The paper also explains that a bloom filter can be created on the XIDs of aborted transactions to avoid keeping the log body in memory.

Modern PostgreSQL uses 32-bit transaction IDs and wraparound is a source of problems. Other difference are that modern PostgreSQL has a redo log, doesn't force modified pages to stable storage on commit and doesn't (yet) try to take advantage of NVRAM.

I expect that POSTGRES had worse write-amplification then a system that didn't force dirty pages on commit. But I am unlikely to run the insert benchmark to confirm this. Besides, LMDB does FORCE on commit and has many happy users.

2.2 Relation Storage

The per-row metadata includes:

  • OID - system-assigned unique ID

  • Xmin, Xmax - the XID that starts, ends the version

  • Tmin, Tmax - commit time of XID from Xmin, Xmax

  • Cmin, Cmax - ID of command that starts, ends the version. This is 1-byte so there could be at most 256 commands (statements?) per transaction.

  • PTR - pointer to older or newer version of the row (explained below)

Modern PostgreSQL has similar per-row metadata. The differences are that in PostgreSQL the XID is 32 bits, there is only one field for command ID, there is a 6-byte tuple ID (TID) and the OID is usually not used for user tables.

Fields were set:

  • On insert the OID, Xmin and Cmin were set. Tmin was not set because commit had yet to occur.

  • On update Xmax and Cmax were set to end the row version and a new version of the row was inserted (hopefully to the same page). The new version reused the OID of the ended version and the PTR for the new version pointed to the ended version.

  • On delete Xmax and Cmax were set.

To use less space updates only stored fields that changed and the other fields were found by following the PTR chain (a singly-linked list). The oldest version of a row was called the anchor point. The notion of an anchor point and update (delta) chain is similar to the current support for Heap Only Tuples (HOT) in modern PostgreSQL. I wonder if that is a feature that was removed in early PostgreSQL and then was returned for a different reason.

2.3 Time Management

This section shows the logic required to determine whether a version is visible to a query. The check is more complicated than what InnoDB and RocksDB require, but I assume the CPU overhead is not that different than what occurs in modern PostgreSQL and in my testing of modern PG this isn't an issue. The logic includes a check of the transaction log to determine whether the transaction from Xmin or Xmax committed. That check wouldn't be needed if the commit timestamp were written into the row on commit -- but doing that is non-trivial and can hurt performance. The need to check the transaction log also means that the searched parts of the log must remain in memory or there will be disk reads. The ability to keep that in memory is explained in section 2.1. I am wary of the ability to keep the log in memory for high TPS systems but this is a problem they didn't need to solve at the time.

2.4 Concurrency Control and Timestamp Management

POSTGRES contains a TIME relation that has the commit time for each transaction. This has 32 bits per XID and is updated on commit. The tail of TIME should be stored in stable main memory to avoid forcing a disk page on commit.

Relations are marked by the user as no-archive, light-archive or heavy-archive. Tmin and Tmax are never set for no-archive relations and I assume old versions for them are not moved to the archive. For light-archive, old versions are moved to the archive but Tmin/Tmax are not set to avoid the overhead of doing a search of the transaction log to determine their status. For heavy-archive the reader (a query) will lookup the commit time from the TIME relation and update Tmin/Tmax (thus making a page dirty). Vacuum sets Tmin/Tmax for heavy-archive when moving older versions to the archive. It is possible that the thing (query, vacuum) that searches TIME will be delayed by disk reads.

2.5 Record Access

On each page there is a line table with an entry per anchor point record. Secondary index entry points to line table entry. On update a secondary index only needs maintenance if the indexed columns have been changed.

Modern PostgreSQL uses the name line pointer. Also modern PostgreSQL does secondary index maintenance for all secondary indexes unless no indexed columns have changed. So if there 3 secondary indexes and an update changes a column used by 1 of them then maintenance is done for all of them -- unless HOT is used. If no indexed columns have changed then the Heap Only Tuples (HOT) optimization is used and the new version is added to the end of the update chain and secondary index entries reference the line pointer for the head of the update chain. Quoting from the HOT document:

Without HOT, every version of a row in an update chain has its own index entries, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries.

3.1 Vacuuming the disk

POSTGRES had a command to trigger vacuum of a relation. The example was vacuum rel-name after "30 days". This reclaims space from aborted transactions and moves old versions to the archive. Old versions for relations marked as light-archive and heavy-archive are moved to archive storage. If heavy-archive is set for the relation then vacuum will set Tmin/Tmax if unset. Differences between POSTGRES and modern PostgreSQL include:

  • Vacuum in modern PostgreSQL doesn't move older versions to an archive. It does reclaim space for versions that have been deleted and are no longer visible. It also sets bits in the visibility map and does work to avoid transaction ID wraparound.

  • Vacuum did a full scan of the relation in POSTGRES while modern PostgreSQL only checks pages that require vacuum courtesy of the visibility map

  • Vacuum in modern PostgreSQL does a full index scan for every secondary index of the vacuumed relation when there are rows to remove.

3.2 Archival Medium

The target archival media was optical WORM. While WORM might not have been a huge hit CD-R and DVD-R were a big deal for a long time. Zip drives were a big deal for a shorter time and now we have USB thumb drives. Maybe WORM will return in the form of ultra-low-endurance NAND flash SSDs that support only one device write.

The paper also explained interesting ways to manage secondary indexes using both magnetic disk and archive devices with plans for R-trees to support efficient time-bounded queries.

3.3 Vacuum Process

Vacuum in POSTGRES did:

  1. Write archive record and index entries

  2. Write new anchor point in current database, insert new index entries

  3. Reclaim space from old anchor point and delta records

This wasn't crash safe but POSTGRES did the right thing in spite of crashes. Crashes could leave duplicate records with a copy of the same version in both the archive and main store. But POSTGRES was relational and eliminated such duplicates. I explain differences with modern PostgreSQL above in section 3.1.

5.1 [Performance Comparison] Assumptions

One nit I have with the paper is the argument that CPU is not a critical resource. It listed a few reasons for this -- CPUs were getting much faster than disk, multi-processors were coming, co-processors could be used and custom logic could be used. While the CPU-disk speed gap was growing the paper ignored that RAM density was growing quickly and many DBMS applications would be less IO-bound in the future.

Another nit is that the paper ignores the overhead from vacuum. Vacuum doesn't just use CPU. It reads from the vacuumed relations and dirties pages in them. Accounting for that overhead would be complicated and the focus of the paper was on simple performance models, which made it a nice paper to read.

Monday, July 13, 2020

Updates for the insert benchmark

I continue to run and improve the insert benchmark. Eventually I will even share results. I also have scripts that generate charts, graphs and tables all combined into an HTML-formatted report. This is an update to the overview of the insert benchmark and details on how I share performance results is here.

Per interval results

The first change is that the benchmark client displays per-second performance results including IPS, QPS and max response time for inserts and queries. This makes it easier to understand stalls.

This is output from a load and the legend for the columns is:
  • i_sec - number of seconds for this interval
  • t_sec - cumulative number of seconds
  • i_ips, i_qps - average insert and query rate for that interval
  • t_ips, t_qps - average insert and query rate from test start until now
  • max_i, max_q - max insert and query response time for that interval, in microseconds
  • t_ins, t_query - total inserts and queries from test start until now

i_sec   t_sec   i_ips   t_ips   i_qps   t_qps   max_i   max_q   t_ins   t_query
1.0     1.0     61833   61833   0       0       2919    0       61900   0
1.0     2.0     63625   62729   0       0       1983    0       125600  0
1.0     3.0     63829   63095   0       0       2080    0       189500  0

Test steps

The second change is in how I run the benchmark. As described in the overview, it used to be run as: create tables/collections with secondary indexes, load X million rows/documents, do a full scan of each secondary index, do read+write tests with rate-limited writers.

I have changed it since then. The new pattern is:
  • create tables/collections without secondary indexes
  • load some data (l.i0)
  • create secondary indexes (l.x)
  • load some more data (l.i1)
  • do read+write tests (q100.1, q100.2, q200.1, q200.2, ...)
The read+write tests use rate-limited writers and the tests are run for varying limits. The test starts with 100 insert/s per client (q100) and then 200, 400, 600, 800 and 1000. In some cases the DBMS is unable to sustain the target insert rates. Also, for each limit the test is run twice so the results might use the names q100.1, q100.2, q200.1, q200.2, ..., q1000.1, q1000.2.

Wednesday, July 1, 2020

Something changed for the better in create index between MySQL 8.0.18 and 8.0.20

I include MongoDB, Postgres, MySQL/InnoDB and MySQL/RocksDB (MyRocks) in the things I test via Linkbench and the insert benchmark. Hopefully I add another test later this year. I have been using MySQL 8.0.18 this year and recently started to use 8.0.20.

The new insert benchmark workflow is create tables with PK indexes, load tables, create 3 secondary indexes on each table, continue the load and then run several rounds of reads+inserts. For this test there were 8 clients with a table per client and 10M rows/table after the initial insert.

I think something changed for the better from MySQL 8.0.18 to 8.0.20 and I didn't spot the change in the release notes. I see:
  • Create index in 8.0.20 is 10% to 20% faster (nice, but not why I wrote this)
  • Create index in 8.0.18 uses ~7X more read IO with innodb_flush_method = O_DIRECT (14G vs 2G) and ~2X more read IO with it set to O_DIRECT_NO_FSYNC (4G vs 2G). The table was cached at the start of create index and the indexed table should be less than 2G. Tests used innodb_sort_buffer_size = 64M.
    • Something changed with the sort used by create index. Maybe it does fewer passes. I am not sure there is a way to monitor that.
    • I don't understand the impact from O_DIRECT vs O_DIRECT_NO_FSYNC in 8.0.18, while it had no impact in 8.0.20.
  • When the load was restarted after create index there was initially a lot of read IO with 8.0.20. Either the indexed table or the newly created indexes were not in cache and my bet is on the new indexes. This doesn't happen with 8.0.18. 
  • For 8.0.20, create index is faster with innodb_use_native_aio set to ON, but I am ignoring that topic for now.
Tests are run for 3 configurations (c10b40, c10b40a, c10b40b). The InnoDB options for those configs are here. Perhaps because of the way I compiled MySQL from source, innodb_use_native_aio is off by default for 8.0.18 but on for 8.0.20 -- so 8.0.18 and 8.0.20 differ at runtime only for that option with the c10b40 config. For the c10b40a and c10b40b configs, the options are the same at run time because innodb_use_native_aio is set. The differences between the configs are:
  • c10b40 - use O_DIRECT
  • c10b40a - start with c10b40, add innodb_use_native_aio=FALSE
  • c10b40b - start with c10b40b, change to O_DIRECT_NO_FSYNC
Performance metrics are here. The slightly out of date legend for the metrics is here. Similar to my favorite InnoDB performance expert, the results are compressed so I can compare many configurations on one screen. The metrics don't include it but the time to create the index for 8.0.18 is (275, 281, 233) seconds and for 8.0.20 is (157, 228, 209) seconds for the (c10b40, c10b40a, c10b40b) configs. The ips metric in the tables for create index is indexed rows per second computed as (number of rows in table / time to create all indexes).

Tuesday, June 30, 2020

Java GC and Linkbench

I filed a bug for the Sun JDK a long time ago, sometime around 1998. At the time growing the heap didn't work and the workaround was to set -Xms and -Xmx to the same value. That worked as long as you knew a reasonable value -- too small and your process dies, too large and you waste memory.

Now I get to revisit Java GC. I am running Linkbench and the bin/linkbench script that starts the bench client uses -Xmx=1000 to limit the heap to 1000MB of RAM. Someone else wrote that script and I didn't know it was done until a test failed when the heap was too small.

I use Linkbench on large and small servers and need to be careful about memory usage on the small servers so I ran a few tests to understand the impact on performance and memory usage for Linkbench run with -Xmx=1000, -Xmx=2000 and -Xmx not set. I used a test with a ~10G database (maxid1=10M) and two levels of concurrency -- 16 clients, 64 clients.

For the tests I report metrics from the Linkbench client process: max value for VSZ, max value for RSS and number of CPU seconds. This is collected via "ps aux" run at 30-second intervals and now I wish I reduced that to 1-second intervals but I don't want to repeat the tests. The tests used MySQL 8.0.18 and JDBC via Connector/J 8.0.20.

Now I get to revisit Java GC on Amazon Linux 2 and Java is:
$ java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
Disclaimers -- I am far from a Java expert and the JDK I used might not be the latest and greatest. Feedback is welcome.


I noticed two things:
  1. VSZ is 5X to 10X larger than RSS. I get scared when I see a large VSZ, even when RSS is small. But some good things (jemalloc) do that so eventually I tolerate it.
  2. Unlimited heap size doesn't save much CPU time. I expected it to help more by reducing the GC overhead. It helped a bit with 64 clients, but hurt with 16 clients.
Based on these results I will update my test scripts to use -Xmx=1000 on small servers and -Xmx=2000 on large servers. Were I to use more than 64 concurrent clients then I might use a value larger than 2000.

The data

  • hMax - value for -Xmx
  • vsz - Linkbench client max VSZ in GB
  • rss.l, rss.r - Linkbench client max RSS in GB during the load (rss.l) and run (rss.r)
  • cpuSec - number of CPU seconds used by the Linkbench client

---16 clients
hMax    vsz     rss.l   rss.r   cpuSec
1000     5.8    0.5     0.9     2795
2000     6.9    0.8     1.3     2856
none    20.8    0.8     4.0     2853

--- 64 clients
hMax    vsz     rss.l   rss.r   cpuSec
1000     9.0    0.9     1.3     8104
2000    10.1    1.2     1.8     7800
none    24.0    5.2     5.8     7773

Monday, May 18, 2020

Avoiding reads in write-optimized index structures

This is a summary of features that avoid storage read IO in OSS write-optimized index structures. My scope is limited to MySQL and Tarantool -- I know there are similar features in other DBMS and am happy for comments on that:
  • RocksDB merge operator logically does read-modify-write but physically does a Put into the memtable and the read is deferred until user queries or compaction. I hope that MyRocks eventually takes advantage of the merge operator.
  • Read free replication in TokuDB and MyRocks.
  • Fast Updates in TokuDB do something similar to the merge operator for update and insert
  • Non-unique secondary index maintenance is read free in MyRocks
  • Deferred secondary index update in Tarantool is a new approach to avoiding reads for write-heavy workloads. It is worth reading. I wish it had been published by VLDB.
  • MyRocks also avoids some storage reads when validating unique constraints (for PK and unique secondary indexes) on inserts and some updates because the Get() that it does can use a bloom filter. This benefit is larger when there is a bloom filter on the max level of the LSM tree, but that has other costs is is frequently not done.
  • MyRocks uses the SingleDelete feature in RocksDB to allow tombstones to be removed earlier which decreases the CPU overhead for reads after a batch of writes.
  • A b-tree can also reduce index maintenance related storage read IO. See the InnoDB change buffer.

An incomplete list of relevant research:
  • DiffIndex - this is more about distributed systems, but still interesting
  • Techniques for reducing index maintenance overhead in an LSM by Luo & Carey (VLDB 2019)
  • Deferred Lightweight Indexing (DELI)

Thursday, April 9, 2020

Reviving mstat

A long time ago I wrote mstat to collect iostat, vmstat and MySQL (show global status) performance counters. It samples everything at the same interval (every N seconds), computes rates for all numeric values and has some support for expressions so I can define a new counter as a function of other counters.

Eventually I stopped using it and then there were problems. Recently I fixed many of them. The tool now works with Python3 and supports old and new iostat output format. This week I started to add support for Postgres and MongoDB. It gets data from pg_stat_bgwriter for Postgres and from db.serverStatus() for MongoDB. So the tool is useful to me again but it won't be useful to others until I add docs.

The tool prints one line in CSV format per time interval. The line is likely to be long because there can be hundreds of counters in the line. The name and offset for each counter is printed at startup. I extract data using bash and awk as part of my benchmark workflow.

Because of this dependency to list the offset per counter at startup the tool is not able to handle dynamic data yet. An example of dynamic data are per-DBMS counters such as the rows in pg_stat_database. That view has one row per database and the database names (datname) are not fixed. Similar problems exist for the per-table and per-index monitoring tables and views provided by MySQL, other stats views provided by Postgres and other monitoring data provided by MongoDB. So I need more time to figure out how to support them.

I have more work to do on mstat, but I am happy that I can use it again:
  • I need to figure out whether there are global counters in Postgres for things like queries and rows read
  • For MongoDB mstat has a hack to replace space, dot and parentheses with _ in key names
  • More work remains to make sure that counters are ignored unless their type is numeric or a date

Monday, April 6, 2020

Creating performance reports

I am trying to automate some of the tasks that I must do to produce performance reports. By trying I mean that I am just getting started on this and will be making decisions I might regret later. This is also a request for advice, feedback and corrections.

Some requirements, not all of which will be satisfied:
  • interactive - I prefer that the format can be hosted in a way that allows for discussions. Google Docs does a great job at this by supporting access control and threaded discussions on comments.
  • graphics - support charts and graphs. Back in the day I was a happy user of Google Image Charts that let me define a chart in a URL. Alas, the service has been turned off and the new thing from Google requires Javascript.
  • self-contained - I want to share the report as one thing. With Google Docs that thing can be the URL for the doc, even when there are additional things, like spreadsheets, linked to it. With HTML the thing is the HTML document.
  • script-able - I want to create as much as possible of the document via scripting. I frequently repeat tests and don't want to waste time reformatting tables.
  • private - some reports are not yet public and I prefer a solution that isn't always public. However I also want a solution that works for company-private reports and also for reports that are shared with the public. 
  • other - supports wide lines (via a slider, wrapping wide lines is lousy for a reader), what else?

  • Google docs
    • interactive - yes
    • graphics - yes
    • self-contained -yes
    • scriptable - no
    • privacy - yes
    • other - no support for wide lines, they are always wrapped
  • Github Pages
    • interactive - no, always public
    • graphics - maybe
    • self-contained - yes
    • scriptable - yes
    • privacy - no, always public
    • other - supports wide lines
  • Blogger
    • interactive - somewhat via comments but not as nice as Google docs, but always public
    • graphics - yes by embedding images
    • self-contained - yes
    • scriptable - maybe if I can paste an HTML doc as the content
    • privacy - no, always public
    • other - weak support for wide lines
  • Plain HTML
    • interactive - maybe*
    • graphics - maybe*
    • self-contained - maybe*
    • scriptable - yes
    • privacy - yes*
    • other - supports wide lines
From the above I am not aware of anything close to a perfect solution. I am likely to choose plain HTML. Privacy can be managed by sharing via company email and or hosting the HTML files on company-private servers.

Charts and graphs could be provided by a web-service that lets a chart be defined in a URL. Quickchart is an example of that. Of course, use of such a service means the data is no longer private. But there is a difference between posting a web-page for the world to read and sharing that data with the company that provides the URL chart service.

If I use plain HTML then I must find a way to share such pages with the public. That last sentence is amusing given that the web runs on HTML. But I have never done this. I have been a long-time, and mostly happy, user of Blogger. I have begun to use Github Pages and am a big fan of markup.

As a bonus, I prefer a solution that doesn't depend on the whims of a vendor. For example, if I commit to service X and the vendor for service X turns off that service, I don't want to spend weeks reformatting reports while I migrate to a new service.

It would be great if I could inline some of the charts in the HTML document. I need to figure out whether there are good tools for simple ascii bar charts and graphs that can be pasted into an HTML document.

I am a fan of gnuplot and need to learn more about how I can share gnuplot output in PNG format as part of an HTML document without posting all of that on the web.

Monday, March 30, 2020

Compiling from source for Amazon Linux 2

Different is worse

I used to know struggle with  make & automake but I knew enough to get things done. Now I have fun with make, cmake, maven and gradle. I am sure this proliferation solves problems for people who create the build scripts that I use. But it isn't a good experience for the intermittent user -- different is worse because my skill level per tool is lower.

What is Amazon Linux 2? It has a landing page. I call it AL2 below. It claims to be like CentOS, RHEL, Fedora and it uses yum. I know how to use Google to get answers for those distros when trying to figure out how to build something from source.

But it isn't clear to me that packages that exist for CentOS also exist for AL2. One example of the challenge is that I ended up installing jemalloc from source because I couldn't find the package without advice from an expert.

I hope the output from cat /etc/os-release on AL2 is improved. Below are examples for AL2 and Ubuntu 18.04. Ubuntu specific that this is Ubuntu 18.04. From AL2 I learn it is like CentOS. But which CentOS?

NAME="Amazon Linux" 
VERSION="2" ID="amzn"
ID_LIKE="centos rhel fedora"
PRETTY_NAME="Amazon Linux 2"

Lets compare that with Ubuntu:

VERSION="18.04.4 LTS (Bionic Beaver)"
PRETTY_NAME="Ubuntu 18.04.4 LTS"

Dependencies for AL2

This explains how to satisfy the dependencies before running cmake. I struggled figuring out the names for packages that had to be installed. Some of this is because I don't have much experience with CentOS, some of this is on AL2. Eventually I found this page which might have helped.

Add libraries for the insert benchmark, some of this is needed for the build and all is needed when I run benchmarks so I will just install before I try to build:

sudo yum install -y python3
sudo python3 -m pip install pymongo
sudo yum install -y mysql-devel python3-devel
sudo pip3 install mysqlclient
sudo pip3 install psycopg2-binary
sudo yum install -y openssl11-libs
sudo yum install -y libzstd-1.3.3
sudo yum install -y ncurses-compat-libs
# The host arrives comes with /etc/my.cnf from MariaDB

sudo rm -f /etc/my.cnf /etc/mysql/my.cnf

Install jemalloc from source because I couldn't figure out how to install the package:

cd /media/ephemeral1
rm -rf jemalloc-5.2.1*
bunzip2 jemalloc-5.2.1.tar.bz2
tar xvf jemalloc-5.2.1.tar
cd jemalloc-5.2.1
./configure --prefix=/usr > 2>
make -j4 > o.m 2> e.m
sudo make install

Install things needed to compile MySQL8 and MyRocks from FB MySQL 5.6. I didn't confirm that all of these packages exist. But I was able to build after doing this:
sudo yum install -y cmake3
sudo yum install -y
sudo yum install -y numactl-devel
sudo yum install -y libedit-devel
sudo yum install -y cmake gcc-c++ bzip2-devel libaio-devel bison 
sudo yum install -y zlib-devel snappy-devel
sudo yum install -y gflags-devel readline-devel ncurses-devel openssl-devel
sudo yum install -y lz4-devel gdb git libzstd-devel

Building MySQL8

Download, compile and install MySQL 8. It is installed at /media/ephemeral1/my8018. The upstream docs are useful.

cd /media/ephemeral1
rm -rf mysql-8.0.18*
tar xzvf mysql-boost-8.0.18.tar.gz
cd mysql-8.0.18
mkdir build
cd build
bash /media/ephemeral1/cmk80 /media/ephemeral1/my8018 > 2>
make -j8 V=1 VERBOSE=1 > 2>
make install

The contents of /media/ephemeral/cmk80.

cmake3 .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DWITH_SSL="system" \
      -DWITH_ZLIB="system" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \
      -DWITH_BOOST=$PWD/../boost \

Building MyRocks

Download, compile and install MyRocks from FB MySQL 5.6. It is installed at /media/ephemeral1/fbmy56. The upstream docs are useful.

First install boost 1.65 from source:

cd /media/ephemeral1
bunzip2 boost_1_65_1.tar.bz2
tar xvf boost_1_65_1.tar
cd boost_1_65_1
./ --prefix=/usr

./b2 > o.b2 2> e.b2
sudo ./b2 install

Then build MySQL 5.6

cd /media/ephemeral1
git clone fbmy56-src
cd fbmy56-src
# This is the old version that I used for a while
# git checkout 20aaaf8d

git submodule init
git submodule update
mkdir build
cd build
bash /media/ephemeral1/cmkfbmy56 /media/ephemeral1/fbmy56 > 2>
make -j8 V=1 VERBOSE=1 > 2>
make install

The contents of /media/ephemeral1/cmkfbmy56

if [ -z $1 ]; then
echo Requires prefix as arg1
exit -1


# Look at /proc/cpuinfo to determine whether these (sse, avx) are supported


# -march=native gets avx and sse if they are supported
CXXF="-DNDEBUG -march=native"
CXXF+=" -faligned-new"

# extra flags to avoid warnings with gcc on ubuntu 18
CF="-Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

CXXF+=" -Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

cmake3 .. \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
-DWITH_LZ4=system \

Not done yet

There is a crash in Boost code that was added in the FB MySQL branch. Boost 1.53 is installed by yum install boost-devel. I will rebuild MyRocks with Boost 1.65 as that is what Ubuntu 18.04 uses, and MyRocks works great for me there.


stack_log: _ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7clas