Wednesday, October 13, 2021

Compatible with MySQL or Postgres?

Open and closed source scale-out DBMS that are compatible with MySQL and Postgres have emerged on the market. This is great for the community but there will be much confusion about the meaning of compatible

This post has yet to have anything on the cloud vendors in China. They are doing impressive work but I don't know enough about it. I am happy to update this post when I learn more.

But first, the MySQL and Postgres teams.

One way to describe compatibility is via three levels: protocol, syntax and semantics. By upstream I mean MySQL and PostgreSQL.

  • protocol - an app can use existing client libraries to authenticate and connect 
  • syntax - the DBMS will parse SQL that upstream parses. It is not guaranteed to provide semantics that matches upstream and some syntax can be (or might be) parsed but ignored. Syntax compatible implies a best effort to match upstream semantics but that isn't guaranteed nor must it be guaranteed to be useful. 
  • semantics - the DBMS will match upstream semantics. This implies syntax compatible.
Note that syntax and semantics compatibility aren't all or nothing. A syntax compatible DBMS can be useful without supporting (parsing) 100% of the upstream syntax. A semantics compatible DBMS can be useful without supporting or matching behavior for 100% of upstream syntax.

Also note that semantics compatible implies syntax compatible. But protocol compatible implies neither.

Elsewhere in DBMS land

While this post is about MySQL and PostgreSQL, compatibility is growing in popularity elsewhere:
  • MariaDB provides an Oracle compatible mode that provides syntax but not protocol or semantics compatibility. 
  • EnterpriseDB provides an Oracle compatibility product that I don't know much about. 
  • Amazon will soon open source Babelfish that is protocol compatible with SQL Server.
  • Amazon DocumentDB is protocol, syntax and semantics compatible with (an older version of) MongoDB. It supports some (much) of the MongoDB 4.0 API as of October, 2021 per Wikipedia. Public statements suggest this was built on top of, or reusing, the Aurora PostgreSQL code.
  • added TimescaleDB to Team Postgres
  • added Redshift to Team Postgres
  • added SingleStore to Team MySQL
  • added ClickHouse to Team MySQL
  • added CrateDB to Team Postgres
  • added Dolt to Team MySQL
  • added Team MongoDB
  • added YellowBrick and Greenplum to Team Postgres
  • added Materialize to Team Postgres
  • Building a DBMS to be compatible with MySQL costs more than for Postgres. The Team Postgres projects can reuse BSD licensed PG code while the Team MySQL projects would have to respect the GPL.
  • I have a vague memory of this but to be JDBC compliant the MySQL JDBC driver does a few queries at connect time (either via tables or session/global variables). My JDBC-related bugs are here.

Friday, October 1, 2021

The other way to compress InnoDB: outsource it

There are at least three ways to do compression for InnoDB - classic, holepunch and outsource. 

The classic approach (table compression) was used and enhanced by the FB MySQL team. It might not have been widely used elsewhere. While it works, even for read/write workloads, the implementation is complicated and it isn't clear that it has a bright future.

The holepunch approach (page compression) is much simpler than the classic approach. Alas, I am skeptical that a filesystem will be happy tracking the metadata from doing a holepunch for every (or most) pages written. I am also skeptical that unlink() response times of seconds to minutes (because of the holepunch usage) will be good for a production DBMS. I wrote a few posts about my experience with the holepunch approach: here, here, here and here.

The outsource approach is the most simple from the perspective of InnoDB - let the filesystem or storage do the compression for you. In this case InnoDB continues to do filesystem reads and writes as if pages have a fixed size and the FS/storage compresses prior to writing to storage, decompresses after reading form storage and does all of the magic to make this work. While there are log-structured filesystems in OSS that might make this possible, such filesystems aren't widely used relative to XFS and the EXT family. There is also at least one storage device on the market that supports this.

tl;dr - the outsource approach is useful when the data is sufficiently compressible and the cost of this approach (more write-amp) is greatly reduced when it provides atomic page writes.

After publishing I noticed this interesting Twitter thread on support for atomic writes.

Performance model

I have a simple performance model to understand when the outsource approach will work. As always, the performance model makes assumptions that can be incorrect. Regardless the model is a good start when comparing the space and write amplification with the outsource approach relative to uncompressed InnoDB.


  • The log-structured filesystem has 1+ log segments open for writing. Compressed (variable length) pages are written to the end of an open segment. Such a write can create garbage - the previous version of the page stored elsewhere in a log segment. Garbage collection (GC) copies live data from previously written log segments into open log segments to reclaim space from old log segments that have too much garbage. 
  • The garbage in log segments is a source of space amplification. Garbage collection is a source of write amplification.
  • g - represents the average percentage of garbage (free space) in a previously written log segment. 
  • r - represents the compression rate. With r=0.5 then a 16kb page is compressed to 8kb.
  • Write and space amplification are functions of g:
    • write-amp = 100 / g
    • space-amp = 100 / (100 - g)
Risks in the assumptions:
  • Assumes that g is constant across log segments. A better perf model would allow for variance.
  • Assumes that r is constant across pages. A better perf model might allow for variance.
  • Estimates of write and space amplification might be more complicated than the formula above.

Now I estimate the space-amp and write-amp for outsource relative to uncompressed InnoDB. The ratios are (value for outsource / value for uncompressed InnoDB). For space-amp when the ratio is < 1 then outsource uses less space vs uncompressed InnoDB. For write-amp when the ratio is > 1 then outsource writes more to storage vs uncompressed InnoDB. 

I show below that when the compression rate (r above) is < 0.6 then outsource provides much less space-amp without suffering from a too-large increase in write-amp. But when r is >= 0.6 the increase in write-amp, relative to uncompressed InnoDB, might be a problem.

However, whether a large increase in write-amp is a problem depends on your workload. For example, if 99% of storage IOPs are reads and 1% are writes with uncompressed InnoDB then a write-amp penalty that changes this from 1% writes to 2% writes is unlikely to be a problem.

I created a graph with Desmos to show the three ratios. The graph allows r to be adjusted. I then copied some values from the graph into a table below. The graph has 3 curves, one for each ratio:
  • s is the space-amp ratio where s = r * 100 / (100 - g).
  • w1 is the write-amp ratio assuming the doublewrite buffer is enabled where w1 = r * 100/g.
  • w2 is the write-amp ratio assuming the doublewrite buffer is disabled. This assumes that outsource provides atomic page writes for free. The formula is w2 = r/2 * 100/g.
The table below has values from the graph for r = 0.4, 0.5 and 0.6. What I see in the graph is that with r in (0.4, 0.5) and the doublewrite buffer disabled (w2) it is possible to get much of the compression benefit (see s) from outsource without a significant increase in write-amp. But the write-amp penalty can be a problem when r >= 0.6. Of course, whether or not more write-amp is an issue depends on the storage read and write rates as I explained above. 

g       s       w1      w2
20      0.50    2       1
30      0.57    1.33    0.67
40      0.67    1       0.5

g       s       w1      w2
20      0.63    2.5     1.25
30      0.71    1.67    0.83
40      0.83    1.25    0.63

g       s       w1      w2
20      0.75    3       1.5
30      0.86    2       1
40      1       1.5     0.75

Wednesday, September 29, 2021

Automating memory management in a DBMS

I recently read two papers on automating memory management - one for DB2, one for Oracle. While the papers aren't new (~2005), they are definitely worth reading. AFAIK both made it into the products, I have no idea how that turned out but the outcome doesn't change my opinion that the papers were excellent. Disclaimer - I worked on query execution for Oracle at the time and made a few changes to the row sources that I maintained to support the work described by the Oracle paper.

Both describe methods to automate memory management with the goal of improving performance. In many DBMS memory management is difficult for several reasons:

  • It isn't global and the DBA must estimate good values for the sizes of various caches and other large memory consumers: sort and hash for order by, aggregation, join and index create.
  • It isn't bounded when allocations are specified per instance of sort or join rather than one limit on the memory used by all sorts or joins. Too many concurrent queries == OOM in this case.
  • It is static. This is an issue for large caches like the buffer pool (block cache).
There was a marketing battle between Oracle and IBM in the mid-2000s over autonomic computing and these papers, along with getting them into products, is one outcome from that battle.

From the DB2 paper

A summary:
  1. Differentiable functions were created to explain the query latency saved per page of memory for a variety of consumers. All functions had the same form (see section 3.2) with graphs that look like this. I think that time saved means wall clock time to account for CPU and IO overheads.
  2. For a few caches (statement and buffer pool) online simulators were added to the DBMS to estimate changes to the hit rate if the cache were given more memory. 
  3. Constrained optimization was used to determine the optimal allocation at any point in time. The constraint was the amount of memory available. I assume that Lagrange multipliers were used.
  4. Feedback control was used to apply the desired memory configuration. One benefit from this is to avoid negative impacts from suddenly applying significant changes.
  5. Decisions are revisited because workloads (types of queries, concurrency) changes.
  • At least for sort, the real curve for time vs memory can't be described by a differentiable function. See page 2 in the Oracle paper: there are three intervals where each interval can be explained by a function but the points where the intervals meet are a problem. I am not sure whether constrained optimization can handle that.
  • At a higher level, the paper doesn't consider the steps in the time vs memory benefits for some row sources (this is a kind of a repeat of the previous comment).
  • It wasn't clear whether row sources could give back memory. There is a risk from giving a long-running sort or hash a lot of memory. If there is no facility to get back some of that memory then memory allocation will be far from optimal until that query completes or is killed.
From the Oracle paper

While Oracle has automatic memory management for caches (plan cache, buffer pool) and queries (hash, sort, bitmap index operators) the paper is limited to memory management for queries (PGA in Oracle terminology).

  1. for each instance of a row source (a query is composed of multiple row sources, some row sources can use a lot of memory for sort and hash) the knees in the response time vs memory graph are estimated, see section 3.2 in the paper. These knees are a function of the row source and the data specific to a query (query A might be able to sort in-memory with 100M of RAM while query B might require 1G to remain in memory).
  2. decisions are made about the amount of memory that each row source can used based on the information from the previous point
  3. over time a row source might be able to get more memory and might be told to give back memory. Giving back memory might not be immediate, but should eventually be done.
  • the paper was vague about how point #2 was done. While section 4.2.2 lists 5 rules used to guide the decisions it wasn't clear there was a goal beyond making sure all queries have enough memory to run. In contrast the IBM paper was clear that constrained optimization was used and explains the function that was optimized.

Tuesday, September 28, 2021

Developer experience, what about the other stakeholders?

While developer experience gets a lot of press, there are four stakeholders when you provide a database service. So we can call these DX, UX, MX and OX for developer, user, management and operations experience:

  • developers want the DBMS to stay out of their way. Schema is one example because waiting for a schema change gets in the way. Note that NoSQL databases have less schema rather than being schema-less because indexes are schema. I am curious whether less schema leads to a great developer experience in the long run for large scale projects given the risk of an unmanaged schema and poorly understood data. The developers who were able to move fast early in the project can create much tech debt for those who arrive years later.
  • users of the services that depend on the database want great QoS - high uptime, low and predictable latency for queries, low chance of lost data. They just want to use your database-backed app without problems.
  • management wants to minimize cost while getting great QoS.
  • operations wants to be able to sleep when they are oncall (self-healing database, auto failover, etc). It helps if the database isn't in chaos mode during working hours as that gives them freedom to get their work done.

Tuesday, September 21, 2021

Review of DiffKV

This is a review of Differentiated Key-Value Storage Management for Balanced I/O Performance that was published in ATC 2021. This is a wonderful paper that resolves several of my concerns about key-value separation (here and here). The authors have experience with key-value separation via Titan which is part of TiDB.

The key idea in the paper is to use leveled compaction for keys and tiered compaction for values. Write-amplification is reduced by not using leveled compaction for values. Read-amplification is reduced, relative to classic key-value separation (see WiscKey), by using tiered compaction for values rather than logs. With classic key-value separation the worst case read-amp for a scan is the need to do a random read from the log for every qualifying key as that random read can include a block decompression and a storage IO. With DiffKV the use of tiered compaction for values provides some ordering to avoid that worst case.

While I enjoyed the paper, it didn't have performance results with compression enabled and I am curious about the impact of decompression overhead on point and range read latency.

A summary of the approach:

  • small values are stored inline in the LSM tree
  • large values are stored in logs, called vLogs. The paper explains a few improvements to make GC faster in this case. But mostly this is classic key-value separation.
  • medium values use the vTree (tiered compaction for values)
  • the user defines the two size limits that determines whether a value is small, medium or large
The vTree

Medium values are stored in the vTree which resembles an LSM that uses tiered compaction. A vTable is the unit of storage in the vTree. A vTable is 8MB by default and stores KV pairs in key order. A sorted group is a sequence vTables for which the keys don't overlap -- (vTable, sorted group) are similar to an (SST, sorted run) in RocksDB. A vTree has multiple levels where each level has one or more sorted groups and the levels have exponentially increasing size.

This is similar to tiered compaction for two reasons. First, there are multiple sorted groups per level. Second, when merges are done to move values from level N to N+1, the merge process only reads data from level N and then writes it to level N+1. It doesn't read and rewrite data already on level N+1. 

To reduce the number of times that values get rewritten the vTree has fewer levels than the LSM tree that stores keys. The smallest N levels in the key's LSM tree share the smallest vTree level. I assume the remaining levels in the key's LSM tree each have their own level in the vTree.

vTree GC

Keys in the LSM tree point into the vTree. When a value is moved between levels in the vTree the key's entry in the LSM tree must be updated to point to the new location in the vTree. DiffKV couples GC in the vTree with GC in the key's LSM tree to avoid extra writes. By coupling I mean that vTree GC is extra work added to compaction done on the key's LSM tree. When a key is moved from level N to level N+1 in the LSM tree then its value might be moved to the next level in the vTree.

By coupling it like this, DiffKV avoids the need to probe the index (key's LSM tree) to determine whether a value is live -- which is an extra overhead that occurs with classic key-value separation. It also avoids generating extra writes to change the point into the vTree that is stored with the key.

This above is called compaction-triggered merge and driven by compaction of the key's LSM tree. Another reason for doing vTree GC is called scan-triggered and triggered by scans that encounter regions of the vTree that need GC.

I am curious whether there is a need to also trigger GC based on vTables that have excessive space-amplification.

vTree rewrites

Leveled compaction has more write-amp than tiered because it rewrites previously written KV pairs. While vTree GC frequently avoids the need to do rewrites, there is one case where a rewrite is needed and I didn't learn enough about how DiffKV handles this.

There can be space wasted from vTables that have low utilization rates because most of the values in the vTable have been deleted. In this case something should be done to copy out (rewrite) the values to a new vTable. And when values are moved the key LSM tree must be updated to reference their new location. For workloads with updates and deletes this will be needed in the largest level of the vTree and might be needed for smaller levels.

Saturday, August 14, 2021

Happy 15th to Percona

I am thrilled to be celebrate Percona's 15th anniversary. My time with MySQL began about the same time as the founding of Percona. Those years, the mid-2000s, were the dark ages for MySQL. There was doubt about the future of MySQL because there were many things that needed to be made better. Fortunately, upstream and the community, with much help from Percona, rallied to fix the problems and add needed features.

Percona has been a huge part of the community. I don't have time to list everything, but examples include saving the user conference, providing an open-source hot backup solution, educating many of us (including me) via their blog posts and helping push the product to get so much better. I have also been a customer as they helped with the MyRocks effort and in the great patch migration (5.6 to 8.0).

I like that the anniversary book starts by mentioning the multi-core scaling problem that InnoDB had in the mid-2000s. My MySQL deployment used 4-core, 1-socket CPUs at the time and 8-core, 2-socket servers were about to arrive. It was difficult to get more than 10k QPS from MySQL/InnoDB on such hardware regardless of the number of cores or sockets. InnoDB didn't benefit from 2-socket servers because of contention in the custom InnoDB mutex (which was also used to guard state in the custom InnoDB rw-lock). A fix was first proposed by Yasufumi Kinoshita, who would soon join Percona. After seeing his presentation at the user conference, my team at Google proposed a similar solution which was accepted by the InnoDB team and a serious problem was resolved.

I also like that the book is full of stories from people in the community. I know many of these people from my time traveling to conferences. I am an introvert except when at conferences. I rarely observe talks because I am out in the hallway track talking to others.

Wednesday, August 11, 2021

On storage engines

I wish it were easier to implement new storage engines for MySQL, Postgres and MongoDB and other OSS databases. There is so much innovation that we miss out on - FASTER is one example. All (MySQL, MongoDB, Postgres) have storage engine APIs but there are not many OSS implementations of them.

MyRocks, MySQL Aurora and MySQL HeatWave are examples of the benefits. But they also show that it helps to have the backing of a well-funded company because this is a huge undertaking.

The API for Postgres is the most recent and perhaps that will be the most popular. The API for MySQL is the oldest and has become harder as more requirements (like partitioning) are pushed into it. The MongoDB API is friendlier than the MySQL API, but MongoRocks was deprecated when the API was enhanced to support transactions and RocksDB wasn't able to support user-provided commit timestamps.

One problem is naming. MongoDB and WiredTiger are databases, but WiredTiger is also a component of MongoDB. To avoid confusion I will use storage engine for things like WiredTiger, RocksDB and InnoDB and database for the systems that provide query languages, replication and more.

I have been involved in three such engines -- MyRocks, MongoRocks and a read-only MySQL engine used long-ago at Google. MyRocks benefited from a large team and Sergey Petrunya at MariaDB and MongoRocks benefited from amazing support from MongoDB the company.

Two interesting things are:

  • The impact of database design decisions on the storage engine API. See what is needed for transactions in the MongoDB API.
  • The impact of the original storage engine on the storage engine API.
    • MongoDB doesn't take advantage of clustered indexes in WiredTiger or RocksDB. An extra (hidden) index must be used to map between DiskLoc and the PK index. See SERVER-14569.
    • I have more research to do before I understand the impact of vacuum and 32-bit transactions IDs on the Postgres API.

I also wonder whether the RocksDB API could be the universal storage engine API. In theory any storage engine implementing that would be able to replace RocksDB in MyRocks or MongoRocks. The goal is then to implement the RocksDB API glue once per database and then be able to use a variety of storage engines based on that. Of course, XKCD has a great take on standards and this might just be naive and wishful thinking on my part.