Friday, October 1, 2021

The other way to compress InnoDB: outsource it

There are at least three ways to do compression for InnoDB - classic, holepunch and outsource. 

The classic approach (table compression) was used and enhanced by the FB MySQL team. It might not have been widely used elsewhere. While it works, even for read/write workloads, the implementation is complicated and it isn't clear that it has a bright future.

The holepunch approach (page compression) is much simpler than the classic approach. Alas, I am skeptical that a filesystem will be happy tracking the metadata from doing a holepunch for every (or most) pages written. I am also skeptical that unlink() response times of seconds to minutes (because of the holepunch usage) will be good for a production DBMS. I wrote a few posts about my experience with the holepunch approach: here, here, here and here.

The outsource approach is the most simple from the perspective of InnoDB - let the filesystem or storage do the compression for you. In this case InnoDB continues to do filesystem reads and writes as if pages have a fixed size and the FS/storage compresses prior to writing to storage, decompresses after reading form storage and does all of the magic to make this work. While there are log-structured filesystems in OSS that might make this possible, such filesystems aren't widely used relative to XFS and the EXT family. There is also at least one storage device on the market that supports this.

tl;dr - the outsource approach is useful when the data is sufficiently compressible and the cost of this approach (more write-amp) is greatly reduced when it provides atomic page writes.

After publishing I noticed this interesting Twitter thread on support for atomic writes.

Performance model

I have a simple performance model to understand when the outsource approach will work. As always, the performance model makes assumptions that can be incorrect. Regardless the model is a good start when comparing the space and write amplification with the outsource approach relative to uncompressed InnoDB.

Assumptions:

  • The log-structured filesystem has 1+ log segments open for writing. Compressed (variable length) pages are written to the end of an open segment. Such a write can create garbage - the previous version of the page stored elsewhere in a log segment. Garbage collection (GC) copies live data from previously written log segments into open log segments to reclaim space from old log segments that have too much garbage. 
  • The garbage in log segments is a source of space amplification. Garbage collection is a source of write amplification.
  • g - represents the average percentage of garbage (free space) in a previously written log segment. 
  • r - represents the compression rate. With r=0.5 then a 16kb page is compressed to 8kb.
  • Write and space amplification are functions of g:
    • write-amp = 100 / g
    • space-amp = 100 / (100 - g)
Risks in the assumptions:
  • Assumes that g is constant across log segments. A better perf model would allow for variance.
  • Assumes that r is constant across pages. A better perf model might allow for variance.
  • Estimates of write and space amplification might be more complicated than the formula above.
Results

Now I estimate the space-amp and write-amp for outsource relative to uncompressed InnoDB. The ratios are (value for outsource / value for uncompressed InnoDB). For space-amp when the ratio is < 1 then outsource uses less space vs uncompressed InnoDB. For write-amp when the ratio is > 1 then outsource writes more to storage vs uncompressed InnoDB. 

I show below that when the compression rate (r above) is < 0.6 then outsource provides much less space-amp without suffering from a too-large increase in write-amp. But when r is >= 0.6 the increase in write-amp, relative to uncompressed InnoDB, might be a problem.

However, whether a large increase in write-amp is a problem depends on your workload. For example, if 99% of storage IOPs are reads and 1% are writes with uncompressed InnoDB then a write-amp penalty that changes this from 1% writes to 2% writes is unlikely to be a problem.

I created a graph with Desmos to show the three ratios. The graph allows r to be adjusted. I then copied some values from the graph into a table below. The graph has 3 curves, one for each ratio:
  • s is the space-amp ratio where s = r * 100 / (100 - g).
  • w1 is the write-amp ratio assuming the doublewrite buffer is enabled where w1 = r * 100/g.
  • w2 is the write-amp ratio assuming the doublewrite buffer is disabled. This assumes that outsource provides atomic page writes for free. The formula is w2 = r/2 * 100/g.
The table below has values from the graph for r = 0.4, 0.5 and 0.6. What I see in the graph is that with r in (0.4, 0.5) and the doublewrite buffer disabled (w2) it is possible to get much of the compression benefit (see s) from outsource without a significant increase in write-amp. But the write-amp penalty can be a problem when r >= 0.6. Of course, whether or not more write-amp is an issue depends on the storage read and write rates as I explained above. 

r=0.4
g       s       w1      w2
20      0.50    2       1
30      0.57    1.33    0.67
40      0.67    1       0.5

r=0.5
g       s       w1      w2
20      0.63    2.5     1.25
30      0.71    1.67    0.83
40      0.83    1.25    0.63

r=0.6
g       s       w1      w2
20      0.75    3       1.5
30      0.86    2       1
40      1       1.5     0.75


2 comments:

  1. On the subject of atomic writes.
    One problem is that if the SSD hardware sector size is 4K and the DB page size is 8K/16k you don't get atomic writes. However, I've seen something in an ssd spec long ago that queried the 'atomic' write size as if a device COULD support an atomic size > sector size. Generally it is one sector but Huawei markets their solution to the torn page problem using an SSD which supports larger atomic writes. They do this for MySQL to eliminate the need for the double-write mechanism.

    What is less clear is how they deal with the OS. Dirty memory pages are not guaranteed to be merged when written. And if this idea required directio then even an 8k(PG) or 16k(MySQL) write is NOT guaranteed to be done as a single bus write. Perhaps they have a driver to deal with this but they don't mention it.

    Hi Mark - Dan Wood hexexpert@comcast.net

    https://support.huaweicloud.com/intl/en-us/twp-kunpengdbs/kunpengdbs_19_0011.html

    ReplyDelete
    Replies
    1. Dan - It is great to hear from you again.

      Thanks for mentioning that I neglected to mention the plumbing complexity of atomic write support. There has to be a channel from the app down to the block layer to inform it of page boundaries.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...