There are at least three ways to do compression for InnoDB - classic, holepunch and outsource.
The classic approach (table compression) was used and enhanced by the FB MySQL team. It might not have been widely used elsewhere. While it works, even for read/write workloads, the implementation is complicated and it isn't clear that it has a bright future.
The holepunch approach (page compression) is much simpler than the classic approach. Alas, I am skeptical that a filesystem will be happy tracking the metadata from doing a holepunch for every (or most) pages written. I am also skeptical that unlink() response times of seconds to minutes (because of the holepunch usage) will be good for a production DBMS. I wrote a few posts about my experience with the holepunch approach: here, here, here and here.
The outsource approach is the most simple from the perspective of InnoDB - let the filesystem or storage do the compression for you. In this case InnoDB continues to do filesystem reads and writes as if pages have a fixed size and the FS/storage compresses prior to writing to storage, decompresses after reading form storage and does all of the magic to make this work. While there are log-structured filesystems in OSS that might make this possible, such filesystems aren't widely used relative to XFS and the EXT family. There is also at least one storage device on the market that supports this.
tl;dr - the outsource approach is useful when the data is sufficiently compressible and the cost of this approach (more write-amp) is greatly reduced when it provides atomic page writes.
After publishing I noticed this interesting Twitter thread on support for atomic writes.
Performance model
I have a simple performance model to understand when the outsource approach will work. As always, the performance model makes assumptions that can be incorrect. Regardless the model is a good start when comparing the space and write amplification with the outsource approach relative to uncompressed InnoDB.
Assumptions:
- The log-structured filesystem has 1+ log segments open for writing. Compressed (variable length) pages are written to the end of an open segment. Such a write can create garbage - the previous version of the page stored elsewhere in a log segment. Garbage collection (GC) copies live data from previously written log segments into open log segments to reclaim space from old log segments that have too much garbage.
- The garbage in log segments is a source of space amplification. Garbage collection is a source of write amplification.
- g - represents the average percentage of garbage (free space) in a previously written log segment.
- r - represents the compression rate. With r=0.5 then a 16kb page is compressed to 8kb.
- Write and space amplification are functions of g:
- write-amp = 100 / g
- space-amp = 100 / (100 - g)
- Assumes that g is constant across log segments. A better perf model would allow for variance.
- Assumes that r is constant across pages. A better perf model might allow for variance.
- Estimates of write and space amplification might be more complicated than the formula above.
- s is the space-amp ratio where s = r * 100 / (100 - g).
- w1 is the write-amp ratio assuming the doublewrite buffer is enabled where w1 = r * 100/g.
- w2 is the write-amp ratio assuming the doublewrite buffer is disabled. This assumes that outsource provides atomic page writes for free. The formula is w2 = r/2 * 100/g.
On the subject of atomic writes.
ReplyDeleteOne problem is that if the SSD hardware sector size is 4K and the DB page size is 8K/16k you don't get atomic writes. However, I've seen something in an ssd spec long ago that queried the 'atomic' write size as if a device COULD support an atomic size > sector size. Generally it is one sector but Huawei markets their solution to the torn page problem using an SSD which supports larger atomic writes. They do this for MySQL to eliminate the need for the double-write mechanism.
What is less clear is how they deal with the OS. Dirty memory pages are not guaranteed to be merged when written. And if this idea required directio then even an 8k(PG) or 16k(MySQL) write is NOT guaranteed to be done as a single bus write. Perhaps they have a driver to deal with this but they don't mention it.
Hi Mark - Dan Wood hexexpert@comcast.net
https://support.huaweicloud.com/intl/en-us/twp-kunpengdbs/kunpengdbs_19_0011.html
Dan - It is great to hear from you again.
DeleteThanks for mentioning that I neglected to mention the plumbing complexity of atomic write support. There has to be a channel from the app down to the block layer to inform it of page boundaries.