Small Datum: Wanted: a file system on which InnoDB transparent page compression works

Friday, October 2, 2015

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems.

So I asked core developers from Btrfs whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other.

XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous I mean 50X or 100X longer. Maybe I am the only one using InnoDB file-per-table and measuring time to drop a table or database. Bad things happen when DROP takes 10+ minutes --InnoDB is unusable by other connections and an InnoDB background thread might kill mysqld because it thinks there is an internal deadlock.

Did I mention this was good news for MyRocks? If you want compression then we get 2X better compression compared to compressed InnoDB. We also made sure that DROP TABLE and DROP INDEX are fast.

I updated bug 78277 for this problem. I am the only person updating that bug. I also found a corruption bug with transparent page compression, bug 78672. Past experience with transparent page compression is described here, here and here.

The end to my testing day was perfect. I rebooted the host to get back memory from XFS metadata allocations and the filesystem came back corrupt. Being a pessimist I was using a scratch filesystem for this, so I didn't lose my Ubuntu install.

15 comments:

UnknownOctober 2, 2015 at 8:31 PM
Your experience on XFS and EXT4 with drops taking 10+ minutes is not unusual. To get around this just create hardlinks to the ibd files before dropping. Then remove the hard-links out side of MySQL during off hours.
ReplyDelete
Replies
AlexOctober 3, 2015 at 7:38 PM
Mark --- did you move your blog to a different theme? The fonts all got a lot smaller and harder to read~
ReplyDelete
Replies
UnknownOctober 7, 2015 at 9:07 AM
I think ZFS supports hole punching, on top of other niceties like compression, block level checksums and making InnoDB's double write irrelevant (yay performance!). Have you given it a try?
ReplyDelete
Replies
AnonymousOctober 17, 2015 at 11:56 PM
I have done testing on this feature mostly using NVMFS and FusionIO NVM device but my tests did not include actual drop table speed.
ReplyDelete
Replies
Marko MäkeläJuly 30, 2021 at 6:05 AM
Mark, I may be missing something, but I just conducted a small experiment where I created and populated a page_compressed table in MariaDB Server 10.6 after the Sysbench prepare step. I would then start the Sysbench oltp_update_index benchmark and execute DROP TABLE while the test is running. I confirmed with "ls -ls" that the file had been compressed, but I did not run anything special to ensure that it is heavily fragmented. The DROP TABLE executed virtually instantly, and I was not able to observe any degradation of throughput.

Several changes could have improved this during the past years. Thanks to MDEV-8069, MariaDB no longer holds any mutexes while the file is being delete-on-close'd. Thanks to MDEV-21069, DROP TABLE will avoid modifying any pages in the *.ibd file. Thanks to MDEV-23855, MariaDB uses asynchronous I/O for page writes, also for the doublewrite buffer.

Linux has improved too. For example, I understood that ext4fs supports FUA mode, which was previously only supported by xfs. If there was a file system scalability bug that would cause the deletion of a large fragmented file to interfere with writes to other files, maybe that bug has been fixed by now. I used Linux 5.10.0. The server was configured with innodb_flush_log_at_trx_commit=1, and the data directory was located on an ext4 file system on an Intel Optane 960 NVMe PCIe card.

In related news, we recently ran some benchmarks on a thinly provisioned ScaleFlux computational storage device. In our tests, storing uncompressed tables on a thinly provisioned device was not only fastest, but also resulted in best compression. The thinly provisioned device would "mispresent" itself as larger than real capacity to the file system and then compress the data by itself.
ReplyDelete
Replies
Marko MäkeläAugust 2, 2021 at 7:56 AM
Mark, thank you. Yes, while page compression at the file system interface makes things much simpler than the ‘opaque’ ROW_FORMAT=COMPRESSED that I started to implement in 2005 (with intrusive changes to several subsystems), it does make things complex for the file system. As we all know, complexity often correlates with bugs.

I hope that smart storage will become mainstream in not too distant future, so that the complexity can stay away from both the database and the operating system kernel.
ReplyDelete
Replies

Add comment

Friday, October 2, 2015

Wanted: a file system on which InnoDB transparent page compression works

15 comments:

The first rule of database fight club: admit nothing