Friday, October 2, 2015

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems.

So I asked core developers from Btrfs whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other.

XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous I mean 50X or 100X longer. Maybe I am the only one using InnoDB file-per-table and measuring time to drop a table or database. Bad things happen when DROP takes 10+ minutes --InnoDB is unusable by other connections and an InnoDB background thread might kill mysqld because it thinks there is an internal deadlock.

Did I mention this was good news for MyRocks? If you want compression then we get 2X better compression compared to compressed InnoDB. We also made sure that DROP TABLE and DROP INDEX are fast.

I updated bug 78277 for this problem. I am the only person updating that bug. I also found a corruption bug with transparent page compression, bug 78672. Past experience with transparent page compression is described here, here and here.

The end to my testing day was perfect. I rebooted the host to get back memory from XFS metadata allocations and the filesystem came back corrupt. Being a pessimist I was using a scratch filesystem for this, so I didn't lose my Ubuntu install.

15 comments:

  1. Your experience on XFS and EXT4 with drops taking 10+ minutes is not unusual. To get around this just create hardlinks to the ibd files before dropping. Then remove the hard-links out side of MySQL during off hours.

    ReplyDelete
    Replies
    1. Some of us don't have off hours. 10 minutes for unlink is a problem I have never experienced before. Setting up hard links doesn't work at scale. I imagine filesystem performance will suck for the 10+ minutes when the unlink is done in the background, but I am not going to spend more time showing this feature is busted.

      Delete
  2. Mark --- did you move your blog to a different theme? The fonts all got a lot smaller and harder to read~

    ReplyDelete
    Replies
    1. I went back to an older theme as the "live" theme had a few glitches. I made the font larger just now.

      Delete
  3. I think ZFS supports hole punching, on top of other niceties like compression, block level checksums and making InnoDB's double write irrelevant (yay performance!). Have you given it a try?

    ReplyDelete
    Replies
    1. I have not. XFS tends to be the FS of choice for web scale Linux. I have never encountered ZFS in production, nor have I heard of the other big web-scale companies using it on Linux.

      I know of many deployments that disable the doublewrite buffer.

      People who run MySQL on ZFS should write about their experiences. Would be even better if Percona stood behind it.

      Delete
    2. There are folks inside of Percona that really like ZFS and recommend it, particularly for snapshot backups. It is kind of surprising that it doesn't get blogged about.

      Delete
    3. I use MySQL on ZFS in production for a service with just under a million and a half registered users. I set it up partially using recommendations I found from Percona's public material (slides & blog posts). I wouldn't mind writing about it but I wonder about one thing: What would you like to know? :)

      Delete
    4. How does it work?

      Linux, BSD or Solaris?

      Disk or SSD? Because ZFS is copy-on-write it will increase fragmentation on database files, if this runs on disk the extra seeks would concern me.

      Have you ever needed fsck? AFAIK, ZFS doesn't have that. HW and SW bugs can cause corruption. Is repair ever needed?

      What is your workload (database & IO) and database size?

      Why ZFS? Did you choose it or was it there when you arrived?

      Are you using anything fancy with it like the SSD accelerators that Oracle sells.

      I have not read about ZFS for a while, so my terminology is limited, but how have you configured it -- pools and disks?

      What other ZFS features do you use?

      Delete
  4. I have done testing on this feature mostly using NVMFS and FusionIO NVM device but my tests did not include actual drop table speed.

    ReplyDelete
    Replies
    1. Can someone provide a link to show that NVMFS is a real product with real customers? I didn't find anything on the SanDisk web site.

      Delete
  5. Mark, I may be missing something, but I just conducted a small experiment where I created and populated a page_compressed table in MariaDB Server 10.6 after the Sysbench prepare step. I would then start the Sysbench oltp_update_index benchmark and execute DROP TABLE while the test is running. I confirmed with "ls -ls" that the file had been compressed, but I did not run anything special to ensure that it is heavily fragmented. The DROP TABLE executed virtually instantly, and I was not able to observe any degradation of throughput.

    Several changes could have improved this during the past years. Thanks to MDEV-8069, MariaDB no longer holds any mutexes while the file is being delete-on-close'd. Thanks to MDEV-21069, DROP TABLE will avoid modifying any pages in the *.ibd file. Thanks to MDEV-23855, MariaDB uses asynchronous I/O for page writes, also for the doublewrite buffer.

    Linux has improved too. For example, I understood that ext4fs supports FUA mode, which was previously only supported by xfs. If there was a file system scalability bug that would cause the deletion of a large fragmented file to interfere with writes to other files, maybe that bug has been fixed by now. I used Linux 5.10.0. The server was configured with innodb_flush_log_at_trx_commit=1, and the data directory was located on an ext4 file system on an Intel Optane 960 NVMe PCIe card.

    In related news, we recently ran some benchmarks on a thinly provisioned ScaleFlux computational storage device. In our tests, storing uncompressed tables on a thinly provisioned device was not only fastest, but also resulted in best compression. The thinly provisioned device would "mispresent" itself as larger than real capacity to the file system and then compress the data by itself.

    ReplyDelete
    Replies
    1. If there were a contest for best comments on my blog, you would win. Thank you.

      I am unlikely to repeat my experiments but any conclusion requires a filesystem that has been sufficiently fragmented. Alas, I can't define "sufficient".

      One of the changes that upstream did was to move the unlink out of the critical path. But it still has to be done and a slow unlink can stall mysqld shutdown, another not fun thing in production.

      I don't think there is an easy way to get compression for free. Either InnoDB has to commit to implementing a feature or you need a clever storage device. I doubt that file system developers are interested in making "punch hole on every page write" robust. They might have better things to do.

      Delete
    2. Another hole punch bug in Linux. I assume there will be more given this isn't a widely used feature and it makes things complex for the kernel.
      https://lwn.net/Articles/864363/

      Delete
  6. Mark, thank you. Yes, while page compression at the file system interface makes things much simpler than the ‘opaque’ ROW_FORMAT=COMPRESSED that I started to implement in 2005 (with intrusive changes to several subsystems), it does make things complex for the file system. As we all know, complexity often correlates with bugs.

    I hope that smart storage will become mainstream in not too distant future, so that the complexity can stay away from both the database and the operating system kernel.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...