Friday, October 2, 2015

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems.

So I asked core developers from Btrfs whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other.

XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous I mean 50X or 100X longer. Maybe I am the only one using InnoDB file-per-table and measuring time to drop a table or database. Bad things happen when DROP takes 10+ minutes --InnoDB is unusable by other connections and an InnoDB background thread might kill mysqld because it thinks there is an internal deadlock.

Did I mention this was good news for MyRocks? If you want compression then we get 2X better compression compared to compressed InnoDB. We also made sure that DROP TABLE and DROP INDEX are fast.

I updated bug 78277 for this problem. I am the only person updating that bug. I also found a corruption bug with transparent page compression, bug 78672. Past experience with transparent page compression is described here, here and here.

The end to my testing day was perfect. I rebooted the host to get back memory from XFS metadata allocations and the filesystem came back corrupt. Being a pessimist I was using a scratch filesystem for this, so I didn't lose my Ubuntu install.

11 comments:

  1. Your experience on XFS and EXT4 with drops taking 10+ minutes is not unusual. To get around this just create hardlinks to the ibd files before dropping. Then remove the hard-links out side of MySQL during off hours.

    ReplyDelete
    Replies
    1. Some of us don't have off hours. 10 minutes for unlink is a problem I have never experienced before. Setting up hard links doesn't work at scale. I imagine filesystem performance will suck for the 10+ minutes when the unlink is done in the background, but I am not going to spend more time showing this feature is busted.

      Delete
  2. Mark --- did you move your blog to a different theme? The fonts all got a lot smaller and harder to read~

    ReplyDelete
    Replies
    1. I went back to an older theme as the "live" theme had a few glitches. I made the font larger just now.

      Delete
  3. I think ZFS supports hole punching, on top of other niceties like compression, block level checksums and making InnoDB's double write irrelevant (yay performance!). Have you given it a try?

    ReplyDelete
    Replies
    1. I have not. XFS tends to be the FS of choice for web scale Linux. I have never encountered ZFS in production, nor have I heard of the other big web-scale companies using it on Linux.

      I know of many deployments that disable the doublewrite buffer.

      People who run MySQL on ZFS should write about their experiences. Would be even better if Percona stood behind it.

      Delete
    2. There are folks inside of Percona that really like ZFS and recommend it, particularly for snapshot backups. It is kind of surprising that it doesn't get blogged about.

      Delete
    3. I use MySQL on ZFS in production for a service with just under a million and a half registered users. I set it up partially using recommendations I found from Percona's public material (slides & blog posts). I wouldn't mind writing about it but I wonder about one thing: What would you like to know? :)

      Delete
    4. How does it work?

      Linux, BSD or Solaris?

      Disk or SSD? Because ZFS is copy-on-write it will increase fragmentation on database files, if this runs on disk the extra seeks would concern me.

      Have you ever needed fsck? AFAIK, ZFS doesn't have that. HW and SW bugs can cause corruption. Is repair ever needed?

      What is your workload (database & IO) and database size?

      Why ZFS? Did you choose it or was it there when you arrived?

      Are you using anything fancy with it like the SSD accelerators that Oracle sells.

      I have not read about ZFS for a while, so my terminology is limited, but how have you configured it -- pools and disks?

      What other ZFS features do you use?

      Delete
  4. I have done testing on this feature mostly using NVMFS and FusionIO NVM device but my tests did not include actual drop table speed.

    ReplyDelete
    Replies
    1. Can someone provide a link to show that NVMFS is a real product with real customers? I didn't find anything on the SanDisk web site.

      Delete