Wednesday, February 26, 2014

Write Amplification: write-optimized versus update-in-place

I spent a few days running linkbench to compare write-amplification between an update-in-place b-tree (InnoDB) and a write-optimized algorithm (TokuDB). There are a few changes that can be made to greatly reduce the bytes written rate for InnoDB. My co-workers had great success in doing this for a specific workload but might not have described the impact in public. So I will describe the changes using linkbench. Note that that bytes written rate is very important when using flash storage as device endurance is rated in terms of it. Reducing the rate can either extend the lifetime of the device or allow you to buy a lower endurance and less expensive device.

The changes to reduce the bytes written rate for InnoDB include:
  • reduce innodb_page_size - if pages are being written back when only a small fraction of the page is dirty then you might reduce the bytes written rate in half by cutting the page size in half (from 16k to 8k). Note that using 2X compression already reduces the disk page size in half and it probably is a bad idea to use 2X compression with 4k pages so the best you can do is use a 4k page without compression or an 8k page with 2X compression.
  • increase innodb_log_file_size - if pages are being written back because of the fuzzy checkpoint constraint (from the flush list not from the LRU) then making the log file larger will reduce the write back rate. However you need to be careful to avoid sacrificing too much RAM to redo log files cached by the OS as buffered IO is used for this.
  • increase innodb_max_dirty_pages_pct - if pages are being written back because this limit is reached then making it larger will reduce the rate. But that will also increase shutdown time.
  • disable the doublewrite buffer - disabling this cuts the bytes written rate for database files in half. Alas it exposes you to torn pages. One day atomic page writes might be common on flash devices and this won't be a concern.
  • disable innodb_flush_neighbors - this is definitely an optimization for spinning disks. Disabling it will prevent InnoDB from writing back some dirty pages before their time.
  • magic - there are a few more changes that my clever colleagues have done. I will allow them to describe that work.


The summary of the results below:
  • with tuning you can significantly reduce the InnoDB bytes written rate
  • TokuDB requires very little tuning to get good performance. I set one parameter in my.cnf for TokuDB versus many for InnoDB. They have done a great job with that.
  • TokuDB might be able to do much better with tuning but that is a task for another Callaghan
  • the bytes written rate is significantly less for TokuDB than for InnoDB
  • the database size is significantly smaller for TokuDB than for InnoDB
  • QPS on multi-core is much better today than it was last year for TokuDB
  • the bytes written rate to flash is underestimated for InnoDB or overestimated for TokuDB because this does not include writes done for copy out during flash block cleaning. So TokuDB is doing much better relative to InnoDB than reported by my metrics


For InnoDB fsync-on-commit was enabled and the binlog was disabled. So this includes writes from the redo logs. InnoDB used a 4G or 8G redo log as indicated in each test result. InnoDB also used O_DIRECT and the doublewrite buffer was disabled. I don't have results for InnoDB with the doublewrite enabled. The write rates would be about 2X of what I report here which is about 4X worse than TokuDB in the worst case. When compression is enabled I used zlib and it is only used for the link and count tables. It is not used for the object table.

TokuDB used a 16k read block size and buffered IO. I also shared a much larger performance report with the TokuDB team but I am reluctant to share performance comparisons in public. I feel much better about sharing reports that show the benefit of a few changes to one DBMS. I used the default compression -- zlib. I repeated some tests for the uncached database using a larger read block size but QPS was much worse as I predicted. I also increased the TokuDB checkpoint interval from 60 seconds to 300 seconds for a few tests but that did not reduce the bytes written rate.

Write rates were measured by iostat. Note that all bytes written are not created equal. When small random writes are done so that the flash block cleaner (garbage collection) encounters a mix of live & dead pages while cleaning it must copy out (write) the live pages elsewhere. This is another source of write amplification and bytes written. I did not measure it but the impact on InnoDB should be much larger than the impact on TokuDB. A write optimized database that does large random writes, for example writes that are a multiple of the flash block size, will avoid some/most/all of the copy out. Note that I also describe all writes as random. Most busy DBMS workloads will have a variety of IO streams running concurrently on a small number of storage devices. From the perspective of the storage device this is likely to not look sequential.

For all of the tests I ran the workloads for many hours before taking any measurements to get the database into something resembling a steady state with fragmentation for InnoDB and dead versions of rows for TokuDB. I read many benchmark results for which measurements are done immediately after load and those results can be very misleading. It also helps to make sure the flash device is full so that flash GC is in progress when the benchmarks are running. I am reasonably sure that was true in this case.

The test server has 12 CPU cores and 24 with HT enabled. I used good flash devices.

The database configurations that I tested were:
  • TokuDB
  • InnoDB-4k-c - InnoDB with a 4k page size and 2X compression. This means that database pages are 2k for compressed tables which is not a good idea in the future when 4k disk sectors are deployed.
  • InnoDB-8k-c - InnoDB with an 8k page size and 2X compression.
  • InnoDB-16k-c - InnoDB with a 16k page size and 2X compression
  • InnoDB-4k-u - InnoDB with a 4k page size
  • InnoDB-8k-u - InnoDB with an 8k page size
  • InnoDB-16k-u - InnoDB with a 16k page size

Small cached database

For the first test I used a database that was cached by InnoDB and TokuDB. I configured linkbench to use maxid1=10M+1. The database size after initial load was much less than 20G for both InnoDB and TokuDB. 10 client threads each did 100M linkbench requests and the test took about 8 hours. A 4G redo log was used for InnoDB.

The results show that InnoDB with 4k pages and 2X compression is close to TokuDB in the bytes written rate. This is a great result but I don't recommend 4k pages with 2X compression and I did not run this test for 8k pages with 2X compression.

Larger cached database

I repeated the test with maxid1=100M+1. In this case an 8G redo log was used for InnoDB. The database after load was 34G/66G for TokuDB/InnoDB and grew to 60G/102G. It was still cached by both TokuDB and InnoDB. For this test I used 20 client threads that each did 20M requests. The bytes written rate for InnoDB with 8k pages and 2X compression was about 2X the rate for TokuDB. I did not save metrics to look at page write rates by table. If most of the writes were done for the uncompressed node table then supporting 8k+2X compressed tables concurrent with 4k-uncompressed tables would be a useful feature for InnoDB.

Not cached database

For the final result I used maxid1=1B+1. The test database after load was 341G for TokuDB, 672G for InnoDB-8k-c and 651G for InnoDB-16k-c. At test end it was 422G, 831G and 792G respectively. An 8G redo log was used for InnoDB. The test used 20 client threads that each did 20M requests. In this case the rate for InnoDB with 8k pages and 2X compression is about 1.3X the rate for TokuDB.


  1. I have a query that is a large data set needed for reporting purposes. Currently, the "duration" showing in MySQL workbench which I'm assuming to be execution time is about 7 seconds, so it is fairly optimized. It returns a measly 6000 rows, but it takes nearly 150 seconds to return them according to the "fetch" time.

    Now, there are over 50 columns, which may explain some of the speed, but when I extracted the data set into a spreadsheet, it turned out to be about 4MB. I'm certainly not an expert, but I didn't expect 4MB to take 150 seconds to return over the pipe. I went ahead and performed the same query on a localhost setup to eliminate networking issues. Same result! It took about 7 seconds to execute, and 150 seconds to return the data on the same machine.

    This report is expected to run real-time on demand, so having the end user wait 2 minutes is unacceptable for this use case. How can I improve the time it takes to return the data from MySQL?

  2. I think that is a better question for the experts at and

  3. We are very keen to reduce the complexity around InnoDB IO tuning. I agree it takes a certain level of expertise to get it right for the given context. After the CPU fixes around transaction life cycle management the next thing on my radar is IO. It needs the same level of treatment as the kernel mutex split. Serious hard look and close to a rewrite if it comes to that.

    As usual great post, please keep them coming!

    1. I should have been more clear that the FB team might have added a few tuning options. They are OK for us as we like to test and tune and want to help other companies get more consulting.

    2. Sunny,

      Its heartening to see that finally we are coming around to work towards reducing the number of knobs.