Friday, July 27, 2018

Fast shutdown, fast startup

Managing a web-scale database service is easier when shutdown and startup are fast, or at least fast enough. Fast enough to me means a few seconds, and too slow means tens of seconds.

When these operations are too slow then:
  • scripts might time out - one of the MySQL scripts used to do that, see bug 25341
  • uncertainty increases - storage engines rarely provide progress indicators for shutdown. Most provide 2 to a few lines in the error log, 1 for shutdown starting, 1 for shutdown ending and maybe a few more. Alas, you have to ssh to the host to tail the error log to see them. When startup for InnoDB does crash recovery there is a useful progress indicator in the error log, but again, you need to ssh to the host to see that. Note that "ssh to host to tail error log" is not a best practice for web-scale.
  • downtime increases - restart (shutdown/startup), shutdown and startup can be sources of downtime. They happen for many reasons -- changing a configuration option that isn't dynamic, upgrading to a new binary, upgrading the kernel, etc. When they take 60 seconds your service might incur 60 seconds of downtime.
The work done by shutdown/startup also depends on the index structure (LSM vs b-tree) and on implementation details.

B-Tree

For a b-tree either shutdown or startup will be slow. The choice is either to flush dirty pages on shutdown (one random write per dirty page) or to do crash recovery at startup (one random read per page that was dirty on shutdown, eventually those dirty pages must be written back). The innodb_fast_shutdown option lets you control which one will be slow.

When dirty page writeback is done on shutdown then the time for that is a function of storage performance and the number of dirty pages. Back in the day (InnoDB running with disks) shutdown was slower. Storage is much faster today, but buffer pools are also larger because servers have more RAM. Shutdown can be made faster by reducing the value of innodb_max_dirty_pages_pct a few minutes before shutdown will be done. Alas, using a large value for innodb_max_dirty_pages_pct can be very good for performance -- less log IO, less page IO, less write-amplification.

Amazon Aurora is a hybrid, or mullet, with a b-tree up front and log-structured in the back. Shutdown for it is fast. It also doesn't need to warmup after startup because the buffer pool survives instance restart. Many years ago there was an option in Percona XtraDB to make the buffer pool survive restart, I wonder if that option will return. InnoDB also has an option to warmup the buffer pool at startup, but that still does IO which isn't as nice as preserving the buffer pool.

Back in the day InnoDB startup was often too slow. My memory about this has faded. One part of the problem is that per-index statistics are computed the first time a table is opened and that did ~6 random reads per index. That was the first part of the delay. My memory about the second part of the delay has faded more but I think at times this was single-threaded. A post from Percona explained some of this. Today InnoDB stats can be persistent, so they don't have to be sampled at startup. But InnoDB was also enhanced to avoid some of this problem long before persistent stats were added. I hope a reader provides a less vague description of this.

LSM

Shutdown for an LSM is fast -- flush the write buffer, no random writes. One thing that made shutdown slower for RocksDB was calling free for every page in the LRU. Note that RocksDB does malloc per page in the LRU rather than one huge malloc like InnoDB. With MyRocks the LRU isn't free'd on shutdown so the there are no stalls from that.

Startup for MyRocks should be fast but there is still at least one problem to solve. If you configure it with max_open_files=-1 then file descriptors are opened for all SSTs at startup. This helps performance by avoiding the need to search a hash table. The cost of this is 1) more open file descriptors and 2) more work at startup. See the description of the RocksDB option and more details in the tuning guide and FAQ. Note that the work done to open all of the SSTs can be done by multiple threads and the number of threads is controlled by the max_file_opening_threads RocksDB option. From looking at MyRocks code I don't think there is a way to change the value of max_file_opening_threads and the default is 16. The not-yet-solved problem is that RocksDB tries to precache some data from the end of every SST, by reading this data into the OS page cache, and that can be a lot of IO at startup, which also can make startup slower. With MyRocks when rocksdb_max_open_files is set to -2 then the open files limit is auto-configured, when set to -1 then there is no limit, and when set to > 0 then that is the limit.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...