Posts

Showing posts from July, 2018

Fast shutdown, fast startup

Managing a web-scale database service is easier when shutdown and startup are fast, or at least fast enough. Fast enough to me means a few seconds, and too slow means tens of seconds. When these operations are too slow then: scripts might time out - one of the MySQL scripts used to do that, see bug 25341 uncertainty increases - storage engines rarely provide progress indicators for shutdown. Most provide 2 to a few lines in the error log, 1 for shutdown starting, 1 for shutdown ending and maybe a few more. Alas, you have to ssh to the host to tail the error log to see them. When startup for InnoDB does crash recovery there is a useful progress indicator in the error log, but again, you need to ssh to the host to see that. Note that "ssh to host to tail error log" is not a best practice for web-scale. downtime increases - restart (shutdown/startup), shutdown and startup can be sources of downtime. They happen for many reasons -- changing a configuration option that isn&#

Tiered or leveled compaction, why not both via adaptive compaction?

Image
First there was leveled compaction and it was good, but it took a while for implementations to become popular. Then there was (size) tiered compaction and it was good too, but more confusing given the diversity in strategies for picking files to compact. RocksDB didn't help with the confusion by calling it universal compaction . Eventually compaction algorithms optimized for time series were added (see DTCS for Cassandra). Finally, Kudu and InfluxDB have specialized compaction algorithms that are also worth understanding. This post is about adaptive compaction which is yet another compaction algorithm. The summary for adaptive compaction is: LSM file structure is orthogonal to the use of tiered or leveled compaction. A given LSM instance can be viewed as using tiered and leveled. It just depends on how you look at it. Some levels in the LSM tree can be read optimized while others can be write optimized. Each compaction step makes cost-based decisions to change a level betwe

Default options in MyRocks

We need to make MyRocks easier to configure -- this isn't a new idea. If you are using MyRocks with default options in mid-2018 then you are probably not using bloom filters, compression or my favorite compaction policy. You can fix all of that by setting rocksdb_default_cf_options. I wish this were the default. rocksdb_default_cf_options= block_based_table_factory={ cache_index_and_filter_blocks= 1;filter_policy=bloomfilter: 10:false;whole_key_filtering= 1};level_compaction_dynamic_ level_bytes=true;optimize_ filters_for_hits=true; compaction_pri= kMinOverlappingRatio The above will enable the default compression type for all levels of the LSM tree which is Snappy in a recent MyRocks build with FB MySQL. But one of the proper distros only provides zlib and doing that for the small levels in the LSM tree (L0, L1, L2) might slow down compaction too much. To set rocksdb_default_cf_options but disable compression use: rocksdb_default_cf_options=block_based_table_factory={cach

Index+log - an alternative to LSM

While there are many algorithms for index structures I think that most of them can be put into one of three families -- page-based, LSM or index+log . I have not spent much time with index+log implementations, but I think they are interesting. These include BitCask , ForestDB , WiscKey , Faster , HashKV  and RocksDB BlobDB . With the index+log approach keys are stored separate from values, values are written to log segments and GC is required. There is variety in the index algorithms. A hash index is used by BitCask and Faster while the others used persistent tree indexes -- a trie for ForestDB, an LSM for WiscKey, HashKV and RocksDB BlobDB. How might index+log do in terms of read, write, space and cache amplification ? I will focus on implementations that use a tree index, but some of what I write is true for a hash index. My summary for page-based and LSM is in a previous post. An important part of the answer is that it depends on the index algorithm and index+log allows for diver

CPU overheads for RocksDB queries

An LSM like RocksDB has much better write and space efficiency than a B-Tree. That means with RocksDB you will use less SSD than with a B-Tree and the SSD will either last longer or you can use lower endurance SSD. But this efficiency comes at a cost. In the worst-case RocksDB might use 2X more CPU/query than a B-Tree, which means that in the worst case QPS with RocksDB might be half of what it is with a B-Tree. In the examples below I show that RocksDB can use ~2X more comparisons per query compared to a B-Tree and it is nice when results in practice can be explained by theory. But first I want to explain the context where this matters and where it doesn't matter. It matters when the CPU overhead from RocksDB is a significant fraction of the query response time -- so the workload needs to be CPU bound (cached working set). It isn't a problem for workloads that are IO-bound. Many performance results have been published on my blog and more are coming later this year. With fa