Small Datum: July 2018

Friday, July 27, 2018

Fast shutdown, fast startup

Managing a web-scale database service is easier when shutdown and startup are fast, or at least fast enough. Fast enough to me means a few seconds, and too slow means tens of seconds.

When these operations are too slow then:

scripts might time out - one of the MySQL scripts used to do that, see bug 25341
uncertainty increases - storage engines rarely provide progress indicators for shutdown. Most provide 2 to a few lines in the error log, 1 for shutdown starting, 1 for shutdown ending and maybe a few more. Alas, you have to ssh to the host to tail the error log to see them. When startup for InnoDB does crash recovery there is a useful progress indicator in the error log, but again, you need to ssh to the host to see that. Note that "ssh to host to tail error log" is not a best practice for web-scale.
downtime increases - restart (shutdown/startup), shutdown and startup can be sources of downtime. They happen for many reasons -- changing a configuration option that isn't dynamic, upgrading to a new binary, upgrading the kernel, etc. When they take 60 seconds your service might incur 60 seconds of downtime.

The work done by shutdown/startup also depends on the index structure (LSM vs b-tree) and on implementation details.

B-Tree

For a b-tree either shutdown or startup will be slow. The choice is either to flush dirty pages on shutdown (one random write per dirty page) or to do crash recovery at startup (one random read per page that was dirty on shutdown, eventually those dirty pages must be written back). The innodb_fast_shutdown option lets you control which one will be slow.

When dirty page writeback is done on shutdown then the time for that is a function of storage performance and the number of dirty pages. Back in the day (InnoDB running with disks) shutdown was slower. Storage is much faster today, but buffer pools are also larger because servers have more RAM. Shutdown can be made faster by reducing the value of innodb_max_dirty_pages_pct a few minutes before shutdown will be done. Alas, using a large value for innodb_max_dirty_pages_pct can be very good for performance -- less log IO, less page IO, less write-amplification.

Amazon Aurora is a hybrid, or mullet, with a b-tree up front and log-structured in the back. Shutdown for it is fast. It also doesn't need to warmup after startup because the buffer pool survives instance restart. Many years ago there was an option in Percona XtraDB to make the buffer pool survive restart, I wonder if that option will return. InnoDB also has an option to warmup the buffer pool at startup, but that still does IO which isn't as nice as preserving the buffer pool.

Back in the day InnoDB startup was often too slow. My memory about this has faded. One part of the problem is that per-index statistics are computed the first time a table is opened and that did ~6 random reads per index. That was the first part of the delay. My memory about the second part of the delay has faded more but I think at times this was single-threaded. A post from Percona explained some of this. Today InnoDB stats can be persistent, so they don't have to be sampled at startup. But InnoDB was also enhanced to avoid some of this problem long before persistent stats were added. I hope a reader provides a less vague description of this.

LSM

Shutdown for an LSM is fast -- flush the write buffer, no random writes. One thing that made shutdown slower for RocksDB was calling free for every page in the LRU. Note that RocksDB does malloc per page in the LRU rather than one huge malloc like InnoDB. With MyRocks the LRU isn't free'd on shutdown so the there are no stalls from that.

Startup for MyRocks should be fast but there is still at least one problem to solve. If you configure it with max_open_files=-1 then file descriptors are opened for all SSTs at startup. This helps performance by avoiding the need to search a hash table. The cost of this is 1) more open file descriptors and 2) more work at startup. See the description of the RocksDB option and more details in the tuning guide and FAQ. Note that the work done to open all of the SSTs can be done by multiple threads and the number of threads is controlled by the max_file_opening_threads RocksDB option. From looking at MyRocks code I don't think there is a way to change the value of max_file_opening_threads and the default is 16. The not-yet-solved problem is that RocksDB tries to precache some data from the end of every SST, by reading this data into the OS page cache, and that can be a lot of IO at startup, which also can make startup slower. With MyRocks when rocksdb_max_open_files is set to -2 then the open files limit is auto-configured, when set to -1 then there is no limit, and when set to > 0 then that is the limit.

Thursday, July 26, 2018

Tiered or leveled compaction, why not both via adaptive compaction?

First there was leveled compaction and it was good, but it took a while for implementations to become popular. Then there was (size) tiered compaction and it was good too, but more confusing given the diversity in strategies for picking files to compact. RocksDB didn't help with the confusion by calling it universal compaction. Eventually compaction algorithms optimized for time series were added (see DTCS for Cassandra). Finally, Kudu and InfluxDB have specialized compaction algorithms that are also worth understanding.

This post is about adaptive compaction which is yet another compaction algorithm. The summary for adaptive compaction is:

LSM file structure is orthogonal to the use of tiered or leveled compaction. A given LSM instance can be viewed as using tiered and leveled. It just depends on how you look at it.
Some levels in the LSM tree can be read optimized while others can be write optimized.
Each compaction step makes cost-based decisions to change a level between read and write optimized states, or remain as is. Compaction steps can also change the per-level fanout (thanks to Siying and Andrew for that suggestion).
Adaptive compaction can be configured to do strictly tiered or strictly leveled. So it can even adapt to become non-adaptive.

This is just an idea today. It has not been implemented. I think it will be good for RocksDB, but validation of ideas takes time.

Drawing the LSM file structure

The LSM file structure with leveled compaction is usually depicted vertically. The example below is from leveled compaction with 3 levels, per-level fanout set to 4, 4 SSTs in L1, 16 SSTs in L2 and 64 SSTs in L3. I ignore level 0 to simplify the picture.

With tiered compaction the LSM files are usually depicted horizontally as a sequence of sorted runs. The compaction algorithm merges 2 or more adjacent sorted runs at a time but it is hard to reason about the write amplification that will occur with such merging. Below each sorted run is range partitioned with 4, 16 and 64 partitions. Each partition is an SST.

File Structure vs Compaction

The key point here is that the LSM file structure is orthogonal to the use of tiered or leveled compaction. For the examples above the same LSM files are depicted vertically for leveled compaction and horizontally for tiered compaction.

The same files from an LSM instance can be used with tiered or leveled compaction. An instance using leveled compaction can be stopped and switched to use tiered compaction. An instance using tiered compaction can be stopped and switched to use leveled compaction. Metadata might have to be changed during the switch but otherwise the switch is trivial (allow me to wave hands).

To make this claim I ignore the work done in leveled with RocksDB to limit overlap between SSTs on adjacent levels. That is useful when incremental compaction is done to compact 1 SST from level N with the ~fanout SSTs from level N+1 as done by LevelDB and RocksDB. Limiting overlap isn't needed with the classic LSM approach because compaction between levels is all-to-all rather than incremental.

To transform from leveled to tiered assume that with tiered compaction the LSM structure is a sequence of sorted runs and each sorted run is 1+ SSTs. Then the N SSTs in the L0 become the first N sorted runs in the tiered sequence. They are followed by the sorted runs from L1 to Lmax. In this case the large sorted runs from L1 to Lmax are range partitioned (split into many SSTs) in the tiered sequence.

To transform from tiered to leveled the sorted runs from the prefix of the tiered sequence that are each a single SST become the SSTs in L0. Each of the remaining sorted runs becomes a level from L1 to Lmax. This requires that large sorted runs are range partitioned into many SSTs with tiered compaction.

Next is a specific example for transforming from leveled to tiered with fanout=8. The LSM with leveled has:

4 SSTs in the L0 named L0.1, L0.2, L0.3 and L0.4
L1 is partitioned into 4 SSTs: L1.1 to L1.4
L2 is partitioned into 32 SSTs: L2.1 to L2.32
L3 is partitioned into 256 SSTs: L3.1 to L3.256

That uses 7 sorted runs with tiered compaction. With tiered the sequence of sorted runs is:

sorted runs 1 to 4 are L0.1, L0.2, L0.3, L0.4
sorted run 5 is L1.1 to L1.4 (range partitioned)
sorted run 6 is L2.1 to L2.32 (range partitioned)
sorted run 7 is L3.1 to L3.256 (range partitioned)

A similar transformation can be done to go from tiered back to leveled. Assume the LSM with tiered compaction uses the file structure from above with 7 sorted runs, the first 4 are each one SST and then runs 5, 6 and 7 are each range partitioned into many SSTs. This can be viewed as an LSM with leveled compaction where runs 1 to 4 are in the L0 and runs 5, 6 and 7 become levels 1, 2 and 3. As noted elsewhere this ignores the work done to limit overlap between adjacent levels with leveled and RocksDB.

Vertical Depiction of Tiered Compaction

More recently I have seen the vertical depiction used to explain write amplification for tiered compaction (thanks DASLab for doing this). But that has unstated assumptions about how sorted runs are selected for compaction. With such assumptions the number of levels in the LSM tree is the same whether leveled or tiered is used, but the write-amplification is different. The number of levels is log_fanout(R) where R is size(Lmax) / size(L1). With leveled the worst-case per-level write-amp is equal to fanout and with tiered it is 1. The unstated assumptions are:

Sorted runs to be merged all come from the same level.
The fanout determines the number of sorted runs per level. When fanout=8 then each level is full with 8 sorted runs and a merge is started.
When each level is full then each level has fanout times more data than the previous level.
The output of a merge is written to the next larger level. Perhaps this rule can be broken when the size of the output is not large enough. For example assume the fanout is 8, the target size for sorted runs on L1 is 100 and the target size for sorted runs on the L2 is 800. When 8 sorted runs from L1 are merged, on which level should the output be written if the output has size 100 or size 200?
The max level has fanout sorted runs, which implies that space amplification is fanout which is too large for me. I prefer to limit the max level to 1 sorted run which also increases the total write-amp. The space-amp can be further reduced by increasing the fanout between the next-to-max and max levels. I am curious whether existing LSMs can be configured to limit the max level to 1 sorted run to limit the worst-case space-amplification to 2 (ignoring transient space-amp during compaction) and a recent paper, Dostoevsky by Dayan and Idreos, claims they cannot.

The example below is a vertical depiction of tiered compaction with 4 sorted runs per level. If fanout=4 then each level is full. The sorted runs in levels 1, 2 and 3 are L1.r1 to L1.r4, L2.r1 to L2.r4 and L3.r1 to L3.r4. Each sorted run can be an SST or a sequence of range partitioned SSTs. A sorted run in level N+1 is approximately fanout times larger than a sorted run in level N. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).

Adaptive Compaction

A quick summary for adaptive compaction is that it uses the vertical depiction for tiered but each level can have a different target for the number of sorted runs -- from 1 to fanout.

First, let me show the LSM tree when the max level is constrained to one sorted run so that the worst-case space-amplification is <= 2, ignoring temp space during compaction. Each level has <= fanout sorted runs. A sorted run is either an SST or range partitioned into many SSTs. A level with 2+ sorted runs is write optimized. A level with 0 or 1 sorted runs is read optimized. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).

Adaptive compaction:

Uses the vertical depiction of the LSM tree to constrain the compaction steps that can occur.
Makes cost-based decisions during compaction to make the LSM tree more efficient for both the short-term and long-term workload. This is done one compaction step at a time. This is called adaptive compaction because it adapts the shape of the LSM tree to the workload.
The decisions are 1) whether a level should be tiered or leveled and 2) the per-level fanout for the level. When a level is tiered then the per-level fanout determines the number of sorted runs on that level. When a level is leveled then the per-level fanout determines the size ratio between it and adjacent levels.
Decisions respect constraints including the maximum space-amplification (both temporary — during compaction, and permanent — after compaction), write-amplification, number of levels and number of sorted runs. Constraints allow limits for the worst-case read and write efficiency. A possible constraint is the max number of sorted runs for the max level. Constraints can also include hints that optimization should favor point reads, range reads or writes.

Compaction is a sequence of compaction steps and at each compaction step adaptive compaction makes the LSM tree more or less read optimized to be more efficient for both the current and long-term workload. Note that workloads can have daily cycles so that optimizing for the current workload during daily backup or bulk load might not be the best decision when things return to normal in a few hours.

There are four types of compaction steps that can be scheduled. Some make the LSM tree more read-optimized, some make the LSM more write-optimized.

tiered - this merges 2+ sorted runs from level N and writes the output to level N or level N+1 depending on the size of the output. Regardless, level N becomes read optimized. Level N+1 remains or becomes write optimized.
tiered+leveled - this merges 2+ sorted runs from level N with 1 sorted run from level N+1 and writes the output to level N+1. For now I assume this cannot be used when level N+1 has more than 1 sorted run. When the compaction step finishes level N+1 remains read optimized and level N becomes read optimized.
leveled.part - this merges one SST from a sorted run on level N with overlapping SSTs from a sorted run on level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. For now I ignore the details of minimizing overlaps between a sorted run in level N and sorted runs in level N+1.
leveled.all - this merges the sorted run in level N with the sorted run in level N+1 and writes the output to level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. Unlike leveled.part this doesn't require minimizing overlap.

Examples of adaptive compaction

The smaller levels of the LSM tree have the most churn. Therefore they adapt faster as the current workload changes. They can also be fixed faster when the workload shifts from read to write heavy. Optimizations of persistent data structures have risk — which is the time to undo those optimizations when things change. There is much more risk for optimizations done for Lmax than for L1.

A simple example of adaptive compaction is changing L1 and then L2 from having one sorted run to many sorted runs when the current workload shifts from read to write heavy. Assume that the LSM tree starts out with all levels read-optimized (one sorted run per level) and has been using either leveled.part or leveled.all compaction steps. Compaction to level 1 is special, I have yet to mention it but assume there is a write buffer and leveled.part compaction steps have been used to merge it with level 1.

Then there is a burst of writes and L1 switches from read to write optimized with 4 sorted runs. Write buffer flushes stop using leveled.all and just write new sorted runs into L1. Compaction steps into L2, L3 and L4 continue using leveled.all or leveled.part.

Then the burst of writes continues and L2 switches from read to write optimized with 4 sorted runs. Write buffer flushes continue to create new sorted runs in L1. Compaction into L2 begins to use tiered compaction steps.

Eventually the write burst ends and L2 changes from write to read optimized by using a tiered+leveled compaction step from L0 to L1. The result is 1 sorted run in L2 and 0 sorted runs in L0. Then a write buffer flush creates a sorted run in L1.

Open Problems

I am sure there are many, this section lists two.

One interesting problem is to optimize when there is skew in the workload. The workload is usually not uniform across the key space. Some ranges of the key space might need optimization for writes and point queries. Another range might benefit from optimization for range queries. The work proposed above doesn't handle this any better than leveled or tiered compaction today.

In some cases we can isolate different workloads to different column families, and with MyRocks that means a column family per index. But that still doesn't handle the case when the workload has variance within one index. But this is a first step.

Wednesday, July 25, 2018

Default options in MyRocks

We need to make MyRocks easier to configure -- this isn't a new idea. If you are using MyRocks with default options in mid-2018 then you are probably not using bloom filters, compression or my favorite compaction policy.

You can fix all of that by setting rocksdb_default_cf_options. I wish this were the default.

rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio

The above will enable the default compression type for all levels of the LSM tree which is Snappy in a recent MyRocks build with FB MySQL. But one of the proper distros only provides zlib and doing that for the small levels in the LSM tree (L0, L1, L2) might slow down compaction too much.

To set rocksdb_default_cf_options but disable compression use:

rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression=kNoCompression

To set rocksdb_default_cf_options and use fast compression for all levels use this after changing $fast to one of kLZ4Compression or kSnappyCompression:

rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression=$fast

And to set rocksdb_default_cf_options with a fast compression configuration (no compression for smallest levels, fast compression for mid levels, slower compression for max level) try this after changing $fast and $slow to the appropriate values:

rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kCompression:$fast:$fast;bottommost_compression=$slow

You can determine which compression libraries are supported by you MyRocks build by looking in $datadir/.rocksdb/LOG. This is from an FB MySQL build that I use:

Compression algorithms supported:

kZSTDNotFinalCompression supported: 1

kZSTD supported: 1

kXpressCompression supported: 0

kLZ4HCCompression supported: 1

kLZ4Compression supported: 1

kBZip2Compression supported: 1

kZlibCompression supported: 1

kSnappyCompression supported: 1

Some information on my.cnf for MyRocks is here and much more info on RocksDB is on the wiki. If you want to tune, then start with the tuning guide.

Another potential problem with the default configuration is that rocksdb_max_background_jobs=2 so there are 2 threads for compaction, you usually want more

Finally, there were some changes to MyRocks to make it better at avoiding too many tombstones and it looks like that is disabled by default:

rocksdb_compaction_sequential_deletes 0rocksdb_compaction_sequential_deletes_count_sd OFFrocksdb_compaction_sequential_deletes_file_size 0rocksdb_compaction_sequential_deletes_window 0

Setting rocksdb_max_open_files to -1 can be great for performance but there are side-effects (more open file descriptors, more untracked memory consumption).

If you are only using MyRocks then you might want to set transaction-isolation=READ-COMMITTED because repeatable-read in MyRocks doesn't use gap locks (yet).

Finally, you probably want to set rocksdb_block_cache_size.

Tuesday, July 24, 2018

Index+log - an alternative to LSM

While there are many algorithms for index structures I think that most of them can be put into one of three families -- page-based, LSM or index+log.

I have not spent much time with index+log implementations, but I think they are interesting. These include BitCask, ForestDB, WiscKey, Faster, HashKV and RocksDB BlobDB. With the index+log approach keys are stored separate from values, values are written to log segments and GC is required. There is variety in the index algorithms. A hash index is used by BitCask and Faster while the others used persistent tree indexes -- a trie for ForestDB, an LSM for WiscKey, HashKV and RocksDB BlobDB.

How might index+log do in terms of read, write, space and cache amplification? I will focus on implementations that use a tree index, but some of what I write is true for a hash index. My summary for page-based and LSM is in a previous post.

An important part of the answer is that it depends on the index algorithm and index+log allows for diversity. The implementations above use an LSM or trie. There might be an implementation in the future that uses a b-tree although tries are getting more popular (see an interesting SIGMOD paper).

My overview of index+log:

Space and write amplification for the value log have an inverse relationship. Space amplification is reduced by doing GC more frequently, but that increases write amplification. A simple model that assumes uniform distribution explains the relationship. When pct_full is the percentage of storage that is used then space-amp = 100 / pct-full and write-amp is 100 / (100 - pct_full). When storage is 80% full then space-amp is 100/80 or 1.25 and write-amp is 100/20 or 5.
The write-amp in the previous bullet point is IO write-amp. But there is also CPU write-amp and for tree indexes that is the number of comparisons per insert. This includes the comparisons done upfront when the insert is processed and the comparisons done over time during GC on behalf of that insert. All implementations have an upfront cost and all except for HashKV must search the index during GC. The upfront cost is equivalent to a point query as it requires a tree search. When write-amp is 5 as above then GC does 5 tree searches. In that case the total CPU write-amp is equivalent to 6 point queries. The CPU cost of a point query depends on the index algorithm -- for an LSM it costs more than for a b-tree (sometimes 2X more even with bloom filters). But in general the CPU write-amp is 1 + IO-write-amp point queries (except for HashKV).
The CPU write-amp for index+log is usually much worse than for LSM when index+log uses an LSM for the index, except for HashKV. With an LSM if write-amp is 20 then each KV pair is rewritten 20 times by compaction to reach the max level of the LSM tree. The CPU cost is one comparison per rewrite and then N comparisons to insert the KV pair into the memtable and a common value for N is 20.
I ignore write-amp and space-amp for the persistent index. That depends on the algorithm -- page-based or LSM. I wonder if index+log could be used.
Cache amplification can be bad for index+log for algorithms that require an index probe per KV pair during GC -- HashKV doesn't probe the index during GC. The index probe must be efficient to support high GC rates that follow from high insert rates. This means that the index must be in memory with an index entry in memory for every key. That is a lot of memory unless the value/key size ratio is large.
Read amplification is different for point and range queries. First, the CPU component of read-amp depends on the algorithm used for the index (page-based vs LSM). Assuming the index fits in memory (and except for HashKV, the index must be in memory) then the IO component of read-amp comes when reading the value log. For a point query the worst case is one storage read. For a range query the worst case is one storage read per KV pair in the range and that can be expensive (more seeks with disk, more request processing with SSD, decompression per storage read).

In summary, the potential problems with index+log are bad cache-amp (index must fit in memory), bad CPU write-amp (too many compares/insert) and bad IO read-amp for range queries. Note that HashKV might avoid the first two problems. But all index structures have potential problems and for some workloads index+log is a great choice.

There is an optimization for index+log that might not have been described in previous work. If the log segments are ordered then multiple ordered log segments can be merged to create longer runs of ordered log segments over time. This can reduce the random IO penalty for range queries. While the log segment when first written isn't ordered, but it can be ordered the first time GC is done for it.

Wednesday, July 11, 2018

CPU overheads for RocksDB queries

An LSM like RocksDB has much better write and space efficiency than a B-Tree. That means with RocksDB you will use less SSD than with a B-Tree and the SSD will either last longer or you can use lower endurance SSD. But this efficiency comes at a cost. In the worst-case RocksDB might use 2X more CPU/query than a B-Tree, which means that in the worst case QPS with RocksDB might be half of what it is with a B-Tree. In the examples below I show that RocksDB can use ~2X more comparisons per query compared to a B-Tree and it is nice when results in practice can be explained by theory.

But first I want to explain the context where this matters and where it doesn't matter. It matters when the CPU overhead from RocksDB is a significant fraction of the query response time -- so the workload needs to be CPU bound (cached working set). It isn't a problem for workloads that are IO-bound. Many performance results have been published on my blog and more are coming later this year. With fast storage devices available I recommend that database workloads strive to be IO-bound to avoid using too much memory+power.

Basic Operations

Where does the CPU overhead come from? I started to explain this in my review of the original LSM paper. My focus in this post is on SELECT statements in SQL, but note that writes in SQL also do reads from the storage engine (updates have a where clause, searches are done to find matching rows and appropriate leaf pages, queries might be done to determine whether constraints will be violated, etc).

With RocksDB data can be found in the memtable and sorted runs from the L0 to Lmax (Lmax is the max/largest level in the LSM tree. There are 3 basic operations that are used to search for data - get, range seek and range next. Get searches for an exact match. Range seek positions iterators at the start of a range scan. Range next gets the next row from a range scan.

First a few comments before I explain the CPU overhead:

I will try to be clear when I refer to the keys for a SQL table (SQL key) and the key as used by RocksDB (RocksDB key). Many SQL indexes can be stored in the same LSM tree (column family) and they are distinguished by prepending the index ID as the first bytes in the RocksDB key. The prepended bytes aren't visible to the user.
Get is used for exact match on a PK index but not used for exact match on a secondary index because with MyRocks the RocksDB key for a secondary index entry is made unique by appending the PK to it. A SQL SELECT that does an exact match on the secondary key searches for a prefix of the RocksDB key and that requires a range scan.
With RocksDB a bloom filter can be on a prefix of the key or the entire key (prefix vs whole key). A prefix bloom filter can be used for range and point queries. A whole key bloom filter will only be used for point queries on a PK. A whole key bloom filter won't be used for range queries. Nor will it be used for point queries on a secondary index (because secondary index exact match uses a range scan internally).
Today bloom filters are configured per column family. A good use for column families is to put secondary indexes into separate ones that are configured with prefix bloom filters. Eventually we expect to make this easier in RocksDB.

Get

A point query is evaluated for RocksDB by searching sorted runs in order: first the memtable, then runs in the L0, then the L1 and so on until the Lmax is reached. The search stops as soon as the key is found, whether from a tombstone or a key-value pair. The work done to search each sorted run varies. I use comparisons and bloom filter probes as the proxy for work. I ignore memory system behavior (cache misses) but assume that the skiplist isn't great in that regard:

memtable - this is a skiplist and I assume the search cost is log2(N) when it has N entries.
L0 - for each sorted run first check the bloom filter and if the key might exist then do binary search on the block index and then do binary search within the data block. The cost is a bloom filter probe and possibly log2(M) comparisons when an L0 sorted run has M entries.
L1 to Ln - each of these levels is range partitioned into many SST files so the first step is to do a binary search to find the SST that might contain the key, then check the bloom filter for that SST and if the key might exist then do binary search on the block index for that SST and then do binary search within the data block. The binary search cost to find the SST gets larger for the larger levels and that cost doesn't go away when bloom filters are used. For RocksDB with a 1T database, per-level fanout=8, SST size=32M then the number of SSTs per level is 8 in L1, 64 in L2, 512 in L3, 4096 in L4 and 32768 in L5. The number of comparisons before the bloom filter check are 3 for L1, 6 for L2, 9 for L3 and 12 for L4. I assume there is no bloom filter on L5 because it is the max level.

A bloom filter check isn't free. Fortunately, RocksDB makes the cost less than I expected by limiting the bits that are set for a key, and must be checked on a probe, to one cache line. See the code in AddHash. For now I assume that the cost of a bloom filter probe is equivalent to a few (< 5) comparisons given the cache line optimization.

A Get operation on a B-Tree with 8B rows needs ~33 comparisons. With RocksDB it might need 20 comparisons for the memtable, bloom filter probes for 4 SSTs in L0, 3+6+9+12 SST search comparisons for L1 to L4, bloom filter probes for L1 to L4, and then ~33 comparisons to search the max level (L5). So it is easy to see that the search cost might be double the cost for a B-Tree.

The search cost for Get with RocksDB can be reduced by configuring the LSM tree to use fewer sorted runs although we are still figuring out how much that can be reduced. With this example about 1/3 of the comparisons are from the memtable, another 1/3 are from the max level and the remaining 1/3 are from L0 to L4.

Range Seek and Next

The cost for a range scan has two components: range seek to initialize the scan and range next to produce each row. For range seek an iterator is positioned within each sorted run in the LSM tree. For range next the merging iterator combines the per-run iterators. For RocksDB the cost of range seek depends on the number of sorted runs while the cost of range next does not (for leveled compaction, assuming uniform distribution).

Range seek does binary search per sorted run. This is similar to a point query without a bloom filter. The cost across all sorted runs depends on the number of sorted runs. In the example above where RocksDB has a memtable, 4 SSTs in the L0, L1 to L5 and 8B rows that requires 20 comparisons for the memtable, 4 x 20 comparisons for the L0 SSTs, 21 + 24 + 27 + 30 + 33 comparison for L1 to L5. The total is 235 comparisons. There isn't a way to avoid this and for short range queries the cost of range seek dominates the cost of range next. While this overhead is significant for short range queries with embedded RocksDB and an in-memory workload it is harder to notice with MyRocks because there is a lot of CPU overhead above MyRocks from optimize, parse and client RPC for a short range query. It is easier to notice the difference in CPU overhead between MyRocks and a B-Tree with longer range scans.

The cost for range next is interesting. LevelDB has a comment that suggests using a heap but the merging iterator code uses N-1 comparisons per row produced which means the overhead is dependent on the number of sorted runs. The cost of range next in RocksDB is much less dependent on the number of sorted runs because it uses an optimized binary heap and the number of comparisons to produce a row depends on the node that produces the winner. Only one comparison is done if the root produces the winner and the root produced the previous winner. Two comparisons are done if the root produces the winner but did not produce the previous winner. More comparisons are needed in other cases. The code is optimized for long runs of winners from one iterator and that is likely with leveled compaction because Lmax is ~10X larger than the next largest level and usually produces most winners. Using a simulation with uniform distribution the expected number of comparisons per row is <= 1.55 regardless of the number of sorted runs for leveled compaction.

The optimization in the binary heap used by the merging iterator is limited to remembering the comparison result between the two children of the root node. Were that extended to remembering comparison results between any two sibling nodes in the heap then the expected number of comparisons would be reduced from ~1.55 to ~1.38.

For a B-Tree:

The overhead for range seek is similar to a point query -- ~33 comparisons when there are 8B rows.
There are no merging iterators. The overhead for range next is low -- usually move to the next row in the current leaf page.