Monday, September 9, 2019

Vinyl - the LSM in Tarantool

Tarantool started as an in-memory DBMS then added an LSM, named Vinyl. This is my review of the Vinyl overview. I have some questions and assume that I will make a few mistakes in describing it. Corrections are welcome.

Updates

Tarantool is fast and one reason is server-side logic implemented in Lua and made faster by LuaJIT.

There has been talk of running Linkbench on Tarantool. I am not sure whether that requires too much work but I expect great throughput results from it.

I know of Tarantool because the former tech lead for it previously did great work at MySQL and now will do the same at ScyllaDB. HighLoad is a great place to learn more about Tarantool, as I did on my two visits.

LSM Tree Shape

Vinyl uses tiered+leveled compaction with the memtable as L0, leveled for Lmax and tiered for the levels in between the memtable and Lmax. See LSM Geek Code for more on my terminology.

Configuration parameters for Vinyl are here. There are two parameters that determine the shape of the LSM tree. The max number of sorted runs per level is set by vinyl_run_count_per_level. The size ratio between adjacent levels is set by vinyl_run_size_ratio. The docs in the Filling an LSM tree section have a nice description of the impact from these options. When vinyl_run_count_per_level is 2 then there will be at most 2 sorted runs in a level. When vinyl_run_size_ratio is 5 then the max size of a run in L2 is 5X the max size of a run in L1.

Not all tiered compaction algorithms allow an LSM tree to be described in terms of levels like this, but Vinyl does and that makes it easier to reason about performance and efficiency. More than one sorted run per level will have a large, negative impact on range query CPU overhead. However the impact for point queries should be much less when there is a bloom filter per sorted run as used by Vinyl.

Tuple Range Cache

The range cache sounds interesting but I am still confused by it. The docs describe it with the following. I assume this stores ranges (less or more than one page) from the max level of the LSM tree based on "after having performed a compaction spanning all tree levels". How is cache invalidation managed? What is done in between all-level compaction? Is there also a block cache separate from this?
Unlike, say, RocksDB or MySQL, this cache doesn’t store pages, but rather ranges of index values obtained from disk, after having performed a compaction spanning all tree levels. This allows the use of caching for both single-key and key-range searches. Since this method of caching stores only hot data and not, say, pages (you may need only some data from a page), RAM is used in the most efficient way possible. The cache size is controlled by the vinyl_cache parameter.
Other highlights

  • Vinyl measures the memtable insert and flush rates to predict when to begin flushing the memtable so that the flush will finish before the memtable is full. 
  • The Vinyl memtable is a b-tree rather than a skip-list
  • Serializable isolation is provided
  • zstd compression is used
  • Multi-level compaction is mentioned a few times but not explained
  • Vinyl has gap locks but the overview didn't explain how they are used.
  • What happens when a transaction stalls on a read from disk with Vinyl? Does Vinyl run transactions twice -- once to prefetch reads without applying changes, then again to apply changes? Does it run transactions until they stall on a disk read and then what happens when the read is ready? This might explain the need for gap locks.
  • I assume that transactions are interleaved when there are read stalls. Locks are not held so commit can fail when conflicts are detected. The overview explains it implements the MVTO (multiversion timestamp ordering) class, whereby the winning transaction is the one that finished earlier. There are no locks and associated deadlocks. But now I am confused by the claim that there are no locks when the overview mentions that gap locks have been implemented.
  • Vinyl creates bloom filters for full and partial key searches. I assume that a bloom filter for partial key search is similar to a prefix bloom filter in RocksDB.
  • Vinyl uses at least one LSM tree per logical index. More than one LSM tree is used when the logical index is too big in which case the LSM trees range partition the logical index. This allows different optimizations per partition and can be a major win for right or left growing indices, in which case only one partition will get writes and the others can be read optimized. Creating partitions is done via a split. Too-small partitions can be combined via a coalesce. Some of the split/coaleasce work is deferred until compaction (think of virtual partitions).
  • Vinyl supports upsert (read-free updates are in RocksDB)

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...