Wednesday, June 6, 2018

The original LSM paper

I reread the LSM Tree paper by O'Neil et al. It is a great paper and I wanted to share my notes. I also make comparisons to leveled compaction in RocksDB.

The design goal was for disks and insert/update/delete intensive tables that get more writes than reads. Transactional logging (history tables) is one example. The LSM was shown to use less IO capacity than a B-Tree for making changes durable and that was a big deal when the cost of IOPs was a significant fraction of the system cost.  The LSM uses less disk IO capacity than a B-Tree for two reasons:
  1. It does multi-page writes rather than single page writes so the seek latency is amortized over N pages (maybe N=64). Assume a disk can do either 32 1MB transfers/s or 128 4kb transfers/s. Then multi-page block IO throughput is 64x faster.
  2. When sized appropriately, it does a page write per M modified rows while a B-Tree in the worst case (for their example) does a page write per modified row.
While OLTP has since moved from disks to SSD, the LSM is still a great idea because it provides better write and space efficiency than a B-Tree -- it needs less SSD capacity and the devices will last longer.


The paper explains two component and multi-component LSM Tress. A component is similar to a level in RocksDB: component 0 (C0) is the in-memory tree, component 1 (C1) is the smallest on-disk tree. For multi-component trees there can be C2 and C3. C0 is similar to the RocksDB memtable and Cn is similar to Ln in RocksDB. The original LSM didn't have something equivalent to the L0 in RocksDB. While it might have been good for LevelDB, I think the L0 is a performance bug in RocksDB as it makes reads less efficient.

The paper has math to explain that the best fanout between components is a constant, and the paper used r for that value. This is similar to RocksDB for levels >= 1. In the paper, r is the fanout between all components including C0 to C1. While in RocksDB the fanout for memtable:L0 and L0:L1 is usually smaller than r.

The paper doesn't mention the use of bloom filters for the LSM Tree, but does mention them as used by another index structure. They aren't needed for a two component LSM Tree, but can be useful when the LSM Tree has more than 2 components. I assume they could be implemented via a bloom filter per multi-page block.

The paper mentions a few features that have yet to arrive in RocksDB. Predicate deletion is similar to range deletion in RocksDB and is a way to quickly perform a bulk delete. A find note is a special indicator written to the LSM Tree that requests data. When the find note moves from C0 to Cmax then the data will be provided as the find note encounters the keys that it requests.

C0 can use a variety of in-memory tree indexes. C1 and larger use a special B-Tree:
  • Leaf nodes are packed (no free space)
  • Key-order adjacent nodes are written in multi-page blocks. Assume a leaf node is 4kb and and a multi-page block is 256kb then 64 leaf nodes are written per multi-page block. This amortizes the seek latency when writing leaf nodes and when reading during long range scans. 
  • Point queries do single-page reads, long range scans do multi-page block reads.
  • Leaf nodes are write once.
  • Non-leaf nodes are also write once and might use multi-page blocks.

The paper uses a rolling merge to move data from C0 to C1. RocksDB calls this compaction. A rolling merge is a sequence of merge steps. Each merge step reads some data from C0 and C1, merges it and then writes it out into a new multi-page block in C1. The rolling merge is in progress all of the time. The rolling merge is similar to compaction for levels L1 to Lmax in RocksDB, but the implementation is different.
  • Compaction in RocksDB is some-to-some and atomic. By some-to-some I mean that RocksDB chooses 1 SST from level N to compact into the ~10 overlapping SSTs in level N+1. By atomic I mean that the output from a compaction step isn't visible until all of the compaction input has been consumed (and this used to be a big problem for universal/tiered compaction in RocksDB with large SSTs). 
  • The rolling merge is neither some-to-some nor atomic. I will call it all-to-all and incremental. By all-to-all I mean that a rolling merge limited to data from a range in each component (perhaps a merge step is). By incremental I mean that the merge output is made visible as soon as possible, alas that is complicated.
  • During a merge step component N is the inner component and component N+1 is the outer component.
  • The merge step input from components N and N+1 are called the emptying nodes. And the output written into component N+1 are called the filling nodes.
  • As soon as possible, the outcome from a merge step is made visible. This part is complicated. I assume this means that the B-Tree for component N is changed to remove that key range and the B-Tree for component N+1 is changed to reference the filling nodes in place of the emptying nodes. The non-leaf nodes of the B-Trees are also written sequentially, just like the leaf nodes. But I wasn't clear on some of the details on when these updates can be done.

A merge step consumes a range of data from the inner component, a range of data from the outer component and replaces that with new data in the outer component. While the paper has a section on concurrency (supporting reads concurrent with the rolling merge, supporting rolling merge in trees with more than 2 components) I don't think all of the details were provided. Some of the details are:
  • for now a node is a b-tree leaf node
  • nodes are locked in read mode for queries
  • the nodes from the outer component, and maybe the inner component, are locked in write mode when data from the node(s) is being merged. I didn't understand all of the details about when the node(s) could be unlocked.

When a merge step fills a multi-page block (for now assume 64 leaf nodes) then that block will be written to the database file. It doesn't overwrite live multi-page blocks. Eventually the B-Tree will be updated to point to the new leaf nodes in this multi-page block. Eventually the space from the multi-page blocks for the emptying nodes can be reclaimed. This maintains one sequential write stream per component.


The paper has a nice example of the 5 Minute Rule to estimate what data should be in memory based on the cost of RAM and IOPs. Assuming RAM is $10/GB and IOPs are $200 / 10k then one IO/second costs 2 cents and 4kb of RAM costs 0.0038 cents and ~500 4kb pages of RAM has the same cost as 1 IO/second, or a 4kb page of RAM costs the same as ~1/500 IO/second. When that is true, data that is referenced more frequently than once per 500 seconds should be in RAM.

Next is the performance model. I renamed a few of the variables. The variables include:
  • IO-S - cost of a single page IO
  • IO-M - cost of a page IO when done as part of a multi-page block IO
  • Se - size of index entry in bytes
  • Sp - size of page in bytes
  • S0 - size in MB of C0
  • S1 - size in MB of C1
  • De is number of B-Tree levels not in cache

The IO cost of a B-Tree insert is: IO-S * (De + 1). There are De IO reads to get the leaf node and then 1 IO to write back the dirty leaf node. This is the worst-case for a B-Tree. Each of the IO operations are single-page with cost IO-S.

M is used for the LSM insert cost. M is the average number of C0 entries merged into a C1 page. One formula for it is (Sp/Se) * (S0 / (S0 + S1)) because:
  • Sp/Se is the number of entries per page
  • S0 / (S0 + S1) is fraction of database entries in C0

Then the IO cost of an LSM insert with a two component LSM Tree is: 2 * IO-M / M.
  • 2 because a component is re-read when it is re-written
  • IO-M because the IO operations are multi-page block reads and writes
  • Divided by M because the cost (2 * IO-M) is amortized over M entries from C0

The ratio of the costs determine when LSM inserts are cheaper:
    Cost(LSM) / Cost(B-Tree) = K1 * (IO-M / IO-S) * (1/M)
  • K1 is 2 / (De + 1), which =1 when De is 1 and =2/3 when De is 2
  • IO-M / IO-S is the speedup from multi-page block IO, which is >> 10 for disks

Assuming disks are used, IO-M/IO-S is 64 and K1 = 1 then 1/M must be >= 1/64 for LSM inserts to be cheaper.

Monday, May 14, 2018

Geek code for database algorithms

I like to read academic papers on database systems but I usually don't have time to do more than browse. If only there were a geek code for this. Part of the geek code would explain the performance vs efficiency tradeoff. While it helps to know that something new is faster, I want to know the cost of faster. Does it require more storage (tiered vs leveled compaction)? Does it hurt SSD endurance (update-in-place vs write-optimized)? Read, write, space and cache amplification are a framework for explaining the tradeoffs.

The next part of the geek code is to group algorithms into one of page-based, LSM, index+log or something else. I suspect that few will go into the something else group. These groups can be used for both tree-based and hash-based algorithms, so I am redefining LSM to mean log structured merge rather than log structured merge tree.

Saturday, May 5, 2018

Where to ask questions about MyRocks and RocksDB


Best places to discuss MyRocks:
Other places to discuss MyRocks:
  • MyRocks group for FB MySQL - this existed before MyRocks made it into Percona and MariaDB. You are better off using the MariaDB and Percona groups.
  • Bugs for MyRocks in FB MySQL are here. Again, if using Percona or MariaDB please use their forums for bugs.


Sunday, April 22, 2018

MyRocks, malloc and fragmentation -- a strong case for jemalloc

While trying to reproduce a MyRocks performance problem I ran a test using a 4gb block cache and tried both jemalloc and glibc malloc. The test server uses Ubuntu 16.04 which has glibc 2.23 today. The table below lists the VSZ and RSS values for the mysqld process after a test table has been loaded. RSS with glibc malloc is 2.6x larger than with jemalloc. MyRocks and RocksDB are much harder on an allocator than InnoDB and this test shows the value of jemalloc.

VSZ(gb) RSS(gb) malloc  
 7.9     4.8    jemalloc-3.6.0
13.6    12.4    glibc-2.23

I am not sure that it is possible to use a large RocksDB block cache with glibc malloc, where large means that it gets about 80% of RAM.

I previously shared results for MySQL and for MongoDB. There have been improvements over the past few years to make glibc malloc perform better on many-core servers. I don't know whether that work also made it better at avoiding fragmentation.

Friday, April 20, 2018

Fun with caching_sha2_password in MySQL 8.0.11

I want to get benchmark numbers with MySQL 8.0.11. This is my first impression. The default auth method was changed to caching_sha2_password. See this post for more details. There will be some confusion with this change. By confusion I mean the difference between "error" and "OK because cached" below. I am not alone. See the experience that an expert had with replication.

Fun with caching_sha2_password occurs even with clients compiled as part of 8.0.11:

  1. install MySQL 8.0.11, disable SSL but use mostly default my.cnf
  2. bin/mysql -u... -p... -h127.0.0.1 -e ... -> error
  3. bin/mysql -u... -p... -e ... -> OK
  4. bin/mysql -u... -p... -h127.0.0.1 -e ... -> OK because cached

The error in step 2 is: ERROR 2061 (HY000): Authentication plugin 'caching_sha2_password' reported error: Authentication requires secure connection.

From show global variables I see the default is caching_sha2_password:

default_authentication_plugin   caching_sha2_password
Setting this in my.cnf after I created the user doesn't fix the problem. Setting this before creating the user is one fix. I did not test whether changing the value of user.plugin to "mysql_native_password" is another workaround.
The error when using an old mysql client will also be a source of confusion:
$ ~/b/orig5635/bin/mysql -u... -p.. -h127.0.0.1
ERROR 2059 (HY000): Authentication plugin 'caching_sha2_password' cannot be loaded: /home/mdcallag/b/orig5635/lib/plugin/ cannot open shared object file: No such file or directory

Thursday, April 5, 2018

Index Structures, Access Methods, whatever

I prefer to use Index Structures when writing about algorithms for indexing data stored on disk and SSD but Access Methods is another popular name. Recently I have been working on a high-level comparison (lots of hand waving) of index structures in terms of read, write, space and cache amplification.

I started by dividing the index structures into tree-based and hash-based. Tree-based support range queries while hash-based do not. Most of my experience is with tree-based approaches. There was a lot of research on hash-based in the 80s (extendible, dynamic and linear hashing), they are in production today but not as prominent as b-trees, and there is recent research on them. This paper by Seltzer and Yigit is a great overview on hash-based index structures.

The next classification is by algorithm - page-based, index+log and LSM. I use this for both tree-based and hash-based so LSM means Log Structured Merge rather than Log Structured Merge Tree. Most DBMS implement one or two of these. Tarantool has more algorithmic diversity, but I don't have much experience with it.

For tree-based:
  • page-based - I used to call this update-in-place but think that page-based is a better name because it includes update-in-place (UiP) and copy-on-write-random (CoW-R) b-trees. Great implementations include InnoDB for UiP and WiredTiger for CoW-R.
  • LSM - leveled and tiered compaction are examples (RocksDB, LevelDB, Cassandra). Note that the original LSM design by O'Neil et al didn't have an L0 and I suspect that the L0 might be a mistake but that is for another post. I don't want to insult LevelDB authors, the L0 makes sense when you want to limit memory consumption for database embedded in client apps, I am not sure it makes sense for serious OLTP with RocksDB. O'Neil also did fundamental work on bitmap indexes. I worked on both so he made my career possible (thank you).
  • index+log - use a tree-based index with data in log segments. The index points into the log segments. GC is required. It scans the log segments and probes the index to determine whether a record is live or can be deleted. There are fewer examples of this approach but interesting systems include WiscKey and ForestDB. Range queries will suffer unless the index is covering (this needs more discussion, but not here).

For hash-based:
  • page-based - see dynamic, extendible and linear hashing. TIL that ZFS implements extendible hashing. BerkeleyDB supports linear hashing. The dbm implementation on many Linux/Unix systems also implements some variant of update-in-place persistent hash tables.
  • LSM - today I only know of one example of this - SILT. That is a great paper (read it). I include it as an example of hash-based even though one of the levels is ordered.
  • index+log - BitCask is one example but the index wasn't durable and it took a long time (scan all log segments) to rebuild it on process start. Faster might be another example, but I am still reading the paper. I hope someone names a future system Efficienter or Greener.

Finally, I have a few comments on performance. I will be brief, maybe there will be another post with a lot more detail on my opinions:
  • b-tree - provides better read efficiency at the cost of write efficiency. The worst case for write efficiency is writing back a page for every modified row. Cache efficiency is better for a clustered index than a non-clustered index -- for a clustered index you should cache at least one key/pointer per leaf block but for a non-clustered index you need the entire index in RAM or there will be extra storage reads. Space efficiency for a b-tree is good (not great, not bad) -- the problem is fragmentation.
  • LSM - provides better write efficiency at the cost of read efficiency. Leveled compaction provides amazing space efficiency. Tiered compaction gets better write efficiency at the cost of space efficiency. Compaction does many key comparisons and this should be considered as part of the CPU overhead for an insert (maybe 4X more comparisons/insert than a b-tree).
  • index+log - provides better write efficiency. Depending on the choice of index structure this doesn't sacrifice read efficiency like an LSM. But the entire index must remain in RAM (just like a non-clustered b-tree) or GC will fall behind and/or do too many storage reads. GC does many index probes and the cost of this is greatly reduced by using a hash-based solution. These comparisons should be considered as part of the CPU overhead of an insert. There is also a write vs space efficiency tradeoff. By increasing the amount of dead data that can be tolerated in log segments then GC is less frequent, write-amplification is improved but space-amplification suffers. There are variants that don't need GC, but they are not general purpose.

Wednesday, March 28, 2018

Missing documentation

In my time with Linux there are some things that would benefit from better documentation.

  1. The use of per-inode mutexes by some members of the ext family for files opened with O_DIRECT
  2. PTHREAD_MUTEX_ADAPTIVE_NP - this enables some busy-waiting when trying to lock a mutex. It isn't mentioned in man pages.