tag:blogger.com,1999:blog-91495239278647510872019-06-18T06:56:57.560-07:00Small DatumMark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.comBlogger276125tag:blogger.com,1999:blog-9149523927864751087.post-36143201028367398152019-06-13T15:23:00.000-07:002019-06-13T18:28:53.676-07:00Interesting indexing in Rockset and MongoDBI was lazy today and <a href="https://twitter.com/markcallaghan/status/1138844420437135361">asked about</a> new indexing features in <a href="https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/">Rockset</a> and <a href="https://docs.mongodb.com/master/core/index-wildcard/">MongoDB</a>. They share a valuable goal which is better indexing for the document data model (think less schema, not schema-less). How do you index documents when you don't know all of the attributes that will be used? MongoDB now supports this via a wildcard index and Rockset via converged indexing.<br /><br />Wildcard indexing in MongoDB lets you specify that an index should be maintained on all, most or some attributes in a document. By most I mean there are options to exclude attributes from a wildcard index. By some I mean there are options to limit this to attributes that start with certain prefixes. <a href="https://docs.mongodb.com/master/core/index-wildcard/">Read the docs</a> for more.<br /><br />Converged indexing in Rockset indexes all attributes. There are no options to include or exclude attributes. This makes the product easier to use at the cost of more IO for index maintenance and more storage for larger indexes. Note that Rockset uses the <a href="https://rocksdb.org/">RocksDB LSM</a> which reduces the cost of index maintenance and might also use the excellent <a href="https://facebook.github.io/zstd/">ZStandard</a> compression.<br /><br />Wildcard and converged indexes do not support compound indexes. For the document { a:1, b:2 } there will be two index entries: a=1 and b=2. There is no way to get an index entry for (a=1, b=2) or (b=2, a=1). If you want a <a href="https://docs.mongodb.com/master/core/index-compound/">compound index</a> with MongoDB the existing index features can be used. See below (editorial 1) for compound indexes and Rockset.<br /><br /><b>Implementation details</b><br /><br />This section is an educated guess. I don't know enough MongoDB and Rockset internals to claim this with certainty. I ignore the complexity of support for datatypes. In the ideal world all values can be compared via memcmp.<br /><br />For a traditional index limited to a specified attribute the index entries are of the form (value, pointer) where pointer points to the row and can be the primary key value or (file name, file offset).<br /><br />This is more interesting for wildcard/converged indexes. I assume that the attribute name is the leading field in each index entry so that the entry is of the form (attribute name, value, pointer). The common way to use such an index is to have an equality predicate on attribute name which is satisfied when the index is queried with predicates like <i>attributeName relOp value</i>. Examples of such predicates are a=2, a>2 and a<=2.<br /><br />A smart person (Dr Pavlo) <a href="https://twitter.com/andy_pavlo/status/1138849744309166081">mentioned the use</a> of skip scan for these indexes. That could be used to query the index and find documents with any attribute equal to a specific value. That is a less likely use case but still interesting.<br /><br />Wildcard/converged indexes aren't free. Putting the attribute name in every index entry makes index entries larger and consume more space in memory and on storage. Block compression reduces some of this overhead. Index prefix compression in <a href="https://scalegrid.io/blog/index-prefix-compression-in-mongodb-3-0-wiredtiger/">WiredTiger</a> and <a href="https://forum.cockroachlabs.com/t/rocksdb-prefix-compression/1321">RocksDB</a> also helps but at the cost of more CPU overhead.<br /><br /><b>Storage differences</b><br /><br />Up to now I have been describing the search index. In this section I will describe the document storage.<br /><br />MongoDB stores the document via the storage engine which will soon be WiredTiger only although I hope MongoRocks returns. I assume that WiredTiger with MongoDB is row-wise so that each document is (usually) a contiguous sequence of bytes on some disk page.<br /><br />Rockset stores each document twice -- row-wise and column-wise. Alas, this gets complicated. The row-wise format is not the traditional approach with one thing in the storage engine per document. Instead there is one thing per attribute per document. This is similar to the <a href="https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/">CockroachDB approach</a>. I prefer to still call this row-wise given that attributes from a document will be co-located in the LSM SSTs. I am also grateful for the many great blog posts from CockroachDB that I can reference.<br /><br />With two copies of each document in the base storage there is more storage overhead. Fortunately that overhead is reduced courtesy of the write efficiency and compression friendliness of an LSM.<br /><br />The <a href="https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/">Rockset blog post</a> does a great job of explaining this with pictures. I do a worse job here without pictures. For the document { pk:1, a:7, b:3 } when the primary key is pk then the keys for row-wise are R.1.a and R.1.b and for column-wise are C.a.1 and C.b.1. The row-wise format clusters all attributes for a given document. The column-wise format clusters all values across documents for a given attribute. The row-wise format is efficient when most attributes for a document must be retrieved. The column-wise format is efficient for analytics when a given attribute across all documents must be retrieved.<br /><br /><div><b>Editorial 1</b></div><div><br /></div><div>I interpret <a href="https://docs.mongodb.com/master/core/index-wildcard/">the MongoDB docs</a> to mean that when a query uses a wildcard index it cannot use any other index and the wildcard index will only be used for a predicate on a single attribute. I expect that to limit the utility of wildcard indexes. I also expect MongoDB to fix that given how fast they reduce their tech debt. The limitations are listed below. The 1st won't be fixed. The 2nd and 3rd can be fixed.</div><div><ol><li>Compound wildcard indexes are not supported</li><li style="box-sizing: border-box; padding-top: 0.2em;">MongoDB cannot use a non-wildcard index to satisfy one part of a query predicate and a wildcard index to satisfy another.</li><li style="box-sizing: border-box; padding-top: 0.2em;">MongoDB cannot use one wildcard index to satisfy one part of a query predicate and another wildcard index to satisfy another.</li></ol>I assume that Rockset can combine indexes during query evaluation given their focus on analytics. Thanks to the Rockset team I learned it supports <a href="https://twitter.com/igorcanadi/status/1138885301131206656">index intersection</a>. It also <a href="https://twitter.com/dhruba_rocks/status/1139278376085200899">supports composite indexes</a> via field mappings (functional indexes).</div><div><br /></div><div><b>Editorial 2</b></div><div><br />An open question is whether an LSM can do clever things to support analytics. There has been some work to show the compression benefit from using column-wise storage within a RocksDB SST for the larger levels of the RocksDB LSM. Alas, the key RocksDB workloads have been OLTP. With the arrival of Rockset there is more reason to reconsider this work. There can be benefits in the compression ratio and reduced overhead during query processing. Vertica showed that it was useful to combine a write-optimized store for recent writes with a read-optimized store for older writes. An LSM already structures levels by write recency. Perhaps it is time to make the larger levels read-optimized especially when column-wise data is inserted to the LSM.<br /><br /><b>Update</b> - read paper years ago then forgot that <a href="https://kudu.apache.org/kudu.pdf">Kudu</a> combines LSM + columnar.<br /><br /><b>Editorial 3</b></div><div><b><br /></b>The previous section is mostly about being clever when storing column-wise data in an LSM to get better compression and use less CPU during query evaluation. This section is about being clever when storing the search index. </div><div><br /></div><div>The search index is likely to have many entries for some values a given attribute. Can an LSM be enhanced to take advantage of that for analytics workloads? In other storage engines there are two approaches -- bitmap indexes and RID-lists. Adapting these for an LSM is non-trivial but not impossible. It is likely that such an adaptation would only be done for the larger levels of the LSM tree.</div><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-21694295227616694882019-05-17T15:08:00.000-07:002019-05-17T15:08:48.573-07:00index+log: implementationsMy take on index+log systems like WiscKey is that they are neither better nor worse than an LSM - it all depends on your workload. But I am certain that we know much more about an LSM than about the index+log systems. Hopefully that changes over time as some of them are thriving.<br /><br />The first <a href="http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html">index+log</a> system that I read about was <a href="https://en.wikipedia.org/wiki/Berkeley_DB">Berkeley DB Java Edition</a>. The design paper is worth reading. Since then there have been a few more implementations and papers that I describe here. This list is probably incomplete: Bitcask, ForestDB, WiscKey, HashKV, TitanDB and RocksDB BlobDB.<br /><br />At this point the systems that are getting into production, TitanDB and BadgerDB, use an LSM for the index. I wonder if an index structure that supports update-in-place would be better especially <a href="http://smalldatum.blogspot.com/2019/05/indexlog-open-issues.html">when the index must be cached</a> because I expect the CPU read-amp for an LSM to be <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">about 2X larger</a> than for a b-tree and a b-tree supports update-in-place which makes it easier to relocate values during GC.<br /><br />While I like index+log systems I think that papers and marketing tend to overstate LSM write-amp. For production RocksDB I usually see write-amp between 10 and 20. I expect that index+log could achieve something closer to 5. This <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">paper from CMU</a> explains one reason why per-level write-amp in an LSM is less than the per-level fanout (less than 10). Write skew is another reason.<br /><br /><b>The systems</b><br /><br /><a href="https://en.wikipedia.org/wiki/Bitcask">Bitcask</a> was part of the Riak effort.<br /><ul><li>The index is an in-memory hash table. The index isn't durable and the entire log has to be scanned to rebuild it on startup -- whether or not this was after a crash or a clean shutdown. The slow startup is a problem.</li><li>The value log is circular and GC copies live values from the tail to the head of the log. Liveness is determined by an index search. </li></ul><br />ForestDB was a finalist in the SIGMOD 2011 <a href="http://db.csail.mit.edu/sigmod11contest/">student programming contest</a>. Eventually the project and creator moved to Couchbase. It is worth reading about <a href="http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf">here</a> and on the <a href="https://github.com/couchbase/forestdb/wiki">github page</a>. I published blog posts that compare ForestDB and RocksDB: <a href="http://smalldatum.blogspot.com/2015/06/rocksdb-forestdb-via-forestdb-benchmark_9.html">1</a>, <a href="http://smalldatum.blogspot.com/2015/06/rocksdb-forestdb-via-forestdb-benchmark_8.html">2</a>, <a href="http://smalldatum.blogspot.com/2015/06/rocksdb-forestdb-via-forestdb-benchmark.html">3</a> and <a href="http://smalldatum.blogspot.com/2015/06/rocksdb-forestdb-via-forestdb-benchmark_73.html">4</a>. Google <a href="https://www.google.com/search?q=forestdb">finds more</a> interesting things to read.<br /><ul><li>The index is a space-efficient trie.</li><li>The value log might have log segments. GC copies live values to the head of the log. Liveness is determined by an index search.</li></ul><br />WiscKey is described as an LSM with key-value separation and made popular the term <i>key-value separation</i>. I put it in the index+log family of <a href="http://smalldatum.blogspot.com/2018/04/index-structures-access-methods-whatever.html">index structures</a>.<br /><ul><li>The index is an LSM. There is no redo log for the index as it can be recovered from the head of the value log.</li><li>Kudos for many references to <a href="http://smalldatum.blogspot.com/2019/05/crum-conjecture-read-write-space-and.html">amplification factors</a>. The paper uses bytes read for read-amp. I prefer to consider both IO and CPU for read-amp with key comparisons for CPU and storage reads for IO.</li><li>It doesn't mention that it has more <a href="http://smalldatum.blogspot.com/2019/05/crum-conjecture-read-write-space-and.html">cache-amp</a> than an LSM, but few papers mention that problem. Shrinking the LSM by keeping large values separate doesn't make the index parts of it (filter and index blocks) easier to cache as they are already separate from the data blocks. There is more to cache with index+log as I <a href="http://smalldatum.blogspot.com/2019/05/indexlog-open-issues.html">describe here</a>.</li><li>It claims to overcome the (worst-case) problem of one storage IO per KV pair on a range scan by fetching in parallel. Assuming the storage device has enough spare IO this might hide the problem but it doesn't fix it. With many workloads there isn't spare IO and extra IO for reads also means extra CPU for decompression.</li><li>The value log is circular and single-threaded GC copies live values to the head of the log. Liveness is determined by an index search. I assume that multi-threaded GC is feasible.</li><li>The paper isn't clear about the total write-amp that might occur from both the first write to the value log and GC that follows.</li><li>Compression isn't explained.</li></ul><br /><a href="https://github.com/dgraph-io/badger">BadgerDB</a> is a golang implementation, and much more, of the WiscKey paper.<br /><ul><li>It has many features and many production use cases. This is impressive. </li><li>GC is <a href="https://github.com/dgraph-io/badger/blob/master/db.go">scheduled by the user</a>. Based on <a href="https://github.com/dgraph-io/badger">Options.NumCompactors</a> I assume it can be multi-threaded.</li><li>The <a href="https://godoc.org/github.com/dgraph-io/badger">docs state</a> that the LSM can be served from RAM because the values are elsewhere. That is true but I don't consider it a feature. It must be in RAM to avoid IO from liveness queries done by GC. An LSM isn't a monolithic thing. There are index blocks, data blocks and filter blocks and most of the LSM, data blocks from the max level, don't have to be cached. </li><li>There is extra work on reads to find values that have been moved by GC. See the comments about BadgerDB <a href="https://github.com/petermattis/pebble/issues/112">here</a>.</li></ul><br />HashKV is an interesting paper that avoids index queries during GC.<br /><ul><li>Hash-based data grouping distributes KV pairs by hash into one of N logs. GC is probably done by scanning a log twice -- once to get the keys and the second time to relocate the live values. A value is live when the most recent key is not a tombstone. A value might be live when needed for a snapshot. GC doesn't do index searches so the index doesn't have to be cached to make GC efficient but you might want to cache it to avoid doing index IO on queries -- and this index is still much larger than the block index for an LSM.</li><li>Hotness awareness copies cold values to a cold segment to avoid repeated GC for a value that doesn't get updated or deleted. A header for the value is kept in the non-cold log.</li><li>Small values are stored inline in the LSM.</li><li>I am curious if more log groups means more write-amp. See my comment about fsync in a <a href="http://smalldatum.blogspot.com/2019/05/indexlog-v2.html">previous post</a>.</li><li>I am curious whether managing the hash buckets is easy. The goal is to make sure that keys for a segment group fit in memory. The range and number of buckets must change over time. Does this have anything in common with <a href="https://en.wikipedia.org/wiki/Linear_hashing">linear</a> and <a href="https://en.wikipedia.org/wiki/Extendible_hashing">extendible</a> hashing?</li></ul><br /><a href="https://pingcap.com/blog/titan-storage-engine-design-and-implementation/">TitanDB</a> is part of TiDB and <a href="https://github.com/pingcap/tidb">TiDB</a> is thriving.<br /><ul><li>A WAL is used for new writes. This might make it easier to compress data on the first write to the value logs.</li><li>GC appears to do index searches to determine liveness.</li><li>Compression is per-record. I hope this does per-block in the future.</li><li>It might let the user tune between space-amp and write-amp via discardable_ratio.</li><li>This is compatible with most of the RocksDB API.k</li></ul><br />RocksDB <a href="https://github.com/facebook/rocksdb/wiki/Blob-DB">BlobDB</a> is an extension to RocksDB that uses log segments for large values and stores small values in the LSM. GC copies live values and liveness is determined by an index search.<br /><br /><b>Future work</b><br /><br />Future work for index+log systems includes:<br /><ul><li>Determine whether a b-tree is better than an LSM for the index structure</li><li>Determine whether the HashKV solution is the best way to avoid liveness queries during GC.</li><li>If an LSM is used for the index structure determine efficient ways to support relocating values during GC without <a href="http://smalldatum.blogspot.com/2019/05/indexlog-open-issues.html">extra overheads and complexity</a> during read processing.</li><li>Determine whether clever things can be done during GC.</li><ul><li>Block compression is easier than on first write.</li><li>Hot/cold value separation is possible (see HashKV). This is an example of <a href="https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)">generational GC</a> even if we rarely mention that for index structures.</li><li>Values in a log segment can be ordered by key rather than by time of write during GC. GC can also merge multiple ordered segments to create longer sorted runs. I wonder if it is then possible to use block indexes (key+pointer per block rather than per row) to reduce cache-amp for such log segments.</li></ul></ul><div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com3tag:blogger.com,1999:blog-9149523927864751087.post-31481061859147701432019-05-16T13:47:00.000-07:002019-05-16T13:47:09.282-07:00index+log: open issuesThis post is about open issues for the index+log family of <a href="http://smalldatum.blogspot.com/2018/04/index-structures-access-methods-whatever.html">index structures</a>. See past posts <a href="http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html">here</a> and <a href="http://smalldatum.blogspot.com/2019/05/indexlog-v2.html">here</a> for more details on index+log. The success of LSM has lead to several solutions that use an LSM for the index. I am curious if that is a good choice when implementing index+log from scratch. It is a great choice when starting with an LSM and then adding index+log.<br /><br /><b>Open Issues</b><br /><br />The open issues for index+log include:<br /><ul><li>Is GC multi-threaded?</li><li>Does value log GC require index queries?</li><li>Must the index be cached?</li><li>Is block compression possible?</li><li>With an LSM as the index how is relocating an entry in the value log supported?</li></ul><i><br /></i><i>Is GC multi-threaded?</i><br /><div><ol><li>I hope so.</li><li>It hasn't been in some of the index+log papers/systems.</li></ol><div><i>Does value log GC require index queries?</i></div><div><ol><li>Yes, in most of the index+log papers/systems this is required to determine whether a value is live when scanning the value log during GC</li><li><a href="https://www.usenix.org/system/files/conference/atc18/atc18-chan.pdf">HashKV</a> proposed a solution that doesn't require such queries. It is a wonderful paper but I think there are more things to figure out. The general idea is to hash KV pairs into buckets such that all keys for a bucket will fit in memory during GC. GC then reads the value log segments for a bucket twice -- first to get the keys, second to copy the live KV pairs into a new log segment. I wonder if managing the hash bucket ranges has things in common with <a href="https://en.wikipedia.org/wiki/Linear_hashing">linear</a> and <a href="https://en.wikipedia.org/wiki/Extendible_hashing">extendible</a> hashing.</li></ol></div><div><i>Must the index be cached?</i></div><div><ol><li>The index for index+log might be 10X larger than for an LSM because an LSM uses a block index while index+log uses a row index. A block index has a key+pointer per block and a row index has that per row. An LSM only needs a block index because rows are clustered by key.</li><li>To do at most one IO per point query for a database larger than memory the index must be cached for index+log to avoid index IO as the IO is spent reading the value log. If the <= 1 IO / query constraint is then cache-amp is larger for index+log compared to LSM because the index+log index is larger (see #1).</li><li>If value log GC queries the index to determine whether each value is live then this query shouldn't do IO or GC will be too slow and inefficient. This is another reason for caching the index+log index.</li><li>If the index must be cached then maybe an LSM isn't the best choice. Consider an index structure optimized for a main-memory DBMS.</li></ol></div><div><i>Is block compression possible?</i></div><div><ol><li>I hope so but this hasn't been explained in some of the index+log papers/systems.</li><li>Per-record compression can be done instead of block compression. That will have a lower compression rate but less CPU overhead when decompressing on read.</li><li>It might be hard to do block compression the first time a KV pair is written to a value log. One option is to write to a redo log until enough data is available to do block compression and then write to the value log. Another option is to defer block compression until KV pairs are rewritten during GC.</li></ol></div><div><i>When the index is an LSM what happens when values are moved?</i></div><div><ol><li>This is an issue that I learned about <a href="https://twitter.com/markcallaghan/status/1129064392413319169">via CockroachDB</a>. I should have figured it out long ago but mistakes happen.</li><li>The LSM tree is read in order -- top to bottom with leveled compaction and left to right with tiered compaction. This guarantees that the correct result is returned with respect to visibility. If the first entry for a key is a tombstone then the search can stop (ignoring snapshot reads).</li><li>Value log GC moves live KV pairs to new log segments. To find the KV pair after the move either the index entry must be updated or the index entry must have a logical value log key and then another index is needed to map the logical value log key to a physical value log (filename, offset). </li><li>Updating the LSM index entry to reference the new value log location (filename, offset) can be done by inserting a new KV pair into the LSM but that either breaks read consistency semantics or complicates read processing. It would break LSM read processing because inserting back into the LSM implies this is a new write, but it just a move for GC. Something other than an LSM that supports update-in-place makes this easier.</li><li>Details on using a logical value log key are explained in a <a href="https://github.com/petermattis/pebble/issues/112">CockroachDB github issue</a>.</li></ol><br /><div></div><br /><div><br /></div></div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-38711014485038199752019-05-15T15:30:00.000-07:002019-05-15T16:04:49.949-07:00Index+log, v2I put most <a href="http://smalldatum.blogspot.com/2018/04/index-structures-access-methods-whatever.html">index structures</a> into one of three categories -- page-based, LSM or index+log. My focus is on databases larger than memory and I might be ignoring categories used for main memory DBMS. Over the past decade <a href="http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html">index+log</a> has gotten more attention and this is my second attempt at explaining it.<br /><br />My definition of index+log is simple -- data is appended to a log on each write, index entries point into the log and GC scans the log to copy live values and discard others. The log can be one file or many log segments. GC might have to search the index to determine whether a value is live. The value written to the log might include the key.<br /><br /><a href="https://en.wikipedia.org/wiki/Bitcask">Bitcask</a> was the first index+log system that I noticed but I assume there are earlier examples and just found one -- <a href="https://www.oracle.com/technetwork/database/database-technologies/berkeleydb/transactional-data-manager-129520.pdf">Berkeley DB Java Edition</a>. While WiscKey made popular the term <i><b>key-value separation</b></i> and is presented as an LSM variant, I put it in the index+log category. Other interesting index+log systems include RocksDB <a href="https://github.com/facebook/rocksdb/wiki/Blob-DB">BlobDB</a>, <a href="https://pingcap.com/blog/titan-storage-engine-design-and-implementation/">TitanDB</a>, <a href="https://github.com/couchbase/forestdb">ForestDB</a> and <a href="https://www.microsoft.com/en-us/research/uploads/prod/2018/03/faster-sigmod18.pdf">Faster</a>. While many of the tree-based solutions use an LSM for the index that is not required by index+log and an LSM is not used by Berkeley DB Java Edition or ForestDB.<br /><br />For an index+log solution that must cache all of the index and query the index during GC I suspect that an LSM is not the best choice for the index. Although if you already have an LSM (see RocksDB BlobDB) then I get it.<br /><br />The summary of my LSM vs index+log comparison is:<br /><ol><li>index+log has less write-amp but more space-amp</li><li>index+log has much more cache-amp, maybe 10X more</li><li>index+log might have more deferred CPU write-amp</li><li>I have yet to validate my claims by running benchmarks with index+log implementations.</li></ol><div>Note that #1 is a feature, #2 is not a feature and for #3 it depends. The key point is that the cost of faster writes from index+log is more cache-amp (more memory) and more IO for range queries. In production with RocksDB I frequently see write-amp=15 with space-amp=1.1. I assume that could be reduced with index+log to write-amp ~=5 and space-amp ~= 1.3. It might be possible to avoid or reduce the impact of #2 and #3 in future index+log implementations.</div><br /><b>Amplification Factors</b><br /><br />I am speculating on read, write, space and cache amplification for index+log because I have little hands on experience with index+log implementations. Another reason for speculation is that index+log allows for different structures for the index (b-tree, LSM, etc) which affects some of the estimates.<br /><br />The <a href="http://smalldatum.blogspot.com/2019/05/crum-conjecture-read-write-space-and.html">amplification factors</a> are:<br /><ul><li>cache - the <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">cache-amp</a> for index+log is likely to be much larger than for an LSM. To achieve at most one IO per point query index+log might need 10X (or more) memory versus an LSM. Clustering values in key order doesn't just make range queries faster, it also means an LSM only needs block indexes in memory (key+pointer per block) while index+log needs a key+pointer in memory for every KV pair in the database. When there are 10 KV pairs per block then this is 10X more memory. Even when the database has hot and cold keys they are likely to be interleaved on the same index leaf pages -- so all of those leaf pages must be in memory to avoid doing two IOs (one for the leaf page, one for the value) on a point query.</li><ul><li>There is additional data that an LSM and b-tree need in memory to satisfy the one IO per query constraint and I described that in previous posts (mostly things from the non leaf/max level).</li><li>It might be valid to debate whether my one IO per point query constraint is valid, but this blog post is already long.</li><li>Another reason for the index to be cached is to avoid doing IO during GC when index searches are done to determine whether KV pairs are live.</li></ul><li>space - I ignore space-amp for the index and focus on the log because the log is usually larger. With index+log the user can trade between write-amp and space-amp. With the variable pct_full representing the percentage of live data in the database (a value between 1 and 100) then:</li><ul><li>space-amp = 100 / pct_full</li><li>write-amp = 100 / (100 - pct_full)</li><li>Just like with an LSM, previously written KV pairs are rewritten during GC with the index+log approach. Fortunately this is done less often with index+log. </li><li>I assumed that block compression is used but that is harder to implement for index+log. The <a href="https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf">WiscKey</a> paper doesn't describe a solution and the <a href="https://www.usenix.org/system/files/conference/atc18/atc18-chan.pdf">HashKV</a> paper suggests using per-record compression, which will have a lower compression rate versus block as used by an LSM. I assume block compression can be done for index+log but it isn't trivial.</li><li>To explain the estimates when pct_full=X assume that all log segments have X% live data (yes, this is naive). When GC is done on a log segment X% is copied leaving (100-X)% free space in the newly written log segment. So in total 100% of a log segment is written for each (100 - pct_full)% of new data, which is the formula above.</li><li>Thus with pct_full=90 then space-amp is 1.1 while write-amp is 10. Comparing these with a leveled LSM the space-amp is similar while the write-amp is slightly better than what I see in production. To get a write-amp that is significantly better the cost is more space-amp. For example with pct-full=75 then write-amp=4 and space-amp=1.33. </li></ul><li>read (CPU) - <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">see here</a> for range seek/next. The summary is that when an LSM is used for index+log then the costs are similar to an LSM. When a b-tree is used for index+log then the costs are smaller.</li><ul></ul><li>read (IO) - <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">see here</a> for range seek/next. In the cache-amp estimate I assume that the index is cached so the only IO to be done is for the log. Therefore the index structure (LSM vs b-tree) doesn't matter.</li><ul><li>point query - the IO read-amp is <= 1 because the log is not cached.</li><li>range seek - range seek doesn't do IO when the index is cached</li><li>range next - this is much larger for index+log than for an LSM because it might do one IO per call to range next because rows are not clustered by key in the log. When data is compressed then there also is the CPU overhead for decompression per KV pair.</li></ul><li>write - by write I assume update (read-modify-write) rather than a blind write.</li><ul><li>immediate CPU - the cost of an index search. See the section on read CPU for point queries above.</li><li>immediate IO - the cost of an optional redo log write for the index structure and then writing the (value) log. Note that the minimum size of a write done by the storage device might be 4kb even if the data written is much smaller. Doing an fsync per 128 byte value might have a write-amp of 32 if that write is really forced to storage and doesn't just linger in a capacitor backed write cache.</li><li>deferred CPU - the deferred CPU write-amp is the cost of index searches done during GC, unless the <a href="https://www.usenix.org/system/files/conference/atc18/atc18-chan.pdf">HashKV</a> approach is used. With pct_full=75, write-amp=4 and space-amp=1.33 then GC is done ~4 times for each key and the deferred CPU cost is 4 index searches. When live KV pairs are copied by GC then there is also the CPU and IO overhead from updating the index entry to point to the new location in the log.</li><li>deferred IO - this is determined by the percentage of live data in the database. With pct_full=75 it is 4.</li></ul></ul>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com3tag:blogger.com,1999:blog-9149523927864751087.post-39611746467227835082019-05-09T16:42:00.000-07:002019-05-09T16:42:23.896-07:00CRUM conjecture - read, write, space and cache amplificationThe <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM Conjecture</a> asserts that an <a href="http://smalldatum.blogspot.com/2018/04/index-structures-access-methods-whatever.html">index structure</a> can't be optimal for all of read, write and space. I will ignore whether optimal is about performance or efficiency (faster is better vs efficient-er is better). I want to use CRUM in place of RUM where C stands for database cache.<br /><br />The <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">C in CRUM</a> is the amount of memory per key-value pair (or row) the DBMS needs so that either a point query or the first row from a range query can be retrieved with at most X storage reads. The C can also be reported as the minimal database : memory ratio to achieve at most X storage reads per point query.<br /><br />My points here are:<br /><ul><li>There are 4 amplification factors - read, write, space and cache</li><li>CRUM is for comparing index structure efficiency and performance</li><li>Read and write amplification have CPU and IO parts</li><li>Write amplification has immediate and deferred parts</li></ul>Many <a href="http://smalldatum.blogspot.com/2015/11/define-better-for-small-data-dbms.html">X is faster than Y</a> papers and articles neglect to quantify the tradeoffs made in pursuit of performance. I hope that changes and we develop better methods for <a href="http://smalldatum.blogspot.com/2018/04/index-structures-access-methods-whatever.html">quantifying the tradeoffs</a> <a href="http://smalldatum.blogspot.com/2019/01/define-better.html">(a short rant</a> on defining <b><i>better)</i></b>.<br /><br /><a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">Amplification factors</a> (RUM -> CRUM) are used to compare index structures. Values for the factors are measured for real workloads and estimated for hypothetical ones. The comparison is the most valuable part. Knowing that the deferred CPU write-amp on inserts for a b-tree is 30 is not that useful. Knowing that it is 3X or 0.5X the value for an LSM is useful.<br /><br />Workload matters. For estimates of amplification I usually assume uniform distribution because this simplifies the estimate. But there is much skew in production workloads and that impact can be measured to complement the estimates.<br /><h4><br /></h4><h4>Read Amplification</h4><br /><a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">This post is an overview</a> for read-amp. <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">This post</a> explains it in detail for an LSM. There are two parts to read-amp -- CPU and IO. Thus for each of the <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">three basic operations</a> (point query, range seek, range next) there are 2 values for read-amp: CPU and IO. I have yet to consider deferred read-amp and by read-amp I mean immediate read-amp.<br /><br />Metrics for CPU read-amp include CPU time, bytes/pages read and key comparisons. I use key comparisons when predicting performance and CPU time when running tests. I have not used bytes/pages read. While key comparisons are a useful metric they ignore other CPU overheads including hash table search, bloom filter search, page read and page decompress.<br /><br />Metrics for IO read-amp include bytes read and pages read. I use pages read for disk and bytes read for SSD because disks are IOPs limited for small reads. IO read-amp implies extra CPU read-amp when the database is compressed. Decompressing pages after storage reads can use a lot of CPU with fast storage devices and even more with zlib but you <a href="https://github.com/facebook/zstd">should be using zstd</a>.<br /><br />With estimates for hypothetical workloads I assume there is a cache benefit as explained in the Cache Amplification section. This is likely to mean that comparisons assume a different amount of memory for index structures that have more or less cache-amp. For real tests I mostly run with database >> memory but don't attempt to use the least memory that satisfies the cache-amp X reads constraint.<br /><h4></h4><h4><br /></h4><h4>Write Amplification</h4><br /><a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">This post is an overview</a> for write-amp. <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">This post</a> explains it in detail for an LSM. Write-amp has two dimensions: CPU vs IO, immediate vs deferred. For each operation (insert, delete, update) there are 4 values for write-amp: immediate CPU, deferred CPU, immediate IO and deferred IO.<br /><br />The immediate write-amp occurs during the write. The deferred write-amp occurs after the write completes and includes writing back dirty pages in a b-tree and compaction in an LSM.<br /><br />Possible metrics for CPU write-amp include bytes written, pages written, key comparisons and pages/bytes (de)compressed. Bytes and (in-memory) pages written are useful metrics for in-memory DBMS but my focus is on databases >> memory.<br /><br />Possible metrics for IO write-amp include bytes written and pages written. These can be estimated for hypothetical workloads and measured for real ones. The choice between bytes or pages written might depend on whether disk or SSD is used as one is limited by ops/s and the other by transfer rate. If you use iostat to measure this then figure out whether Linux still counts bytes written as <a href="http://smalldatum.blogspot.com/2016/04/trim-iostat-and-linux.html">bytes trimmed</a>.<br /><br /><div>Examples of deferred and immediate write-amp:</div><ul><li>The InnoDB change buffer is deferred IO and CPU. Checking the change buffer and applying changes is deferred CPU. The deferred IO is from reading pages from storage to apply changes.</li><li>For a b-tree: page writeback for a b-tree is deferred IO, compression and creating the page checksum are deferred CPU, finding the in-memory copy of a page is immediate CPU, reading the page on a cache miss is immediate IO.</li><li>An LSM insert has immediate/deferred IO/CPU. </li><ul><li>Immediate CPU - key comparisons for memtable insert</li><li>Immediate IO - redo log write</li><li>Deferred IO - reading uncached pages for input SSTs and writing output SSTs during compaction</li><li>Deferred CPU - decompression, compression and key comparisons while merging input SSTs into output SSTs during compaction. Note that compaction does a merge, not a sort or merge+sort.</li></ul></ul><br /><h4>Space Amplification</h4><br />Space-amp is the size of the database files versus the size of the data, or the ratio of the physical to logical database size. An estimate for the logical size is the size of the uncompressed database dump with some adjustment if secondary indexes are used. The space-amp is reduced by compression. It is increased by fragmentation in a b-tree and uncompacted data in an LSM.<br /><br />It is best to measure this after the DBMS has reached a steady state to include the impact of fragmentation and uncompacted data.<br /><h4><br /></h4><h4>Cache Amplification</h4><br />I briefly <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">described cache-amp in this post</a>. The cache-amp describes memory efficiency. It represents the minimal database : memory ratio such that a point query requires at most X storage reads. A DBMS with cache-amp=10 (C=10) needs 10 times more memory than one with C=100 to satisfy the at most X reads constraint.<br /><br />It can be more complicated to consider cache-amp for range seek and range next because processing them is more complicated for an LSM or index+log algorithm. Therefore I usually limit this to point queries.<br /><br />For a few years I limited this to X=1 (at most 1 storage read). But it will be interesting to consider X=2 or 3. With X=1:<br /><ul><li>For a b-tree all but the leaf level must be in cache</li><li>For an LSM the cache must include all bloom filter and index blocks, all data blocks but the max level</li><li>For an index+log approach it depends (wait for another blog post)</li></ul><div><h4><br /></h4><h4>Other posts</h4><br />Related posts by me on this topic include:<br /><ul><li><a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html">Compare B-Tree and LSM</a> on read, write and space-amp</li><li><a href="http://smalldatum.blogspot.com/2016/11/why-is-myrocks-more-write-efficient_22.html">Compare MyRocks and InnoDB</a> on read, write and space-amp</li><li><a href="http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html">Overview of index+log</a> </li></ul></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-17008508271887571782019-04-12T11:48:00.002-07:002019-04-13T05:43:24.884-07:00A research paper on Optane performanceI just read <a href="https://arxiv.org/pdf/1903.05714.pdf">Basic Performance Measurements of the Intel Optane DC Persistent Memory Module</a> published by the <a href="http://nvsl.ucsd.edu/">NVSL</a> at UCSD. It is worth reading. I appreciate the rigor in testing and the separation of the summary (first 10 pages) from the many details. This is too incomplete to be a review of the paper. It is really a collection of my comments.<br /><br />Comments:<br /><br /><ul><li>When using the device in cached mode where RAM is the cache the cache block size is 4kb. I assume that a cache miss does 16 256-byte reads from the Optane device before returning to the user.</li><li>The paper doesn't explain the endurance for the device. The word "endurance" doesn't occur in the paper. I read elsewhere that Optane might provide ~60 DWPD. Update - I assume that endurance isn't mentioned because the vendor has yet to disclose that info.</li><li>The paper states that the Optane DIMM uses a protocol that supports variable response time but doesn't explain how much it varies. How does response time variance in Optane compare to a NAND-flash SSD where the stalls can be bad?</li><li>The Optane DIMM does 256 byte reads and writes. I wonder if that prevents 4kb page writes from being atomic when this is used for a filesystem assuming copy-on-write isn't done internally, as it might be for Nova.</li><li>There is wear-leveling. I am not sure whether that has a name yet. I saw one blog post that called it the XTL to match the FTL used by NAND flash. I am also curious about the latency impact from doing lookups on the XTL to determine locations for 256 byte blocks. The XTL is cached in RAM and a 256g device needs ~4g of RAM assuming each 256 byte block uses 4 bytes in the XTL.</li><li>Nova does much better than XFS and ext4 on Optane. Nova is a research filesystem from NVSL that exploits new features in Optane.</li><li>They modified RocksDB to make the memtable persistent and avoid the need for a WAL. It will be interesting to learn whether that turns out to be useful.</li></ul><div><br />Requests I have for the next Optane performance paper:</div><ul><li>For mixed and concurrent workloads include response time latencies -- average+variance or a histogram. This paper reports the average latency for single-threaded read-only and write-only. For mixed+concurrent workloads this paper reports average throughput which combines read and write performance. It is hard to determine whether reads or writes degrade more from concurrency and a mixed workload. </li><li>For any workload include information about response time variation whether that is variance or a histogram</li><li>Provide numbers to accompany graphs because some of the graphs are hard to understand without numbers when the lines converge in one part and diverge in another because the range for the y-axis is large. Figure 18 is one example. </li></ul><div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com1tag:blogger.com,1999:blog-9149523927864751087.post-9177147697610643762019-01-22T13:30:00.000-08:002019-01-22T13:30:05.683-08:00Less "mark" in MySQL benchmarkingMy goal for the year is more time learning math and less time running MySQL benchmarks. I haven't done serious benchmarks for more than 12 months. It was a great experience but I want to learn new things. MySQL 8.0.14 has been released with fixes for a serious bug I found via the insert benchmark. I won't confirm whether it has been fixed. I hope someone else does.<br /><br />My tests and methodology are described in posts for <a href="http://smalldatum.blogspot.com/2017/02/using-modern-sysbench-to-compare.html">sysbench</a>, <a href="http://smalldatum.blogspot.com/2017/06/all-about-linkbench.html">linkbench</a> and the <a href="http://smalldatum.blogspot.com/2017/06/the-insert-benchmark.html">insert benchmark</a>. I hope the upstream distros (MySQL, MariaDB, Percona) repeat my tests and methodology and I am happy to answer questions about that. I even have inscrutable shell scripts that make it easy to run the tests. Despite being a lousy example of how to use Bash, they are portable enough to run on my home and work hardware.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-74274397832367123732019-01-21T14:32:00.000-08:002019-01-21T14:49:59.342-08:00Optimal configurations for an LSM and moreI have been trying to solve the problem of finding an optimal LSM configuration for a given workload. The real problem is larger than that, which is to find the right <a href="http://smalldatum.blogspot.com/2018/04/index-structures-access-methods-whatever.html">index structure</a> and the right configuration for a given workload. But my focus is RocksDB so I will start by solving for an LSM.<br /><br /><a href="https://docs.google.com/presentation/d/e/2PACX-1vSNk8RkQrVRm_BNZKYyz0sl1k7C6yjTfJIqfMDxnnka8f4pfpf6j2yuXvxvyVGnrzRERdAaxNbOU-CT/pub?start=false&loop=false&delayms=3000&slide=id.p">This link</a> is to slides that summarizes my effort. I have expressed the problem to be solved using differentiable functions to express the cost that is to be minimized. The cost functions have a mix of real and integer valued parameters for which values must be determine to minimize the cost. I have yet to solve the functions, but I am making progress and learning more math. This might be a <a href="https://en.wikipedia.org/wiki/Constrained_optimization">constrained optimization</a> problem and <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange Multipliers</a> might be useful. The slides are from a talk I am about to present at the MongoDB office in Sydney where several WiredTiger developers are based. I appreciate that Henrik Ingo set this up.<br /><br />My work has things in common with the excellent work by <a href="http://daslab.seas.harvard.edu/">Harvard DASlab</a> lead by Stratos Idreos. I have years of production experience on my side, they have many smart and ambitious people on their side. There will be progress. I look forward to more results from their <a href="http://daslab.seas.harvard.edu/datacalculator/">Data Calculator</a> effort. And I have learned a lot from the Monkey and Dostoevsky papers by <a href="http://nivdayan.github.io/">Niv Dayan</a> et al.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com1tag:blogger.com,1999:blog-9149523927864751087.post-86376377790016449572019-01-20T21:13:00.004-08:002019-01-20T21:13:34.790-08:00Bugs in Windows 10 parental controlsI use Windows 10 parental controls with my two children. Sometimes I am surprised at the bugs I encounter, but I can't rant too much because of glass houses and stones. My old favorite was that a hard reset before the time limit reached zero allowed my clever child to get more time. Apparently Microsoft takes storage efficiency very seriously and didn't want to waste a disk write and/or fsync on persisting the usage counter every few minutes. I haven't tried to reproduce this recently but never heard back after filing a bug report.<br /><br />Now I have a new favorite bug. I am 5 hours behind their timezone and granted another hour to my daughter. It is 4pm here and 9pm there. The landing page after granting the time tells me my child can use the computer until 5pm (my timezone). Child tries to login and immediately encounters the timeout dialog. Apparently timezones are a hard problem. But less screen time is a good thing.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-13133214627514312522019-01-15T20:00:00.003-08:002019-01-15T20:01:16.348-08:00Geek code for LSM trees<a href="https://docs.google.com/presentation/d/e/2PACX-1vQ9AStYSOmJzrcmXA-nkQk0kwoJPpgZAAHcxfrL0TApxbBLVdu0Cszit-EQ11yY1Kpbri5vhJFAssLh/pub?start=false&loop=false&delayms=3000">This is a link to slides</a> from my 5-minute talk at the <a href="http://cidrdb.org/cidr2019/program.html">CIDR 2019</a> Gong Show. The slides are a brief overview of the geek code for LSM trees. If you click on the settings icon in the slide show you can view the speaker notes which have links to blog posts that have more details. I also pasted the links below. Given time I might add to this post, but most of the content is in my past blog posts. Regardless I think there is more to be discovered about performant, efficient and manageable LSM trees.<br /><br />The key points are there are more compaction algorithms to discover, we need to make it easier to describe them and compaction is a property of a level, not of the LSM tree.<br /><br />Links to posts with more details:<br /><ul><li><a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">Describing tiered and leveled compaction</a></li><li><a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">Number of levels that minimized write amplification</a></li><li><a href="http://smalldatum.blogspot.com/2018/10/combining-tiered-and-leveled-compaction.html">Combining tiered and leveled compaction</a></li><li><a href="http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html">Tiered vs leveled, why not both</a></li><li><a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">Name that compaction algorithm</a></li><li><a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">Original LSM paper that got this started</a></li><li><a href="http://smalldatum.blogspot.com/2018/09/review-of-slimdb-from-vldb-2018.html">Review of SlimDB</a> with references to the first tiered compaction, Stepped Merge</li></ul><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-71052640453851549572019-01-10T10:30:00.000-08:002019-01-10T10:30:27.695-08:00LSM math: fixing mistakes in my last post<a href="http://smalldatum.blogspot.com/2019/01/lsm-math-revisiting-number-of-levels.html">My last post</a> explained the number of levels in an LSM that minimizes write amplification using 3 different estimates for the per-level write-amp. Assuming the per-level growth factor is w then the 3 estimates were approximately w, w+1 and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.<br /><br />I realized there was a mistake in that post for the analysis of LWA-3. The problem is that the per-level write-amp must be >= 1 (and really should be > 1) but the value of w-1 is <= 1 when the per-level growth factor is <= 2. By allowing the per-level write-amp to be < 1 it easy to incorrectly show that a huge number of levels reduces write-amp as I do for curve #3 <a href="https://www.desmos.com/calculator/ap4fa0okcq">in this graph</a>. While I don't claim that (w-1) or (w-1)/2 can't be a useful estimate for per-level write-amp in some cases, it must be used with care.<br /><br /><b>Explaining LWA-3</b><br /><br />The next challenge is to explain how LWA-3 is derived. That comes from equation 12 on page 9 of the <a href="https://stratos.seas.harvard.edu/files/stratos/files/dostoevskykv.pdf">Dostoevsky paper</a>. Start with the (T-1)/(K+1) term and with K=1 then this is (T-1)/2. T in the paper is the per-level growth factor so this is the same as (w-1)/2. The paper mentions that this is derived using an arithmetic series but does not show the work. I show my work but was not able to reproduce that result.<br /><br />Assume that the per-level growth factor is w, <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">all-to-all compaction</a> is used and the LSM tree has at least 3 levels. When full L1 has size 1, L2 has size w and L3 has size w*w. There are four derivations below - v1, v2, v3, v4. The results are either w/2 or (w+1)/2 which doesn't match (w-1)/2 from the paper. Fortunately, my previous post shows how to minimize total write-amp assuming the per-level write-amp is w/2 or (w+1)/2. I will contact the author to figure out what I am missing.<br /><br />The analysis below is for merges from L1 to L2, but it holds for merges from Ln to Ln+1. I think that v1 and v2 are correct and their estimate for per-level write-amp is (w+1)/2. As explained below I don't think that v3 or v4 are correct, their estimate for per-level write-amp is w/2.<br /><br />I have yet to explain how to get (w-1)/2.<br /><br /><b>v1</b><br /><br />Assume that merges are triggered from Ln to Ln+1 when a level is full -- L1 has size 1, L2 has size w, L3 has size w*w. A level is empty immediately after it is merged into the next level. So L2 gets full, then is merged into L3 and becomes empty, then slowly gets larger as L1 is merged into it w times. The per-level write-amp from this is (w+1)/2.<br /><br /><span style="font-family: Courier New, Courier, monospace;">* merges into L2 write output of size 1, 2, ..., w<br />* then L2 is full<br />* sum of that sequence -> w*(w+1)/2<br />* average value is sum/w -> (w+1)/2<br /><br />1) Moving data of size 1 from L1 to L2 writes (w+1)/2 on average<br />2) Therefore per-level write-amp for L1 -> L2 is (w+1)/2<br /><br />Note that per-level write-amp is (avg merge output to Ln / size of Ln-1)<br />* avg merge output to L2 is (w+1)/2<br />* size of Ln-1 is 1</span><br /><br /><b>v2</b><br /><br />Assume that merges are triggered from Ln to Ln+1 when a level is almost full -- L1 has size 1 * (w-1)/w, L2 has size w * (w-1)/w, L3 has size (w*w) * (w-1)/w. The trigger conditions can be reduced to L1 has size (w-1)/w, L2 has size (w-1) and L3 has size w*(w-1).<br /><br />This assumes that w merges are done from L1 to L2 for L2 to go from empty to full. Each merge adds data of size (w-1)/w because L1:L2 merge is triggered when L1 has that much data. Thus L2 has size (w-1) after w merges into it at which point L2:L3 merge can be done. The per-level write-amp from this is the same as it was for v1.<br /><br /><span style="font-family: Courier New, Courier, monospace;">* merges into L2 write output of size (w-1)/w * [1, 2, ..., w]<br />* then L2 is full<br />* sum of that sequence -> (w-1)/w * w*(w+1)/2 = (w-1)(w+1)/2<br />* average value is sum/w -> (w-1)(w+1)/(2*w)<br /><br />As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)</span><br /><span style="font-family: Courier New, Courier, monospace;">* avg merge output to L2 = (w-1)(w+1)/(2*w)<br />* size of L1 = (w-1)/w</span><br /><span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: Courier New, Courier, monospace;">start with: ( (w-1)(w+1)/(2*w) ) / ( (w-1)/w )<br />simplify to: (w+1)/2</span><br /><br /><b>v3</b><br /><br />Merges are triggered the same as for v1 but I assume that only w-1 merges are done from Ln to Ln+1 rather than w. Ln+1 won't be full at the end of that, for example L2 would have size w-1 rather than the expected size w. But I was curious about the math. The per-level write-amp is w/2.<br /><br /><span style="font-family: "Courier New", Courier, monospace;">* merges into L2 write output of size 1, 2, ..., w-1</span><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">* sum of that sequence -> (w-1)*w/2</span><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">* average value is sum/(w-1) -> w/2</span><br style="font-family: "Courier New", Courier, monospace;" /><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">1) Moving data of size 1 from L1 to L2 writes w/2 on average</span><br style="font-family: "Courier New", Courier, monospace;" /><span style="font-family: "Courier New", Courier, monospace;">2) </span><span style="font-family: "Courier New", Courier, monospace;">Therefore per-level write-amp for L1 -> L2 is w/2</span><br /><br /><b>v4</b><br /><br />Merges are triggered the same as for v2. But as with v3, only w-1 merges are done into a level. Again I don't think this is correct because a level won't have enough data to trigger compaction at that point. The per-level write-amp here is the same as for v3.<br /><br /><span style="font-family: Courier New, Courier, monospace;">* merges into L2 write output of size (w-1)/w * [1, 2, ..., w-1]<br />* sum of that sequence -> (w-1)/w * (w-1)*w/2 = (w-1)(w-1)/2<br />* average value is sum/(w-1) -> (w-1)/2<br /><br />As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)</span><br /><span style="font-family: Courier New, Courier, monospace;">* avg merge output to L2 = (w-1)/2<br />* size of L1 = (w-1)/w</span><br /><span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: Courier New, Courier, monospace;">start with: ( (w-1)/2 ) / ( (w-1)/w )<br />simplify to: w/2</span><br /><br /><br /><br />Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-1759749492812850582019-01-09T15:58:00.000-08:002019-01-10T10:37:58.560-08:00LSM math: revisiting the number of levels that minimizes write amplificationI <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">previously used math</a> to explain the number of levels that minimizes write amplification for an LSM tree with leveled compaction. My answer was one of ceil(ln(T)) or floor(ln(T)) assuming the LSM tree has total fanout = T where T is size(database) / size(memtable).<br /><br />Then I heard from a coworker that the real answer is less than floor(ln(T)). Then I heard from Niv Dayan, first author of <a href="https://stratos.seas.harvard.edu/publications/dostoevsky-better-space-time-trade-offs-lsm-tree-based-key-value-stores">the Dostoevsky paper</a>, that the real answer is larger than ceil(ln(T)) and the optimal per-level growth factor is ~2 rather than ~e.<br /><br />All of our answers are correct. We have different answers because we use different functions to estimate the per-level write-amp. The graph of the functions for total write-amp using the different cost functions <a href="https://www.desmos.com/calculator/ap4fa0okcq">is here</a> and you can see that the knee in the curve occurs at a different x value for two of the curves and the third curve doesn't appear to have a minimum.<br /><br />While working on this I learned to love the <a href="https://en.m.wikipedia.org/wiki/Lambert_W_function">Lambert W function</a>. But I wonder whether I made the math below for LWA-2 harder than necessary. I am happy to be corrected. I appreciate the excellent advice on Quora: <a href="https://www.quora.com/How-can-I-learn-to-use-the-Lambert-W-function/answer/Awnon-Bhowmik">here</a>, <a href="https://www.quora.com/What-is-the-Lambert-W-function">here</a> and <a href="http://o/">here</a>. The online graphing calculator <a href="https://www.desmos.com/">Desmos</a> is another great resource.<br /><br /><b>Math</b><br /><br />I use differentiable functions to express the total write-amp as a function of the number of levels, then determine the value (number of levels) at which the first derivative is zero as that might be the global minimum. Constants, variables and functions below include:<br /><ul><li>T - total fanout, = size(database) / size(memtable)</li><li>n - number of levels in the LSM tree</li><li>LWA, LWA-x - function for the per-level write-amp</li><li>TWA, TWA-x - function for the total write-amp, = n * LWA</li><li>w - per-level growth factor, = T^(1/n) for all levels <a href="http://smalldatum.blogspot.com/2018/10/minimizing-write-amplification-in-lsm_3.html">to minimize write-amp</a></li></ul>The function for total write-amp has the form: TWA = n * LWA where n is the number of levels and LWA is the per-level write-amp. LWA is a function of T and n. The goal is determine the value of n at which TWA is minimized. While n must be an integer the math here doesn't enforce that and the result should be rounded up or down to an integer. T is a constant as I assume a given value for total fanout. Here I use T=1024.<br /><br />I wrote above that the 3 different answers came from using 3 different estimates for the per-level write-amp and I label these LWA-1, LWA-2 and LWA-3. When w is the per-level growth factor then the per-level write-amp functions are:<br /><ul><li>LWA-1 = w -- I used this to find that the best n = ceil(ln(T)) or floor(ln(T))</li><li>LWA-2 = w + 1 -- with this the best n is less than that found with LWA-1</li><li>LWA-3 = (w - 1) / 2 -- with this the best n is greater than that found with LWA-1</li></ul><div>I can also state the per-level write-amp functions directly with T and n. I didn't above to make it easier to see the differences.<br /><ul><li>LWA-1 = T^(1/n)</li><li>LWA-2 = T^(1/n) + 1</li><li>LWA-3 = (T^(1/n) - 1) / 2</li></ul></div><b>Explaining LWA</b><br /><br />First I explain LWA-1 and LWA-2. Compacting 1 SST from Ln to Ln+1 requires merging 1 SST from Ln with ~w SSTs from Ln+1 where w=10 by default with RocksDB. The output will be between w and w+1 SSTs. If the output is closer to w then LWA-1 is correct. If the output is closer to w+1 then LWA-2 is correct. <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">This paper explains</a> why the per level write-amp is likely to be less than w. Were I to use f*w where f < 1 for LWA-1 then the <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">math still holds</a>. Maybe that is a future blog post.<br /><br />LWA-3 assumes that all-to-all compaction is used rather than some-to-some. I <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">explain the difference here</a>. RocksDB/LevelDB leveled uses some-to-some but all-to-all is interesting. With all-to-all when compaction from Ln to Ln+1 finishes then Ln is empty and slowly gets full after each merge into it. Assume the per-level growth factor is w and Ln-1, Ln and Ln+1 are full at sizes 1, w and w*w. Then Ln becomes full after w merges from Ln-1 and those write output of size 1, 2, ..., w-1, w. The <a href="https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF#Partial_sums">sum of the first w</a> integers is w(w+1)/2. Divide this by w to get the averge -- (w+1)/2. However above LWA-3 is (w-1)/2 not (w+1)/2. I will explain that in another blog post. Note that in LWA-3 the numerator, w-1, is more interesting than the denominator, 2. Dividing by any constant doesn't change where the minimum occurs assuming there is a minimum and that is visible on <a href="https://www.desmos.com/calculator/vy9f2k7oyi">this graph</a> that shows the impact of dividing by 2 on the total write-amp.<br /><br />Read on to understand the impact of using w-1, w or w+1 as the function for per-level write-amp. The difference might be more significant than you expect. It surprised me.<br /><br /><b>Minimizing TWA</b><br /><br /><a href="https://www.desmos.com/calculator/ap4fa0okcq">This graph</a> shows the total write-amp for LWA-1, LWA-2 and LWA-3. I call the total write-amp TWA-1, TWA-2 and TWA-3. Two of the curves, for TWA-1 and TWA-2, appear to have a minimum. One occurs for x between 4 and 6, the other for x between 6 and 8. The third curve, for TWA-3, doesn't appear to have a minimum and is decreasing as x (number of levels) grows.<br /><br /><a href="https://www.desmos.com/calculator/kt7zwwe6lj">The next graph</a> uses the first derivative for the total write-amp functions, so it is for TWA-1', TWA-2' and TWA-3'. A global minimum for TWA-x can occur when TWA-x' = 0 and from the graph TWA-1'=0 when x=6.931 and TWA-2'=0 when x=5.422 which matches the estimate from the previous paragraph. From the graph it appears that TWA-3' approaches zero as x gets large but is never equal to zero.<br /><br />The next step is to use math to confirm what is visible on the graphs.<br /><br /><b>Min write-amp for LWA-1</b><br /><br />See my <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">previous post</a> where I show that n = ln(T) minimizes total write-amp if n isn't limited to an integer and then the per-level growth factor is e. Since the number of levels must be an integer then one of ceil(ln(T)) or floor(ln(T)) minimized total write-amp.<br /><br /><b>Min write-amp for LWA-2</b><br /><div><br /></div><div>I can reuse some of the math from my <a href="http://smalldatum.blogspot.com/2018/12/lsm-math-how-many-levels-minimizes.html">previous post</a>. But this one is harder to solve.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># wa is the total write-amp<br /># n is the number of levels</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># t is the total fanout</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n * ( t^(1/n) + 1 )</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n*t^(1/n) + n<br /><br /># the difference between this and the previous post is '+1'</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) + 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /><span style="font-family: "times" , "times new roman" , serif;">At this point the difference between this and the previous post is '+1'. But wait this starts to get interesting.</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"># critical point for this occurs when wa' = 0<br />t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1 = 0<br /><br /># multiply by t^(-1/n)<br />1 - (1/n) * ln(t) + t^(-1/n) = 0<br /><br /># move some terms to RHS<br />t^(-1/n) = (1/n) ln(t) - 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># use ln on LHS and RHS to get rid of '^(1/n)'</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">ln ( t^(-1/n) ) = ln( (1/n) * ln(t) - 1 )</span><br /><span style="font-family: "courier new" , "courier" , monospace;">(-1/n) ln(t) = ln( (1/n) * ln(t) - 1</span><br /><br /><span style="font-family: "times" , "times new roman" , serif;">I got stuck here but eventually made progress.</span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># let a = (1/n) ln(t) and rewrite<br />-a = ln(a - 1)<br /><br /># let x=a-1, a=x+1 and rewrite<br />-(x+1) = ln(x)<br /><br /># do e^LHS = e^RHS<br />e^-(x+1) = e^ln(x)<br />e^-x * e^-1 = x<br /><br /># multiply LHS and RHS by e^x<br />e^-1 = e^x * x<br /><br /># e^-1 -> (1/e)<br />(1/e) = e^x * x</span><br /><br /><span style="font-family: "times" , "times new roman" , serif;">At last I can use <a href="https://en.m.wikipedia.org/wiki/Lambert_W_function">Lambert W function</a>!</span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># Given: e^x * x = K, then x = W(K)<br />x = W(e^-1) ~= 0.27846</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># because a=x+1<br />a ~= 1.27846</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># a = (1/n) ln(t) -> n = (1/a) ln(t), t=1024<br />n = 1/1.27846 * ln(1024)<br /><br /># The value for n that minimizes total write-amp<br /># from the graph I claimed that n=5.422. this is close<br />n = 5.4217</span><br /><br /></div><div><b>Min write-amp for LWA-3</b></div><div><br /><b>Update-1</b> - I think I made a few mistakes here. So you can stop reading until update-2 arrives.<br /><br /><b>Update-2</b> - <a href="http://smalldatum.blogspot.com/2019/01/lsm-math-fixing-mistakes-in-my-last-post.html">this post</a> explains my mistake and uses math to estimate that per-level write-amp = (w+1)/2 when all-to-all compaction is used. I am still unable to derive (w-1)/2.<br /><br /></div><div>I started to work on this without paying attention to the <a href="https://www.desmos.com/calculator/kt7zwwe6lj">curve for LWA-3'</a>. From the graph it appears to converge to 0 but is always less than 0, TWA-3 is decreasing as x, number of levels, gets large. Therefore make the number of levels as large as possible, 2M or 2B, to minimize total write-amp as visible <a href="https://www.desmos.com/calculator/vheojxcczt">in this graph</a>.<br /><br />But more levels in the LSM tree comes at a cost -- more read-amp. And the reduction in write-amp is small when the number of levels increases from 20 to 200 to 2000 to 2M. Again, this is visible <a href="https://www.desmos.com/calculator/vheojxcczt">in the graph</a>. Besides, if you really want less write-amp then use tiered compaction rather than leveled with too many levels.<br /><br />The other consideration is the minimal per-level growth factor that should be allowed. If the min per-level growth factor is 2. Then then that occurs when the number of levels, n, is:<br /><span style="font-family: "courier new" , "courier" , monospace;"><br /># assume total fanout is 1024</span><br /><span style="font-family: "courier new" , "courier" , monospace;">2^n = 1024<br />log2(2^n) = log2(1024)<br />n = log2(1024) = 10</span><br /><br />Alas the total fanout isn't always a power of 2. Given that the number of levels must be an integer then the goal is to use the smallest number of levels such that the per-level growth factor >= 2. Therefore when x isn't limited to an integer there is no answer -- just make x as large as possible (1M, 1B, etc) in which case the per-level growth factor converges to 1 but is always greater than 1.<br /><br />The above can be repeated where the constraint is either the max number of levels or a different value for the min per-level growth factor (either <2 or >2). Regardless, if LWA-3 is the cost function then total write-amp is minimized by using as many levels as possible subject to these constraints.<br /><br />Below is some math for LWA-3 and LWA-3'.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># wa is the total write-amp<br /># n is the number of levels</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># t is the total fanoutwa = n * ( t^(1/n) - 1 ) / 2</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = (n*t^(1/n) - n ) / 2<br /><br /># the big difference between this and the previous post is '+1'</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = [ t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) - 1 ] / 2</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = [ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2<br /><br /># determine when wa' = 0</span><span style="font-family: "courier new" , "courier" , monospace;">[ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2 = 0<br /><br /># multiply LHS and RHS by 2</span><span style="font-family: "courier new" , "courier" , monospace;">t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 = 0</span><span style="font-family: "courier new" , "courier" , monospace;"># multiply LHS and RHS by t^(-1/n)</span><span style="font-family: "courier new" , "courier" , monospace;"><br />1 - (1/n) * ln(t) - t^(-1/n) = 0<br /><br /># move last term to RHS<br />1 - (1/n) * ln(t) = t^(-1/n)<br /><br /># probably a good idea to stop here<br /># LHS is likely to be <0 so can't use ln(LHS) = ln(RHS)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-47330692257427692792019-01-07T12:35:00.000-08:002019-01-07T13:15:18.507-08:00Define "better"Welcome to my first rant of 2019, although I have <a href="http://smalldatum.blogspot.com/2015/11/define-better-for-small-data-dbms.html">written about this</a> before. While I enjoy <a href="http://smalldatum.blogspot.com/2014/06/benchmarketing.html">benchmarketing</a> from a distance it is not much fun to be in the middle of it. The RocksDB project has been successful and thus becomes the base case for products and research claiming that something else is better. While I have no doubt that other things can be better I am wary about the definition of <i><b>better</b></i>.<br /><br />There are at least 3 ways to define better when evaluating database performance. The first, faster is better, ignores efficiency, the last two do not. I'd rather not ignore efficiency. The marginal return of X more QPS eventually becomes zero while the benefit of using less hardware is usually greater than zero.<br /><ol><li>Optimize for throughput and ignore efficiency (faster is better)</li><li>Get good enough performance and then optimize for efficiency</li><li>Get good enough efficiency and then optimize for throughput</li></ol><div><b>Call to action</b></div><div><br /></div><div>I forgot to include this before publishing. Whether #1, #2 or #3 is followed I hope that more performance results include details on the HW consumed to create that performance. How much memory and disk space were used? What was the CPU utilization? How many bytes were read from and written to storage? How much random IO was used? I try to report both absolute and relative values where relative values are normalized by the transaction rate.</div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-41563271290807086672019-01-03T11:06:00.001-08:002019-01-03T11:06:25.101-08:00Review of LSM-based Storage Techniques: A SurveyChen Luo and Mike Carey published a wonderful <a href="https://arxiv.org/abs/1812.07527">survey of research on LSM algorithms</a>. They know about LSM because the <a href="http://asterix.ics.uci.edu/">AsterixDB project</a> includes an LSM. They did a great job explaining the LSM space, telling a coherent story and summarizing relevant papers. Reading this paper was a good use of my time and I found a few more papers to read in their references.<br /><br />I have read a few papers, including <a href="http://smalldatum.blogspot.com/2018/11/review-of-triad-creating-synergies.html">TRIAD</a>, with ideas on reducing write-amp for the smaller levels of the LSM tree. I think this could be done for RocksDB by merging and remerging immutable memtables -- this is similar in spirit to subcompactions for the L0. With a large immutable memtable there would be one less level in the LSM tree. This is an alternative to having an L0, and maybe an L1, that are not made durable. In all cases the cost is a longer MTTR because WAL replay must be done. In all cases there is an assumption that the non-durable levels (large immutable memtables or L0/L1) are in memory.<br /><br />This is a small complaint from me that I have made in the past. The paper states that an LSM eliminates random IO when making things durable. I prefer to claim that it reduces random IO. With leveled compaction each step merges N (~11) SSTs to generate one steam of output. So for each step there is likely a need to seek when reading the ~11 input streams and writing the output stream. Then compaction steps usually run concurrently when the ingest rate is high so there are more seeks. Then the WAL must be written -- one more stream and a chance for more seeks. Finally user queries are likely to read from storage causing even more seeks. Fortunately, there will be fewer seeks per insert/update/delete compared to a B-Tree.<br /><br />The paper has a short history of compaction describing pure-tiered and pure-leveled. But these are rarely used in practice. The <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">original LSM paper</a> implemented pure-leveled. LevelDB and RocksDB use a hybrid approach with tiered for the L0 followed by leveled for the remaining levels. Pure-tiered was introduced by the Stepped Merge paper. Using tiered for all levels has a large space-amplification, much larger than 1, because the max level is tiered and that is too much wasted space for many workloads. Tiered in RocksDB and other popular LSM engines can be configured to use leveled compaction into the max level to get a space-amp less than 2, ignoring transient space-amp during compaction into the max level. Pure-tiered was a great choice for Stepped Merge because that was a cache for bulk-loading a data warehouse rather than a full copy of the database. While I think that RocksDB leveled and RocksDB tiered are examples of <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">tiered+leveled</a>, I don't want to rename them.<br /><br />I appreciate that the paper makes clear that trade-offs must be considered when evaluating benchmarks. Many things can support higher write rates than RocksDB with leveled compaction, including RocksDB with tiered compaction. But that comes at a cost in memory, read and/or space amplification. Some papers could do a better job of documenting those costs.<br /><br />The cost analysis in section 2.3 is limited to IO costs. I look forward to coverage of CPU costs in future LSM research. The read penalty for an LSM compared to a B-Tree is usually worse for CPU than for IO. The paper uses partitioned and non-partitioned where I use <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">all-to-all and some-to-some</a> to explain the compaction approaches. RocksDB implements some-to-some for leveled and all-to-all for tiered. The paper does a nice job explaining why the per-level write-amp should be less for all-to-all than some-to-some, ignoring write skew. Note that in production the per-level write-amp is almost always less than the per-level growth factor and <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">this paper from Hyeontaek Lim</a> explains why.<br /><br />For the read IO costs, the paper counts logical IOs rather than physical IOs. Logical IOs are easier to estimate because caches mean that many logical IOs don't cause a physical IO and smaller levels in the LSM tree are usually in cache. There are two ways to consider the cost for a range query -- long vs short range queries or the cost of range seek vs range next. The paper uses the first, I use the second. Both are useful.<br /><br />I appreciate that the author noticed this. I realize there is pressure to market research and I am not offering to try and reproduce benchmark results, but I have been skeptical about some of the comparisons I see where the base case is InnoDB or RocksDB.<br /><blockquote class="tr_bq"><span style="font-family: Times, Times New Roman, serif;"><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">These improvements have mainly </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">been evaluated against a default (untuned) configuration of </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">LevelDB or RocksDB, which use the leveling merge policy </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">with size ratio 10. It is not clear how these improvements </span><span style="background-color: white; color: #202124; font-size: 16px; font-variant-ligatures: none; letter-spacing: 0.1px; white-space: pre-wrap;">would compare against a well-tuned LSM-tree.</span></span></blockquote>The discussion in 3.3.1 on pipelining compaction is interesting but RocksDB already does pipelining. With buffered IO there is support for async read-ahead and async write-behind. Note that the read and write phases can also be CPU-heavy if the cost for decompression on read and compression on write are included, even when the wonderful zstd and lz4 algorithms are used.<br /><br />A few more comments:<br /><ul><li>RocksDB has limited support for fractional cascading (from SST to SST). See 3.4.2.</li><li>With key-value separation, GC could merge log segments to generate longer ordered log segments over time. This would reduce the range read penalty. See 3.4.2.</li><li>LHAM might be the first time-series optimized compaction strategy. See 3.5.</li><li>Non-unique secondary index maintenance is already read-free in MyRocks. It has a copy of the row prior to index maintenance, because SQL semantics or because this was an insert. Write-optimized SQL engines can add support for read-free change statements in some cases but that usually means SQL semantics (like modified row count) will be broken. See 3.7.2.</li><li>MyRocks already collects statistics during compaction. See 3.7.3.</li></ul><br />Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-69071944690750933402018-12-17T12:05:00.002-08:002018-12-17T12:06:34.446-08:00New small servers for performance testingMy <a href="http://smalldatum.blogspot.com/2017/05/small-servers-for-database-performance.html">old NUC cluster</a> found a new home and I downsized to 2 new NUC servers. The new server is <a href="https://ark.intel.com/products/126140/Intel-NUC-Kit-NUC8i7BEH">NUC8i7beh</a> with <a href="https://www.amazon.com/gp/product/B01BIWMWVS">16g RAM</a>, 500g <a href="https://www.amazon.com/gp/product/B0781Z7Y3S">Samsung 860</a> EVO for the OS and 500g <a href="https://www.amazon.com/gp/product/B07BN4NJ2J">Samsung 970</a> EVO for performance. The Samsung 860 is SATA and the Samsung 970 is an m.2 device. I expect to wear out the performance devices as I <a href="http://smalldatum.blogspot.com/2017/10/wearing-out-ssd.html">have done that</a> in the past. With the OS on a separate device I avoid the need to reinstall the OS when that happens.<br /><br />The new NUC has a post-Skylake CPU (<a href="https://ark.intel.com/products/137979/Intel-Core-i7-8559U-Processor-8M-Cache-up-to-4-50-GHz-">i7-8559u</a>), provides 4 cores (8 HW threads) compared to 2 cores (4 HW threads) in the old NUCs. I disabled turbo boost again to avoid performance variance as mentioned in the old post. I am not sure these have sufficient cooling for sustained boost and when boost isn't sustained there are frequent changed in CPU performance. I also disabled hyperthreads out of concern for both the impact from Spectre fixes and to avoid a different syscall overhead each time I update the kernel.<br /><br />I might use these servers to examine the impact of the <a href="https://twitter.com/markcallaghan/status/1074375153650266112">~10x increase in PAUSE</a> times on InnoDB with and without HT enabled. I might also use them for another round of MySQL performance testing when 8.0.14 is release.<br /><br />I am a big fan of Intel NUC servers. But maybe I am not a fan of the SATA cables they use. I already had one of my old NUCs replaced under warranty after one of the SATA wires was bare. In the new NUCs I just setup a few of the SATA cables appear to be cut and I wonder if that eventually becomes bare.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-87931951296277077552018-12-14T10:17:00.001-08:002019-03-08T05:00:32.218-08:00LSM math - size of search space for LSM tree configurationI have <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">written before</a> and will write again about using 3-tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for Ln. There is one 3-tuple per level and the components of the 3-tuple are (type, fanout, runs) for Lk (level k) where:<br /><ul><li>type is Tiered or Leveled and explains compaction into that level</li><li>fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1</li><li>runs is the number of sorted runs in that level, an integer and >= 1</li></ul><div>Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3-tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy - there are an infinite number of configurations because fanout is a real.</div><div><br /></div><div>The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer.</div><div><br /></div><div>There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11 - 1</div><div><br /></div><div>But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.</div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-74227521885973079772018-12-13T13:18:00.001-08:002018-12-13T13:39:27.718-08:00LSM math - how many levels minimizes write amplification?How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels write-amp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question.<br /><ol><li>The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)</li><li>When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.</li></ol><div><b>Introduction</b></div><div><br />I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">original LSM paper</a> didn't address this problem because that system used a fixed number of levels.</div><br />One result from the <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">original LSM paper</a> and <a href="http://smalldatum.blogspot.com/2018/10/minimizing-write-amplification-in-lsm_3.html">updated by me</a> is that write-amp is minimized when the per-level growth factor is constant. Sometimes I use fanout or per-level fanout rather than per-level growth factor. In RocksDB the option name is <a href="https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L489">max_bytes_for_level_multiplier</a>. Yes, this can be confusing. The default fanout in RocksDB is 10.<br /><br /><b>Math</b><br /><br />I solve this for pure-leveled compaction which differs from what RocksDB calls leveled. In pure-leveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">explain this here</a> where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.<br /><br />Assumptions:<br /><ul><li>LSM tree uses pure-leveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction</li><li>total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree</li><li>workload is update-only so the number of keys in the database is fixed</li><li>workload has no write skew and all keys are equally likely to be updated</li><li>per-level write-amp == per-level growth factor. In practice and <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">in theory</a> the per-level write-amp tends to be less than the per-level growth factor.</li><li>total write-amp is the sum of per-level write-amp. I ignore write-amp from the WAL. </li></ul><br /><b>Specify function for write-amp and determine critical points</b><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># wa is the total write-amp<br /># n is the number of levels<br /># per-level fanout is the nth root of the total fanout</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># per-level fanout = per-level write-amp<br /># therefore wa = number of levels * per-level fanout</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n * t^(1/n)<br /><br /># given the function for write-amp as wa = a * b<br /># ... then below is a' * b + a * b'<br />a = n, b = t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2)<br /><br /># which simplifies to</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)<br /><br /># critical point for this occurs when wa' = 0<br />t^(1/n) - (1/n) * ln(t) * t^(1/n) = 0</span><br /><span style="font-family: "courier new" , "courier" , monospace;">t^(1/n) = (1/n) * ln(t) * t^(1/n)<br />1 = (1/n) * ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">n = ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: Times, Times New Roman, serif;">When t = 1024 then n = ln(1024) ~= 6.93. In this case write-amp is minimized when 7 levels are used although 6 isn't a bad choice.</span><br /><br />Assuming the cost function is convex (see below) the critical point is the minimum for write-amp. However, n must be an integer so the number of levels that minimizes write-amp is one of: ceil(ln(t)) or floor(ln(t)).<br /><br />The graph for wa when t=1024 can be viewed <a href="https://www.desmos.com/calculator/dyqkf7irep">thanks to Desmos</a>. The function looks convex and I show below that it is.<br /><br /><b>Determine whether critical point is a min or max</b><br /><br />The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">wa = n * t ^ (1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa' = t^(1/n) * (1 - (1/n) * ln(t))</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /># assuming wa' is a * b then wa'' is a' * b + a * b' </span><br /><span style="font-family: "courier new" , "courier" , monospace;">a = t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">a' = ln(t) * t^(1/n) * -1 * (1/n^2)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">a' = - ln(t) * t^(1/n) * (1/n^2)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;">b = 1 - (1/n) * ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">b' = (1/n^2) * ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># a' * b </span><br /><span style="font-family: "courier new" , "courier" , monospace;">- ln(t) * t^(1/n) * (1/n^2) --> called x below</span><br /><span style="font-family: "courier new" , "courier" , monospace;">+ ln(t) * ln(t) * (1/n^3) * t^(1/n) --> called y below</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># b' * a</span><br /><span style="font-family: "courier new" , "courier" , monospace;">t^(1/n) * (1/n^2) * ln(t) --> called z below</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;"># therefore wa'' = x + y + z</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># note that x, y and z all contain: t^(1/n), 1/n and ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa'' = t^(1/n) * (1/n) * ln(t) * (-(1/n) + (ln(t) * 1/n^2) + (1/n))</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'</span><br /><span style="font-family: "courier new" , "courier" , monospace;">wa'' = t^(1/n) * 1/n^3 * ln(t)^2</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: Times, Times New Roman, serif;">Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa</span><br /><br /><b>Solve for per-level fanout</b><br /><br />The next step is to determine the value of the per-level fanout when write-amp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the per-level fanout is e in that case. When the number of levels is limited to an integer then the per-level fanout that minimizes write-amp is a value that is close to e.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"># total write-amp is number of levels * per-level fanout<br />wa = n * t^(1/n)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /># The per-level fanout is t^(1/n) and wa is minimized when n = ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;"># Therefore we show that t^(1/n) = e when n = ln(t)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">Assume t^(1 / ln(t)) = e</span><br /><span style="font-family: "courier new" , "courier" , monospace;">ln (t^(1 / ln(t))) = ln e</span><br /><span style="font-family: "courier new" , "courier" , monospace;">(1 / ln(t)) * ln(t) = 1</span><br /><span style="font-family: "courier new" , "courier" , monospace;">1=1</span><br /><br />When the t=1024 then ln(t) ~= 6.93. With 7 levels the per-level fanout is t^(1/7) ~= 2.69 while e ~= 2.72.<br /><span style="background-color: #f3f1f0; color: #1d2129; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 13px;"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span><span style="background-color: #f3f1f0; color: #1d2129; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 13px;"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span><span style="background-color: #f3f1f0; color: #1d2129; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 13px;"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-13803072068986634382018-12-01T08:39:00.000-08:002019-06-09T04:42:52.636-07:00Pixelbook reviewThis has nothing to do with databases. This is a review of a Pixelbook (Chromebook laptop) that I got on sale last month. This one has a core i5, 8gb RAM and 128gb storage. It runs Linux too but I haven't done much with that. I expected a lot from this given that my 2013 Nexus 7 tablet is still awesome. I have been mostly happy with the laptop but if you care about keyboards and don't like the new Macs thanks to the butterfly keyboard then this might not be the laptop for you. My 3 complaints:<br /><ol><li>keyboard is hard to read. It is grey on grey and too hard to read when there is light on my back even with the backlight (backlit?) turned all the way up. I don't get it -- grey on grey. So this is a great device for using in a dark room or for improving your touch typing skills.</li><li>touchpad control is too coarse grained so it is either too fast or too slow. The settings has 5 values via a slider (1=slowest, 5=fastest). I have been using it at 3 which is a bit too fast for me while 2 is a bit too slow. I might go back to 2 but that means picking up my finger more frequently when moving a pointer across the screen.</li><li>no iMessage - my family uses Apple devices and I can't run that here as I can on a Mac laptop</li><li>the "" ey is flay --> the "k" key is flaky -> spacebar is flaky. Keys go bad for a few days, then get better, repeat. Ugh, his is one-off Google hardware. Maybe they don't want Apple and the butterfly keyboard to have all the fun. Fortunately I bought from an authorized reseller (Best Buy) so the <a href="https://support.google.com/pixelbook/answer/7504309?hl=en">1 year warranty</a> should apply.</li><li>Charger failed, fortunately that is easy to replace.</li></ol><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-_qqlkbJTjxo/XAK4Tpc2RSI/AAAAAAAAYqU/hWjw_XHYVx01Cz_SwF1XU9i75S-x56C4QCLcBGAs/s1600/IMG-0067.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="240" src="https://4.bp.blogspot.com/-_qqlkbJTjxo/XAK4Tpc2RSI/AAAAAAAAYqU/hWjw_XHYVx01Cz_SwF1XU9i75S-x56C4QCLcBGAs/s320/IMG-0067.JPG" width="320" /></a></div><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com1tag:blogger.com,1999:blog-9149523927864751087.post-68062792429821627322018-11-19T12:32:00.000-08:002018-11-19T12:32:44.822-08:00Review of TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value StoresThis is review of TRIAD which <a href="https://www.usenix.org/conference/atc17/technical-sessions/presentation/balmau">was published</a> in USENIX ATC 2017. It explains how to reduce write amplification for RocksDB leveled compaction although the ideas are useful for many LSM implementations. I share a review here because the paper has good ideas. It isn't easy to keep up with all of the LSM research, even when limiting the search to <a href="https://scholar.google.com/scholar?as_vis=1&q=rocksdb+lsm&hl=en&as_sdt=1,5">papers that reference RocksDB</a>, and I didn't notice this paper until recently.<br /><br />TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM Conjecture</a> improvements usually come at a cost and the cost in this case is more <a href="http://smalldatum.blogspot.com/2018/03/cache-amplification.html">cache amplification</a> (more memory overhead/key) and possibly more <a href="http://smalldatum.blogspot.com/2018/07/query-cpu-overheads-in-rocksdb.html">read amplification</a>. I assume this is a good tradeoff in many cases.<br /><br />The paper explains the improvements via 3 components -- TRIAD-MEM, TRIAD-DISK and TRIAD-LOG -- that combine to reduce write amplification.<br /><br /><b>TRIAD-MEM</b><br /><br />TRIAD-MEM reduces write-amp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIAD-MEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.<br /><br />An extra 4-bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIAD-MEM has a cost in cache-amp but I don't think that is a big deal.<br /><br />Assuming the per-level write-amp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.<br /><br /><b>TRIAD-DISK</b><br /><br />TRIAD-DISK reduces write-amp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.<br /><br />Overlap is estimated via <a href="https://en.wikipedia.org/wiki/HyperLogLog">Hyperloglog</a> (HLL) which requires 4kb/SST and is estimated as the following where file-i is the i-th SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(file-i) is the number of keys in the i-th SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that per-SST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:<br /> <span style="font-family: "courier new" , "courier" , monospace;">UniqueKeys(file-1, file-2, ... file-n) / sum( Keys( file-i))</span><br /><br />The benefit from early L0:L1 compaction is less read-amp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more per-level write-amp which is etimated by size(L1 input) / size(L0 input). TRIAD-DISK provides the benefit with less cost.<br /><br />In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIAD-DISK uses a cost-based approach to do early compaction when it won't hurt the per-level write-amp. This is an interesting idea.<br /><br /><b>TRIAD-LOG</b><br /><br />TRIAD-LOG explains improvements to memtable flush that reduce write-amp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CL-SST for commit log SST) will be deleted when it is compacted into the L1.<br /><br />There is cache-amp from TRIAD-LOG. Each key in the CL-SST (L0) and maybe in the memtable needs 8 extra bytes -- 4 bytes for CL-SST ID, 4 bytes for the WAL offset.<br /><br />Assuming the per-level write-amp is one from the memtable flush for cold keys this reduces that to 0.<br /><br /><b>Reducing write amplification</b><br /><br />The total write-amp for an LSM tree with leveled compaction is the sum of:<br /><ul><li>writing the WAL = 1</li><li>memtable flush = 1</li><li>L0:L1 compaction ~= size(L1) / size(L0)</li><li>Ln compaction for n>1 ~= fanout, the per-level growth factor, usually 8 or 10. Note that <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">this paper</a> explains why it is usually a bit less than fanout.</li></ul><div>TRIAD avoids the write-amp from memtable flush thanks to TRIAD-MEM for hot keys and TRIAD-LOG for cold keys. I will wave my hands and suggest that TRIAD-DISK reduces write-amp for L0:L1 from 3 to 1 based on the typical LSM configuration I use. So TRIAD reduces the total write-amp by 1+2 or 3.<br /><br />Reducing total write-amp by 3 is a big deal when the total write-amp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total write-amp above then per-level write-amp is ~8 for both L1:L2 and L2:L3. The total write-amp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.<br /><br />But my write-amp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total write-amp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the per-level write-amp with and without TRIAD? Most of this can be observed from <a href="https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide">compaction statistics</a> provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.</div><br /><b>Questions</b><br /><br />The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU read-amp from using the commit-log SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commit-log SST.<br />impact of more levels?<br /><br />Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of <a href="http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html">compaction priority</a> in RocksDB but it doesn't for all. When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.<br /><br />An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits write-amp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com4tag:blogger.com,1999:blog-9149523927864751087.post-50873921815814846002018-11-02T10:18:00.000-07:002018-11-02T10:18:45.344-07:00Converting an LSM to a B-Tree and back againI wonder if it is possible to convert an LSM to a B-Tree. The goal is to do it online and in-place -- so I don't want two copies of the database while the conversion is in progress. I am interested in data structures for data management that adapt dynamically to improve performance and efficiency for a given workload. <div><br /></div><div>Workloads change in the short and long term. I hope that data structures can be adapt to the change and converting between an LSM and a B-Tree is one way to adapt. This is more likely to be useful when the data structure supports some kind of partitioning in the hope that different workloads can be isolated to different partitions -- and then some can use an LSM while others use a B-Tree.<br /><br /><b>LSM to B-Tree</b></div><div><br /></div><div>A B-Tree is one tree. An LSM is a sequence of trees. Each sorted run in the LSM is a tree. With leveled compaction in RocksDB there are a few sorted runs in level 0 (L0) and then one sorted run in each of L1, L2 up to the max level (Lmax). <div><br /></div><div>A B-Tree persists changes by writing back pages -- either in-place or copy-on-write (<a href="http://smalldatum.blogspot.com/2015/08/different-kinds-of-copy-on-write-for-b.html">UiP or CoW</a>). An LSM persists changes by writing and then re-writing rows. I assume that page write back is required if you want to limit the database to one tree and row write back implies there will be more than one tree. </div><div><br /></div><div>There are two things that must be done online and in-place:</div><div><ol><li>Convert the LSM from many trees to one tree</li><li>Convert from row write back to page write back</li></ol></div><div>Note that my goal has slightly changed. I want to move from an LSM to a data structure with one tree. For the one-tree solution a B-Tree is preferred but not required.</div><div><br /></div><div>The outline of a solution:<br /><ol><li>Reconfigure the LSM to use 2 levels -- L0 and L1 -- and 3 trees -- memtable, L0, L1.</li><li>Disable the L0. At this point the LSM has two trees -- memtable and L1.</li><li>Flush the memtable and merge it into the L1. Now there is one tree.</li><li>After the flush disable the memtable and switch to a page cache. Changes now require a copy of the L1 block in the page cache that eventually get written back via UiP or CoW.</li></ol><div>The outline above doesn't explain how to maintain indexes for the L1. Note that after step 2 there is one tree on disk and the layout isn't that different from the leaf level of a B-Tree. The interior levels of the B-Tree could be created by reading/rewriting the block indexes stored in the SSTs.</div></div><div><br /></div><div><b>B-Tree to LSM</b></div><div><br />The conversion can also be done in the opposite direction (B-Tree to LSM)</div><div><ol><li>Treat the current B-Tree as the max level of the LSM tree. While it might help to flush the page cache I don't think that is required. This is easier to do when your LSM uses a B-Tree per level, as done by WiredTiger.</li><li>Record new changes for insert, update, delete in a memtable rather than a page cache.</li><li>When the memtable is full then flush it to create a new tree (sorted run, SST) on disk.</li><li>Eventually start to do compaction.</li></ol></div></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com2tag:blogger.com,1999:blog-9149523927864751087.post-22596677300676814792018-10-19T16:34:00.000-07:002018-10-19T16:34:21.871-07:00Combining tiered and leveled compactionThere are simple optimization problems for LSM tuning. For example use leveled compaction to minimize space amplification and use tiered to minimize write amplification. But there are interesting problems that are harder to solve:<br /><ol><li>maximize throughput given a constraint on write and/or space amplification</li><li>minimize space and/or write amplification given a constraint on read amplification</li></ol><div>To solve the first problem use leveled compaction if it can satisfy the write amp constraint, else use tiered compaction if it can satisfy the space amp constraint, otherwise there is no solution. The lack of a solution might mean the constraints are unreasonable but it can also mean we need to enhance LSM implementations to support more <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">diversity in LSM tree shapes</a>. Even when there is a solution using leveled or tiered compaction there are solutions that would do much better were an LSM to support more varieties of tiered+leveled and leveled-N.</div><div><br /></div><div>When I mention <i>solved</i> above I leave out that there is more work to find a solution even when tiered or leveled compaction is used. For both there are decisions about the number of levels and per-level fanout. If <a href="http://smalldatum.blogspot.com/2018/10/minimizing-write-amplification-in-lsm_3.html">minimizing write amp</a> is the goal then that is a solved problem. But there are usually more things to consider.</div><div><br /><b>Tiered+leveled</b></div><div><br />I defined tiered+leveled and leveled-N in a <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">previous post</a>. They occupy the middle ground between tiered and leveled compaction with better read efficiency than tiered and better write efficiency than leveled. They are not supported today by popular LSM implementations but I think they can and should be supported. </div><div><br />While we tend to explain compaction as a property of an LSM tree (all tiered or all leveled) it is really a property of a level of an LSM tree and RocksDB already supports hybrids, combinations of tiered and leveled. For tiered compaction in RocksDB all levels except the largest use tiered. The largest level is usually configured to use leveled to reduce space amp. For leveled compaction in RocksDB all levels except the smallest use leveled and the smallest (L0) uses tiered.</div><div><br />So tiered+leveled isn't new but I think we need more flexibility. When a string of T and L is created from the per-level compaction choices then the regex for the strings that RocksDB supports is T+L or TL+. I want to support T+L+. I don't want to support cases where leveled is used for a smaller level and tiered for a larger level. So I like TTLL but not LTTL. My reasons for not supporting LTTL are:</div><div><ol><li>The benefit from tiered is less write amp and is independent of the level on which it is used. The reduction in write amp is the same whether tiered is used for L1, L2 or L3.</li><li>The cost from tiered is more read and space amp and that is dependent on the level on which it is used. The cost is larger for larger levels. When space amp is 2 more space is wasted on larger levels than smaller levels. More IO read amp is worse for larger levels because they have a lower hit rate than smaller levels and more IO will be done. More IO implies more CPU cost from decompression and the CPU overhead of performing IO.</li></ol><div>From above the benefit from using T is the same for all levels but the cost increases for larger levels so when T and L are both used then T (tiered) should be used on the smaller levels and L (leveled) on the larger levels.</div></div><div><br /></div><div><b>Leveled-N</b></div><div><br /></div><div>I defined leveled-N in a <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">previous post</a>. Since then a co-worker, <a href="https://twitter.com/MaysamYabandeh">Maysam Yabandeh</a>, explained to me that a level that uses leveled-N can also be described as two levels where the smaller uses leveled and the larger uses tiered. So leveled-N might be syntactic sugar in the LSM tree configuration language.</div><div><br /></div><div>For example with an LSM defined using the triple syntax <a href="https://twitter.com/MaysamYabandeh">from here</a> as (compaction type, fanout, runs-per-level) then this is valid: (T,1,8) (T,8,2) (L,8,2) (L,8,1) and has total fanout of 512 (8 * 8 * 8). The third level (L,8,2) uses leveled-N with N=2. Assuming we allow LSM trees where T follows L then the leveled-N level can be replaced with two levels: (L,8,1) (T,1,8). Then the LSM tree is defined as (T,1,8) (T,8,2) (L,8,1) (T,1,8) (L,8,1). These LSM trees have the same total fanout and total read/write/space amp. Compaction from (L,8,1) to (T,1,8) is special. It has zero write amp because it is done by a file move rather than merging/writing data so all that must be updated is LSM metadata to record the move.</div><div><br />So in general I don't support T after L but I do support it in the special case. Of course we can pretend the special case doesn't exist if we use the syntactic sugar provided by leveled-N. But I appreciate that Maysam discovered this.<br /></div><div><br /></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-90342570708996548822018-10-03T11:47:00.000-07:002018-10-30T13:35:17.825-07:00Minimizing write amplification in an LSMWrite-amplification for an LSM with leveled compaction is minimized when the per-level growth factor (fanout) is the same between all levels. This is a result for an LSM tree using a given number of levels. To find the minimal write-amplification for any number of levels this result can be repeated for 2, 3, 4, ... up to a large value. You might find that a large number of levels is needed to get the least write-amp and that comes at <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">price of more read-amp</a>, as the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM Conjecture</a> predicts.<br /><br />In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no write-amp. This is done to reduce the size of this blog post.<br /><br />tl;dr - for an LSM with L1, L2, L3 and L4 what values for per-level fanout minimizes write-amp when the total fanout is 1000?<br /><ul><li>(10, 10, 10) for leveled</li><li>(6.3, 12.6, 12.6) for leveled-N assuming two of the levels have 2 sorted runs</li><li>(>1, >1, >1) for tiered</li></ul><br /><b>Minimizing write-amp for leveled compaction</b><br /><br />For an LSM with 4 levels (L1, L2, L3, L4) there is a per-level fanout between L1:L2, L2:L3 and L3:L4. Assume this uses <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">classic leveled</a> compaction so the total fanout is size(L4) / size(L1). The product of the per-level fanouts must equal the total fanout. The total write-amp is the sum of the per-level write-amp. I assume that the per-level write amp is the same as the per-level fanout although in practice and <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">in theory</a> it isn't that simple. Lets use a, b and c as the variables for the per-level fanout (write-amp) then the math problem is:<br /><ol><li>minimize a+b+c</li><li>such that a*b*c=k and a, b, c > 1</li></ol>While I have been working on my math skills this year they aren't great and corrections are welcome. This is a <a href="https://en.wikipedia.org/wiki/Constrained_optimization">constrained optimization</a> problem that can be solved using <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange Multipliers</a>. From above #1 is the sum of per-level write-amp and #2 means that the product of per-level fanout must equal the total fanout. The last constraint is that a, b and c must (or should) all be > 1.<br /><div><br /></div><div>This result uses Lagrange Multipliers for an LSM tree with 4 levels do there are 3 variables: a, b, c. But the math holds for an LSM tree with fewer levels or with more levels. If there are N levels then there are N-1 variables.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">L(a, b, c) = a + b + c - lambda * (a*b*c - k)</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/da = 1 - lambda * bc</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/db = 1 - lambda * ac</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/dc = 1 - lambda * ab</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">then</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">lambda = 1/bc = 1/ac = 1/ab</span><br /><span style="font-family: "courier new" , "courier" , monospace;">bc == ac == ab</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">and a == b == c to minimize the sum in #1</span></div><div><br /></div><div>I wrote a <a href="https://github.com/mdcallag/mytools/blob/master/scripts/index_structures/minwa.py">Python script</a> to discover the (almost) best values and the results match the math above.<br /><br /><b>Minimizing write-amp for tiered compaction</b><br /><br />Assuming you can reason about tiered compaction using the notion of levels then the math changes a bit because the per-level write-amp with tiered equals 1 regardless of the per-level fanout. For tiered with 4 levels and 3 variables the problem is:<br /><ol><li>minimize 1+1+1</li><li>such that a*b*c = k and a, b, c > 1</li></ol><div>Any values for a, b and c are sufficient as long they satisfy the constraints in #2. But it still helps to minimize a+b+c if that is predicts read-amp because a, b and c are also the number of sorted runs in L2, L3 and L4. So my advice is to use a == b == c in most cases.<br /><br /><b>Minimizing write-amp for leveled-N compaction</b><br />I explain leveled-N compaction <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">here</a> and <a href="http://smalldatum.blogspot.com/2018/10/describing-tiered-and-leveled-compaction.html">here</a>. It is like leveled compaction but allows a level to have more than one sorted run. This reduces the per-level write-amp at the cost of more read-amp. Sometimes that is a good trade.<br /><br />The math above can also be used to determine how to configure per-level fanout to minimize write-amp for leveled-N. Assume an LSM tree with 4 levels (L1, L2, L3, L4) and 2 sorted runs in L2 and L3. The problem is:<br /><ol><li>minimize a + b/2 + c/2</li><li>such that a*b*c = k and a, b, c > 1</li></ol>For leveled compaction I assume that per-level write-amp is all-size(Ln+1) / all-size(Ln) for compaction from Ln into Ln+1. For leveled-N I assume it is run-size(Ln+1) / all-size(Ln) where all-size is the size of all sorted runs on that level and run-size is the size of one sorted run. The astute reader might notice that all-size(Ln) == run-size(Ln) for traditional leveled. For leveled-N I assume that fanout continues to be run-size(Ln+1) / run-size(Ln).<br /><br />Therefore with leveled-N the per-level write-amp is b/2 for L2 to L3 and c/2 for L3 to L4 because there are 2 sorted runs in the compaction input (twice as much data) in those cases. Were there 3 sorted runs then the values would be b/3 and c/3.<br /><br />Lagrange Multipliers can be used to solve this assuming we want to minimize a + b/2 + c/2.<br /><br /><div><span style="font-family: "courier new" , "courier" , monospace;">L(a, b, c) = a + b/2 + c/2 - lambda * (a*b*c - k)</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/da = 1 - lambda * bc</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/db = 1/2 - lambda * ac</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">dL/dc = 1/2 - lambda * ab</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">then</span></div><div><span style="font-family: "courier new" , "courier" , monospace;">lambda = 1/bc = 1/2ac = 1/2ab</span><br /><span style="font-family: "courier new" , "courier" , monospace;">bc == 2ac -> b == 2a</span><br /><span style="font-family: "courier new" , "courier" , monospace;">bc == 2ab -> c == 2a</span><br /><span style="font-family: "courier new" , "courier" , monospace;">2ac == 2ab -> c == b </span></div><div><span style="font-family: "courier new" , "courier" , monospace;">and 2a == b == c to minimize the sum</span></div><br />If the total fanout is 1000 then the per-level fanout values that minimize write-amp are 10, 10, 10 for leveled and 6.30, 12.60, 12.60 for this example with leveled-N and can be computed by "bc -l"<br /><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;"># for leveled-N<br />e(l(1000/4)/3)</span></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;">6.29960524947436582381</span></span></div><div class="p2"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;"><span class="s1"></span><br /></span></div><div class="p1"><span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: small;">e(l(1000/4)/3) * 2</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;">12.59921049894873164762</span></div><span style="font-family: "courier new" , "courier" , monospace;"><br /># and for leveled<br />e(l(1000)/3)</span><br /><span style="background-color: white; font-family: "courier new" , "courier" , monospace;">9.99999999999999999992</span><br /><br />One way to think of this result is that with leveled compaction the goal is to use the same per-level fanout between levels. This also uses the same per-level write-amp between levels because per-level write-amp == the per-level fanout for leveled.<br /><br />But with leveled-N compaction we need to adjust the per-level fanout for levels to continue to get the same per-level write-amp between levels.</div><br /><br /></div><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-50472360374591024912018-10-02T16:27:00.000-07:002018-10-02T16:27:15.687-07:00Describing tiered and leveled compactionThis is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts <a href="http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html">here</a> and <a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">here</a>. My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature.<br /><br /><b>Stepped Merge</b><br /><br />The <a href="https://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/indexbuffering-vldb97.pdf">Stepped Merge paper</a> might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html">write-amplification</a> was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.<br /><br />This <a href="http://smalldatum.blogspot.com/2018/09/review-of-slimdb-from-vldb-2018.html">looks like</a> tiered compaction. However it allows for N sorted runs on the max level which means that space-amplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a B-Tree, not for storing the entire database and it doesn't lead to a large space-amp for that workload. Note that the <a href="https://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html">InnoDB change buffer</a> is a B-Tree that buffers changes to other B-Trees for a similar reason.<br /><br /><b>Compaction per level</b><br /><br />I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.<br /><br />I will start with definitions:<br /><ol><li>When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).</li><li>A sorted run is a sequence of key-value pairs stored in key order. It is stored using 1+ files.</li><li>A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level. </li><li>A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.</li><li>Levels are full when they have a configurable number of sorted runs. In <a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">classic leveled compaction</a> a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2. </li><li><a href="http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html">leveled-N</a> uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1. </li><li>The per level fanout is size(sorted-run in level N) / size(sorted-run in level N-1)</li><li>The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.</li><li>The runs-per-level is the number of sorted runs in a level when it is full.</li><li>The per level write-amplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, all-size(Ln+1) / all-size(Ln) for leveled and run-size(Ln+1) / all-size(Ln) for leveled-N where run-size is the size of a sorted run and all-size is the sum of the sizes of all sorted runs on a level.</li></ol><div>A level can be described by a 3-tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runs-per-level. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3-tuple per level. For now I don't use 3-tuples to describe the write buffer (memory component). That is a topic for another post. Example 3-tuples include:</div><div><ul><li>T:1:4 - this is tiered with fanout=1 and runs-per-level=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.</li><li>T:8:8 - this is tiered with fanout=8 and runs-per-level=8. When Ln and Ln+1 both use tiered then runs-per-level in Ln == fanout in Ln+1. </li><li>T:8:4 - this is tiered with fanout=8 and runs-per-level=4. It might be used when the next larger level uses leveled and runs-per-level on this level can be smaller than fanout to reduce read-amp.</li><li>L:10:1 - this is common in RocksDB with leveled compaction, fanout=10 and runs-per-level=1</li><li>L:10:2 - this is leveled-N with runs-per-level=2</li></ul></div><div><b><br /></b><b>Compaction per LSM tree</b></div><div><br />An LSM tree can be described using the per level 3-tuples from the previous section. The following are examples for popular algorithms.<br /><br /><a href="http://smalldatum.blogspot.com/2018/06/the-original-lsm-paper.html">Classic LSM</a> with total fanout = 1000 is:<br /><ul><li>C0 is the write buffer</li><li>C1, C2 and C3 are L:10:1</li></ul></div><div>RocksDB leveled with total fanout = 1000 is:</div><div><ul><li>L0 is T:1:4</li><li>L1 is L:1:1</li><li>L2, L3, L4 are L:10:1</li></ul><div>Stepped Merge with total fanout = 1000 is:<br /><ul><li>L1 is T:1:10</li><li>L2, L3, L4 are T:10:10</li></ul><div>Tiered in HBase and Cassandra with total fanout = 1000 might be:<br /><ul><li>L1 is T:1:10</li><li>L2, L3 are T:10:10</li><li>L4 is L:10:1</li></ul></div></div></div><div><b><br /></b><b>Tiered+leveled</b><br /><br />Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveled-N, then the remaining levels use leveled. When tiered=T, leveled=L and leveled-N=N then the regex for this is T+N*L+.<br /><br />I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.<br /><br />The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification.<br /><br />LSM trees with tiered+leveled can be described using 3-tuples and the previous section does that but here I provide one for a tree that uses leveled-N for L1 and L2 with total fanout = 1000:<br /><ul><li>L0 is T:1:4</li><li>L1 is L:1:2</li><li>L2 is L:10:2</li><li>L3, L4 are L:10:1</li></ul><br />And another example to show that tiered levels don't have to use the same fanout or runs-per-level, but runs-per-level for Ln == fanout for Ln+1:<br /><ul><li>L0 is T:1:20</li><li>L1 is T:20:10</li><li>L2 is T:10:2</li><li>L3 is L:5:1</li></ul><b><br /></b><b>Leveled-N</b><br /><br />Leveled-N can reduce the per level write-amp at the cost of increasing the per level read-amp.<br /><br />The regex for an LSM tree that uses leveled-N is N+L+ (see the previous section). The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification. An example 3-tuple for leveled-N with fanout=1000 that has runs-per-level=2 for L1 and L2 is:<br /><ul><li>L1 is L:10:2</li><li>L2 is L:10:2</li><li>L3 is L:10:1</li></ul></div>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-8269597228709268102018-10-01T10:44:00.000-07:002018-10-02T16:31:20.680-07:00Transaction Processing in NewSQLThis is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.<br /><br />By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:<br /><ol><li>Support for multiple nodes because the storage/compute on one node isn't sufficient.</li><li>Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated.</li><li>Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it is important to understand the minimal time between commits to a single row or the max number of commits/second to a single row.</li></ol>NewSQL Systems<br /><ul><li><a href="https://en.wikipedia.org/wiki/MySQL_Cluster">MySQL Cluster</a> - this was NewSQL before NewSQL was a thing. There is a <a href="https://www.amazon.com/MySQL-Cluster-7-5-Inside-Out/dp/9176998142">nice book</a> that explains the internals. There is <a href="https://www.hops.io/">a company</a> that uses it to <a href="https://medium.com/@jim_dowling/introducing-hops-hadoop-120c30d02676">make HDFS better</a>. Cluster seems to be more popular for uses other than web-scale workloads.</li><li><a href="https://en.wikipedia.org/wiki/VoltDB">VoltDB</a> - another early NewSQL system that is still <a href="http://voltdb.com/">getting better</a>. It was after MySQL Cluster but years before Spanner and came out of the <a href="http://hstore.cs.brown.edu/">H-Store research effort</a>.</li><li><a href="https://en.wikipedia.org/wiki/Spanner_(database)">Spanner</a> - XA across-shards, Paxos across replicas, special hardware to reduce clock drift between nodes. Sounds amazing, but this is Google so it just works. See the papers that explain <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf">the system</a> and <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf">support for SQL</a>. This got the NewSQL movement going.</li><li>CockroachDB - the answer to implementing Spanner <a href="https://www.cockroachlabs.com/blog/living-without-atomic-clocks/">without GPS and atomic clocks</a>. From that URL they explain it as "while Spanner always waits after writes, CockroachDB sometimes waits before reads". It uses RocksDB and they help make it better.</li><li>FaunaDB - FaunaDB is inspired by Calvin and Daniel Abadi explains the difference between it and Spanner -- <a href="https://fauna.com/blog/distributed-consistency-at-scale-spanner-vs-calvin">here</a> and <a href="http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html">here</a>. Abadi is great at explaining distributed systems, see his work <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html">on PACELC</a> (and <a href="http://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf">the pdf</a>). A key part of Calvin is that "Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed." This approach might limit the peak TPS on a large cluster, but I assume that doesn't matter for a large fraction of the market.</li><li>YugaByte - another <a href="https://docs.yugabyte.com/latest/architecture/concepts/persistence/">user of RocksDB</a>. There is much discussion about it in the <a href="http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html">recent Abadi post</a>. Their docs are amazing -- <a href="https://www.slideshare.net/YugaByte/yugabyte-db-architecture-storage-engine-and-transactions">slides</a>, <a href="https://docs.yugabyte.com/latest/architecture/transactions/transactional-io-path/">transaction IO path</a>, <a href="https://docs.yugabyte.com/latest/architecture/core-functions/write-path/">single-shard write IO path</a>, <a href="https://docs.yugabyte.com/latest/architecture/transactions/distributed-txns/">distributed ACID</a> and <a href="https://docs.yugabyte.com/latest/architecture/transactions/single-row-transactions/">single-row ACID</a>.</li><li><a href="https://github.com/pingcap/tidb">TiDB</a> - I don't know much about it but they are <a href="https://techcrunch.com/2018/09/11/tidb-developer-pingcap-wants-to-expand-in-north-america-after-raising-50m-series-c/">growing fast</a> and are part of the <a href="https://www.percona.com/live/17/sessions/tidb-newsql-database-compatible-mysql">MySQL community</a>. It uses RocksDB (I shouldn't have forgotten that).</li></ul>Other relevant systems<br /><ul><li><a href="https://www.foundationdb.org/">FoundationDB</a> - I am curious where this goes given the competition explained above.</li><li><a href="https://en.wikipedia.org/wiki/Amazon_Aurora">Aurora</a> - not NewSQL yet because this doesn't scale across nodes. It does support large nodes and that might be sufficient for a large part of the market. But Amazon moves fast (see the <a href="https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/">new parallel query feature</a>) so I wouldn't be surprised if this became NewSQL one day. I appreciate that they have begun to explain the internals -- <a href="https://dl.acm.org/citation.cfm?id=3056101">here</a> and <a href="https://dl.acm.org/citation.cfm?id=3196937">here</a>.</li><li>MongoDB - not SQL, but starting to get interesting with the new features for <a href="http://smalldatum.blogspot.com/2015/10/losing-it.html">read and write concerns</a>. There is also new support for <a href="https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#causal-consistency">causal consistency</a> and <a href="https://docs.mongodb.com/manual/core/retryable-writes/">retryable writes</a>.</li><li>Clustrix - a NewSQL system that is now <a href="https://techcrunch.com/2018/09/20/mariadb-acquires-clusterix/">part of MariaDB</a>. Maybe this becomes open source.</li><li><a href="http://kudu.apache.org/">Kudu</a> - <a href="https://kudu.apache.org/kudu.pdf">awesome paper</a>, interesting <a href="http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf">research on HybridTime</a>, useful docs <a href="https://kudu.apache.org/docs/transaction_semantics.html#1">on the internals</a>.</li><li><a href="https://vitess.io/overview/">Vitess</a> - was created to scale MySQL for Youtube. Now is part of CNCF, backed by a startup and used by many companies. Cross-shard <a href="https://vitess.io/user-guide/twopc/">writes are atomic</a>, but isolation is weaker.</li><li><a href="https://www.splicemachine.com/">Splice Machine</a> - SQL on HBase. Summary is "100% ACID via snapshot isolation with optimistic concurrency via write-write conflicts" and details <a href="https://doc.splicemachine.com/developers_fundamentals_transactions.html">are here</a>. Has integration to use Spark for OLAP, so <a href="https://www.splicemachine.com/defining-htap/">this is HTAP</a>.</li></ul>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0tag:blogger.com,1999:blog-9149523927864751087.post-545027977452958192018-09-19T15:01:00.002-07:002018-09-19T15:01:43.963-07:00Durability debtI define <i>durability debt</i> to be the amount of work that can be done to persist changes that have been applied to a database. Dirty pages must be written back for a b-tree. Compaction must be done for an LSM. Durability debt has IO and CPU components. The common IO overhead is from writing something back to the database. The common CPU overhead is from computing a checksum and optionally from compressing data.<br /><br />From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a B-Tree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a B-Tree which is one reason why tuning can be more challenging for an LSM than a B-Tree.<br /><br />In this post by LSM I mean LSM with leveled compaction.<br /><br /><b>B-Tree</b><br /><br />The maximum durability debt for a B-Tree is limited by the size of the buffer pool. If the buffer pool has N pages then there will be at most N dirty pages to write back. If the buffer pool is 100G then there will be at most 100G to write back. The IO is more random or less random depending on whether the B-Tree is update-in-place, <a href="http://smalldatum.blogspot.com/2015/08/different-kinds-of-copy-on-write-for-b.html">copy-on-write random or copy-on-write sequential</a>. I prefer to describe this as small writes (page at a time) or large writes (many pages grouped into a larger block) rather than random or sequential. InnoDB uses small writes and WiredTiger uses larger writes. The distinction between small writes and large writes is more important with disks than with SSD.<br /><br />There is a small CPU overhead from computing the per-page checksum prior to write back. There can be a larger CPU overhead from compressing the page. Compression isn't popular with InnoDB but is popular with WiredTiger.<br /><br />There can be an additional IO overhead when torn-write protection is enabled as provided by the InnoDB <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-buffer.html">double write buffer</a>.<br /><br /><b>LSM</b><br /><br />The durability debt for an LSM is the work required to compact all data into the max level (Lmax). A byte in the write buffer causes more debt than a byte in the L1 because more work is needed to move the byte from the write buffer to Lmax than from L1 to Lmax.<br /><br />The maximum durability debt for an LSM is limited by the size of the storage device. Users can configure RocksDB such that the level 0 (L0) is huge. Assume that the database needs 1T of storage were it compacted into one sorted run and the write-amplification to move data from the L0 to the max level (Lmax) is 30. Then the maximum durability debt is 30 * sizeof(L0). The L0 is usually configured to be <= 1G in which case the durability debt from the L0 is <= 30G. But were the L0 configured to be <= 1T then the debt from it could grow to 30T.<br /><br />I use the notion of per-level write-amp to explain durability debt in an LSM. Per-level write-amp is defined in the next section. Per-level write-amp is a proxy for all of the work done by compaction, not just the data to be written. When the per-level write-amp is X then for compaction from Ln to Ln+1 for every key-value pair from Ln there are ~X key-value pairs from Ln+1 for which work is done including:<br /><ul><li>Read from Ln+1. If Ln is a small level then the data is likely to be in the LSM block cache or OS page cache. Otherwise it is read from storage. Some reads will be cached, all writes go to storage. So the write rate to storage is > the read rate from storage.</li><li>The key-value pairs are decompressed if the level is compressed for each block not in the LSM block cache.</li><li>The key-value pairs from Ln+1 are merged with Ln. Note that this is a merge, not a merge sort because the inputs are ordered. The number of comparisons might be less than you expect because one iterator is ~X times larger than the other and there are optimizations for that.</li></ul><div>The output from the merge is then compressed and written back to Ln+1. Some of the work above (reads, decompression) are also done for Ln but most of the work comes from Ln+1 because it is many times larger than Ln. I stated above that an LSM usually has more IO and less CPU durability debt per modified row. The extra CPU overheads come from decompression and the merge. I am not sure whether to count the compression overhead as extra.<br /><br />Assuming the per-level growth factor is 10 and f is 0.7 (see below) then the per-level write-amp is 7 for L1 and larger levels. If sizeof(L1) == sizeof(L0) then the per-level write-amp is 2 for the L0. And the per-level write-amp is always 1 for the write buffer.<br /><br />From this we can estimate the pending write-amp for data at any level in the LSM tree.</div><div><ol><li>Key-value pairs in the write buffer have the most pending write-amp. Key-value pairs in the max level (L5 in this case) have none. Key-value pairs in the write buffer are further from the max level. </li><li>Starting with the L2 there is more durability debt from a full Ln+1 than a full Ln -- while there is more pending write-amp for Ln, there is more data in Ln+1.</li><li>Were I given the choice of L1, L2, L3 and L4 when first placing a write in the LSM tree then I would choose L4 as that has the least pending write-amp.</li><li>Were I to choose to make one level 10% larger then I prefer to do that for a smaller level given the values in the <i>rel size X pend</i> column.</li></ol></div><br /><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">legend:</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">w-amp per-lvl : per-level write-amp</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">w-amp pend<span class="Apple-converted-space"> </span>: write-amp to move byte to Lmax from this level</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">rel size<span class="Apple-converted-space"> </span>: size of level relative to write buffer</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">rel size X pend : write-amp to move all data from that level to Lmax</span></span></div><div class="p2"><span style="font-family: Courier New, Courier, monospace; font-size: small;"><span class="s1"></span><br /></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;"><span class="Apple-converted-space"> </span>w-amp <span class="Apple-converted-space"> </span>w-amp <span class="Apple-converted-space"> </span>rel <span class="Apple-converted-space"> </span>rel size<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">level <span class="Apple-converted-space"> </span>per-lvl pend<span class="Apple-converted-space"> </span>size<span class="Apple-converted-space"> </span>X pend</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">----- <span class="Apple-converted-space"> </span>------- ----- <span class="Apple-converted-space"> </span>----- <span class="Apple-converted-space"> </span>--------</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">wbuf<span class="Apple-converted-space"> </span>1 <span class="Apple-converted-space"> </span>31<span class="Apple-converted-space"> </span>1<span class="Apple-converted-space"> </span>31 <span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L0<span class="Apple-converted-space"> </span>2 <span class="Apple-converted-space"> </span>30<span class="Apple-converted-space"> </span>4 <span class="Apple-converted-space"> </span>120<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L1<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>28<span class="Apple-converted-space"> </span>4 <span class="Apple-converted-space"> </span>112<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L2<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>21 <span class="Apple-converted-space"> </span>40 <span class="Apple-converted-space"> </span>840<span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L3<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>14<span class="Apple-converted-space"> </span>400<span class="Apple-converted-space"> </span>5600 <span class="Apple-converted-space"> </span></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L4<span class="Apple-converted-space"> </span>7 <span class="Apple-converted-space"> </span>7<span class="Apple-converted-space"> </span>4000 <span class="Apple-converted-space"> </span>28000<span class="Apple-converted-space"> </span></span></span></div><div class="p1"> <style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff; min-height: 15.0px} span.s1 {font-variant-ligatures: no-common-ligatures} </style> </div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace; font-size: small;">L5<span class="Apple-converted-space"> </span>0 <span class="Apple-converted-space"> </span>0 <span class="Apple-converted-space"> </span>40000 <span class="Apple-converted-space"> </span>0 <span class="Apple-converted-space"> </span></span></span></div><br /><b>Per-level write-amp in an LSM</b><br /><br />The per-level write-amplification is the work required to move data between adjacent levels. The per-level write-amp for the write buffer is 1 because a write buffer flush creates a new SST in L0 without reading/re-writing an SST already in L0.<br /><br />I assume that any key in Ln is already in Ln+1 so that merging Ln into Ln+1 does not make Ln+1 larger. This isn't true in real life, but this is a model.<br /><br />The per-level write-amp for Ln is approximately sizeof(Ln+1) / sizeof(Ln). For n=0 this is 2 with a typical RocksDB configuration. For n>0 this is the per-level growth factor and the default is 10 in RocksDB. Assume that the per-level growth factor is equal to X, in reality the per-level write-amp is f*X rather than X where f ~= 0.7. See this <a href="https://hyeontaek.com/papers/msls-fast2016.pdf">excellent paper</a> or examine the compaction IO stats from a production RocksDB instance. Too many excellent conference papers assume it is X rather than f*X in practice.<br /><br />The per-level write-amp for Lmax is 0 because compaction stops at Lmax.<br /><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style><style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Menlo; color: #000000; background-color: #ffffff} span.s1 {font-variant-ligatures: no-common-ligatures} </style>Mark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.com0