tag:blogger.com,1999:blog-9149523927864751087.post2906083446479881183..comments2024-03-26T09:43:01.052-07:00Comments on Small Datum: Summarizing the different implementations of tiered compactionMark Callaghanhttp://www.blogger.com/profile/09590445221922043181noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-9149523927864751087.post-40539397561818933082022-01-03T20:31:08.955-08:002022-01-03T20:31:08.955-08:00Thank you for the summary and for a link to the pa...Thank you for the summary and for a link to the paper. I wasn't aware of it and will read it soon.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-9149523927864751087.post-49319875032248919182022-01-03T19:58:35.254-08:002022-01-03T19:58:35.254-08:00Interesting post. Out of curiosity, I dug up the ...Interesting post. Out of curiosity, I dug up the answers for SinglestoreDB (S2DB). Some of these are from our older ICDE paper (https://www.singlestore.com/resources/research-paper_columnstore-engine-for-real-time-streaming-analytics/). We're working on a newer paper with a more up to date description of our storage design. <br /><br />- Major Compaction: The `OPTIMIZE TABLE FULL` command merges the table to a single sorted run. We never automatically do this otherwise (too expensive).<br /><br />- The compaction picker sees sorted runs are ordered by their size<br /><br />- S2DB can compact a substring of the sorted runs. This can happen in two cases<br />1. The longer sorted runs are compacted separately from the shorter ones to allow the shorter sorted runs to be compacted without waiting for the longer ones (we have a compaction thread dedicated to dealing with the shorter sorted runs). The merger responsible for compacting the longer sorted runs will always be compacting a substring rather than a prefix since it skips the shorter sorted runs.<br />2. If the number of sorted runs in the selected prefix is larger than the maximum number of sorted runs allowed to be compacted at a time (10), only the suffix of the longer sorted runs within the selected prefix gets compacted.<br /><br />- S2DB has no constraint on merging sorted runs not overlapping in commit timestamp order. S2DB tracks deletes using a bit vector in the metadata rather than using tombstones, so it’s not necessary to use a merging iterator to find the latest versions of the rows.<br /><br />- S2DB triggers a minor compaction when the sum of the sizes of a prefix of smaller sorted runs grows larger than 1/8 of the next sorted run. This is similar to a “local‘ trigger, since it only considers sorted runs with similar sizes.<br /><br />- There isn't a minimum number of sorted runs to trigger compaction. There is no read triggered compaction (though we should likely add this at some point).<br /><br />- S2DB tracks deletes using a bit vector in metadata, so the space amplification can be estimated from the metadata as #non-deleted-rows / #all-rows. Compaction is triggered when more than 1/8 of the rows are deleted in a segment.<br /><br />- All on-disk data can be read incrementally. Each sorted run is partitioned into segments so that it can be read incrementally. The index structures are split into two parts - the local index structures stored with each segment, and the global index structures stored in a separate LSM tree (decoupled from the sorted runs). The global index structures are paged and cached in pieces.<br /><br />- Compaction is incremental by writing out one merged segment (up to 1M rows) at a time. Disk space from a segment can be reclaimed as soon as all of the rows in the segment are merged, without waiting for the entire sorted run to finish merging.<br /><br />- Compaction is single-threaded per shard/partition (so we do get parallelism due to sharding). Due to the need to keep compaction transparent we only use 2 threads per host today unless the compaction is manually triggered (in which case we use 1 thread per shard).<br /><br />- The sorted runs are computed dynamically, so that new data inserted in key order can be appended to existing large sorted runs without merging.<br /><br />- Without going into too much detail, we limit the amount of the compaction work (by using only two threads), so that we do something close to leveled compaction (less point read amp, more write amp) on low write rates, but fall back to tiered compaction (more point read amp, less write amp) on higher write rates. <br />Adam Prouthttps://www.blogger.com/profile/14022199690385338368noreply@blogger.com