Friday, January 28, 2022

RocksDB internals: compaction stall counters

Compaction IO statistics in RocksDB includes two lines with information on write throttling. Writes are throttled to slow the ingest (new writes) so that compaction (outgest) can keep up. This is a feedback mechanism and needed with an LSM because an LSM lacks the natural feedback that a b-tree has, which is when all pages in the buffer pool are dirty. The first line has counters for the causes of write slowdowns and write stalls. The second line has the total time and percentage of time (relative to uptime) that write stalls have been in effect for the RocksDB instance.

Stalls(count): 713 level0_slowdown, 706 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 1160 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 351 total count

Cumulative stall: 00:03:31.992 H:M:S, 65.1 percent

A write stall is when RocksDB blocks writes until compaction debt has been reduced. A write slowdown is when RocksDB delays writes so that the write rate is <= a target rate to give compaction a chance to catch up and avoid write stalls. The causes for write stalls and slowdowns are listed here although some enums in there are for things other than write stalls and slowdowns.

Detecting write stalls

From the sample compaction IO stats output listed above, the first line (stall counters) is printed by this code and the second line (stall timers) by this code. The reason for a stall is computed by GetWriteStallConditionAndCause and that result is processed, including incrementing stall counters, in RecalculateWriteStallConditions. Note that GetWriteStallConditionAndCause makes checks in a fixed sequence. While multiple conditions might be true at any point in time, such as too many L0 files and too many bytes pending compaction, only the first one in that sequence is returned as the reason for the stall.

The stall timer is incremented in DBImpl::DelayWrite. Although I am not sure if this is called for both write stalls and write slowdowns.

The following explains the names used in the stall counter example text above. They are listed in the order in which they are checked by GetWriteStallConditionAndCause:

  • The first 4 are write stalls in which case write_controller_token_ is assigned the return value from GetStopToken.
    • memtable_compaction - true when the number of unflushed memtables is >= max_write_buffer_number. A message is written to LOG with the text "Stopping writes because we have %d immutable memtables ..."
    • level0_numfiles - true when the number of L0 files is >= level0_stop_writes_trigger. A message is written to LOG with the text "Stopping writes because we have %d level-0 files ..."
    • level0_numfiles_with_compaction - true when level0_numfiles is true and L0 compaction is in progress. Thus this counter is <= the counter for leve0_numfiles. The LOG message is explained in the previous point.
    • stop for pending_compaction_bytes - true when the number of pending compaction bytes is >= hard_pending_compaction_bytes_limit. A message is written to LOG with the text "Stopping writes because of estimated pending compaction bytes ...".
  • The remaining are write slowdowns in which case write_controller_token_ is assigned the return value from SetupDelay. The delayed write rate that is computed by SetupDelay is printed in the LOG messages mentioned below.
    • memtable_slowdown - true when max_write_buffer_number > 3 and the number of unflushed memtables is >= both (max_write_buffer_number - 1) and (min_write_buffer_number_to_merge - 1). A message is written to LOG with the text "Stalling writes because we have %d immutable memtables (waiting for flush) ..."
    • level0_slowdown - true when the number of L0 files is >= level0_slowdown_writes_trigger. A message is written to LOG with the text "Stalling writes because we have %d level-0 files rate ..."
    • level0_slowdown_with_compaction - true when level0_slowdown is true and L0 compaction is in progress. Thus this counter is <= the counter for level0_slowdown. See the previous point for the LOG message.
    • slowdown for pending_compaction_bytes - true when the number of bytes pending compaction is >= soft_pending_compaction_bytes_limit. A message is written to LOG with the text "Stalling writes because of estimated pending compaction bytes ..."

When SetupDelay is called, the penalize_stopped argument normally gets the value of write_controller->IsStopped() for the memtable_slowdown case. For level0_slowdown it is also true when the number of L0 files is within 2 of level0_stop_writes_trigger. For pending_compaction_bytes slowdown it is also true if the number of bytes pending compaction is 3/4 of the way between the slow and hard limits.

If GetWriteStallConditionAndCause returned WriteStallCondition::kNormal then

  • write_controller_token_ = GetCompactionPressureToken() if either of these are true
    • vstorage->l0_delay_trigger_count() >= GetL0ThresholdSpeedupCompaction(). The l0_delay_trigger_count is the number of files in the L0. GetL0Threshold...() returns the smaller of 2*level0_file_num_compaction_trigger and X where X is a value 1/4 of the way from the L0 compaction trigger to the L0 slowdown trigger. A message is written to LOG with the text "Increasing compaction threads because we have %d level-0 files"
    • vstorage->estimated_compaction_needed_bytes() >= soft_pending_compaction_bytes_limit/4. A message is written to LOG with the text "Increasing compaction threads because of estimated pending compaction bytes"
  • otherwise write_control_token_->reset() is called. Nothing is written to LOG.
  • if write_controller->NeedsDelay() was true the delayed write rate is set to kDelayRecoverSlowdownRatio (=1.4) times the configured delayed_write_rate and the low-pri delayed write rate is set to 1/4 that. Nothing is written to LOG.

I will save the write controller for a future note and will end with a description of SetupDelay. This function determines the value for the applied delayed write rate which is <= the value of the delayed_write_rate option.

  • if write_controller->NeedsDelay() is not true then the applied delayed write is not changed. I am curious why SetupDelay() would be called in that case.
  • else if penalize_stop is true (see above) then the applied delayed write rate is multiplied by kNearStopSlowdownRatio (=0.6) which reduces it.
  • else if bytes pending compaction is increasing then the applied delayed write rate is multiplied by kIncSlowDownRatio (=0.8) which reduces it.
  • else if bytes pending compaction is decreasing then the applied delayed write rate is multiplied by kDecSlowdownRatio (=1/kIncSlowdownRatio) which increases it.

There is a reference count on the number of current CompactionPressureToken instances and when this count is non-zero then the NeedSpeedupCompaction() function can return true and NeedSpeedupCompaction is called by:

Monitoring

I listed the relevant LOG messages above related to rate-limiting writes. I have two questions:

  • Will it be easier to grep for these from LOG if they all contain a common string unique to this feature?
  • Are all of the transitions covered? Note that in some cases described above nothing is written to LOG.

Thursday, January 27, 2022

RocksDB internals: intra-L0 compaction

There are some details on intra-L0 compaction on the public RocksDB wiki. This post has more details.

The purpose for intra-L0 is to reduce read-amplification when compaction is behind by reducing the number of SSTs in the L0. Intra-L0 compaction merges smaller L0 SSTs into a larger SST that remains in the L0. It can be done when L0->L1 compaction is needed but blocked. From the discussion of DBPCF below I suspect that intra-L0 increases write-amplification by <= 1 as SSTs in the L0 are likely to be subject to intra-L0 at most once.

  • Needed means the L0 compaction score is >= 1 either because there are too many SSTs in L0 or the L0 SSTs are too large. The compaction score is computed here.
  • Blocked means that L0->L1 wasn't selected because compaction input (from L0) or output (from L1) SSTs are already being compacted.

There is no option to disable intra-L0. For write-heavy workloads I am trying to quantify whether there is a benefit. For write-heavy the cost is more CPU and IO from compaction. While there is less or no need to reduce read-amp for write-heavy workloads, intra-L0 can still help by reducing write stalls by reducing the number of SSTs in L0. But I need to repeat benchmarks with intra-L0 disabled (by editing code and compiling a custom db_bench) to better understand this. 

Disclaimer

I have a deep understanding of LSM performance in general but not enough understanding of RocksDB internals. That is a risky combination that I am fixing by reading source code. I will share what I learn via blog posts. My approach here is to document as I read and the result might be a lot of detail without a high-level overview.

Code

PickIntraL0Compaction is called by SetupInitialFiles when the L0 compaction score is >= 1 but an L0->L1 compaction was not selected. Both are methods in class LevelCompactionBuilder. PickIntraL0Compaction calls FindIntraL0Compaction if there are least 2 SSTs in L0 and the first isn’t being compacted. When FindIntraL0Compaction is called the arg min_files_to_compact has the value kMinFilesforIntraL0Compaction (=4) and the arg max_compaction_bytes has the value from the max_compaction_bytes option. The default for the max_compaction_bytes option is 0 and when 0 it is set to 25 * target_file_size_base. Note that with intra-L0, SSTs in the L0 can be as large as max_compaction_bytes (ignoring compression).

FindIntraL0Compaction tries to find SSTs in L0 for intra-L0 compaction. It starts by iterating the list of L0 SSTs from newest to oldest. If one is found that is being compacted, then it returns immediately. Otherwise it stops on the first SST with an old enough largest_seqno and this is the first of the SSTs to include in the intra-L0 compaction. It then adds successive SSTs as long as:

  • the SST is not being compacted and
  • Current CBPDF is less than previous CBPDF where CBPDF is Compacted Bytes Per Deleted File and computed as TB / NF, TB (Total Bytes) is the sum of the sizes of SST files and NF is the number of SST files less one. If the SST file sizes were [5, 5, 5, x] then the CBPDF for [5, 5, 5] is 7.5 (15/2) and the CBPDF for [5, 5, 5, x] is (15+x)/3. The loop stops and doesn't include the SST of size x when the CBPDF is greater than 7.5 and (15+x)/3 is > 7.5 when x > 7.5. An intra-L0 compaction would not be picked if only 3 SSTs were selected because 3 is less than min_files_to_compact
  • the sum of the sizes of the SSTs is <= max_compaction_bytes

When it stops an intra-L0 compaction will be done if the number of SSTs selected is >= min_files_to_compact (kMinFilesForIntraL0Compaction).








Wednesday, January 26, 2022

RocksDB internals: bytes pending compaction

I have a deep understanding of LSM performance in general but not enough understanding of RocksDB internals. That is a risky combination that I am fixing by reading source code. I will share what I learn via blog posts. My approach here is to document as I read and the result might be a lot of detail without a high-level overview.

This post is about soft and hard_pending_compaction_bytes_limit. Their comments in the header are brief (I expect to send a diff to improve them), so reading code is required. Before reading code I misunderstood how bytes pending compaction was computed. Below I use BPC for bytes pending compaction. EstimateCompactionBytesNeeded is the function that computes BPC. I assume this is called each time compaction or memtable flush finishes.

While this and a few other leveled compaction features have confused me, they also make leveled more adaptive to bursts of write-heavy activity and I have been promoting the benefits of adaptive LSM tree tuning for a long time. So perhaps I should embrace and document the complexity.

The soft and hard_pending_compaction_bytes_limit options set bounds on how large bytes pending compaction can get before writes are slowed or stalled. A write slowdown occurs when BPC is >= the soft limit and < the hard limit. The duration for a write slowdown is usually 1 millisecond although it can be longer. A write stall occurs when BPC is >= the hard limit. And by write stall I mean that the write is blocked until BPC becomes < the hard limit. The value for BPC is computed by EstimateCompactionBytesNeeded. The value is the sum of the bytes pending compaction across all levels in the LSM tree. Pseudo-code for this is:

For L0:
If num_files(L0) or size(L0) > size(L1_target):
increment by size(L0) + size(L1)

For Ln, n>0:
If size(Ln) > target_size(Ln):
increment by [ size(Ln) - target_size(Ln) ] *
[ (size(Ln+1) / size(Ln)) + 1 ]

Prior to reading the code I assumed was that BPC was incremented on each write (or each memtable flush) by something like sizeof(write) * write-amp and then decremented by compaction. Alas, while it is easy to estimate write-amp it is unlikely that the per-compaction decrements would match the per-write increments and BPC would quickly get out of sync. So I get why the current approach is used.

Challenges

The current approach has challenges:

  • What are good values for soft and hard_pending_compaction_bytes_limit?
    • The defaults are 64G and 256G.
  • The value of BPC changes suddenly.
    • When L0->L1 compaction is triggered at 4 files in L0, then it contributes nothing to bytes pending compaction until the number of SSTs in L0 reaches 4. At that point it increments bytes pending compaction by size(L0) + size(L1). If you plot BPC over time there will be many discontinuous changes.
  • All of the debt can be spent at one level
    • Long ago RocksDB used to constrain each level to be no more than X times larger than its target size. Whether or not this worked well in production, it made it easier to reason about the shape of the LSM tree. In papers all leveled compaction LSM trees look like pyramids and Ln+1 always has more data than Ln. With reasonable settings for per-level fanout and the size constraints this was also true in practice.
    • The enforcement today is global and all of the BPC debt can be spent at one level. If the soft and hard limits are 64G and 256G then the L1 can have ~64G of compaction debt before the slow limit starts to delay writes. If the per-level fanout were 8 then size(L1) could be 8G larger than target_size(L1). I don't know whether this is a problem beyond being a source of confusion.
  • There is monitoring debt.
    • It would help to have details on stalls, per-level target sizes, the total BPC and the per-level contribution to BPC -- written to LOG and/or summarized in compaction IO statistics.

Summary

The current approach allows the LSM tree shape to get very much out of shape. I don't know whether that is a problem, beyond my confusion about this feature which might have lead me to use bad values for the soft and hard limits while running benchmarks. Regardless, I need to enhance the comments for those options to help other users.

I am still figuring out a good way to set values for the hard and soft limits but my suggestion for now is to use the defaults.

Thursday, December 2, 2021

Summarizing the different implementations of tiered compaction

This is my first attempt to summarize tiered compaction as implemented by RocksDB, Cassandra, ScyllaDB and HBase. I have limited experience with tiered compaction (that will change) and limited knowledge of Cassandra, ScyllaDB and HBase (that won't change soon) so I will make mistakes but enjoy the learning process. To keep this brief I ignore the variants optimizations for time-series and I don't explain compaction basics.

Different DBMS use different names for tiered compaction:

  • RocksDB has universal compaction
  • HBase has ExploringCompactionPolicy and RatioBasedCompactionPolicy, see here
  • Cassandra has STCS (Size Tiered Compaction Strategy)
  • ScyllaDB has STCS and ICS (Incremental Compaction Strategy), see here
Context

The workload matters a lot when considering compaction algorithms. There are many but I list only three below. Note that many real workloads are a mixture of patterns, an example is some overwrites, some inserts.
  • Overwrite-mostly - all of my time in LSM-land has been focused on overwrite-mostly workloads. Compaction is useful for overwrites because it reclaims space by removing tombstone and old versions.
  • Insert-mostly - for now I assume that something removes data via TTL or explicit deletes otherwise the database size grows forever. Insert-only can be further divided into:
    • Insert-only in key order - this is the easiest pattern to support. Bloom filters are less useful because pruning based on min/max keys per sorted run can eliminate many runs from being accessed on reads. Excluding implementation artifacts there are few benefits from using compaction in this case. While the sorted runs created by memtable flush will be small, they can be concatenated into larger, logical sorted runs that are range-partitioned although not all implementations do that.
    • Insert-only not in key order - compaction won't reclaim space (there is nothing to reclaim) but it still helps to reduce read-amplification.
STCS vs Prefix

While STCS is an accepted and good name for one approach to tiered compaction, HBase and RocksDB have a similar approach but use different names (RatioBasedCompactionPolicy, Universal). I prefer to call that Prefix compaction for reasons explained below.

While it would help to show examples (via diagrams or text) of the compactions that are done for STCS and Prefix assuming a sequence of memtable flushes, this post is already too long. I will do that in a future post.

Properties

Definitions:
  • By compaction picker I mean the algorithm that decides which sorted runs to compact.
  • Major compaction merges all sorted runs, minor compaction does not. 
  • The metadata for a sorted run includes the min and max keys stored in the sorted run and the min and max commit timestamp used by a key in that sorted run.
I prefer to have a geek code for tiered compaction implementations but I don't have the expertise yet to suggest that. Regardless, below is an incomplete list of things that distinguish the implementations:

  • What triggers major compaction?
    • Many implementations only do major compaction manually (when requested by an admin). The reasons for not doing it automatically include 1) it can take a long time and 2) there might be a transient 2X increase in space. Note that some implementations can merge the largest sorted runs without merging all sorted runs and this can cause the same problems as major compaction.
  • How are sorted runs ordered as viewed by the compaction picker?
    • They can be ordered by commit timestamp or size.
    • Commit timestamp order frequently implies size order, but sometimes explicit deletes, compaction removing a larger/older value in favor of a smaller/new value or TTL means that older sorted runs are not larger.
  • Does the compaction picker (almost) always consume a prefix of the sorted runs or can it consume a substring?
    • This assumes there is an order imposed on the sorted runs (see the previous question).
  • Can the compaction picker merge sorted runs that aren't adjacent in commit timestamp order?
    • See the two previous questions. If this is done the sorted runs can overlap with respect to commit timestamp ranges. A side-effect of allowing sorted runs to overlap in commit timestamp order might be that merging iterators can't use a short-circuit optimization for point queries -- reading from iterators in commit timestamp order and stopping on the first matching key or tombstone. I don't know if that side-effect is a big deal. 
  • What read-amplification constraints trigger minor compaction?
    • The trigger can be global or local. Global is triggered when there are too many sorted runs. Local is triggered when there are too many sorted runs within a size bucket.
    • These are naive because they assume all sorted runs might have to be read for a query. For workloads where the newly written keys have some correlation with time then the assumption might be wrong, many sorted runs can be pruned based on their min/max key and the number of sorted runs in the LSM tree is a too-pessimistic estimate of the read-amp. Time-series workload might break the naive assumption, but I am trying to not discuss LSM optimizations for time-series in this post.
  • Is there a minimum number of sorted runs to merge for minor compaction?
  • Is there read-triggered compaction?
    • LevelDB has a feature to trigger compaction for regions of the LSM tree with leveled compaction. Some tiered compaction implementations have a read hotness feature that prioritizes compaction for sorted runs that get more reads.
  • What space-amplification constraints trigger compaction?
    • This assumes there is an estimate for space-amp and there might not be one. If there is an estimate it is usually naive because a less naive estimate is expensive. The naive estimate is sizeof(all-sorted-runs) / sizeof(largest-sorted-run). I think this can be improved.
  • Do large sorted runs imply large filter and index blocks in memory?
    • If index and filter blocks are per sorted run, stored as a contiguous sequence of bytes and cannot be paged then a large sorted run implies there are large index and filter blocks. These are inefficient to cache and to read and decompress from storage. 
    • One alternative is to page them so they can be read from storage and cached in pieces. 
    • Another alternative is to have large logical sorted runs that are range partitioned into N sorted runs. Each of the N sorted runs has its own and smaller index and filter blocks.
  • Is compaction for large sorted runs incremental?
    • By incremental I mean there isn't a transient 2X increase in disk space during compaction and that disk space used by compaction input is reclaimed before a large compaction is complete.
  • Is compaction single-threaded?
    • Leveled compaction is usually single-thread because each compaction step is limited to a small amount of data. And then background threads can do many small compaction steps concurrently. But a compaction step can be much larger with tiered. Multi-threading can reduce this problem. 
    • One approach to multi-threading is to have N threads divide the compaction input into N key ranges and work on them independently.
    • Another approach to multi-threading is to have one thread do the merge, 1+ threads read and decompress compaction input and 1+ threads post-process (collect stats, create bloom filters, compress) compaction output. The separate threads for reading and writing can also do async read-ahead and write-behind.
  • Are there optimizations for inserts in key order?
  • What is the space-amp vs read-amp vs write-amp tradeoff?
    • Implementations have different tradeoffs for read, write and space amplification. I am not aware of an attempt to quantify the differences. My answer is left for a future blog post.

RocksDB

There are plans for improving universal compaction and some details are here.

Answers to the geek code questions:
  • The compaction picker is here and options are here
  • Major compaction is triggered automatically when the space-amp constraint is exceeded. That is set by the option max_size_amplification_percent
  • Sorted runs are ordered by commit timestamp (seqno)
  • The compaction picker usually picks a prefix of the sorted runs. But it skips the first N if they are currently being compacted so it doesn't always pick a prefix. The next sorted run is added to the prefix when sizeof(next run) <= sizeof(selected runs) X size_ratio. When the size_ratio check fails or max_merge_width is reached then the minor compaction input is known. The stop_style option can also be used but I won't describe that here.
  • Compaction input sorted runs are always adjacent in commit timestamp order.
  • The min and max number of sorted runs to merge are set by min_merge_width and max_merge_width
  • The read-amp constraint that triggers minor compaction is set by the option level0_file_num_compaction_trigger which is borrowed from leveled compaction. This is global.
  • There is no read-triggered compaction. While it was inherited from LevelDB it is currently disabled in RocksDB.
  • The space-amp constraint triggers major compaction as mentioned above. The naive estimate is used for space-amp. The estimate is truthy for overwrite-mostly workloads. I assume the space-amp constraint should be disabled for insert-mostly workloads.
  • Large sorted runs can lead to large index and filter blocks but that is unlikely because the largest sorted runs are logical and range-partitioned with an index and filter per partition (see here). RocksDB also optionally supports partitioned index and filter blocks.
  • Compaction for large sorted runs is not incremental (yet)
  • Large compactions are multi-threaded via subcompactions that split compaction input into N ranges (see here)
  • The allow_trivial_move option can use trivial moves in some cases when inserts are in key order to avoid re-writing sorted runs.
HBase

Answers to the geek code questions:
  • Compaction options are here. Search for hstore.compaction.
  • The compaction policies (ExploringCompactionPolicy, RatioBasedCompactionPolicy) are explained here.
  • The min and max number of sorted runs to merge are set by hbase.hstore.compaction.{min,max}
  • Time-based major compaction can be enabled or disabled. Search for hstore.hregion.majorcompaction
  • RatioBasedCompactionPolicy seems similar to universal compaction in RocksDB but ExploringCompactionPolicy might be closer to STCS. However, I didn't spend enough time to figure them out.
Cassandra

Answers to the geek code questions for STCS:
  • The compaction picker is here and options are here
  • Major compaction is triggered manually
  • Sorted runs are ordered by size and the compaction picker then groups them into size classes, then looks at size classes that have at least min_threshold sorted runs and chooses the one that gets the most reads (read hotness). The bucket_high and bucket_low options determine the range of sorted run sizes that can be in one size class (bucket). More detail is here and here.
  • Sorted runs that are not adjacent in commit timestamp order can be merged
  • The min and max number of sorted runs to merge are determined by the options min_threshold and max_threshold
  • The read-amp constraint is min_threshold and triggers minor compaction when there is a size class with at least that many sorted runs
  • There isn't support for read-triggered minor compaction but read hotness is used to prioritize which size class to compact next.
  • I am not sure whether large index and filter blocks are a problem
  • Compaction is single-threaded. I don't know whether the sharding built into Cassandra makes it easier to avoid large shards and the problems related to large compactions.
  • Does not do incremental compaction
  • I am not aware of optimizations for key-order inserts, similar to the allow_trivial_move feature in RocksDB universal compaction.
  • Cassandra might use HyperLogLog to estimate overlap between sorted runs. This can help estimate whether there would be a benefit from merging sorted runs by predicting how many redundant keys would be removed. But all I have today is a DataStax blog post.

ScyllaDB

Answers to the geek code questions for STCS (size tiered compaction strategy) with a few answers for ICS (incremental compaction strategy). Most of the answers above for Cassandra are also true here:
  • The compaction picker for STCS is here and options are here
  • Major compaction is triggered manually for STCS. ICS has an option to trigger it based on a SAG (space amplification goal). And they explain it is useful for overwrite workloads.
  • The min and max number of sorted runs to merge are determined by the options min_threshold and max_threshold
  • Compaction is single-threaded but ScyllaDB explains how the shard-per-core model reduces the problem from large/slow compactions.
  • STCS does not do incremental compaction but ScyllaDB Enterprise includes ICS
Updates
  • Added a link to the use of HyperLogLog by DataStax Cassandra to estimate sorted run overlap

Wednesday, December 1, 2021

Sysbench: MySQL 5.6, 5.7 & 8.0 on a small server

This has results for sysbench with MySQL versions 5.6.49, 5.7.35 and 8.0.2x using the same setup as previously used for the insert benchmark. The versions for 8.0.2x are 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27.

Executive summary:

  • For CPU-bound setups, MySQL 8.0.27 is slower than 5.6.49 on 39 of the 42 microbenchmarks. For the microbenchmarks when it was slower, on average it gets 72% of the throughput vs 5.6.49.
  • For IO-bound setups, MySQL 8.0.27 is slower than 5.6.49 on 27 of the 42 microbenchmarks, faster on 13 and had the same throughput on 2.
  • In most cases more CPU overhead in 8.0 is the reason it is slower than 5.6.

Details

See the Postgres post for more details on how I run sysbench. The summary is there are 43 invocations of sysbench and each is a microbenchmark. But a typo meant I only have results for 42 of the 43.

Three configurations were tested: 10M rows without prepared statements, 10M rows with prepared statements, 400M rows without prepared statements. The 10M row tests are CPU-bound as the database fits in memory. The 400M row test is usually IO-bound as the database is larger than memory. Below I call the test configurations 10m.prep0, 10m.prep1 and 400m.prep0. The database sizes are 2.9G at 10M rows and 91G at 400M rows after the initial load.

One of my goals was to understand the impact of using prepared statements with MySQL especially with respect to CPU overhead. They didn't make a big difference with the exception of the one test for which there is a performance bug with prepared statements (random-points.pre.range1000.pk1).

How to read the results

The presentation for these results is just text files pasted into gists. There are two types of files: qps and metrics. 

The qps file has the relative QPS for each test where the relative QPS is the ratio: QPS-for-me / QPS-for-base.

  • QPS-for-me is the QPS for one of MySQL 5.6.49, 5.7.35 or 8.0.2x
  • QPS-for-base is the QPS for MySQL 5.6.49
  • Thus each line in the qps file has seven numeric columns and the first one has the value 1.00 (QPS-for-5.6.49 / QPS-for-5.6.49). The second through seventh columns have the values for (QPS-for-me / QPS-for-5.6.49) where me is 5.7.35, 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27 in that order. When the value in a column is less than 1.0 then that version is slower than 5.6.49.
The metrics file has absolute and relative (to results for 5.6.49) values for each of the 42 tests. This file is long so I start with the qps file and then consult the metrics file to understand why one result is better than another.
  • cpu/o - CPU per operation, measured by iostat. While there is a unit for this, I don't worry about that as it is most useful for comparing numbers between different test configurations so the units drop out.
  • r/o - storage reads per operation, measured by iostat
  • rKB/o - KB read from storage per operation, measured by iostat
  • wKB/o - KB written to storage per operation,  measured by iostat
  • o/s - operations/second (QPS, inserts/s, etc). I tend to use QPS below for everything.
  • dbms - the MySQL version and configuration file
Results

Here are the qps and metrics files for the 10m.prep0 (CPU-bound, no prepared statements):
  • 8.0.27 is faster than 5.6.49 on 3 of 42 tests (read-only.pre.range10000,pk1, read-only.range10000.pk1, update-index.range100,pk1). The first two do long range scans and the relative QPS is 1.25. For the update-index test the relative QPS is 1.48.
  • The relative QPS for 8.0.27 is less than 0.8 for 35 of the 42 tests. This means that 8.0.27 gets less than 80% of the throughput relative to 5.6.49 on 35 of the tests. The minimum relative QPS is 0.61. For the 39 tests on which 8.0.27 is slower than 5.6.49 the average of the relative QPS is 0.72.
  • For the tests in which the relative QPS in 8.0.27 is much less than one the cause is using more CPU/query. See the cpu/o column in the metrics file.
  • The speedup for update-index is harder to explain. 8.0.2x uses less CPU, but the difference isn't large enough to explain the improvement. The speedup for the read-only tests is mostly from using less CPU/query in 8.0.2x. See the cpu/o column in the metrics file.

Here are the qps and metrics files for the 10m.prep1 (CPU-bound, prepared statements):
  • Results are similar to 10m.prep0
  • 8.0.27 is faster than 5.6.49 3 of 42 tests. The relative QPS is 1.24 for the read-only tests that do long range scans and 1.54 for the update-index test.
  • The relative QPS for 8.0.27 is less than 0.8 for 35 of the 42 tests. 
  • The minimum relative QPS is 0.30 for random-points.pre.range1000.pk1 (AFAIK there is a known perf bug in the optimizer). MySQL 8.0.2x uses more than 3X CPU/query compared to 5.6.49. See the cpu/o column in the metrics file. 
  • For the 39 tests on which 8.0.27 is slower than 5.6.49 the average of the relative QPS is 0.70.

Here are the qps and metrics files for the 400m.prep0 (IO-bound, no prepared statements):
  • 8.0.27 is faster than 5.6.49 for 13 of the 42 tests (tests are listed below)
  • 8.0.27 matched 5.6.49 QPS for 2 of the 42 tests (random-points.pre.range1000.pk1, random-points.range1000.pk1)
  • The relative QPS for 8.0.27 is less than 0.8 for 8 of the 42 tests (tests are listed below)
  • For the 27 tests on which 8.0.27 is slower than 5.6.49 the average of the relative QPS is 0.84

Relative QPS for 8.0.27 vs 5.6.49 when 8.0.27 is faster for the IO-bound tests:

1.22    point-query.pre.range100.pk1
1.02    random-points.pre.range10.pk1
1.05    read-only.pre.range100.pk1
1.13    read-only.pre.range10000.pk1
1.03    range-covered-pk.pre.range100.pk1
1.02    range-notcovered-pk.pre.range100.pk1
1.12    update-inlist.range100.pk1
1.17    read-only.range100.pk1
1.14    read-only.range10000.pk1
1.03    point-query.range100.pk1
1.05    random-points.range10.pk1
1.05    range-covered-pk.range100.pk1
1.06    range-notcovered-pk.range100.pk1

Relative QPS for 8.0.27 vs 5.6.49 when 8.0.27 is much slower for the IO-bound tests:

0.75    range-covered-si.pre.range100.pk1
0.68    update-one.range100.pk1
0.73    update-zipf.range100.pk1
0.55    read-write.range10.pk1
0.65    hot-points.range100.pk1
0.75    range-covered-si.range100.pk1
0.79    scan.range100.pk1
0.60    insert.range100.pk1









Sysbench: PostgreSQL 12, 13 and 14 on a small server

This has results for sysbench with PostgreSQL versions 12.4, 13.4 and 14.0 using the same setup as previously used for the insert benchmark.

Executive summary:

  • Postgres continues to be boring. There are a few regressions and more improvements.
  • There were ~40 tests and each ran for 5 minutes. Running each test for such a short time means there is more chance for confusing results. Alas, running each test for 30 minutes or more would have taken too long because I repeated the benchmark many times for different DBMS, configurations and database sizes. I started another round just for the Postgres IO-bound configuration that will take about 3 days to get results.

Details

See the insert benchmark post for more details on the hardware and Postgres configurations. As always, there are layers upon layers of shell scripts. The file all_small.sh lists the sequence in which the test steps are run and each step is run for 300 seconds with 1 connection (low concurrency!). From the all_small.sh script there were 42 different invocations of sysbench, each one is a microbenchmark. There should have been 43 but a typo got in the way. From 42 X 5 minutes for the tests plus more time for the load it takes 4 to 8 hours for all_small.sh to get results for each DBMS + configuration.

Three configurations were tested: 10M rows without prepared statements, 10M rows with prepared statements, 400M rows without prepared statements. The 10M row tests are CPU-bound as the database fits in memory. The 400M row test is usually IO-bound as the database is larger than memory. Below I call the test configurations 10m.prep0, 10m.prep1 and 400m.prep0. The database sizes are 2.6G at 10M rows and 99G at 400M rows after the initial load.

All of the shell scripts are here and the tests were run using a script like this.

I use my fork of sysbench that includes a few more tests (Lua scripts). The Lua scripts are here. My fork might be a few years behind upstream.

How to read the results

The presentation for these results is just text files pasted into gists. There are two types of files: qps and metrics. 

The qps file has the relative QPS for each test where the relative QPS is the ratio: QPS-for-me / QPS-for-base.

  • QPS-for-me is the QPS for one of Postgres 12.4, 13.4 or 14.0
  • QPS-for-base is the QPS for Postgres 12.4.
  • Thus each line in the qps file has three numeric columns and the first one has the value 1.00 (QPS- for-12.4 / QPS-for-12.4). The value in the second column is the QPS for 13.4 relative to 12.4, and in the third is the QPS for 14.0 relative to 12.4. When the value in the second or third column is less than 1.0 then that version is slower than 12.4.
The metrics file has absolute and relative (to results for 12.4) values for each of the 42 test steps. This file is long so I start with the qps file and then consult the metrics file to understand why one result is better than another.
  • cpu/o - CPU per operation, measured by iostat. While there is a unit for this, I don't worry about that as it is most useful for comparing numbers between different test configurations so the units drop out.
  • r/o - storage reads per operation, measured by iostat
  • rKB/o - KB read from storage per operation, measured by iostat
  • wKB/o - KB written to storage per operation,  measured by iostat
  • o/s - operations/second (QPS, inserts/s, etc). I tend to use QPS below for everything.
  • dbms - the Postgres version and configuration file
Results

Here are the qps and metrics files for the 10m.prep0 (CPU-bound, no prepared statements):
  • scan.range.pk1 has a regression in 13.4 that was mostly fixed in 14.0. The QPS relative to 12.4 is 0.82 for 13.4 and 0.96 for 14.0. From the metrics file there is a 30% increase in CPU overhead (see the cpu/o column) in 13.4. The scan test does a full scan of the test table using a WHERE clause that filters all rows: SELECT * from %s WHERE LENGTH(c) < 0.

Here are the qps and metrics files for the 10m.prep1 (CPU-bound, prepared statements):
  • Results are similar to 10m.prep0, the only regression is for scan.range.pk1 for 13.4.

Here are the qps and metrics files for the 400m.prep0 (IO-bound, no prepared statements):
  • scan.range.pk1 has a regression from CPU overhead for 13.4
  • read-write.range10.pk has a ~10% regression in 13.4 and 14.0. The biggest difference from the metrics file is for wKB/o. Perhaps this test needs to run for more time to get a better signal.
  • Five of the tests improve by ~10% or more in 13.4 and 14.0. I don't count the improvement for random-points.pre.range1000.pk1 because the QPS is too small (5 & 6) and rounding might explain the difference. Reduced CPU/query doesn't explain all of the improvements as in many cases there is also less IO/query but my Postgres expertise isn't strong enough to have a good guess. There have been improvements to vacuum and b-tree indexes that are likely part of the reason.
Updates

I repeated the IO-bound tests with more time per microbenchmark, 1800 seconds rather than 300. Here are the qps and metrics files. For 14.0, there were 3 tests for which perf improved more than ~5% and 5 for which it got worse by more than ~5%.

This lists microbenchmarks for which the QPS ratio is > 1.05 for Postgres 13.4 or 14.0 relative to 12.4. I picked 1.05 as a cutoff. There aren't any huge speedups, but it is nice to see some improvements.

12.4    13.4    14.0
1.00    1.12    1.13    point-query.warm.range100.pk1
1.00    1.04    1.07    points-notcovered-si.pre.range100.pk1
1.00    1.02    1.07    read-only.range10.pk1
1.00    1.05    1.07    points-notcovered-si.range100.pk1

And this lists microbenchmarks for which the QPS ratio is < =.95 for Postgres 13.4 or 14.0 relative to 12.4. From the metrics file, new CPU overhead isn't the root cause for 14.0.

12.4    13.4    14.0
1.00    0.98    0.94    update-inlist.range100.pk1
1.00    0.84    0.93    scan.range100.pk1
1.00    1.04    0.93    insert.range100.pk1

Tuesday, November 30, 2021

Insert benchmark: MySQL 5.6, 5.7 & 8.0 on a small server

I have another performance report for the insert benchmark using MySQL 5.6, 5.7 and 8.0 on a small server. As normal, I had to repeat the tests a few times: I make mistakes and also want results for new my.cnf settings. The goal for the tests is to understand how low-concurrency performance changes in new MySQL releases. Low-concurrency tests are great for finding CPU regressions. High-concurrency tests are also great but I only have small servers at home.

For the CPU-bound setup, new MySQL (8.0) gets between 10% and 37% less throughput than old MySQL (5.6) because 8.0 uses more CPU/query. The increase in CPU/query for MySQL from 5.6 to 8.0 is large while PostgreSQL avoids CPU regressions from 12.4 to 14.0.

Disclaimers:

  • Be careful about interpreting these results, the context here is a workload with simple queries.
  • These results have yet to be reviewed by others. There can be mistakes.
  • 8.0.20 and 8.0.22 were exciting releases. Performance is slowly improving since then.
Details

Tests were run for MySQL versions 5.6.49, 5.7.35, 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27. I try to use similar my.cnf settings except when new versions have new features. The options are in the my56, my57 and my80 directories here (look for my.cnf.cy8). Scripts to build MySQL from source are here for MySQL 5.6, 5.7 and 8.0. One of the big changes I made was to add innodb_log_writer_threads=OFF to the MySQL 8.0 my.cnf settings. Without that the CPU overhead was much worse.

The test servers have 4 cores and tests were run with either 1 insert connection or 2 connections (rate limited inserts, not rate-limited queries). This page has links to explain the insert benchmark and perf report format. The test was run in CPU-bound and IO-bound modes. For each mode, X rows were inserted into a table without a secondary index, then 3 secondary indexes were created, then Y more rows were inserted. The values for X and Y were (20M, 20M) for CPU-bound and (500M, 10M) for IO-bound. After the inserts there were three insert+query steps where the insert connection was rate-limited to 100, 500 and 1000 inserts/s. Each query step was run for 2 hours.

The test server uses Ubuntu 20.04, gcc-9.3.0-17 and uname -a shows 5.4.0-89-generic. The database filesystem uses XFS with discard enabled. 

Results

I have shell scripts to generate perf reports. The reports for CPU-bound and IO-bound are here and here. Below I use throughput ratios where the denominator is the value for MySQL 5.6.49 and the numerator is the value for another version. A value less than 1.0 means the new version is slower than 5.6.49.

Comments on CPU-bound (see the summary):

  • for the l.i0 test (inserts without secondary indexes) the throughput ratio is 0.63 for 8.0.27. The slowdown is explained by an increase in CPU/query. See the cpupq column here that is 15 for 5.6.49 and 21 for 8.0.27.
  • for the l.i1 test (inserts with secondary indexes) the throughput ratio is 0.73 for 8.0.27. The slowdown is explained by an increase in CPU/query. See the cpupq column here that is 32 for 5.6.49 and 43 for 8.0.27.
  • for the q100.1, q500.1 and q1000.1 tests the throughput ratios using the qps columns are 0.81, 0.76 and 0.74 for 8.0.27. This is probably explained by more CPU/query. See the cpupq columns for q100.1, q500.1 and q1000.1. The throughput ratios were much better for 8.0.22 than for 8.0.27 (0.94, 0.89 and 0.86) but I have not attempted to debug that. I was unable to explain why the wkbpi column is larger for 8.0.27 than 5.6.49 and larger means more write-amp (KB written/insert). It was larger for 8.0.27 when measured by iostat, but smaller when measured by global status counters like Innodb_dblwr_pages_written, Innodb_os_log_written, Innodb_data_written and Innodb_buffer_pool_pages_flushed. This remains a mystery. I am not sure it is possible to configure InnoDB writeback in 8.0 to match 5.6. The only difference in the my.cnf settings is the use of innodb_idle_flush_pct=1 for the 8.0 my.cnfs.

Comments on IO-bound (see the summary):

  • for the l.i0 test (inserts without secondary indexes) results are similar to the CPU-bound results above because this step is still CPU-bound, 8.0 is slower because it uses more CPU/query. See here.
  • for the l.x test (create index) 8.0 is faster than 5.6. I think significant changes were done for InnoDB create index. See here.
  • for the l.i1 test (inserts with secondary indexes) results for 8.0.26 & 8.0.27 are better than 5.6.49, but results for 8.0.20 and 8.0.22 and 8.0.23 is in between. With the new release process there can be significant changes for performance between point releases. See the ips column here. I think this is related to the big changes to redo log code in 8.0 as CPU/query (cpupq) and context switches/query (cspq) were larger in 8.0.20 and 8.0.22.
  • for the q100.1, q500.1 and q1000.1 tests the throughput ratios are larger than 1 as 8.0 is faster than 5.6, except for 8.0.20 which appears to be an exciting release. See the qps and cpupq columns here. MySQL 8.0.27 uses less CPU/query than 5.6.49.