Thursday, December 2, 2021

Summarizing the different implementations of tiered compaction

This is my first attempt to summarize tiered compaction as implemented by RocksDB, Cassandra, ScyllaDB and HBase. I have limited experience with tiered compaction (that will change) and limited knowledge of Cassandra, ScyllaDB and HBase (that won't change soon) so I will make mistakes but enjoy the learning process. To keep this brief I ignore the variants optimizations for time-series and I don't explain compaction basics.

Different DBMS use different names for tiered compaction:

  • RocksDB has universal compaction
  • HBase has ExploringCompactionPolicy and RatioBasedCompactionPolicy, see here
  • Cassandra has STCS (Size Tiered Compaction Strategy)
  • ScyllaDB has STCS and ICS (Incremental Compaction Strategy), see here
Context

The workload matters a lot when considering compaction algorithms. There are many but I list only three below. Note that many real workloads are a mixture of patterns, an example is some overwrites, some inserts.
  • Overwrite-mostly - all of my time in LSM-land has been focused on overwrite-mostly workloads. Compaction is useful for overwrites because it reclaims space by removing tombstone and old versions.
  • Insert-mostly - for now I assume that something removes data via TTL or explicit deletes otherwise the database size grows forever. Insert-only can be further divided into:
    • Insert-only in key order - this is the easiest pattern to support. Bloom filters are less useful because pruning based on min/max keys per sorted run can eliminate many runs from being accessed on reads. Excluding implementation artifacts there are few benefits from using compaction in this case. While the sorted runs created by memtable flush will be small, they can be concatenated into larger, logical sorted runs that are range-partitioned although not all implementations do that.
    • Insert-only not in key order - compaction won't reclaim space (there is nothing to reclaim) but it still helps to reduce read-amplification.
STCS vs Prefix

While STCS is an accepted and good name for one approach to tiered compaction, HBase and RocksDB have a similar approach but use different names (RatioBasedCompactionPolicy, Universal). I prefer to call that Prefix compaction for reasons explained below.

While it would help to show examples (via diagrams or text) of the compactions that are done for STCS and Prefix assuming a sequence of memtable flushes, this post is already too long. I will do that in a future post.

Update - after reading more RocksDB code it doesn't limit compactions to a prefix, so substring would be a better name assuming you visualize the sequence of sorted runs as a string -- each sorted run is an entry in that string.

Properties

Definitions:
  • By compaction picker I mean the algorithm that decides which sorted runs to compact.
  • Major compaction merges all sorted runs, minor compaction does not. 
  • The metadata for a sorted run includes the min and max keys stored in the sorted run and the min and max commit timestamp used by a key in that sorted run.
I prefer to have a geek code for tiered compaction implementations but I don't have the expertise yet to suggest that. Regardless, below is an incomplete list of things that distinguish the implementations:

  • What triggers major compaction?
    • Many implementations only do major compaction manually (when requested by an admin). The reasons for not doing it automatically include 1) it can take a long time and 2) there might be a transient 2X increase in space. Note that some implementations can merge the largest sorted runs without merging all sorted runs and this can cause the same problems as major compaction.
  • How are sorted runs ordered as viewed by the compaction picker?
    • They can be ordered by commit timestamp or size.
    • Commit timestamp order frequently implies size order, but sometimes explicit deletes, compaction removing a larger/older value in favor of a smaller/new value or TTL means that older sorted runs are not larger.
  • Does the compaction picker (almost) always consume a prefix of the sorted runs or can it consume a substring?
    • This assumes there is an order imposed on the sorted runs (see the previous question).
  • Can the compaction picker merge sorted runs that aren't adjacent in commit timestamp order?
    • See the two previous questions. If this is done the sorted runs can overlap with respect to commit timestamp ranges. A side-effect of allowing sorted runs to overlap in commit timestamp order might be that merging iterators can't use a short-circuit optimization for point queries -- reading from iterators in commit timestamp order and stopping on the first matching key or tombstone. I don't know if that side-effect is a big deal. 
  • What read-amplification constraints trigger minor compaction?
    • The trigger can be global or local. Global is triggered when there are too many sorted runs. Local is triggered when there are too many sorted runs within a size bucket.
    • These are naive because they assume all sorted runs might have to be read for a query. For workloads where the newly written keys have some correlation with time then the assumption might be wrong, many sorted runs can be pruned based on their min/max key and the number of sorted runs in the LSM tree is a too-pessimistic estimate of the read-amp. Time-series workload might break the naive assumption, but I am trying to not discuss LSM optimizations for time-series in this post.
  • Is there a minimum number of sorted runs to merge for minor compaction?
  • Is there read-triggered compaction?
    • LevelDB has a feature to trigger compaction for regions of the LSM tree with leveled compaction. Some tiered compaction implementations have a read hotness feature that prioritizes compaction for sorted runs that get more reads.
  • What space-amplification constraints trigger compaction?
    • This assumes there is an estimate for space-amp and there might not be one. If there is an estimate it is usually naive because a less naive estimate is expensive. The naive estimate is sizeof(all-sorted-runs) / sizeof(largest-sorted-run). I think this can be improved.
  • Do large sorted runs imply large filter and index blocks in memory?
    • If index and filter blocks are per sorted run, stored as a contiguous sequence of bytes and cannot be paged then a large sorted run implies there are large index and filter blocks. These are inefficient to cache and to read and decompress from storage. 
    • One alternative is to page them so they can be read from storage and cached in pieces. 
    • Another alternative is to have large logical sorted runs that are range partitioned into N sorted runs. Each of the N sorted runs has its own and smaller index and filter blocks.
  • Is compaction for large sorted runs incremental?
    • By incremental I mean there isn't a transient 2X increase in disk space during compaction and that disk space used by compaction input is reclaimed before a large compaction is complete.
  • Is compaction single-threaded?
    • Leveled compaction is usually single-thread because each compaction step is limited to a small amount of data. And then background threads can do many small compaction steps concurrently. But a compaction step can be much larger with tiered. Multi-threading can reduce this problem. 
    • One approach to multi-threading is to have N threads divide the compaction input into N key ranges and work on them independently.
    • Another approach to multi-threading is to have one thread do the merge, 1+ threads read and decompress compaction input and 1+ threads post-process (collect stats, create bloom filters, compress) compaction output. The separate threads for reading and writing can also do async read-ahead and write-behind.
  • Are there optimizations for inserts in key order?
  • What is the space-amp vs read-amp vs write-amp tradeoff?
    • Implementations have different tradeoffs for read, write and space amplification. I am not aware of an attempt to quantify the differences. My answer is left for a future blog post.

RocksDB

There are plans for improving universal compaction and some details are here.

Answers to the geek code questions:
  • The compaction picker is here and options are here
  • Major compaction is triggered automatically when the space-amp constraint is exceeded. That is set by the option max_size_amplification_percent
  • Sorted runs are ordered by commit timestamp (seqno)
  • The compaction picker usually picks a prefix of the sorted runs. But it skips the first N if they are currently being compacted so it doesn't always pick a prefix. The next sorted run is added to the prefix when sizeof(next run) <= sizeof(selected runs) X size_ratio. When the size_ratio check fails or max_merge_width is reached then the minor compaction input is known. The stop_style option can also be used but I won't describe that here.
  • Compaction input sorted runs are always adjacent in commit timestamp order.
  • The min and max number of sorted runs to merge are set by min_merge_width and max_merge_width
  • The read-amp constraint that triggers minor compaction is set by the option level0_file_num_compaction_trigger which is borrowed from leveled compaction. This is global.
  • There is no read-triggered compaction. While it was inherited from LevelDB it is currently disabled in RocksDB.
  • The space-amp constraint triggers major compaction as mentioned above. The naive estimate is used for space-amp. The estimate is truthy for overwrite-mostly workloads. I assume the space-amp constraint should be disabled for insert-mostly workloads.
  • Large sorted runs can lead to large index and filter blocks but that is unlikely because the largest sorted runs are logical and range-partitioned with an index and filter per partition (see here). RocksDB also optionally supports partitioned index and filter blocks.
  • Compaction for large sorted runs is not incremental (yet)
  • Large compactions are multi-threaded via subcompactions that split compaction input into N ranges (see here)
  • The allow_trivial_move option can use trivial moves in some cases when inserts are in key order to avoid re-writing sorted runs.
HBase

Answers to the geek code questions:
  • Compaction options are here. Search for hstore.compaction.
  • The compaction policies (ExploringCompactionPolicy, RatioBasedCompactionPolicy) are explained here.
  • The min and max number of sorted runs to merge are set by hbase.hstore.compaction.{min,max}
  • Time-based major compaction can be enabled or disabled. Search for hstore.hregion.majorcompaction
  • RatioBasedCompactionPolicy seems similar to universal compaction in RocksDB but ExploringCompactionPolicy might be closer to STCS. However, I didn't spend enough time to figure them out.
Cassandra

Answers to the geek code questions for STCS:
  • The compaction picker is here and options are here
  • Major compaction is triggered manually
  • Sorted runs are ordered by size and the compaction picker then groups them into size classes, then looks at size classes that have at least min_threshold sorted runs and chooses the one that gets the most reads (read hotness). The bucket_high and bucket_low options determine the range of sorted run sizes that can be in one size class (bucket). More detail is here and here.
  • Sorted runs that are not adjacent in commit timestamp order can be merged
  • The min and max number of sorted runs to merge are determined by the options min_threshold and max_threshold
  • The read-amp constraint is min_threshold and triggers minor compaction when there is a size class with at least that many sorted runs
  • There isn't support for read-triggered minor compaction but read hotness is used to prioritize which size class to compact next.
  • I am not sure whether large index and filter blocks are a problem
  • Compaction is single-threaded. I don't know whether the sharding built into Cassandra makes it easier to avoid large shards and the problems related to large compactions.
  • Does not do incremental compaction
  • I am not aware of optimizations for key-order inserts, similar to the allow_trivial_move feature in RocksDB universal compaction.
  • Cassandra might use HyperLogLog to estimate overlap between sorted runs. This can help estimate whether there would be a benefit from merging sorted runs by predicting how many redundant keys would be removed. But all I have today is a DataStax blog post.

ScyllaDB

Answers to the geek code questions for STCS (size tiered compaction strategy) with a few answers for ICS (incremental compaction strategy). Most of the answers above for Cassandra are also true here:
  • The compaction picker for STCS is here and options are here
  • Major compaction is triggered manually for STCS. ICS has an option to trigger it based on a SAG (space amplification goal). And they explain it is useful for overwrite workloads.
  • The min and max number of sorted runs to merge are determined by the options min_threshold and max_threshold
  • Compaction is single-threaded but ScyllaDB explains how the shard-per-core model reduces the problem from large/slow compactions.
  • STCS does not do incremental compaction but ScyllaDB Enterprise includes ICS
Updates
  • Added a link to the use of HyperLogLog by DataStax Cassandra to estimate sorted run overlap

Wednesday, December 1, 2021

Sysbench: MySQL 5.6, 5.7 & 8.0 on a small server

This has results for sysbench with MySQL versions 5.6.49, 5.7.35 and 8.0.2x using the same setup as previously used for the insert benchmark. The versions for 8.0.2x are 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27.

Executive summary:

  • For CPU-bound setups, MySQL 8.0.27 is slower than 5.6.49 on 39 of the 42 microbenchmarks. For the microbenchmarks when it was slower, on average it gets 72% of the throughput vs 5.6.49.
  • For IO-bound setups, MySQL 8.0.27 is slower than 5.6.49 on 27 of the 42 microbenchmarks, faster on 13 and had the same throughput on 2.
  • In most cases more CPU overhead in 8.0 is the reason it is slower than 5.6.

Details

See the Postgres post for more details on how I run sysbench. The summary is there are 43 invocations of sysbench and each is a microbenchmark. But a typo meant I only have results for 42 of the 43.

Three configurations were tested: 10M rows without prepared statements, 10M rows with prepared statements, 400M rows without prepared statements. The 10M row tests are CPU-bound as the database fits in memory. The 400M row test is usually IO-bound as the database is larger than memory. Below I call the test configurations 10m.prep0, 10m.prep1 and 400m.prep0. The database sizes are 2.9G at 10M rows and 91G at 400M rows after the initial load.

One of my goals was to understand the impact of using prepared statements with MySQL especially with respect to CPU overhead. They didn't make a big difference with the exception of the one test for which there is a performance bug with prepared statements (random-points.pre.range1000.pk1).

How to read the results

The presentation for these results is just text files pasted into gists. There are two types of files: qps and metrics. 

The qps file has the relative QPS for each test where the relative QPS is the ratio: QPS-for-me / QPS-for-base.

  • QPS-for-me is the QPS for one of MySQL 5.6.49, 5.7.35 or 8.0.2x
  • QPS-for-base is the QPS for MySQL 5.6.49
  • Thus each line in the qps file has seven numeric columns and the first one has the value 1.00 (QPS-for-5.6.49 / QPS-for-5.6.49). The second through seventh columns have the values for (QPS-for-me / QPS-for-5.6.49) where me is 5.7.35, 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27 in that order. When the value in a column is less than 1.0 then that version is slower than 5.6.49.
The metrics file has absolute and relative (to results for 5.6.49) values for each of the 42 tests. This file is long so I start with the qps file and then consult the metrics file to understand why one result is better than another.
  • cpu/o - CPU per operation, measured by iostat. While there is a unit for this, I don't worry about that as it is most useful for comparing numbers between different test configurations so the units drop out.
  • r/o - storage reads per operation, measured by iostat
  • rKB/o - KB read from storage per operation, measured by iostat
  • wKB/o - KB written to storage per operation,  measured by iostat
  • o/s - operations/second (QPS, inserts/s, etc). I tend to use QPS below for everything.
  • dbms - the MySQL version and configuration file
Results

Here are the qps and metrics files for the 10m.prep0 (CPU-bound, no prepared statements):
  • 8.0.27 is faster than 5.6.49 on 3 of 42 tests (read-only.pre.range10000,pk1, read-only.range10000.pk1, update-index.range100,pk1). The first two do long range scans and the relative QPS is 1.25. For the update-index test the relative QPS is 1.48.
  • The relative QPS for 8.0.27 is less than 0.8 for 35 of the 42 tests. This means that 8.0.27 gets less than 80% of the throughput relative to 5.6.49 on 35 of the tests. The minimum relative QPS is 0.61. For the 39 tests on which 8.0.27 is slower than 5.6.49 the average of the relative QPS is 0.72.
  • For the tests in which the relative QPS in 8.0.27 is much less than one the cause is using more CPU/query. See the cpu/o column in the metrics file.
  • The speedup for update-index is harder to explain. 8.0.2x uses less CPU, but the difference isn't large enough to explain the improvement. The speedup for the read-only tests is mostly from using less CPU/query in 8.0.2x. See the cpu/o column in the metrics file.

Here are the qps and metrics files for the 10m.prep1 (CPU-bound, prepared statements):
  • Results are similar to 10m.prep0
  • 8.0.27 is faster than 5.6.49 3 of 42 tests. The relative QPS is 1.24 for the read-only tests that do long range scans and 1.54 for the update-index test.
  • The relative QPS for 8.0.27 is less than 0.8 for 35 of the 42 tests. 
  • The minimum relative QPS is 0.30 for random-points.pre.range1000.pk1 (AFAIK there is a known perf bug in the optimizer). MySQL 8.0.2x uses more than 3X CPU/query compared to 5.6.49. See the cpu/o column in the metrics file. 
  • For the 39 tests on which 8.0.27 is slower than 5.6.49 the average of the relative QPS is 0.70.

Here are the qps and metrics files for the 400m.prep0 (IO-bound, no prepared statements):
  • 8.0.27 is faster than 5.6.49 for 13 of the 42 tests (tests are listed below)
  • 8.0.27 matched 5.6.49 QPS for 2 of the 42 tests (random-points.pre.range1000.pk1, random-points.range1000.pk1)
  • The relative QPS for 8.0.27 is less than 0.8 for 8 of the 42 tests (tests are listed below)
  • For the 27 tests on which 8.0.27 is slower than 5.6.49 the average of the relative QPS is 0.84

Relative QPS for 8.0.27 vs 5.6.49 when 8.0.27 is faster for the IO-bound tests:

1.22    point-query.pre.range100.pk1
1.02    random-points.pre.range10.pk1
1.05    read-only.pre.range100.pk1
1.13    read-only.pre.range10000.pk1
1.03    range-covered-pk.pre.range100.pk1
1.02    range-notcovered-pk.pre.range100.pk1
1.12    update-inlist.range100.pk1
1.17    read-only.range100.pk1
1.14    read-only.range10000.pk1
1.03    point-query.range100.pk1
1.05    random-points.range10.pk1
1.05    range-covered-pk.range100.pk1
1.06    range-notcovered-pk.range100.pk1

Relative QPS for 8.0.27 vs 5.6.49 when 8.0.27 is much slower for the IO-bound tests:

0.75    range-covered-si.pre.range100.pk1
0.68    update-one.range100.pk1
0.73    update-zipf.range100.pk1
0.55    read-write.range10.pk1
0.65    hot-points.range100.pk1
0.75    range-covered-si.range100.pk1
0.79    scan.range100.pk1
0.60    insert.range100.pk1









Sysbench: PostgreSQL 12, 13 and 14 on a small server

This has results for sysbench with PostgreSQL versions 12.4, 13.4 and 14.0 using the same setup as previously used for the insert benchmark.

Executive summary:

  • Postgres continues to be boring. There are a few regressions and more improvements.
  • There were ~40 tests and each ran for 5 minutes. Running each test for such a short time means there is more chance for confusing results. Alas, running each test for 30 minutes or more would have taken too long because I repeated the benchmark many times for different DBMS, configurations and database sizes. I started another round just for the Postgres IO-bound configuration that will take about 3 days to get results.

Details

See the insert benchmark post for more details on the hardware and Postgres configurations. As always, there are layers upon layers of shell scripts. The file all_small.sh lists the sequence in which the test steps are run and each step is run for 300 seconds with 1 connection (low concurrency!). From the all_small.sh script there were 42 different invocations of sysbench, each one is a microbenchmark. There should have been 43 but a typo got in the way. From 42 X 5 minutes for the tests plus more time for the load it takes 4 to 8 hours for all_small.sh to get results for each DBMS + configuration.

Three configurations were tested: 10M rows without prepared statements, 10M rows with prepared statements, 400M rows without prepared statements. The 10M row tests are CPU-bound as the database fits in memory. The 400M row test is usually IO-bound as the database is larger than memory. Below I call the test configurations 10m.prep0, 10m.prep1 and 400m.prep0. The database sizes are 2.6G at 10M rows and 99G at 400M rows after the initial load.

All of the shell scripts are here and the tests were run using a script like this.

I use my fork of sysbench that includes a few more tests (Lua scripts). The Lua scripts are here. My fork might be a few years behind upstream.

How to read the results

The presentation for these results is just text files pasted into gists. There are two types of files: qps and metrics. 

The qps file has the relative QPS for each test where the relative QPS is the ratio: QPS-for-me / QPS-for-base.

  • QPS-for-me is the QPS for one of Postgres 12.4, 13.4 or 14.0
  • QPS-for-base is the QPS for Postgres 12.4.
  • Thus each line in the qps file has three numeric columns and the first one has the value 1.00 (QPS- for-12.4 / QPS-for-12.4). The value in the second column is the QPS for 13.4 relative to 12.4, and in the third is the QPS for 14.0 relative to 12.4. When the value in the second or third column is less than 1.0 then that version is slower than 12.4.
The metrics file has absolute and relative (to results for 12.4) values for each of the 42 test steps. This file is long so I start with the qps file and then consult the metrics file to understand why one result is better than another.
  • cpu/o - CPU per operation, measured by iostat. While there is a unit for this, I don't worry about that as it is most useful for comparing numbers between different test configurations so the units drop out.
  • r/o - storage reads per operation, measured by iostat
  • rKB/o - KB read from storage per operation, measured by iostat
  • wKB/o - KB written to storage per operation,  measured by iostat
  • o/s - operations/second (QPS, inserts/s, etc). I tend to use QPS below for everything.
  • dbms - the Postgres version and configuration file
Results

Here are the qps and metrics files for the 10m.prep0 (CPU-bound, no prepared statements):
  • scan.range.pk1 has a regression in 13.4 that was mostly fixed in 14.0. The QPS relative to 12.4 is 0.82 for 13.4 and 0.96 for 14.0. From the metrics file there is a 30% increase in CPU overhead (see the cpu/o column) in 13.4. The scan test does a full scan of the test table using a WHERE clause that filters all rows: SELECT * from %s WHERE LENGTH(c) < 0.

Here are the qps and metrics files for the 10m.prep1 (CPU-bound, prepared statements):
  • Results are similar to 10m.prep0, the only regression is for scan.range.pk1 for 13.4.

Here are the qps and metrics files for the 400m.prep0 (IO-bound, no prepared statements):
  • scan.range.pk1 has a regression from CPU overhead for 13.4
  • read-write.range10.pk has a ~10% regression in 13.4 and 14.0. The biggest difference from the metrics file is for wKB/o. Perhaps this test needs to run for more time to get a better signal.
  • Five of the tests improve by ~10% or more in 13.4 and 14.0. I don't count the improvement for random-points.pre.range1000.pk1 because the QPS is too small (5 & 6) and rounding might explain the difference. Reduced CPU/query doesn't explain all of the improvements as in many cases there is also less IO/query but my Postgres expertise isn't strong enough to have a good guess. There have been improvements to vacuum and b-tree indexes that are likely part of the reason.
Updates

I repeated the IO-bound tests with more time per microbenchmark, 1800 seconds rather than 300. Here are the qps and metrics files. For 14.0, there were 3 tests for which perf improved more than ~5% and 5 for which it got worse by more than ~5%.

This lists microbenchmarks for which the QPS ratio is > 1.05 for Postgres 13.4 or 14.0 relative to 12.4. I picked 1.05 as a cutoff. There aren't any huge speedups, but it is nice to see some improvements.

12.4    13.4    14.0
1.00    1.12    1.13    point-query.warm.range100.pk1
1.00    1.04    1.07    points-notcovered-si.pre.range100.pk1
1.00    1.02    1.07    read-only.range10.pk1
1.00    1.05    1.07    points-notcovered-si.range100.pk1

And this lists microbenchmarks for which the QPS ratio is < =.95 for Postgres 13.4 or 14.0 relative to 12.4. From the metrics file, new CPU overhead isn't the root cause for 14.0.

12.4    13.4    14.0
1.00    0.98    0.94    update-inlist.range100.pk1
1.00    0.84    0.93    scan.range100.pk1
1.00    1.04    0.93    insert.range100.pk1

Tuesday, November 30, 2021

Insert benchmark: MySQL 5.6, 5.7 & 8.0 on a small server

I have another performance report for the insert benchmark using MySQL 5.6, 5.7 and 8.0 on a small server. As normal, I had to repeat the tests a few times: I make mistakes and also want results for new my.cnf settings. The goal for the tests is to understand how low-concurrency performance changes in new MySQL releases. Low-concurrency tests are great for finding CPU regressions. High-concurrency tests are also great but I only have small servers at home.

For the CPU-bound setup, new MySQL (8.0) gets between 10% and 37% less throughput than old MySQL (5.6) because 8.0 uses more CPU/query. The increase in CPU/query for MySQL from 5.6 to 8.0 is large while PostgreSQL avoids CPU regressions from 12.4 to 14.0.

Disclaimers:

  • Be careful about interpreting these results, the context here is a workload with simple queries.
  • These results have yet to be reviewed by others. There can be mistakes.
  • 8.0.20 and 8.0.22 were exciting releases. Performance is slowly improving since then.
Details

Tests were run for MySQL versions 5.6.49, 5.7.35, 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27. I try to use similar my.cnf settings except when new versions have new features. The options are in the my56, my57 and my80 directories here (look for my.cnf.cy8). Scripts to build MySQL from source are here for MySQL 5.6, 5.7 and 8.0. One of the big changes I made was to add innodb_log_writer_threads=OFF to the MySQL 8.0 my.cnf settings. Without that the CPU overhead was much worse.

The test servers have 4 cores and tests were run with either 1 insert connection or 2 connections (rate limited inserts, not rate-limited queries). This page has links to explain the insert benchmark and perf report format. The test was run in CPU-bound and IO-bound modes. For each mode, X rows were inserted into a table without a secondary index, then 3 secondary indexes were created, then Y more rows were inserted. The values for X and Y were (20M, 20M) for CPU-bound and (500M, 10M) for IO-bound. After the inserts there were three insert+query steps where the insert connection was rate-limited to 100, 500 and 1000 inserts/s. Each query step was run for 2 hours.

The test server uses Ubuntu 20.04, gcc-9.3.0-17 and uname -a shows 5.4.0-89-generic. The database filesystem uses XFS with discard enabled. 

Results

I have shell scripts to generate perf reports. The reports for CPU-bound and IO-bound are here and here. Below I use throughput ratios where the denominator is the value for MySQL 5.6.49 and the numerator is the value for another version. A value less than 1.0 means the new version is slower than 5.6.49.

Comments on CPU-bound (see the summary):

  • for the l.i0 test (inserts without secondary indexes) the throughput ratio is 0.63 for 8.0.27. The slowdown is explained by an increase in CPU/query. See the cpupq column here that is 15 for 5.6.49 and 21 for 8.0.27.
  • for the l.i1 test (inserts with secondary indexes) the throughput ratio is 0.73 for 8.0.27. The slowdown is explained by an increase in CPU/query. See the cpupq column here that is 32 for 5.6.49 and 43 for 8.0.27.
  • for the q100.1, q500.1 and q1000.1 tests the throughput ratios using the qps columns are 0.81, 0.76 and 0.74 for 8.0.27. This is probably explained by more CPU/query. See the cpupq columns for q100.1, q500.1 and q1000.1. The throughput ratios were much better for 8.0.22 than for 8.0.27 (0.94, 0.89 and 0.86) but I have not attempted to debug that. I was unable to explain why the wkbpi column is larger for 8.0.27 than 5.6.49 and larger means more write-amp (KB written/insert). It was larger for 8.0.27 when measured by iostat, but smaller when measured by global status counters like Innodb_dblwr_pages_written, Innodb_os_log_written, Innodb_data_written and Innodb_buffer_pool_pages_flushed. This remains a mystery. I am not sure it is possible to configure InnoDB writeback in 8.0 to match 5.6. The only difference in the my.cnf settings is the use of innodb_idle_flush_pct=1 for the 8.0 my.cnfs.

Comments on IO-bound (see the summary):

  • for the l.i0 test (inserts without secondary indexes) results are similar to the CPU-bound results above because this step is still CPU-bound, 8.0 is slower because it uses more CPU/query. See here.
  • for the l.x test (create index) 8.0 is faster than 5.6. I think significant changes were done for InnoDB create index. See here.
  • for the l.i1 test (inserts with secondary indexes) results for 8.0.26 & 8.0.27 are better than 5.6.49, but results for 8.0.20 and 8.0.22 and 8.0.23 is in between. With the new release process there can be significant changes for performance between point releases. See the ips column here. I think this is related to the big changes to redo log code in 8.0 as CPU/query (cpupq) and context switches/query (cspq) were larger in 8.0.20 and 8.0.22.
  • for the q100.1, q500.1 and q1000.1 tests the throughput ratios are larger than 1 as 8.0 is faster than 5.6, except for 8.0.20 which appears to be an exciting release. See the qps and cpupq columns here. MySQL 8.0.27 uses less CPU/query than 5.6.49.

Insert benchmark: PostgreSQL 12, 13 and 14 on a small server

I have another performance report for the insert benchmark using PostgreSQL 12.4, 13.4 and 14.0 on a small server. As normal, I had to repeat the tests a few times: I make mistakes and sometimes need results for new tuning options. The goal for the tests is to understand how low-concurrency performance changes in new PostgreSQL releases. Low-concurrency tests are great for finding CPU regressions. High-concurrency tests are also great but I only have small servers at home.

Executive summary:

  • Postgres is boring. There were no regressions.

Details

Tests were run for Postgres versions 12.4, 13.4 and 14.0. I used the same configuration file (see here). The script to build Postgres from source is here.

The test servers have 4 cores and tests were run with either 1 insert connection or 2 connections (rate limited inserts, not rate-limited queries). This page has links to explain the insert benchmark and perf report format. The test was run in CPU-bound and IO-bound modes. For each mode, X rows were inserted into a table without a secondary index, then 3 secondary indexes were created, then Y more rows were inserted. The values for X and Y were (20M, 20M) for CPU-bound and (500M, 10M) for IO-bound. After the inserts there were three insert+query steps where the insert connection was rate-limited to 100, 500 and 1000 inserts/s. Each query step was run for 2 hours.

The test server uses Ubuntu 20.04, gcc-9.3.0-17 and uname -a shows 5.4.0-89-generic. The database filesystem uses XFS with discard enabled. 

Results

I have shell scripts to generate perf reports. The reports for CPU-bound and IO-bound are here and here. From the CPU-bound (see the summary) and IO-bound results (see the summary):

  • Postgres is boring. There were no regressions.

Sunday, November 21, 2021

Welll actually, LSM and B-tree

 Things I read about LSM that make me go well actually:

  • LSM does sequential writes - well actually, writes are sequential logically (per-file) but not physically (per-device). From the physical perspective compaction writes are large & concurrent rather than sequential. Compaction writes to files sequentially but it is usually concurrent (multi-threaded) with a file written sequentially per-thread. There can also be small writes to the WAL. With concurrent compaction threads the large (maybe 1MB+) writes are interleaved at the device.
  • Compaction does merge-sort - well actually, that is a k-way merge, not a merge-sort.
  • Leveled compaction suffers from write-amplification - well actually, write-amplification was 10X smaller with MyRocks when the first workload was moved to it from InnodDB. Write-amp with leveled compaction is larger than with tiered, but that is the (C)RUM Conjecture in action: leveled means less space-amp at the cost of more write-amp. One of the problems is that the naive (hand-wavy) estimate of write-amp for leveled (~8 or ~10 per level) is frequently too pessimistic.
  • LSM suffers from write stalls - well actually, that is more true than the previous bullet points. But the real problem for an LSM is figuring out how to smooth the write stalls (reduce response time variance) and RocksDB needs to get better at that. Everything (b-tree, LSM, etc) suffers from write-stalls when ingest exceeds write back throughput and the easy way to reproduce that is a benchmark with WAL fsync disabled. A b-tree benefits from feedback that serves as a small write-stall - when the buffer pool is full of dirty pages and the working set is not cached then a write likely needs to wait for 1 page to be written back before it can read something into the buffer pool. But an LSM doesn't have that feedback as a write usually just needs something to be put into the memtable, non-unique secondary index maintenance doesn't require reads, and even reads needed to check unique constraints (likely for a PK) can be avoided in some cases with bloom filters. Checkpoint is one place where a b-tree can suffer from large write stalls. Not that many years ago the insert benchmark was great at documenting longer write-stalls related to checkpoint in InnoDB and WiredTiger. Fortunately those have been fixed. Postgres also had serious problems, now fixed, from checkpoint IO creating bursts of write IO that starve other IO (like reads for user queries). Checkpoint is a hard problem.

Saturday, November 13, 2021

DeWitt Clause vs the public cloud

The DeWitt Clause is the name of a clause in the terms of service that prevents a customer from disclosing performance of a licensed product. Things turned out OK for David DeWitt and he even went on to work for at least one company that now uses a DeWitt Clause for their database products.

I am not a lawyer but I have a strong opinion on the DeWitt Clause and non-compete agreements. I think both are anti-competitive and hope more is done to limit their usage.

One of the arguments for DeWitt Clauses is that benchmarking is hard, others will get it wrong and that can mislead customers. While the first parts are correct (it is hard, people will get it wrong) the last clause feels like a bad faith argument to me. Vendors haven't always pursued truth while running their benchmarks. I agree there will be plenty of benchmarketing in a world without DeWitt Clauses. I think the benefits outweigh the problems.

Database vendors

We used to think of this as a thing for legacy enterprise database vendors and many of the new vendors don't use it -- not MongoDB Atlas as of 24-Jun-2021, not CockroachDB as of 19-Oct-2021, not TimescaleDB as of 24-Sep-2020 and 5-Mar-2019, not Altinity as of 1-Jul-2021.

But other new vendors use it and this became news after Databricks published a great result for 100TB TPC-DS and a less than great result for SnowflakeDB. They then changed their license to the DeWitt Embrace Clause which is like copyleft for the DeWitt Clause. This means that if you disclose benchmark results for them, they are allowed to do the same for you. I hope the DeWitt Embrace Clause is embraced by more vendors who use a DeWitt Clause.

For the record, Snowflake quickly replied to the Databricks result. Frankly, I prefer to see a benchmarketing battle between Snowflake and Oracle. But it is awesome that the Snowflake response explains that anyone can repeat their result by signing up, spending about $300 and clicking a few buttons. Cloud DBMS is fun. The Snowflake ToS (as of 1-Nov-2021) don't have a DeWitt Clause.

Public cloud vendors

Switching from managed database providers to public cloud vendors the DeWitt Clause usage is interesting. From what I was able to find, Microsoft has the worst terms followed by Google:

  • AWS (as of 28-Oct-2021) allows benchmark disclosure via the DeWitt Embrace Clause
  • Microsoft was harder to find docs on. In this undated agreement if you offer a service on Azure that competes with them, then they are allowed to benchmark and disclose the results of your service. This is far more aggressive than any DeWitt Clause variant. But I didn't find any other mention of benchmarks (to allow or prevent them for Azure).
  • Oracle (a PDF dated 22-Jun-2020) has a DeWitt Clause and requires pre-approval for disclosure. They also don't allow performing a benchmark without pre-approval.
  • Google (as of 18-Oct-2021) has a DeWitt Clause and requires pre-approval for disclosure. They  bans competitors from running benchmarks even when not disclosed. They added the DeWitt Clause on 6-Oct-2020.
Public cloud benchmarks

Examples of useful public cloud benchmarks that will be prevented in the future by DeWitt Clauses. I will add to this list as I look for them. For now there was just one from long ago that triggered this post:
More on Microsoft

This is the part in the Microsoft agreement that seems aggressive:

If Customer offers a service competitive to an Online Service, by using the Online Service, Customer agrees to waive any restrictions on competitive use and benchmark testing in the terms governing its competitive service. If Customer does not intend to waive such restrictions in its terms of use, Customer is not allowed to use the Online Service.



Wednesday, October 13, 2021

Compatible with MySQL or Postgres?

Open and closed source scale-out DBMS that are compatible with MySQL and Postgres have emerged on the market. This is great for the community but there will be much confusion about the meaning of compatible

This post has yet to have anything on the cloud vendors in China. They are doing impressive work but I don't know enough about it. I am happy to update this post when I learn more.

But first, the MySQL and Postgres teams.

One way to describe compatibility is via three levels: protocol, syntax and semantics. By upstream I mean MySQL and PostgreSQL.

  • protocol - an app can use existing client libraries to authenticate and connect 
  • syntax - the DBMS will parse SQL that upstream parses. It is not guaranteed to provide semantics that matches upstream and some syntax can be (or might be) parsed but ignored. Syntax compatible implies a best effort to match upstream semantics but that isn't guaranteed nor must it be guaranteed to be useful. 
  • semantics - the DBMS will match upstream semantics. This implies syntax compatible.
  • QA compatible - the DBMS will match upstream WRT passing and failing the QA tests provided with upstream. This is slightly stronger than semantics compatible.
  • version - to which version is this (somewhat) compatible? This is easier to explain when the compatible thing is forked from the real thing. When it is just list the release from which it has merged changes.
Note that syntax and semantics compatibility aren't all or nothing. A syntax compatible DBMS can be useful without supporting (parsing) 100% of the upstream syntax. A semantics compatible DBMS can be useful without supporting or matching behavior for 100% of upstream syntax.

Also note that semantics compatible implies syntax compatible. But protocol compatible implies neither.

Elsewhere in DBMS land

While this post is about MySQL and PostgreSQL, compatibility is growing in popularity elsewhere:
  • MariaDB provides an Oracle compatible mode that provides syntax but not protocol or semantics compatibility. 
  • EnterpriseDB provides an Oracle compatibility product that I don't know much about. 
  • Amazon will soon open source Babelfish that is protocol compatible with SQL Server.
  • Amazon DocumentDB is protocol, syntax and semantics compatible with (an older version of) MongoDB. It supports some (much) of the MongoDB 4.0 API as of October, 2021 per Wikipedia. Public statements suggest this was built on top of, or reusing, the Aurora PostgreSQL code.
Updates
  • added TimescaleDB to Team Postgres
  • added Redshift to Team Postgres
  • added SingleStore to Team MySQL
  • added ClickHouse to Team MySQL
  • added CrateDB to Team Postgres
  • added Dolt to Team MySQL
  • added Team MongoDB
  • added YellowBrick and Greenplum to Team Postgres
  • added Materialize to Team Postgres
  • Building a DBMS to be compatible with MySQL costs more than for Postgres. The Team Postgres projects can reuse BSD licensed PG code while the Team MySQL projects would have to respect the GPL.
  • I have a vague memory of this but to be JDBC compliant the MySQL JDBC driver does a few queries at connect time (either via tables or session/global variables). My JDBC-related bugs are here.

Friday, October 1, 2021

The other way to compress InnoDB: outsource it

There are at least three ways to do compression for InnoDB - classic, holepunch and outsource. 

The classic approach (table compression) was used and enhanced by the FB MySQL team. It might not have been widely used elsewhere. While it works, even for read/write workloads, the implementation is complicated and it isn't clear that it has a bright future.

The holepunch approach (page compression) is much simpler than the classic approach. Alas, I am skeptical that a filesystem will be happy tracking the metadata from doing a holepunch for every (or most) pages written. I am also skeptical that unlink() response times of seconds to minutes (because of the holepunch usage) will be good for a production DBMS. I wrote a few posts about my experience with the holepunch approach: here, here, here and here.

The outsource approach is the most simple from the perspective of InnoDB - let the filesystem or storage do the compression for you. In this case InnoDB continues to do filesystem reads and writes as if pages have a fixed size and the FS/storage compresses prior to writing to storage, decompresses after reading form storage and does all of the magic to make this work. While there are log-structured filesystems in OSS that might make this possible, such filesystems aren't widely used relative to XFS and the EXT family. There is also at least one storage device on the market that supports this.

tl;dr - the outsource approach is useful when the data is sufficiently compressible and the cost of this approach (more write-amp) is greatly reduced when it provides atomic page writes.

After publishing I noticed this interesting Twitter thread on support for atomic writes.

Performance model

I have a simple performance model to understand when the outsource approach will work. As always, the performance model makes assumptions that can be incorrect. Regardless the model is a good start when comparing the space and write amplification with the outsource approach relative to uncompressed InnoDB.

Assumptions:

  • The log-structured filesystem has 1+ log segments open for writing. Compressed (variable length) pages are written to the end of an open segment. Such a write can create garbage - the previous version of the page stored elsewhere in a log segment. Garbage collection (GC) copies live data from previously written log segments into open log segments to reclaim space from old log segments that have too much garbage. 
  • The garbage in log segments is a source of space amplification. Garbage collection is a source of write amplification.
  • g - represents the average percentage of garbage (free space) in a previously written log segment. 
  • r - represents the compression rate. With r=0.5 then a 16kb page is compressed to 8kb.
  • Write and space amplification are functions of g:
    • write-amp = 100 / g
    • space-amp = 100 / (100 - g)
Risks in the assumptions:
  • Assumes that g is constant across log segments. A better perf model would allow for variance.
  • Assumes that r is constant across pages. A better perf model might allow for variance.
  • Estimates of write and space amplification might be more complicated than the formula above.
Results

Now I estimate the space-amp and write-amp for outsource relative to uncompressed InnoDB. The ratios are (value for outsource / value for uncompressed InnoDB). For space-amp when the ratio is < 1 then outsource uses less space vs uncompressed InnoDB. For write-amp when the ratio is > 1 then outsource writes more to storage vs uncompressed InnoDB. 

I show below that when the compression rate (r above) is < 0.6 then outsource provides much less space-amp without suffering from a too-large increase in write-amp. But when r is >= 0.6 the increase in write-amp, relative to uncompressed InnoDB, might be a problem.

However, whether a large increase in write-amp is a problem depends on your workload. For example, if 99% of storage IOPs are reads and 1% are writes with uncompressed InnoDB then a write-amp penalty that changes this from 1% writes to 2% writes is unlikely to be a problem.

I created a graph with Desmos to show the three ratios. The graph allows r to be adjusted. I then copied some values from the graph into a table below. The graph has 3 curves, one for each ratio:
  • s is the space-amp ratio where s = r * 100 / (100 - g).
  • w1 is the write-amp ratio assuming the doublewrite buffer is enabled where w1 = r * 100/g.
  • w2 is the write-amp ratio assuming the doublewrite buffer is disabled. This assumes that outsource provides atomic page writes for free. The formula is w2 = r/2 * 100/g.
The table below has values from the graph for r = 0.4, 0.5 and 0.6. What I see in the graph is that with r in (0.4, 0.5) and the doublewrite buffer disabled (w2) it is possible to get much of the compression benefit (see s) from outsource without a significant increase in write-amp. But the write-amp penalty can be a problem when r >= 0.6. Of course, whether or not more write-amp is an issue depends on the storage read and write rates as I explained above. 

r=0.4
g       s       w1      w2
20      0.50    2       1
30      0.57    1.33    0.67
40      0.67    1       0.5

r=0.5
g       s       w1      w2
20      0.63    2.5     1.25
30      0.71    1.67    0.83
40      0.83    1.25    0.63

r=0.6
g       s       w1      w2
20      0.75    3       1.5
30      0.86    2       1
40      1       1.5     0.75


Wednesday, September 29, 2021

Automating memory management in a DBMS

I recently read two papers on automating memory management - one for DB2, one for Oracle. While the papers aren't new (~2005), they are definitely worth reading. AFAIK both made it into the products, I have no idea how that turned out but the outcome doesn't change my opinion that the papers were excellent. Disclaimer - I worked on query execution for Oracle at the time and made a few changes to the row sources that I maintained to support the work described by the Oracle paper.

Both describe methods to automate memory management with the goal of improving performance. In many DBMS memory management is difficult for several reasons:

  • It isn't global and the DBA must estimate good values for the sizes of various caches and other large memory consumers: sort and hash for order by, aggregation, join and index create.
  • It isn't bounded when allocations are specified per instance of sort or join rather than one limit on the memory used by all sorts or joins. Too many concurrent queries == OOM in this case.
  • It is static. This is an issue for large caches like the buffer pool (block cache).
There was a marketing battle between Oracle and IBM in the mid-2000s over autonomic computing and these papers, along with getting them into products, is one outcome from that battle.

From the DB2 paper

A summary:
  1. Differentiable functions were created to explain the query latency saved per page of memory for a variety of consumers. All functions had the same form (see section 3.2) with graphs that look like this. I think that time saved means wall clock time to account for CPU and IO overheads.
  2. For a few caches (statement and buffer pool) online simulators were added to the DBMS to estimate changes to the hit rate if the cache were given more memory. 
  3. Constrained optimization was used to determine the optimal allocation at any point in time. The constraint was the amount of memory available. I assume that Lagrange multipliers were used.
  4. Feedback control was used to apply the desired memory configuration. One benefit from this is to avoid negative impacts from suddenly applying significant changes.
  5. Decisions are revisited because workloads (types of queries, concurrency) changes.
Comments:
  • At least for sort, the real curve for time vs memory can't be described by a differentiable function. See page 2 in the Oracle paper: there are three intervals where each interval can be explained by a function but the points where the intervals meet are a problem. I am not sure whether constrained optimization can handle that.
  • At a higher level, the paper doesn't consider the steps in the time vs memory benefits for some row sources (this is a kind of a repeat of the previous comment).
  • It wasn't clear whether row sources could give back memory. There is a risk from giving a long-running sort or hash a lot of memory. If there is no facility to get back some of that memory then memory allocation will be far from optimal until that query completes or is killed.
From the Oracle paper

While Oracle has automatic memory management for caches (plan cache, buffer pool) and queries (hash, sort, bitmap index operators) the paper is limited to memory management for queries (PGA in Oracle terminology).

Summary:
  1. for each instance of a row source (a query is composed of multiple row sources, some row sources can use a lot of memory for sort and hash) the knees in the response time vs memory graph are estimated, see section 3.2 in the paper. These knees are a function of the row source and the data specific to a query (query A might be able to sort in-memory with 100M of RAM while query B might require 1G to remain in memory).
  2. decisions are made about the amount of memory that each row source can used based on the information from the previous point
  3. over time a row source might be able to get more memory and might be told to give back memory. Giving back memory might not be immediate, but should eventually be done.
Comments:
  • the paper was vague about how point #2 was done. While section 4.2.2 lists 5 rules used to guide the decisions it wasn't clear there was a goal beyond making sure all queries have enough memory to run. In contrast the IBM paper was clear that constrained optimization was used and explains the function that was optimized.


Tuesday, September 28, 2021

Developer experience, what about the other stakeholders?

While developer experience gets a lot of press, there are four stakeholders when you provide a database service. So we can call these DX, UX, MX and OX for developer, user, management and operations experience:

  • developers want the DBMS to stay out of their way. Schema is one example because waiting for a schema change gets in the way. Note that NoSQL databases have less schema rather than being schema-less because indexes are schema. I am curious whether less schema leads to a great developer experience in the long run for large scale projects given the risk of an unmanaged schema and poorly understood data. The developers who were able to move fast early in the project can create much tech debt for those who arrive years later.
  • users of the services that depend on the database want great QoS - high uptime, low and predictable latency for queries, low chance of lost data. They just want to use your database-backed app without problems.
  • management wants to minimize cost while getting great QoS.
  • operations wants to be able to sleep when they are oncall (self-healing database, auto failover, etc). It helps if the database isn't in chaos mode during working hours as that gives them freedom to get their work done.

Tuesday, September 21, 2021

Review of DiffKV

This is a review of Differentiated Key-Value Storage Management for Balanced I/O Performance that was published in ATC 2021. This is a wonderful paper that resolves several of my concerns about key-value separation (here and here). The authors have experience with key-value separation via Titan which is part of TiDB.

The key idea in the paper is to use leveled compaction for keys and tiered compaction for values. Write-amplification is reduced by not using leveled compaction for values. Read-amplification is reduced, relative to classic key-value separation (see WiscKey), by using tiered compaction for values rather than logs. With classic key-value separation the worst case read-amp for a scan is the need to do a random read from the log for every qualifying key as that random read can include a block decompression and a storage IO. With DiffKV the use of tiered compaction for values provides some ordering to avoid that worst case.

While I enjoyed the paper, it didn't have performance results with compression enabled and I am curious about the impact of decompression overhead on point and range read latency.

A summary of the approach:

  • small values are stored inline in the LSM tree
  • large values are stored in logs, called vLogs. The paper explains a few improvements to make GC faster in this case. But mostly this is classic key-value separation.
  • medium values use the vTree (tiered compaction for values)
  • the user defines the two size limits that determines whether a value is small, medium or large
The vTree

Medium values are stored in the vTree which resembles an LSM that uses tiered compaction. A vTable is the unit of storage in the vTree. A vTable is 8MB by default and stores KV pairs in key order. A sorted group is a sequence vTables for which the keys don't overlap -- (vTable, sorted group) are similar to an (SST, sorted run) in RocksDB. A vTree has multiple levels where each level has one or more sorted groups and the levels have exponentially increasing size.

This is similar to tiered compaction for two reasons. First, there are multiple sorted groups per level. Second, when merges are done to move values from level N to N+1, the merge process only reads data from level N and then writes it to level N+1. It doesn't read and rewrite data already on level N+1. 

To reduce the number of times that values get rewritten the vTree has fewer levels than the LSM tree that stores keys. The smallest N levels in the key's LSM tree share the smallest vTree level. I assume the remaining levels in the key's LSM tree each have their own level in the vTree.

vTree GC

Keys in the LSM tree point into the vTree. When a value is moved between levels in the vTree the key's entry in the LSM tree must be updated to point to the new location in the vTree. DiffKV couples GC in the vTree with GC in the key's LSM tree to avoid extra writes. By coupling I mean that vTree GC is extra work added to compaction done on the key's LSM tree. When a key is moved from level N to level N+1 in the LSM tree then its value might be moved to the next level in the vTree.

By coupling it like this, DiffKV avoids the need to probe the index (key's LSM tree) to determine whether a value is live -- which is an extra overhead that occurs with classic key-value separation. It also avoids generating extra writes to change the point into the vTree that is stored with the key.

This above is called compaction-triggered merge and driven by compaction of the key's LSM tree. Another reason for doing vTree GC is called scan-triggered and triggered by scans that encounter regions of the vTree that need GC.

I am curious whether there is a need to also trigger GC based on vTables that have excessive space-amplification.

vTree rewrites

Leveled compaction has more write-amp than tiered because it rewrites previously written KV pairs. While vTree GC frequently avoids the need to do rewrites, there is one case where a rewrite is needed and I didn't learn enough about how DiffKV handles this.

There can be space wasted from vTables that have low utilization rates because most of the values in the vTable have been deleted. In this case something should be done to copy out (rewrite) the values to a new vTable. And when values are moved the key LSM tree must be updated to reference their new location. For workloads with updates and deletes this will be needed in the largest level of the vTree and might be needed for smaller levels.

Saturday, August 14, 2021

Happy 15th to Percona

I am thrilled to be celebrate Percona's 15th anniversary. My time with MySQL began about the same time as the founding of Percona. Those years, the mid-2000s, were the dark ages for MySQL. There was doubt about the future of MySQL because there were many things that needed to be made better. Fortunately, upstream and the community, with much help from Percona, rallied to fix the problems and add needed features.

Percona has been a huge part of the community. I don't have time to list everything, but examples include saving the user conference, providing an open-source hot backup solution, educating many of us (including me) via their blog posts and helping push the product to get so much better. I have also been a customer as they helped with the MyRocks effort and in the great patch migration (5.6 to 8.0).

I like that the anniversary book starts by mentioning the multi-core scaling problem that InnoDB had in the mid-2000s. My MySQL deployment used 4-core, 1-socket CPUs at the time and 8-core, 2-socket servers were about to arrive. It was difficult to get more than 10k QPS from MySQL/InnoDB on such hardware regardless of the number of cores or sockets. InnoDB didn't benefit from 2-socket servers because of contention in the custom InnoDB mutex (which was also used to guard state in the custom InnoDB rw-lock). A fix was first proposed by Yasufumi Kinoshita, who would soon join Percona. After seeing his presentation at the user conference, my team at Google proposed a similar solution which was accepted by the InnoDB team and a serious problem was resolved.

I also like that the book is full of stories from people in the community. I know many of these people from my time traveling to conferences. I am an introvert except when at conferences. I rarely observe talks because I am out in the hallway track talking to others.

Wednesday, August 11, 2021

On storage engines

I wish it were easier to implement new storage engines for MySQL, Postgres and MongoDB and other OSS databases. There is so much innovation that we miss out on - FASTER is one example. All (MySQL, MongoDB, Postgres) have storage engine APIs but there are not many OSS implementations of them.

MyRocks, MySQL Aurora and MySQL HeatWave are examples of the benefits. But they also show that it helps to have the backing of a well-funded company because this is a huge undertaking.

The API for Postgres is the most recent and perhaps that will be the most popular. The API for MySQL is the oldest and has become harder as more requirements (like partitioning) are pushed into it. The MongoDB API is friendlier than the MySQL API, but MongoRocks was deprecated when the API was enhanced to support transactions and RocksDB wasn't able to support user-provided commit timestamps.

One problem is naming. MongoDB and WiredTiger are databases, but WiredTiger is also a component of MongoDB. To avoid confusion I will use storage engine for things like WiredTiger, RocksDB and InnoDB and database for the systems that provide query languages, replication and more.

I have been involved in three such engines -- MyRocks, MongoRocks and a read-only MySQL engine used long-ago at Google. MyRocks benefited from a large team and Sergey Petrunya at MariaDB and MongoRocks benefited from amazing support from MongoDB the company.

Two interesting things are:

  • The impact of database design decisions on the storage engine API. See what is needed for transactions in the MongoDB API.
  • The impact of the original storage engine on the storage engine API.
    • MongoDB doesn't take advantage of clustered indexes in WiredTiger or RocksDB. An extra (hidden) index must be used to map between DiskLoc and the PK index. See SERVER-14569.
    • I have more research to do before I understand the impact of vacuum and 32-bit transactions IDs on the Postgres API.

I also wonder whether the RocksDB API could be the universal storage engine API. In theory any storage engine implementing that would be able to replace RocksDB in MyRocks or MongoRocks. The goal is then to implement the RocksDB API glue once per database and then be able to use a variety of storage engines based on that. Of course, XKCD has a great take on standards and this might just be naive and wishful thinking on my part.


Saturday, March 13, 2021

Sequential IO and an LSM tree

I see statements that sequential IO is a benefit of using an LSM tree. That is a vague and truthy statement. It is correct that the writes done by compaction are sequential per-file. But I am interested in the IO from the perspective of the storage device. 

The truthier claim is that with an LSM there will be many streams of IO (read & write) that benefit from large IO requests. The reads and writes done by compaction are sequential per file, but there is much concurrency and the storage device will see many concurrent streams of IO.

The IO patterns for an LSM with a busy workload are:

  • N compaction threads where each thread is reading from ~10 files, merging the results and writing the output to create new SSTs. Each thread has one file open at a time that is written sequentially. The reads are sequential per-file. At any point in time there can be 10*N read streams and N write streams. The reads benefit from large IO requests and prefetch. The writes benefit from async write-back. The writes might benefit from clever placement on the storage device (see multistream). The writes are likely to generate large requests to the storage device.
  • The WAL, if enabled, is another write stream. If fsync is done on commit then stream gets a sequence of small writes.
  • User queries generate read requests. For OLTP these are mostly small (random) requests. But in some cases (logical backup, analytics) there can be scans.

Sunday, March 7, 2021

How to submit a PR for RocksDB

I submitted my first PR to RocksDB as an external contributor and have advice on the process.

  • Read the Contribution Guide
  • Run make format on your diffs. This depends on clang-format and you might need to install it.
  • Edit HISTORY.md if there are user-facing changes
  • Optionally run make check to confirm tests pass. Some tests failed because my small servers have 16G of RAM, the tests use /dev/shm and needed more than the 8G that /dev/shm gets on my servers.
  • Confirm that RocksDB LITE builds still work: LITE=1 make db_bench
  • Some of the CI tests are flaky now and have bogus failures (see here). To get CI tests to repeat I edit and push small changes to HISTORY.md. The team is fixing the intermittent test failures. Alas, one of the intermittent problems was from internal gcc failures on ARM -- something I can't reproduce on my x86 hardware. The good news is that after three attempts everything passes.
  • Google for advice on how to do this in git. I am not a git guru but the advice I find is great

Friday, February 26, 2021

More on read-only benchmarks and an LSM

This adds to my last post on the impact of LSM tree shape on read performance. I updated the test script to add one more stage, l0_and_lmax, that runs after compact_all. A full compaction is done for compact_all to show the benefit (more QPS) of reducing the number of trees to search to 1. For the l0_and_lmax stage overwrites are done to get 6 SSTs in the L0. At this point there is data in the L0 and max level of the LSM tree -- thus the name l0_and_lmax.

The updated results and changes to my test scripts are here. The spreadsheet with data and charts is here.

The summary is that query performance suffers if you do all of the work to fully compact the LSM tree and then put 6 SSTs in the L0. By suffers, I mean it is almost as bad as when the LSM tree has data everywhere (memtable, L0, all levels after the L0). This is visible in the charts below -- compare l0_and_lmax with pre_flush and post_flush.

Charts!



Wednesday, February 24, 2021

Read-only benchmarks with an LSM are complicated

I often see benchmark results for an LSM tree where the shape of the tree is not described. When the results are worse than desired I wonder how much of that is inherent to the LSM tree versus the configuration. It might be easier to make an LSM tree look bad than to make a B-Tree look bad.

I have results from RocksDB for which QPS varies by up to 5X depending on the shape of the LSM tree. The workload is read-only and CPU-bound. The impact of the LSM tree shape on CPU overhead is known but not widely known. Hopefully this helps with that.

Read amplification for an LSM is usually worse than for a B-Tree. But read-amp should be described separately for IO vs CPU because (usually) CPU read-amp is the problem. The unit for IO read-amp is pages read from storage (real storage, not a read from the OS page cache). The unit for CPU read-amp is CPU overhead but some combination of cache lines read, keys compared and bloom filters checked per query might be a good proxy.

The extra CPU read-amp comes from the number of trees in the LSM tree that must be searched on queries. Perhaps LSM tree is a misnomer because it usually has more than one tree, but perhaps not enough trees for LSM forest to be a good name. Is it too late to rename LSM tree to LSM trees? The memtable is one tree, each sorted run in the L0 is another tree, and then each level beyond L0 is one more tree. For the configurations I use it is common for there to be from 10 to 20 trees in one RocksDB instance. While bloom filters help, they don't eliminate the overhead. Even with a bloom filter some searching is done (cost is a function of |values|) and then there is the overhead of the bloom filter check. For too much information on the CPU overhead of a RocksDB query, see this post from me.

I will begin using LSM tree shape to describe the state of the LSM tree where the state includes:

  • number of KV pairs in the active memtable
  • number of immutable memtables waiting to be flushed
  • number of sorted runs in the L0
  • number of levels beyond the L0, and their sizes
Just keep in mind that it is hard to interpret performance results for a CPU-bound and read-only workload without understanding the LSM tree shape. It is easy to make the LSM look better or worse -- on purpose or by accident.

Finally, LSM implementations can get better at this by adapting their LSM tree shape to become more read-friendly when a workload shifts to read-only or read-mostly phases.

Performance summary

The tests I do change the per-query CPU overhead. I directly measure and report throughput but it, per-query response times and per-query CPU overhead are correlated.
  • Flushing the memtable has no impact on throughput
  • The impact from too many trees to search is worse for workloads that don't use bloom filters (like range queries, or any queries when bloom filters don't exist). The worst-case was ~5X for range and point without bloom vs ~2X for point with bloom.
  • There are incremental benefits from compacting the L0, compacting the L1 and doing full compaction. Each of these reduces the number of trees to search.
  • Bloom filters in an LSM aren't O(1). There is a search cost, O(logN), to find per-SST filter
  • Feature request for LSM trees - adapt to be more read-friendly when a workload is read-heavy
  • This is more of an issue for read-only benchmarks because the LSM tree shape is usually stuck in the same state for the duration of the test. That state might be good or bad and this can add noise or misleading results to your tests. For a test with writes the LSM tree is likely to have data in all places (memtable, L0, L1, ...) most of the time which might not help read performance but will reduce read response time variance.
Charts!

A spreadsheet with absolute and relative throughput is here. All results are from the configuration with the per-level fanout set to 8.





My tests

I ran tests for three types of queries: short range queries, point queries with bloom filters and point queries without bloom filters. I repeated the queries for four LSM tree shapes described below. By data is in the memtable I mean that the memtable is at least half full but there were no immutable memtables. By data is in L0 I mean there were six SSTs (sorted runs) in L0. 

  • pre_flush - data is in the memtable, L0 and each level beyond L0
  • post_flush - memtable is empty while L0 and larger levels have data
  • compact_l0 - memtable and L0 are empty while L1 and larger levels have data
  • compact_l1 - memtable, L0 and L1 are empty while L2 and larger levels have data
  • compact_all - all data is in one sorted run on Lmax (the max level of the LSM tree)
The query tests are provided by db_bench. I used seekrandom for range queries with --seek_nexts=8 to fetch 8 values per query. I used readrandom for point queries with and without bloom filters. Elsewhere I call these tests range query, point with bloom and point without bloom.

Tests were repeated after loading 1M, 10M, 100M and 200M 128-byte values. All tests used one thread and were run on my small servers. The servers have 16G of RAM and RocksDB used an 8G block cache. The database fit in the block cache for all tests except for 200M values. Test output, command lines, and a diff to db_bench are here. It wasn't easy to get db_bench to cooperate. Some of the parts of the db_bench diff will be useful for future tests and I hope to improve and submit a PR for it. The useful features include new benchmarks (things that go in --benchmarks=...) including:
  • flush - flushes the memtable
  • compact0 - compacts L0 into L1 (and blocks until done)
  • compact1 - compacts L1 into L2 (and blocks until done)
  • waitforcompaction - a hacky attempt to block until compaction is done. Can be used as --benchmarks=fillrandom,compactall,waitforcompaction,readrandom because compactall doesn't block until compaction is done.
Without the options above it was hard to get the LSM tree shape into a deterministic state. Once I had the new features it took a few experiments to find the correct number of overwrites to do to make both the memtable and L0 almost full (275000), while also avoiding doing too many flushes and triggering an L0->L1 compaction.

One interesting problem was that immediately after fillrandom the number of sorted runs in the L0 was almost always zero regardless of the number of values written. The reason is that I run tests with fsync on commit disabled and writes were so fast that at the end of fillrandom there were always too many sorted runs in L0. Once fillrandom ended a compaction was scheduled and it merged all L0 sorted runs into L1. The workaround is to run db_bench twice -- once to do fillrandom, flush memtables and then compact L0. Then again to run overwrite to put some SSTs in L0 and then do query tests. This is in the r.sh script.

# Step 1
db_bench ... --benchmarks=fillrandom,flush,waitforcompaction,compact0

# Step 2
db_bench ... --use_existing_db=1 --benchmarks=overwrite,waitforcompaction,...

LSM tree shape

Below I describe the LSM tree shape using logical rather than physical names for the levels because an artifact of dynamic leveled compaction is that compaction stats might show L0 followed by L4 where L4 is logically L1.  

The results show something that I hope can be improved. When configuring RocksDB you can reason about the read, write and space amplification tradeoffs but then the following happens as the LSM tree grows and your assumptions might be far from the truth. The problem is visible in the 10M and 100M values tests when the per-level fanout is 10 (the default). They use the same number of levels while the 100M test has 10X more values.

For all tests the target size for L1 is 64M and dynamic leveled compaction was enabled. But the interaction between dynamic leveled and my configuration means that the LSM trees for the 10M and 100M values tests use the same number of levels with a dramatically different sizeof(L1) / sizeof(L0) ratio. That ratio is <= 2 for the 10M values test vs ~10 for the 100M values test. The LSM tree uses one too many levels in the 10M values test. 

LSM tree shape by number of values inserted when per-level fanout is 10 (default):
  • 1M
    • memtable is ~80% full with 38,239 values
    • L0 has 6 sorted runs and 33M of data
    • L1 has 87M of data
  • 10M
    • memtable is ~80% full with 38,199 values
    • L0 has 6 sorted run and 34M of data
    • L1 has 80M of data
    • L2 has 829M of data
  • 100M
    • memtable is ~80% full with 38,240 values
    • L0 has 6 sorted runs and 39M of data
    • L1 has 792M of data
    • L2 has 7.77G of data
  • 200M
    • memtable is ~80% full with 38,215 values
    • L0 has 6 sorted runs and 38M of data
    • L1 has 163M of data
    • L2 has 1.61G of data
    • L3 has 16.12G of data
I then repeated the test with the per-level fanout reduced to 8 via the max_bytes_for_level_multiplier option in db_bench. In this case the number of levels is what I expect.
  • 1M
    • memtable is ~80% full with 38,209 values
    • L0 has 6 sorted runs and 33M of data
    • L1 has 87M of data
  • 10M
    • memtable is ~80% full with 38,241 values
    • L0 has 6 sorted runs and 34M of data
    • L1 has 101M of data
    • L2 has 820M of data
  • 100M
    • memtable is ~80% full with 38,264 values
    • L0 has 6 sorted runs and 39M of data
    • L1 has 126M of data
    • L2 has 1.0G of data
    • L3 has 7.9G of data
  • 200M
    • memtable is ~80% full with 38,215 values
    • L0 has 6 sorted runs with 38M of data
    • L1 has 251M of data
    • L2 has 1.98G of data
    • L3 has 15.83G of data