Small Datum: July 2023

Wednesday, July 26, 2023

Tuning MyRocks for the Insert Benchmark on a large server

I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.

tl;dr

The good

Performance is much better when subcompactions and the hyper clock cache are enabled.

The not as good

Peak RSS is larger for configs that enable the hyper clock cache. The problem here is not as large as it is on pure open source builds that I test because the version of jemalloc used here is tuned differently. Understanding this is a work in progress.
The select min(transactionid) from TABLE query used at the start of benchmark steps can take up to 20 seconds when it should finish in a few milliseconds. I have yet to try to fix this via tuning.

Updates

Added values from the RocksDB perf context counters to explain the slow select min() query

Builds

I used MyRocks from FB MySQL 8.0.28 with source from June 2023 at git hash ef5b9b101.

Benchmark

The insert benchmark was run in three configurations.

cached by RocksDB - all tables fit in the RocksDB block cache
cached by OS - all tables fit in the OS page cache but not the 4G RocksDB block cache
IO-bound - the database is larger than memory

The test HW has 80 cores with hyperthreads enabled, 256G of RAM and fast local-attached NVMe storage.

The benchmark is run with 24 clients and a client per table. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 4G block cache, one with 180G cache.

c (4G, 180G) - base config
c1 (4G, 180G) - adds rocksdb_use_hyper_clock_cache=ON
c2 (4G, 180G) - adds rocksdb_block_cache_numshardbits=4
c3 (4G, 180G) - disables intra-L0 compaction via a hack
c4 (4G, 180G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
c5 (4G, 180G) - enables subcompactions via rocksdb_max_subcompactions=4
c6 (4G, 180G) - combines c1, c2, c5
c7 (4G, 180G) - combines c1, c5

Results

Performance reports are here for Cached by RocksDB, Cached by OS and IO-bound.

Reviewing results based on average throughput:

Cached by RocksDB

The c7 config (combines c1 and c5) is the best followed by c1 and c5 because subcompactions and hyper clock cache are great for perf. The c2 config is bad for l.i0 perhaps because there is more mutex contention with fewer block cache shards. The c3 config is bad for l.i0 and l.i1 because disabling intra-L0 compaction is bad for write throughput. The c4 config is also bad for l.i0 and l.i1 because it makes the L0 less write friendly and more read friendly.

Cached by OS

See the description above for Cached by RocksDB

IO-bound

See the description above for Cached by RocksDB

Reviewing results based on response time histograms:

Cached by RocksDB

What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.

Cached by OS

What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs while the base config has a 25-second write stall. For q1000 the query response time histograms are much better for the c6 and c7 configs.

IO-bound

What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.

Reviewing results based on variance over per-second intervals:

Cached by RocksDB

For l.i1 see the max insert response time charts for the base, c3, c4 and c5 configs. The c5 config greatly reduces the variance.

Cached by OS

For l.i0 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. The c3 and c4 configs have much more variance.
For l.i1 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. Results are much better with c5.

IO-bound

For l.i0 compare IPS (insert rate) and max response time charts for c3, c4 and c5. There is much less variance with c5.
For l.i1 compare the IPS (insert rate) and max response time charts for c3, c4 and c5. There is much less variance with c5.

Slow get_min queries

Some of the benchmark steps do a simple query at startup to determine the min value of the transactionid column. This is the PK column and the query should be fast but when MVCC GC falls behind it can be slow. I provide more detail on the problem here.

-- Cached by RocksDB

config q100 q500 q1000

c 9.744 9.029 8.521

c1 11.762 9.487 8.761

c2 8.938 8.635 7.625

c3 16.680 15.582 16.392

c4 15.819 14.861 16.046

c5 15.999 15.251 17.754

c6 16.348 14.795 18.964

c7 15.046 13.760 18.254

-- Cached by OS

config q100 q500 q1000

c 15.080 15.462 15.896

c1 14.064 13.622 11.954

c2 10.148 9.641 7.242

c3 16.876 18.810 18.159

c4 16.252 18.571 18.138

c5 16.290 18.533 18.411

c6 16.530 17.341 18.236

c7 17.148 19.591 18.775

-- IO-bound

config q100 q500 q1000

c 9.310 8.747 11.018

c1 7.853 8.661 10.508

c2 9.156 9.188 11.096

c3 17.336 20.003 20.281

c4 19.281 19.794 21.849

c5 17.313 23.940 22.283

c6 18.848 19.926 22.483

c7 17.170 17.790 21.088

Peak RSS

This shows peak RSS for mysqld during the l.x (create index) benchmark step. It is ~5GB larger for the configs that enable the hyper clock cache. The difference is larger on the open-source builds that I do. I think jemalloc here is tuned better than on the open-source builds.

peak

config RSS (GB)

c 185.3

c1 191.3

c2 185.9

c3 183.3

c4 186.5

c5 188.0

c6 191.3

c7 191.3

Update 1: perf context counters

RocksDB has perf context counters that let you understand the work RocksDB does for a specific operation. AFAIK these can be access per session, but in my case I queried them via a global table assuming there wasn't much other activity on the server and ignoring compaction (which is risky to ignore).

The counters are in the information_schema.rocksdb_perf_context_global table. I ran the query and selected the values of all counters from that table before and after running the query and then computed the diff between the two sets of measurements. The counters are only populated if you set rocksdb_perf_context_level to something >= 2 in my.cnf.

My notes and more numbers from this experiment are here.

This is what the counters look like when the query is fast (0.001s) at the start of benchmark step l.i1.

4 SEEK_CHILD_SEEK_COUNT

5 BLOCK_CACHE_HIT_COUNT

12 IO_THREAD_POOL_ID

46 ITER_READ_BYTES

47 USER_KEY_COMPARISON_COUNT

And then when the query is slow (12.715s) at the start of benchmark step q100. The problems are the large values for *_SKIPPED_COUNT. The large value for USER_KEY_COMPARISON_COUNT might just be from background compaction.

65466 BLOCK_READ_COUNT

225827 BLOCK_CACHE_HIT_COUNT

36521600 INTERNAL_DELETE_SKIPPED_COUNT

36521600 INTERNAL_KEY_SKIPPED_COUNT

256113421 USER_KEY_COMPARISON_COUNT

511093703 BLOCK_READ_BYTE

511093703 IO_BYTES_READ

Tuning MyRocks for the Insert Benchmark on a small server

I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.

tl;dr

With jemalloc the peak RSS for mysqld is larger with rocksdb_use_hyper_clock_cache=ON so I reduce the value of rocksdb_block_cache_size from 8G to 6G for some configurations. This isn't fully explained but experts are working on it.
The base config (a0) is good enough and the other configs don't provide a significant improvement. This isn't a big surprise, while the hyper clock cache and subcompactions are a big deal on larger servers the server in this case is small and the workload has low concurrency.
In some cases the a3 config that disables intra-L0 compaction hurts write throughput. This result is similar to what I measured on a larger server.

Builds

I used MyRocks from FB MySQL 8.0.28 using the rel_native_lto build with source from June 2023 at git hash ef5b9b101.

Benchmark

The insert benchmark was run in three configurations.

cached by RocksDB - all tables fit in the RocksDB block cache
cached by OS - all tables fit in the OS page cache but not the 1G RocksDB block cache
IO-bound - the database is larger than memory

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 800 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 1G block cache, one with a larger block cache. The larger block cache size is 8G when LRU is used and 6G when hyper clock cache is used (see tl;dr).

a0 (1G, 8G) - base config
a1 (1G, 6G) - adds rocksdb_use_hyper_clock_cache=ON
a2 (1G, 8G) - adds rocksdb_block_cache_numshardbits=3
a3 (1G, 8G) - disables intra-L0 compaction via a hack
a4 (1G, 8G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
a5 (1G, 8G) - enables subcompactions via rocksdb_max_subcompactions=2
a6 (1G, 6G) - combines a1, a2, a5
a7 (1G, 6G) - combines a1, a5

Results

Performance reports are here for Cached by RocksDB, Cached by OS and IO-bound.

The conclusion is that the base config (a0) is good enough and the other configs don't provide a significant improvement. This isn't a big surprise, while the hyper clock cache (a1) and subcompactions (a5) are a big deal on larger servers the server in this case is small and the workload has low concurrency. The a3 config is bad for performance on the IO-bound workload -- intra-L0 compaction is useful.

When evaluating this based on average throughput (see summaries for Cached by RocksDB, Cached by OS and IO-bound) the base config (a0) is good enough and the other configs don't provide significant improvements although for IO-bound the a3 config is bad for the l.i1 benchmark step because it increases write stalls.

All configs have similar response time distributions for Cached by RocksDB, Cached by OS and IO-bound with one exception. For IO-bound the a3 config does worse on the l.i1 benchmark step.

The charts showing various metrics at 1-second intervals look similar with one exception. Links are in the performance summaries, grep for per 1-second interval in Cached by RocksDB, Cached by OS and IO-bound. The exception is on IO-bound with the a3 config -- see the IPS charts for the l.i1 benchmark step with the a0 config and a3 config where the a3 config has much more variance.

Tuesday, July 25, 2023

How do you limit MVCC GC debt?

Too much MVCC GC debt usually means bad & weird performance along with too much space-amplification. And too much space-amp means the database will be much larger than needed. How do you limit MVCC GC debt for MyRocks, InnoDB and Postgres? This can be an issue for write-heavy workloads especially when the writes aren't slowed by doing reads from storage (because the working-set is cached) or by fsync on commit (because fsync is fast or disabled).

This problem appeared recently for both InnoDB and Postgres while doing write-heavy tests (Insert Benchmark) with a database that could be cached (assuming there isn't too much MVCC GC debt).

Limits are set via:

MyRocks

Tune slow_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit to limit the total amount of compaction debt. I think the hard/soft pending limits would be easier to use if they were relative rather than absolute, but at least something exists. Tune level0_slowdown_writes_trigger and level0_stop_writes_trigger to limit the number of SSTs in level 0. It is odd that MyRocks doesn't have a counter to display the current value for pending compaction bytes. I filed a feature request for that.

InnoDb

tune purge. Note that innodb_max_purge_lag is 0 by default meaning there is no limit and your workload is free to create too much MVCC GC debt. This is easy to notice with file-per-table enabled when purge falls behind -- table *.ibd files will be much larger than expected if purge isn't removing delete-marked rows as fast as they are being created. The Innodb history list length is a proxy for the amount of MVCC GC debt.

Postgres

I don't know if this is possible. Certainly write rates will slow down a lot when vacuum falls too far behind. But that is a bit of a disaster and I am not sure if it is possible to set configuration options that will make writes slower (as done by InnoDB and MyRocks) prior to the disaster. Alas, my MyRocks and InnoDB skills exceed my Postgres skills so I won't rule out that I don't know the answer or forgot it. I also don't know if there are any counters or queries that can be done to estimate MVCC GC debt.

Monday, July 24, 2023

Insert benchmark perf over time for MyRocks 5.6 on a large server

I used the Insert Benchmark to compare performance for MyRocks builds from March 2022 through today to understand how performance changes over time. I was unable to go back further in time because of conflicts between old code and a new compiler toolchain.

tl;dr

A build from March 2022 gets ~15% more inserts/s and ~5% more queries/s when compared to recent builds

Builds

All builds use MyRocks 5.6.35 but from different points in time. I used the same compiler toolchain and gcc for all builds. The builds are:

fbmy5635_202203072101 - from 7 March 2022 at git hash 84ce624a with RocksDB 6.28.2
fbmy5635_202304122154 - from 12 April 2023 at git hash f2161d019 with RocksDB 7.10.2
fbmy5635_202305292102 - from 29 May 2023 at git hash 509203f4 with RocksDB 8.2.1
fbmy5635_jun23_7e40af67 - from 23 June 2023 at git hash 7e40af67 with RocksDB 8.2.1

Benchmark

The insert benchmark was run in two configurations.

cached by RocksDB - RocksDB block cache caches all tables
IO-bound - the database is larger than memory

The test HW has 80 cores with hyperthreads enabled, 256G of RAM and fast local-attached NVMe storage.

The benchmark is run with 24 clients and a client per table. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark used the cy9c5_u configuration for MyRocks. Much more detail on the benefit of the c5 configuration is here -- it adds rocksdb_max_subcompactions which makes L0 -> L1 compactions go faster and reduces (or removes) write stalls.

Results

Performance reports are here for Cached by RocksDB and for IO-bound. From the summaries for cached and IO-bound the March 2022 build gets more throughput than the June 2023 build

~15% more inserts/second on the write-heavy l.i1 benchmark step
~5% more queries/second on the read+write benchmark steps (q100, q500, q1000)

The root cause appears to be more CPU overhead in the June 2023 build.

For l.i1 and Cached by RocksDB the cpupq column has the CPU cost per insert (see here) and it is 185 for the March 2022 build vs 209 for the June 2023 build (a 13% difference). The difference for IO-bound is 9% (see here). Note that this metric includes all CPU, including that from background threads, not just CPU from foreground inserts so it can be misleading but I trust it to explain the differences here until I do more perf debugging.
For q100 and Cached by RocksDB the cpupq column is 5.8% larger for the June 2023 build relative to the March 2022 build (see here, 252 vs 238). And for IO-bound the cpupq is 4.1% larger for the June 2023 build relative to the March 2022 build (see here, 302 vs 290).

Also, the response time histograms are slightly better for the March 2022 build (see here for Cached by RocksDB and for IO-bound).

Finally, graphs for the insert & query rates at 1-second intervals:

Cached by RocksDB: l.i0, l.i1, q100, q500, q1000

For q1000 the March 2022 build had the best chart WRT max insert response time and that is also reflected in the IPS (inserts/s) graphs where it has the least noise

IO-bound: l.i0, l.i1, q100, q500, q1000

For l.i1 the max insert response time charts shift up for builds after March 2022
For q100 the max query response time charts have two horizontal lines -- one near 0 and the other near 10,000 usecs for the March 2022 build but near 20,000 usecs for the more recent builds.
For q1000 the max insert response time charts have one thick line for the March 2022 builds but two less-think lines for the more recent builds. The max query response time charts are similar to q100.

Tuesday, July 18, 2023

MyRocks, InnoDB and Postgres as a queue

As part of my changes to the Insert Benchmark to support longer-running benchmark steps without growing the size of the benchmark tables I added a feature to optionally do deletes at the same rate as inserts and with this the table is like a queue -- inserts at one end, deletes from the other end.

The benchmark steps that use deletes must run a query when they start to determine the minimum value of the PK column (transactionid) currently in the table and in some cases that query is very slow for InnoDB, somewhat slow for MyRocks and not really slow for Postgres. The query is: select min(transactionid) from table.

tl;dr

Postgres does great, MyRocks and InnoDB have problems
I can tune away the problem for InnoDB via the innodb_max_purge my.cnf options but that comes at a big cost -- less write throughput
I don't know yet whether I can tune the problem away for MyRocks
One response to this post is that doing a queue in a SQL DBMS is an anti-pattern. One response to that response is too frequent usage of X is an anti-pattern leads to fixable problems that don't get fixed.

Workload

This is the table for MySQL. For Postgres I use bigserial rather than auto_increment for transactionid.

CREATE TABLE pi1 (
transactionid bigint NOT NULL AUTO_INCREMENT,
dateandtime datetime DEFAULT NULL,
cashregisterid int NOT NULL,
customerid int NOT NULL,
productid int NOT NULL,
price float NOT NULL,
data varchar(4000) DEFAULT NULL,
PRIMARY KEY (transactionid),
KEY pi1_marketsegment (price, customerid),
KEY pi1_registersegment (cashregisterid, price, customerid),
KEY pi1_pdc (price, dateandtime, customerid)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The benchmark has multiple steps run in this order:

l.i0

insert X million rows across all tables without secondary indexes

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background.

At the start of the l.i1, q100, q500 and q1000 steps the select min(transactionid) from table query is done (see here) to determine the smallest transactionid in the table and then deletes are done via statements like delete from table where transactionid >= X and transactionid <= Y (see here). The deletes are done at the same rate as inserts and the thread handling the deletes remembers where it should do the next delete.

The problem

The problem is that this query can be slow when MVCC GC gets behind, otherwise it should take a few milliseconds. And in my testing so far it can too for InnoDB and MyRocks.

These are from the c2-standard-30 servers I have been using for the Insert Benchmark with 8 clients. Full benchmark reports are here for MyRocks, for InnoDB and for Postgres. Results from a larger server are pending and from early results they will be a lot worse.

Tables that show the query response time in seconds by benchmark step (l.i0, q100, q500, q1000) for each of the DBMS + configurations tested. The worst-case is likely to occur at the start of the q100 benchmark step because that immediately follows the l.i1 benchmark step and l.i1 is the most write-heavy.

InnoDB - cached by InnoDB, cached by OS, IO-bound
MyRocks - cached by MyRocks, cached by OS, IO-bound
Postgres

v1 - cached by Postgres, cached by OS, IO-bound
v2 - cached by Postgres, cached by OS, IO-bound

A table summarizing the responses times for the select min(transactionid) query at the start of q100 is here. For InnoDB I ignore the results from the a13 config because write throughput is much worse with it as shown in the perf reports for Cached by InnoDB, Cached by OS and IO-bound. For Postgres I use the results from the a22 config.

Typical response times in seconds

cached cached

byDB byOS IO-bound

InnoDB 4 15 10

MyRocks 4 4 7

PG 0.008 0.076 0.197

Summaries based on the q100 result. For Postgres I focus on the v2 results because they include the a21 and a22 configs.

InnoDB

Cached by InnoDB

For MySQL 8.0.33 the a13 config has the best response time (1 millisecond). It uses smaller values for the innodb_max_purge_lag options. The response time is ~4 seconds for the other configs except for the a1 config where it is ~100 seconds. The a1 config disables native AIO.
For MySQL 5.7.40 the best response times are between 20 and 30 seconds.
For MySQL 5.6.51 the best response times are ~30 seconds.

Cached by OS

For MySQL 8.0.33 the a13 config again has the best response time (545ms). Most of the configs have a response time of ~15 seconds. The a18 and a19 configs are bad outliers.
For MySQL 5.6.51 and 5.7.40 the best response time is ~25 seconds.

IO-bound

For MySQL 8.0.33 the a13 config again has the best response time (171ms). The response time for many configs is ~10 seconds. The a18 and a19 configs are bad outliers. The results for ps8033 are from Percona Server and they are similar to upstream.
For MySQL 5.6.51 and 5.7.40 the best response time is ~15 seconds.

MyRocks

Cached by MyRocks

Response time is ~4 seconds for the c6 and c7 configs. I focus on these configs because they provided the best results for the benchmark.

Cached by OS

Response time is ~4 seconds for the c6 and c7 configs.

IO-bound

Response time is 6 to 7 seconds for the c6 and c7 configs.

Postgres

Cached by Postgres

Response time is between 4ms and 8ms for the a21 and a22 configs.

Cached by OS

Reponse time is between 32ms and 76ms for the a21 and a22 configs.

IO-bound

Response time is between 139ms and 197ms for the a21 and a22 configs.

Monday, July 17, 2023

Tuning InnoDB for the insert benchmark on a medium server, part 2

This continues work I shared in June for InnoDB with the Insert Benchmark on a medium server. In this post I have results for more my.cnf variations with MySQL 8.0.33.

tl;dr

I am certain I don't understand how innodb_io_capacity and innodb_io_capacity_max are supposed to work in MySQL 8. I suspect I am not alone.
MySQL 8.0.33 is almost always faster than 5.6.51 on a medium server with medium concurrency. This result is different from what I get on my small servers where the benchmarks are run with low concurrency.
Does GCP provide any way of detecting a slowly failing SSD? I'd rather not scrape dmesg output to figure this out. I have several benchmark runs that are in doubt because of a failing SSD (those that use the a17, a18 and a19 configs).
I am still chasing a workaround for stalls from the page cleaner falling behind, this also affects queries because they might get stalled by doing single-page flushes to make pages clean for reuse. One workaround is Percona Server.
Despite testing many variations of my.cnf (19 for MySQL 8.0.33) I have not found significant improvements to the config I have been using for years

Updates

How should I set innodb_io_capacity? It would be great if that variable really did specify how man IOPs InnoDB should try to use. But I doubt that is true given furious flushing and innodb_flush_sync.

Benchmark

The insert benchmark was run in three configurations.

cached by InnoDB - InnoDB buffer pool caches all tables
cached by OS - the 4G InnoDB buffer pool does not call all tables but the OS page cache does
IO-bound - the database is larger than memory

I used the rel build for MySQL 5.6.56 and the rel_lto builds for MySQL 5.7.40 and 8.0.33. The builds are explained here. In all cases I build from source using upstream MySQL.

The test HW is a c2-standard-30 server from GCP with 15 cores, hyperthreads disabled, 120G RAM and 1.5TB of XFS via SW RAID 0 striped over four local NVMe devices. The OS is Ubuntu 22.04.

The benchmark is run with 8 clients and a client per table. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 100 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance where X is 100M for cached and 10M for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run first using the base configurations. Then it was repeated for other configurations that changed or added a small number of options. I use a1 through a19 to name them here but the full name for a configuration file has the pattern my.cnf.cy10X_gcp_c2s30 or my.cnf.cy10X_4g_gcp_c2s30 where X has the value a1 ... a19. All of the config files are here.

For cached by InnoDB and IO-bound the configurations have an 80G InnoDB buffer pool and don't use buffered IO (innodb_flush_method = O_DIRECT_NO_FSYNC), with one exception -- the a4 config uses innodb_flush_method = fsync.

For cached by OS the configurations have a 4G InnoDB buffer pool and use buffered IO (innodb_flush_method = fsync).

The base configurations are here:

for cached by InnoDB and IO-bound: 5.6.51, 5.7.40, 8.0.33
for cached by OS: 5.6.51, 5.7.40, 8.0.33

The configurations tested are:

a1 - adds innodb_use_native_io=off
a2 - increases innodb_io_capacity[_max] to 20k/40k
a3 - adds innodb_flush_sync=OFF
a4 - adds innodb_flush_method=fsync
a5 - adds innodb_change_buffering=none
a6 - adds innodb_max_dirty_pages_pct_lwm=40, innodb_max_dirty_pages_pct=50
a7 - adds innodb_lru_scan_depth=2048
a8 - adds innodb_lru_scan_depth=4096
a9 - adds innodb_lru_scan_depth=8192
a10 - adds innodb_purge_threads=8, innodb_page_cleaners=8
a11 - reduces innodb_redo_log_capacity to 16G
a12 - reduces innodb_redo_log_capacity to 32G
a13 - adds innodb_max_purge_lag=50000, innodb_max_purge_lag_delay=1000000, innodb_purge_threads=8
a14 - adds innodb_change_buffering=inserts
a15 - changes innodb_idle_pct to default
a16 - reduces innodb_io_capacity to 2000 and innodb_io_capacity_max to 4000
a17 - adds innodb_doublewrite=OFF (not safe for production)
a18 - adds innodb_buffer_pool_instances=4, innodb_page_cleaners=4
a19 - adds innodb_max_dirty_pages_pct_lwm=20, innodb_max_dirty_pages_pct=50

All of the configurations were tested for MySQL 8.0.33. Only the first three or four (a1 to a4) were tested for 5.6.51 and 5.7.40.

Performance

For each of the workloads below I analyze the performance in four parts:

average throughput - I look at the average queries and inserts per second to see if these are much better or worse than the base config
response time - I look at max response time and the response time distributions to see if a config makes them much better or much worse than the base config
graphs - there are graphs for the per-second query, insert and delete rates. I look at the graphs to see if the rates degrade over time or have too many stalls.
SLA - there is a target rate for the background inserts and deletes which is one of 100, 500 or 1000 per second per client. There is an SLA failure when the actual rate is too much smaller than the target.

Reports are here:

MySQL 5.6.51: cached by InnoDB, cached by OS, IO-bound
MySQL 5.7.40: cached by InnoDB, cached by OS, IO-bound
MySQL 8.0.33: cached by InnoDB, cached by OS, IO-bound
5.6 vs 5.7 vs 8.0: cached by InnoDB, cached by OS, IO-bound

Results: cached by InnoDB

Summary

average throughput

For 5.6.51 and 5.7.40 the base config is best
For 8.0.33 relative to the base config only the a8 config has better or similar results for all benchmark steps and the difference was small (<= 1% better on two, the same on three)
From 5.6 vs 5.7 vs 8.0, MySQL 8.0 gets between 1.40X and 1.58X more throughput than MySQL on the write-heavy benchmark steps (l.i0, l.x, l.i1). But on the read-heavy benchmark steps (q100, q500, q1000) it is slower and gets between 0.73X and 0.81X the QPS vs MySQL 5.6.

response time

For 5.6.51 and 5.7.40 there is nothing odd in the response time histograms
For 8.0.33 there are bad write stalls with a17 which is reasonable because that config has a much larger average insert rate. On the l.i1 benchmark steps there are bad write stalls for a11 and a19. I assume the problem with a11 is more frequent checkpoint and with a19 it might have been a failing SSD.

graphs

For 5.6 vs 5.7 vs 8.0 the results for l.i0 and for l.i1 are good (not much variance). For q1000 there is an odd step in the QPS chart for 5.7.40

For 5.7.40 the a1 config fails the insert-rate SLA for q100 and q500 and almost for q1000.
For 8.0.33 the a1 config fails the insert-rate SLA for q100.

Results: cached by OS

Summary

average throughput

For 5.6.51 the base config is best
For 5.7.40 the a1 config is best. It uses synchronous IO rather than AIO. The read IO done here is from the OS page cache.
For 8.0.33 the a1 config is best. I ignore a17 because that isn't safe for prod. The impact from a17 is huge in part because it reduces stalls when the free list is empty (page flushing has less work to do).
From 5.6 vs 5.7 vs 8.0, MySQL 8.0 is almost strictly faster than 5.6. It is slightly slower on l.i1, similar on q1000 and gets up to 1.6X more throughput on the other benchmark steps

response time

For 5.6.51 the a2 config has the largest write stalls
For 5.7.40 the base, a1 and a2 configs have the largest write stalls and a3 is better
For 8.0.33 the a16, a17 and a18 configs have the largest write stalls but that might have been caused by a failing SSD

graphs

For 5.6 vs 5.7 vs 8.0 the results for l.i0 aren't that interesting. For l.i1 the IPS charts are similar but it is hard to spot differences on the max insert response time charts. For q1000 the IPS charts are visually appealing but have too much variance and the QPS graphs also have too much variance (from single-page flush stalls).

For 8.0.33 the a5, a13, a14, a17 and a18 configs have some insert-rate SLA failures. The failures for a17 and a18 might be caused by a failing SSD.

Results: IO-bound

Summary

average throughput

For 5.6.51 and 5.7.40 the base config is best
For 8.0.33 the base config is good but several configs have similar results including a2, a3, a7, a8, a15 and a16. I am certain I don't understand how innodb_io_capacity and innodb_io_capacity_max are supposed to work. I suspect I am not alone. It is odd that both increasing and decreasing innodb_io_capacity are good for perf -- a2 increases it, a16 decreases it.
From 5.6 vs 5.7 vs 8.0, MySQL 8.0 is strictly faster than 5.6 and gets between 1.03X and 1.67X more throughput

response time

For 5.6.51 all of the configs have multi-second write stalls on write-heavy benchmark steps
For 5.7.40 all configs were similar WRT write stalls
For 8.0.33 write-stalls were bad for a17, a18 and a19 but the problem might have been a failing SSD

graphs

For 5.6 vs 5.7 vs 8.0 the results for l.i0 have two almost horizontal lines for the IPS and max response time charts. Results for l.i1 have a curve that slowly increases and I will repeat the benchmark with a longer run time to see if it eventually becomes stable. One possible reason for this is that there are many page splits at the start of the benchmark step, which followed the index creation step (l.x). For q1000 the QPS charts show a slow improvement, similar to the IPS charts from l.i1. I will increase the runtime for this benchmark step as well to see if the curve eventually settles into a horizontal line.

For 5.6.51 the a4 config fails the insert-rate SLA for q100, q500 and q1000.
For 5.7.40 the a4 config fails the insert-rate SLA for q100 and almost for q500 and q1000.
For 8.0.33 the a5, a13, a17, a18 and a19 configs have some failures for the insert-rate SLA. The failures for a17, a18 and a19 might be from a failing SSD.

my.cnf options that influence create index performance

I measured the time to create indexes for InnoDB and MyRocks using data and schemas from the insert benchmark.

tl;dr

For InnoDB changing the my.cnf options was more significant when the secondary indexes were created via one alter table statement
Increasing innodb_ddl_threads helps, up to a point, when all indexes are created at once
Increasing innodb_ddl_buffer_size helps, up to a point, when all indexes are created at once
Increasing rocksdb_merge_combine_read_size didn't help

Benchmarks

All of my scripts and output files are here and the spreadsheet with all results is here.

The table was created with a PK index, then loaded with N rows (N=1B, 2B, 4B, 8B) and then I measured the time to create three secondary indexes. The secondary indexes were created using two methods:

one at a time - create each index separately (three alter table statements)
three at once - create all indexes at once (one alter table statements)

Example my.cnf files are here for InnoDB and for MyRocks. I used upstream 8.0.32 for InnoDB and FB MySQL 8.0.33 for MyRocks. The goal was to understand the impact of innodb_ddl_threads, innodb_ddl_buffer_size and rocksdb_merge_combine_read_size.

The server has 80 cores with hyperthreads enabled, 256G of RAM and fast storage.

To understand the impact of innodb_ddl_threads I measured the time to create the secondary indexes with it set to 1, 2, 4, 8, 12, 16 and 32 while innodb_ddl_buffer_size was fixed at 2G. Assuming innodb_ddl_buffer_size is the per-thread memory limit then this means that with a larger value of innodb_ddl_threads there was more memory used. Regardless, when indexes were created one at a time the value of innodb_ddl_threads has little impact. But when all indexes were created via one alter table statement there was an improvement up to innodb_ddl_threads=8.

To understand the impact of innodb_ddl_buffer_size I measured the time to create secondary indexes with it set to 32M, 64M, 128M, 256M, 512M, 1G and 2G while innodb_ddl_threads was fixed at 4. When indexes were created one at a time the value of innodb_ddl_buffer_size has a small impact. But when all indexes were created via one alter table statement the impact was larger. Data is missing for 64m at 8B because the create index statement failed.

The value of rocksdb_merge_combine_read_size has little impact on the time to create indexes. The default is 1G and I frequently reduce it to reduce peak RSS for mysqld (and avoid OOM).

Sunday, July 16, 2023

Keeping up with the SQL DBMS market

Sometimes it is easier to talk about tech when you can name things. There is much innovation in progress in the SQL DBMS market. The same is true for NoSQL, but I focus on SQL.

My current attempt to classify SQL DBMS products is:

TradSQL

traditional SQL DBMS solutions that arrived long before cloud native was a thing. They can be used as the nodes in ShardSQL. They don't provide ACID across shards, although brave people use XA to get atomic writes across shards. Examples are Oracle, Postgres and MySQL.

ShardSQL

Run many TradSQL DBMS in one cluster, provide some way to figure out where data exists and you have ShardSQL. This might involve but doesn't require a proxy or middleware. Examples include roll your own, Vitess and CitusDB. These have been popular with web-scale companies and I supported roll your own deployments at Google and Facebook. These provide limited support for cross-shard semantics -- perhaps XA for atomic writes, it will be interesting to see what happens with HLC in MySQL otherwise there isn't support for consistent cross-shard reads. Even Oracle has a sharding product, but I don't know much about it.

NewSQL (DisaggSQL)

The NewSQL name might have been claimed by others and systems with that name didn't end well. I hope to reclaim that name. If NewSQL doesn't work out then the other name is DisaggSQL. By NewSQL I mean a SQL DBMS that is unsharded and cloud-native. The goal is to provide better characteristics, such as throughput, performance and HA, while also supporting much larger databases than are typically supported by TradSQL courtesy of cloud-native storage. Examples include Aurora from AWS, AlloyDB from Google and Neon. A NewSQL DBMS offloads many things that are usually not offloaded by TradSQL. One benefit from offloading is to make more compute and memory available for query processing. Update - I think that DisaggSQL is a better name.

DistSQL

These provide ACID across shards. If you want to do ACID with a PB-scale database then DistSQL is the answer. There is a cost to DistSQL in more latency and over time we will get a better understanding of that cost. Regardless, this is a big step forward for academia and industry. If you like fancy algorithms, then you will love DistSQL. Clustrix is an early example, for me, but Spanner made the world aware. And now we have TiDB, Yugabyte, CockroachDB, YDB and more. While MongoDB isn't a SQL DBMS (yes, it has some support for SQL) it is definitely a great example of ACID across shards.

Friday, July 14, 2023

Insert+delete benchmark, medium server and MyRocks, part 2

This has more results for MyRocks vs the insert benchmark on a medium server. It expands on work from my previous post by finding a few changes to the config file (my.cnf) that improves performance.

tl;dr

Variance during the l.i1 benchmark step is much better for MyRocks than for Postgres or InnoDB. The challenge for a b-tree is the read-modify-write cycle during secondary index maintenance and that is done via blind-writes (RocksDB Put operations) and is read-free for MyRocks.
Query rates at 1-second intervals have an interesting sawtooth pattern. I assume this is the CPU overhead from searching more data as the write buffer and/or L0 fill up and then empty at regular intervals.
Things that help performance

Enabling the hyper clock cache improves performance but has a cost. With it enabled the peak RSS of mysqld is larger when using jemalloc. I have enlisted help from experts to figure that out.
Enabling subcompactions reduces the time for L0->L1 compactions which means there will be fewer write stalls.

Things that hurt performance

Disabling intra-L0 compaction hurts throughput. Hopefully someone on the RocksDB team is amused by this because I have been skeptical of the benefit from it after encountering a few problems with it. But I was wrong about the benefit from it.
Reducing level0_slowdown_writes_trigger and level0_stop_writes_trigger to keep the L0 from getting to large hurts throughput. I tried this to reduce the amount of extra compaction debt that can arrive in L0.
I have one more test in progress for a config that disables intra-L0 and reduces the level0 triggers (see previous two bullet points).

Updates

I tried one more config, c8, that combines c3 and c4 (disables intra-L0, reduces leve0 slowdown and stop triggers). Performance was that was similar to c3 -- which wasn't good.

Hyper clock cache, jemalloc and RSS

The VSZ and RSS for mysqld are larger when hyper clock cache is enabled and this caused OOM in a few tests until I realized that and reduced them for any config that enables the hyper clock cache. This occurs with jemalloc and you really should be using jemalloc or tcmalloc with RocksDB because it can be an allocator stress test. I have sought help from experts to debug this and that is still pending.

The table below shows the impact.

Legend:
* v-delta: peak VSZ - bc, measured during create index

* r-delta: peak RSS - bc, measured during create index

* bc(GB): rocksdb_block_cache_size in GB

config v-delta r-delta bc(GB)
base 41.0 22.3 80
c1 49.5 33.3 60
c2 27.0 14.8 80
c3 27.6 10.8 80
c4 27.4 14.7 80
c5 28.6 14.6 80
c6 50.8 32.6 60
c7 53.2 33.3 60

The value of max_subcompactions

Until this round of tests my benchmarks have been using max_subcompactions=1 which disables subcompactions. The reason I haven't been using them is that most of my testing from prior years was on small servers with 4 or 8 CPU cores and it wasn't clear to me that I had enough spare CPU to make use of subcompactions. Docs for subcompactions are here.

On the medium server, setting max_subcompactions=4 has a huge impact that can is visible in the Avg(sec) column of the compaction statistics. For the base config the value is 45.744 seconds and for the c5 config it drops to 2.861 seconds. This is the time for an L0->L1 compaction.

When L0->L1 compactions take too long (as in 45+ seconds) then more data will pile up in the L0 and future L0->L1 compactions will be larger and slower. From the Read(GB) columns I know the amount of data read during L0->L1 compaction and from the Comp(cnt) columns I know the number of L0->L1 compactions. From that I learn that on average L0->L1 compaction reads 2.2 GB when subcompactions are disabled versus 0.5 GB when they are enabled. So L0->L1 is faster with subcompactions because there are more threads doing the work and there is less work per compaction job.

The benefit is also visible in the write stall metrics where the Cumulative stall time drops from 2.8% to 0.0%. A stale overview of stall counters is here. The code has been improved since I wrote that and I have yet to revisit it.

The benefit of max_subcompactions is visible the average insert rates for the l.i1 benchmark step. See the summary tables for Cached by RocksDB, Cached by OS and IO-bound. Compare the values for the base config (no subcompactions) and c5 config (subcompactions).

Disabling intra-L0 compaction

An overview of intra-L0 compaction is here. I have been wary of intra-L0 because it has caused me a few problems in the past and because it makes the LSM tree shape more dynamic and harder to reason about. But the results here show that I would be foolish to not embrace it.

I modified MyRocks so that I could disable intra-L0 by setting max_blob_size to a value greater than zero. This was a convenient hack, not proper code. And the c3 config used that to disable intra-L0. A positive side-effect of this change is that the average time for L0->L1 compaction jobs drops from 45.744 to 24.105 seconds per the Avg(sec) column. But a negative side-effect is that the Cumulative stall time increased from 2.8% with the base config to 36.0% with the c3 config meaning that write stalls were much worse and the result was a reduction in the average insert rate with the c3 config.

The benefit of intra-L0 is visible the average insert rates for the l.i1 benchmark step. See the summary tables for Cached by RocksDB, Cached by OS and IO-bound. Compare the values for the base config (enabled intra-L0) and c3 config (disabled intra-L0).

Reducing L0 slowdown and stop triggers

Next up is the c4 config that reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12. The hope was that by reducing them the L0 would not have as much data when there was stress (convoys would be smaller). Perhaps that was true but it was lousy for performance.

From compaction stats I see that the average time for L0->L1 compaction jobs is 45.744 seconds with the base config and drops to 10.188 seconds with the c4 config. However the Cumulative stall time increases from 2.8% with the base config to 15.4% with the c4 config resulting in a lower insert rate for the c4 config.

The benefit of larger values for the slowdown and stop triggers is visible the average insert rates for the l.i1 benchmark step. See the summary tables for Cached by RocksDB, Cached by OS and IO-bound. Compare the values for the base config (default values) and c4 config (smaller values).

Benchmarks

The medium server is c2-standard-30 from GCP with 15 cores, hyperthreads disabled, 120G of RAM, and 1.5T of XFS vis SW RAID 0 over 4 local NVMe devices.

An overview of the insert benchmark is here, here and here. The insert benchmark was run for 8 clients. The read+write steps (q100, q500, q1000) were run for 3600 seconds each. The delete per insert option was set for l.i1, q100, q500 and q1000.

Benchmarks were repeated for three setups:

cached by RocksDB - all data fits in the 80G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
cached by OS - all data fits in the OS page cache but not the 4G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
IO-bound - the database is larger than memory. The benchmark tables have 4000M rows and the database size is ~281G.

The following configurations were tested:

base config
c1 - adds rocksdb_use_hyper_clock_cache=ON
c2 - adds rocksdb_block_cache_numshardbits=4.
c3 - disables intra-L0 compaction
c4 - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
c5 - enables subcompactions via rocksdb_max_subcompactions=4
c6 - combines c1, c2, c5
c7 - combines c1, c5

The my.cnf files are here.

cached by RocksDB, IO-bound: base config, c1, c2, ..., c7. The c1, c6 and c7 configs uses a 60G RocksDB block cache to avoid OOM because they enable the hyper clock cache while others use an 80G block cache.
cached by OS: base config, c1, c2, ..., c7

The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background.

Reports

Performance reports are here for Cached by RocksDB, Cached by OS and IO-bound. The c7 config provides a small (1% to 10%) improvement for average throughput in most of the benchmark steps.

Most of the configs have worse create index (l.x benchmark step) performance because I used rocksdb_merge_combine_read_size=1G for the base config but =128M for the other configs while I was debugging the OOM issue with hyper clock cache.

From the response time tables

For Cached by RocksDB the max response times are all less than one second and the distributions are similar for all configs
For Cached by OS the max response times are all less than one second and the distributions are mostly similar for all configs, but for q1000 the distributions are slightly better for c5, c6 and c7
For IO-bound the c6 and c7 configs had the worst max response times (~2 seconds) on l.i1 while most others were less than one second. Also for l.i1, the distributions were slightly worse for c5, c6 and c7 versus the others. But that might not be a fair comparison because c5, c6 and c7 sustained a higher insert rate.

Charts for insert/delete/query rates and max response time at 1-second intervals. Note that the charts are from the first client and there are 8 clients

Cached by RocksDB

l.i0, l.i1, q100, q500, q1000
The l.i1 insert rates are stable (see here)
The q100, q500 and q1000 query rates have an interesting sawtooth pattern. This is most likely from the write buffer or L0 filling up, then emptying. Note that the pattern is more compressed when the insert rate is higher (q1000).

Cached by OS

l.i0, l.i1, q100, q500, q1000
The l.i1 insert rates are stable (see here)
The q100, q500 and q1000 query rates have an interesting sawtooth pattern. This is most likely from the write buffer or L0 filling up, then emptying. Note that the pattern is more compressed when the insert rate is higher (q1000).

IO-bound

l.i0, l.i1, q100, q500, q1000
The l.i1 insert rates are stable (see here)
The q100, q500 and q1000 query rates have an interesting sawtooth pattern. This is most likely from the write buffer or L0 filling up, then emptying. Note that the pattern is more compressed when the insert rate is higher (q1000).