Tuesday, June 27, 2023

Insert+delete benchmark, medium server and MyRocks

This has results for the new insert benchmark (with deletes enabled) for MyRocks on a medium server. I ended up repeating the benchmark 3 times (round 2 after realizing I needed to log the response time for a potentially slow queries, round 3 because I needed to use a new MyRocks build). Results for the benchmark on a small server are here and here.

tl;dr

  • MyRocks in 8.0.28 has better perf than in 5.6.35 for most of the benchmark steps. This wasn't true on the small server. I suspect that one reason for the change is that a server CPU was used here while the small server uses a mobile CPU -- the Beelink uses AMD Ryzen 7 4700u and /proc/cpuinfo from the c2 server shows Intel(R) Xeon(R) CPU @ 3.10GHz.
  • there is a ~5 second write stall for the l.i1 benchmark step in one of the configurations. I have more work to do to explain it

Benchmarks

The medium server is c2-standard-30 from GCP with 15 cores, hyperthreads disabled, 120G of RAM, and 1.5T of XFS vis SW RAID 0 over 4 local NVMe devices. 

An overview of the insert benchmark is herehere and here. The insert benchmark was run for 8 clients. The read+write steps (q100, q500, q1000) were run for 3600 seconds each. The delete per insert option was set for l.i1, q100, q500 and q1000.

Benchmarks were repeated for three setups:
  • cached by RocksDB - all data fits in the 80G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
  • cached by OS - all data fits in the OS page cache but not the 4G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
  • IO-bound - the database is larger than memory. The benchmark tables have 4000M rows and the database size is ~281G.
The my.cnf files are here for:
The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.
  • q100
    • do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background
  • q500
    • do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background
  • q1000
    • do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background
MyRocks

Tests used MyRocks from FB MySQL 5.6.35 and 8.0.28 with the rel build for 5.6.35 and the rel_lto build for 8.0.28. The builds are described in a previous post.

I used old and new versions of MyRocks source code. The old versions were built from HEAD in February 2023. The new versions were built from HEAD in June 2023. The details are:
  • old versions
    • 5.6.35 - RocksDB 7.10.2, FB MySQL git hash 205c31dd
    • 8.0.28 - RocksDB 7.10.0, FB MySQL git hash unknown
  • new versions
    • 5.6.35 - RocksDB 8.2.1, FB MySQL git hash 7e40af67
    • 8.0.28 - RocksDB 8.3.1, FB MySQL git hash ef5b9b101
Reports

Performance reports are here for cached by MyRocks, cached by OS and IO-bound. For cached there are results from the builds with old and new source code distinguished by _old and _new. For IO-bound I only have results from the new source code.

Comparing summaries for 5.6.35 vs 8.0.28 show that the results here for 8.0.28 are a lot better than the results on the small server (see here for cached and IO-bound). I speculate above that one reason for the change is that the small server (Beelink) uses a mobile CPU while the c2 here uses a server CPU (Xeon).
  • 8.0.28 gets ~10% less throughput for l.i0 (more CPU/insert, see cpupq here)
  • 8.0.28 gets similar throughput for l.i1 (more CPU but few context switches, see cpupq and cspq here)
  • 8.0.28 gets 15% to 20% more throughput for queries (q100, q500, q1000) because there is less CPU/query, see cpupq here
From the response time distributions for cached by MyRocks, cached by OS and IO-bound:
  • more throughput mostly implies a better response time histogram
  • worst-case response times are bad (~5 seconds) in one case: l.i1 and cached by OS (see here). Note that worst-case response time for l.i1 is <= 1-second for IO-bound and for cached by RocksDB. From the throughput vs time charts that show per-second insert rates and per-second max response times for cached by RocksDB, cached by OS and IO-bound there is one blip on the cached by RocksDB chart.
From the charts for throughput vs time using per-second measurements there are write stalls in l.i1 in the cached by OS and IO-bound workloads but they are much worse for cached by OS.

Debugging the write stall

What else can I learn about the ~5 second write stalls in l.i1 in cached by OS? Unfortunately, my test scripts don't archive the RocksDB LOG file (not yet). I do have the output from SHOW ENGINE ROCKSDB STATUS run at the end of each benchmark step and from that I see:
  • there are no write stalls at the end of the step (l.x) that precedes l.i1 (see here)
  • the write stalls in l.i1 are from too many files in L0 (see here)
  • the average time for a L0->L1 compaction (see Avg(sec) here). I don't know if the median time is close to the average time. Again from here, based on the values of the Rn(GB), Rn+1(GB) and Comp(cnt) columns the average L0->L1 compaction reads ~842M from L0 and ~1075M from L1 and then writes ~1884M. This is done by a single thread which processes the compaction input at ~64M/s. Note that L0->L1->L2 is the choke point for compaction because L0->L1 is usually single-threaded and L0->L1 usually cannot run concurrent with L1->L2. From results for l.i1 with cached by RocksDB and IO-bound the stats aren't that different, but somehow these don't get ~5 second write stalls. Based on the configuration, compaction should be triggered with 4 SSTs in L0 (~32M each) and 4 SSTs in L0 (~64M each) which would be ~384M of input. But when compaction gets behind the input gets larger.
I need to repeat the cached by OS benchmark with more monitoring and archiving of LOG to try and explain this.







































todo

Thursday, June 22, 2023

Universal Compaction in RocksDB and me

I almost always use leveled compaction with RocksDB. I rarely use universal. It is useful for many workloads but will take more time for users to get working. One overview of RocksDB compaction is here and details on universal are here. A comparison of RocksDB universal with other implementations of tiered is here.

If you are a pessimist this is a rant. If you are an optimist this is a short list of work that can be done to improve universal compaction.

Universal compaction in RocksDB is called tiered compaction by every other LSM. I wish RocksDB didn't create a new name as that is a source of confusion. 

Updates

  • I added the Trivial Move section with stats to show write amp during fillseq


How it began

Universal started as a clever hack by reusing much code from leveled. The clever hack was to keep all SSTs into the L0. Alas, this has a bad side-effect. Bloom filters and block indexes are per-SST and not paged (paged == split into ~4kb pieces) when cached. When the SST is huge then the bloom filter or block index block for that SST will be large -- reading & decompressing this takes more time and then it is cached as-is and putting a large thing into the block cache will frequently displace more useful info. And the best practice at the time of using 64 shards for the block cache (to reduce mutex contention) makes the problem worse because each shard is smaller since 64 shards means each shard gets 1/64th of the block cache memory. 

Large SSTs are expected with universal, thus this was a common perf problem with it. This caused performance problems for many users and they weren't easy to diagnose. The solution was to use levels beyond the L0 for the largest SSTs so that one large logical sorted run could be stored on level N and split into many smaller physical sorted runs (SSTs) and then the per-SST bloom filter and index blocks would be smaller. If you think of sorted runs with tiered compaction as being ordered in a list (run-1, run-2, ..., run-6 and a leveled LSM tree with levels L0 to L5 then L5 can store run-6 split into many SSTs, L4 might store run-5 split into many SSTs, and run-1 to run-4 might all be in L0 with each using just one SST.

Performance problems

Common performance problems with universal include:

  • transient space-amp
    • transient space-amp is large during compaction (true for all tiered, until they add incremental) but is larger in practice than you expect because any SST created after a long-running compaction has started cannot be dropped until the long-running compaction ends. See issue 10533.
  • weak limit on space-amp
    • while leveled compaction has a strong limit on worst-case space amp, universal compaction does not and this is a different issue than the one with transient space-amp listed above. See issue 9561.
  • single-threaded compaction
    • a compaction job is single-threaded, ignoring subcompactions, and with universal the amount of data processed by one compaction job can be huge (tens of GB or more) which means the equivalent of major compaction in RocksDB can take many hours. Note that subcompactions allow some compaction jobs to be multi-threaded. Alas, subcompactions ignore the max_background_jobs setting and you can easily have too many threads running when using this feature (issue 9291 is still open). Note that subcompactions are disabled by default. A few more details on subcompactions are here.
  • trivial move is better with leveled
    • Inserts in key order with leveled frequently have minimal (=2) write-amplification (write to WAL, write during memtable flush) thanks to the trivial move optimization (see here and here). While this can be enabled for universal it doesn't work as well, maybe, but that is based on what I reported in this issue and eventually I will revisit it.
  • compaction is bursty
    • With leveled compaction it is likely that compaction jobs are running all of the time and each compaction job processes a not-too-large amount of data (no more than a few GB). If you haven't enabled subcompactions and have set max_background_jobs to a reasonable value then you can assume ~X threads will always be busy doing compaction where X = max_background_jobs. There is some variance because the amount of data processed for an L0->L1 compaction is likely to be larger than for Ln->Ln+1 where n > 0, but that usually isn't a big deal. And the meaning of busy also has variance because L0->L1 compaction less likely to be IO-bound (reads should be from the OS page cache if not using O_DIRECT). While larger levels might have more CPU overhead if using no or lighter (lz4) compression for smaller levels but zstd for larger levels. But larger levels are also more likely to be IO-bound.
    • With universal compaction there can also be some compaction jobs running all of the time, although not as much with leveled. But there can also be bursts of work because compaction jobs are started less often (less write-amp means less compaction) and the jobs can run for a long time. These bursts can be a surprise if you didn't leave spare CPU and IO capacity to handle them (kind of like not paying taxes gives you more money to spend, until the tax bill arrives).
Several of these problems will be less of an issue if incremental universal is implemented.

Trivial move

I ran the following to document the difference between leveled and universal. For universal the test was repeated with and without the trivial move optimization enabled. The fillseq benchmark inserts keys in ascending key order. The test was run in 3 configurations:
  1. no compression
  2. lz4 compression for all levels
  3. lz4 compression with min_level_to_compress=2: in this case the trivial move optimization is only used to move uncompressed SSTs to level 2 of the LSM tree, then they must be compressed.
Then I looked at the compaction IO statistics printed when test finishes and the results are below. The behavior for leveled is similar to universal with the trivial move optimization for the first two cases but not for the third where min_level_to_compress=2. I am curious if the difference in the third case is by design, an oversight or a bug.

The compaction IO stats from test end are here for case 1, case 2 and case 3. The test script is here.

--- no compression

GB written by compaction and memtable flush
11.3    leveled
68.9    universal with universal_allow_trivial_move disabled (default)
11.3    universal with universal_allow_trivial_move enabled

--- lz4 for everything

GB written by compaction and memtable flush
 6.3    leveled
32.5    universal with universal_allow_trivial_move disabled (default)
 6.3    universal with universal_allow_trivial_move enabled

--- lz4 for everything, min_level_to_compress=2

GB written by compaction and memtable flush
16.8    leveled
37.0    universal with universal_allow_trivial_move disabled (default)
11.3    universal with universal_allow_trivial_move enabled

Tuesday, June 13, 2023

Tuning Postgres and the new insert benchmark, round 4

After looking at the results from round 3 I reran the insert+delete benchmark for a new configuration that combines the best changes identified in the previous tests. The goal is to find a configuration that improves throughput and reduces variance. And in this post I am explaining the impact of a new configuration (a18) that improves on the previous best configuration (a16).

tl;dr

  • The new config (a18) improves on the previous best one (a16)
  • There is still too much variance in the l.i1 benchmark step
Configurations

The new configuration is a18 and I use two variations of it: one with shared_buffers=70GB for the cached by Postgres and IO-bound workloads, and one with shared_buffers=4GB to simulate fast storage (reads are done from the OS page cache). It starts with the base config and then combines the changes from the a1, a10, a12 and a16 configs. The changes are: wal_compression=lz4, autovacuum_work_mem=256MB, wal_buffers=64MB, autovacuum_scale_factor=0.05 and autovacuum_insert_scale_factor=0.05. The last two are the most important.

Benchmarks

The previous post has more detail. The benchmark is the updated insert benchmark with the delete_per_insert option enabled so that the logical table size doesn't grow after the initial load. It was run for two workloads:

  • cached by Postgres - database fits in the Postgres buffer pool
  • cached by OS - database is too big for the Postgres buffer pool but fits in the OS page cache
The test server is c2-standard-30 with 15 cores and 120G RAM. The benchmark is run for 8 clients.

The benchmark steps are described here. I focus on the l.i1 benchmark step that is a stress test for MVCC GC and I want to minimize variance and maximize throughput for it.

Results

The reports are here for cached by Postgres and by OS. Things that I see:
  • From the Summary section it is clear that throughput with the a18 config is slightly better than with the a16 config. 
  • I have yet to explain why QPS with the read+write tests (q100, q500, q1000) on the cached by Postgres workload is slightly worse for a16 and a18 vs the base config. One possible issue is that read+write tests inherit more writeback debt with a16/a18 because the l.i1 benchmark step finishes earlier with them.
  • I missed this in my previous post but the max insert response time for cached by Postgres and q1000 is 15.661 seconds with the a11 config that adds autovacuum_work_mem=1GB. That is a bad write stall. For queries with the same config it was 15.654 seconds, again much too long.
  • For cached by Postgres
  • For cached by OS
    • for l.i1 the response time distributions are similar for a16 and a18 except that the max insert response time is 1.998 seconds with a18 which is too long.
    • for l.i1 write stalls are much worse (longer duration) for a18 than for a16. This is harder to see from the response time distribution because (sadly) the benchmark client suffers from coordinated omission. But for q1000 the insert rate has more variance with a16 than with a18.

Tuning Postgres and the new insert benchmark on a small server

This has results for the updated insert benchmark with Postgres 15.2 on a small server. The goal is to search for a better configuration. Previous posts have results from a medium server: post 1, post 2, post 3. My focus is on the l.i1 benchmark step that is write-heavy -- inserts & deletes with a PK index and 3 secondary indexes in place. And the goal is to find a configuration that minimizes write stalls and response time variance.

tl;dr

  • the 1-client tests are more sensitive to config options than the 4-client tests at least for the l.i1 benchmark step. Most of the configs get about 10k inserts/s on l.i1 for the 1-client tests, except for the a16 and a18 configs which get >= 17k/s. But on the 4-client tests all of the configs get ~22k inserts/s for cached by Postgres and ~14k inserts/s for cached by OS
  •  the largest improvement to the base config is from the a16 config that tunes autovacuum*scale_factor (see here). This is the second big improvement from tuning autovaccum. The previous one is here.

Benchmark

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. Postgres version 15.2 was used, compiled from source with the o2_nofp build (compiled with -O2 -no-omit-frame-pointer).

The insert benchmark was run with --delete_per_insert to avoid growing tables after the initial benchmark step that loads tables. The database was either cached by Postgres (shared_buffers=10GB) or the OS (shared_buffers=1GB).

The benchmark was repeated for 1 client and 8 clients with a table per client in each case. There were 20M inserts/table in the l.i0 step, 50M inserts/table in the l.i1 step and the read+write steps (q100, q500, q100) run for 1800 seconds. The benchmark steps are explained here.

There were four variants of the benchmark:

  • by Postgres, 1 client - database cached by Postgres, 1 client
  • by Postgres, 4 clients - database cached by Postgres, 4 clients
  • by OS, 1 client - database cached by OS, 1 client
  • by OS, 4 clients - database cached by OS, 4 clients

All of the config files are here. The base config files are here for cached by Postgres and by OS. The other config files make small changes to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

  • one of them has a significant benefit: a16 which tunes autovacuum*scale_factor
  • a few of them have a small benefit: 
  • many of them hurt performance:  
  • a few of them can be ignored:
The configurations are:
  • a1 - adds wal_compression=lz4
  • a2 - adds backend_flush_after=256kb
  • a3 - adds full_page_writes=off (not for production!)
  • a4 - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
  • a9 - adds checkpoint_timeout=900s
  • a10 - adds autovacuum_work_mem=256MB
  • a12 - adds wal_buffers=64MB
  • a14 - adds backend_flush_after=1MB
  • a15 - adds max_wal_size=32GB
  • a16 - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • a17 - reduces bgwriter_lru_maxpages from 2000 to 500
  • a18 - combines the best configs: a1, a10, a12 and a16
The full names for the config files used for cached by Postgres have the pattern conf.diff.cx8${X}_bee where X is one of a1 to a18. And for cached by OS the full name has the pattern conf.diff.cx8${X}_1g_bee. Examples are conf.diff.cx8a18_bee and conf.diff.cx8a18_1g_bee.

Results

To understand the config changes I compare results from a given config with results from the base config.

The summaries are here for:
First an analysis for by Postgres & 1 client:
  • Relative to the base config the a16 config provides the largest improvement to average throughput for l.i1.
  • Next I consider the impact on worst-case response time for inserts (see here).
  • Finally, I look at the charts for per-second insert rates vs time for l.i1 (see here). The a16 config has the best average throughput and the chart for a16 is a bit different than the others.
  • The conclusion is that I should definitely use a16 and probably use a1, a10, and a12 for this workload. I then created the a18 config that does just this.
Results from the other workloads are similar to by Postgres & 1 client and the a16 config was the best but a few other configs were useful so I created a new config, a18, that combines them. However, results from the a18 config are similar to the a16 config, not significantly better.

I then classified the impact of each config file for 4 workloads: by Postgres & 1 client, by Postgres & 4 clients, by OS & 1 client, by OS & 4 clients. The classification has 5 groups and uses relative throughput which is (throughput for config) / (throughput for base config). The 5 groups are:
  • significant benefit - relative throughput is more than 1.05
  • small benefit - relative throughput is >= 1.02 and <= 1.05
  • no impact -  relative throughput is between >= 0.99 and <= 1.01
  • hurt performance - relative throughput is <= 0.98
  • ignored - the result for this config can be ignored
From these results I chose to create the a18 config by combining a1, a10, a12 and a16.

Significant benefit
by Postgres, 1 client:  a16
by OS, 1 client:        a16
by Postgres, 4 clients: a16
by OS, 4 clients:

Small benefit:
by Postgres, 1 client:  a2 a10
by OS, 1 client:        a1 a9 a10 a12 a15
by Postgres, 4 clients:
by OS, 4 clients:       a16

No impact:
by Postgres, 1 client:  a1 a9 a12 a15 a17
by OS, 1 client:        
by Postgres, 4 clients: a1 a2 a9 a10 a12 a14 a17
by OS, 4 clients:       a1 a9 a10 a12 a15 a17

Hurt performance:
by Postgres, 1 client:  a14
by OS, 1 client:        a2 a14 a17
by Postgres, 4 clients: a15
by OS, 4 clients:       a2 a14

Ignored:
all:    a3 (not safe for prod), a4 (duplicate of base config)








Tuning InnoDB for the insert benchmark on a medium server

This explains my attempt to improve the my.cnf settings that I use for InnoDB with the insert benchmark. Here I use the updated insert benchmark with deletes per insert to avoid growing tables. For MySQL 5.6, 5.7 & 8.0 I tested 3, 4 & 11 changes to the my.cnf options I have been using.

tl;dr

  • In some benchmarks setting innodb_use_native_aio=OFF improves query while hurting insert performance. I assume the problem is that IO queues get overloaded with aio=ON. Too many concurrent IO requests is OK for background activity (page writeback) but bad for foreground activity (queries). In some cases I would accept this trade off and use aio=OFF but I hope InnoDB adds a limit to concurrent IO requests with aio=ON.
  • Other than the a1 config (innodb_use_native_aio=OFF) none of the configs consistently help performance. The base config I have been using is good enough for now.
  • While I don't focus on comparisons to other DBMS here, MySQL QPS on the read+write benchmark steps suffers a lot (and too much) from bug 109595.
  • There were several long (~100 second) read stalls at the start of some benchmark steps. At last I can explain this, see the Updates section.
Updates

The following explains the ~100 second read stalls with InnoDB. I assume that Postgres and MyRocks also have this problem, but need to confirm that. The problem is:
  • tables have a PK on trxid column
  • background process intermittently deletes N rows at a time with smallest trxid value
  • at start of q100, q500, q1000 benchmark step inserts and deletes will not be done until this query finishes: select min(trxid) from $table
  • if InnoDB purge gets behind then that query encounters and skips a large number of delete-marked rows before finding the first live one. And in the tests here that can take ~100 seconds.
Benchmark

The insert benchmark was run in three configurations.

  • cached by InnoDB - InnoDB buffer pool caches all tables
  • cached by OS - the 4G InnoDB buffer pool does not call all tables but the OS page cache does
  • IO-bound - the database is larger than memory

I used the rel build for MySQL 5.6.56 and the rel_lto builds for MySQL 5.7.40 and 8.0.33. The builds are explained here. In all cases I build from source using upstream MySQL.

The test HW is a c2-standard-30 server from GCP with 15 cores, hyperthreads disabled, 120G RAM and 1.5TB of XFS via SW RAID 0 striped over four local NVMe devices. The OS is Ubuntu 22.04.

The benchmark is run with 8 clients and a client per table. For cached by InnoDB/OS the l.i0 step inserts 20M rows/table and for IO-bound it inserts 400M rows/table. The l.i1 step is run with --delete_per_insert and does 50M inserts/deletes per table. The read+write steps (q100, q500, q1000) also run with --delete_per_insert for 1800 seconds.

Configurations

The benchmark was run first using the base configurations. Then it was repeated for other configurations that changed or added a small number of options. I use a1 through a11 to name them here but the full name for a configuration file has the pattern my.cnf.cy10X_gcp_c2s30 or my.cnf.cy10X_4g_gcp_c2s30 where X has the value a1 ... a11. All of the config files are here for 5.6.51, 5.7.40 and 8.0.33.

For cached by InnoDB and IO-bound the configurations have an 80G InnoDB buffer pool and don't use buffered IO (innodb_flush_method = O_DIRECT_NO_FSYNC), with one exception -- the a4 config uses innodb_flush_method = fsync.

For cached by OS the configurations have a 4G InnoDB buffer pool and use buffered IO (innodb_flush_method = fsync). 

The base configurations are here:

The configurations tested are:

  • a1 - adds innodb_use_native_io=off
  • a2 - increases innodb_io_capacity[_max] to 20k/40k
  • a3 - adds innodb_flush_sync=OFF
  • a4 - adds innodb_flush_method=fsync
  • a5 - adds innodb_change_buffering=none
  • a6 - adds innodb_max_dirty_pages_pct_lwm=40, innodb_max_dirty_pages_pct=50
  • a7 - adds innodb_lru_scan_depth=2048
  • a8 - adds innodb_lru_scan_depth=4096
  • a9 - adds innodb_lru_scan_depth=8192
  • a10 - adds innodb_purge_threads=8, innodb_page_cleaners=8
  • a11 - reduces innodb_redo_log_capacity to 16G
All of the configurations were tested for MySQL 8.0.33. Only the first three or four (a1 to a4) were tested for 5.6.51 and 5.7.40.

For each of the workloads below I analyze the performance in four parts:
  • average throughput - I look at the average queries and inserts per second to see if these are much better or worse than the base config
  • response time - I look at max response time and the response time distributions to see if a config makes them much better or much worse than the base config
  • graphs - there are graphs for the per-second query, insert and delete rates. I look at the graphs to see if the rates degrade over time or have too many stalls.
  • SLA - there is a target rate for the background inserts and deletes which is one of 100, 500 or 1000 per second per client. There is an SLA failure when the actual rate is too much smaller than the target.

Performance Problems

QPS on the read+write benchmark steps is worse than it should be courtesy of bug 109595. The optimizer is doing extra reads from the PK index to determine selectivity. There is no workaround and it is unfortunate that this bug exists because the queries for the index to be used, to the optimizer should need this info, just trust my hint.

The insert rate on the l.i1 benchmark steps is lower than I expect for the cached by OS workloads. By this I mean that inserts/s for l.i1 is much smaller than I expect relative to l.i0. I am not sure this needs to be explained because while the l.i1/l.i0 ratio for cached by OS is much worse with InnoDB than with Postgres, the ratios for InnoDB and Postgres are similar for the IO-bound workloads. 

Legend:
* IPS is average for inserts/second
* Numbers are (IPS for l.i1 / IPS for l.i0)

MySQL 5.6.51
  by InnoDB     82051 / 430108 = .190
  by OS         14388 / 422164 = .034
  IO-bound      17252 / 432199 = .039

MySQL 5.7.40
  by InnoDB     137368 / 812183 = .169
  by OS          14443 / 784314 = .018
  IO-bound       17433 / 844327 = .020

MySQL 8.0.33
  by InnoDB     129686 / 686695 = .188
  by OS          13098 / 675105 = .019
  IO-bound       18022 / 709534 = .025

Postgres 15.3
  by Postgres   249299 / 1000000 = .249
  by OS         110345 / 1012658 = .108
  IO-bound       20317 / 1104972 = .018

For several of the benchmark results below there is a ~100 second write stall at the start of a benchmark step. See the Updates section above for an explanation. 

Results: cached by InnoDB

Reports are here for cached by InnoDB, cached by OS and IO-bound.

Summary:

  • average throughput
    • For 5.6.51 the a1, a2, a4 configs have a small impact, both good and bad
    • For 5.7.40 the a1 config reduces QPS on q100, q500 and q1000 by 10% to 20% and the a4 config reduces throughput by 1% to 4% for l.i0, l.i1, q100, q500 and q1000. The a2 and a3 configs have a small impact, both good and bad.
    • For 8.0.33 the a1 config is bad for all benchmark steps. The a2 and a3 configs have a small impact, both good and bad. The a4 config hurts perf. The remaining configs (a5 to a11) generally helped QPS on q100 but hurt perf on all other benchmark steps. And a10 was bad for l.i1.
  • response time
    • distributions are similar for 5.6.515.7.40. For 8.0.33 except the a10 configs is bad for l.i0 and l.i1.
  • graphs
    • for 5.6.51 and 5.7.40 nothing interesting beyond the write stalls
    • for 8.0.33 the a10 config has an odd graph for the l.i1 insert rate, 
  • SLA
    • the a1 config failed for q100 with 5.7.40. It didn't fail for q500 or q1000. This is odd. From the metrics the disk reads per query metric (rpq) is 2X the other configs. This benchmark step runs immediately after create index and the InnoDB buffer pool has to warm up after create index. Perhaps something wiped the OS page cache. I am happy to ignore this for now. From the graphs there is a write stall for ~100 seconds at the start of the benchmark step.

Results: cached by OS

Reports are here for cached by InnoDB, cached by OS and IO-bound.

Summary:

  • average throughput
    • For 5.6.51 the a1 and a2 configs have a small impact, both good and bad
    • For 5.7.40 the a1 config is great for perf. Note that this combines AIO with buffered IO while for cached by InnoDB and IO-bound it would combine AIO with O_DIRECT. Also note that the read IO done here is from the OS page cache. The a2 and a3 configs hurt perf.
    • For 8.0.33 the a1 config is again great for perf. But see the disclaimers in the previous point. The a2, a3, a7, a8 and a11 configs have a small impact, both good and bad. The a5 config is lousy for perf. The a6 config boosts QPS for a small cost to the insert rates. The a9 config might be good, although it is odd that a7 and a8 were not because all increase lru_scan_depth. The a10 config hurts perf.
  • response time
    • distributions are similar for 5.6.51, 5.7.40
    • For 8.0.33 the a5 is bad for writes and a9 is bad for worst-case response time.
  • graphs
    • not much to see. For 5.7.40 and q1000 the a3 config does much better with the background inserts than the other configs.
  • SLA
    • the a5 config failed for q500 and q1000 with 8.0.33. The a5 config disables the InnoDB change buffer and because of that the KB written to disk per insert (see wkbpi) is 2X to 4X larger vs the other configs. This is one example of the benefit of the change buffer. Alas, the change buffer has significant costs from mutex contention that I will document later.

Results: IO-bound

Reports are here for cached by InnoDB, cached by OS and IO-bound.

Summary:

  • average throughput
    • For 5.6.51 the a1 config has a small impact, good and bad. The a2 config has a small impact, mostly bad. The a4 config is lousy for perf.
    • For 5.7.40 the a1 config has a medium impact (bad for inserts, good for QPS). The a2 and a3 configs have a small impact, good and bad. The a4 config is lousy for perf.
    • For 8.0.33 the a1 config has a medium impact (bad for inserts, good for QPS). The a2 and a7 configs have a small impact, good and bad. The a3 config has a small impact, mostly bad. The a4, a5, a6, a10 and a11 configs are bad for perf. The a8 and a9 configs have a small impact, mostly good.
  • response time
    • For 5.6.51 the a4 config is bad. All of the configs have 4 second write stalls with l.i1.
    • For 5.7.40 the a4 config is bad.
    • For 8.0.33 with l.i1 he a11 config has a 2 second write stall and the a4, a5 and a10 configs are bad. The a4, a5 and a10 configs are also bad for q1000.
  • graphs
    • for 5.6.51 the a4 config has a write stall and QPS variance with q1000
    • for 5.7.40 the a4 config has much QPS variance with q1000
  • SLA
    • the a4 config failed for q100 and q1000 with 5.6.51. For q100 I see that KB read per query (rkbpq) is almost 4X larger with a1, but reads per query (rpq) is the same as other configs. I don't understand this. For the a4 config there is a similar problem with the rkbpq metric. The metrics are here. For q1000 there is a ~100 second write stall at the start of the benchmark step (see here). While a4 didn't fail for q500, the graph shows a similar write stall and too much variance.
    • the a4 config failed for q100 with 5.7.40. Again, there is a 100-second write stall at the start of the benchmark step.
    • the a5 config failed for q1000 with 8.0.33. From the metrics the reads per query (rpq) and KB written per insert (wkbpi) are much larger for the a5 config.








Monday, June 12, 2023

Insert+delete benchmark: small server, MyRocks and IO-bound

This has results for an IO-bound, low-concurrency insert benchmark on a small server. The previous report for a CPU-bound workload is here. This used the --delete_per_insert option so that the write-heavy steps ran for a long time while the working set remained in memory.

This work was delayed because I had to figure out the memory demand for create index to avoid OOM.

tl;dr

  • 5.6.35 has better perf than 8.0.28 courtesy of new CPU overheads in upstream MySQL 8 and with 8.0.28 the throughput is up to 22% less than 5.6.35
  • Variance is visible, but not horrible.
Benchmark

See the previous report. MyRocks used the cy10a_bee config for 5.6.35 and for 8.0.28.

The insert benchmark was repeated for 1 client and 4 clients. For 1 client the l.i0 benchmark step loaded 800M rows into 1 table. For 4 clients it loaded 200M rows per table and there were 4 tables.

In both cases:
  • the l.i1 benchmark step did 50M inserts matched by 50M deletes
  • the q100, q500 and q1000 benchmark steps each ran for 1800 seconds
Reports

Reports are here for 1 client and 4 clients.

The first thing to acknowledge is the performance regression. There is more CPU overhead in MySQL 8.0 than 5.6 (in non-RocksDB code) so 8.0 gets less throughput than 5.6 in many cases. The following table shows the throughput in 8.0.28 relative to 5.6.35 and a value < 1.0 means 8.0 is slower. These results are from the Summary tables in the reports:

See the previous report for a description of the benchmark steps.

Throughput from 8.0.28 / 5.6.35

l.i0    l.x     l.i1    q100    q500    q1000
0.78    0.97    0.90    1.03    0.99    0.98    1 client
0.80    0.95    1.01    0.92    0.90    0.90    4 clients

Interesting things in the reports:
  • The largest regression is in l.i0
  • Response times are similar for 1 client and for 4 clients.
  • Insert response time graphs show two stall levels (two horizontal lines) for 8.0.28 but only one for 5.6.35. An example is here (see 5.6 and 8.0)
  • QPS graphs vs time have a lot more noise with 8.0.28. Perhaps this is a function of CPU overhead. See here for 5.6 and for 8.0.
  • There are intermittent slow queries in 5.6 and 8.0 (response time in usecs jumps from < 1000 to > 10k. I have yet to explain this. See here for 5.6.
  • For the read+write benchmark steps (q100, q500, q1000) the CPU/query (see the cpupq columns here and here) explains the regressions in the 4 clients results. But I didn't expect this change and repeat the benchmark with a longer running time.
Write performance vs memory

The table below lists the insert rate for the l.i0 and l.i1 benchmark steps for the cached workload from the previous report and the IO-bound workload explained here. It might be a surprise that the rates are similar between cached and IO-bound. But this is expected thanks to the trivial move optimization (for l.i0) and read-free secondary index maintenance (for l.i1).

--- 1 client

Cached                  IO-bound
l.i0    l.i1            l.i0    l.i1
76628   26667           77527   26288   5.6.35
61162   23866           60469   23719   8.0.28

--- 4 clients

Cached                  IO-bound
l.i0    l.i1            l.i0    l.i1
194175  39877           212427  36062   5.6.35
158730  40530           169635  36298   8.0.28

Wednesday, June 7, 2023

Create index, MyRocks and memory

I got OOM during index create for a few MyRocks benchmarks and I am somewhat ignorant about how MyRocks handles create index. This occurred on a server that has 16G of RAM, RocksDB block cache was 10G and mysqld RSS was ~11G so there wasn't much spare memory.

I am still somewhat ignorant but will share what I learned.

By OOM I mean that the Linux OOM killer decided to kill mysqld when the memory demand exceeded the memory supply.

There is one approach for creating a secondary index: bulk.

tl;dr

  • Plan for the increase to mysqld RSS as a function of rocksdb_merge_combine_read_size

Bulk approach

MyRocks can optimize bulk loading and those features have been extended to support secondary index creation. Documentation on bulk loading is here. My request to improve the docs is here. My possibly correct description of how this works:

  1. The PK is scanned to extract the columns needed for the secondary index and stored in memory
    1. When a memory limit is reached the columns are sorted and written to a temp file
    2. Continue the PK index scan
  2. Assume there are many sorted runs, each in a temp file. Ignore the case where nothing was spilled.
    1. Merge the sorted runs, write the output to new SSTs that are written to the max level of the LSM tree
There are two options that determine memory usage during bulk load:
  • rocksdb_merge_combine_read_size - the default is 1G and there is one concurrently active allocation per alter table add index statement even when multiple indexes are created by that statement.
  • rocksdb_merge_buf_size - the default is 64M and I am not sure how many concurrently active allocations might be used per alter table add index statement

Experiments: small server

I did a few experiments on a small server to understand memory usage by monitoring VSZ and RSS for mysqld while creating indexes. The server has 16G of RAM, the RocksDB block cache was 1G, rocksdb_merge_buf_size was the default (64M) and rocksdb_merge_combine_read_size was set to 128M.

My goal is to understand the memory demand for create index as a function of rocksdb_merge_buf_size and rocksdb_merge_combine_read_size so that I can avoid OOM. However, I really should have tested more values for these options. From the RSS diff table below I think that mysqld RSS grows by ~300M per concurrent index create when rocksdb_merge_combine_read_size is 128M.

First I did a full scan of a large table. At the end of the scan VSZ was 2.3G and RSS was 1.5G for mysqld, courtesy of the 1G RocksDB block cache.

Then I created secondary indexes first with 1 client and then with 4 clients creating the indexes concurrently. For the 1 client test the table had 800M rows. For the 4 client test the tables each have 200M rows. Scripts monitored VSZ and RSS for mysqld during each benchmark step and the table below reports the peak values during the benchmark step. The benchmark steps here are:

  • l.i0 - load the table in PK order, table has no secondary indexes
  • l.x - create 3 secondary indexes per table

Peak VSZ (GB) during benchmark step
Clients l.i0    l.x
1       3.1     4.2
4       3.4     5.7

Peak RSS (GB) during benchmark step
Clients l.i0    l.x
4       1.8     3.8
1       1.7     2.8

Finally, I subtract VSZ/RSS during the full scan (VSZ=2.3G, RSS=1.5G) from the peak VSZ/RSS values in the tables above.

Computed as:
(peak value during benchmark step) - (value after full scan)

VSZ diff
Clients l.i0    l.x
1       0.8     1.9
4       1.1     3.4

RSS diff
Clients l.i0    l.x
1       0.2     1.3
4       0.3     2.3

Experiments: big server

Next I ran tests on a large server with a 4G RocksDB block cache. After a SELECT statement that does a full scan of a large table the VSZ and RSS for mysqld are 6.0G and 4.5G. Tests were repeated with rocksdb_merge_combine_read_size set to 128M, 256M, 512M, 1024M, 2048M, 3000M and 4000M.

As the value for rocksdb_merge_combine_read_size gets larger the increase in RSS approaches the value of it. When it is set to 4000M then mysqld RSS increases by 4.1G. But when it is 128M then mysqld RSS increases by ~300M. 

Legend:
* mcrs - rocksdb_merge_combine_read_size in MB
* vsz - peak mysqld VSZ in GB during the create index
* rss - peak mysqld RSS in GB during the create index

--- Peak RSS and VSZ during create index

mcrs    vsz     rss
 128     7.1    4.8
 256     7.1    4.9
 512     8.1    5.2
1024     8.1    5.7
2048     9.3    6.7
3000    10.8    7.6
4000    12.0    8.6

--- Diffs

Computed as:
  (peak value during benchmark step) - (value after full scan)

mcrs    vsz     rss
 128    1.1     0.3
 256    1.1     0.4
 512    2.1     0.7
1024    2.1     1.2
2048    3.3     2.2
3000    4.8     3.1
4000    6.0     4.1

Thursday, June 1, 2023

Tuning Postgres and the new insert benchmark, round 3

This has results for the insert benchmark on a medium server when Postgres caches the database. The previous post is for a similar setup except the database was cached by the OS but not by Postgres. 

tl;dr

  • the base config I have been using is good once I tuned autovacuum to be more agressive
  • the big improvement is from adding autovacuum_scale_factor=0.05 and autovacuum_insert_scale_factor=0.05 
  • adding wal_compression=lz4 has a small benefit
  • several of the changes made performance much worse
  • increasing autovacuum_work_mem didn't help here but will probably help on IO-bound tests

Benchmark

This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory.

The configurations all use shared_buffers=70G and the database is cached by Postgres. All of the config files are here. The base config file is cx7a_gcp_c2s30 and then the other config files make a small change to the base. The biggest win is already in the base config, and is the tuning done to make autovacuum more aggressive (see here). Of the configs, and using abbreviated names:

  • one of them has a significant benefit: a16
  • a few of them have a small benefit: a1, a2, a6, a9
  • many of them hurt performance:  a5, a7, a10, a11, a12, a13, a14, a15, a17
  • a few of them can be ignored: a3, a4, a8
The configurations are:

  • conf.diff.cx7a1_gcp_c2s30 (a1) - adds wal_compression=lz4
  • conf.diff.cx7a2_gcp_c2s30 (a2) - adds backend_flush_after=256kb
  • conf.diff.cx7a3_gcp_c2s30 (a3) - adds full_page_writes=off (not for production!)
  • conf.diff.cx7a4_gcp_c2s30 (a4) - adds checkpoint_timeout=1800s, which was already set to 1800s in the base config so this can be ignored
  • conf.diff.cx7a5_gcp_c2s30 (a5) - adds min_wal_size=16384
  • conf.diff.cx7a6_gcp_c2s30 (a6) - adds backend_flush_after=256kb, initdb --wal-segsize 256
  • conf.diff.cx7a7_gcp_c2s30 (a7) - adds max_wal_size=16GB
  • conf.diff.cx7a8_gcp_c2s30 (a8) - adds shared_buffers=8GB, ignore this, I was just curious about the benefit of doubling shared buffers from the base config
  • conf.diff.cx7a9_gcp_c2s30 (a9) - adds checkpoint_timeout=900s
  • conf.diff.cx7a10_gcp_c2s30 (a10) - adds autovacuum_work_mem=256MB
  • conf.diff.cx7a11_gcp_c2s30 (a11) - adds autovacuum_work_mem=1GB
  • conf.diff.cx7a12_gcp_c2s30 (a12) - adds wal_buffers=64MB
  • conf.diff.cx7a13_gcp_c2s30 (a13) - adds wal_buffers=64MB, initdb --wal-segsize 256
  • conf.diff.cx7a14_gcp_c2s30 (a14) - adds backend_flush_after=1MB
  • conf.diff.cx7a15_gcp_c2s30 (a15) - adds max_wal_size=32GB
  • conf.diff.cx7a16_gcp_c2s30 (a16) - adds autovacuum_scale_factor=0.05, autovacuum_insert_scale_factor=0.05
  • conf.diff.cx7a17_gcp_c2s30 (a17) - reduces bgwriter_lru_maxpages from 5000 to 1000
Results

I start by comparing the impact of the config changes with the results for the base config.

The summary is here. Relative to the base config the a16 config provides the largest improvement to average throughput for l.i1 (from 154202 to 249299 inserts/s). Some configs made it much worse than the base case: a5, a7, a8, a10, a11, a12, a13, a14, a15, a17.

Next I consider the impact on worst-case response time for inserts (see here). Some configs made it worse than the base case: a7, a10, a12, a13.

Finally, I look at the charts for per-second insert rates vs time (see here). The a16 config has the best average throughput and the chart for a16 is a bit different than the others.

The conclusion is that I should definitely use a16 and probably use a1, a2, a6 and a9 for this workload.