Small Datum: 2020

Wednesday, December 30, 2020

Sysbench: IO-bound InnoDB in MySQL 5.6, 5.7 and 8.0

This post has results for InnoDB in MySQL 5.6, 5.7 and 8.0 using an IO-bound & low-concurrency workload via sysbench. The setup is the same as I described for MyRocks and I wrote about in-memory sysbench + InnoDB in a previous post. The goal is to understand where performance changes across releases.

I will repeat a disclaimer from my previous post, these results are from microbenchmarks. They are useful but be careful to assume these results will reproduce for your workload, especially if your queries are more complex than those used by sysbench.

Summary:

The regressions here are not as bad as in the in-memory workload. In many tests below the QPS in 8.0 is better than 5.6 because the new CPU overheads are less significant for IO-bound.
I hope to repeat some of these tests to explain the odd results.
Covered point queries with a secondary index use too much CPU and IO in 8.0. Covered range queries also have regressions in 8.0. But versions of those queries that don't use a covering index don't have a problem in 8.0 for IO-bound, while they did for in-memory.
Some of the write-heavy tests are significantly faster in 8.0

Overview

I tested MySQL versions 5.6.49, 5.7.31, and 8.0.17 through 8.0.22. As much as possible, I used the same my.cnf options for all versions. The my.cnf files are in the per-version directories in github and here is my.cnf for 8.0.22. I used the my.cnf options that were good for the insert benchmark (the cx6d configuration). For 8.0 ssl is disabled and the same charset/collation is used for all versions. Fsync is not done on commit to avoid becoming an fsync latency benchmark.

The in-memory workload used 1 table with 10M rows and each test ran for 90 seconds at 1, 2, 3 and 4 threads. The IO-bound workload used 1 table with 200M rows and each test ran for 300 seconds.

Results

The tests are in 4 groups based on the sequence in which they are run: read-only run before write-heavy, write-heavy, read-only run after write-heavy and insert/delete.

I have scripts that generate 3 summaries -- absolute throughput, relative throughput and HW efficiency. Absolute throughput is the QPS/TPS that a test gets for 1, 2, 3 and 4 clients. Relative throughput is the QPS/TPS relative to the base case (MySQL 5.6.49). The HW efficiency report has absolute and relative results (base case is MySQL 5.6.49) for CPU and IO per operation.

I use ratios to explain performance. In this case MySQL 5.6.49 is the denominator and the numerator is the value for 5.7 or 8.0. A QPS ratio < 1 means the new version is slower. For HW efficiency I use CPU/operation and IO/operation (read & write). For CPU and IO per operation a ratio > 1 means the new version uses more CPU or IO per query. Below I only mention IO/operation ratios when 5.6 and 8.0 don't have similar values.

The results are in github for absolute throughput, relative throughput and HW efficiency. I annotate the relative throughput and HW efficiency results below. Unless called out, I am explaining the results for 8.0.22 relative to 5.6.49.

Read-only before write-heavy:

QPS ratios are 1.02, 1.04, 1.10, 1.14 for the first 4 tests (up to here)

These do point queries
CPU/query ratios are: 0.96, 0.93, 0.79, 0.76. Up to here.
The regressions from bug 102037 for the in-memory workload are less of an issue here but the QPS ratio for random-points with range 1000 drops from 1.29 in 8.0.21 to 1.14 here.

QPS ratios are 1.14, 1.18, 1.36 for the next 3 tests (here to here)

These have the range scans from oltp_read_write.lua with ranges of size 10, 100 & 10,000
CPU/query ratios are 0.89, 0.83, 0.67 (here to here). Long scans are better in 8.0 vs 5.6.

QPS ratios are 1.09, 1.07 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index.
CPU/query ratios are 0.85, 0.88 (here to here).

QPS ratios are 0.50, 1.00 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index.
CPU/query ratios are 1.95, 0.97 while read IO/query ratios are 2.01, 1.00 (here to here). For the in-memory result the regression was ~2X for both covered and non-covered while here it only occurs for covered. The regression is in 5.7 and 8.0 but only for the covered query while it occurs for both the covered and non-covered queries with the in-memory workload.

QPS ratios are 1.12, 1.25 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
CPU/query ratios are 0.88, 0.78 (here to here)

QPS ratios are 0.79, 1.00 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index.
CPU/query ratios are 1.29, 0.95 while read IO/query ratios are 1.27 and 1.00 (here to here)
The in-memory workload has regression for both covered and non-covered, while only covered has a regression here. This is similar to the result above for point-queries using covered and non-covered queries.

Write-heavy

QPS ratios are 1.07, 9.50, 2.04, 0.91, 1.19 for the next 5 tests (here to here)

These are update-only
CPU/statement ratios are 0.87, 0.25, 0.79, 1.94, 0.83. Read IO/query ratios are 1.00, 0.44, 0.95, 1.00, 0.97 (here to here).
The improvement for the 2nd and 3rd tests (update-index, update-nonindex) started in 5.7 and improved more in 8.0.

QPS ratio is 2.37 for the next test, write-only. See here.

This has the writes from oltp_read_write.lua.
CPU/transaction ratio is 0.45 and read IO/query ratio is 0.76. See here.

QPS ratios are 1.18, 0.91 for the next two tests, read-write (here to here)

These are the traditional sysbench tests (oltp_read_write.lua) with ranges of size 10 and 100
CPU/transaction ratios are 0.91, 1.16 (here to here)

Read-only after write-heavy includes tests that were run before write-heavy:

QPS ratio is 0.92, 14.12, 1.36 for the next 3 tests, read-only (here to here)

These have the queries (range scans) from oltp_read_write.lua with ranges of size 10, 100 and 10,000. Results here are similar to Read-only before write-heavy and 8.0 is better than 5.6 at long range scans.
CPU/transaction ratios are 0.79, 0.17, 0.67 (here to here)
The huge speedup (14.12 QPS ratio) for the second test is confusing. It only happens for the 1-thread test and the QPS ratio for 2, 3, and 4 threads is ~2. The problem appears to be CPU/query (5.6 does ~5X more) while IO/query is OK.

QPS ratios are 2.04, 1.71, 4.62, 1.14, 0.50 for the next 5 tests (here to here)

These do a variety of point queries. The first 4 were run in Read-only before write heavy, and results here are similar.
CPU/query ratios are 0.51, 0.70, 0.32, 0.90, 1.89 (here to here) and that explains most of the differences.
The hot-points test doesn't do IO reads as the working set is small. So it still suffers from bug 102037 which is new in 8.0.22

QPS ratios are 1.06, 1.05 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index. Results here are similar to Read-only before write-heavy.
CPU/query ratios are 0.87, 0.90 (here to here)

QPS ratios are 0.50, 1.00 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index.
CPU/query ratios are 1.93, 0.97 while read IO/query ratios are 2.01, 1.00 (here to here). Results here are similar to Read-only before write-heavy.

QPS ratios are 1.16, 1.24 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
Results here are similar to Read-only before write-heavy.
CPU/query ratios are 0.84, 0.80 (here to here)

QPS ratios are 0.78, 1.03 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. Results here are similar to Read-only before write-heavy.
CPU/query ratios are 1.31, 0.94 and read IO/query ratios are 1.29 and 1.00 (here to here)

Insert/delete

QPS ratio is 1.80 for the delete test and 0.83 for the insert test
CPU/statement ratio is 0.86 for delete and 2.62 for insert
Read IO/statement ratio is 0.88 for delete and 234.69 for insert. For insert, the read IO ratio was ~2 for 5.7, then grew to ~5 in 8.0.19, then to ~16 for 8.0.20, then to ~200 for 8.0.21 & 8.0.22.

Sysbench: in-memory InnoDB in MySQL 5.6, 5.7 and 8.0

This post has results for InnoDB in MySQL 5.6, 5.7 and 8.0 using an in-memory & low-concurrency workload via sysbench. The setup is the same as I described for MyRocks. The goal is to understand where performance changes across releases and the in-memory workloads make it easier to spot CPU overhead.

I will repeat a disclaimer from my previous post, these results are from microbenchmarks. They are useful but be careful to assume these results will reproduce for your workload, especially if your queries are more complex than those used by sysbench. Regardless, it is good for users and the environment to be careful about CPU efficiency in new versions of a popular DBMS.

Updated results with improved my.cnf options are here.

Summary:

Things that might motivate a repeat

Change my scripts so that InnoDB writeback is done before starting the tests.
Set innodb_purge_threads=1 as the default value (4) caused mutex contention in the past on my NUC servers that have 4 CPU cores.
Debug whether the new InnoDB redo log code wastes CPU on my small servers. It did in early 8.0 versions.

Most of the performance regression from 5.6 to 8.0 is from code above the storage engine layer
In most cases, 8.0 gets between 70% and 80% of the throughput vs 5.6
8.0 is better than 5.6 at longer range scans but worse at shorter ones
8.0 is a lot worse for point-queries on secondary indexes (5.6 gets 2X more QPS). Most of this arrived in 5.7, but 8.0 made it worse.
Filed bug 102037 because optimizer overhead is too large with large in-list SELECTs

Overview

Results

The tests are in 4 groups based on the sequence in which they are run: read-only run before write-heavy, write-heavy, read-only run after write-heavy and insert/delete.

Read-only before write-heavy:

QPS ratios are 0.76, 0.71, 0.59, 0.24 for the first 4 tests (up to here)

These do point queries
CPU/query ratios are: 1.36, 1.44, 1.78, 4.08. Up to here.
The large regressions (0.59, 0.24) are from bug 102037 which is new in 8.0.22. For 8.0.21 the QPS ratios are 0.74, 0.70, 0.74, 0.89 and CPU/query ratios are 1.46, 1.51, 1.44, 1.16.
MySQL 5.6 was able to finish InnoDB write-back prior to these tests but it was still in progress for 5.7 until the fourth test (above) and for 8.0 until the update-only tests (below). I assume that 5.6 is more willing to do furious flushing while 5.7 and 8.0 are not. That is neither good nor bad but makes it harder to tune across versions. Given that writeback is in progress for 8.0 the CPU overhead in the Read-only before write-heavy section will be overstated for 8.0. Fortunately the results in the Read-only after write-heavy section should still be truthy.

QPS ratios are 0.74, 1.03, 1.26 for the next 3 tests (here to here)

These have the range scans from oltp_read_write.lua with ranges of size 10, 100 & 10,000
CPU/query ratios are 1.38, 1.02, 0.83 (here to here). Long scans are better in 8.0 vs 5.6, but worse in 8.0 vs 5.7.

QPS ratios are 0.71, 0.71 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index.
CPU/query ratios are 1.45, 1.44 (here to here).

QPS ratios are 0.47, 0.49 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. The regression here is larger than for the PK index and CPU/query is the problem. The regression is in 5.7 and 8.0.
CPU/query ratios are 2.06, 1.98 (here to here)

QPS ratios are 0.74, 0.73 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
CPU/query ratios are 1.38, 1.39 (here to here)

QPS ratios are 0.75, 0.67 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index.
CPU/query ratios are 1.35, 1.52 (here to here)

Write-heavy

QPS ratios are 0.99, 1.61, 0.75, 0.70, 0.74 for the next 5 tests (here to here)

These are update-only
CPU/statement ratios are 2.01, 1.43, 2.08, 2.22, 2.21 (here to here).
8.0 is faster than 5.6 for update-index (QPS ratio is 1.61), but slower in the other tests except for update-inlist. 8.0.17 had problems but perf has improved since then. The wKB/o results would be easier to explain if each test ran longer than 90 seconds, otherwise the writes done during one test might have been triggered by a previous test. 8.0 is almost as fast as 5.6 (QPS ratio is 0.99) for update-inlist. I assume because the in-list is large enough that the overhead is dominated by the storage engine (and not code above the engine).

QPS ratio is 0.88 for the next test, write-only. See here.

This has the writes from oltp_read_write.lua.
CPU/transaction ratio is 1.94. See here.

QPS ratios are 0.73, 0.85 for the next two tests, read-write (here to here)

These are the traditional sysbench tests (oltp_read_write.lua) with ranges of size 10 and 100
CPU/transaction ratios are 1.32, 1.22 (here to here)

Read-only after write-heavy includes tests that were run before write-heavy:

QPS ratio is 0.77, 1.02, 1.24 for the next 3 tests, read-only (here to here)

These have the queries (range scans) from oltp_read_write.lua with ranges of size 10, 100 and 10,000. Results here are similar to Read-only before write-heavy and 8.0 is better than 5.6 at long range scans, but worse at shorter ones.
CPU/transaction ratios are 1.21, 1.03, 0.85 (here to here)

QPS ratios are 0.77, 0.71, 0.59, 0.24, 0.49 for the next 5 tests (here to here)

These do a variety of point queries. The first 4 were run in Read-only before write heavy, and results here are similar.
CPU/query ratios are 1.34, 1.44, 1.78, 4.07, 1.95 (here to here)
The large regressions in QPS ratios (0.59, 0.24, 0.49) are from bug 102037 which is new in 8.0.22. For 8.0.21 the QPS ratios are 0.74, 0.71, 0.74, 0.90, 0.62 and the CPU/query ratios are 1.43, 1.50, 1.45, 1.15, 1.57.

QPS ratios are 0.71, 0.71 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index. Results here are similar to Read-only before write-heavy.
CPU/query ratios are 1.45, 1.45 (here to here)

QPS ratios are 0.48, 0.50 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. Results here are similar to Read-only before write-heavy.
CPU/query ratios are 2.05, 1.97 (here to here).

QPS ratios are 0.73, 0.73 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
Results here are similar to Read-only before write-heavy.
CPU/query ratios are 1.42, 1.41 (here to here)

QPS ratios are 0.75, 0.67 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. Results here are similar to Read-only before write-heavy.
CPU/query ratios are 1.37, 1.51 (here to here)

Insert/delete

QPS ratio is 0.69 for the delete test and 0.70 for the insert test
CPU/statement ratio is 2.32 for delete and 2.62 for insert

Tuesday, December 29, 2020

Sysbench: IO-bound MyRocks in 5.6 vs 8.0

This post has results for an IO-bound workload. The previous post has results for an in-memory workload.

Summary:

Many of the regressions are explained by new CPU overhead. But there is also variance that is likely caused by stalls from RocksDB and my scripts don't do a good job of showing that.
In most cases, MyRocks in 8.0.17 gets between 80% and 100% of the throughput vs 5.6.35

Overview

Things that are different for this benchmark from the previous post:

The test table has 200M rows rather than 10M. Tests here are likely to be IO-bound.
Each test step is run for 300 seconds rather than 90.

The IO-bound results and shell scripts used to run the tests are in github.

Results

The tests are in 4 groups based on the sequence in which they are run: read-only run before write-heavy, write-heavy, read-only run after write-heavy and insert/delete.

The summaries focus on relative throughput and HW efficiency for the test with 1 thread although tests were run for 1, 2, 3 and 4 threads.

I use ratios to explain performance. In this case MyRocks in 5.6.35 is the denominator and 8.0.17 is the numerator. A QPS ratio < 1 means that 8.0.17 is slower. For HW efficiency I use CPU/operation and IO/operation (read & write). For CPU and IO per operation a ratio > 1 means that 8.0.17 uses more CPU or IO per query. Below I only mention IO/operation ratios when 5.6 and 8.0 don't have similar values.

Read-only before write-heavy:

QPS ratios are 0.95, 0.92, 0.91, 0.91 for the first 4 tests (up to here)

These do point queries
CPU/query ratios are: 1.12, 1.23, 1.27, 1.19. See here to here

QPS ratios are 0.97, 0.95, 1.09 for the next 3 tests (here to here)

These have the range scans from oltp_read_write.lua with ranges of size 10, 100 & 10,000
CPU/query ratios are 1.08, 1.10, 0.93 (here to here). Long scans are better in 8.0.17.

QPS ratios are 0.97, 0.96 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index.
CPU/query ratios are 1.10, 1.13 (here to here).

QPS ratios are 0.74, 0.98 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. The regression in 8.0.17 is larger here than for the tests that use the primary key index and CPU/query is the problem.
CPU/query ratios are 1.31, 1.08 (here to here)
read IO/query ratios are 1.35, 1.01. I don't know why it is 1.35 rather than closer to 1.

QPS ratios are 1.03, 1.04 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
CPU/query ratios are 0.99, 0.97 (here to here)

QPS ratios are 1.01, 0.93 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index.
CPU/query ratios are 1.01, 1.17 (here to here)

Write-heavy

QPS ratios are 0.95, 0.84, 1.06, 0.80, 0.76 for the next 5 tests (here to here)

These are update-only
CPU/statement ratios are 1.13, 1.23, 0.90, 1.27, 1.25 (here to here).
Read and write IO/operation ratios are better and worse in 8.0.17 depending on the test

QPS ratio is 1.00 for the next test, write-only. See here.

This has the writes from oltp_read_write.lua.
CPU/transaction ratio is 1.10. See here.

QPS ratios are 0.30, 1.47 for the next two tests, read-write (here to here)

These are the traditional sysbench tests (oltp_read_write.lua) with ranges of size 10 and 100. This result is odd and it would be great to explain it.
CPU/transaction ratios are 3.11, 0.79 (here to here)

Read-only after write-heavy includes tests that were run before write-heavy:

QPS ratio is 1.06, 1.11, 1.10 for the next 3 tests, read-only (here to here)

These have the range scans from oltp_read_write.lua with ranges of size 10, 100 and 10,000. These were also run before write-heavy, results here are slightly worse.
CPU/transaction ratios are 0.96, 0.84, 0.94 (here to here)

QPS ratios are 1.93, 1.03, 1.01, 1.10, 1.02 for the next 5 tests (here to here)

These do a variety of point queries. The first 4 were run in Read-only before write heavy, and results here are similar.
CPU/query ratios are 0.56, 1.00, 1.01, 1.00, 0.99 (here to here)
The 1.93 result is an example of the variance I mentioned above. This ratio is for the 1-thread test and the ratios for the 2, 3 and 4 thread tests are close to 1. Something is creating variance. I have to debug this live.

QPS ratios are 0.94, 0.96 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index. These were also run before write-heavy.
CPU/query ratios are 1.17, 1.11 (here to here)

QPS ratios are 1.19, 1.00 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. These were also run before write-heavy.
CPU/query ratios are 2.26, 1.90 (here to here).

QPS ratios are 1.04, 1.10 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
These were also run before write-heavy.
CPU/query ratios are 0.97, 0.88 (here to here)

QPS ratios are 0.98, 0.94 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. These were also run before write-heavy.
CPU/query ratios are 1.02, 1.17 (here to here)

Insert/delete

QPS ratio is 0.84 for the delete test and 0.77 for the insert test
CPU/statement ratio is 1.22 for delete and 1.28 for insert

Sysbench: MyRocks in MySQL 5.6 vs 8.0

I revived my sysbench helper scripts and have started to get results for MySQL (InnoDB, MyRocks) and Postgres. The results here are from all_small.sh which uses a few Lua files that are in my sysbench fork but have yet to be contributed. Tests were run on my home NUC servers for in-memory and IO-bound workloads. The results here are limited to the in-memory workloads.

Disclaimer -- the tests here are microbenchmarks. While useful, be careful about assuming that the regressions here will reproduce on your workloads. The in-memory tests help me understand CPU overhead which tends to increase in newer releases.

Summary:

In most cases, MyRocks in 8.0.17 achieves between 75% and 100% of the throughput that it gets in 5.6.35. The regression tended to be larger for write-heavy tests than for read-only.
Many of the regressions are explained by new CPU overhead. But there is also variance that is likely caused by stalls from RocksDB and my scripts don't do a good job of showing that.
I don't understand why MyRocks was reading from storage for some of the tests. That doesn't reproduce with InnoDB and the database should have been cached.
There was one lousy, but still unexplained result -- a test that does point queries via a secondary index was much slower in 8.0.17. It got between 41% and 54% of the 5.6.35 throughput (search for 0.41 below). The problem occurs when the test is run after write-heavy tests. It does not occur when the test is run before the write-heavy tests. I need to debug this live but my test servers are busy.

Overview

The all_small.sh script run a variety of tests in a sequence that is interesting to me. One example is to run the read-only tests before and then after write-heavy tests to understand the impact of fragmentation on query performance. The tests include the traditional result from oltp_read_write.lua, and results from other Lua scripts focused on a single command (point lookup, range query, etc). These are microbenchmarks and the purpose is to spot regressions between releases. MySQL 8 has many improvements and a few regressions (like bug 102037, found via these scripts).

For the in-memory workload there is 1 table with 10M rows and each test is run for 90 seconds. For the IO-bound setup there is 1 table with 200M rows and each test is run for 300 seconds. In both cases tests are repeated for 1, 2, 3 and 4 threads. The test servers have 4 CPU cores with HT disabled. The in-memory table is small enough to fit in RAM, the IO-bound table is not.

Tests used FB MySQL 5.6.35 at git hash 4911e0 and 8.0.17 at git hash cf9dbc. These were latest as of early December 2020. The NUC servers use Ubuntu 20.04 and XFS.

Results

I pushed results to github with:

Result directories for in-memory and IO-bound workloads
Summaries with results for 1, 2, 3 and 4 threads

Absolute QPS by thread count in-memory and IO-bound workloads
Relative QPS by thread count for in-memory and IO-bound workloads. The results are relative to MyRocks in 5.6.35 (> 1.0 is an improvement, < 1.0 is a regression)
Metrics (CPU and IO per operation) for in-memory and IO-bound

my.cnf for 5.6 and 8.0 are as similar as possible. Fsync is not done on commit to keep this from being an fsync benchmark. Compression was not used.
Sysbench output and test scripts. The run.sh script maps the test names that are in the result summaries to the Lua scripts and per-script options.

In-memory

The tests are in 4 groups based on the sequence in which they are run: read-only run before write-heavy, write-heavy, read-only run after write-heavy and insert/delete.

The summaries focus on relative throughput and HW efficiency for the test with 1 thread. The results for more than 1 thread are similar although at 4 threads there were some tests with larger regressions for 8.0 and the likely cause was CPU saturation.

I use ratios to explain performance. In this case MyRocks in 5.6.35 is the denominator and 8.0.17 is the numerator. A QPS ratio < 1 means that 8.0.17 is slower. A HW efficiency metric, CPU/query, ratio > 1 means that 8.0.17 uses more CPU/query. I use CPU/query because it is correlated with QPS, the CPU/query overhead usually increases in new MySQL releases and that causes a reduction in QPS for low-concurrency workloads, especially when the database is cached and queries are simple.

Read-only before write-heavy:

QPS ratios are 0.89, 0.89, 0.90, 0.89 for the first 4 tests (up to here)

These do point queries
CPU/query ratios are: 1.15, 1.14, 1.12, 1.13. See here to here

QPS ratios are 0.94, 1.00, 1.12 for the next 3 tests (here to here)

These have the range scans from oltp_read_write.lua with ranges of size 10, 100 & 10,000
CPU/query ratios are 1.09, 1.03, 0.92. Long scans are better in 8.0.17 while shorter ones are better in 5.6.35. See here to here.

QPS ratios are 0.91, 0.89 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index.
CPU/query ratios are 1.12, 1.15 (here to here)

QPS ratios are 0.82, 0.84 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. The regression in 8.0.17 is larger here than for the tests that use the PK index and CPU/query is the problem.
CPU/query ratios are 1.23, 1.23 (here to here)

QPS ratios are 0.95, 0.96 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
CPU/query ratios are 1.09, 1.04 (here to here)

QPS ratios are 0.98, 0.94 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index.
CPU/query ratios are 1.04, 1.07 (here to here)

Write-heavy - many of these tests have more read IO than I expected given.

QPS ratios are 0.80, 0.84, 0.81, 0.77, 0.72 for the next 5 tests (here to here)

These are update-only
CPU/statement ratios are 1.40, 1.23, 1.39, 1.37, 1.38 (here to here).
IO/operation ratios are worse for 8.0.17. I don't understand why there are so many reads as the database should be cached. The update-zipf tests were the worst for IO and the IO problem occurs for both 5.6 and 8.0. I assume the problem is that new compaction output isn't in the OS page cache.

QPS ratio is 0.92 for the next test, write-only. See here.

This has the writes from oltp_read_write.lua.
CPU/transaction ratio is 1.17. See here. Read/write IO is also worse.

QPS ratios are 1.01, 0.98 for the next two tests, read-write (here to here)

These are the traditional sysbench tests (oltp_read_write.lua) with ranges of size 10 and 100
CPU/transaction ratios are 1.05, 1.06 (here to here)
IO/operation results confuse me as the test table should be cached (here to here)

Read-only after write-heavy includes tests that were run before write-heavy:

QPS ratio is 0.98, 0.94, 0.98 for the next 3 tests, read-only (here to here)

These have the range scans from oltp_read_write.lua with ranges of size 10, 100 and 10,000. These were also run before write-heavy, results here are slightly worse.
CPU/transaction ratios are 1.06, 1.09, 1.04 (here to here)

QPS ratios are 0.90, 0.88. 0.85, 0.85, 0.81 for the next 5 tests (here to here)

These do a variety of point queries. The first 4 were run in Read-only before write heavy, and results here are similar.
CPU/query ratios are 1.14, 1.15, 1.18, 1.18, 1.25 (here to here)

QPS ratios are 0.89, 0.87 for the next 2 tests (here to here)

These do point queries via in-lists that are covering and then not covering for the primary index. These were also run before write-heavy. Results here are slightly worse.
CPU/query ratios are 1.14, 1.16 (here to here)

QPS ratios are 0.41, 0.54 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. These were also run before write-heavy. The results here show a serious regression for 8.0.17.
CPU/query ratios are 2.26, 1.90 (here to here). This starts to explain the problem but I need stack traces and/or CPU profiles to really explain it.

QPS ratios are 0.74, 0.76 for the next 2 tests (here to here)

These do range queries that are covering and then not covering for the primary index
These were also run before write-heavy.
CPU/query ratios are 1.31, 1.27 (here to here)

QPS ratios are 0.80, 0.74 for the next 2 tests (here to here)

These are similar to the previous test, but use the secondary index. These were also run before write-heavy.
CPU/query ratios are 1.25, 1.25 (here to here)

Insert/delete

QPS ratio is 0.78 for the delete test and 0.72 for the insert test
CPU/statement ratio is 1.27 for delete and 1.49 for insert

Friday, December 4, 2020

Tuning for the Insert Benchmark: Postgres, InnoDB & MyRocks

I used the Insert Benchmark and a small server to compare performance across many (11, 18 & 17) different configurations for Postgres, InnoDB and MyRocks. The goal is to determine which options matter for this scenario (workload + HW). The goal isn't to compare Postgres vs InnoDB vs MyRocks.

Results are here for Postgres, InnoDB and MyRocks. Conclusions are below. Some of these are specific to the workload + HW, so be careful about applying my findings.

For InnoDB

It is hard to improve on innodb_dedicated_server. This is great news for deployments as there is less to tune. I hope the perf test community does more to confirm that across more workloads and HW.
Performance with the minimal (default + a few things) configuration is lousy. The solution is to set innodb_dedicated_server.
Performance with 4kb and 8kb pages is lousy except for QPS with more IO-bound database size where it is better. I don't understand the CPU overhead when it is lousy.
QPS with the more IO-bound database is much worse than MyRocks and Postgres. I don't know why.

For MyRocks

For the in-memory and less IO-bound databases the configurations didn't matter as perf was good for all. For the more IO-bound database all configurations except the minimal were good.
Tuning MyRocks was easy

For Postgres

Vacuum is needed, even for insert-only workloads, to set visibility map bits
wal_compression=on was bad for perf in many cases. Is PGLZ overhead too large?
Using a larger WAL helps by reducing checkpoint IO. This is standard advice.

A next step is to repeat the experiment with bigger HW and/or more concurrency. I want to try running iibench.py with PyPy and probably should upgrade the Python connector for MySQL.

Tuning for the Insert Benchmark: MyRocks

This presents performance and efficiency results for MyRocks on the Insert Benchmark using MySQL 5.6.35 and a small server with 4 CPU cores, 16G of RAM and a fast SSD. The goal is to understand the impact of my.cnf options and I have results for 17 different configurations.

Summary

I use scripts to generate the performance reports. The reports are here for the in-memory, less IO-bound and more IO-bound database sizes. My summary of the results is below. Some of the conclusions are specific to this workload and HW so be careful about applying this advice.

A summary of the results is below. This one is easy.

Configurations have little impact on throughput for the in-memory and less IO-bound databases
Throughput is lousy with the minimal configuration for the more IO-bound database size

My focus in on throughput. There are larger impacts on HW efficiency metrics that I ignore. The impact of configuration should be more interesting for the benchmark run on larger HW with more concurrency, but that is for a future post.

Tests

The insert benchmark was run for three database sizes that I call in-memory, less IO-bound and more IO-bound. The database was cached in memory for the in-memory size but not for the others. The initial load size was 20M rows for in-memory, 100M rows for less IO-bound and 500M rows for more IO-bound. Additional rows were inserted after the initial load, but most of the inserts were done by the initial load.

Tests were run at most 2 threads/connections -- one for inserts and another for queries -- and I consider that to be 1 insert benchmark client.

The benchmark has several steps as described here -- initial load, create 3 secondary indexes, load more data, read+write. Each read+write step is run for 30 minutes with a query thread that runs as fast as possible and a writer that that is rate limited to 100, 100, 200, 200, 400, 400, 600, 600, 800, 800, 1000 and 1000 inserts/s. The load more data step inserts 20M, 10M and 10M more rows for the in-memory, less IO-bound and more IO-bound workloads.

In the performance reports, the steps are named:

l.i0 - initial load
l.x - create 3 secondary indexes
l.i1 - load more data
q100.2, q200.2, ..., q1000.2 - read+write where the insert rate limit is 100, 200, ..., 1000 and the .2 means the second run at each rate. The reports focus on the second run.

Configurations

All tests were done with fsync-on-commit disabled to avoid becoming an fsync latency benchmark. The option names below leave off rocksdb_ to save space. I also leave off my.cnf. when naming config files.

my.cnf.cx0 - minimal configuration
my.cnf.cx1 - cx0 + block_cache_size=10G
my.cnf.cx3 - cx1 +

max_background_jobs=2, default_cf_options=level_compaction_dynamic_level_bytes=true

my.cnf.cx4a - cx3 +

default_cf_options += block_based_table_factory= {cache_index_and_filter_blocks=1}

my.cnf.cx4b - cx3 +

default_cf_options += block_based_table_factory= {cache_index_and_filter_blocks=1; filter_policy=bloomfilter:10: false; whole_key_filtering=1}; optimize_filters_for_hits=true

my.cnf.cx5a - cx4b +

default_cf_options += max_write_buffer_number=4; write_buffer_size=64m; max_bytes_for_level_base=256m

my.cnf.cx5b - cx4b +

default_cf_options += max_write_buffer_number=4; write_buffer_size=128m; max_bytes_for_level_base=512m

my.cnf.cx5c - cx4b +

default_cf_options += max_write_buffer_number=4; write_buffer_size=256m; max_bytes_for_level_base=1024m

my.cnf.cx6a - cx5c +

default_cf_options += compression_per_level=kNoCompression

my.cnf.cx6b - cx5c +

default_cf_options += compression_per_level=kNoCompression; bottommost_compression=kLZ4Compression

my.cnf.cx6c - cx5c +

default_cf_options += compression_per_level=kNoCompression; bottommost_compression=kZSTD

my.cnf.cx8 - cx6b + default_cf_options += format_version=5
my.cnf.cx9a - cx6b + block_size=8192
my.cnf.cx9b - cx6b + block_size=16384
my.cnf.cx10 - cx6b + bytes_per_sync=1M, wal_bytes_per_sync=1M
my.cnf.cx11 - cx6b + delayed_write_rate=32M
my.cnf.cx12 - cx6b + max_open_files=-1

Tuning for the Insert Benchmark: Postgres

This presents performance and efficiency results for Postgres on the Insert Benchmark using release 12.3 and a small server with 4 CPU cores, 16G of RAM and a fast SSD. The goal is to understand the impact of my.cnf options and I have results for 11 different configurations.

Summary

I didn't try as many configurations for Postgres as I did for InnoDB and MyRocks. I am not sure that is because there is less tuning opportunity in Postgres. Notice that I didn't try to tune vacuum, which has many options, and instead my test scripts did a manual non-blocking vacuum after each test step. While the only writes in this test are inserts, vacuum is still needed to set bits in the visibility map. I think Postgres can be more clever about setting those bits after inserts.

A summary of the results is below.

With wal_compression=on, see configs cx2 through cx4b, performance drops for create index (l.x) and load more data (l.i1). The drop was larger than I expected. The compression method is PGLZ and I wonder if lz4 would help. The benefit from WAL compression should be fewer checkpoints and less checkpoint IO. See the summaries for in-memory, less and more IO-bound. For create index (l.x) the cpu overhead (cpupq) doubles from 4 to 8 (or 9) when WAL compression is enabled from cx1 to cx2 for in-memory, less and more IO-bound but there isn't a significant reduction in KB written per row (wkbpi). The impact on CPU overhead (cpupq) is even larger for the load more data (l.i1) step for in-memory, less and more IO-bound.
With a small WAL, see max_wal_size not set for cx2 and =2G for cx3a, performance suffers during load more data (l.i1). This is expected as a larger WAL means fewer checkpoints and less checkpoint IO and KB written to storage per insert (wkbpi) is lower with a larger WAL for in-memory, less and more IO-bound. The benefit is smaller for more IO-bound because page writeback there is triggered more from buffer pool pressure than from checkpoint.
QPS during the read+write tests didn't vary much with different configurations.

There is much detail in the reports that I mostly ignore, including response time histograms and the HW efficiency metrics. But they can be used to explain the results and answer questions.

Tests

Tests were run at most 2 threads/connections -- one for inserts and another for queries -- and I consider that to be 1 insert benchmark client.

In the performance reports, the steps are named:

l.i0 - initial load
l.x - create 3 secondary indexes
l.i1 - load more data
q100.2, q200.2, ..., q1000.2 - read+write where the insert rate limit is 100, 200, ..., 1000 and the .2 means the second run at each rate. The reports focus on the second run.

Configurations

All tests were done with fsync-on-commit disabled to avoid becoming an fsync latency benchmark. To save space. I also leave off conf.diff. when naming config files.

conf.diff.cx0 - minimal configuration
conf.diff.cx1 - cx0 + shared_buffers=10G
conf.diff.cx2 - cx1 + wal_compression=on
conf.diff.cx3a - cx2 + max_wal_size=2G
conf.diff.cx3b - cx2 + max_wal_size=10G
conf.diff.cx3c - cx2 + max_wal_size=20G
conf.diff.cx3d - cx2 + max_wal_size=40G
conf.diff.cx4a - cx3c + bgwriter_lru_maxpages=2000, bgwriter_delay=100ms
conf.diff.cx4b - cx3c + bgwriter_lru_maxpages=1000, bgwriter_delay=50ms
conf.diff.cx5 - cx4a + wal_compression=off
conf.diff.cx6 - cx4a + shared_buffers=2G

Tuning for the Insert Benchmark: InnoDB

This presents performance and efficiency results for InnoDB on the Insert Benchmark using MySQL 8.0.21 and a small server with 4 CPU cores, 16G of RAM and a fast SSD. The goal is to understand the impact of my.cnf options and I have results for 18 different configurations.

Summary

I use scripts to generate the performance reports. The reports are here for the in-memory, less IO-bound and more IO-bound database sizes. Some of the conclusions are specific to this workload and HW so be careful about applying this advice, but I suspect that innodb_dedicated_server should be used in most cases. A summary of the results is below.

It is hard to improve on innodb_dedicated_server. Much of the benefit for innodb_dedicated_server comes from using a large enough value for innodb_log_files_in_group. See the summaries for the in-memory, less IO-bound and more IO-bound database sizes.
The minimal configuration, my.cnf.cx0, is lousy. This isn't a big deal because it is trivial to add innodb_dedicated_server=ON to my.cnf.
The configs that use 8kb and 4kb for innodb_page_size, my.cnf.cx10a and my.cnf.cx10b, have poor create index performance (see l.x) for the in-memory and less IO-bound database sizes. I can't explain that but the HW efficiency metrics are worse for the context switch rate (cspq) and CPU overhead (cpupq).
Query throughput (qps) with the more IO-bound database was best for the configs that use 8kb and 4kb page sizes, cx10a and cx10b. The reason is less storage read IO, see rpq.
Setting innodb_buffer_pool_instances=1 also hurts the create index performance as it increases the CPU overhead (cpupq) for the in-memory and less IO-bound database sizes.
While my purpose isn't to compare InnoDB, MyRocks and Postgres the QPS results for InnoDB on the more IO-bound (500M rows) database size are odd. InnoDB does ~4 disk reads per query (see rpq) and is therefore limited to ~650 QPS per client. For Postgres, QPS is much better while rpq is close to zero.

There is much detail in the reports that I mostly ignore, including response time histograms and the HW efficiency metrics. But they can be used to explain the results and answer questions.

Tests

Tests were run at most 2 threads/connections -- one for inserts and another for queries -- and I consider that to be 1 insert benchmark client.

In the performance reports, the steps are named:

l.i0 - initial load
l.x - create 3 secondary indexes
l.i1 - load more data
q100.2, q200.2, ..., q1000.2 - read+write where the insert rate limit is 100, 200, ..., 1000 and the .2 means the second run at each rate. The reports focus on the second run.

Configurations

All tests were done with fsync-on-commit disabled to avoid becoming an fsync latency benchmark. The option names below leave off innodb_ to save space. I also leave off my.cnf. when naming config files.

my.cnf.cx0 - minimal configuration
my.cnf.cx1 - my.cnf.cx0 + dedicated_server=ON, buffer_pool_size=10G to cx0
my.cnf.cx2a - cx1 + log_files_in_group=20
my.cnf.cx2b - cx1 + log_files_in_group=40
my.cnf.cx2c - cx1 + log_files_in_group=2
my.cnf.cx6a - cx2b + idle_flush_pct=1, io_capacity=200, io_capacity_max=2000
my.cnf.cx6b - cx2b + idle_flush_pct=1, io_capacity=2500, io_capacity_max=5000
my.cnf.cx6c - cx2b + idle_flush_pct=1, io_capacity=5000, io_capacity_max=10000
my.cnf.cx6d - cx2b + idle_flush_pct=1, io_capacity=1000, io_capacity_max=2000
my.cnf.cx8 - cx6d + adaptive_hash_index=OFF
my.cnf.cx10a - cx6d + page_size=8192
my.cnf.cx10b - cx6d + page_size=4096
my.cnf.cx11a - cx6d + lru_scan_depth=2048
my.cnf.cx13 - cx6d + doublewrite=OFF
my.cnf.cx14 - cx6d + use_native_aio=OFF
my.cnf.cx15a - cx6d + buffer_pool_instances=1
my.cnf.cx16 - cx6d + log_writeahead_size=4096
my.cnf.cx18 - cx6d + purge_threads=1

Monday, November 30, 2020

Blind-writes and an LSM

Blind-writes for an LSM can be a challenge in the presence of secondary indexes. This post was inspired by an interesting but unpublished paper based on Tarantool and a blog post on SAI in DataStax Cassandra. Some of this restates ideas from the Tarantool paper.

Contributions of this post are:

Explains blind-writes
Explains a new way to do blind-writes for some SQL-ish statements. The new way doesn't require readers to validate secondary index entries, which is required with the Tarantool approach. The new approach only works for some types of statements while the Tarantool approach had fewer limits.

Longer introduction

Non-unique secondary index maintenance is read-free for MyRocks during a write (insert, update, delete). By this I mean that nothing is read from the secondary index so there is no chance of doing a storage read during index maintenance. This makes secondary indexes cheaper with MyRocks than the typical DBMS that uses a B-Tree. The InnoDB change buffer also reduces the chance of storage reads during index maintenance, which is nice but not as significant as the MyRocks benefit. I ignore the reads that MyRocks might do during LSM compaction so index maintenance is eventually not read-free so allow me to be truthy. Regardless, with an LSM there is (almost always) less read and write IO per write compared to a B-Tree.

If you only want to provide SQL semantics and support write-heavy workloads with secondary indexes, then it is hard to be more efficient than MyRocks as it doesn't do reads other than those required for SQL -- fetching the row(s). There are several reasons for fetching the row(s) during processing of an UPDATE or DELETE statement including returning the affected row count and validating constraints. Getting the row requires a read from the table (if heap organized) or the PK index (if using a clustered PK index like MyRocks and InnoDB).

If you are willing to give up SQL semantics then it is possible to be read-free -- no reads for secondary index maintenance, no reads to get the base row. This is explained in the Tarantool paper but that solution has a cost -- all secondary index entries must be validated by reading the base row because they can be stale (invalid). The validation isn't that different from what can happen in Postgres and InnoDB when vacuum and purge get behind and index-only queries might not be index-only.

There was also a feature in TokuDB for blind writes but I am not sure whether that avoided reads to get the base row.

Blind writes

A blind write is done without hidden or explicit reads. But I need to add an exception for some reads, because an LSM has to search the memtable to insert key-value pairs. So I limit this post to disk-based DBMS and allow a blind-write to read in-memory search structures. The Put operation in the LevelDB and RocksDB API is a blind write.

An explicit read is done explicitly by the user or implied by the operation. A read-modify-write sequence includes an explicit read and an example is SELECT ... FOR UPDATE followed by UPDATE. Reads done to process the WHERE clause for UPDATE and DELETE statements are explicit reads.

A hidden read is a read that isn't explicit and done while processing an operation. Examples include:

Read an index to validate a unique constraint. A bloom filter makes this fast for an LSM.
Determine the number of affected rows to provide SQL semantics for UPDATE & DELETE
Read B-Tree secondary index leaf pages to maintain them after a write

Secondary Index Maintenance

An index entry is composed (indexed column values, row pointer). The row pointer is usually a row ID or the value of the row's primary key columns. The typical index maintenance after a write consists of some of remove old index entry, insert new index entry. To perform these steps both the index column values and row pointer must be known. When a DBMS has the row then it has the required information, but a read must be done to get the row and we are discussing ways to avoid that read.

SQL

The rest of this post assumes a SQL DBMS and a simple schema.

create table T (id primary key, a int, b int)
create index xa on T(a)

I am interested in whether blind writes can be done for the following statements. The example statements are listed below. The first 3 statements use the PK index to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the PK index to find the qualifying row, and then perform secondary index maintenance. Secondary index maintenance with a B-Tree means doing a read-modify-write on index leaf pages. With an LSM it can be read-free.

P1 - UPDATE T set a = 5 WHERE id=3
P2 - UPDATE T set a = a + 1 WHERE id=3
P3 - DELETE from T WHERE id=3

The next 3 statements use the secondary index, xa, to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the secondary index to find qualifying index entries, then using the row pointer in each index entry to find the base row and finally performing index maintenance.

S1 - UPDATE T set b = 5 WHERE a=4
S2 - UPDATE T set b = b + 1 WHERE a=4
S3 - DELETE from T WHERE a=4

Truly read-free

I promised an approach that was truly read-free and after many words have yet to deliver. An LSM makes this possible, but since this is LSM-based there will still be reads during compaction so I am truthy here rather than true. Also, I assume reads can be done from an index to evaluate the WHERE clause.

With a SQL DBMS that uses an LSM, the base row doesn't have to be fetched for statements S1, S2 and S3 above if SQL semantics aren't required. The base row still has to be fetched for statements P1, P2 and P3. For S1, S2 and S3 I assume that the secondary index xa is scanned to find matching index entries, for each matching index entry the remaining work is:

S1 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=5.
S2 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=b+1.
S3 - Put a tombstone for the clustered PK index to delete the row

I have no idea about the effort required to get this into MySQL. I assume it always reads the base row for UPDATE and DELETE statements.