Small Datum: August 2023

Sunday, August 27, 2023

RocksDB and glibc malloc don't play nice together

Pineapple and ham work great together on pizza. RocksDB and glibc malloc don't work great together. The primary problem is that for RocksDB processes the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. I have written about this before -- see here and here. RocksDB is a stress test for an allocator.

tl;dr

For a process using RocksDB the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. There will be more crashes from the OOM killer with glibc malloc.

Benchmark

The benchmark is explained in a previous post.

The insert benchmark was run in the IO-bound setup and the database is larger than memory.

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

l.i0

insert 500 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Malloc

The test was repeated with 4 malloc implementations:

je-5.2.1 - jemalloc 5.2.1, the version provided by Ubuntu 22.04
je-5.3.0 - jemalloc 5.3.0, the current jemalloc release, built from source
tc-2.9.1 - tcmalloc 2.9.1, the version provided by Ubuntu 22.04
glibc 2.3.5 - this is the version provided by Ubuntu 22.04

Results

I measured the peak RSS during each benchmark step.

The benchmark completed for all malloc implementations using the c5 config, but had some benchmark steps run for more time there would have been OOM with glibc. All of the configs used a 40G RocksDB block cache.

The benchmark completed for jemalloc and tcmalloc using the c7 config and fails with OOM with glibc on the q1000 step. Had the l.i1, q100 and q500 steps run for more time then the OOM would have happened sooner.

Saturday, August 26, 2023

Postgres 16 beta3 and the Insert Benchmark on a small server

This post has results for Postgres 16 beta vs the Insert Benchmark on a small server. I am searching for performance regressions. A previous post using a medium server is here. The tests here used Postgres versions 15.3, 15.4, 16 beta1, 16 beta2 and 16 beta3.

tl;dr

I don't see regressions in Postgres 16 beta3 for the workload that is cached by Postgres
Results for the IO-bound workloads have too much variance. I don't know whether the issue is Postgres or my HW. It is summer time and the temperature in my home datacenter (aka upstairs) can vary.
I am repeating the tests using the o3_native_lto build

Builds

I compiled Postgres 15.3, 15.4, 16 beta1, 16 beta2 and 16 beta3 from source. The builds are named o2_nofp which is shorthand for using: -O2 -no-omit-frame-pointer.

Benchmark

The insert benchmark was run in two setups.

cached by Postgres - all tables are cached by Postgres
IO-bound - the database is larger than memory

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

l.i0

insert X million rows per table where X is 20 for cached and 800 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another X million rows per table with secondary index maintenance where X is 200 for Cached and 20 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. This step took ~20,000 seconds.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

I used the a1 and a2 configurations. The difference between them is that a1 has wal_compression set to lz4 while a2 has it set to off.

Results

Reports are here for Cached (a1 config, a2 config) and IO-bound (a1 config, a2 config).

The results for average throughput are interesting and confusing. The tables linked below use absolute and relative throughput where relative throughput is (QPS for my version / QPS for base version) and the base version is Postgres 15.3.

Cached, a1 config (see here)

Postgres 16 beta2 and beta3 always get more throughput. Relative throughput is between 1.01 and 1.07 for them on every benchmark step.
From the metrics for the write-only benchmark steps (l.i0, l.x, l.i1) the CPU overhead (cpupq column) and write IO overhead (wkbpi column) are similar from 15.3 to 16 beta3
From the metrics for the read+write benchmark steps (q100, q500, q1000) there might be a slight reduction in CPU overhead (cpupq column)

Cached, a2 config (see here)

Postgres 16 beta2 and beta3 were faster than 15.3 on the write-only benchmark steps (l.i0, l.x, l.i1) where relative QPS was between 1.00 and 1.03 but almost always slower on the read-write benchmark steps (q100, q500, q1000) where relative QPS was between 0.91 and 1.03. In the read+write steps 16 beta2 had a relative QPS of 1.03 on q1000, otherwise these versions were slower than 15.3.
From the metrics for the write-only benchmark steps (l.i0, l.x, l.i1) the CPU overhead (cpupq column) and write overhead (wkbpi column) are similar from 15.3 to 16 beta3
From the metrics for the read+write benchmark steps (q100, q500, q1000) the CPU overhead (cpupq column) is frequently larger in 16 beta versions and this might explain the reduction in QPS.

IO-bound, a1 config (see here)

I am reluctant to characterize these because there is too much variance

IO-bound, a2 config (see here)

I am reluctant to characterize these because there is too much variance

Friday, August 25, 2023

Checking MyRocks 5.6 for regressions with the Insert Benchmark and a large server

This documents how performance changes from old to new releases of MyRocks using the Insert Benchmark and a large server. I use MyRocks 5.6 rather than 8.0 because the 5.6 release go back further in time. This post uses a large server. Previous posts are here for a small server and medium server.

Update - the builds I used were bad so the results here are bogus. I fixed the builds, repeated the tests and share results here. There are no regressions and the initial load of the benchmark is ~10% faster in modern MyRocks.

Comparing old MyRocks with modern MyRocks where old means fbmy5635_202203072101 with RocksDB 6.28.2 and modern means fbmy5635_rel_20230529_850 with RocksDB 8.5.0:

Write throughput drops by ~15%. Most of the regressions are confined to a few releases between fbmy5635_rel_202210112144 and fbmy5635_202302162102.
Read throughput drops by ~10%. The regressions occur in many releases.
In both cases, new CPU overhead explains the performance regression.

Builds

The builds are explained in a previous post but the oldest build used here is fbmy5635_202203072101 which has source as of 2022/03/07 while the other posts have builds going back to 2021/04/07.

The configuration files (my.cnf) are here: base and c5. The difference between them is that c5 adds rocksdb_max_subcompactions=4.

Benchmark

The insert benchmark was run in two setups:

cached by RocksDB - all tables fit in the RocksDB block cache
IO-bound - the database is larger than memory

The server has 80 HW threads, 40 cores, 256G of RAM and fast NVMe storage with XFS.

The benchmark is run with 24 clients, 24 tables and a client per table. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.

Results

Performance reports are here for Cached by RocksDB (base config and c5 config) and IO-bound (base config and c5 config).

Results: average throughput

This section explains the average throughput tables in the Summary section. I use relative throughput to save on typing where relative throughput is (throughput for some version / throughput for base case). When relative throughput is > 1 then some version is faster than the base case. The base case is fbmy5635_202203072101 with source from 2022/03/07 and it uses RocksDB 6.28.2.

Comparing old MyRocks with modern MyRocks where old means fbmy5635_202203072101 with RocksDB 6.28.2 and modern means fbmy5635_rel_20230529_850 with RocksDB 8.5.0:

Write throughput drops by ~15%. Most of the regressions are confined to a few releases between fbmy5635_rel_202210112144 and fbmy5635_202302162102. The issue appears to be new CPU overhead -- see the cpupq (CPU/operation) column here.
Read throughput drops by ~10%. The regressions occur in many releases.
The issue appears to be new CPU overhead -- see the cpupq (CPU/operation column) here for writes and here for reads.

Cached by RocksDB, base config (see here)

For fbmy5635_202205192101 with RocksDB 7.2.2

Relative throughput is (0.96, 0.98, 0.95) for (l.i0, l.x, l.i1) so write throughput dropped by ~5% vs the base case
Relative throughput is (0.98, 0.95, 0.95) for (q100, q500, q1000) so read throughput dropped by ~5% vs the base case

For fbmy5635_202302162102 with RocksDB 7.10.0

Relative throughput is (0.98, 0.99, 0.83) for (l.i0, l.x, l.i1). The result for l.i1 is much worse than it was on the next earlier build (fbmy5635_rel_202210112144 with RocksDB 7.3.1) and I am tracking that down.

For fbmy5635_rel_20230529_850 with RocksDB 8.5.0

Relative throughput is (0.94, 0.97, 0.83) for (l.i0, l.x, l.i1) so the regression to write perf for l.i1 hasn't gotten worse compared to fbmy5635_202302162102
Relative throughput is (0.91, 0.87, 0.89) for (q100, q500, q1000) and query performance has small regressions in every release tested

Cached by RocksDB, c5 config (see here)

Results are similar to Cached by RocksDB with the base config

IO-bound, base config (see here)

Results are similar to Cached by RocksDB with the base config

IO-bound, c5 config (see here)

Results are similar to Cached by RocksDB with the base config

Wednesday, August 23, 2023

Checking MyRocks 5.6 for regressions with sysbench and a small server, part 2

This post has results from sysbench and many MyRocks versions (old and new) in my search for performance regressions. It uses a small server and my previous post on that is here, which has results for the Insert Benchmark with the same server and same MyRocks versions. I also have results here for sysbench on a medium server.

The tests (microbenchmarks) are split into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. Using data from the Summary statistics section at the end of the post and the median of the relative throughput values per group to compare modern MyRocks with classic where modern uses source from 2023/05/29 with RocksDB 8.5.0 and classic uses source from 2021/04/07 with RocksDB 6.19. Finally, relative throughput is (QPS for modern / QPS for classic).

QPS for point queries is 1.01x and 1.07x larger in modern MyRocks vs classic
QPS for range queries is 1.02x and 1.12x larger in modern MyRocks vs classic
QPS for writes is similar between modern MyRocks and classic.

Builds

The builds tested are described in a previous post. MyRocks was compiled from source. All tests used the y10a5_bee config.

Benchmark

I used sysbench and my usage is explained here. RocksDB was configured to cached all tables.

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client and 1 table with 20M rows. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each.

Results

All of the results are here and each chart below is there too (and easier to read there). For the charts below I split the results into two groups of point query tests, two groups of range query tests and one group for read-write/write-only tests.

The charts use relative QPS which is: (QPS for a given version / QPS for the base case) and the base case is the 20210407 build that has a much longer name, fbmy5635_rel_202104072149, in my previous post. The names used below are the date for which builds were done, YYYMMDD, and then for the last three the three digits at the end (832, 843, 850) are the RocksDB version when I used a more recent version of RocksDB than the build would otherwise use.

Note

the y-axis for all charts starts at 0.8 rather than 0.0 to improve readability
the test names under the x-axis are truncated in the images here but readable here

Point query, group 1

Point query, group 2

Range query, group 1

Writes

Summary statistics

Point-1	20230529_850
Average	1.19
Median	1.07
Min	0.94
Max	2.06
Stddev	0.353

Point-2	20230529_850
Average	1.01
Median	1.01
Min	0.95
Max	1.06
Stddev	0.051

Range-1	20230529_850
Average	1.10
Median	1.02
Min	0.80
Max	1.50
Stddev	0.260

Range-2	20230529_850
Average	1.13
Median	1.12
Min	0.85
Max	1.62
Stddev	0.265

Writes	20230529_850
Average	1.01
Median	1.00
Min	0.90
Max	1.14
Stddev	0.075

Tuesday, August 22, 2023

The impact from hyperthreading for RocksDB db_bench on a medium server

This has results from db_bench on the same instance type configured with and without hyperthreading to determine the impact from it.

tl;dr

I used the LRU block cache and will soon repeat this experiment with Hyper Clock Cache
When the CPU is oversubscribed hyperthreading improves QPS on read-heavy benchmark steps but hurts it on the high-concurrency, write-heavy step (overwrite)

Builds

I used RocksDB 8.5.0 compiled from source.

Benchmark

The benchmark was run with the LRU block cache and an in-memory workload. The test database was ~15GB. The RocksDB benchmarks scripts were used (here and here).

The test server is a c2-standard-60 server from GCP with 120G of RAM. The OS is Ubuntu 22.04. I repeated tests for it it with and without hyperthreading and name the servers ht0 and ht1:

ht0 - hyperthreads disabled, 30 HW threads and 30 cores
ht1 - hyperthreads enabled, 60 HW threads and 30 cores

The benchmark was repeated for 10, 20, 30, 40 and 50 threads. At 10 threads the CPU is undersubscribed for both ht0 and ht1. At 50 threads the CPU is oversubscribed for both ht0 and ht1. I want to see the impact on performance as the workload changes from an undersubscribed to an oversubscribed CPU.

Results

Results are here and charts for the results are below. The y-axis for the charts starts at 0.9 rather than 0 to improve readability. The charts show the relative QPS which is (QPS for ht1 / QPS for ht0). Hyperthreading helps when the relative QPS is great than 1.

At 10 and 20 threads hyperthreading has a small, negative impact on QPS
At 30 threads hyperthreading has no impact on QPS
At 40 and 50 threads hyperthreading helps performance for read heavy tests and hurts it for the concurrent write heavy test (overwrite)
Note that fillseq always runs with 1 thread regardless of what the other tests use

I abbreviated a few of the test names to fit on the chart -- revrangeww is revrangewhilewriting, fwdrangeww is fwdrangewhilewriting and readww is readwhilewriting. See the Benchmark section above that explains how I run db_bench.

What happens at 50 threads for fwdrangeww where hyperthreading gets more QPS vs overwrite where it hurt QPS? The following sections attempt to explain it.

For both tests, the configuration with a much larger context switch rate (more than 2X larger) gets more QPS: ~1.2X more on fwdrangeww for ht1, ~1.1X more for ht0 on overwrite.

Explaining fwdrangeww

First, are results where I divide user & system CPU seconds by QPS to compute CPU/operation for the test with 50 threads. The CPU time is measured by time db_bench ... and that is multiplied by 1M, so this is CPU microseconds per operation.

From this, there might be more CPU time consumed per operation when hyperthreads are enabled. However, this metric can be misleading especially when the CPU isn't fully subscribed because a HW thread isn't a CPU core.

user/q sys/q (user+sys)/q

ht0 139 23 163

ht1 183 35 219

Next are the average values for context switches (cs), user CPU time (us) and system CPU time (sy) from vmstat. The CPU utilization rates can be misleading as explained in the previous paragraph. The context switch rates for nt50 (5

10 threads

cs us sy

132350 30.8 2.1

128609 15.1 1.1

20 threads

cs us sy

515579 58.1 6.1

497088 28.8 3.1

30 threads

cs us sy

1053293 82.6 11.8

1053209 41.1 5.9

40 threads

cs us sy

900373 83.0 13.0

1338999 54.1 8.6

50 threads

cs us sy

742973 83.1 14.1

1620549 66.0 12.7

Explaining overwrite

First, are results where I divide user & system CPU seconds by QPS to compute CPU/operation for the test with 50 threads.

From this, there might be more CPU time consumed per operation when hyperthreads are enabled. However, this metric can be misleading because a HW thread isn't a CPU core.

user/q sys/q (user+sys)/q
ht0 112 62 175
ht1 175 102 277

Next are the average values for context switches (cs), user CPU time (us) and system CPU time (sy) from vmstat. The average CPU utilization sustained is higher with hyperthreads disabled, but that can also be misleading for the same reason as mentioned above. The context switch rate is much higher when hyperthreads are disabled for 20, 30, 40 and 50 threads. That can mean there is more mutex contention.

10 threads
cs us sy
15878 21.3 8.4
16753 10.9 4.4

20 threads
cs us sy
50776 30.9 12.8
28250 15.7 6.8

30 threads
cs us sy
526622 31.3 14.5
102929 20.1 9.9

40 threads
cs us sy
833918 29.1 15.9
248478 22.3 12.7

50 threads
cs us sy
1107510 27.7 15.8
461239 19.6 11.7

Checking MyRocks 5.6 for regressions with sysbench and a medium server, part 3

I used sysbench to check for performance regressions over many releases of MyRocks 5.6. I previously shared results using the Insert Benchmark. Here I have results for builds from 2021/04/07 through 2023/05/29.

tl;dr

QPS for modern MyRocks on read-only tests is mostly better than older MyRocks
QPS for modern MyRocks on write-only tests is ~10% less than older MyRocks
Modern MyRocks does much better than older MyRocks for read-heavy tests that run after the database has been subject to random writes.
Modern MyRocks does much better on full scans

Builds

The builds tested are described in the previous post. All builds used the y9c5_gcp_c2s30 my.cnf file.

Benchmark

I used sysbench and my usage is explained here. RocksDB was configured to cached all tables.

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 4 tables with 20M rows per-table. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each.

Results

Note

the y-axis for all charts starts at 0.8 rather than 0.0 to improve readability
the test names under the x-axis are truncated in the images here but readable here

Point query, group 1

The points-covered-si_range=100 test is an outlier. The test fetches 100 rows via point lookups on a covered secondary index. The same query is done for points-covered-si_pre_range=100 and the improvement there isn't as large. The difference between these two is that pre_range is run before the database is subject to random writes.

Point query, group 2

Here, modern MyRocks does better on the tests that run after the database gets random writes than it does on the pre_range tests that run prior.

Range query, group 1

Four tests show a significant improvement in modern MyRocks. All of them run after the database is subject to random writes. The benchmark steps that don't show the speedup run prior to the random writes.

range-covered-pk_range=100
range-covered-si_range=100
range-notcovered-pk_range=100
range-notcovered-si_range=100

Range query, group 2

Performance for full scans is much better in modern MyRocks. Relative performance is better for tests that run after the database has been subject to random writes.

Read-write and write-only

QPS on write-only tests in modern MyRocks is ~10% less than older MyRocks. The cases below where it modern is faster than older MyRocks is from tests that are read-write.

Summary statistics

Summary statistics for the relative QPS results are in the spreadsheet here in the tab named qps.rel.56cached. Summary stats for the most recent build (20230529_850) are also below.

Point-1	20230529_850
Average	1.27
Median	1.14
Min	0.97
Max	2.14
Stddev	0.372

Point-2	20230529_850
Average	1.10
Median	1.08
Min	0.99
Max	1.24
Stddev	0.118

Range-1	20230529_850
Average	1.14
Median	1.11
Min	0.91
Max	1.45
Stddev	0.223

Range-2	20230529_850
Average	1.23
Median	1.09
Min	0.92
Max	2.17
Stddev	0.445

Writes	20230529_850
Average	0.94
Median	0.91
Min	0.87
Max	1.11
Stddev	0.080

Thursday, August 17, 2023

Checking MyRocks 5.6 for regressions with the Insert Benchmark and a small server, part 2

This is part 2 in my attempt to document how performance changes from old to new releases of MyRocks using the Insert Benchmark. I use MyRocks 5.6 rather than 8.0 because the 5.6 release go back further in time. A previous post is here but results there were bogus because my builds were broken.

tl;dr - a hand-wavy summary of

Modern MyRocks uses RocksDB 8.5.0 and classic MyRocks uses RocksDB 6.19
Modern MyRocks gets ~5% less throughput on read-intensive benchmark steps because there is new CPU overhead
Modern MyRocks gets ~3% less throughput on write-intensive benchmark steps because there is new CPU overhead
Modern MyRocks has ~10% less write-amplification

Builds

I started with the builds from my previous post, removed the fbmy5635_rel_jun23_7e40af677 build and then added 3 builds that use MyRocks as of what is used by fbmy5635_rel_202305292102 but upgraded RocksDB to 8.3.2, 8.4.3 and 8.5.0

I used MyRocks from FB MySQL 5.6.35 using the rel build (CMAKE_BUILD_TYPE=Release, see here) with source from 2021 through 2023. The versions are:

fbmy5635_rel_202104072149 - from 20210407 at git hash (f896415fa0 MySQL, 0f8c041ea RocksDB), RocksDB 6.19
fbmy5635_rel_202203072101 - from 20220307 at git hash (e7d976ee MySQL, df4d3cf6fd RocksDB), RocksDB 6.28.2
fbmy5635_rel_202205192101 - from 20220519 at git hash (d503bd77 MySQL, f2f26b15 RocksDB), RocksDB 7.2.2
fbmy5635_rel_202208092101 - from 20220809 at git hash (877a0e585 MySQL, 8e0f4952 RocksDB), RocksDB 7.3.1
fbmy5635_rel_202210112144 - from 20221011 at git hash (c691c7160 MySQL, 8e0f4952 RocksDB), RocksDB 7.3.1
fbmy5635_rel_202302162102 - from 20230216 at git hash (21a2b0aa MySQL, e5dcebf7 RocksDB), RocksDB 7.10.0
fbmy5635_rel_202304122154 - from 20230412 at git hash (205c31dd MySQL, 3258b5c3 RocksDB), RocksDB 7.10.2
fbmy5635_rel_202305292102 - from 20230529 at git hash (b739eac1 MySQL, 03057204 RocksDB), RocksDB 8.2.1
fbmy5635_rel_20230529_832 - from 20230529 at git hash (b739eac1 MySQL) but with RocksDB at version 8.3.2
fbmy5635_rel_20230529_843 - from 20230529 at git hash (b739eac1 MySQL) but with RocksDB at version 8.4.3
fbmy5635_rel_20230529_850 - from 20230529 at git hash (b739eac1 MySQL) but with RocksDB at version 8.5.0

Benchmark

The insert benchmark was run in two setups:

cached by RocksDB - all tables fit in the RocksDB block cache
IO-bound - the database is larger than memory

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04.

The benchmark is run with 1 client. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 800 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 1G block cache, one with a larger block cache. The larger block cache size is 8G when LRU is used and 6G when hyper clock cache is used (see tl;dr).

a (see here) - base config
a5 (see here) - enables subcompactions via rocksdb_max_subcompactions=2

Results

Performance reports are here for Cached by RocksDB (base config and c5 config) and IO-bound (base config and c5 config).

Results: average throughput

Base case is fbmy5635_rel_202104072149, the oldest build that uses RocksDB 6.19
Some version is fbmy5635_rel_20230529_850, the newest build that uses RocksDB 8.5.0

Cached by RocksDB, base config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (0.94, 0.99, 0.98, 0.96, 0.96, 0.96)
Modern MyRocks gets ~4% less throughput on read-intensive steps, 1% less on create index and 2% to 4% less on write-intensive steps.

Cached by RocksDB, c5 config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (0.94, 0.99, 0.97, 0.95, 0.94, 0.95)
Modern MyRocks gets ~5% less throughput on read-intensive steps, 1% less on create index and 3% to 6% less on write-intensive steps.
Using the HW perf metrics for l.i1 and for q100 the regressions are from new CPU overhead, see the cpupq column (CPU/operation) where it grows from 137 to 141 for the l.i1 step and from 375 to 396 for the q100 step.

IO-bound, base config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (0.93, 0.99, 0.94, 0.94, 0.94, 0.95)
Modern MyRocks gets ~6% less throughput on read-intensive steps, 1% less on create index and 6% to 7% less on write-intensive steps.

IO-bound, c5 config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (0.94, 1.00, 0.97, 0.96, 0.95, 0.96)
Modern MyRocks gets ~4% less throughput on read-intensive steps and 3% to 6% less on write-intensive steps.
Using the HW perf metrics for l.i1 and for q100 there is more CPU overhead in modern MyRocks. See the cpupq column (CPU/operation) where it grows from 166 to 170 for the l.i1 step and from 529 to 560 for the q100 step. On the bright side, write efficiency has improved based on the wkbpi (KB written to storage per insert) for the l.i1 step where it drops from 2.877 to 2.558. Based on compaction IO statistics at the end of l.i1 and the end of q1000 the improvement is from an increase in trivial move and a decrease in compaction writes to most levels. This reduces write-amplification by ~10%.

Wednesday, August 16, 2023

Checking MyRocks 5.6 for regressions with the Insert Benchmark and a medium server, part 2

This is a follow up to my previous post where I use the Insert Benchmark to search for performance regressions in MyRocks 5.6. I use 5.6 rather than 8.0 because the MyRocks 5.6 builds go back further in time.

tl;dr - a hand waving summary

Modern MyRocks uses RocksDB 8.5.0 and classic MyRocks uses RocksDB 6.19
Perf regressions in modern MyRocks are from new CPU overheads
The load benchmark step (l.i0) is faster in modern MyRocks
The write heavy benchmark step (l.i1) is ~5% slower in modern MyRocks
The read heavy benchmark steps (q100, q500, q100) are ~10% slower in modern MyRocks
Modern MyRocks has ~13% less write-amplification

Builds

I used MyRocks from FB MySQL 5.6.35 using the rel build (CMAKE_BUILD_TYPE=Release, see here) with source from 2021 through 2023. The versions are:

fbmy5635_rel_202104072149 - from 20210407 at git hash (f896415fa0 MySQL, 0f8c041ea RocksDB), RocksDB 6.19
fbmy5635_rel_202203072101 - from 20220307 at git hash (e7d976ee MySQL, df4d3cf6fd RocksDB), RocksDB 6.28.2
fbmy5635_rel_202205192101 - from 20220519 at git hash (d503bd77 MySQL, f2f26b15 RocksDB), RocksDB 7.2.2
fbmy5635_rel_202208092101 - from 20220809 at git hash (877a0e585 MySQL, 8e0f4952 RocksDB), RocksDB 7.3.1
fbmy5635_rel_202210112144 - from 20221011 at git hash (c691c7160 MySQL, 8e0f4952 RocksDB), RocksDB 7.3.1
fbmy5635_rel_202302162102 - from 20230216 at git hash (21a2b0aa MySQL, e5dcebf7 RocksDB), RocksDB 7.10.0
fbmy5635_rel_202304122154 - from 20230412 at git hash (205c31dd MySQL, 3258b5c3 RocksDB), RocksDB 7.10.2
fbmy5635_rel_202305292102 - from 20230529 at git hash (b739eac1 MySQL, 03057204 RocksDB), RocksDB 8.2.1
fbmy5635_rel_20230529_832 - from 20230529 at git hash (b739eac1 MySQL) but with RocksDB at version 8.3.2
fbmy5635_rel_20230529_843 - from 20230529 at git hash (b739eac1 MySQL) but with RocksDB at version 8.4.3
fbmy5635_rel_20230529_850 - from 20230529 at git hash (b739eac1 MySQL) but with RocksDB at version 8.5.0

Benchmark

The insert benchmark was run in two setups.

cached by RocksDB - all tables are cached by RocksDB
IO-bound - the database is larger than memory

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

l.i0

insert X million rows per table where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 200 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. This step took ~20,000 seconds.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

I used two config (my.cnf) files: base and c5. The c5 config adds rocksdb_max_subcompactions=4.

Results

Performance reports are here for Cached by RocksDB (for base and for c5 config) and for IO-bound (for base and for c5 config).

Results: average throughput

This section explains the average throughput tables in the Summary section. I use relative throughput to save on typing where relative throughput is (throughput for some version / throughput for base case). When relative throughput is > 1 then some version is faster than the base case. Unless stated ...

Base case is fbmy5635_rel_202104072149, the oldest build that uses RocksDB 6.19
Some version is fbmy5635_rel_20230529_850, the newest build that uses RocksDB 8.5.0

For Cached by RocksDB with the base config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (0.96, 0.91, 0.95, 0.91, 0.90, 0.90)
Modern MyRocks gets 5% to 10% less throughput vs an old build that used RocksDB 6.19. The regression arrives slowly -- a small decrease in each new release.
From the HW perf metrics new CPU overhead probably explains the perf regressions

For l.i1 the CPU/operation overhead grew from 90 to 103 or 14.4% (see cpupq)
For q100 the CPU/operation overhead grew from 206 to 227 or 10.1% (see cpupq)

For Cached by RocksDB with the c5 config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (1.04, 1.00, 1.00, 0.95, 0.95, 0.94)
Modern MyRocks does the same or better than old MyRocks for the write-intensive steps and gets ~5% less throughput for the read-intensive steps.

For IO-bound with the base config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (1.03, 0.92, 0.94, 0.91, 0.91, 0.90)
Modern MyRocks gets ~10% less throughput on the read-intensive steps, 6% less on the most important write-intensive step (l.i1) but is 3% faster at the load step (l.i0).
From the HW perf metrics new CPU overhead probably explains the perf regressions

For l.i1 the CPU/operation overhead grew from 115 to 126 or 9.5% (see cpupq)
For q100 the CPU/operation overhead grew from 249 to 282 or 13.2% (see cpupq)

For IO-bound with the c5 config (see here)

Relative throughput for (l.i0, l.x, l.i1, q100, q500, q1000) is (1.08, 0.92, 0.94, 0.91, 0.91, 0.90)
Modern MyRocks gets ~10% less throughput on the read-intensive steps, 6% less on the most important write-intensive step (l.i1) but is 8% faster at the load step (l.i0).
Modern MyRocks has ~13% less write-amplification. See the wkbpi column (KB written to storage/insert) from the l.i1 step where it drops from 2.100 with RocksDB 6.19 to 1.825 with RocksDB 8.5.0.