Small Datum: Understanding the insert benchmark results: variance and write stalls

Today I published a report for the insert benchmark with InnoDB and a big server and then updated reports for MyRocks and Postgres. One goal is to find things that can be made better. Another goal is to understand how performance and efficiency changes when context changes where context might be the version of the DBMS (newer releases fix some things, but add CPU overhead) or it might be whether the database is cached by the DBMS.

Most of my reports focus on one DBMS and don't compare MyRocks with Postgres and InnoDB, but I will do some of that here. I try to leave benchmarketing to others, but I am happy to do perf bug marketing via my posts.

The results are interesting, but before discussing them I have a few disclaimers:

I risk making strong conclusions from small samples. For example, I only ran the benchmark once per DBMS/configuration. My focus is on breadth, I tested many builds per DBMS. Had I tried to get 3 or 5 results per setup then I would have used 3X or 5X as much machine time.
Comparisons between setups can be hard because the work rates are not fixed. For the write-only benchmark steps (l.i0, l,i1) the inserts are done as fast as possible. For the read+write benchmark steps while the background write rate is fixed, the query clients run as fast as possible. Some comparisons would be easier if I were to fix the insert and query rates in all cases. It isn't unreasonable for a DBMS doing X inserts/second to have more stalls or response time variance than a DBMS doing X/10 inserts/second.
It is possible that better tuning will improve the results and I am open to feedback on that. However, I am wary of doing too much tuning, especially per-workload. I prefer to have a DBMS that offers robust tuning which does the right thing for many workloads.
Context matters. The context here is a workload with high-concurrency that fits in memory (either the OS page cache or the database) so there are writes to storage but few reads from storage.
InnoDB might be suffering from bug 109595. I hope it gets fixed soon.

Most important disclaimer (updated)

I should have included this up front. These are results for workloads expected to remain in memory (no reads from storage). The benchmark does inserts and range queries. If too many inserts are done then the working set will not fit in memory. And too long on many servers == ~5 minutes. It is risky to make strong judgements from such a short interval.

It is OK to make strong judgements on the CPU overhead of the foreground work done for inserts. But it is less OK to confirm that a storage engine doesn't get unhappy for an in-memory, write-heavy workload unless I am able to run that workload for a lot more time than 5 minutes.

While I have a replacement benchmark in progress that includes a fix for this problem, that isn't ready today. So I will revive a dormant feature in the insert benchmark to optionally do a delete per insert to keep the database at a fixed size.

Interesting results

A summary:

Max response time queries and inserts with MyRocks in all of the benchmark steps (l.i0, l.i1, q100.1, q500.1, q1000.1) is much smaller than it is for InnoDB and Postgres. I didn't expect this although I know that worst-case write-stalls have been made much smaller in recent RocksDB releases.
MyRocks QPS at per-second intervals has an interesting pattern in the read+write workloads (q100.1, q500.1, q1000.1). I assume this is the result of regular changes in the CPU overhead for queries -- memtable flush makes the memtable empty and the memtable search cost drops, L0 -> L1 compaction makes the L0 empty so there are fewer SSTs to check. But then the search costs for the memtable and L0 grows, repeat.
Some of the max response times for queries and inserts with Postgres were much worse with a larger buffer pool (database cached by Postgres) then a smaller buffer pool (cached by OS but not by Postgres). This isn't a shock but it would be interesting to explain that one day.
The InnoDB report has the longest tl;dr. Performance changes much more depending on context vs the same test for MyRocks and Postgres. That complexity isn't a feature, it means you have more things to worry about.
The difference in the l.i1 insert rate between cached by DBMS and cached by OS workloads is interesting.

How does perf change when the database isn't cached by the DBMS?

This table shows how the insert rate changes for both the l.i0 and l.i1 benchmark steps between the cached by OS and cached by DBMS setups. Note that l.i0 is the initial load in key order with only a PK index and l.i1 does inserts in random key order with 3 secondary indexes in place where index maintenance is required for each index. Also, the 20-tab and 1-tab columns are for the setups with 20 tables (table per client) and 1-table (all clients share 1 table).

The goal is to document how the ratios change between the l.i0 and l.i1 benchmark steps to understand whether either is more sensitive to the database fitting in the DBMS cache. It is reasonable to expect insert rates to be smaller with l.i1 than with l.i0 because l.i1 maintains 4 indexes while l.i0 only maintains 1 and the inserts are in key order for l.i0 but in random order for l.i1.

The summary is:

Performance for l.i0 is mostly cache oblivious. This is expected given the inserts are in key order.
Performance for l.i1 is mostly cache oblivious for the LSM (MyRocks) but not the b-trees (Postgres, InnoDB). This too is expected.

Performance doesn't change for MyRocks when it doesn't cache the database. Non-unique secondary index maintenance doesn't require reads from the secondary index.
Performance for Postgres with l.i1 drops, the number for l.i1 is less than for l.i0 with the same number of tables, but the drop is reasonable.
Performance for InnoDB with l.i1 drops too much. I have yet to explain this.

insert rates from: (cached by OS) / (cached by DBMS)
20-tab 20-tab 1-tab 1-tab
l.i0 l.i1 l.i0 l.i1
MyRocks 0.97 1.01 1.00 1.00
InnoDB 1.01 0.04 1.01 0.14
PG 1.04 0.36 1.04 0.50

How does perf change when index maintenance must be done?

This table shows the ratio of the insert rates for the l.i1 and l.i0 benchmark steps. See the previous section for a description of workloads.

The summary is:

The results for MyRocks show there is no difference between cached by DBMS and cached by OS because non-unique secondary index maintenance doesn't read from secondary indexes.
The results for InnoDB show that it is the most sensitive to not having the working set fit in the DBMS cache. The reduction for InnoDB from cached by DBMS to cached by OS is much larger than for Postgres.
The results for Postgres seem reasonable.

inserts rates from: l.i1 / l.i0
20-tab 20-tab 1-tab 1-tab
by-DBMS by-OS by-DBMS by-OS
MyRocks 0.29 0.30 0.22 0.22
InnoDB 0.41 0.02 0.68 0.09
PG 0.30 0.10 0.37 0.18

Small Datum

Tuesday, May 16, 2023

Understanding the insert benchmark results: variance and write stalls

No comments:

Post a Comment

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)