Monday, April 3, 2023

Impact of writes on read performance

This post compares performance from sysbench read-only microbenchmarks before and after the database has been subject to random writes. Results are shared for MyRocks, InnoDB and Postgres.

tl;dr

  • if you care about read-write performance then run read-write benchmarks and be wary of read-only benchmarks. Alas, a common benchmark practice is load and go (load, create indexes, run read-only or read-heavy tests).
  • The penalty for the random writes (the post result) was largest for MyRocks and smallest for InnoDB. The penalty for Postgres range queries was closer to MyRocks and for point queries was closer to InnoDB.
  • The penalties were larger on the c2 server than on the Beelink. At least for MyRocks this isn't a surprise because the LSM tree on the c2 server has more levels.
  • In the worst case, MyRocks gets ~50% of the QPS for point and ~61% of the QPS for range queries after random writes have been done versus before they have been done

Editorial

Measuring read-only performance for a database can be misleading because there is much context that affects the results and that context isn't always documented. The context is the state of the various database structures that will be accessed for queries. This state has a significant impact on the CPU overhead per query.

The problem is that the database structures can be in one of two states:

  • dynamic - by dynamic I mean there is background work in progress -- memtable flush and compaction for an LSM, vacuum and writeback for Postgres, purge and writeback for InnoDB. When the work is in progress there is competition for resources including CPU and IO. When the work has not been completed then queries might be more expensive - for example, a query on a covering secondary index might have to fetch the base row with Postgres and InnoDB.
  • static - by static I mean that background work is not in progress. The problem here is that the database can settle in a state that isn't deterministic across benchmarks. One example is the amount of data in the memtable and number of L0 files with an LSM. When looking for CPU regressions it is hard to distinguish between a regression and more/less overhead from non-deterministic states.
For RocksDB with leveled compaction one extreme WRT the index structure state is to have all data in one level (extremely read friendly). But more common states are:
  • Load in key order - while the LSM tree probably uses multiple levels to store the data the SST files will not overlap and bloom filter checks will only be done for at most one SST as the min/max key per SST can prune all but the SST that has the data. Thus, some CPU/query will be saved. But even in this state the amount of data in the memtable and the number of SSTs in the L0 can have a significant impact on CPU/query.
  • After random writes - once enough random writes have been done there will be overlap between SST files and bloom filter checks must be done.

If you want to understand how a database performs for a read-write workload then ignore read-only benchmark results and get storage/DBMS into a steady state before taking measurements. For more advice on how to do LSM benchmarks see here and here.

I hope that RocksDB gets adaptive behavior that detects when a workload shifts to read-only or read-mostly mode and does the right thing (flush memtable, flush L0, maybe flush L1). Long-ago RocksDB inherited read-triggered compaction from LevelDB, which is adaptive behavior, but that was disabled years ago. Issue 8004 is a request to restore it, but the behavior I seek is more comprehensive.

Benchmarks

A description of how I run sysbench is here. This uses the results from benchmarks I recently ran using my home Beelink servers and larger c2-standard-60 servers from GCP. In both cases the database is cached by the DBMS. For the Beelink server there is 1 table with 20M rows and for the c2 server there are 4 tables with 50M rows per table. There were 42 microbenchmarks put into 3 groups -- point queries, range queries and writes. I focus on the point and range query microbenchmarks here. Each of the 42 was run for 10 minutes. For an overview of the benchmarks see here and here.

Results are provided for MyRocks from FB MySQL 8.0.28, InnoDB from upstream MySQL 8.0.28 and Postgres 15.2.

The benchmark sequence is:

  1. Load tables in PK order 
  2. Create secondary indexes
  3. Do the post-write work
  4. Run read-only microbenchmarks. These are the pre results.
  5. Do random writes
  6. Do the post-write work
  7. Run read-only microbenchmarks. These are the post results.

The post-write work is explained in the How I run sysbench blog post which includes links to the code for the post-write work. It is done to get database structures into a more deterministic state and from above there are pre and post results. The pre (post) results are the read-only microbenchmarks run before (after) subjecting the database to many random writes. The post-write work includes sleep time which is a function of table size and was 20 minutes for the c2 server and 3 minutes for the Beelink server.

This is the shape of the MyRocks LSM tree:

Graphs

A spreadsheet with results is here, the pre results are the read-only microbenchmarks run prior to random writes and the post results are from after the random writes.

The graphs use relative throughput which is throughput for post / throughput for pre. When the relative throughput is > 1 then results from post are better than pre. But the relative throughput is usually < 1 because results from pre are usually better than post (because random writes make index structures less read friendly).

Results from the Beelink server.

Results from the c2 server.


Summary statistics

Summary statistics are provided for the point and range query benchmarks to compare the pre and post results.

Summary of the summary stats:

  • The penalty for random writes (the post result) was largest for MyRocks and smallest for InnoDB. The penalty for Postgres range queries was closer to MyRocks and for point queries was closer to InnoDB. 
  • The penalties were larger on the c2 server than on the Beelink. At least for MyRocks this isn't a surprise because the LSM tree on the c2 server has more levels.

Summary statistics from the Beelink server:

MyRocksInnoDBPostgres
Point: avg0.841.000.98
Point: median0.901.000.98
Point: min0.590.980.96
Point: max0.971.020.99
Point: stddev0.1310.0130.011
Range: avg0.911.000.93
Range: median0.921.000.94
Range: min0.861.000.85
Range: max0.931.010.98
Range: stddev0.0230.0040.050

Summary statistics from the c2 server:

MyRocksInnoDBPostgres
Point: avg0.671.000.99
Point: median0.661.010.99
Point: min0.500.980.98
Point: max0.901.021.00
Point: stddev0.1210.0130.006
Range: avg0.851.000.88
Range: median0.881.000.89
Range: min0.610.950.68
Range: max0.921.011.00
Range: stddev0.1110.0210.119


No comments:

Post a Comment

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for mak...