Tuesday, June 27, 2023

Insert+delete benchmark, medium server and MyRocks

This has results for the new insert benchmark (with deletes enabled) for MyRocks on a medium server. I ended up repeating the benchmark 3 times (round 2 after realizing I needed to log the response time for a potentially slow queries, round 3 because I needed to use a new MyRocks build). Results for the benchmark on a small server are here and here.

tl;dr

  • MyRocks in 8.0.28 has better perf than in 5.6.35 for most of the benchmark steps. This wasn't true on the small server. I suspect that one reason for the change is that a server CPU was used here while the small server uses a mobile CPU -- the Beelink uses AMD Ryzen 7 4700u and /proc/cpuinfo from the c2 server shows Intel(R) Xeon(R) CPU @ 3.10GHz.
  • there is a ~5 second write stall for the l.i1 benchmark step in one of the configurations. I have more work to do to explain it

Benchmarks

The medium server is c2-standard-30 from GCP with 15 cores, hyperthreads disabled, 120G of RAM, and 1.5T of XFS vis SW RAID 0 over 4 local NVMe devices. 

An overview of the insert benchmark is herehere and here. The insert benchmark was run for 8 clients. The read+write steps (q100, q500, q1000) were run for 3600 seconds each. The delete per insert option was set for l.i1, q100, q500 and q1000.

Benchmarks were repeated for three setups:
  • cached by RocksDB - all data fits in the 80G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
  • cached by OS - all data fits in the OS page cache but not the 4G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
  • IO-bound - the database is larger than memory. The benchmark tables have 4000M rows and the database size is ~281G.
The my.cnf files are here for:
The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.
  • q100
    • do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background
  • q500
    • do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background
  • q1000
    • do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background
MyRocks

Tests used MyRocks from FB MySQL 5.6.35 and 8.0.28 with the rel build for 5.6.35 and the rel_lto build for 8.0.28. The builds are described in a previous post.

I used old and new versions of MyRocks source code. The old versions were built from HEAD in February 2023. The new versions were built from HEAD in June 2023. The details are:
  • old versions
    • 5.6.35 - RocksDB 7.10.2, FB MySQL git hash 205c31dd
    • 8.0.28 - RocksDB 7.10.0, FB MySQL git hash unknown
  • new versions
    • 5.6.35 - RocksDB 8.2.1, FB MySQL git hash 7e40af67
    • 8.0.28 - RocksDB 8.3.1, FB MySQL git hash ef5b9b101
Reports

Performance reports are here for cached by MyRocks, cached by OS and IO-bound. For cached there are results from the builds with old and new source code distinguished by _old and _new. For IO-bound I only have results from the new source code.

Comparing summaries for 5.6.35 vs 8.0.28 show that the results here for 8.0.28 are a lot better than the results on the small server (see here for cached and IO-bound). I speculate above that one reason for the change is that the small server (Beelink) uses a mobile CPU while the c2 here uses a server CPU (Xeon).
  • 8.0.28 gets ~10% less throughput for l.i0 (more CPU/insert, see cpupq here)
  • 8.0.28 gets similar throughput for l.i1 (more CPU but few context switches, see cpupq and cspq here)
  • 8.0.28 gets 15% to 20% more throughput for queries (q100, q500, q1000) because there is less CPU/query, see cpupq here
From the response time distributions for cached by MyRocks, cached by OS and IO-bound:
  • more throughput mostly implies a better response time histogram
  • worst-case response times are bad (~5 seconds) in one case: l.i1 and cached by OS (see here). Note that worst-case response time for l.i1 is <= 1-second for IO-bound and for cached by RocksDB. From the throughput vs time charts that show per-second insert rates and per-second max response times for cached by RocksDB, cached by OS and IO-bound there is one blip on the cached by RocksDB chart.
From the charts for throughput vs time using per-second measurements there are write stalls in l.i1 in the cached by OS and IO-bound workloads but they are much worse for cached by OS.

Debugging the write stall

What else can I learn about the ~5 second write stalls in l.i1 in cached by OS? Unfortunately, my test scripts don't archive the RocksDB LOG file (not yet). I do have the output from SHOW ENGINE ROCKSDB STATUS run at the end of each benchmark step and from that I see:
  • there are no write stalls at the end of the step (l.x) that precedes l.i1 (see here)
  • the write stalls in l.i1 are from too many files in L0 (see here)
  • the average time for a L0->L1 compaction (see Avg(sec) here). I don't know if the median time is close to the average time. Again from here, based on the values of the Rn(GB), Rn+1(GB) and Comp(cnt) columns the average L0->L1 compaction reads ~842M from L0 and ~1075M from L1 and then writes ~1884M. This is done by a single thread which processes the compaction input at ~64M/s. Note that L0->L1->L2 is the choke point for compaction because L0->L1 is usually single-threaded and L0->L1 usually cannot run concurrent with L1->L2. From results for l.i1 with cached by RocksDB and IO-bound the stats aren't that different, but somehow these don't get ~5 second write stalls. Based on the configuration, compaction should be triggered with 4 SSTs in L0 (~32M each) and 4 SSTs in L0 (~64M each) which would be ~384M of input. But when compaction gets behind the input gets larger.
I need to repeat the cached by OS benchmark with more monitoring and archiving of LOG to try and explain this.







































todo

No comments:

Post a Comment