Thursday, June 1, 2017

The Insert Benchmark

The insert benchmark was first published by Tokutek in C++. Since then I converted it to Python, they made my Python version faster, I added support for MongoDB and I have been using it to evaluate storage engine efficiency. I think it is about 10 years old and been useful at finding things we can make better in storage engines. I hope to convert it to a faster language but in Python the code is still faster than storage engines for IO-bound workloads.

I use a helper script to run a sequence of tests. For all tests there is either one table shared by all clients or a separate table per client. Some storage engines suffer more from concurrency within a table. For MySQL each table has a primary key index up to 3 secondary indexes. I configure the tests to use 3 secondary indexes. The inserts during the load are in PK order but random for the secondary indexes. InnoDB benefits from the change buffer. MyRocks and MongoRocks benefit from read-free secondary index maintenance. MongoRocks and WiredTiger suffer from the hidden index used by MongoDB for engines that use clustered indexes, but the overhead is less with the insert benchmark than for Linkbench. The tests are:
  1. load - this is the first test and uses N clients to load into the test table(s).
  2. scan - scan in order: the primary index, each secondary index, the primary index. So there are 5 scans total. The scan uses 1 client per index. For the test with 16 tables there are 16 concurrent scans per index. For the test with 1 table there is 1 client (no concurrency). Each scan is index only and returns no rows as non-indexed predicates exclude all rows.
  3. q1000 - this is the second test and uses N query clients and N writer clients. Each writer client is rate limited to 1000 inserts/second. With some storage engines the writer clients are unable to sustain that rate. With N writer clients the global write rate should be N*1000 inserts/second. The query clients perform range scans as fast as possible. The usage of tables is the same as 
  4. q100 - this is the third test and uses N query clients and N writer clients. Each writer client is rate limited to 100 inserts/second. With N writer clients the global write rate should be N*100 inserts/second. The query clients perform range scans as fast as possible.
The benchmark client and my helper scripts are on github. The client is iibench.py and the top-level helper scripts is iq.sh. One day I will add a few comments to iq.sh to explain the command line. An example command line to run it for 500M rows, 16 clients and each client using a separate table is:
    bash iq.sh innodb "" ~/b/orig801/bin/mysql /data/m/my/data nvme0n1 1 1 no no no 0 no 500000000

When it finishes there will be four directories named l, scanq1000 and q100. In each directory there is a file named o.res.$something that has performance and efficiency metrics. The l directory has results from the load step. The scan directory has the results from the scans. The q1000 and q100 directories have results from the write+query step where each writer is rate limited to 1000/second and 100/second. An example o.res.$something file for a test with 500M rows and 1 client is here for the load test and here for the q1000 test. Each file has a section with efficiency metrics normalized by the insert rate and then efficiency metrics normalized by the query rate. For the load test only the first section is interesting. For the q1000 test both sections are interesting. For the q100 test only the second section is interesting because the insert rates are too low. Efficiency metrics from the scan are normalized by the number of rows scanned.

Metrics for load, q1000 and q100

The first section starts with the text iostat, vmstat normalized by insert rate and the second section starts with the text iostat, vmstat normalized by query rate. They both have similar data -- rates from iostat and vmstat normalized by the insert and query rate. An example is:
iostat, vmstat normalized by insert rate
samp    r/s     rkb/s   wkb/s   r/q     rkb/q   wkb/q   ips             spi

501     3406.8  54508   91964   3.410   54.563  92.057  999.0           0.100100

samp    cs/s    cpu/c   cs/q    cpu/q
525     16228   17.3    16.244  0.017276

iostat, vmstat normalized by query rate
samp    r/s     rkb/s   wkb/s   r/q     rkb/q   wkb/q   qps             spq
501     3406.8  54508   91964   4.023   64.370  108.602 846.8           0.118092

samp    cs/s    cpu/c   cs/q    cpu/q
525     16228   17.3    19.164  0.02038

The metric names are:
  • samp - number of iostat or vmstat samples collected during the test
  • r/s, rkb/s, wkb/s - average values for iostat r/s, rKB/s and wKB/s during the test
  • r/q - iostat r/s divided by the insert or query rate
  • rKB/q, wKB/q - iostat rKB/s and wKB/s divided by the insert or query rate
  • ips, qps - average insert and query rate
  • spi, spq - seconds per insert, seconds per query - average response time for inserts and queries
  • cs/s - average vmstat cs/s rate (cs is context switch)
  • cpu/c - average CPU utilization from vmstat us & sy (user & system)
  • cs/q - context switches per insert or per query
  • cpu/q - CPU utilization divided by insert or query rate then multiplied by a constant. This is only comparable between servers with the same CPU count.
Metrics for scan

Metrics for scan are measured differently. Absolute rates from iostat and vmstat are collected. But normalized rates are computed by dividing by the row scan rate (rows read / second). The header for the metrics looks like:
secs    rMB/s   wMB/s   r/o     rKB/o   wKB/o   rGB     cs/o    Mcpu/o  wa.sec  engine

And a legend for these columns is:
  • secs - seconds to finish the full scan of the index
  • rMB/s, wMB/s - iostat read MB/second and write/MB second
  • r/o, rKB/o, wKB/o - iostat reads/row, KB read/row and KB written/row. There can be writes because the index scans are done immediately after the load and compaction or page writeback can be in progress.
  • cs/o - context switches per row read using the vmstat cs column
  • Mcpu/o - normalized CPU overhead per row
  • wa.sec - number of seconds in IO wait state computed from iostat wa

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...