Monday, January 30, 2017

Compaction stalls: something to make better in RocksDB


In previous results that I shared for the insert benchmark it was obvious that MyRocks throughput is steady when the workload transitions from in-memory to IO-bound. The reason is that non-unique secondary index maintenance is read-free for MyRocks so there are no stalls for storage reads of secondary index pages. Even with the change buffer, InnoDB eventually is slowed by storage reads and by page writeback.

It was less obvious that MyRocks has more variance on both the in-memory and IO-bound insert benchmark tests. I try to be fair when explaining storage engine performance so I provide a few more details here and results for InnoDB in MySQL 5.7.10 & 5.6.26 along with MyRocks from our fork of MySQL 5.6. The binlog was enabled for all tests, fsync-on-commit was disabled and 16 clients inserted 500m or 2b rows into 16 tables in PK order. Each table has 3 secondary indexes. For MyRocks I made one configuration change from the earlier result. I changed level0_slowdown_writes_trigger from 10 to 20 to reduce compaction stalls. This has a potential bad side effect of making queries slower, but this test was insert-only.

For both graphs the y-axis is the average insert rate per 5-second interval and the x-axis is the interval number. I used mstat to collect the data.

The goal is to match InnoDB in MySQL 5.7 in quality of service while providing much better throughput. We have some work to do. On the bright side, this is an opportunity for someone to make RocksDB better.

In-memory

The first graph is for the in-memory workload where all data is cached by the storage engine, nothing is read from storage, but many storage writes are done to persist the changes. InnoDB with MySQL 5.7 is much faster than with 5.6. It also has the least variance. MyRocks has the most variance and that is largely from compaction stalls when there are too many files in level 0 of the LSM tree.


IO-bound

The second graph is for the IO-bound workload. The graph ends first for MyRocks because it sustains the highest average insert rate. But it also has a thick line because of variance. InnoDB from MySQL 5.6 also has a lot of variance.

Monday, January 23, 2017

The value of write efficiency for the insert benchmark

I shared results last week for the insert benchmark. The test server for that result had fast SSD courtesy of several NVMe NAND flash devices. I repeated tests on a server with slower SSD and the results are interesting. Regardless of the storage device, MyRocks did better when the database was larger than RAM and InnoDB 5.7.10 did better when the database fit in RAM. But the difference between MyRocks and InnoDB 5.7.10 for the in memory workload is much smaller on slow SSD than it is on fast SSD. It is always risky to ignore the context from benchmark results and the important context here is the performance of the storage device.

Slower SSD hurts InnoDB more than it hurts MyRocks because MyRocks is more efficient for writes. Spending less IO on writes saves IO to do more writes or more reads as I previously demonstrated for Linkbench.

When the database fits in RAM there are no reads from storage. But there are many writes to storage whether or not the database is larger than RAM and the insert benchmark workload is always IO heavy. From looking at PMP stacks InnoDB throughput is limited by the rate at which the redo log is written and InnoDB in 5.6.26 has more CPU overhead from mutex contention than in 5.7.10. Note that the binlog was enabled but sync-on-commit was disabled so there was more stress on redo log throughput.

Workload

The workload is fully described in my previous post. 500m rows are inserted for the in memory test and 2b rows for the larger than memory test. All tests use 16 clients, each client inserts to a different table and each table has 3 secondary indexes to maintain.

The slower SSD provides about 10k page reads second versus more than 100k/second provided by the fast SSD server.  Both servers have similar CPUs and RAM.

By in memory I really mean that the database working set fits in memory but it is easier to ignore that and just claim the database fits in memory. Sorry for the confusion.

Throughput

The throughput listed below is the average insert rate during the test. Things that I notice in the results include:
  • InnoDB loses more throughput when going from fast to slow SSD and from in memory to larger than memory workloads.
  • Restating the previous point, MyRocks throughput is more stable when going from fast to slow SSD and from in memory to larger than memory workloads.
  • InnoDB in MySQL 5.7.10 scales much better than 5.6.26 for in memory write-heavy workloads

Throughput for in memory database
            slow-SSD  fast-SSD
MyRocks        83766    102712
InnoDB-5.7    124782    268873
InnoDB.5.6     66251    111111

Throughput for larger than memory database

            slow-SSD  fast-SSD
MyRocks        86401   115234
InnoDB-5.7     38207    65604
InnoDB-5.6     14784    52812

IO Efficiency

Things that I notice in the results include:
  • the value for wKB/i for MyRocks is somewhat smaller than InnoDB for the in memory database and much smaller than InnoDB for the larger than memory database. I expected this.
  • InnoDB in MySQL 5.7 is able to sustain higher IO rates than in 5.6 thanks to many performance improvements in InnoDB.
  • Assuming the slow SSD saturates around 400 MB/s of IO, then writing less per insert for MyRocks means it gets more inserts/second when IO saturates.

Legend:
* IPS - average inserts/second
* rMB/s, wMB/s - read & write MB/s via iostat
* rKB/i, wKB/i - read & write KB per insert via iostat
* note that iostat usually overstates bytes written by 2X

  for RocksDB because it counts bytes trimmed as bytes written

IO metrics for in memory database
             IPS     rMB/s   wMB/s   rKB/i   wKB/i
- slow SSD
MyRocks       83766    0      280     0       3.34
InnoDB-5.7   124782    0      533     0       4.27
InnoDB-5.6    66251    0      281     0       4.24
- fast SSD
MyRocks      102712    0      339     0       3.30
InnoDB-5.7   268873    0     1163     0       4.35
InnoDB-5.6   111111    0      451     0       4.05

IO metrics for larger than memory database
             IPS     rMB/s   wMB/s   rKB/i   wKB/i
- slow SSD
MyRocks       86401   86      346     1.00    4.00
InnoDB-5.7    38207   42      385     1.21   10.08
InnoDB-5.6    14784   36      169     2.44   11.41
- fast SSD
MyRocks      115234  118      457     1.03    3.96
InnoDB-5.7    65604   75      878     1.14   13.38
InnoDB-5.6    52812   71      525     1.34    9.94

Tuesday, January 17, 2017

Insert benchmark, MyRocks and InnoDB

I haven't been sharing many performance results on my blog as I have been saving results for my conference talks. Later this year I expect to blog more and travel less. Today I will share two results from the insert benchmark to compare MyRocks and InnoDB.

tl;dr - both are expected
  • MyRocks insert performance doesn't degrade when the database is larger than memory
  • InnoDB in MySQL 5.7 scales much better than in MySQL 5.6 for in-memory workloads
  • InnoDB in MySQL 5.7 wins for in-memory, MyRocks wins for io-bound

Configuration

This test does inserts in PK order into a table with 3 secondary indexes. Without the secondary indexes the workload is friendly to a b-tree. But secondary index for a b-tree is read-modify-write and the reads from and writes to storage will make InnoDB slower when the working set doesn't fit into RAM. InnoDB is likely to do better than other b-trees because of the insert buffer. With MyRocks, non-unique secondary index maintenance is read-free so performance doesn't drop much when the working set no longer fits in RAM.

I ran the insert benchmark using this script with two configurations. Both configurations used 16 clients, 16 tables and a client per table. For the io-bound configuration there was 50gb of RAM for the database and OS and 2B rows were inserted.  For the in-memory configuration there was 256gb of RAM for the database and OS and all data was cached by the database cache.

The test server has 48 cores with HT enabled and fast NVMe flash. For MyRocks I used the FB MySQL repo. For InnoDB I used MySQL 5.6.26 and 5.7.10. I tried to use a good my.cnf for all engines and the MyRocks settings are similar to what we suggest. For the io-bound configurations I used a 10gb RocksDB block cache and 35gb InnoDB buffer pool. For the in-memory configurations all database caches were set to 190gb. The binlog was enabled for all tests, but sync on commit was disabled for the database redo log & binlog.

I disabled compression for all engines in this test. InnoDB still uses much more space on disk, probably because of fragmented leaf pages.

Average insert rates

The average insert rates for each configuration. It is odd that the insert rate for io-bound MyRocks is better than for in-memory and I am not sure I will debug it. There isn't much difference between InnoDB from MySQL 5.6 and 5.7 for the io-bound configuration. There is a huge difference between them for the in-memory configuration. I assume the difference is all of the performance work done upstream (thank you). MyRocks is much faster than InnoDB for the io-bound configuration. InnoDB in MySQL 5.7 is much faster than MyRocks for the in-memory configuration. I expect MyRocks to do better on the in-memory setup in the future.

Legend for the table:
  • io-bound - average inserts/second for the io-bound configuration
  • in-memory - average inserts/second for the in-memory configuration
  • size - database size in GB at test end
  • engine - database engine

io-bound  in-memory  size   engine
115234    103088     226    MyRocks

 65604    267523     362    InnoDB-5.7.10
 52812    111259     362    InnoDB-5.6.26

IO-bound results

I have two graphs to show the insert rates over time. This is for data from one of the 16 clients rather than the global rate because my test script forgot to collect that. The first graph uses log scale for the x-axis. The second graph does not. The first graph makes it easier to see how throughput drops for InnoDB as the database gets larger. While it is odd that the throughput rate for InnoDB 5.7 picks up at test end, that can occur if the thread I monitored ran longer than other threads and had less competition for HW resources near test end. The graph for MyRocks stops earlier than the InnoDB graphs because it finishes the insert of 2B rows much earlier.

As a bonus for MyRocks, it writes about 1/4 the amount to storage per inserted row when compared to InnoDB. That makes SSD last longer and using less IO for writes saves more IO for reads.



Tuesday, January 3, 2017

MyRocks, MongoRocks and RocksDB at Percona Live 2017

Percona Live 2017 is a great place to learn about MyRocks, MongoRocks and RocksDB. The MyRocks community continues to grow as one of the talks is from Alibaba. I am excited that they help make MyRocks better.

Talks about RocksDB

Talks about !RocksDB
There are two other talks that I want to attend given their focus on storage engines:

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...