Tuesday, January 17, 2017

Insert benchmark, MyRocks and InnoDB

I haven't been sharing many performance results on my blog as I have been saving results for my conference talks. Later this year I expect to blog more and travel less. Today I will share two results from the insert benchmark to compare MyRocks and InnoDB.

tl;dr - both are expected
  • MyRocks insert performance doesn't degrade when the database is larger than memory
  • InnoDB in MySQL 5.7 scales much better than in MySQL 5.6 for in-memory workloads
  • InnoDB in MySQL 5.7 wins for in-memory, MyRocks wins for io-bound

Configuration

This test does inserts in PK order into a table with 3 secondary indexes. Without the secondary indexes the workload is friendly to a b-tree. But secondary index for a b-tree is read-modify-write and the reads from and writes to storage will make InnoDB slower when the working set doesn't fit into RAM. InnoDB is likely to do better than other b-trees because of the insert buffer. With MyRocks, non-unique secondary index maintenance is read-free so performance doesn't drop much when the working set no longer fits in RAM.

I ran the insert benchmark using this script with two configurations. Both configurations used 16 clients, 16 tables and a client per table. For the io-bound configuration there was 50gb of RAM for the database and OS and 2B rows were inserted.  For the in-memory configuration there was 256gb of RAM for the database and OS and all data was cached by the database cache.

The test server has 48 cores with HT enabled and fast NVMe flash. For MyRocks I used the FB MySQL repo. For InnoDB I used MySQL 5.6.26 and 5.7.10. I tried to use a good my.cnf for all engines and the MyRocks settings are similar to what we suggest. For the io-bound configurations I used a 10gb RocksDB block cache and 35gb InnoDB buffer pool. For the in-memory configurations all database caches were set to 190gb. The binlog was enabled for all tests, but sync on commit was disabled for the database redo log & binlog.

I disabled compression for all engines in this test. InnoDB still uses much more space on disk, probably because of fragmented leaf pages.

Average insert rates

The average insert rates for each configuration. It is odd that the insert rate for io-bound MyRocks is better than for in-memory and I am not sure I will debug it. There isn't much difference between InnoDB from MySQL 5.6 and 5.7 for the io-bound configuration. There is a huge difference between them for the in-memory configuration. I assume the difference is all of the performance work done upstream (thank you). MyRocks is much faster than InnoDB for the io-bound configuration. InnoDB in MySQL 5.7 is much faster than MyRocks for the in-memory configuration. I expect MyRocks to do better on the in-memory setup in the future.

Legend for the table:
  • io-bound - average inserts/second for the io-bound configuration
  • in-memory - average inserts/second for the in-memory configuration
  • size - database size in GB at test end
  • engine - database engine

io-bound  in-memory  size   engine
115234    103088     226    MyRocks

 65604    267523     362    InnoDB-5.7.10
 52812    111259     362    InnoDB-5.6.26

IO-bound results

I have two graphs to show the insert rates over time. This is for data from one of the 16 clients rather than the global rate because my test script forgot to collect that. The first graph uses log scale for the x-axis. The second graph does not. The first graph makes it easier to see how throughput drops for InnoDB as the database gets larger. While it is odd that the throughput rate for InnoDB 5.7 picks up at test end, that can occur if the thread I monitored ran longer than other threads and had less competition for HW resources near test end. The graph for MyRocks stops earlier than the InnoDB graphs because it finishes the insert of 2B rows much earlier.

As a bonus for MyRocks, it writes about 1/4 the amount to storage per inserted row when compared to InnoDB. That makes SSD last longer and using less IO for writes saves more IO for reads.



Tuesday, January 3, 2017

MyRocks, MongoRocks and RocksDB at Percona Live 2017

Percona Live 2017 is a great place to learn about MyRocks, MongoRocks and RocksDB. The MyRocks community continues to grow as one of the talks is from Alibaba. I am excited that they help make MyRocks better.

Talks about RocksDB

Talks about !RocksDB
There are two other talks that I want to attend given their focus on storage engines:

Friday, December 30, 2016

The MyRocks community in 2016

The MyRocks community grew in 2016 with significant contributions from early evaluators. I appreciate their help and have been busy documenting the problems that they reported. I look forward to more contributions in 2017.

MariaDB Corporation and Percona announced plans to include MyRocks in their distributions. You will be able to get MyRocks with expert support. Equally big have been the contributions from Sergey Pertunya to MyRocks.

The MySQL team at Alibaba reported poor performance from range scans with sysbench. The problem was memory system contention on a performance counter and the fix made long range scans 2X faster on concurrent workloads.

I have been using sysbench more and created a script to run a sequence of tests with it. During my tests I noticed too much variance during the read-only tests and the problem is non-determinism from the state of the memtable and level 0 of the LSM tree. The overhead for a search of the LSM tree depends on the amount of data in the memtable and the number of files in level 0. This state is not deterministic when a read-only test follows a read-write test. I filed issue 427 to see if we can get RocksDB to adapt and flush data earlier than normal from the memtable and level 0 when a workload becomes read-only.

Sysbench is extremely useful even though the workloads provided by it are synthetic -- especially the read-only tests with a small (multi-GB) database and most tests use a uniform distribution for keys. There is a talk on sysbench at FOSDEM 2017 and I look forward to using it more next year.

Justin Swanhart pointed out the lousy performance MyRocks provided with a default configuration. As a result we improved the default configuration by increasing the memtable to 64M and block cache to 512M and I will share results at FOSDEM. See issues 369375, and 441.

Justin Swanhart and Rick Pizzi reported problems with too-large transactions leading to lousy performance and OOM. MyRocks buffers in memory all changes from a transaction until commit and at commit time the changes are briefly double buffered when copied into the memtable. Until we make this better the workaround is to use rocksdb_commit_in_the_middle for bulk loads. Even after we make this better that option will be useful for bulk loads. Today we have a poor limit on the max size of a transaction via rocksdb_max_row_locks. This is a limit on the number of locked rows rather than on the memory used by a transaction. Even worse, the default is 1B. We are changing the limit to be on the amount of memory used by a transaction. Issues for this include 341, 347, 348 and 365.

The MySQL team at Alibaba reported a performance regression from MyRocks group commit when running sysbench. They also reported a correctness problem that I ran into with sysbench. Auditing the code found another bug. A related problem is that rocksdb_disable_2pc was enabled by default, meaning engine & binlog crash safety wasn't guaranteed. We have changed the default configuration to be crash safe and renamed the option to avoid double negatives. We also used YSCB and sysbench to reproduce the performance regression and have work in progress to make this better. That requires collaboration between the MyRocks and RocksDB teams. It is fun to watch the teams solve problems. Issues for this problem include 474481, 482488.

Wednesday, November 23, 2016

MyRocks: use less IO on writes to have more IO for reads

Holiday is almost here and I wrote a long blog post on write-efficiency yesterday so this one will be short. A longer version of this is in progress because this is an interesting result for me to explain. We assume that an LSM is less efficient for reads because it is more efficient for writes and it is hard to be optimal for all of read, write & space efficiency.

For real workloads it is complicated and for now I include benchmarks in "real workloads".  Here is one interesting result from my IO-bound tests of Linkbench. The summary is that when you spend less on IO to write back changes then you can spend more on IO to handle user queries. That benefit is more apparent on slower storage (disk array) than on faster storage (MLC NAND flash) because slower storage is more likely to be the bottleneck.

IO-bound Linkbench means that I used a server with 50G of RAM and ran Linkbench with maxid1=1B (1B nodes). The MyRocks database was ~400G and the InnoDB database was ~1.6T. Both MyRocks and InnoDB used MySQL 5.6.26. The workload is IO-heavy and the database working set is not cached.

The interesting result is that the difference between MyRocks and InnoDB becomes larger as storage gets slower. Another way to describe this is that InnoDB loses more performance than MyRocks when moving from faster to slower storage. I assume this is because MyRocks uses less IO capacity for writing back database changes so it has more IO capacity for handling user queries.


                Transactions per second
                MyRocks InnoDB  MyRocks/InnoDB
Disk array      2195    414     5.3
Slow SSD        23484   10143   2.3
Fast SSD        28965   21414   1.4

The random operations per second provided by the storage devices above is approximately 1k for the disk array, 10k for the slow SSD and more than 100k for the fast SSD.

Tuesday, November 22, 2016

Why is MyRocks more write-efficient than InnoDB?

This year I shared results where InnoDB wrote between 10X and 20X more data to storage than MyRocks for the same workload. I use KB written to storage per transaction as a measure of write efficiency and I usually compute this with data from the benchmark client and iostat. I get KB written/second from iostat, average transaction/second from the benchmark client and divide the former by the latter to compute KB written/transaction. When using SSD this excludes the writes done by SSD firmware and I previously reported that the overhead was worse for InnoDB than for RocksDB on one vendor's device.

An engine that writes less to storage per transaction is more write efficient. It is a good thing if MyRocks writes 10X less to storage than InnoDB for the same workload. This might enable MyRocks to use lower-endurance SSD for workloads where InnoDB required higher-endurance SSD. This might enable MyRocks to use SSD for workloads in which the device would not last with InnoDB. This also means that MyRocks needs less overprovisioning on the SSD, which is another way of saying you get more capacity from the device.

This is an update on results I previously shared.

Why
There are a few reasons why MyRocks is more write-efficient than InnoDB:
  1. Doublewrite buffer
  2. Configuration
  3. Page size
Doublewrite
The InnoDB doublewrite buffer doubles the storage write rate. It does this for a good reason -- to protect against partial page writes. Perhaps one day SSD vendors will agree on an atomic-write solution that works across vendors and with popular file systems on Linux. Perhaps one day crash safe RAM will be a common thing in data centers. Perhaps one day we will have a copy-on-write filesystem that is widely used for InnoDB on Linux. Until then we are stuck with 2X write-amplification from the doublewrite buffer.

Configuration
I might be overstating this to make a point. If you have fast storage (NAND flash) and the database working set fits in RAM then you have too much RAM. If you have fast storage then configure the database to use it. Or keep the working set in RAM and use a disk array instead of NAND flash. But if you have an in-memory workload and a database engine that does random IO (update-in-place b-tree) then you still need IOPs capacity or you should switch to a proper in-memory database engine like Tarantool.

Most of the servers that I care about are setup so that the database working set isn't in RAM. I configure benchmarks like Linkbench in the same manner. I am not promising that MyRocks will write 10X less to storage than InnoDB for all use cases - workload and configuration matter. But it tends to be better.

Page size
InnoDB page size has a big impact on write-efficiency when the working set isn't cached because dirty b-tree pages will be evicted from the tail of the LRU earlier and when evicted they must be written back to storage (twice with InnoDB thanks to the doublewrite buffer). In the worst case pages are written back with only dirty row and the write-amplification in that case is sizeof(page) / sizeof(row). In the best case all rows on the page are dirty but the best case isn't likely when the working set isn't cached.

The obvious way to reduce write-amplification (and improve write-efficiency) is to reduce the database page size. The default page size for InnoDB is 16kb, but it is possible to use 8kb or 4kb pages at initialization by setting innodb_page_size or when using compressed tables. Both of these have a cost. First, InnoDB performance is greatly reduced when using compressed tables. Second, rows must fit in half of a page, excluding LOB columns, so a smaller page also means a smaller max row size. A smaller page also reduces the max size of an index key. See the manual for more details.

I repeated Linkbench with an IO-bound configuration - 50G of RAM, database is ~400G for MyRocks with zlib compression and ~1.6T for InnoDB without compression. The data below shows throughput (TPS is transactions/second) and the storage write-rate per transaction (iostat wKB/t) for MyRocks versus InnoDB with different page sizes. When using a smaller page size for InnoDB the storage write rate and TPS is better. I assume TPS is better because when less IO capacity is used for writes then more IO capacity is available for reads.


          Page-size  TPS     iostat-wKB/t
MyRocks   16kb       28965    1.25
InnoDB    4kb        24845    6.13
InnoDB    8kb        24352   10.52
InnoDB    16kb       21414   19.70

And graphs for the same data.


Monday, November 21, 2016

Sysbench, InnoDB, transaction isolation and the performance schema

I used sysbench to understand the impact of transaction isolation and the performance schema for InnoDB from upstream MySQL 5.6.26.

The test server has 24 CPU cores, 48 HW threads with hyperthreading enabled, 256G of RAM and fast SSD. For sysbench I used the 1.0 version with support for Lua. Tests were run in two configurations -- cached and IO-bound. For the cached configuration I used 8 tables, 1M rows/table and the database cache was large enough to cache all data. For the IO-bound configuration I used 8 tables, 10M rows/table, a 2G database cache and buffered IO so that all data was in the OS page cache. The database was ~2G for the cached configuration and ~20G for the IO-bound configuration. InnoDB table compression was not used and jemalloc was used. The binlog was enabled but sync-on-commit was disabled for the binlog and InnoDB redo log.

With 8 tables and 1M rows per table the database is very small -- a few GB. I am wary of drawing too many conclusions from sysbench results for such a small database but other people will use it to evaluate MyRocks so I have been spending more time to understand sysbench performance.

Both the cached and IO-bound configurations require data to be written to storage. For the cached test all page reads are served from the database cache. For the IO-bound test some reads are served from the database cache and misses are served from the OS page cache. In both cases nothing is read from storage.

tl;dr
  1. For InnoDB using repeatable-read can help performance.
  2. Disabling performance schema can help performance
  3. My tests are from MySQL 5.6.26. It will be interesting to understand what has changed in 5.7 and 8. I know that the overhead from #1 and #2 should be much less in modern MySQL.
Sysbench

I have a script that uses sysbench to run tests in an interesting pattern and used it for this test. The pattern is in the all.sh script and is explained below. The all.sh script uses the run.sh script to set the sysbench command line options. I run most of the read-write tests before the read-only tests to fragment the database before evaluating query performance.
  • load - load the 8 tables
  • update-index - uses oltp.lua and each transaction is an UPDATE statement that finds & changes 1 row by PK. Secondary index maintenance is required for the update statement.
  • update-nonindex - like update-index but secondary index maintenance is not required.
  • read-write - uses oltp.lua in read-write mode, a classic sysbench workload. Run with oltp-range-size=100.
  • read-only - uses oltp.lua in read-only mode, a classic sysbench workload. Run four times with oltp-range-size set 10, 100, 1000 and 10000.
  • point-query - uses oltp.lua to fetch one row by PK per query
  • select - uses select.lua to fetch one row per query
  • insert - uses insert.lua to insert one row per transaction. This grows the database and the growth amount depends on the the insert rate. So a faster engine will grow the database more than a slower engine. It also means that when I run the test for a long time that the database won't fit in the database or OS page cache. For these reasons I run this test last.
The test was run for 1, 2, 4, 8, 16, 24, 32, 40, 48, 64, 80, 96 and 128 concurrent clients. This is currently hardwired in the all.sh script. For each level of concurrency I ran sysbench for 3 minutes for the read-only tests and 5 minutes for the read-write tests. Eventually I will run it for more time at each level of concurrency but I had a large number of tests to run and am trying to figure out which configurations are interesting.

The my.cnf for these tests is here.

Results

Data for the results is here. The numbers provided are queries per second (QPS) not transactions per second (TPS). The configurations tested are described below:
  • innodb.8t.1m.rr.ps0 - 8 tables, 1M rows/table, repeatable-read, performance_schema=0
  • innodb.8t.1m.rr.ps1 - 8 tables, 1M rows/table, repeatable-read, performance_schema=1
  • innodb.8t.1m.rc.ps0 - 8 tables, 10M rows/table, read-committed, performance_schema=0
  • innodb.8t.1m.rc.ps1 - 8 tables, 10M rows/table, read-committed, performance_schema=1
My summary of performance is:
  • update-index - disabling the performance schema has a small impact on QPS (between 1% and 5% more QPS is common). Changing transaction isolation has no impact on QPS.
  • update-nonindex - same as update-index
  • read-write - disabling the performance schema frequently boosts QPS by 5% to 10% and the impact is greater at high concurrency. Using repeatable-read boosts performance because it reduces the mutex contention from getting a consistent read snapshot as that is done once per transaction rather than once per statement.
  • read-only - see the conclusions for read-write for shorter range scans (oltp-range-size set to 10 or 100 and maybe 1000). For longer range scans (oltp-range-size set to 10000) transaction isolation and the performance schema have little impact because the overhead is elsewhere
  • point-query - disabling the performance schema has a big impact on performance (between 5% and 10% more QPS) and the benefit is larger at high concurrency. Transaction isolation has no impact on performance because transactions are single statement.
  • select - see point-query
  • insert - disabling the performance schema has a smaller impact on QPS. Transaction isolation doesn't have an impact on performance because transactions are single statement.

Thursday, October 27, 2016

Benchmarketing MyRocks

I have been spending time understanding MyRocks performance for new workloads including benchmarks that potential MyRocks users run. One of those benchmarks is sysbench and I wrote a script to make it easier for me to run.

sysbench

Like most synthetic benchmarks sysbench is valuable but has its flaws. It helps to understand the flaws when looking at results. Most uses of sysbench are for very small databases. A typical run for me is 8 tables with 1M rows per table. That uses about 2G of space with uncompressed InnoDB tables. For a typical MyRocks configuration that will use a 3 level LSM tree with data in levels 0, 1 and 2 and I usually disable compression for those levels. And if you are running performance tests for a 2G database that fits in cache I wouldn't use compression. Small databases save time when running benchmarks as the load happens real fast. But you might miss the real overheads that occur with a larger database.

Another possible problem with sysbench is that several of the test configurations are for read-only workloads. If your real workload isn't read-only, then you might miss real overheads. For example, the RocksDB memtable might be empty for a read-only workload. That avoids the cost of checking the memtable on a query and can overstate the QPS you will measure.

I spent a day explaining unexpected performance variance on a read-only sysbench test. I took too long to notice that the LSM on the slower server had data in levels 0, 1 and 2 while the LSM on the faster server only used levels 1 and 2. By not having data in level 0 there was less work to do to process a query and the faster server got more QPS. This was visible in the compaction IO statistics displayed by SHOW ENGINE ROCKSDB STATUS. Had this been a read-write workload the LSM would have been in a steadier state with data (usually) in the memtable and level 0. But in this case the memtable was empty and compaction was stopped because there were no writes and the compaction scores for all levels was <= 1. I wonder whether we can add a feature to RocksDB to trigger compaction during read-only workloads when the LSM tree can be made more performant for queries?

configuration

The best settings for the MyRocks my.cnf file are also a source of confusion. I almost always enable the concurrent memtable. See the comments for the options allow_concurrent_memtable_write and enable_write_thread_adaptive_yield. I explained the benefits of these options in a previous post. Alas the options are disabled by default and not mentioned in the suggested my.cnf options. They are enabled by adding this to my.cnf:
rocksdb_allow_concurrent_memtable_write=1
rocksdb_enable_write_thread_adaptive_yield=1

I enable the concurrent memtable for most of my benchmarks. When MyRocks arrives in MariaDB Server and Percona Server I wonder whether other users will do the same. For read-write workloads the concurrent memtable can be a big deal.