Monday, August 31, 2015

Cached linkbench performance for MySQL 5.7.8, 5.6, WebScale and MyRocks

This extends previous results for Linkbench to compare performance for a cached database with concurrent clients. My conclusions are:
  • InnoDB compression in the Facebook patch for MySQL 5.6 is much faster for insert-heavy workloads than the same feature in upstream 5.6 and 5.7. Too bad those changes might not reach upstream
  • InnoDB transparent page compression is faster than non-transparent for write-heavy workloads assuming that feature is OK to use on your servers.
  • QPS for MyRocks suffers over time. We have work-in-progress to fix this. Otherwise it is already competitive with InnoDB. Compression with MyRocks is much better than InnoDB for linkbench data. That has also been true for real workloads.
  • Load rates are higher for compressed InnoDB tables when partitioning is used for 5.6 but not for 5.7. I didn't debug the slowdown in 5.7. It has been a win in the past for IO-bound linkbench because it reduces contention on the per-index mutex in InnoDB. Work has been done in 5.7 to reduce the contention on the per-index mutex.

Setup

The database size was between 10G and 30G after the load. The test was run with maxid=20000001, loaders=10 & requesters=20. Otherwise the default settings for linkbench were used. The InnoDB buffer pool was large enough to cache the database. The server has 144G of RAM, fast PCIe flash storage and 40 HW threads with HT enabled. The binlog was enabled but fsync was not done for the binlog or InnoDB redo log on commit. I tested several configurations for compression and partitioning:
  • p0.c0 - no partitioning, no compression
  • p0.c1 - no partitioning, compression (transparent & non-transparent)
  • p1.c0 - partitioning, no compression
  • p1.c1 - partitioning, compression (transparent & non-transparent)
I tested several binaries:
  • myrocks - the Facebook patch for MySQL, 5.6.X and the RocksDB storage engine
  • fb56 - the Facebook patch for MySQL, 5.6.X and InnoDB
  • orig56.ps - upstream 5.6.26 with the performance schema (PS) enabled
  • orig57.ps - upstream 5.7.8 with non-transparent compression and PS enabled
  • orig57.tc - upstream 5.7.8 with transparent compression and PS enabled
The test was done in two parts. First I measure the load performance, then I run the query test for 12 1-hour intervals. The data below is the insert rate from the load (load ips), the size after load (load gb), the QPS rate during the second and twelfth 1-hour runs (2h qps, 12h qps) and the size after the second and twelfth 1-hour runs (2h gb, 12h gb).

p0.c0
load    load    2h      2h      12h     12h
ips     gb      qps     gb      qps     gb      config
136041  14      43784   18      24298   20      myrocks
109724  22      49881   31      48459   51      fb56
103030  24      39979   34      39582   54      orig56.ps
116343  24      48506   35      48112   58      orig57.ps

p0.c1
load    load    2h      2h      12h     12h
ips     gb      qps     gb      qps     gb      config
 73115  15      42508   20      35766   32      fb56
 45660  16      36474   22      33107   34      orig56.ps
 46737  16      40890   22      37305   36      orig57.ps
101966  17      33716   23      29695   37      orig57.tc

p1.c0
load    load    2h      2h      12h     12h
ips     gb      qps     gb      qps     gb      config
101783  26      34342   30      21883   36      myrocks
105099  24      48686   33      47369   52      fb56
 97931  27      39343   36      39000   55      orig56.ps
109230  27      46671   37      46155   59      orig57.ps

p1.c1
load    load    2h      2h      12h     12h
ips     gb      qps     gb      qps     gb      config
 91884  15      46852   21      45223   36      fb56
 74080  17      39379   23      38627   38      orig56.ps
 77037  17      45156   24      44070   40      orig57.ps
 87708  19      37062   25      32424   40      orig57.tc

Graphs!

And for people who want graphs this has the average insert rate from the load and the average query rate from the twelfth hour for the p0.c0 test (no partitioning, no compression).



Friday, August 28, 2015

First day with InnoDB transparent page compression

I ran linkbench overnight for a database that started at 100G using MySQL 5.7.8 and InnoDB transparent page compression. After ~24 hours I have 1 mysqld crash with nothing in the error log. I don't know if that is related to bug 77738. I will attach gdb and hope for another crash. For more about transparent page compression read here, here and here. For concerns about the feature see the post by Domas. I previously wrote about this feature.

On the bright side, this is a great opportunity for MyRocks, the RocksDB storage engine for RocksDB.

Then I ran 'dmesg -e' and get 81 complaints from XFS on the host that uses transparent compression. The warnings are from the time when the benchmark ran. My other test host isn't using hole-punch and doesn't get these warnings. I get the same error message below from a host with CentoOS 6.6 host using a 3.10.53 kernel.

[Aug27 05:53] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.999375] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.999387] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.983386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.999379] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)


The host has Fedora 19 and the kernel is 3.14.27-100.fc19.x86_64. I don't know if Fedora 19 is officially supported. I know that hole punch is available because this is in the error log:
    [Note] InnoDB: PUNCH HOLE support available

And this was used in a create table statement:
    ENGINE=InnoDB COMPRESSION="zlib" DEFAULT CHARSET=latin1;

From my host without hole-punch where the files for the table are ~44G.
    $ xfs_bmap /ssd1/scratch/data.mdcallag/data/linkdb/linktable*.ibd | wc -l
    11446


And the host with it  where the files for the table are ~104G according to ls but are much smaller because of the holes.
    $ xfs_bmap /ssd1/scratch/data.mdcallag/data/linkdb/linktable.ibd  | wc -l
    11865620


I don't like the vmstat output either. On the host that uses transparent page compression swap is being used and that started during the linkbench load. It is not being used on the other host. Doesn't look right.

 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
13  6 4095996 8410200   1164 6298324    0    0    65   534    0    0  8  1 89  2


But wait, there is more

This is from a host with CentOS 6.6 and the 3.10.53 kernel. After running for a few hours with transparent page compression mysqld got stuck. I looked in the database error log and there was the standard output from threads getting long waits on InnoDB synchronization objects. I attached gdb to look at thread stacks, and that took ~15 minutes just to attach. Fortunately I got stacks before needing to reboost this host. All threads appear to be blocked on calls into the kernel. This gist shows two of the threads -- one is stuck doing aio and another trying to do fallocate64 called from os_file_punch_hole.

Deadlocked mysqld + reboot server == victory? I filed bug 78277 for this. After the reboot dropping a ~72G ibd file that had been using hole-punch took ~3 minutes. Then I created an 80G file and dropping that took 10 seconds. It a good idea to have database files that take minutes to drop given that InnoDB and filesystem state can get out of sync during a crash requiring manual repair.

And more

I have two CentOS 6.6 servers with the 3.10.53 kernel and XFS. The host in the previous section doesn't use partitioning. The host in this section uses partitioning and transparent page compression. They both get InnoDB deadlocks but this host was healthy enough the the InnoDB background thread was able to kill the server. This was triggered by a DROP DATABASE statement. I don't have thread stacks but will guess that a file unlink operation took too long or got stuck. Because DDL isn't transactional the data dictionary isn't in a good state at this point. Some of the per-partition ibd files are missing.

Reboot #2 has arrived. Did kill -9 on mysqld and the process is still there in state Z after 40 minutes. Delete the database directory but space is still in use according to df. I assume that space won't be released until the zombie process goes away. I give up on trying to use transparent compression on this host.

Thursday, August 27, 2015

Single-threaded linkbench performance for MySQL 5.7, 5.6, WebScale and MyRocks

The good news is that features in the pending 5.7 release look great. More good news is that InnoDB transparent page compression might be faster than the original compression feature assuming your storage system supports it. The bad news is that there are significant performance regressions for low-concurrency workloads. I previously reported this for 5.6 and 5.7.5 and have yet to see progress. While the focus on high-concurrency workloads has been important, we can't lose this much performance at low-concurrency.

I used linkbench with one client thread to load and then query a small & cached database. This was repeated for many configurations to determine the impact from compression and partitioning. The binlog was enabled but it and the InnoDB redo log were not synced on commit. The performance summary is:
  • InnoDB transparent compression is faster for loads than non-transparent compression
  • Insert rates for 5.7.8 are much worse than for 5.6. The insert rate for 5.6 are between 1.07X and 1.77X faster than 5.7.
  • The Facebook 5.6 build is always faster than 5.6.26 and 5.7.8, except when transparent compression is used.
  • MyRocks (MySQL+RocksDB) has the fastest load rate
  • Default configuration for the performance schema reduces the insert rate by 5%

Transparent compression

While transparent compression might be a lot faster than non-transparent compression I share the same doubts as Domas about whether we can use this feature in production. The feature requires either a filesystem that is happy when doing a holepunch on every write or a filesystem that is really a key-value store where the key is pageID, the value is a page and the page will be compressed on write with minimal roundup to meet alignment requirements. By roundup I mean that when a 16kb page is compressed to 11kb, it is written as 11kb when rounded up to the next 512b sector or 12kb when rounded up to the next 4kb sector. With 2X compression, that page would be split and the ~5.5kb compressed output for each post-split page would be rounded up to 8kb.

AFAIK there isn't a GA solution for the filesystem that doesn't require holepunch on write. There is a proprietary solution (filesystem & storage device) that is not yet GA from one vendor and the open-source solution might be bcachefs assuming it gets sufficient support to reach GA.

InnoDB page compression adds latency to query processing. By non-transparent compression I mean the original page compression for InnoDB where a compression factor (2X, 4X, 8X) is declared per table. Both transparent and non-transparent page compression add the latency of decompression after page reads. There is other latency that only occurs for non-transparent page compression. By latency I mean work done by threads handling SQL statements and these are things that I see when looking at PMP stack traces.
  1. Page images are optionally written to the redo log to avoid the risk of changing the zlib library during crash recovery. I always disable page image logging.
  2. When the per-page compression log is full all changes must be applied to the page and then the page is compressed.
  3. When a page doesn't compress enough (see #2) then a page split is done and the split pages must then be compressed. When a page split is done the per-index mutex is held so other threads can't access that index until the operation is done. I frequently use partitioning to reduce the impact from contention on the per-index mutex.
  4. There is a bit more work to manage both the uncompressed and compressed LRUs.

Performance regressions

Using partitioning has improved InnoDB performance for IO-bound linkbench tests in the past as it reduces contention for the per-index mutex. This test wasn't IO-bound as the database was always cached by InnoDB, but I still wanted to see the impact from the different configurations. Transparent compression in this case is done via holepunch-on-write. I ignore the compression rates for now and will revisit that in a future test. I tested these configurations:
  • p0.c0 - no partitioning, no compression
  • p0.c1 - no partitioning, compression (transparent & non-transparent)
  • p1.c0 - partitioning, no compression
  • p1.c1 - partitioning, compression (transparent & non-transparent)
The linkbench test was run with maxid=2000001, loaders=1 & requesters=1. Otherwise the default settings were used. The InnoDB buffer pool was large enough to cache the database, the binlog was enabled but the binlog and InnoDB redo log were not synced on commit. I set many other my.cnf values to match what we use in production but don't list them here. The p0.c0 test was repeated for upstream 5.6.26 and 5.7.8 with and without the PS enabled (performance_schema=0, 1 in my.cnf). I didn't set any other options for the PS. The load rate is reduced by about 5% when the PS is enabled with default options.

I tested several binaries:
  • myrocks - the Facebook patch for MySQL, 5.6.X and the RocksDB storage engine
  • fb56 - the Facebook patch for MySQL, 5.6.X and InnoDB
  • orig56.nops, orig56.ps - upstream 5.6.26 without and with PS enabled. The my.cnf file I used was based on what we use in production and is here.
  • orig57.nops, orig57.ps - upstream 5.7.8 with non-transparent compression, without and with PS enabled. The my.cnf file I used was based on what we use in production and is here.
  • orig57.tc - upstream 5.7.8 with transparent compression and PS disabled
The test was done in two parts. First I measure the rate at which one thread can load the database. Then I measure the average QPS sustained for one hour. The data below is the insert rate from the load (load ips), the size after load (load gb), the QPS rate during the 1-hour run (1h qps) and the size after the 1 hour run (1h gb).

The relative ips column is the insert rate for that configuration divided by the rate for upstream 5.7.8. These rates are large and we have lost a lot performance at low-concurrency in 5.7.8. Most of the loss is not from the performance schema.

p0.c0
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 53577  X       3244    X      1.57      myrocks
 47165  2.3     4016    3.0    1.38      fb56
 46785  2.6     3698    3.4    1.37      orig56.nops
 48857  2.6     3735    3.4    1.43      orig56.ps
 34233  2.6     3604    3.4    1.0       orig57.nops
 36077  2.6     3693    3.4    1.05      orig57.ps

p0.c1
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 17688  1.5     3874    1.9    1.33      fb56
 14164  1.7     3481    2.1    1.07      orig56.nops
 13252  1.7     3372    2.1    1.0       orig57.nops
 34770  2.1     3510    2.3    2.62      orig57.tc

p1.c0
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 45980  3.2     3921    3.8    1.77      fb56
 34660  3.5     3560    4.1    1.34      orig56.nops
 25908  3.5     3391    4.1    1.0       orig57.nops

p1.c1
load    load    1h      1h     relative
ips     gb      qps     gb     ips       config
 16152  2.1     3707    2.4    1.50      fb56
 12345  2.3     3296    2.7    1.15      orig56
 10779  2.3     3214    2.7    1.0       orig57.nops
 23148  3.0     3361    3.0    2.15      orig57.tc

Graphs

A graph for the p0.c0 test (no partitioning, no compression) showing relative insert rates.


Tuesday, August 18, 2015

How to use iTunes family sharing

It took me too long to figure this out. If you have family sharing enabled for iTunes and try to play content from the library of another member in the plan then you will get a dialog asking for your password followed by another dialog claiming the computer is already authorized and the content won't play. You have to download the content, you can't stream it, but that isn't the most annoying part. The shared content is hard to find. iTunes seems to change the UI each in random ways each year. This advice works for version 12.2.2.
  1. click iTunes store at the top
  2. click purchased (small text buried on middle of right hand side)
  3. Use pull down on left hand side to switch from self to other person next to Purchased
  4. Click on a show on the left hand side
  5. Click on Episodes at the top to switch from Seasons and see the episodes.

Monday, August 17, 2015

Transactions with RocksDB

RocksDB has work-in-progress to support transactions via optimistic and pessimistic concurrency control. The features need more documentation but we have shared the API, additional code for pessimistic and optimistic and examples for pessimistic and optimistic. Concurrency control is a complex topic (see these posts) and is becoming popular again for academic research. An awesome PhD thesis on serializable snapshot isolation by Michael Cahill ended up leading to an implementation in PostgreSQL.

We intend to use the pessimistic CC code for MyRocks, the RocksDB storage engine for MySQL. We had many discussions about the repeatable read semantics in InnoDB and PostgreSQL and decided on Postgres-style. That is my preference because the gap locking required by InnoDB is more complex.

MongoRocks uses a simpler implementation of optimistic CC today and a brief discussion on CC semantics for MongoDB is here. AFAIK, write-write conflicts can be raised, but many are caught and retried internally. I think we need more details. This is a recent example of confusion about the current behavior.

Thanks go to Anthony for doing the hard work on this.

Friday, August 14, 2015

Reducing write amplification in MongoRocks

After evaluating results for MongoRocks and TokuMX I noticed that the disk write rate per insert was ~3X larger for MongoRocks than TokuMX. I ran tests with different configurations to determine whether I could reduce the write rate for MongoRocks during the insert benchmark tests on a server with slow storage. I was able to get a minor gain in throughput and a large reduction in write amplification by using Snappy compression on all levels of the LSM.

Configuration

The test was to load 500M documents with a 256 byte pad column using 10 threads. There were 3 secondary indexes to be maintained. In the default configuration RocksDB uses leveled compaction with no compression for L0 and L1 and Snappy compression for L2 and beyond. I used QuickLZ for TokuMX 2.0.1 and assume that all database file writes were compressed by it. From other testing I learned that I need to increase the default value used for bytes_per_sync with MongoRocks to trigger calls to sync_file_range less frequently. Here I tried two configurations for MongoRocks:
  • config1 - Use a larger memtable (200M versus 64M) and larger L1 (1500M versus 512M).
  • config2- Use a larger memtable (400M versus 64M), larger L1 (1500M versus 512M) and Snappy compression for all levels. 

Binaries

This describes the configurations that I tested:
  • tokumx - TokuMX 2.0.1
  • rocksdb.def - MongoRocks with default configuration
  • rocksdb.config1 - MongoRocks with larger L1 & memtable
  • rocksdb.config2 - MongoRocks with Snappy compression for all levels

Results

From the results below the value for wkb/i (bytes written per insert) dropped almost in half for MongoRocks when Snappy compression was used for all levels. The reduction was smaller for the config that used a larger L1. Both of the changes caused a small improvement in the insert rate. The insert rate and write amplification were better for TokuMX. The table below has these columns:
  • r/s - average rate for iostat r/s
  • rmb/s, wmb/s - average rate for iostat rMB/s, wMB/s
  • r/i - iostat reads per document inserted
  • rkb/i, wkb/i - iostat rKB, wKB per document inserted
  • us+sy - average rate for vmstat us + sy
  • (us+sy)/i - us+sy divided by the insert rate
  • ips - average insert rate
r/s     rmb/s   wmb/s   r/i        rkb/i    wkb/i    us+sy   cs/i   (us+sy)/i   ips     engine
728.4   9.9      87.4   0.016343   0.228    2.007    40.1    4      0.000899    44571   tokumx
77.8    0.9     176.6   0.003123   0.036    7.254    22.3    7      0.000895    24923   rocksdb.def
556.3   7.6     149.8   0.022126   0.308    6.101    20.8    7      0.000826    25143   rocksdb.config1
285.3   3.6     105.1   0.010865   0.142    4.099    20.4    6      0.000775    26258   rocksdb.config2

Configuration options


This is the change for mongo.conf with rocksdb.config2:
storage.rocksdb.configString: "bytes_per_sync=16m;max_background_flushes=3;max_background_compactions=12;max_write_buffer_number=4;max_bytes_for_level_base=1500m;target_file_size_base=200m;level0_slowdown_writes_trigger=12;write_buffer_size=400m;compression_per_level=kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression"

This is the change for mongo.conf with rocksdb.config1:
storage.rocksdb.configString: "bytes_per_sync=16m;max_background_flushes=3;max_background_compactions=12;max_write_buffer_number=4;max_bytes_for_level_base=1500m;target_file_size_base=200m;level0_slowdown_writes_trigger=12;write_buffer_size=200m"

Thursday, August 13, 2015

Insert benchmark and queries on slow storage for MongoDB

This continues the use of the insert benchmark to understand performance for different MongoDB engines with slow storage. I previously explained load and query performance for a server with PCIe flash and documented the impact of too frequent invalidation in the plan cache. In this post I explain load and query performance for a server with a disk array -- SW RAID 0 across 15 SATA disks.

My summary from this and the previous posts includes:
  • write-optimized engines (TokuMX, RocksDB) do much better than b-trees (WiredTiger, mmapv1) for secondary index maintenance on large databases because they don't do read-before-write. InnoDB might be unique among the b-tree implementations -- it has an optimization to save on disk reads via the insert buffer.
  • MongoDB invalidates cached plans too quickly, but that has been fixed recently. The performance impact from this can be much worse on slow storage.
  • TokuMX 2.0.1 has a CPU bottleneck from the partial eviction code that also hurts range query performance. It also wasn't doing index-only access plans for queries that should be index only.
  • RocksDB is very competitive for this workload. It was usually the first or second best on each test.
  • TokuMX tends to do more reads and fewer writes compared to RocksDB
  • mmapv1 can become single-threaded courtesy of the write lock not getting released when disk reads are done during index maintenance. See server-13325.
  • TokuMX and RocksDB on slow storage do loads almost as fast as on fast storage. With 3 secondary indexes TokuMX gets 98% of the rate it had on fast storage and RocksDB gets 83% of the rate it had on fast storage. RocksDB loses more because of compaction stalls. We are thinking of ways to make this better.
  • For write intensive workloads RocksDB loses more performance than TokuMX when the oplog is enabled. TokuMX loaded 500M docs at 44571/second with the oplog enabled versus 48515/second with it disabled. RocksDB did 24923/second with it enabled and 36759/second with it disabled. The percentages for rate-with-oplog-enabled / rate-with-oplog-disables is 91% for TokuMX and 67% for RocksDB. TokuMX has more optimizations for the oplog.

The tests

I wasted a few weeks of machine time getting this data. My first attempt used the same setup as the fast storage test -- 2B documents with a 256 byte pad column. The load with 3 secondary indexes would have taken much too long for mmapv1 and WiredTiger but I didn't realize this until more than one week into the load. I changed the load to 500M documents and even then I wasn't willing to wait for mmapv1 and WiredTiger to finish with 3 secondary indexes so I stopped the load early for them. I was more interested in incremental load than bulk load performance otherwise I would have created the indexes after loading the table.

I then lost another few days and repeated all query tests because I had a lousy setting for /sys/block/$device/queue/read_ahead_kb. This hurt mmapv1 the most on query tests. RocksDB was mostly immune to it because it sets POSIX_FADV_RANDOM by default for user readsI didn't notice this until late in the process. I did not repeat the load.

I spent a few more days improving the RocksDB configuration for disk arrays and finally another day repeating tests with the oplog disabled. This is a typical experience for performance tests, tests will be repeated.

All tests use my fork of the insert benchmark for MongoDB. The test servers have a disk array (SW RAID 0, 15 SATA disks), 32 hyperthread cores and 64G of RAM.  The servers with fast storage had more of everything: PCIe flash, 40 hyperthread cores, 144G of RAM. The CPU on the fast storage server was one generation newer than the CPU on the slow storage server. The mongod configuration was described in the two previous blog posts, but I limited to database cache to 16G versus 32G for the server with fast storage. Loads were done with 10 client threads. The query tests are described in the Test Details section of the previous post.

Load performance with oplog enabled

I had lots of data and a nice graph to share for the load. Then I realized that a lousy setting for read_ahead_kb made my load results bogus. I repeated the load tests for TokuMX and RocksDB with better settings. I did not repeat the load for WiredTiger and mmapv1 because that would take too long. I also stopped the load early for WiredTiger and mmapv1 so the query tests were run with 352M documents for mmapv1 and 419M documents for WiredTiger versus 500M documents for RocksDB and TokuMX. The oplog was enabled for these tests, but fsync was not done on commit.

This has data from iostat and vmstat including absolute and relative (normalized by insert/second) rates. From the normalized rates TokuMX and RocksDB use a similar amount of CPU per insert, while TokuMX reads ~7X more data and RocksDB writes ~3.5X more data per insert. Note that I can greatly reduce the write rate for RocksDB by using universal compaction, but I have yet to explore that tradeoff between write, read and space amplification. I suspect that RocksDB uses one more index than TokuMX to map DiscLoc to a document as explained in a previous post and this is a current limitation in MongoDB 3.0, so WiredTiger also suffers from it. This extra index might be the reason for some extra writes, but not for 3.5X more. The values in the table are:
  • r/s - average rate for iostat r/s
  • rmb/s - average rate for iostat read MB/second 
  • wmb/s - average rate for iostat write MB/second
  • r/o - iostat reads insert
  • rkb/o - iostat KB read per insert
  • wkb/o - iostat KB written per insert
  • us+sy - average value of vmstat us + sy (user & system CPU)
  • cs/o - vmstat cs per insert (context switches per insert)
  • (us+sy)/o - CPU time per insert
  • ips - inserts/second
  • #indexes - number of secondary indexes
r/s     rmb/s   wmb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   ips     engine
728.4   9.9      87.4   0.016343   0.228    2.007    40.1    4      0.000899    44571   tokumx
 77.8   0.9     176.6   0.003123   0.036    7.254    22.3    7      0.000895    24923   rocksdb

Load performance with oplog disabled

I repeated the load test for RocksDB and TokuMX with the oplog disabled to determine the overhead of the oplog for RocksDB and hopefully provide motivation to make this more efficient. The data below has the load rate (ips) and statistics from iostat and vmstat (absolute and normalized by the insert rate) for TokuMX and RocksDB with the oplog enabled (*.log) and disabled (*.nolog). Things I noticed include:
  • RocksDB loses more from the oplog than TokuMX. RocksDB gets 67% of the insert rate with the oplog enabled compared to when it is disabled, while TokuMX gets 91% in that comparison. TokuMX doesn't have to delete old data from the oplog, it can drop/truncate old partitions. While RocksDB must delete old entries one at a time.
  • The wkb/o (rate for bytes written to storage per insert) value is reduced almost in half for RocksDB with the oplog disabled. For an insert only workload documents are written at least twice (once to the oplog, once to the database) ignoring writes from LSM compaction in RocksDB. So when the oplog is disabled half of the writes are removed.
  • The rkb/o rate drops by ~6X for RocksDB when the oplog is disabled.
r/s     rmb/s   wmb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   qps     engine
728.4   9.9      87.4   0.016343   0.228    2.007    40.1    4      0.000899    44571   tokumx.log
77.8    0.9     176.6   0.003123   0.036    7.254    22.3    7      0.000895    24923   rocksdb.log

706.8   8.3     65.0    0.014568   0.176    1.372    39.0    4      0.000804    48515   tokumx.nolog
25.3    0.2     141.5   0.000689   0.005    3.941    23.2    6      0.000632    36759   rocksdb.nolog

Impact of too-frequent plan invalidation

The i1.q10 test shows how bad the impact can be from too-frequent plan invalidation. The problem is that the 3.0.X version of MongoDB that I used invalidates query plans after 1000 documents have been changed. That happens quickly for the workloads I tested and QPS drops when plans are recomputed. The table below shows the rates for inserts and queries per second with the default setting ($engine.nofix) and a workaround ($engine.fix). I was unable to use the workaround for TokuMX 2.0.1 as it is based on an older version of MongoDB. But I don't think TokuMX suffers from this problem. 

ipsAvgqpsAvg
wired.nofix6682.1
wired.fix61638
mmap.nofix9250.22
mmap.fix5834.2
rocks.nofix277831.3
rocks.fix2962878

Query performance

The query tests are described in the Test Details section of the previous post. The tests were run after the load with 3 secondary indexes in the sequence and here I share the results for i0.10, i1.q10, i10.q1 and i10.q0. The tests were only run for the configuration with 500M documents and a 256 byte pad field but the load was stopped early for WiredTiger and mmapv1 (see the section on Load Performance) so they had less data.

The queries used in this test are interesting. The secondary indexes were small enough to be cached for RocksDB, TokuMX and WiredTiger but only if I didn't use too much memory for the block caches. I used 16G which left almost 48G for the OS filesystem cache. To be precise, this was enough memory to almost cache the secondary indexes for TokuMX. The queries use covering secondary indexes so they should be index only, but were not for TokuMX for reasons I don't understand. The extra disk reads done by TokuMX were not good for QPS.

The secondary indexes were not small enough to be fully cached by mmapv1, but a lot of the data from them was in cache. With mmapv1 I don't have to worry about splitting RAM between uncompressed pages in the block cache and compressed pages in the OS filesystem cache. Also with mmapv1, decompression on page read wasn't needed because it doesn't use compression. This might explain why it had the highest QPS. AFAIK, WiredTiger doesn't use block compression (snappy in this case) for index pages so it has less overhead after reading the data compared to RocksDB.

This describes the columns in the tables below:
  • ipsAvg, qpsAvg - average rates for inserts and queries per second
  • dbSize - database size in GB at test end
  • rssSize - size of mongod process per RSS column at test end

i0.q10

Perhaps because I have never used mmapv1 or MyISAM in production I am always happy to get a result where they are the fastest. The queries were not index only for TokuMX and QPS suffers. They were index only for RocksDB, WiredTiger and mmapv1.

qpsAvgdbSizerssSize
tokumx529140g18g
wiredtiger8783129g15g
mmapv110827251g62g
rocksdb7711120g22g

This has data from iostat and vmstat including absolute and relative (normalized by IPS or QPS) rates. The columns are explained in the Load Performance section. TokuMX does ~1037X more disk reads per query than RocksDB. That is a problem. mmapv1 does ~8.4X more disk reads per query than RocksDB and that is much less of a problem in this case. RocksDB uses about 2X more CPU/query than mmapv1 while TokuMX uses about 4.8X more CPU/query than RocksDB.

r/s     rmb/s   wmb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   qps     engine
940.1   11.1    0.6     1.778132   21.445   1.204    10.3    13     0.019426    529     tokumx
  1.5    0.2    0.1     0.000176    0.019   0.017    36.7    10     0.004181    8783    wiredtiger
156.1    0.7    0.2     0.014419    0.065   0.015    28.2     7     0.002602    10827   mmapv1
 13.2    0.2    0.1     0.001715    0.029   0.019    31.3     7     0.004060    7711    rocksdb

i1.q10

The one insert thread was rate limited to 1000 inserts/second but WiredTiger and mmapv1 were unable to sustain that rate. This test was run with the fix to avoid too-frequent query planning. The QPS for TokuMX suffers because the query isn't index only for it. QPS for mmapv1 suffers because the database is much larger than others and because of the RW lock which is locked in W mode when disk reads are done for secondary index maintenance as explained in a previous post.

ipsAvgqpsAvgdbSizerssSize
tokumx989502143g17g
wiredtiger8975985131g16g
mmapv1253352258g21g
rocksdb9917129122g22g

i10.q0

TokuMX does much better than RocksDB here because RocksDB has more write amplification, TokuMX has more optimizations for the oplog and RocksDB must maintain one extra index. The insert rates for WiredTiger and mmapv1 are much lower because they do read-before-write during secondary index maintenance. The write efficiency can be seen in the wkb/o column where it is ~5X larger for RocksDB than TokuMX and ~23X larger for WiredTiger than TokuMX. The read efficiency can be seen in the r/o column where WiredTiger is ~40X larger than TokuMX, mmapv1 is ~18X larger than TokuMX and RocksDB is ~1.7X larger than TokuMX. WiredTiger and mmapv1 also use much more CPU per insert than TokuMX and RocksDB.

ipsAvgdbSizerssSize
tokumx48997153g19g
wiredtiger764139g16g
mmapv1649259g36g
rocksdb27810132g22g

r/s     rkb/s   wkb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   ips     engine
536.9   7.1      71.2   0.010958   0.148     1.488   32.6    4      0.000666    48997   tokumx
332.5   2.2      25.5   0.435168   3.006    34.154    5.4    8      0.007050    764     wiredtiger
131.6   0.9      15.8   0.202600   1.367    24.962    1.1    5      0.001684    649     mmapv1
547.3   5.0     206.5   0.018976   0.177     7.331   27.1    6      0.000940    28839   rocksdb

i10.q1

I am not sure why the insert rate for RocksDB here is higher than i10.q0, perhaps that is due to variance from compaction and a longer running test would have fixed it. This test was run with the fix to avoid too-frequent query planning. The insert rates are similar to the rates for i10.q0.

ipsAvgqpsAvgdbSizerssSize
tokumx4652439165g19g
wiredtiger66838146g19g
mmapv15834.2272g34g
rocksdb2962878138g22g



Friday, August 7, 2015

Different kinds of copy-on-write for a b-tree: CoW-R, CoW-S

I use CoW-R and CoW-S to distinguish between two approaches to doing copy-on-write for a B-Tree.

CoW-R stands for copy-on-write random and is used by LMDB and WiredTiger. Page writeback is not in place. The writes use the space from dead versions of previously written pages. Background threads are not used to copy live data from old log segments or extents. This tends to look like random IO. When the number of pages written per fsync is large then a clever IO scheduler can significantly reduce the random IO penalty for disks and LMDB might benefit from that. WiredTiger has shown that compression is possible with CoW-R. But compression makes space management harder.

CoW-S stands for copy-on-write sequential. New writes are done to one or a few active log segments. Background threads are required to copy-out live data from previously written log segments. Compared to CoW-R this has less random IO at the cost of extra write-amplification from cleaning old segments. It allows for a tradeoff between space and write amplification -- more dead data can be allowed to reduce the frequency of cleaning and the write amplification overhead. I am not aware of an implementation of CoW-S for a B-Tree.

In CoW-R and CoW-S B-Trees the file structure is page based. In the worst case a page is written back to disk when with one dirty row and the write amplification from that is sizeof(page) / sizeof(row). This can be large when a row is small (128 bytes) and a page is 4kb or 8kb. Write optimized approaches avoid that source of write amplification as they only write back the changed rows. Examples of this includes an LSM like RocksDB and a log structured product like ForestDB. Alas, there is no free lunch. While these avoid the write amplification from page writeback, they add write amplification from compaction in an LSM and log segment cleaning in ForestDB.

Wednesday, August 5, 2015

Insert benchmark & queries for MongoDB on fast storage

This extends the performance evaluation using the insert benchmark on a server with fast storage. After the load extra tests were run using different numbers of insert and query threads. The load was done for 2B documents with a 256 byte pad field. The databases are at least 500G and the test server has 144G of RAM and 40 hyperthread cores. My conclusions are:
  • WiredTiger has the best range query performance followed closely by RocksDB. While the LSM used by RocksDB can suffer a penalty on range reads, that wasn't visible for this workload.
  • mmapv1 did OK on range query performance when there were no concurrent inserts. Adding concurrent inserts reduces range query QPS by 4X. Holding exclusive locks while doing disk reads for index maintenance is a problem, see server-13225.
  • TokuMX 2.0.1 has a CPU bottleneck that hurts range query performance. I await advice from TokuMX experts to determine whether tuning can fix this. I have been using a configuration provided by experts.

Test details

More details on the configuration for mongod is in the previous post. The tests here were run after the database was loaded with 2B documents that used a 256 byte pad field. After the initial load the database was ~620G, ~630G, ~1400T and ~500G for TokuMX, WiredTiger, mmapv1 and RocksDB. I used my fork of iibench for MongoDB (thanks Tim) to support more than 1 query thread.  My fork has an important diff to avoid reusing the same RNG seed between tests. The range query fetches at most 4 documents via the QUERY_LIMIT=4 command line option. Tests were run in this sequence:
  • 10i.1q - 10 insert threads, 1 query thread. Run to insert 10M documents for mmapv1 and 100M documents for the other engines
  • 1i.10q - 1 insert thread, 10 query threads. The insert thread was rate limited to 1000 documents/second. Run for at least 1 hour.
  • 0i.10q - 0 insert threads, 10 query threads. Run for at least 1 hour.
  • 10i.0q - 10 insert threads, 0 query threads. Run to insert at least 10M documents.
  • 1i.10q - this is the same as the previous 1i.10 test except I increased the value of internalQueryCacheWriteOpsBetweenFlush to avoid the IO overhead from re-planning queries too frequently as explained in a previous post. Alas, this cannot be done for TokuMX 2.0.1 because that option doesn't exist in the older version of MongoDB it uses.

Results

This is the QPS from the 10 query threads with and without a concurrent insert thread. The result for 1i.10q is from the run where I increased the value of this parameter to get better performance, although the overhead from it is much more significant with a disk array than with PCIe flash --internalQueryCacheWriteOpsBetweenFlush. There are a few things to note here:
  • WiredTiger and RocksDB have similar QPS
  • QPS for mmapv1 is much worse with a concurrent insert thread (locks are held when disk reads are done for index maintenance, not yielded)
  • QPS for TokuMX is low. There is a CPU bottleneck. More details at the end of this post.
The tables that follow have results for each of the tests. The results include:
  • ipsAvg - average rate for inserts/second
  • ipsP95 - rate at 95th percentile for inserts/second using 10-second intervals
  • qpsAvg - average rate for range queries/second
  • qpsP95 - rate at 95th percentile for queries/second using 10-second intervals
  • dbSize - database size in GB at test end
  • rssSize - RSS for mongod at test end

10i.1q

These are results for the 10i.1q test. WiredTiger has significant variance for the insert and query rates. Bugs have been opened for this problem and progress has been made but I think some bugs are still open. The query rate for mmapv1 is very low as described earlier in this blog post.

ipsAvgipsP95qpsAvgqpsP95dbSizerssSize
tokumx4077536001119.5101.6622gb36gb
wiredtiger617514885014.8632gb37gb
mmapv1125710941.150.714XXgb70gb
rocksdb294242593145.130507gb54gb

1i.10q, first

These are results for the first 1i.10q test. The insert thread is rate limited to 1000 documents/second. WiredTiger and RocksDB sustain higher query rates. TokuMX (because of the CPU bottleneck) and mmapv1 (because of the larger database and RW lock) sustain lower query rates.

ipsAvgipsP95qpsAvgqpsP95dbSizerssSize
tokumx981900726696553gb39gb
wiredtiger98090874146505632gb32gb
mmapv195187476665014XXgb66gb
rocksdb98196967206189501gb53gb

0i.10q

These are results for the 0i.10q test. WiredTiger and RocksDB get about 10% more QPS compared to the 1i.10q result. The QPS for mmapv1 improves by ~4X because it isn't slowed by the RW lock from the insert threads. TokuMX continues to suffer from the CPU bottleneck.

ipsAvgipsP95qpsAvgqpsP95dbSizerssSize
tokumx00850819552gb39gb
wiredtiger0087508568632gb31gb
mmapv1003603336214XXgb62gb
rocksdb0077237667501gb52gb

I was too lazy to copy this into a table. This has results from iostat and vmstat, both absolute and normalized by the operation (query or insert) rate. The values are:
  • qps, ips - operation rate (queries or inserts)
  • r/s - average rate for r/s (storage reads/second)
  • rmb/s - average rate for read MB/second 
  • wmb/s - average rate for write MB/second
  • r/o - storage reads per operation
  • rkb/o - KB of storage reads per operation
  • wkb/o - KB of storage writes per operation
  • cs - average context switch rate
  • us - average user CPU utilization
  • sy - average system CPU utilization
  • us+sy - sum of us and sy
  • cs/o - context switches per operation
  • (us+sy)/o - CPU utilization divided by operation rate
WiredTiger and RocksDB sustain the highest QPS because they do the fewest storage reads per query. The range read penalty from an LSM doesn't occur for this workload with RocksDB. The CPU/operation value is more than 25X larger for TokuMX (0.085000) than for the other engines (0.003104 for RocksDB). TokuMX is also doing almost 4X the disk reads per query compared to RocksDB. I hope we can figure this out. The storage reads per query rate is much higher for mmapv1 in part because the database is much larger and the cache hit rate is lower.

         qps     r/s    rmb/s   wmb/s    r/o    rkb/o   wkb/o
toku     850     2122    32.0   1.4      2.50    38.6   1.7
wired   8750     4756    47.9   0.6      0.54     5.6   0.1
mmap    3603    38748   577.2   0.6     10.75   164.0   0.2
rocks   7723     5755    47.2   0.5      0.75     6.3   0.1

         qps      cs    us      sy      us+sy   cs/o    (us+sy)/o
toku     850     18644  67.1    5.1      72.2    22      0.085000
wired   8750     92629  24.3    3.0      27.3    11      0.003121
mmap    3603    255160   6.8    4.9      11.8    71      0.003264
rocks   7723     75393  21.2    2.8      24.0    10      0.003104

10i.0q

These are results for the 10i.0q test. The insert rate improvement versus 10i.1q is larger for TokuMX than for RocksDB. I assume that TokuMX suffers more than RocksDB from concurrent readers. WiredTiger still has too much variance in the insert rate.

ipsAvgipsP95qpsAvgqpsP95dbSizerssSize
tokumx436673915200653gb36gb
wiredtiger535758400658gb40gb
mmapv1125411630014XXgb62gb
rocksdb298422576200533gb54gb

The CPU utilization and storage IO per insert is lower for RocksDB & TokuMX. This is expected given they are write-optimized and benefit from not having to read secondary index pages during index maintenance. mmapv1 suffers from doing more disk reads and using more CPU.

        ips       r/s   rmb/s   wmb/s   r/o     rkb/o   wkb/o
toku    43667     587    23.7    96.1   0.01      0.6    2.3
wt       5357    9794   109.9   187.2   1.83     21.0   35.8
mmap     1254   10550   235.7    31.7   8.41    192.5   25.9
rocks   29842    1308    31.1   281.9   0.04      1.1    9.7

        ips     cs      us      sy      us+sy   cs/i    (us+sy)/i
toku    43667   189501  31.7    4.6     36.3     4       0.000830
wt       5357   105640  16.3    4.3     20.6    20       0.003840
mmap     1254    93985   1.3    2.0      3.3    75       0.002665
rocks   29842   189706  22.3    3.6     25.9     6       0.000867

1i.10q, second

These are results for the second 1i.10q test. The QPS rates for RocksDB, WiredTiger and mmapv1 are better than in the first 1i.10q test because mongod was changed to cache query plans for a longer time.

ipsAvgipsP95qpsAvgqpsP95dbSizerssSize
tokumx987907858829527gb37gb
wiredtiger97789978306821658gb32gb
mmapv194986486972714XXgb56gb
rocksdb99193770186669524gb53gb

TokuMX CPU bottleneck, test 1

I want to understand why TokuMX uses more than 25X the CPU per query than other engines. The high CPU load comes from eviction (flushing dirty data so block cache memory can be used for disk reads) but it isn't clear to me why eviction should be so much more expensive in TokuMX than in WiredTiger which also does eviction. RocksDB doesn't show this overhead but it does compaction instead of eviction. 

I reloaded the TokuMX database with 2B documents, let it sit idle for 1 day and then started the 0i.10q test. From a flat profile with Linux perf the top-10 functions are listed below. Then I ran PMP and from stack traces see that eviction is in progress which explains why QuickLZ compression uses a lot of CPU because data must be compressed during eviction. From stack traces I also see that the query threads are stalled waiting for eviction to flush dirty pages before they can use memory for data read from storage. This is a common place where a database engine can stall (InnoDB and WiredTiger have stalled there, maybe they still do) but the stalls are worse in TokuMX.

I will let the 0i.10q test run until writes stop and see if the query rate increases at that point.
45.46%  mongod   libtokufractaltree.so  [.] qlz_compress_core
19.10%  mongod   libtokufractaltree.so  [.] toku_decompress 
 7.19%  mongod   libc-2.12.so           [.] memcpy
 1.71%  mongod   libtokufractaltree.so  [.] toku_x1764_memory
 1.43%  mongod   [kernel.kallsyms]      [k] clear_page_c_e
 1.28%  mongod   [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
 1.26%  mongod   mongod                 [.] free
 1.11%  mongod   libtokufractaltree.so  [.] toku_ftnode_pe_callback 
 0.96%  mongod   libtokufractaltree.so  [.] deserialize_ftnode_partition 
 0.83%  mongod   mongod                 [.] malloc

Update 1

I ran the 0i.10q test overnight and after 20 hours QPS has increased from 1074 to 1343. The CPU bottleneck from the partial eviction code remains. Partial eviction and waits for it are still frequent per PMP stack traces. The top-10 functions by CPU used are:

36.41%  mongod   libtokufractaltree.so  [.] toku_decompress
12.70%  mongod   libc-2.12.so           [.] memcpy
 5.58%  mongod   libtokufractaltree.so  [.] qlz_compress_core
 3.40%  mongod   [kernel.kallsyms]      [k] clear_page_c_e
 3.11%  mongod   libtokufractaltree.so  [.] toku_x1764_memory
 2.72%  mongod   [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
 1.41%  mongod   libtokufractaltree.so  [.] bn_data::deserialize_from_rbuf
 1.40%  mongod   libtokufractaltree.so  [.] toku_ftnode_pe_callback
 1.28%  mongod   mongod                 [.] free
 0.87%  mongod   mongod                 [.] arena_chunk_dirty_remove