Tuesday, April 15, 2014

MongoDB, TokuMX and InnoDB for disk IO-bound, update-only by PK

I used sysbench to measure TPS for a workload that does 1 update by primary key per transaction. The database was much larger than RAM and the server has a SAS disk array that can do at least 2000 IOPs with a lot of concurrency. The update is to a non-indexed column so there is no secondary index maintenance which also means there is no benefit from a fractal tree in TokuMX or the change buffer in InnoDB. I also modified the benchmark client to avoid creating a secondary index. Despite that TokuMX gets almost 2X more TPS than InnoDB and InnoDB gets 3X to 5X more TPS than MongoDB.
  • TokuMX is faster because it doesn't use (or waste) random IOPs on writes so more IO capacity is available for reads. In this workload an update is a read-modify-write operation where the read is likely to require a disk read.
  • MongoDB is slower for two reasons. The first reason is the per-database RW-lock and the result doesn't get better with more concurrent clients. For this test all collections were in one database. The lock is held while the b-tree index for the PK is searched to find the document to update. Disk reads might be done when the lock is held. The second reason is that it does twice the number of disk reads per update while InnoDB & TokuMX do about 1 per update. Part of the difference is that InnoDB and TokukMX have clustered PK indexes but the results are much worse than I expected for MongoDB. I wonder if caching of index leaf blocks is not as effective as I expect or if I am wrong to expect this. Maybe this is one of the problems of depending on the OS VM to cache the right data.

Yield on page fault

The TPS results for MongoDB are limited by disk read latency. Even though there is a disk array that can do a few thousand random reads per second, the array sustains about 150 reads/second when there is a single stream of IO requests. And the per-database RW-lock guarantees that is the case. So MongoDB won't get more than 1 / disk-read-latency updates per second for this test regardless of the number of disks in the array or number of concurrent clients.

MongoDB documentation mentions that the per-database RW-lock can be yielded on page faults but the documentation wasn't specific enough for me. I think this is what you need to know and I hope MongoDB experts correct any mistakes.
  1. Yield is only done for access to documents. It is not done while accessing primary or secondary indexes. To see in the code where a yield might be done search for calls to Record::_accessing() which throws PageFaultException. The record might also be "prefetched" after releasing the per-database RW-lock via a call to Record::touch().
  2. Yield is done on predicted page faults, not on actual page faults. AFAIK, a signal handler for SIGSEGV could be used to do this for actual page faults and MongoDB creates a handler for SIGSEGV but only to print a stack trace before exiting. MongoDB has something like an LRU to track memory references and predict page faults. I haven't spent much time trying to figure out that code but have seen those functions use a lot of CPU time for some benchmarks. I am curious why the btree code uses that tracking code (it calls likelyInPhysicalMemory). To learn more about the page fault prediction code read the functions Record::likelyInPhysicalMemory and Record::_accessing and the classes PageFaultException and Rolling.
From reading the above you should assume that you really want all indexes to be cached in RAM. Alas that can be hard to do for big data databases. For this test my server has 72G of RAM and the PK indexes are 83G. So I know that all of the indexes won't be cached.

I tried to overcome disk read stalls during index searching by changing the Java sysbench client to manually prefetch the to-be-updated document by calling findOne prior to the update. That improved TPS by about 20%. I hoped for more but the prefetch attempt needs a read-lock and pending write-lock requests on the per-database RW-lock appear to block new read-lock requests. I think this is done to prevent write-lock requests from getting starved. My attempt is not a workaround.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has a SAS disk array that can do more than 2000 IOPs with many concurrent requests, 16 CPU cores with HT enabled and 72G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested these configurations:
  • mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mongo-p2n - 828 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
  • mysql - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=2
  • mysql-zlib - 349 GB database, MySQL 5.6.12, InnoDB 2X compression (key_block_size=4) via zlib, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=1
  • tokumz-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0
  • tokumz-zlib - 385 GB database, TokuMX 1.4.1 with zlib compression, logFlushPeriod=300, w:1,j:0

Results

MongoDB does twice the number of disk reads per update compared to TokuMX and InnoDB. MongoDB TPS does not increase with concurrency. TPS does increase with concurrency for InnoDB and TokuMX which benefit from having many more concurrent pending disk reads. TokuMX does better than InnoDB because it doesn't use random IOPs for database page writes so there is more capacity remaining for reads.

TPS
configuration  8 clients  16 clients  32 clients  64 clients
tokumx-zlib          888        1267        1647        2034
tokumx-quicklz       870        1224        1567        1915
mysql-zlib           562         809         983        1140
mysql                543         737         913        1043
mongo-p2y            168         168         169         169
mongo-p2n            168         169         168         169

iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
tokumx-zlib          924        1279        1650        2032
tokumx-quicklz       891        1243        1600        1948
mysql-zlib           520         727         862         966
mysql                512         695         855         970
mongo-p2y            337         340         342         344
mongo-p2n            343         347         350         350

disk reads per update
configuration  8 clients  16 clients  32 clients  64 clients
tokumx-zlib         1.04        1.01        1.00        1.00
tokumx-quicklz      1.02        1.02        1.02        1.02
mysql-zlib          0.93        0.90        0.88        0.85
mysql               0.94        0.94        0.94        0.93
mongo-p2y           2.01        2.02        2.02        2.04
mongo-p2n           2.04        2.05        2.08        2.07

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...