- TokuMX is faster because it doesn't use (or waste) random IOPs on writes so more IO capacity is available for reads. In this workload an update is a read-modify-write operation where the read is likely to require a disk read.
- MongoDB is slower for two reasons. The first reason is the per-database RW-lock and the result doesn't get better with more concurrent clients. For this test all collections were in one database. The lock is held while the b-tree index for the PK is searched to find the document to update. Disk reads might be done when the lock is held. The second reason is that it does twice the number of disk reads per update while InnoDB & TokuMX do about 1 per update. Part of the difference is that InnoDB and TokukMX have clustered PK indexes but the results are much worse than I expected for MongoDB. I wonder if caching of index leaf blocks is not as effective as I expect or if I am wrong to expect this. Maybe this is one of the problems of depending on the OS VM to cache the right data.
Yield on page fault
The TPS results for MongoDB are limited by disk read latency. Even though there is a disk array that can do a few thousand random reads per second, the array sustains about 150 reads/second when there is a single stream of IO requests. And the per-database RW-lock guarantees that is the case. So MongoDB won't get more than 1 / disk-read-latency updates per second for this test regardless of the number of disks in the array or number of concurrent clients.
MongoDB documentation mentions that the per-database RW-lock can be yielded on page faults but the documentation wasn't specific enough for me. I think this is what you need to know and I hope MongoDB experts correct any mistakes.
- Yield is only done for access to documents. It is not done while accessing primary or secondary indexes. To see in the code where a yield might be done search for calls to Record::_accessing() which throws PageFaultException. The record might also be "prefetched" after releasing the per-database RW-lock via a call to Record::touch().
- Yield is done on predicted page faults, not on actual page faults. AFAIK, a signal handler for SIGSEGV could be used to do this for actual page faults and MongoDB creates a handler for SIGSEGV but only to print a stack trace before exiting. MongoDB has something like an LRU to track memory references and predict page faults. I haven't spent much time trying to figure out that code but have seen those functions use a lot of CPU time for some benchmarks. I am curious why the btree code uses that tracking code (it calls likelyInPhysicalMemory). To learn more about the page fault prediction code read the functions Record::likelyInPhysicalMemory and Record::_accessing and the classes PageFaultException and Rolling.
I tried to overcome disk read stalls during index searching by changing the Java sysbench client to manually prefetch the to-be-updated document by calling findOne prior to the update. That improved TPS by about 20%. I hoped for more but the prefetch attempt needs a read-lock and pending write-lock requests on the per-database RW-lock appear to block new read-lock requests. I think this is done to prevent write-lock requests from getting starved. My attempt is not a workaround.
Configuration
This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has a SAS disk array that can do more than 2000 IOPs with many concurrent requests, 16 CPU cores with HT enabled and 72G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested these configurations:
- mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
- mongo-p2n - 828 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
- mysql - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=2
- mysql-zlib - 349 GB database, MySQL 5.6.12, InnoDB 2X compression (key_block_size=4) via zlib, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=1
- tokumz-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0
- tokumz-zlib - 385 GB database, TokuMX 1.4.1 with zlib compression, logFlushPeriod=300, w:1,j:0
Results
MongoDB does twice the number of disk reads per update compared to TokuMX and InnoDB. MongoDB TPS does not increase with concurrency. TPS does increase with concurrency for InnoDB and TokuMX which benefit from having many more concurrent pending disk reads. TokuMX does better than InnoDB because it doesn't use random IOPs for database page writes so there is more capacity remaining for reads.
TPS
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 888 1267 1647 2034
tokumx-quicklz 870 1224 1567 1915
mysql-zlib 562 809 983 1140
mysql 543 737 913 1043
mongo-p2y 168 168 169 169
mongo-p2n 168 169 168 169
iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 924 1279 1650 2032
tokumx-quicklz 891 1243 1600 1948
mysql-zlib 520 727 862 966
mysql 512 695 855 970
mongo-p2y 337 340 342 344
mongo-p2n 343 347 350 350
disk reads per update
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 1.04 1.01 1.00 1.00
tokumx-quicklz 1.02 1.02 1.02 1.02
mysql-zlib 0.93 0.90 0.88 0.85
mysql 0.94 0.94 0.94 0.93
mongo-p2y 2.01 2.02 2.02 2.04
mongo-p2n 2.04 2.05 2.08 2.07
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 888 1267 1647 2034
tokumx-quicklz 870 1224 1567 1915
mysql-zlib 562 809 983 1140
mysql 543 737 913 1043
mongo-p2y 168 168 169 169
mongo-p2n 168 169 168 169
iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 924 1279 1650 2032
tokumx-quicklz 891 1243 1600 1948
mysql-zlib 520 727 862 966
mysql 512 695 855 970
mongo-p2y 337 340 342 344
mongo-p2n 343 347 350 350
disk reads per update
configuration 8 clients 16 clients 32 clients 64 clients
tokumx-zlib 1.04 1.01 1.00 1.00
tokumx-quicklz 1.02 1.02 1.02 1.02
mysql-zlib 0.93 0.90 0.88 0.85
mysql 0.94 0.94 0.94 0.93
mongo-p2y 2.01 2.02 2.02 2.04
mongo-p2n 2.04 2.05 2.08 2.07
No comments:
Post a Comment