Wednesday, April 16, 2014

TokuMX, MongoDB and InnoDB, IO-bound update-only with fast storage

I repeated the update-only IO-bound tests using pure-flash servers to compare TokuMX, MongoDB and InnoDB. The test setup was the same as on the pure-disk servers except for the hardware. In this case the servers have fast flash storage, 144G of RAM and 24 CPU cores with HT enabled. As a reminder, the InnoDB change buffer and TokuMX fractal tree don't help on this workload because there are no secondary indexes to maintain. Note that all collections/tables are in one database for this workload thus showing the worst-case for the MongoDB per-database RW-lock. The result summary:
  • InnoDB is much faster than MongoDB and TokuMX. This test requires a high rate of dirty page writeback and thanks to a lot of work from the InnoDB team at MySQL with help from Percona and Facebook (and others) the InnoDB engine is now very good at that. Relative to MongoDB, InnoDB also benefits from a clustered PK index.
  • MongoDB is much slower than InnoDB for two reasons. First it doesn't have a clustered PK index so it might do storage reads for both the index search and then while reading the document. The second reason is the per-database RW-lock. As I described previously this lock appears to be held during disk reads when the index is searched so at most one thread searches the index at a time even though there are concurrent update requests. I created JIRA 3177 to make that obvious in the documentation. Because of this the peak rate for MongoDB is approximately the number of reads per second that one thread can do from the flash device. The device can sustain many more reads/second with concurrency but MongoDB doesn't get much benefit from it. I think there will be at most 2 concurrent flash/disk reads at any time -- one while searching the index and the other while prefetching the document into RAM after releasing the per-database RW-lock in Record::touch.
  • TokuMX also benefits from the clustered PK index but it suffers from other problems that I was unable to debug. I think it can do much better once a Toku expert reproduces the problem on their hardware.

Configuration

This test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has fast flash storage that can do more than 5000 reads/second from one thread and more than 50,000 reads/second from many threads.  The server also has 24 CPU cores with HT enabled and 144G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested the configurations listed below. I ran tests for more configurations but forgot to adjust read_ahead_kb so I won't publish results from those hosts.
  • mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mysql-4k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=4k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • mysql-8k - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • tokumx-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0

Results

That probably isn't a typo below. InnoDB sustained about 5 to 10 times more updates/second. MongoDB does many more disk reads per update which is similar to the pure-disk results. I don't have the expertise to explain why TokuMX results weren't better but I shared information with the Tokutek team. Bytes written to storage per update is listed for InnoDB to show the impact on the write rate from using a smaller page. That can be important when flash endurance must be improved.

TPS
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k           24834       33886       37573       40198
mysql-4k           24826       31704       34644       34987
tokumx-quicklz      3706        3950        3552        3357
mongo-p2y           5194        5167        5173        5102

disk reads/second from iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k           20995       28371       31397       33537
mysql-4k           22016       27985       30553       30972
tokumx-quicklz      4943        5641        4962        4783
mongo-p2y           8960        8921        8951        8859

disk reads per update
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k            0.85        0.84        0.84        0.83
mysql-4k            0.89        0.88        0.88        0.89
tokumx-quicklz      1.33        1.43        1.40        1.42
mongo-p2y           1.73        1.73        1.73        1.74 

bytes written per update
configuration  8 clients  16 clients  32 clients  64 clients
mysql-8k            6.56        6.40        5.36        5.36
mysql-4k            3.86        3.72        3.76        3.78

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...