Monday, April 7, 2014

MongoDB, TokuMX and InnoDB for disk IO-bound, read-only point queries

This repeats a test that was done on pure-flash servers. The goals are to determine whether the DBMS can efficiently use the IO capacity of a pure-disk server.  The primary metrics are the QPS that the DBMS can sustain and the ratio of disk reads per query. The summary is that a clustered primary key index makes TokuMX and InnoDB much more IO efficient for PK lookups on IO-bound workloads.

TokuMX and InnoDB get much more QPS than MongoDB from the same IO capacity for this workload. TokuMX and InnoDB have a clustered primary index. There is at most 1 disk read per query assuming all non-leaf nodes from the index are in memory and all leaf nodes are not in memory. With MongoDB the primary key index is not clustered so there can be a disk read for the leaf node of the PK index and then a second disk read to fetch the document. And from the data below you can see that the PK index in MongoDB is larger than RAM - 83 vs 72 GB.

I repeated the test with the clients limited to half of the the PK range. With that limit the PK indexes for MongoDB were smaller than RAM - 42 vs 72 GB. But the rate for disk reads per query did not drop as much as I hoped it would and InnoDB/TokuMX still got much more QPS than MongoDB. Note that MongoDB will suffer an extra performance penalty from doing disk seeks over a longer distance as the database files with it are larger than InnoDB and TokuMX. Better compression from TokuMX is kind of like short-stroking the disks for free.

Varying the readahead value between 0kb, 8kb, 32kb and 64kb from /sys/block/$device/queue/read_ahead_kb did not have a significant impact. I used XFS with the deadline scheduler. I only present results for readahead set to 32kb.

Benchmark details

The test server has 72G of RAM, 16 CPU cores with HT and can sustain more than 2000 read IOPs when there are 64 concurrent requests. The workload was sysbench as described previously without secondary indexes as they were not used by the test and I wanted the load to finish faster (well, the load on MongoDB is the problem as I described previously). There were 8 collections/tables in one database with 400M rows per collection/table. For anyone trying to reproduce these results there are a few changes I need to push to the sysbench clients that I used. I tested 4 configurations and they are described below along with the database size. For MongoDB I also include the index size for the primary key. For InnoDB and TokuMX the primary key is clustered and the size of it is the same as the size of the database.
  • mongop2y - 828 GB data, 83 GB index, MongoDB 2.6.0rc2 using powerOf2Sizes=1
  • mongop2n - 874 GB data, 83 GB index, MongoDB 2.6.0rc2 using powerOf2Sizes=0
  • tokumx - 513 GB data, TokuMX 1.4.1 using quicklz compression
  • tokumx-z - 385 GB data, TokuMX 1.4.1 using zlib compression
  • mysql - 698 GB data, MySQL 5.6.12, InnoDB using 8kb pages, no compression
  • mysql-z 349 GB data, MySQL 5.6.12, InnoDB using 8kb pages, zlib compression

Results at 3.2B rows

This has results when the sysbench client was able to query all rows - there were 400M per collection/table. The buffer pool (within the DBMS or in the OS filesystem cache) was in a steady state when the measurements were taken. The table below has three sections for tests run at 8, 16, 32 and 64 concurrent clients. The first section lists the QPS (more is better), the second is disk reads per second (iostat r/s) and the third is the ratio of disk reads per query (less is better). The average read size was 8kb for InnoDB uncompressed, 4kb for InnoDB compressed, 12kb for TokuMX and 26kb for MongoDB. All reads for InnoDB were at 8kb uncompressed and 4kb compressed. I don't think all reads were the same size for TokuMX and MongoDB.

QPS
configuration  8 clients  16 clients  32 clients  64 clients
mongop2y             461         629         799         975
mongop2n             469         631         803         981
tokumx               940        1325        1686        2055
tokumx-z             950        1341        1745        2160
mysql                927        1287        1644        2023
mysql-z             1009        1432        1877        2360        

iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
mongop2y             859        1169        1471        1793
mongop2n             836        1186        1498        1818
tokumx               943        1315        1730        2167
tokumx-z             987        1377        1766        2174
mysql                891        1238        1581        1890
mysql-z              911        1291        1684        2028

disk reads per query
configuration  8 clients  16 clients  32 clients  64 clients
mongop2y            1.86        1.86        1.84        1.84   
mongop2n            1.78        1.88        1.87        1.85
tokumx              1.00        0.99        1.03        1.05
tokumx-z            1.04        1.03        1.01        1.01
mysql               0.96        0.96        0.96        0.93
mysql-z             0.90        0.89        0.87        0.84

Results at 1.6B rows

This has results when the sysbench clients were limited to the first 200M rows per collection/table. The buffer pool (within the DBMS or in the OS filesystem cache) was in a steady state when the measurements were taken. The table below has three sections for tests run at 8, 16, 32 and 64 concurrent clients. The first section lists the QPS (more is better), the second is disk reads per second (iostat r/s) and the third is the ratio of disk reads per query (less is better). The average read size was 8kb for InnoDB, 15kb for TokuMX and 21kb for MongoDB. MySQL benefited much more from this change than TokuMX. I am not sure why.

QPS
configuration  8 clients  16 clients  32 clients  64 clients
mongop2y             511         711         936        1180
mongop2n             527         739         978        1233
tokumx               924        1320        1694        2075
mysql               1135        1635        2193        2728

iostat r/s
configuration  8 clients  16 clients  32 clients  64 clients
mongop2y             901        1255        1608        1983
mongop2n             959        1334        1689        2086
tokumx              1037        1375        1833        2327
mysql               1025        1454        1912        2305

disk reads per query
configuration  8 clients  16 clients  32 clients  64 clients
mongop2y            1.76        1.77        1.72        1.68   
mongop2n            1.82        1.81        1.73        1.69
tokumx              1.12        1.04        1.08        1.12
mysql               0.90        0.89        0.87        0.84

Cache benefit

A larger database file has two obvious costs. The first cost is that reads will be done over more of the disk so the average seek latency increases. The disk array in this case has about 1TB of space and the tested databases range from 349GB (much less than half of the disk) to 874GB (most of the disk). Limiting database files to the faster part of the disk, and thus reducing capacity, is called short stroking and is sometimes done on disk arrays used for database servers.

The second cost is that the database cache miss rate will increase as the file grows. Lets assume uniform distribution because that was used for this benchmark (and uniform isn't a great fit for many real workloads). But remember that the miss rate doesn't double when the database size doubles. The hit rate changes linearly with the database size but the miss rate is what we care about. Below I do the math to show what the hit rate might be assuming a 70GB cache and the database sizes from the tested binaries:

  • mysql-z - 349GB database, 79.9% miss rate
  • tokumx-z - 385 GB database, 81.8% miss rate
  • tokumx - 513 GB database, 86.3% miss rate
  • mysql - 698 GB database, 89.9% miss rate
  • mongop2n - 828 GB database, 91.5% miss rate
  • mongop2y - 874 GB database, 91.9% miss rate

1 comment:

  1. Added results for InnoDB with zlib compression and TokuMX with zlib compression. Also added the section on "Cache benefit"

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...