Small Datum: MongoDB, TokuMX and InnoDB for disk IO-bound, read-only point queries

This repeats a test that was done on pure-flash servers. The goals are to determine whether the DBMS can efficiently use the IO capacity of a pure-disk server. The primary metrics are the QPS that the DBMS can sustain and the ratio of disk reads per query. The summary is that a clustered primary key index makes TokuMX and InnoDB much more IO efficient for PK lookups on IO-bound workloads.

TokuMX and InnoDB get much more QPS than MongoDB from the same IO capacity for this workload. TokuMX and InnoDB have a clustered primary index. There is at most 1 disk read per query assuming all non-leaf nodes from the index are in memory and all leaf nodes are not in memory. With MongoDB the primary key index is not clustered so there can be a disk read for the leaf node of the PK index and then a second disk read to fetch the document. And from the data below you can see that the PK index in MongoDB is larger than RAM - 83 vs 72 GB.

I repeated the test with the clients limited to half of the the PK range. With that limit the PK indexes for MongoDB were smaller than RAM - 42 vs 72 GB. But the rate for disk reads per query did not drop as much as I hoped it would and InnoDB/TokuMX still got much more QPS than MongoDB. Note that MongoDB will suffer an extra performance penalty from doing disk seeks over a longer distance as the database files with it are larger than InnoDB and TokuMX. Better compression from TokuMX is kind of like short-stroking the disks for free.

Varying the readahead value between 0kb, 8kb, 32kb and 64kb from /sys/block/$device/queue/read_ahead_kb did not have a significant impact. I used XFS with the deadline scheduler. I only present results for readahead set to 32kb.

Benchmark details

The test server has 72G of RAM, 16 CPU cores with HT and can sustain more than 2000 read IOPs when there are 64 concurrent requests. The workload was sysbench as described previously without secondary indexes as they were not used by the test and I wanted the load to finish faster (well, the load on MongoDB is the problem as I described previously). There were 8 collections/tables in one database with 400M rows per collection/table. For anyone trying to reproduce these results there are a few changes I need to push to the sysbench clients that I used. I tested 4 configurations and they are described below along with the database size. For MongoDB I also include the index size for the primary key. For InnoDB and TokuMX the primary key is clustered and the size of it is the same as the size of the database.

mongop2y - 828 GB data, 83 GB index, MongoDB 2.6.0rc2 using powerOf2Sizes=1
mongop2n - 874 GB data, 83 GB index, MongoDB 2.6.0rc2 using powerOf2Sizes=0
tokumx - 513 GB data, TokuMX 1.4.1 using quicklz compression
tokumx-z - 385 GB data, TokuMX 1.4.1 using zlib compression
mysql - 698 GB data, MySQL 5.6.12, InnoDB using 8kb pages, no compression
mysql-z 349 GB data, MySQL 5.6.12, InnoDB using 8kb pages, zlib compression

Results at 3.2B rows

This has results when the sysbench client was able to query all rows - there were 400M per collection/table. The buffer pool (within the DBMS or in the OS filesystem cache) was in a steady state when the measurements were taken. The table below has three sections for tests run at 8, 16, 32 and 64 concurrent clients. The first section lists the QPS (more is better), the second is disk reads per second (iostat r/s) and the third is the ratio of disk reads per query (less is better). The average read size was 8kb for InnoDB uncompressed, 4kb for InnoDB compressed, 12kb for TokuMX and 26kb for MongoDB. All reads for InnoDB were at 8kb uncompressed and 4kb compressed. I don't think all reads were the same size for TokuMX and MongoDB.

QPS
configuration 8 clients 16 clients 32 clients 64 clients
mongop2y 461 629 799 975
mongop2n 469 631 803 981
tokumx 940 1325 1686 2055
tokumx-z 950 1341 1745 2160
mysql 927 1287 1644 2023
mysql-z 1009 1432 1877 2360

iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients
mongop2y 859 1169 1471 1793
mongop2n 836 1186 1498 1818
tokumx 943 1315 1730 2167
tokumx-z 987 1377 1766 2174
mysql 891 1238 1581 1890
mysql-z 911 1291 1684 2028

disk reads per query
configuration 8 clients 16 clients 32 clients 64 clients
mongop2y 1.86 1.86 1.84 1.84
mongop2n 1.78 1.88 1.87 1.85
tokumx 1.00 0.99 1.03 1.05
tokumx-z 1.04 1.03 1.01 1.01
mysql 0.96 0.96 0.96 0.93
mysql-z 0.90 0.89 0.87 0.84

Results at 1.6B rows

This has results when the sysbench clients were limited to the first 200M rows per collection/table. The buffer pool (within the DBMS or in the OS filesystem cache) was in a steady state when the measurements were taken. The table below has three sections for tests run at 8, 16, 32 and 64 concurrent clients. The first section lists the QPS (more is better), the second is disk reads per second (iostat r/s) and the third is the ratio of disk reads per query (less is better). The average read size was 8kb for InnoDB, 15kb for TokuMX and 21kb for MongoDB. MySQL benefited much more from this change than TokuMX. I am not sure why.

QPS
configuration 8 clients 16 clients 32 clients 64 clients
mongop2y 511 711 936 1180
mongop2n 527 739 978 1233
tokumx 924 1320 1694 2075
mysql 1135 1635 2193 2728

iostat r/s
configuration 8 clients 16 clients 32 clients 64 clients
mongop2y 901 1255 1608 1983
mongop2n 959 1334 1689 2086
tokumx 1037 1375 1833 2327
mysql 1025 1454 1912 2305

disk reads per query
configuration 8 clients 16 clients 32 clients 64 clients
mongop2y 1.76 1.77 1.72 1.68
mongop2n 1.82 1.81 1.73 1.69
tokumx 1.12 1.04 1.08 1.12
mysql 0.90 0.89 0.87 0.84

Cache benefit

A larger database file has two obvious costs. The first cost is that reads will be done over more of the disk so the average seek latency increases. The disk array in this case has about 1TB of space and the tested databases range from 349GB (much less than half of the disk) to 874GB (most of the disk). Limiting database files to the faster part of the disk, and thus reducing capacity, is called short stroking and is sometimes done on disk arrays used for database servers.

The second cost is that the database cache miss rate will increase as the file grows. Lets assume uniform distribution because that was used for this benchmark (and uniform isn't a great fit for many real workloads). But remember that the miss rate doesn't double when the database size doubles. The hit rate changes linearly with the database size but the miss rate is what we care about. Below I do the math to show what the hit rate might be assuming a 70GB cache and the database sizes from the tested binaries:

mysql-z - 349GB database, 79.9% miss rate
tokumx-z - 385 GB database, 81.8% miss rate
tokumx - 513 GB database, 86.3% miss rate
mysql - 698 GB database, 89.9% miss rate
mongop2n - 828 GB database, 91.5% miss rate
mongop2y - 874 GB database, 91.9% miss rate

Small Datum

Monday, April 7, 2014

MongoDB, TokuMX and InnoDB for disk IO-bound, read-only point queries

Benchmark details

Results at 3.2B rows

Results at 1.6B rows

Cache benefit

1 comment:

How efficient is RocksDB for IO-bound, point-query workloads?