Wednesday, July 29, 2015

Linkbench, MongoDB and a disk array, now with TokuMX

I repeated the linkbench tests described in a previous post for TokuMX 2.0.1. This uses MongoDB, LinkbenchX and different storage engines. The test server has 144 GB of RAM, 6 SAS disks with HW RAID 0 and 24 hyperthread cores. The benchmark was run with 12 concurrent clients.

Cached database

The first test is with a cached database. The test pattern is to load the database and then do 12 1-hour query tests in a loop. The database always fit in RAM. At the end of the 12th hour the database size was 40 GB for WiredTiger, 22 GB for RocksDB, 30 GB for TokuMX and 78 GB for mmapv1. I used Snappy compression for WiredTiger/RocksDB and QuickLZ for TokuMX.

The graph below has the average QPS per 1-hour interval.
This is the data for the graph:

wiredtiger,18279,17715,16740,16585,16472,15924,15703,15632,15783,15401,15872,15654
rocksdb,10892,9649,9580,9639,9860,9981,9316,9535,9578,9682,9437,9689
tokumx,11078,6881,5832,5132,5864,5434,5495,5340,5168,5505,4763,4924
mmapv1,5066,4918,4821,4758,4629,4666,4589,4613,4663,4626,4563,4642

I then looked at the output from the 12th 1-hour run to understand why QPS was much better for WiredTiger. The table below has the average response time in milliseconds for the 3 most frequent operations. WiredTiger has the best times, mmapv1 has the worst times for writes (per database or per collection writes are single threaded) and TokuMX has the worst time for get_links_list. The get_links_list operation requires a short range query.

                add_link        update_link     get_links_list
wiredtiger      1.361           1.422           0.768
rocksdb         1.702           1.789           1.460
tokumx          1.538           1.674           3.929
mmapv1          4.788           5.230           2.657

Database larger than RAM

The test was repeated with a database that does not fit in RAM. The test was not run for mmapv1 because I didn't have enough disk space or patience to wait for the load to finish. At the end of the 12th hour the database size was 728 GB for WiredTiger, 632 GB for RocksDB and 588 GB for TokuMX. It is interesting that the TokuMX database was smaller than RocksDB here but larger than RocksDB for the cached test.

The graph below has the average QPS per 1-hour interval.

This is the data for the graph:
tokumx,439,580,622,625,638,617,598,613,631,609,610,611
rocksdb,387,448,479,468,468,477,471,483,475,473,471,477
wiredtiger,297,343,345,333,333,331,320,335,326,339,324,333

I then looked at the output from the 12th 1-hour run to understand why QPS was much better for TokuMX. The table below has the average response time in milliseconds for the 3 most frequent operations. TokuMX has is faster on get_links_list while RocksDB is faster on add/update link and the get_links_list operation is done about 5 times per add/update. WiredTiger is the slowest on all of the operations.

                add_link        update_link     get_links_list
tokumx          23.499          25.903          22.987
rocksdb         21.704          23.883          25.835
wiredtiger      47.557          51.122          35.648

TokuMX is the most IO efficient for this workload based on the data below. That explains why it sustains the highest QPS because disk IO is the bottleneck. I used data from iostat (r/s, w/s, rKB/s and wKB/s) and divided those rates by the average QPS with all data taken from the 12th 1-hour run. I assume that disk reads done by queries dominate reads done from compaction. TokuMX does less IO per query than RocksDB and WiredTiger. Both TokuMX and RocksDB write much less data per query than WiredTiger.

                read/query      read-KB/query   write-KB/query
tokumx          1.612           14.588          2.495
rocksdb         2.135           20.234          2.512
wiredtiger      2.087           26.675          12.110

Configuration

This has a few more details on the MongoDB configuration I used. The oplog was enabled for all engines. This is the configuration file and startup script for TokuMX. The block cache was ~70G for all engines.

dbpath = /home/mongo/data
logpath = /home/mongo/log
logappend = true
fork = true
slowms = 2000
oplogSize = 2000
expireOplogHours = 2

numactl --interleave=all \
bin/mongod \
    --config $PWD/mongo.conf \
    --setParameter="defaultCompression=quicklz" \
    --setParameter="defaultFanout=128" \
    --setParameter="defaultReadPageSize=16384" \
    --setParameter="fastUpdates=true" \
    --cacheSize=$1 \
    --replSet foobar \
    --checkpointPeriod=900

And this is the configuration file for other engines:

processManagement:
  fork: true
systemLog:
  destination: file
  path: /home/mongo/log
  logAppend: true
storage:
  syncPeriodSecs: 60
  dbPath: /home/mongo/data
  journal:
    enabled: true
  mmapv1:
    journal:
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy
storage.rocksdb.configString: "write_buffer_size=16m;max_write_buffer_number=4;max_background_compactions=6;max_background_flushes=3;target_file_size_base=16m;soft_rate_limit=2.9;hard_rate_limit=3;max_bytes_for_level_base=128m;stats_dump_period_sec=60;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=12;level0_stop_writes_trigger=20;max_grandparent_overlap_factor=8;max_bytes_for_level_multiplier=8"



No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...