Wednesday, July 29, 2015

Linkbench, MongoDB and a disk array, now with TokuMX

I repeated the linkbench tests described in a previous post for TokuMX 2.0.1. This uses MongoDB, LinkbenchX and different storage engines. The test server has 144 GB of RAM, 6 SAS disks with HW RAID 0 and 24 hyperthread cores. The benchmark was run with 12 concurrent clients.

Cached database

The first test is with a cached database. The test pattern is to load the database and then do 12 1-hour query tests in a loop. The database always fit in RAM. At the end of the 12th hour the database size was 40 GB for WiredTiger, 22 GB for RocksDB, 30 GB for TokuMX and 78 GB for mmapv1. I used Snappy compression for WiredTiger/RocksDB and QuickLZ for TokuMX.

The graph below has the average QPS per 1-hour interval.
This is the data for the graph:

wiredtiger,18279,17715,16740,16585,16472,15924,15703,15632,15783,15401,15872,15654
rocksdb,10892,9649,9580,9639,9860,9981,9316,9535,9578,9682,9437,9689
tokumx,11078,6881,5832,5132,5864,5434,5495,5340,5168,5505,4763,4924
mmapv1,5066,4918,4821,4758,4629,4666,4589,4613,4663,4626,4563,4642

I then looked at the output from the 12th 1-hour run to understand why QPS was much better for WiredTiger. The table below has the average response time in milliseconds for the 3 most frequent operations. WiredTiger has the best times, mmapv1 has the worst times for writes (per database or per collection writes are single threaded) and TokuMX has the worst time for get_links_list. The get_links_list operation requires a short range query.

                add_link        update_link     get_links_list
wiredtiger      1.361           1.422           0.768
rocksdb         1.702           1.789           1.460
tokumx          1.538           1.674           3.929
mmapv1          4.788           5.230           2.657

Database larger than RAM

The test was repeated with a database that does not fit in RAM. The test was not run for mmapv1 because I didn't have enough disk space or patience to wait for the load to finish. At the end of the 12th hour the database size was 728 GB for WiredTiger, 632 GB for RocksDB and 588 GB for TokuMX. It is interesting that the TokuMX database was smaller than RocksDB here but larger than RocksDB for the cached test.

The graph below has the average QPS per 1-hour interval.

This is the data for the graph:
tokumx,439,580,622,625,638,617,598,613,631,609,610,611
rocksdb,387,448,479,468,468,477,471,483,475,473,471,477
wiredtiger,297,343,345,333,333,331,320,335,326,339,324,333

I then looked at the output from the 12th 1-hour run to understand why QPS was much better for TokuMX. The table below has the average response time in milliseconds for the 3 most frequent operations. TokuMX has is faster on get_links_list while RocksDB is faster on add/update link and the get_links_list operation is done about 5 times per add/update. WiredTiger is the slowest on all of the operations.

                add_link        update_link     get_links_list
tokumx          23.499          25.903          22.987
rocksdb         21.704          23.883          25.835
wiredtiger      47.557          51.122          35.648

TokuMX is the most IO efficient for this workload based on the data below. That explains why it sustains the highest QPS because disk IO is the bottleneck. I used data from iostat (r/s, w/s, rKB/s and wKB/s) and divided those rates by the average QPS with all data taken from the 12th 1-hour run. I assume that disk reads done by queries dominate reads done from compaction. TokuMX does less IO per query than RocksDB and WiredTiger. Both TokuMX and RocksDB write much less data per query than WiredTiger.

                read/query      read-KB/query   write-KB/query
tokumx          1.612           14.588          2.495
rocksdb         2.135           20.234          2.512
wiredtiger      2.087           26.675          12.110

Configuration

This has a few more details on the MongoDB configuration I used. The oplog was enabled for all engines. This is the configuration file and startup script for TokuMX. The block cache was ~70G for all engines.

dbpath = /home/mongo/data
logpath = /home/mongo/log
logappend = true
fork = true
slowms = 2000
oplogSize = 2000
expireOplogHours = 2

numactl --interleave=all \
bin/mongod \
    --config $PWD/mongo.conf \
    --setParameter="defaultCompression=quicklz" \
    --setParameter="defaultFanout=128" \
    --setParameter="defaultReadPageSize=16384" \
    --setParameter="fastUpdates=true" \
    --cacheSize=$1 \
    --replSet foobar \
    --checkpointPeriod=900

And this is the configuration file for other engines:

processManagement:
  fork: true
systemLog:
  destination: file
  path: /home/mongo/log
  logAppend: true
storage:
  syncPeriodSecs: 60
  dbPath: /home/mongo/data
  journal:
    enabled: true
  mmapv1:
    journal:
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy
storage.rocksdb.configString: "write_buffer_size=16m;max_write_buffer_number=4;max_background_compactions=6;max_background_flushes=3;target_file_size_base=16m;soft_rate_limit=2.9;hard_rate_limit=3;max_bytes_for_level_base=128m;stats_dump_period_sec=60;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=12;level0_stop_writes_trigger=20;max_grandparent_overlap_factor=8;max_bytes_for_level_multiplier=8"



Thursday, July 16, 2015

Performance impact from too-small RAID stripes

What is the impact from the RAID stripe size? That depends on the workload. Many years ago there was an excellent paper from Oracle with the title Optimal Storage Configuration Made Easy. It is still worth reading today. One of the points in the paper is that you don't want individual storage read requests to use more than one disk for workloads with high concurrency to reduce the number of disk seeks.

RocksDB is used for OLTP and supports workloads with high concurrency from user requests and from compaction done in the background. There might be 8 threads doing compaction on a busy server and a typical compaction step done by a thread is to read from ~11 SST files and then write ~11 SST files. While each SST file is read sequentially, all of the files are read at the same time. The SST files are written in sequence. So we have ~12 streams of IO per compaction thread (11 for reads, 1 for write) and with 8 compaction threads then we can have ~96 streams of concurrent IO. The reads are done a block at a time (for example 16kb) and depend on file system readahead to get larger requests and reduce seeks. The write requests are likely to be large because the files are multi-MB and fsync is called when all of the data to the file has been written or optionally when sync_file_range is called after every N MB of writes.

Even if we limit our focus to writes there is a lot of concurrency as there is likely 1 write stream in progress for each compaction thread and another for the WAL. The reads done by compaction are interesting because many of them will get data from the OS filesystem cache. When level style compaction is used, the small levels near the top of the LSM are likely to be in cache while the largest level is not.

Reads vs stripe size

Enough about RocksDB, this post is supposed to be about the performance impact from RAID stripe sizes. I ran a simple IO performance test on two servers that were identical except for the RAID stripe size. Both had 15 disks with HW RAID 0, one used a 256KB stripe and the other a 1MB stripe. While the disk array is ~55TB I ran tests limited to the first 10TB and used the raw device. Domas provided the test client and it was run for 1 to 128 threads doing 1MB random reads.

The array with a 1MB stripe gets more throughput when there are at least 2 threads doing reads. At high concurrency the array with a 1MB stripe gets almost 2X more throughput.

Read MB/second by concurrency
1       2       4       8       12      16      24      32      64      128     threads
67      125     223     393     528     642     795     892     891     908     1MB stripe
74      115     199     285     332     373     407     435     499     498     256KB stripe

Monday, July 13, 2015

Summary of Linkbench for MongoDB & MySQL

I published many benchmark reports last week for Linkbench using MongoDB and MySQL with several storage engines: RocksDB, InnoDB, WiredTiger and mmapv1. My goal from running tests like this is to find things we can make better. From these tests the thing to make better is load performance for MongoDB. MySQL did better on the load and query tests but the difference on the load tests was much larger. I suspect that much of the problem can be blamed on the need for 9 indexes in MongoDB versus 4 indexes in MySQL to support the Linkbench workload as more indexes means there is more index maintenance during the load. If we get support for clustered primary key indexes into a future MongoDB release then we can reduce this from 9 to 6 indexes. The extra two indexes might be here forever because there is no composite PK index in MongoDB. The results from last week are:

MySQL schema

From the MySQL Setup section in the Linkbench README file there are 4 indexes created across 3 tables:
  • linktable has a primary key index on (link_type, id1, id2) and a secondary index on (id1, link_type, visibility, time, id2, version, data). The secondary index is covering for the most frequent query. While it almost doubles the size of linktable it also greatly reduces the worst-case random IO required for that query from a few thousand to a few. This can be a good tradeoff even when using PCIe flash.
  • counttable has a primary key index on (id, link_type)
  • nodetable has a primary key index on (id)

Why 5 extra indexes?

Why does MongoDB need 5 more indexes than MySQL? From the MongoDB driver in LinkbenchX there are 3 indexes created by the user and then 3 implicit primary key indexes on _id:
  • the link collection has a primary key index on _id, a unique secondary index on (link_type, id1, id2) and a non-unique secondary index on (id1, link_type, visibility, time, id2, version, data)
  • the count collection has a primary key index on _id and a unique secondary index on (id, link_type)
  • the node collection has a primary key index on _id
I have described 6 indexes but claimed that 9 were used. The other 3 are used for WiredTiger and RocksDB but not for mmapv1 and are indexes on DiskLoc (key is DiskLoc, value is the document). for each collection. While WiredTiger and RocksDB support a clustered primary key that is not supported (yet) in MongoDB because the storage engine API is still evolving from what was needed for mmapv1. With mmapv1 the base table is heap organized and the value in a primary key index is a DiskLoc which I assume is a filename and offset where the document resides. WiredTiger and RocksDB are copy-on-write and the location for a document will change one or more times after an update so they can't use DiskLoc to point to the one place on disk where the document exists, thus they need the extra index. Server-14569 and server-15354 are open for this. 

If we get proper support for a clustered primary key index, then MongoDB would use 6 indexes for Linkbench versus 4 for MySQL. How do we reduce that from 6 to 4? The unique secondary indexes for the link and count collections are not needed in MySQL. The problem here is that MongoDB doesn't support a composite PK. One workaround is to concatenate the fields (link_type | id1 | id2 -> _id for the link collection, (id | link_type -> _id) for the count collection. I don't recommend that because it must be done by the client, and done correctly, on every insert and update. It is even more work to make the concatenated value order preserving, which is easy in code, but makes it harder to construct the _id value when manually querying the database. This also means that some fields will be stored multiple times: once in the _id field, once separately. 

Thursday, July 9, 2015

Linkbench, MongoDB and a disk array

This has results for Linkbench with MongoDB and a server with a disk array. The Linkbench configurations are similar to the tests done for cached and uncached databases, but the server here uses a disk array with 6 10K RPM SAS disks and HW RAID 0. The previous tests used PCIe flash. The server with a disk array has 24 hyperthread cores and is at least one generation older than the PCIe flash server which has 40 hyperthread cores. Both have 144G RAM.

Results for cached database

The test client is LinkbenchX and the configuration is described in a previous post. The test used 10 threads for loading and 20 for queries. After loading there were 12 1-hour runs of the query test and results are reported for the 2nd and 12th hour.

The results below are for the test with maxid1=20M set in FBWorkload.properties for both servers (disk array, PCIe flash from a previous post). The load rate is similar between the servers with a disk array and PCIe flash. The query rate is better for the server with PCIe flash, but that might be due more to more cores and newer CPUs than to storage performance. The load and query rates are better for WiredTiger than for RocksDB on both servers.

The server names are explained in a previous post. The oplog was enabled for all tests.

--- results for disk array
load    load    load    2h      2h      12h     12h     24h     24h
time    rate    size    qps     size    qps     size    qps     size    server
5333    16519   14g     9649    17g     9689    22g                     mongo.rocks.log
3015    29223   16g     17715   30g     15654   40g                     mongo.wt.log
35253   2499    65g     4918    68g     4642    78g                     mongo.mmap.log

--- results for PCIe flash
5015    17565   14g     14442   17g     13925   24g     13506   29g     mongo.rocks.log
3601    28020   16g     25488   34g     22992   45g                     mongo.wt.log

Results for uncached database

The results below are for the test with maxid1=1B for both servers (disk array, PCIe flash from a previous post). The load rate is similar between disk and flash and is also similar to the rates above for the cached database. Multiple secondary indexes are maintained during the load but IO latency does not have a significant impact on the load rate, even for WiredTiger which does more random IO than RocksDB.

The query rates are significantly lower for the disk array than for PCIe flash. IO latency is significant for queries. However, RocksDB does better than WiredTiger on the disk array possibly because it uses less random IO for writes which leaves more random IO capacity to serve reads.

Update - I repeated tests for the disk-array server with different configurations and the results are better. For WiredTiger I set storage.syncPeriodSecs=600 which changes the default checkpoint interval from 60 to 600 seconds. The benefit should be fewer disk writes and QPS improved by more than 30% with that change. For RocksDB I used the default configuration and QPS improved by more than 20% compared to the non-default configuration I had been using (Igor did a good job choosing the defaults). For all engines I used a smaller block cache -- 32G rather than 70G -- to save more space for compressed blocks in the OS filesystem cache. Results for all engines improved with a smaller block cache.

--- results for disk array
load    load    load    2h      2h      12h     12h     24h     24h     48h     48h
time    rate    size    qps     size    qps     size    qps     size    qps     size    server
298901  14625   606g    580     585g    611     588g    604     590g    600     596g    mongo.tokumx.log, 70G
297159  14711   597g    731     585g    786     588g    782     592g    736     598g    mongo.tokumx.log, 32G
178923  24432   694g    343     704g    333     728g                                    mongo.wt.log, default, 70G
176432  24777   696g    449     709g    434     738g    423     749g    418     757g    mongo.wt.log, non-default, 32G
271569  16097   631g    448     631g    477     632g    452     633g    471     635g    mongo.rocks.log, non-default, 70G
274780  15909   628g    458     628g    592     629g    574     631g    569     633g    mongo.rocks.log, default, 32G

--- results for PCIe flash
251688  17368   630g    9670    633g    6762    644g    6768    656g    mongo.rocks.log
175740  24874   695g    8533    766g    8019    791g    7718    806g    mongo.wt.log

This has the mean response time in milliseconds for each of the query types in Linkbench. The data below is from the 12th 1-hour run for the disk array (the first pair of numbers) and then PCIe flash (the second pair of numbers. The most frequent operation is GET_LINKS_LIST followed by MULTIGET_LINK. On the disk array because the response time for these two operations is better for RocksDB and that explains why it gets more QPS than WiredTiger. For PCIe flash the response time for GET_LINKS_LIST is lower for WiredTiger which explains the better QPS. The response time for all of the write operations is better for RocksDB than WiredTiger on disk and flash, but those operations are less frequent. WiredTiger does more reads from disk & flash during writes as b-tree leaf pages must be read before being written.

The QPS for RocksDB is higher than WiredTiger for the 2nd 1-hour run with the PCIe server but lower after the 12th 1-hour run. The mean response time for the GET_LINKS_LIST operation almost doubles and the cause might be the range read penalty from an LSM.




                12th 1-hour run         2nd 1-hour run  12th 1-hour run
                disk    disk    -       flash   flash   flash   flash   
                wired   rocks   -       wired   rocks   wired   rocks   
ADD_NODE        0.717   0.444           0.327   0.300   0.324   0.276
UPDATE_NODE     23.0    21.5            1.217   1.083   1.240   0.995
DELETE_NODE     22.9    21.6            1.260   0.675   1.285   1.018   
GET_NODE        22.7    20.9            0.913   2.355   0.941   0.622
ADD_LINK        47.6    23.5            2.988   1.610   3.142   2.255
DELETE_LINK     31.9    21.6            2.063   1.610   2.407   1.701
UPDATE_LINK     51.1    25.8            3.238   2.507   3.407   2.395   
COUNT_LINK      16.7    10.9            0.686   0.571   0.739   0.547
MULTIGET_LINK   22.6    18.9            1.599   1.195   1.603   1.136

GET_LINKS_LIST  35.6    27.6            1.910   2.181   2.056   3.945

And the data below is example output from the end of one test, RocksDB on flash for the 12 1-hour run.

ADD_NODE count = 627235  p25 = [0.2,0.3]ms  p50 = [0.2,0.3]ms  p75 = [0.2,0.3]ms  p95 = [0.3,0.4]ms  p99 = [1,2]ms  max = 263.83ms  mean = 0.276ms
UPDATE_NODE count = 1793589  p25 = [0.7,0.8]ms  p50 = [0.8,0.9]ms  p75 = [0.9,1]ms  p95 = [1,2]ms  p99 = [4,5]ms  max = 301.453ms  mean = 0.995ms
DELETE_NODE count = 246225  p25 = [0.7,0.8]ms  p50 = [0.8,0.9]ms  p75 = [0.9,1]ms  p95 = [1,2]ms  p99 = [4,5]ms  max = 265.012ms  mean = 1.018ms
GET_NODE count = 3150740  p25 = [0.4,0.5]ms  p50 = [0.5,0.6]ms  p75 = [0.5,0.6]ms  p95 = [0.9,1]ms  p99 = [3,4]ms  max = 301.078ms  mean = 0.622ms
ADD_LINK count = 2189319  p25 = [1,2]ms  p50 = [2,3]ms  p75 = [2,3]ms  p95 = [3,4]ms  p99 = [7,8]ms  max = 317.292ms  mean = 2.255ms
DELETE_LINK count = 727942  p25 = [0.4,0.5]ms  p50 = [0.6,0.7]ms  p75 = [2,3]ms  p95 = [3,4]ms  p99 = [6,7]ms  max = 320.13ms  mean = 1.701ms
UPDATE_LINK count = 1949970  p25 = [1,2]ms  p50 = [2,3]ms  p75 = [2,3]ms  p95 = [3,4]ms  p99 = [7,8]ms  max = 393.483ms  mean = 2.395ms
COUNT_LINK count = 1190142  p25 = [0.3,0.4]ms  p50 = [0.4,0.5]ms  p75 = [0.5,0.6]ms  p95 = [0.8,0.9]ms  p99 = [2,3]ms  max = 296.65ms  mean = 0.547ms
MULTIGET_LINK count = 127871  p25 = [0.7,0.8]ms  p50 = [0.9,1]ms  p75 = [1,2]ms  p95 = [1,2]ms  p99 = [4,5]ms  max = 272.272ms  mean = 1.136ms
GET_LINKS_LIST count = 12353781  p25 = [0.5,0.6]ms  p50 = [1,2]ms  p75 = [1,2]ms  p95 = [3,4]ms  p99 = [38,39]ms  max = 2360.432ms  mean = 3.945ms
REQUEST PHASE COMPLETED. 24356814 requests done in 3601 seconds. Requests/second = 6762

Wednesday, July 8, 2015

Linkbench for MySQL and MongoDB with a large database

This has results for Linkbench where the database is much larger than RAM. I previously shared results for a database that can be cached in RAM.

I used Linkbench for MySQL and LinkbenchX for MongoDB as described in the Configuration section in a previous post. The value of maxid1 was set to 1,000,000,001 in the FBWorkload.properties file. For the cached database test that was set to 20,000,001. The test server has 144G of RAM, 40 hyperthread cores and PCIe flash. The tests were configured to use 10 threads for loading and 20 threads for queries. I disabled the binlog during the load for MySQL and enabled it during the query tests. The oplog was always enabled for MongoDB.

I don't report results for mmapv1 because the load would take too long and I might not have enough storage. I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:
  • mysql.rocks.log - MySQL with RocksDB and the binary log enabled
  • mysql.inno.log - MySQL with InnoDB and the binary log enabled
  • mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
  • mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
  • mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled

Results

The test pattern is to load the data then do 24 1-hour runs of the query test. The results below include:
  • load time - the number of seconds for the load test
  • load rate - the average insert rate during the load test
  • load size - the database size in GB when the load ended
  • 2h qps - the average QPS during the 2nd 1-hour query test
  • 2h size - the database size in GB after the 2nd 1-hour query test
  • 12h qps - the average QPS during the 12th 1-hour query test
  • 12h size - the database size in GB after the 12th 1-hour query test
  • 24h qps - the average QPS during the 24th 1-hour query test
  • 24h size - the database size in GB after the 24th 1-hour query test
load    load    load    2h      2h      12h     12h     24h     24h
time    rate    size    qps     size    qps     size    qps     size    server
43820   99759   707g    22039   748g    25696   805g    25947   831g    mysql.inno.log
24261   180184  368g    32469   372g    30180   383g    29321   394g    mysql.rocks.log
251688  17368   630g    9670    633g    6762    644g    6768    656g    mongo.rocks.log
175740  24874   695g    8533    766g    8019    791g    7718    806g    mongo.wt.log

Update - I ran a test for mmapv1 with maxid1=200M rather than 1B, so it has about 20% of the data compared to the others. I didn't have enough time or disk space to use the larger value with mmap. The data is below:
load    load    load    2h      2h      12h     12h     24h     24h
time    rate    size    qps     size    qps     size    qps     size    server
276350  3165    553g    3222    555g    3084    568g    2969    573g    mongo.mmap.log

Some conclusions from the results:
  • the load rates are similar to the rates for the cached database and IO when the database is larger than RAM doesn't have a big impact on performance. But this might depend on fast storage and these results are for a server with PCIe flash.
  • relative to the query rates for the cached database, RocksDB suffers the least and WiredTiger suffers the most when the database is larger than RAM.
  • the size for RocksDB was similar between MySQL and MongoDB for the cached database. That is not the case here as MongoRocks uses almost 1.7X more space than MyRocks. I have yet to debug this. For MongoRocks all data is stored in one column family. For MyRocks I used one column family per index and if nothing else this makes it easy to determine how much space and IO is consumed per index. There are also more optimizations in MyRocks to remove tombstones earlier but the space difference shows up immediately during the load and there shouldn't be tombstones at that point. MongoDB also uses more indexes than MySQL (9 vs 4) for Linkbench and that might be the reason but this is still a mystery.
  • I restarted MySQL after the load test to enable the binlog prior to the query tests. InnoDB was configured to use direct IO so I lost the contents of the buffer pool during the restart. Performance suffered for the early query tests until the buffer pool warmed up and that includes the 2h test I reported above. This explains why QPS was better for the 12h and 24h results with InnoDB. Buffered IO is used for MySQL+RocksDB so less data is lost during a restart (block cache must be warmed up but OS filesystem cache is not wiped).

Comparison with the cached database results

The table below shows the ratio of the rates from this test with the rates for the cached database. The goal is to determine how much performance degrades when the database is larger than RAM. The ratio is expressed as a percentage and 50% means that the rate for the uncached database was 50% of the rate for the cached database. Note that the load rates are close to 100% for RocksDB while InnoDB degrades a bit and WiredTiger degrades more. This is expected from a write-optimized algorithm like RocksDB and b-trees like InnoDB and WiredTiger.

The results for 2h for InnoDB are less accurate because mysqld was restarted as described in the previous section. For the 24h results MySQL degrades less than MongoDB and RocksDB degrades less than b-trees (WiredTiger & InnoDB).

load    2h      24h
rate    qps     qps     server
97.6%   67.6%   63.0%   mysql.rocks.log
95.0%   42.9%   52.0%   mysql.inno.log
98.8%   66.9%   50.1%   mongo.rocks.log
88.7%   33.4%   34.8%   mongo.wiredtiger.log


And the same data in a chart...


WiredTiger improvements from MongoDB 3.0 to 3.1

There has been progress in the 3.1 branch on performance bugs I reported for the WiredTiger B-Tree so I repeated the insert benchmark for it and the improvements are great. The average insert and query rates for WiredTiger are almost 2X better in 3.1 versus 3.0. RocksDB still does better than WiredTiger for inserts but worse for queries.

Results

I tested 3 binaries: WiredTiger via MongoDB 3.0.x, RocksDB via MongoDB 3.0.x and WiredTiger via MongoDB 3.1.5. I use "3.0.x" because the build is from a special branch used for MongoRocks development.

The test is run with 10 loader threads and 1 query thread. The test HW has 40 hyperthread cores, PCIe flash storage and 144G of RAM. The database is much larger than RAM at test end. The insert rate is faster for RocksDB because it is write-optimized and benefits from not doing read-before-write during secondary index maintenance. It also benefits from a better cache hit rate because the database is smaller than for WiredTiger.

The queries are short, 10 document, range scans on a secondary index. WiredTiger does better than RocksDB because a write-optimized algorithm like RocksDB trades read for write performance and there is more work to be done on range scans. The read penalty is less for point queries.

The results below include:
  • irate - the average rate of inserts per second
  • Nslow - the number of write operations that take at least two seconds. This is computed from the slow operations reported in the log.
  • qrate - the average rate of queries per second
  • size - the database size in GB at test end
irate   Nslow   qrate   size    server
24187   430     136     312g    RocksDB 3.0.x
20273   2409    624     442g    WiredTiger 3.1.5
10823   3996    311     416g    WiredTiger 3.0.x

The charts display the insert and query rates for each binary.

Configuration

The hardware and configuration I use is described in a previous post. I used my fork of iibench and am glad that Tim wrote and shared the code. I used this command line to run iibench and load 500M documents with 10 loader threads and 1 query thread.
java -cp mongo-java-driver-2.13.1.jar:src jmongoiibench iibench 10 500000000 10 -1 10 ib.tsv quicklz 16384 60000000000 4 100000 999999 SAFE localhost 27017 1 1000 3 50 Y 1 0 1

Tuesday, July 7, 2015

The impact of long-running transactions on Linkbench QPS: InnoDB vs RocksDB

I have been using Linkbench and LinkbenchX to compare performance for MySQL and MongoDB. Previous results were shared for a cached database and a small server. Here I describe the impact of a long-running transaction on QPS for InnoDB and RocksDB storage engines in MySQL. The summary is that a long-running transaction caused a big drop in QPS for InnoDB and RocksDB.

The long-running transaction was created by executing BEGIN and then a short running SELECT statement in one session. After the SELECT statement completed the session remained open while a 1-hour linkbench test was run. If you take logical backups from an active database server, then you care about the performance impact from a long-running transaction.

The InnoDB result surprised me as I didn't expect the drop to be that large. The test used repeatable read for InnoDB and a long-running transaction in this case means there is a long-open snapshot that blocks purge and the history list length grows during this test which also means that the disk space for undo grows.

RocksDB also uses a snapshot per transaction so a long-running transaction should imply a long-open snapshot. Read the section on Gets, Iterators and Snapshots to understand the difference between snapshots and iterators in RocksDB. Both of them prevent some old data from being removed from the database, kind of like InnoDB purge. But RocksDB is more clever than InnoDB about dropping old data so we don't have a pending feature request like 74919. But a long-running transaction still means that queries encounter more tombstones which means more overhead.

Show me the graph

This shows linkbench QPS from 3 1-hour runs for MySQL+InnoDB and MySQL+RocksDB. The first and third 1-hour runs are done without a long-running transaction. The middle 1-hour run is done with a long-running transaction and QPS for InnoDB drops quickly and for RocksDB drops less quickly. Fortunately it recovers as soon as the long-running transaction is closed. The tests used the cached database configuration (maxid1=20M) as described in a previous post and were done after ~12 1-hour linkbench query tests had already been completed.

There are 8 types of statements used by linkbench. By far the most frequent is GET_LINKS_LIST which is handled by doing a short range scan on a secondary index for the MySQL implementation. That is also the statement type most impacted by the long-running transaction as the average response time increased by ~1.5X for RocksDB and ~7X for InnoDB.

LinkBenchX and MongoDB on a small server

I previously reported results for Linkbench and LinkbenchX running on high-end commodity servers.
Here I report results for much lower-end hardware. I have two Intel NUC systems with 8G of RAM, 1 disk and 1 SSD. They are small, quiet and were easy to setup. There is an active NUC community hosted by Intel with useful answers to many questions. I managed to get the systems running without asking for help from Domas. That is rare.

I ran LinkbenchX as described in the previous post with a few minor changes because this hardware is smaller. First, I used 4 load threads and 4 request threads compared to 10 load threads and 20 request threads on the larger hardware (set loaders=4 and requesters=4 in LinkConfigMongoDBV2.properties). Then I ran tests for smaller databases using maxid1=2M for the cached database and maxid1=20M for the uncached database (set in FBWorkload.properties). Finally I added one option to storage.rocksdb.configString to put SST index and bloom filter data in the RocksDB cache to make it subject to the block cache limits. While this can hurt performance it also gets RocksDB to respect storage.rocksdb.cacheSizeGB. Without this option the data for SST index and filter data is always in memory when the SST file is open and as the database grows this can use a lot of memory, especially when running on a server with 8GB of RAM. Without this option the memory consumed for index and filter data also looks like a memory leak in RocksDB until you realize what is going on (yes, I have wasted too much time on this problem). The extra option is:
block_based_table_factory={cache_index_and_filter_blocks=1}

Results

The tests here were run with the database on the single disk. The oplog was enabled for the test but sync-on-commit was disabled. Tests were done with maxid1=2M for the cached database and maxid1=20m for the uncached database. Unfortunately the database was always not cached for mmapv1 because it uses so much more space for the same data compared to WiredTiger and RocksDB. 

The results below include:
  • load time - the number of seconds for the load test
  • load rate - the average insert rate during the load test
  • load size - the database size in GB when the load ended
  • 2h qps - the average QPS during the 2nd 1-hour query test
  • 2h size - the database size in GB after the 2nd 1-hour query test
  • 12h qps - the average QPS during the 12th 1-hour query test
  • 12h size - the database size in GB after the 12th 1-hour query test
The QPS for RocksDB is a lot better than WiredTiger in the cached database test. After looking at the iostat data from the test I see that WiredTiger didn't cache the database for the 12h result below. The WiredTiger database was 5G, the test server has 8G of RAM and the WiredTiger block cache gets 4G of RAM. Assuming the database compressed by 2X then the uncompressed database is 10G, the 4G block cache can store 40% of it and the OS filesystem cache gets at most 4G. From vmstat data I see that the memory.cache column grows to ~4G.

Sizing a cache for InnoDB with direct IO is easy. Give it as much memory as possible and hope that the background tasks that share the HW don't use too much memory. But then InnoDB supported compression and now we had a problem of figuring out how to share the InnoDB buffer pool between compressed and uncompressed pages. There is some clever code in InnoDB that tries to figure this out based on whether a workload is CPU or IO bound. Well, we have the same problem with WiredTiger and RocksDB. Because they use buffered IO the OS filesystem cache is the cache for compressed pages and the WiredTiger/RocksDB block cache is the cache for uncompressed pages. Neither WiredTiger or RocksDB has code yet to dynamically adjust the amount of memory used for compressed versus uncompressed pages but I am certain that it is easier to dynamically resize the block cache in them compared to InnoDB.

For now RocksDB and WiredTiger default to using 50% of system RAM for the block cache. I suspect that in many cases, like when the database is larger than RAM, that it is better to use much less than 50% of system RAM for their block caches. I will save my hand waving math for another post and will leave myself a note to repeat the tests below with the cache set to use 20% of RAM.

uncached database, maxid1=20m, loaders=4, requesters=4

load    load    load    2h      2h      12h     12h
time    rate    size    qps     size    qps     size    server
16252   5421    14g     113     14g     129     14g     mongo.rocks.log
12037   7319    15g     105     16g     97      17g     mongo.wt.log
19062   4494    69g     50      68g     46      68g     mongo.mmap.log

cached database, maxid1=2m, loaders=4, requesters=4

load    load    load    2h      2h      12h     12h
time    rate    size    qps     size    qps     size    server
1629    5886    2.2g    3405    3.0g    3147    3.9g    mongo.rocks.log
12774   7530    2.5g    2966    4.1g    1996    5.0g    mongo.wt.log
2058    4659    14g     1491    14g     627     18g     mongo.mmap.log

Hardware

The Intel NUC systems have:

Monday, July 6, 2015

Linkbench for MySQL & MongoDB with a cached database

I used linkbench for MySQL and MongoDB to understand OLTP performance on an interesting workload. Here I post results for a cached database and in a few days I will have more results for a database larger than RAM. The summary is that MySQL is a lot faster than MongoDB even when using the same storage engine (RocksDB). I don't know all of the reasons for this and it is possible I made a mistake because this is the first time I have run linkbenchX for MongoDB. There are overheads in MongoDB that doesn't exist in MySQL, but I have not quantified the contribution from each. The ones that are interesting for this test include:
  1. PK index isn't clustered - the PK index is clustered for RocksDB and InnoDB in MySQL. All of the columns for a row can be fetched from a PK index leaf page. The PK index on _id is not clustered in MongoDB and is from _id to DiskLoc and then WiredTiger & RocksDB use a hidden clustered index on DiskLoc to access the document. While the PK index isn't clustered for mmapv1, it doesn't have the hidden clustered index on DiskLoc as it uses DiskLoc as the address (filename & offset) for the document. I expect this to be fixed in a future release so that RocksDB and WiredTiger can use a clustered index on _id. The current behavior is a legacy from the storage engine interface used by mmapv1.
  2. PK can't be composite - the primary key for a MongoDB collection is on the _id field. If your data has a natural primary key using one field, then rename that field _id. If the natural primary key is composite then you might use _id as a surrogate key and declare a unique secondary index on the composite key. The cost for this is an extra index.
The costs above mean that linkbench uses more indexes in MongoDB than in MySQL. This makes updates and queries slower as there are more indexes to maintain and more indexes to search. For MySQL there are 3 tables and 4 indexes in linkbench: a PK and secondary index on the link table, a PK on the count table and a PK on the node table. For MongoDB there are 3 collections and 9 indexes. Each collection has a PK index on the _id field and the hidden index that maps DiskLoc to a document for RocksDB & WiredTiger. As there are 3 collections this explains the first 6 indexes. Then there are unique secondary indexes for the count and link collections where a composite key would have been used in a SQL DBMS. Finally, there is the secondary index on the link collection.

I am not sure that this workload is a great fit for MongoDB but it is a good benchmark for looking at OLTP performance and is less synthetic than the insert benchmark. The MongoDB version, linkbenchx, would benefit greatly from support for per-shard transactions. That is only available via TokuMX today. These tests were run without transactions and there might be races for transactions that need to atomically change the link and count collections.

From speaking to my local MongoDB internals guru (hi Igor), the secondary index in MongoDB contains the DiskLoc to reference the document. When the secondary index isn't covering then the hidden index on DiskLoc will be searched to find the missing document. This overhead is similar to what occurs with a clustered PK index in MySQL - the PK in the secondary index is used to search the PK index to get the missing columns.

A search by _id in MongoDB is more expensive than a search on the PK in MySQL for RocksDB and WiredTiger. With the clustered PK index in MySQL all of the columns are in the leaf page of the index. For MongoDB there are two indexes to search -- first the index on _id to find a DiskLoc and then the hidden index on DiskLoc to find the document.

A search on the natural PK for the link and count tables is also more expensive in MongoDB. First the search of the unique secondary index is done to find a DiskLoc and then the hidden index on DiskLoc is searched to get the document. For MySQL there is only one search using the clustered PK index.

Configuration

I used linkbench for MySQL and linkbenchx for MongoDB. I changed bin/linkbench in linkbenchx to use the same driver as MySQL: CLASS='com.facebook.LinkBench.LinkBenchDriver'. I set maxid1=20000001 in config/FBWorkload.properties. Finally I set loaders=10, requesters=20, maxtime=3600, warmup_time=60 and requests=1000000000 in config/LinkConfigMongoDBV2.properties and config/LinkConfigMysql.properties.

With this setup the database is cached in RAM during the entire test on servers with 144G of RAM. The test pattern is to load the database and then run either 12 or 24 1-hour query runs. I report the average QPS sustained during the 2nd, 12th and optionally the 24th 1-hour runs. New data is added to the database over time so the database is larger during the 12th 1-hour run than during the 2nd. The growth rate is a function of the QPS.

I have yet to share quality-of-service metrics like p95 or p99 response time. Eventually I will do that.

I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:

  • mysql.rocks.nolog - MySQL with RocksDB and the binary log disabled
  • mysql.rocks.log - MySQL with RocksDB and the binary log enabled
  • mysql.inno.nolog - MySQL with InnoDB and the binary log disabled
  • mysql.inno.log - MySQL with InnoDB and the binary log enabled
  • mongo.rocks.nolog - MongoDB with RocksDB and the oplog disabled
  • mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
  • mongo.wiredtiger.nolog - MongoDB with WiredTiger and the oplog disabled
  • mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
  • mongo.mmap.nolog - MongoDB with mmapv1 and the oplog disabled
  • mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled

I used this mongo.conf file:
processManagement:
  fork: true
systemLog:
  destination: file
  path: /data/mysql/mongo.30/log
  logAppend: true
storage:
  syncPeriodSecs: 60
  dbPath: /data/mysql/mongo.30/data
  journal:
    enabled: true
  mmapv1:
    journal:
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy

storage.rocksdb.configString: "write_buffer_size=16m;max_write_buffer_number=4;max_background_compactions=6;max_background_flushes=3;target_file_size_base=16m;soft_rate_limit=2.9;hard_rate_limit=3;max_bytes_for_level_base=128m;stats_dump_period_sec=60;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=12;level0_stop_writes_trigger=20;max_grandparent_overlap_factor=8;max_bytes_for_level_multiplier=8"

And I used a my.cnf file that is too long to paste here. This is the schema for RocksDB and InnoDB. All InnoDB tables use 32 hash partitions to reduce mutex contention on the per-index mutex. All tables are compressed except for the node table with InnoDB because the average compression rate for rows in the node table is less than 2X. Four column families are used for the RocksDB tables, one for each index. All collections are in one column family for MongoDB+RocksDB The my.cnf settings for RocksDB are here.

Compression varied by configuration.
  • MongoDB+mmapv1 - no compression
  • MongoDB+WiredTiger - snappy
  • MySQL+InnoDB - zlib
  • MongoDB+RocksDB - no compression for levels 0, 1, snappy for level 2, zlib for others
  • MySQL+RocksDB - no compression for levels 0, 1 and zlib for others

Load performance

This lists the average insert rate sustained during the load. With maxid1=20,000,001 there are 20M rows/documents in the node table/collection, ~90M rows/documents in the link table/collection and ~15M rows/documents in the count table/collection. A few conclusions:
  • the insert rate is much better for MySQL than MongoDB even when the same engine, RocksDB, is used for both. MongoDB does more work during the load as it maintains 9 indexes versus 4 for MySQL.
  • enabling the binlog for MySQL+RocksDB hurts performance much more than for MySQL+InnoDB
  • enabling the oplog for MongoDB doesn't hurt performance as much as it does on the insert benchmark
  • for MongoDB the load rate was faster for WiredTiger than RocksDB because the database is cached. This changes when the database is larger than RAM.
 

Query performance

The average QPS was similar for MySQL+RocksDB and MySQL+InnoDB. They were both much faster than MongoDB. For MongoDB the best QPS is from WiredTiger. The MongoDB results match what I have gotten from other benchmarks and I have some faith these are accurate.

All results

This lists all results per configuration. The 24 1-hour query tests were only done for a few configurations. The results include:
  • load time - the number of seconds for the load test
  • load rate - the average insert rate during the load test
  • load size - the database size in GB when the load ended
  • 2h qps - the average QPS during the 2nd 1-hour query test
  • 2h size - the database size in GB after the 2nd 1-hour query test
  • 12h qps - the average QPS during the 12th 1-hour query test
  • 12h size - the database size in GB after the 12th 1-hour query test
  • 24h qps - the average QPS during the 24th 1-hour query test
  • 24h size - the database size in GB after the 24th 1-hour query test
A few conclusions:
  • The database size grows more for MySQL+InnoDB than for MySQL+RocksDB. Fragmentation from the B-Tree is larger than space-amplification of ~1.1X from leveled compaction. It is difficult to compare growth for configurations that don't sustain the same QPS because new data is added as a function of the QPS.
  • The database size grows more for Mongo+WiredTiger than for Mongo+RocksDB. While WiredTiger sustains a higher QPS I think part of the problem is that B-Tree fragmentation is worse than space amplification from leveled compaction.
  • I am interesting in comparing database growth between MySQL+InnoDB and Mongo+WiredTiger. Both implement a B-Tree but InnoDB is update-in-place and can suffer from internal fragmentation while WiredTiger is copy-on-write and can suffer from external fragmentation. I don't know whether there are counters to explain this in WiredTiger.


load    load    load    2h      2h      12h     12h     24h     24h
time    rate    size    qps     size    qps     size    qps     size    server
478     184496  14g     50153   17g     48671   25g     48258   34g     mysql.rocks.nolog
602     146462  14g     47981   16g     47250   23g     46537   34g     mysql.rocks.log
839     105008  15g     51574   22g     50229   37g     50058   56g     mysql.inno.nolog
851     103489  15g     51277   22g     49780   37g     49805   56g     mysql.inno.log
4156    21200   13g     15149   17g     15271   23g                     mongo.rocks.nolog
5015    17565   14g     14442   17g     13925   24g     13506   29g     mongo.rocks.log
3020    29174   14g     27940   32g     23043   45g     22141   53g     mongo.wt.nolog
3601    28020   16g     25488   34g     22992   45g                     mongo.wt.log
25563   3446    63g     9774    70g     8703    82g                     mongo.mmap.nolog
27000   3263    66g     9414    70g     8471    84g                     mongo.mmap.log

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...