Small Datum: July 2015

Wednesday, July 29, 2015

Linkbench, MongoDB and a disk array, now with TokuMX

I repeated the linkbench tests described in a previous post for TokuMX 2.0.1. This uses MongoDB, LinkbenchX and different storage engines. The test server has 144 GB of RAM, 6 SAS disks with HW RAID 0 and 24 hyperthread cores. The benchmark was run with 12 concurrent clients.

Cached database

The first test is with a cached database. The test pattern is to load the database and then do 12 1-hour query tests in a loop. The database always fit in RAM. At the end of the 12th hour the database size was 40 GB for WiredTiger, 22 GB for RocksDB, 30 GB for TokuMX and 78 GB for mmapv1. I used Snappy compression for WiredTiger/RocksDB and QuickLZ for TokuMX.

The graph below has the average QPS per 1-hour interval.

This is the data for the graph:

wiredtiger,18279,17715,16740,16585,16472,15924,15703,15632,15783,15401,15872,15654

rocksdb,10892,9649,9580,9639,9860,9981,9316,9535,9578,9682,9437,9689

tokumx,11078,6881,5832,5132,5864,5434,5495,5340,5168,5505,4763,4924

mmapv1,5066,4918,4821,4758,4629,4666,4589,4613,4663,4626,4563,4642

I then looked at the output from the 12th 1-hour run to understand why QPS was much better for WiredTiger. The table below has the average response time in milliseconds for the 3 most frequent operations. WiredTiger has the best times, mmapv1 has the worst times for writes (per database or per collection writes are single threaded) and TokuMX has the worst time for get_links_list. The get_links_list operation requires a short range query.

add_link update_link get_links_list

wiredtiger 1.361 1.422 0.768

rocksdb 1.702 1.789 1.460

tokumx 1.538 1.674 3.929

mmapv1 4.788 5.230 2.657

Database larger than RAM

The test was repeated with a database that does not fit in RAM. The test was not run for mmapv1 because I didn't have enough disk space or patience to wait for the load to finish. At the end of the 12th hour the database size was 728 GB for WiredTiger, 632 GB for RocksDB and 588 GB for TokuMX. It is interesting that the TokuMX database was smaller than RocksDB here but larger than RocksDB for the cached test.

The graph below has the average QPS per 1-hour interval.

This is the data for the graph:

tokumx,439,580,622,625,638,617,598,613,631,609,610,611

rocksdb,387,448,479,468,468,477,471,483,475,473,471,477

wiredtiger,297,343,345,333,333,331,320,335,326,339,324,333

I then looked at the output from the 12th 1-hour run to understand why QPS was much better for TokuMX. The table below has the average response time in milliseconds for the 3 most frequent operations. TokuMX has is faster on get_links_list while RocksDB is faster on add/update link and the get_links_list operation is done about 5 times per add/update. WiredTiger is the slowest on all of the operations.

add_link update_link get_links_list

tokumx 23.499 25.903 22.987

rocksdb 21.704 23.883 25.835

wiredtiger 47.557 51.122 35.648

TokuMX is the most IO efficient for this workload based on the data below. That explains why it sustains the highest QPS because disk IO is the bottleneck. I used data from iostat (r/s, w/s, rKB/s and wKB/s) and divided those rates by the average QPS with all data taken from the 12th 1-hour run. I assume that disk reads done by queries dominate reads done from compaction. TokuMX does less IO per query than RocksDB and WiredTiger. Both TokuMX and RocksDB write much less data per query than WiredTiger.

read/query read-KB/query write-KB/query

tokumx 1.612 14.588 2.495

rocksdb 2.135 20.234 2.512

wiredtiger 2.087 26.675 12.110

Configuration

This has a few more details on the MongoDB configuration I used. The oplog was enabled for all engines. This is the configuration file and startup script for TokuMX. The block cache was ~70G for all engines.

dbpath = /home/mongo/data

logpath = /home/mongo/log

logappend = true

fork = true

slowms = 2000

oplogSize = 2000

expireOplogHours = 2

numactl --interleave=all \

bin/mongod \

--config $PWD/mongo.conf \

--setParameter="defaultCompression=quicklz" \

--setParameter="defaultFanout=128" \

--setParameter="defaultReadPageSize=16384" \

--setParameter="fastUpdates=true" \

--cacheSize=$1 \

--replSet foobar \

--checkpointPeriod=900

And this is the configuration file for other engines:

processManagement:

fork: true

systemLog:

destination: file

path: /home/mongo/log

logAppend: true

storage:

syncPeriodSecs: 60

dbPath: /home/mongo/data

journal:

enabled: true

mmapv1:

journal:

commitIntervalMs: 100

operationProfiling.slowOpThresholdMs: 2000

replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy

storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy

storage.rocksdb.configString: "write_buffer_size=16m;max_write_buffer_number=4;max_background_compactions=6;max_background_flushes=3;target_file_size_base=16m;soft_rate_limit=2.9;hard_rate_limit=3;max_bytes_for_level_base=128m;stats_dump_period_sec=60;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=12;level0_stop_writes_trigger=20;max_grandparent_overlap_factor=8;max_bytes_for_level_multiplier=8"

Thursday, July 16, 2015

Performance impact from too-small RAID stripes

What is the impact from the RAID stripe size? That depends on the workload. Many years ago there was an excellent paper from Oracle with the title Optimal Storage Configuration Made Easy. It is still worth reading today. One of the points in the paper is that you don't want individual storage read requests to use more than one disk for workloads with high concurrency to reduce the number of disk seeks.

RocksDB is used for OLTP and supports workloads with high concurrency from user requests and from compaction done in the background. There might be 8 threads doing compaction on a busy server and a typical compaction step done by a thread is to read from ~11 SST files and then write ~11 SST files. While each SST file is read sequentially, all of the files are read at the same time. The SST files are written in sequence. So we have ~12 streams of IO per compaction thread (11 for reads, 1 for write) and with 8 compaction threads then we can have ~96 streams of concurrent IO. The reads are done a block at a time (for example 16kb) and depend on file system readahead to get larger requests and reduce seeks. The write requests are likely to be large because the files are multi-MB and fsync is called when all of the data to the file has been written or optionally when sync_file_range is called after every N MB of writes.

Even if we limit our focus to writes there is a lot of concurrency as there is likely 1 write stream in progress for each compaction thread and another for the WAL. The reads done by compaction are interesting because many of them will get data from the OS filesystem cache. When level style compaction is used, the small levels near the top of the LSM are likely to be in cache while the largest level is not.

Reads vs stripe size

Enough about RocksDB, this post is supposed to be about the performance impact from RAID stripe sizes. I ran a simple IO performance test on two servers that were identical except for the RAID stripe size. Both had 15 disks with HW RAID 0, one used a 256KB stripe and the other a 1MB stripe. While the disk array is ~55TB I ran tests limited to the first 10TB and used the raw device. Domas provided the test client and it was run for 1 to 128 threads doing 1MB random reads.

The array with a 1MB stripe gets more throughput when there are at least 2 threads doing reads. At high concurrency the array with a 1MB stripe gets almost 2X more throughput.

Read MB/second by concurrency
1 2 4 8 12 16 24 32 64 128 threads
67 125 223 393 528 642 795 892 891 908 1MB stripe
74 115 199 285 332 373 407 435 499 498 256KB stripe

Monday, July 13, 2015

Summary of Linkbench for MongoDB & MySQL

I published many benchmark reports last week for Linkbench using MongoDB and MySQL with several storage engines: RocksDB, InnoDB, WiredTiger and mmapv1. My goal from running tests like this is to find things we can make better. From these tests the thing to make better is load performance for MongoDB. MySQL did better on the load and query tests but the difference on the load tests was much larger. I suspect that much of the problem can be blamed on the need for 9 indexes in MongoDB versus 4 indexes in MySQL to support the Linkbench workload as more indexes means there is more index maintenance during the load. If we get support for clustered primary key indexes into a future MongoDB release then we can reduce this from 9 to 6 indexes. The extra two indexes might be here forever because there is no composite PK index in MongoDB. The results from last week are:

MySQL schema

From the MySQL Setup section in the Linkbench README file there are 4 indexes created across 3 tables:

linktable has a primary key index on (link_type, id1, id2) and a secondary index on (id1, link_type, visibility, time, id2, version, data). The secondary index is covering for the most frequent query. While it almost doubles the size of linktable it also greatly reduces the worst-case random IO required for that query from a few thousand to a few. This can be a good tradeoff even when using PCIe flash.
counttable has a primary key index on (id, link_type)
nodetable has a primary key index on (id)

Why 5 extra indexes?

Why does MongoDB need 5 more indexes than MySQL? From the MongoDB driver in LinkbenchX there are 3 indexes created by the user and then 3 implicit primary key indexes on _id:

the link collection has a primary key index on _id, a unique secondary index on (link_type, id1, id2) and a non-unique secondary index on (id1, link_type, visibility, time, id2, version, data)
the count collection has a primary key index on _id and a unique secondary index on (id, link_type)
the node collection has a primary key index on _id

I have described 6 indexes but claimed that 9 were used. The other 3 are used for WiredTiger and RocksDB but not for mmapv1 and are indexes on DiskLoc (key is DiskLoc, value is the document). for each collection. While WiredTiger and RocksDB support a clustered primary key that is not supported (yet) in MongoDB because the storage engine API is still evolving from what was needed for mmapv1. With mmapv1 the base table is heap organized and the value in a primary key index is a DiskLoc which I assume is a filename and offset where the document resides. WiredTiger and RocksDB are copy-on-write and the location for a document will change one or more times after an update so they can't use DiskLoc to point to the one place on disk where the document exists, thus they need the extra index. Server-14569 and server-15354 are open for this.

If we get proper support for a clustered primary key index, then MongoDB would use 6 indexes for Linkbench versus 4 for MySQL. How do we reduce that from 6 to 4? The unique secondary indexes for the link and count collections are not needed in MySQL. The problem here is that MongoDB doesn't support a composite PK. One workaround is to concatenate the fields (link_type | id1 | id2 -> _id for the link collection, (id | link_type -> _id) for the count collection. I don't recommend that because it must be done by the client, and done correctly, on every insert and update. It is even more work to make the concatenated value order preserving, which is easy in code, but makes it harder to construct the _id value when manually querying the database. This also means that some fields will be stored multiple times: once in the _id field, once separately.

Thursday, July 9, 2015

Linkbench, MongoDB and a disk array

This has results for Linkbench with MongoDB and a server with a disk array. The Linkbench configurations are similar to the tests done for cached and uncached databases, but the server here uses a disk array with 6 10K RPM SAS disks and HW RAID 0. The previous tests used PCIe flash. The server with a disk array has 24 hyperthread cores and is at least one generation older than the PCIe flash server which has 40 hyperthread cores. Both have 144G RAM.

Results for cached database

The test client is LinkbenchX and the configuration is described in a previous post. The test used 10 threads for loading and 20 for queries. After loading there were 12 1-hour runs of the query test and results are reported for the 2nd and 12th hour.

The results below are for the test with maxid1=20M set in FBWorkload.properties for both servers (disk array, PCIe flash from a previous post). The load rate is similar between the servers with a disk array and PCIe flash. The query rate is better for the server with PCIe flash, but that might be due more to more cores and newer CPUs than to storage performance. The load and query rates are better for WiredTiger than for RocksDB on both servers.

The server names are explained in a previous post. The oplog was enabled for all tests.

--- results for disk array

load load load 2h 2h 12h 12h 24h 24h

time rate size qps size qps size qps size server

5333 16519 14g 9649 17g 9689 22g mongo.rocks.log

3015 29223 16g 17715 30g 15654 40g mongo.wt.log

35253 2499 65g 4918 68g 4642 78g mongo.mmap.log

--- results for PCIe flash

5015 17565 14g 14442 17g 13925 24g 13506 29g mongo.rocks.log

3601 28020 16g 25488 34g 22992 45g mongo.wt.log

Results for uncached database

The results below are for the test with maxid1=1B for both servers (disk array, PCIe flash from a previous post). The load rate is similar between disk and flash and is also similar to the rates above for the cached database. Multiple secondary indexes are maintained during the load but IO latency does not have a significant impact on the load rate, even for WiredTiger which does more random IO than RocksDB.

The query rates are significantly lower for the disk array than for PCIe flash. IO latency is significant for queries. However, RocksDB does better than WiredTiger on the disk array possibly because it uses less random IO for writes which leaves more random IO capacity to serve reads.

Update - I repeated tests for the disk-array server with different configurations and the results are better. For WiredTiger I set storage.syncPeriodSecs=600 which changes the default checkpoint interval from 60 to 600 seconds. The benefit should be fewer disk writes and QPS improved by more than 30% with that change. For RocksDB I used the default configuration and QPS improved by more than 20% compared to the non-default configuration I had been using (Igor did a good job choosing the defaults). For all engines I used a smaller block cache -- 32G rather than 70G -- to save more space for compressed blocks in the OS filesystem cache. Results for all engines improved with a smaller block cache.

--- results for disk array

load load load 2h 2h 12h 12h 24h 24h 48h 48h

time rate size qps size qps size qps size qps size server

298901 14625 606g 580 585g 611 588g 604 590g 600 596g mongo.tokumx.log, 70G
297159 14711 597g 731 585g 786 588g 782 592g 736 598g mongo.tokumx.log, 32G

178923 24432 694g 343 704g 333 728g mongo.wt.log, default, 70G

176432 24777 696g 449 709g 434 738g 423 749g 418 757g mongo.wt.log, non-default, 32G

271569 16097 631g 448 631g 477 632g 452 633g 471 635g mongo.rocks.log, non-default, 70G

274780 15909 628g 458 628g 592 629g 574 631g 569 633g mongo.rocks.log, default, 32G

--- results for PCIe flash

251688 17368 630g 9670 633g 6762 644g 6768 656g mongo.rocks.log

175740 24874 695g 8533 766g 8019 791g 7718 806g mongo.wt.log

This has the mean response time in milliseconds for each of the query types in Linkbench. The data below is from the 12th 1-hour run for the disk array (the first pair of numbers) and then PCIe flash (the second pair of numbers. The most frequent operation is GET_LINKS_LIST followed by MULTIGET_LINK. On the disk array because the response time for these two operations is better for RocksDB and that explains why it gets more QPS than WiredTiger. For PCIe flash the response time for GET_LINKS_LIST is lower for WiredTiger which explains the better QPS. The response time for all of the write operations is better for RocksDB than WiredTiger on disk and flash, but those operations are less frequent. WiredTiger does more reads from disk & flash during writes as b-tree leaf pages must be read before being written.

The QPS for RocksDB is higher than WiredTiger for the 2nd 1-hour run with the PCIe server but lower after the 12th 1-hour run. The mean response time for the GET_LINKS_LIST operation almost doubles and the cause might be the range read penalty from an LSM.

12th 1-hour run 2nd 1-hour run 12th 1-hour run

disk disk - flash flash flash flash

wired rocks - wired rocks wired rocks

ADD_NODE 0.717 0.444 0.327 0.300 0.324 0.276

UPDATE_NODE 23.0 21.5 1.217 1.083 1.240 0.995

DELETE_NODE 22.9 21.6 1.260 0.675 1.285 1.018

GET_NODE 22.7 20.9 0.913 2.355 0.941 0.622

ADD_LINK 47.6 23.5 2.988 1.610 3.142 2.255

DELETE_LINK 31.9 21.6 2.063 1.610 2.407 1.701

UPDATE_LINK 51.1 25.8 3.238 2.507 3.407 2.395

COUNT_LINK 16.7 10.9 0.686 0.571 0.739 0.547

MULTIGET_LINK 22.6 18.9 1.599 1.195 1.603 1.136

GET_LINKS_LIST 35.6 27.6 1.910 2.181 2.056 3.945

And the data below is example output from the end of one test, RocksDB on flash for the 12 1-hour run.

ADD_NODE count = 627235 p25 = [0.2,0.3]ms p50 = [0.2,0.3]ms p75 = [0.2,0.3]ms p95 = [0.3,0.4]ms p99 = [1,2]ms max = 263.83ms mean = 0.276ms

UPDATE_NODE count = 1793589 p25 = [0.7,0.8]ms p50 = [0.8,0.9]ms p75 = [0.9,1]ms p95 = [1,2]ms p99 = [4,5]ms max = 301.453ms mean = 0.995ms

DELETE_NODE count = 246225 p25 = [0.7,0.8]ms p50 = [0.8,0.9]ms p75 = [0.9,1]ms p95 = [1,2]ms p99 = [4,5]ms max = 265.012ms mean = 1.018ms

GET_NODE count = 3150740 p25 = [0.4,0.5]ms p50 = [0.5,0.6]ms p75 = [0.5,0.6]ms p95 = [0.9,1]ms p99 = [3,4]ms max = 301.078ms mean = 0.622ms

ADD_LINK count = 2189319 p25 = [1,2]ms p50 = [2,3]ms p75 = [2,3]ms p95 = [3,4]ms p99 = [7,8]ms max = 317.292ms mean = 2.255ms

DELETE_LINK count = 727942 p25 = [0.4,0.5]ms p50 = [0.6,0.7]ms p75 = [2,3]ms p95 = [3,4]ms p99 = [6,7]ms max = 320.13ms mean = 1.701ms

UPDATE_LINK count = 1949970 p25 = [1,2]ms p50 = [2,3]ms p75 = [2,3]ms p95 = [3,4]ms p99 = [7,8]ms max = 393.483ms mean = 2.395ms

COUNT_LINK count = 1190142 p25 = [0.3,0.4]ms p50 = [0.4,0.5]ms p75 = [0.5,0.6]ms p95 = [0.8,0.9]ms p99 = [2,3]ms max = 296.65ms mean = 0.547ms

MULTIGET_LINK count = 127871 p25 = [0.7,0.8]ms p50 = [0.9,1]ms p75 = [1,2]ms p95 = [1,2]ms p99 = [4,5]ms max = 272.272ms mean = 1.136ms

GET_LINKS_LIST count = 12353781 p25 = [0.5,0.6]ms p50 = [1,2]ms p75 = [1,2]ms p95 = [3,4]ms p99 = [38,39]ms max = 2360.432ms mean = 3.945ms

REQUEST PHASE COMPLETED. 24356814 requests done in 3601 seconds. Requests/second = 6762

Wednesday, July 8, 2015

Linkbench for MySQL and MongoDB with a large database

This has results for Linkbench where the database is much larger than RAM. I previously shared results for a database that can be cached in RAM.

I used Linkbench for MySQL and LinkbenchX for MongoDB as described in the Configuration section in a previous post. The value of maxid1 was set to 1,000,000,001 in the FBWorkload.properties file. For the cached database test that was set to 20,000,001. The test server has 144G of RAM, 40 hyperthread cores and PCIe flash. The tests were configured to use 10 threads for loading and 20 threads for queries. I disabled the binlog during the load for MySQL and enabled it during the query tests. The oplog was always enabled for MongoDB.

I don't report results for mmapv1 because the load would take too long and I might not have enough storage. I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:

mysql.rocks.log - MySQL with RocksDB and the binary log enabled
mysql.inno.log - MySQL with InnoDB and the binary log enabled
mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled

Results

The test pattern is to load the data then do 24 1-hour runs of the query test. The results below include:

load time - the number of seconds for the load test
load rate - the average insert rate during the load test
load size - the database size in GB when the load ended
2h qps - the average QPS during the 2nd 1-hour query test
2h size - the database size in GB after the 2nd 1-hour query test
12h qps - the average QPS during the 12th 1-hour query test
12h size - the database size in GB after the 12th 1-hour query test
24h qps - the average QPS during the 24th 1-hour query test
24h size - the database size in GB after the 24th 1-hour query test

load load load 2h 2h 12h 12h 24h 24h

time rate size qps size qps size qps size server

43820 99759 707g 22039 748g 25696 805g 25947 831g mysql.inno.log

24261 180184 368g 32469 372g 30180 383g 29321 394g mysql.rocks.log

251688 17368 630g 9670 633g 6762 644g 6768 656g mongo.rocks.log

175740 24874 695g 8533 766g 8019 791g 7718 806g mongo.wt.log

Update - I ran a test for mmapv1 with maxid1=200M rather than 1B, so it has about 20% of the data compared to the others. I didn't have enough time or disk space to use the larger value with mmap. The data is below:

load load load 2h 2h 12h 12h 24h 24h

time rate size qps size qps size qps size server

276350 3165 553g 3222 555g 3084 568g 2969 573g mongo.mmap.log

Some conclusions from the results:

the load rates are similar to the rates for the cached database and IO when the database is larger than RAM doesn't have a big impact on performance. But this might depend on fast storage and these results are for a server with PCIe flash.
relative to the query rates for the cached database, RocksDB suffers the least and WiredTiger suffers the most when the database is larger than RAM.
the size for RocksDB was similar between MySQL and MongoDB for the cached database. That is not the case here as MongoRocks uses almost 1.7X more space than MyRocks. I have yet to debug this. For MongoRocks all data is stored in one column family. For MyRocks I used one column family per index and if nothing else this makes it easy to determine how much space and IO is consumed per index. There are also more optimizations in MyRocks to remove tombstones earlier but the space difference shows up immediately during the load and there shouldn't be tombstones at that point. MongoDB also uses more indexes than MySQL (9 vs 4) for Linkbench and that might be the reason but this is still a mystery.
I restarted MySQL after the load test to enable the binlog prior to the query tests. InnoDB was configured to use direct IO so I lost the contents of the buffer pool during the restart. Performance suffered for the early query tests until the buffer pool warmed up and that includes the 2h test I reported above. This explains why QPS was better for the 12h and 24h results with InnoDB. Buffered IO is used for MySQL+RocksDB so less data is lost during a restart (block cache must be warmed up but OS filesystem cache is not wiped).

Comparison with the cached database results

The table below shows the ratio of the rates from this test with the rates for the cached database. The goal is to determine how much performance degrades when the database is larger than RAM. The ratio is expressed as a percentage and 50% means that the rate for the uncached database was 50% of the rate for the cached database. Note that the load rates are close to 100% for RocksDB while InnoDB degrades a bit and WiredTiger degrades more. This is expected from a write-optimized algorithm like RocksDB and b-trees like InnoDB and WiredTiger.

The results for 2h for InnoDB are less accurate because mysqld was restarted as described in the previous section. For the 24h results MySQL degrades less than MongoDB and RocksDB degrades less than b-trees (WiredTiger & InnoDB).

load 2h 24h

rate qps qps server

97.6% 67.6% 63.0% mysql.rocks.log

95.0% 42.9% 52.0% mysql.inno.log

98.8% 66.9% 50.1% mongo.rocks.log

88.7% 33.4% 34.8% mongo.wiredtiger.log

And the same data in a chart...

WiredTiger improvements from MongoDB 3.0 to 3.1

There has been progress in the 3.1 branch on performance bugs I reported for the WiredTiger B-Tree so I repeated the insert benchmark for it and the improvements are great. The average insert and query rates for WiredTiger are almost 2X better in 3.1 versus 3.0. RocksDB still does better than WiredTiger for inserts but worse for queries.

Results

I tested 3 binaries: WiredTiger via MongoDB 3.0.x, RocksDB via MongoDB 3.0.x and WiredTiger via MongoDB 3.1.5. I use "3.0.x" because the build is from a special branch used for MongoRocks development.

The test is run with 10 loader threads and 1 query thread. The test HW has 40 hyperthread cores, PCIe flash storage and 144G of RAM. The database is much larger than RAM at test end. The insert rate is faster for RocksDB because it is write-optimized and benefits from not doing read-before-write during secondary index maintenance. It also benefits from a better cache hit rate because the database is smaller than for WiredTiger.

The queries are short, 10 document, range scans on a secondary index. WiredTiger does better than RocksDB because a write-optimized algorithm like RocksDB trades read for write performance and there is more work to be done on range scans. The read penalty is less for point queries.

The results below include:

irate - the average rate of inserts per second
Nslow - the number of write operations that take at least two seconds. This is computed from the slow operations reported in the log.
qrate - the average rate of queries per second
size - the database size in GB at test end

irate Nslow qrate size server

24187 430 136 312g RocksDB 3.0.x

20273 2409 624 442g WiredTiger 3.1.5
10823 3996 311 416g WiredTiger 3.0.x

The charts display the insert and query rates for each binary.

Configuration

The hardware and configuration I use is described in a previous post. I used my fork of iibench and am glad that Tim wrote and shared the code. I used this command line to run iibench and load 500M documents with 10 loader threads and 1 query thread.

java -cp mongo-java-driver-2.13.1.jar:src jmongoiibench iibench 10 500000000 10 -1 10 ib.tsv quicklz 16384 60000000000 4 100000 999999 SAFE localhost 27017 1 1000 3 50 Y 1 0 1

Tuesday, July 7, 2015

The impact of long-running transactions on Linkbench QPS: InnoDB vs RocksDB

I have been using Linkbench and LinkbenchX to compare performance for MySQL and MongoDB. Previous results were shared for a cached database and a small server. Here I describe the impact of a long-running transaction on QPS for InnoDB and RocksDB storage engines in MySQL. The summary is that a long-running transaction caused a big drop in QPS for InnoDB and RocksDB.

The long-running transaction was created by executing BEGIN and then a short running SELECT statement in one session. After the SELECT statement completed the session remained open while a 1-hour linkbench test was run. If you take logical backups from an active database server, then you care about the performance impact from a long-running transaction.

The InnoDB result surprised me as I didn't expect the drop to be that large. The test used repeatable read for InnoDB and a long-running transaction in this case means there is a long-open snapshot that blocks purge and the history list length grows during this test which also means that the disk space for undo grows.

RocksDB also uses a snapshot per transaction so a long-running transaction should imply a long-open snapshot. Read the section on Gets, Iterators and Snapshots to understand the difference between snapshots and iterators in RocksDB. Both of them prevent some old data from being removed from the database, kind of like InnoDB purge. But RocksDB is more clever than InnoDB about dropping old data so we don't have a pending feature request like 74919. But a long-running transaction still means that queries encounter more tombstones which means more overhead.

Show me the graph

This shows linkbench QPS from 3 1-hour runs for MySQL+InnoDB and MySQL+RocksDB. The first and third 1-hour runs are done without a long-running transaction. The middle 1-hour run is done with a long-running transaction and QPS for InnoDB drops quickly and for RocksDB drops less quickly. Fortunately it recovers as soon as the long-running transaction is closed. The tests used the cached database configuration (maxid1=20M) as described in a previous post and were done after ~12 1-hour linkbench query tests had already been completed.

There are 8 types of statements used by linkbench. By far the most frequent is GET_LINKS_LIST which is handled by doing a short range scan on a secondary index for the MySQL implementation. That is also the statement type most impacted by the long-running transaction as the average response time increased by ~1.5X for RocksDB and ~7X for InnoDB.

LinkBenchX and MongoDB on a small server

I previously reported results for Linkbench and LinkbenchX running on high-end commodity servers.
Here I report results for much lower-end hardware. I have two Intel NUC systems with 8G of RAM, 1 disk and 1 SSD. They are small, quiet and were easy to setup. There is an active NUC community hosted by Intel with useful answers to many questions. I managed to get the systems running without asking for help from Domas. That is rare.

I ran LinkbenchX as described in the previous post with a few minor changes because this hardware is smaller. First, I used 4 load threads and 4 request threads compared to 10 load threads and 20 request threads on the larger hardware (set loaders=4 and requesters=4 in LinkConfigMongoDBV2.properties). Then I ran tests for smaller databases using maxid1=2M for the cached database and maxid1=20M for the uncached database (set in FBWorkload.properties). Finally I added one option to storage.rocksdb.configString to put SST index and bloom filter data in the RocksDB cache to make it subject to the block cache limits. While this can hurt performance it also gets RocksDB to respect storage.rocksdb.cacheSizeGB. Without this option the data for SST index and filter data is always in memory when the SST file is open and as the database grows this can use a lot of memory, especially when running on a server with 8GB of RAM. Without this option the memory consumed for index and filter data also looks like a memory leak in RocksDB until you realize what is going on (yes, I have wasted too much time on this problem). The extra option is:

block_based_table_factory={cache_index_and_filter_blocks=1}

Results

The tests here were run with the database on the single disk. The oplog was enabled for the test but sync-on-commit was disabled. Tests were done with maxid1=2M for the cached database and maxid1=20m for the uncached database. Unfortunately the database was always not cached for mmapv1 because it uses so much more space for the same data compared to WiredTiger and RocksDB.

The results below include:

load time - the number of seconds for the load test
load rate - the average insert rate during the load test
load size - the database size in GB when the load ended
2h qps - the average QPS during the 2nd 1-hour query test
2h size - the database size in GB after the 2nd 1-hour query test
12h qps - the average QPS during the 12th 1-hour query test
12h size - the database size in GB after the 12th 1-hour query test

The QPS for RocksDB is a lot better than WiredTiger in the cached database test. After looking at the iostat data from the test I see that WiredTiger didn't cache the database for the 12h result below. The WiredTiger database was 5G, the test server has 8G of RAM and the WiredTiger block cache gets 4G of RAM. Assuming the database compressed by 2X then the uncompressed database is 10G, the 4G block cache can store 40% of it and the OS filesystem cache gets at most 4G. From vmstat data I see that the memory.cache column grows to ~4G.

Sizing a cache for InnoDB with direct IO is easy. Give it as much memory as possible and hope that the background tasks that share the HW don't use too much memory. But then InnoDB supported compression and now we had a problem of figuring out how to share the InnoDB buffer pool between compressed and uncompressed pages. There is some clever code in InnoDB that tries to figure this out based on whether a workload is CPU or IO bound. Well, we have the same problem with WiredTiger and RocksDB. Because they use buffered IO the OS filesystem cache is the cache for compressed pages and the WiredTiger/RocksDB block cache is the cache for uncompressed pages. Neither WiredTiger or RocksDB has code yet to dynamically adjust the amount of memory used for compressed versus uncompressed pages but I am certain that it is easier to dynamically resize the block cache in them compared to InnoDB.

For now RocksDB and WiredTiger default to using 50% of system RAM for the block cache. I suspect that in many cases, like when the database is larger than RAM, that it is better to use much less than 50% of system RAM for their block caches. I will save my hand waving math for another post and will leave myself a note to repeat the tests below with the cache set to use 20% of RAM.

uncached database, maxid1=20m, loaders=4, requesters=4

load load load 2h 2h 12h 12h

time rate size qps size qps size server

16252 5421 14g 113 14g 129 14g mongo.rocks.log

12037 7319 15g 105 16g 97 17g mongo.wt.log

19062 4494 69g 50 68g 46 68g mongo.mmap.log

cached database, maxid1=2m, loaders=4, requesters=4

load load load 2h 2h 12h 12h

time rate size qps size qps size server

1629 5886 2.2g 3405 3.0g 3147 3.9g mongo.rocks.log

12774 7530 2.5g 2966 4.1g 1996 5.0g mongo.wt.log

2058 4659 14g 1491 14g 627 18g mongo.mmap.log

Hardware

The Intel NUC systems have:

NUC5i3RYH - Intel core i3 with 4 hyperthread cores
8 GB RAM
1TB disk - HGST Travelstar, 7200 RPM disk
120GB SSD - Samsung 850 EVO

Monday, July 6, 2015

Linkbench for MySQL & MongoDB with a cached database

I used linkbench for MySQL and MongoDB to understand OLTP performance on an interesting workload. Here I post results for a cached database and in a few days I will have more results for a database larger than RAM. The summary is that MySQL is a lot faster than MongoDB even when using the same storage engine (RocksDB). I don't know all of the reasons for this and it is possible I made a mistake because this is the first time I have run linkbenchX for MongoDB. There are overheads in MongoDB that doesn't exist in MySQL, but I have not quantified the contribution from each. The ones that are interesting for this test include:

PK index isn't clustered - the PK index is clustered for RocksDB and InnoDB in MySQL. All of the columns for a row can be fetched from a PK index leaf page. The PK index on _id is not clustered in MongoDB and is from _id to DiskLoc and then WiredTiger & RocksDB use a hidden clustered index on DiskLoc to access the document. While the PK index isn't clustered for mmapv1, it doesn't have the hidden clustered index on DiskLoc as it uses DiskLoc as the address (filename & offset) for the document. I expect this to be fixed in a future release so that RocksDB and WiredTiger can use a clustered index on _id. The current behavior is a legacy from the storage engine interface used by mmapv1.
PK can't be composite - the primary key for a MongoDB collection is on the _id field. If your data has a natural primary key using one field, then rename that field _id. If the natural primary key is composite then you might use _id as a surrogate key and declare a unique secondary index on the composite key. The cost for this is an extra index.

The costs above mean that linkbench uses more indexes in MongoDB than in MySQL. This makes updates and queries slower as there are more indexes to maintain and more indexes to search. For MySQL there are 3 tables and 4 indexes in linkbench: a PK and secondary index on the link table, a PK on the count table and a PK on the node table. For MongoDB there are 3 collections and 9 indexes. Each collection has a PK index on the _id field and the hidden index that maps DiskLoc to a document for RocksDB & WiredTiger. As there are 3 collections this explains the first 6 indexes. Then there are unique secondary indexes for the count and link collections where a composite key would have been used in a SQL DBMS. Finally, there is the secondary index on the link collection.

I am not sure that this workload is a great fit for MongoDB but it is a good benchmark for looking at OLTP performance and is less synthetic than the insert benchmark. The MongoDB version, linkbenchx, would benefit greatly from support for per-shard transactions. That is only available via TokuMX today. These tests were run without transactions and there might be races for transactions that need to atomically change the link and count collections.

From speaking to my local MongoDB internals guru (hi Igor), the secondary index in MongoDB contains the DiskLoc to reference the document. When the secondary index isn't covering then the hidden index on DiskLoc will be searched to find the missing document. This overhead is similar to what occurs with a clustered PK index in MySQL - the PK in the secondary index is used to search the PK index to get the missing columns.

A search by _id in MongoDB is more expensive than a search on the PK in MySQL for RocksDB and WiredTiger. With the clustered PK index in MySQL all of the columns are in the leaf page of the index. For MongoDB there are two indexes to search -- first the index on _id to find a DiskLoc and then the hidden index on DiskLoc to find the document.

A search on the natural PK for the link and count tables is also more expensive in MongoDB. First the search of the unique secondary index is done to find a DiskLoc and then the hidden index on DiskLoc is searched to get the document. For MySQL there is only one search using the clustered PK index.

Configuration

I used linkbench for MySQL and linkbenchx for MongoDB. I changed bin/linkbench in linkbenchx to use the same driver as MySQL: CLASS='com.facebook.LinkBench.LinkBenchDriver'. I set maxid1=20000001 in config/FBWorkload.properties. Finally I set loaders=10, requesters=20, maxtime=3600, warmup_time=60 and requests=1000000000 in config/LinkConfigMongoDBV2.properties and config/LinkConfigMysql.properties.

With this setup the database is cached in RAM during the entire test on servers with 144G of RAM. The test pattern is to load the database and then run either 12 or 24 1-hour query runs. I report the average QPS sustained during the 2nd, 12th and optionally the 24th 1-hour runs. New data is added to the database over time so the database is larger during the 12th 1-hour run than during the 2nd. The growth rate is a function of the QPS.

I have yet to share quality-of-service metrics like p95 or p99 response time. Eventually I will do that.

I used WebScaleSQL for MySQL and the v3.0-fb MongoDB partners branch for MongoDB. The following configurations were tested:

mysql.rocks.nolog - MySQL with RocksDB and the binary log disabled
mysql.rocks.log - MySQL with RocksDB and the binary log enabled
mysql.inno.nolog - MySQL with InnoDB and the binary log disabled
mysql.inno.log - MySQL with InnoDB and the binary log enabled
mongo.rocks.nolog - MongoDB with RocksDB and the oplog disabled
mongo.rocks.log - MongoDB with RocksDB and the oplog enabled
mongo.wiredtiger.nolog - MongoDB with WiredTiger and the oplog disabled
mongo.wiredtiger.log - MongoDB with WiredTiger and the oplog enabled
mongo.mmap.nolog - MongoDB with mmapv1 and the oplog disabled
mongo.mmap.log - MongoDB with mmapv1 and the oplog enabled

I used this mongo.conf file:

processManagement:

fork: true

systemLog:

destination: file

path: /data/mysql/mongo.30/log

logAppend: true

storage:

syncPeriodSecs: 60

dbPath: /data/mysql/mongo.30/data

journal:

enabled: true

mmapv1:

journal:

commitIntervalMs: 100

operationProfiling.slowOpThresholdMs: 2000

replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: snappy

storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.compression: snappy

And I used a my.cnf file that is too long to paste here. This is the schema for RocksDB and InnoDB. All InnoDB tables use 32 hash partitions to reduce mutex contention on the per-index mutex. All tables are compressed except for the node table with InnoDB because the average compression rate for rows in the node table is less than 2X. Four column families are used for the RocksDB tables, one for each index. All collections are in one column family for MongoDB+RocksDB The my.cnf settings for RocksDB are here.

Compression varied by configuration.

MongoDB+mmapv1 - no compression
MongoDB+WiredTiger - snappy
MySQL+InnoDB - zlib
MongoDB+RocksDB - no compression for levels 0, 1, snappy for level 2, zlib for others
MySQL+RocksDB - no compression for levels 0, 1 and zlib for others

Load performance

This lists the average insert rate sustained during the load. With maxid1=20,000,001 there are 20M rows/documents in the node table/collection, ~90M rows/documents in the link table/collection and ~15M rows/documents in the count table/collection. A few conclusions:

the insert rate is much better for MySQL than MongoDB even when the same engine, RocksDB, is used for both. MongoDB does more work during the load as it maintains 9 indexes versus 4 for MySQL.
enabling the binlog for MySQL+RocksDB hurts performance much more than for MySQL+InnoDB
enabling the oplog for MongoDB doesn't hurt performance as much as it does on the insert benchmark
for MongoDB the load rate was faster for WiredTiger than RocksDB because the database is cached. This changes when the database is larger than RAM.

Query performance

The average QPS was similar for MySQL+RocksDB and MySQL+InnoDB. They were both much faster than MongoDB. For MongoDB the best QPS is from WiredTiger. The MongoDB results match what I have gotten from other benchmarks and I have some faith these are accurate.

All results

This lists all results per configuration. The 24 1-hour query tests were only done for a few configurations. The results include:

load time - the number of seconds for the load test
load rate - the average insert rate during the load test
load size - the database size in GB when the load ended
2h qps - the average QPS during the 2nd 1-hour query test
2h size - the database size in GB after the 2nd 1-hour query test
12h qps - the average QPS during the 12th 1-hour query test
12h size - the database size in GB after the 12th 1-hour query test
24h qps - the average QPS during the 24th 1-hour query test
24h size - the database size in GB after the 24th 1-hour query test

A few conclusions:

The database size grows more for MySQL+InnoDB than for MySQL+RocksDB. Fragmentation from the B-Tree is larger than space-amplification of ~1.1X from leveled compaction. It is difficult to compare growth for configurations that don't sustain the same QPS because new data is added as a function of the QPS.
The database size grows more for Mongo+WiredTiger than for Mongo+RocksDB. While WiredTiger sustains a higher QPS I think part of the problem is that B-Tree fragmentation is worse than space amplification from leveled compaction.
I am interesting in comparing database growth between MySQL+InnoDB and Mongo+WiredTiger. Both implement a B-Tree but InnoDB is update-in-place and can suffer from internal fragmentation while WiredTiger is copy-on-write and can suffer from external fragmentation. I don't know whether there are counters to explain this in WiredTiger.

load load load 2h 2h 12h 12h 24h 24h

time rate size qps size qps size qps size server

478 184496 14g 50153 17g 48671 25g 48258 34g mysql.rocks.nolog

602 146462 14g 47981 16g 47250 23g 46537 34g mysql.rocks.log

839 105008 15g 51574 22g 50229 37g 50058 56g mysql.inno.nolog

851 103489 15g 51277 22g 49780 37g 49805 56g mysql.inno.log

4156 21200 13g 15149 17g 15271 23g mongo.rocks.nolog

5015 17565 14g 14442 17g 13925 24g 13506 29g mongo.rocks.log

3020 29174 14g 27940 32g 23043 45g 22141 53g mongo.wt.nolog

3601 28020 16g 25488 34g 22992 45g mongo.wt.log

25563 3446 63g 9774 70g 8703 82g mongo.mmap.nolog

27000 3263 66g 9414 70g 8471 84g mongo.mmap.log