- WiredTiger has the best range query performance followed closely by RocksDB. While the LSM used by RocksDB can suffer a penalty on range reads, that wasn't visible for this workload.
- mmapv1 did OK on range query performance when there were no concurrent inserts. Adding concurrent inserts reduces range query QPS by 4X. Holding exclusive locks while doing disk reads for index maintenance is a problem, see server-13225.
- TokuMX 2.0.1 has a CPU bottleneck that hurts range query performance. I await advice from TokuMX experts to determine whether tuning can fix this. I have been using a configuration provided by experts.
Test details
More details on the configuration for mongod is in the previous post. The tests here were run after the database was loaded with 2B documents that used a 256 byte pad field. After the initial load the database was ~620G, ~630G, ~1400T and ~500G for TokuMX, WiredTiger, mmapv1 and RocksDB. I used my fork of iibench for MongoDB (thanks Tim) to support more than 1 query thread. My fork has an important diff to avoid reusing the same RNG seed between tests. The range query fetches at most 4 documents via the QUERY_LIMIT=4 command line option. Tests were run in this sequence:
- 10i.1q - 10 insert threads, 1 query thread. Run to insert 10M documents for mmapv1 and 100M documents for the other engines
- 1i.10q - 1 insert thread, 10 query threads. The insert thread was rate limited to 1000 documents/second. Run for at least 1 hour.
- 0i.10q - 0 insert threads, 10 query threads. Run for at least 1 hour.
- 10i.0q - 10 insert threads, 0 query threads. Run to insert at least 10M documents.
- 1i.10q - this is the same as the previous 1i.10 test except I increased the value of internalQueryCacheWriteOpsBetweenFlush to avoid the IO overhead from re-planning queries too frequently as explained in a previous post. Alas, this cannot be done for TokuMX 2.0.1 because that option doesn't exist in the older version of MongoDB it uses.
Results
This is the QPS from the 10 query threads with and without a concurrent insert thread. The result for 1i.10q is from the run where I increased the value of this parameter to get better performance, although the overhead from it is much more significant with a disk array than with PCIe flash --internalQueryCacheWriteOpsBetweenFlush. There are a few things to note here:
- WiredTiger and RocksDB have similar QPS
- QPS for mmapv1 is much worse with a concurrent insert thread (locks are held when disk reads are done for index maintenance, not yielded)
- QPS for TokuMX is low. There is a CPU bottleneck. More details at the end of this post.
The tables that follow have results for each of the tests. The results include:
- ipsAvg - average rate for inserts/second
- ipsP95 - rate at 95th percentile for inserts/second using 10-second intervals
- qpsAvg - average rate for range queries/second
- qpsP95 - rate at 95th percentile for queries/second using 10-second intervals
- dbSize - database size in GB at test end
- rssSize - RSS for mongod at test end
10i.1q
These are results for the 10i.1q test. WiredTiger has significant variance for the insert and query rates. Bugs have been opened for this problem and progress has been made but I think some bugs are still open. The query rate for mmapv1 is very low as described earlier in this blog post.
ipsAvg | ipsP95 | qpsAvg | qpsP95 | dbSize | rssSize | |
tokumx | 40775 | 36001 | 119.5 | 101.6 | 622gb | 36gb |
wiredtiger | 6175 | 1488 | 50 | 14.8 | 632gb | 37gb |
mmapv1 | 1257 | 1094 | 1.15 | 0.7 | 14XXgb | 70gb |
rocksdb | 29424 | 25931 | 45.1 | 30 | 507gb | 54gb |
1i.10q, first
These are results for the first 1i.10q test. The insert thread is rate limited to 1000 documents/second. WiredTiger and RocksDB sustain higher query rates. TokuMX (because of the CPU bottleneck) and mmapv1 (because of the larger database and RW lock) sustain lower query rates.
ipsAvg | ipsP95 | qpsAvg | qpsP95 | dbSize | rssSize | |
tokumx | 981 | 900 | 726 | 696 | 553gb | 39gb |
wiredtiger | 980 | 908 | 7414 | 6505 | 632gb | 32gb |
mmapv1 | 951 | 874 | 766 | 650 | 14XXgb | 66gb |
rocksdb | 981 | 969 | 6720 | 6189 | 501gb | 53gb |
0i.10q
These are results for the 0i.10q test. WiredTiger and RocksDB get about 10% more QPS compared to the 1i.10q result. The QPS for mmapv1 improves by ~4X because it isn't slowed by the RW lock from the insert threads. TokuMX continues to suffer from the CPU bottleneck.
ipsAvg | ipsP95 | qpsAvg | qpsP95 | dbSize | rssSize | |
tokumx | 0 | 0 | 850 | 819 | 552gb | 39gb |
wiredtiger | 0 | 0 | 8750 | 8568 | 632gb | 31gb |
mmapv1 | 0 | 0 | 3603 | 3362 | 14XXgb | 62gb |
rocksdb | 0 | 0 | 7723 | 7667 | 501gb | 52gb |
I was too lazy to copy this into a table. This has results from iostat and vmstat, both absolute and normalized by the operation (query or insert) rate. The values are:
- qps, ips - operation rate (queries or inserts)
- r/s - average rate for r/s (storage reads/second)
- rmb/s - average rate for read MB/second
- wmb/s - average rate for write MB/second
- r/o - storage reads per operation
- rkb/o - KB of storage reads per operation
- wkb/o - KB of storage writes per operation
- cs - average context switch rate
- us - average user CPU utilization
- sy - average system CPU utilization
- us+sy - sum of us and sy
- cs/o - context switches per operation
- (us+sy)/o - CPU utilization divided by operation rate
qps r/s rmb/s wmb/s r/o rkb/o wkb/o
toku 850 2122 32.0 1.4 2.50 38.6 1.7
wired 8750 4756 47.9 0.6 0.54 5.6 0.1
mmap 3603 38748 577.2 0.6 10.75 164.0 0.2
rocks 7723 5755 47.2 0.5 0.75 6.3 0.1
qps cs us sy us+sy cs/o (us+sy)/o
toku 850 18644 67.1 5.1 72.2 22 0.085000
wired 8750 92629 24.3 3.0 27.3 11 0.003121
mmap 3603 255160 6.8 4.9 11.8 71 0.003264
rocks 7723 75393 21.2 2.8 24.0 10 0.003104
10i.0q
These are results for the 10i.0q test. The insert rate improvement versus 10i.1q is larger for TokuMX than for RocksDB. I assume that TokuMX suffers more than RocksDB from concurrent readers. WiredTiger still has too much variance in the insert rate.
ipsAvg | ipsP95 | qpsAvg | qpsP95 | dbSize | rssSize | |
tokumx | 43667 | 39152 | 0 | 0 | 653gb | 36gb |
wiredtiger | 5357 | 584 | 0 | 0 | 658gb | 40gb |
mmapv1 | 1254 | 1163 | 0 | 0 | 14XXgb | 62gb |
rocksdb | 29842 | 25762 | 0 | 0 | 533gb | 54gb |
The CPU utilization and storage IO per insert is lower for RocksDB & TokuMX. This is expected given they are write-optimized and benefit from not having to read secondary index pages during index maintenance. mmapv1 suffers from doing more disk reads and using more CPU.
ips r/s rmb/s wmb/s r/o rkb/o wkb/o
toku 43667 587 23.7 96.1 0.01 0.6 2.3
wt 5357 9794 109.9 187.2 1.83 21.0 35.8
mmap 1254 10550 235.7 31.7 8.41 192.5 25.9
rocks 29842 1308 31.1 281.9 0.04 1.1 9.7
ips cs us sy us+sy cs/i (us+sy)/i
toku 43667 189501 31.7 4.6 36.3 4 0.000830
wt 5357 105640 16.3 4.3 20.6 20 0.003840
mmap 1254 93985 1.3 2.0 3.3 75 0.002665
rocks 29842 189706 22.3 3.6 25.9 6 0.000867
1i.10q, second
These are results for the second 1i.10q test. The QPS rates for RocksDB, WiredTiger and mmapv1 are better than in the first 1i.10q test because mongod was changed to cache query plans for a longer time.
ipsAvg | ipsP95 | qpsAvg | qpsP95 | dbSize | rssSize | |
tokumx | 987 | 907 | 858 | 829 | 527gb | 37gb |
wiredtiger | 977 | 899 | 7830 | 6821 | 658gb | 32gb |
mmapv1 | 949 | 864 | 869 | 727 | 14XXgb | 56gb |
rocksdb | 991 | 937 | 7018 | 6669 | 524gb | 53gb |
TokuMX CPU bottleneck, test 1
I want to understand why TokuMX uses more than 25X the CPU per query than other engines. The high CPU load comes from eviction (flushing dirty data so block cache memory can be used for disk reads) but it isn't clear to me why eviction should be so much more expensive in TokuMX than in WiredTiger which also does eviction. RocksDB doesn't show this overhead but it does compaction instead of eviction.
I reloaded the TokuMX database with 2B documents, let it sit idle for 1 day and then started the 0i.10q test. From a flat profile with Linux perf the top-10 functions are listed below. Then I ran PMP and from stack traces see that eviction is in progress which explains why QuickLZ compression uses a lot of CPU because data must be compressed during eviction. From stack traces I also see that the query threads are stalled waiting for eviction to flush dirty pages before they can use memory for data read from storage. This is a common place where a database engine can stall (InnoDB and WiredTiger have stalled there, maybe they still do) but the stalls are worse in TokuMX.
I will let the 0i.10q test run until writes stop and see if the query rate increases at that point.
45.46% mongod libtokufractaltree.so [.] qlz_compress_core
19.10% mongod libtokufractaltree.so [.] toku_decompress
7.19% mongod libc-2.12.so [.] memcpy
1.71% mongod libtokufractaltree.so [.] toku_x1764_memory
1.43% mongod [kernel.kallsyms] [k] clear_page_c_e
1.28% mongod [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.26% mongod mongod [.] free
1.11% mongod libtokufractaltree.so [.] toku_ftnode_pe_callback
0.96% mongod libtokufractaltree.so [.] deserialize_ftnode_partition
0.83% mongod mongod [.] malloc
Update 1
I ran the 0i.10q test overnight and after 20 hours QPS has increased from 1074 to 1343. The CPU bottleneck from the partial eviction code remains. Partial eviction and waits for it are still frequent per PMP stack traces. The top-10 functions by CPU used are:
36.41% mongod libtokufractaltree.so [.] toku_decompress
12.70% mongod libc-2.12.so [.] memcpy
5.58% mongod libtokufractaltree.so [.] qlz_compress_core
3.40% mongod [kernel.kallsyms] [k] clear_page_c_e
3.11% mongod libtokufractaltree.so [.] toku_x1764_memory
2.72% mongod [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.41% mongod libtokufractaltree.so [.] bn_data::deserialize_from_rbuf
1.40% mongod libtokufractaltree.so [.] toku_ftnode_pe_callback
1.28% mongod mongod [.] free
0.87% mongod mongod [.] arena_chunk_dirty_remove
No comments:
Post a Comment