Small Datum: Reducing the overhead of secondary index maintenance with write-optimized database engine

There was a question about the performance of RocksDB for large objects during a recent RocksDB AMA. I used the insert benchmark to get a better answer to that question and understand the impact of secondary index maintenance on the insert rate. I repeated the benchmark for documents with a padding field of size 256 bytes, 10kb and 1mb. The padding field compresses by 50%.

The insert benchmark workload uses one or more threads to insert documents. There are optionally 1, 2 or 3 secondary indexes and the indexes must be maintained by inserting new index entries. For a b-tree the disk page with a leaf node from each secondary index must be read before it can be written (call this read-before-write). I assume that non-leaf nodes of the index are in the database or OS cache. When the database is large then leaf nodes might not be cached and a disk read is done as part of read-before-write.

Non-unique secondary indexes with a write-optimized database don't, or shouldn't, require read-before-write. With RocksDB a Put is done to the memstore to add the new index entry. Were this an update there might be a delete of the old key, which is implemented as a Put of a tombstone into the memstore, and again no read is required. Were this a unique secondary index then the database must do a read to confirm that the unique constraint is supported.

The key conclusions are:

A write-optimized database can make secondary index maintenance much faster whether using slow (disk) or fast (PCIe flash) storage. Results for disk are in progress. Today I share results for PCIe flash.
There isn't a penalty for RocksDB with large documents. While RocksDB and TokuMX were faster than WiredTiger and mmapv1 for small documents, they were not faster for 10kb and 1mb documents. There are two reasons for this. First, a b-tree might not suffer from higher write-amplification for large documents when the document is a multiple of the disk page size. Second, the secondary index data was small enough to remain in cache during the tests with 10kb and 1mb documents. While read-before-write was still done, the reads were done from cache.

Results

This is the performance data for all tests. The test server has PCIe flash, 144G of RAM and 40 hyperthread cores. The mongod process was configured to use a 32G block cache for RocksDB, WiredTiger and TokuMX. For RocksDB, WiredTiger and mmapv1 I used MongoDB 3.0.X from a special branch. For TokuMX I used 2.0.1 which is based on MongoDB 2.4. The oplog was configured for all servers. The oplog was 2G for RocksDB, WiredTiger and mmapv1. Oplog retention is time-based for TokuMX and set to 2 hours. More configuration options are listed at the end of this post. Snappy compression was used for RocksDB and WiredTiger while QuickLZ was used for TokuMX.

The test with a 256 byte pad column inserted 2B documents. The test with a 10kb pad field inserted 100M documents. The test with a 1MB pad field inserted 1M documents. All tests used 10 client threads.

This shows the insert rate by number of secondary indexes for 2B documents with a 256 byte pad field. WiredTiger did great with 0 secondary indexes but it and mmapv1 suffer once reads from storage are done during secondary index maintenance. TokuMX and RocksDB lose much less performance with more indexes. The insert rate at 3 secondary indexes is 77% & 67% of the rate at 0 secondary indexes for TokuMX and RocksDB. The rate at 3 indexes is 8% and 7% of the rate at 0 indexes for mmapv1 and WiredTiger. The iostat section at the end of the blog has more details on the performance overhead of secondary index maintenance. With 3 secondary indexes and WiredTiger or mmapv1 the number of disk reads per insert is more than 100X the rate for TokuMX and RocksDB.

I repeated the test for RocksDB and TokuMX with the oplog disabled. The insert rate increased from 45651 to 49322 for TokuMX and from 29913 to 39953 for RocksDB. I suspect that TokuMX loses less performance from the oplog because they partitioned it.

This is the database size for the tests with 2B documents and a 256 byte pad field. RocksDB uses the least disk space and mmapv1 uses the most.

This is the result for 100M documents with a 10kb pad field. While TokuDB and RocksDB lose the least performance as indexes are added, the results for all engines are closer than for the 256 byte documents. The secondary index data remains in cache for this test so WiredTiger and mmapv1 suffer less from read-before-write.

This is the result for 1M documents with a 1mb pad field. Results were close for all engines.

Perfomance Data

The collection used in this test always has a primary key index and tests were done with 0, 1, 2 and 3 secondary indexes. For each test there are 4 numbers below: ipsAvg-ipsP95-dbSize-rssSize. For some database sizes I don't have resolution for all digits (see 13XX below). The four values are:

ipsAvg - average rate of inserts/second during the test
ipsP95 - 95th percentile rate for inserts/second from 10-second samples
dbSize - database size in GB at test end
rssSize - mongod process size in GB at test end

2B documents with 256 byte pad field

PK-only PK, 1 secondary PK, 2 secondaries PK, 3 secondaries

tokumx 59499-56576-433g-25g 54530-50836-483g-34g 49092-44445-529g-34g 45641-40115-608g-34g

rocks 44136-33342-363g-42g 37587-25271-394g-45g 33330-23267-435g-50g 29913-21191-478g-53g

wt 100484-85026-383g-6.5g 24448-59.0-461g-39g 11776-68.9-534g-36g 6604-997-607g-36g

mmap 30063-27652-1023g-130g 13684-11689-11XXg-133g 5709-2617-13XXg-79g 2393-1188-14XXg-72g

100M documents with 10kb pad field

PK-only PK, 1 secondary PK, 2 secondaries PK, 3 secondaries

tokumx 6897-4509-986g-33g 6960-5338-14XXg-35g 7053-5536-14XXg-35g 7043-5488-14XXg-35g

rocks 8489-6933-515g-4.3g 7976-73-517g-5.3g 7813-6184-518g-5.4g 7475-5786-521g-5.6g

wt 9087-7444-640g-5.2g 5159-1137-654g-18g 5619-714-651g-27g 4204-546-663g-34g

mmap 7434-4307-16XXg-106g 6039-3466-16XXg-98g 5046-2853-16XXg-111g 4299-2711-16XXg-111g

1M documents with 1mb pad field

PK-only PK, 1 secondary PK, 2 secondaries PK, 3 secondaries

tokumx 100.1-70.1-864g-34g 100.9-72.8-1022g-33g 100.5-69.8-1019g-33g 99.7-67.9-1019g-33g

rocks 100.0-72.2-514g-34g 94.1-68.0-517g-34g 95.8-69.9-517g-34g 94.9-69.1-518g-34g

wt 106.7-101.3-518g-2.7g 107.1-101.7-518g-4.2g 107.7-101.5-519g-2.9g 108.0-101.7-520g-3.0g

mmap 79.2-32.2-20xxg-117g 93.9-49.2-20XXgb-117g 95.2-57.0-20XXg-118g 95.2-57.7-20XXg-117g

iostat

I examined iostat data to explain the differences in insert rates for the test with 2B documents and a 256 byte pad field. There are four lines for each engine with data for 0, 1, 2 and 3 secondary indexes. Note that with 3 secondary indexes the b-tree engines (WiredTiger, mmap) are writing more to storage per insert and reading a lot more per insert. The output is:

r/s - average rate for r/s (storage reads/second)
rmb/s - average rate for read MB/second
wmb/s - average rate for write MB/second
r/i - storage reads per insert
rkb/i - KB of storage reads per insert
wkb/i - KB of storage writes per insert
#indexes - number of secondary indexes

TokuMX

r/s rmb/s wmb/s r/i rkb/i wkb/i #indexes

13.5 0.3 71.6 0.000226 0.005 1.233 0

116.4 5.0 79.6 0.002135 0.093 1.495 1

182.4 9.7 86.9 0.003715 0.202 1.812 2

320.9 16.4 98.0 0.007031 0.367 2.199 3

RocksDB

r/s rmb/s wmb/s r/i rkb/i wkb/i #indexes

11.1 0.3 228.9 0.000251 0.008 5.310 0

18.9 0.4 230.4 0.000503 0.010 6.276 1

114.7 3.9 234.3 0.003441 0.119 7.198 2

185.0 8.3 240.1 0.006186 0.284 8.218 3

WiredTiger

r/s rmb/s wmb/s r/i rkb/i wkb/i #indexes

3.8 0.0 128.3 0.000038 0.000 1.308 0

37.2 0.2 129.2 0.001522 0.009 5.413 1

2681.9 24.4 172.9 0.227747 2.123 15.033 2

5913.2 58.2 174.5 0.895392 9.021 27.053 3

mmapv1

r/s rmb/s wmb/s r/i rkb/i wkb/i #indexes

7.5 0.1 41.3 0.000250 0.003 1.407 0

28.7 0.2 126.6 0.002101 0.013 9.477 1

5534.8 70.7 94.7 0.969479 12.679 16.987 2

11116.2 160.2 57.5 4.645290 68.554 24.607 3

Using TokuMX as the base case, the normalized r/i rates are listed below. When using 3 secondary indexes, mmapv1 does 660X more disk reads per insert and WiredTiger does 127X more disk reads per insert compared to TokuMX. That can be a problem even with fast storage.

	0 indexes	1 index	2 indexes	3 indexes
tokumx	1.000	1.000	1.000	1.000
rocksdb	1.111	0.236	0.926	0.880
wiredtiger	0.168	0.713	61.305	127.349
mmap	1.106	0.984	260.963	660.687

Configuration

This is the configuration file for RocksDB, WiredTiger and mmapv1:

processManagement:

fork: true

systemLog:

destination: file

path: /mongo.30/log

logAppend: true

storage:

syncPeriodSecs: 60

dbPath: /mongo.30/data

journal:

enabled: true

mmapv1:

journal:

commitIntervalMs: 100

operationProfiling.slowOpThresholdMs: 2000

replication.oplogSizeMB: 2000

storage.wiredTiger.engineConfig.cacheSizeGB: 32

storage.wiredTiger.collectionConfig.blockCompressor: snappy

storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.cacheSizeGB: 32

storage.rocksdb.compression: snappy

And this is the configuration file for TokuMX

dbpath = /tokumx201/data

logpath = /tokumx201/log

logappend = true

fork = true

slowms = 2000

oplogSize = 2000

expireOplogHours = 2

It was then started using this shell script with "32G" as the argument:

numactl --interleave=all \

bin/mongod \

--config $PWD/mongo.conf \

--setParameter="defaultCompression=quicklz" \

--setParameter="defaultFanout=128" \

--setParameter="defaultReadPageSize=16384" \

--setParameter="fastUpdates=true" \

--cacheSize=$1 \

--replSet foobar \

--checkpointPeriod=900

sleep 5

echo "rs.initiate()" | bin/mongo

echo "show collections" | bin/mongo local

Small Datum

Tuesday, August 4, 2015

Reducing the overhead of secondary index maintenance with write-optimized database engine

Results

Perfomance Data

iostat

Configuration

No comments:

Post a Comment

Why is RocksDB spending so much time handling page faults?