Tuesday, August 4, 2015

Reducing the overhead of secondary index maintenance with write-optimized database engine

There was a question about the performance of RocksDB for large objects during a recent RocksDB AMA. I used the insert benchmark to get a better answer to that question and understand the impact of secondary index maintenance on the insert rate. I repeated the benchmark for documents with a padding field of size 256 bytes, 10kb and 1mb. The padding field compresses by 50%.

The insert benchmark workload uses one or more threads to insert documents. There are optionally 1, 2 or 3 secondary indexes and the indexes must be maintained by inserting new index entries. For a b-tree the disk page with a leaf node from each secondary index must be read before it can be written (call this read-before-write). I assume that non-leaf nodes of the index are in the database or OS cache. When the database is large then leaf nodes might not be cached and a disk read is done as part of read-before-write.

Non-unique secondary indexes with a write-optimized database don't, or shouldn't, require read-before-write. With RocksDB a Put is done to the memstore to add the new index entry. Were this an update there might be a delete of the old key, which is implemented as a Put of a tombstone into the memstore, and again no read is required. Were this a unique secondary index then the database must do a read to confirm that the unique constraint is supported.

The key conclusions are:
  • A write-optimized database can make secondary index maintenance much faster whether using slow (disk) or fast (PCIe flash) storage. Results for disk are in progress. Today I share results for PCIe flash.
  • There isn't a penalty for RocksDB with large documents. While RocksDB and TokuMX were faster than WiredTiger and mmapv1 for small documents, they were not faster for 10kb and 1mb documents. There are two reasons for this. First, a b-tree might not suffer from higher write-amplification for large documents when the document is a multiple of the disk page size. Second, the secondary index data was small enough to remain in cache during the tests with 10kb and 1mb documents. While read-before-write was still done, the reads were done from cache.

Results

This is the performance data for all tests. The test server has PCIe flash, 144G of RAM and 40 hyperthread cores. The mongod process was configured to use a 32G block cache for RocksDB, WiredTiger and TokuMX. For RocksDB, WiredTiger and mmapv1 I used MongoDB 3.0.X from a special branch. For TokuMX I used 2.0.1 which is based on MongoDB 2.4. The oplog was configured for all servers. The oplog was 2G for RocksDB, WiredTiger and mmapv1. Oplog retention is time-based for TokuMX and set to 2 hours. More configuration options are listed at the end of this post. Snappy compression was used for RocksDB and WiredTiger while QuickLZ was used for TokuMX.

The test with a 256 byte pad column inserted 2B documents. The test with a 10kb pad field inserted 100M documents. The test with a 1MB pad field inserted 1M documents. All tests used 10 client threads.

This shows the insert rate by number of secondary indexes for 2B documents with a 256 byte pad field. WiredTiger did great with 0 secondary indexes but it and mmapv1 suffer once reads from storage are done during secondary index maintenance. TokuMX and RocksDB lose much less performance with more indexes. The insert rate at 3 secondary indexes is 77% & 67% of the rate at 0 secondary indexes for TokuMX and RocksDB. The rate at 3 indexes is 8% and 7% of the rate at 0 indexes for mmapv1 and WiredTiger. The iostat section at the end of the blog has more details on the performance overhead of secondary index maintenance. With 3 secondary indexes and WiredTiger or mmapv1 the number of disk reads per insert is more than 100X the rate for TokuMX and RocksDB.

I repeated the test for RocksDB and TokuMX with the oplog disabled. The insert rate increased from 45651 to 49322 for TokuMX and from 29913 to 39953 for RocksDB. I suspect that TokuMX loses less performance from the oplog because they partitioned it.
This is the database size for the tests with 2B documents and a 256 byte pad field. RocksDB uses the least disk space and mmapv1 uses the most.
This is the result for 100M documents with a 10kb pad field. While TokuDB and RocksDB lose the least performance as indexes are added, the results for all engines are closer than for the 256 byte documents. The secondary index data remains in cache for this test so WiredTiger and mmapv1 suffer less from read-before-write.
This is the result for 1M documents with a 1mb pad field. Results were close for all engines.

Perfomance Data

The collection used in this test always has a primary key index and tests were done with 0, 1, 2 and 3 secondary indexes. For each test there are 4 numbers below: ipsAvg-ipsP95-dbSize-rssSize. For some database sizes I don't have resolution for all digits (see 13XX below). The four values are:
  • ipsAvg - average rate of inserts/second during the test
  • ipsP95 - 95th percentile rate for inserts/second from 10-second samples
  • dbSize - database size in GB at test end
  • rssSize - mongod process size in GB at test end

2B documents with 256 byte pad field
        PK-only                 PK, 1 secondary         PK, 2 secondaries       PK, 3 secondaries
tokumx  59499-56576-433g-25g    54530-50836-483g-34g    49092-44445-529g-34g    45641-40115-608g-34g
rocks   44136-33342-363g-42g    37587-25271-394g-45g    33330-23267-435g-50g    29913-21191-478g-53g
wt      100484-85026-383g-6.5g  24448-59.0-461g-39g     11776-68.9-534g-36g     6604-997-607g-36g
mmap    30063-27652-1023g-130g  13684-11689-11XXg-133g  5709-2617-13XXg-79g     2393-1188-14XXg-72g

100M documents with 10kb pad field
        PK-only                 PK, 1 secondary         PK, 2 secondaries       PK, 3 secondaries
tokumx  6897-4509-986g-33g      6960-5338-14XXg-35g     7053-5536-14XXg-35g     7043-5488-14XXg-35g
rocks   8489-6933-515g-4.3g     7976-73-517g-5.3g       7813-6184-518g-5.4g     7475-5786-521g-5.6g
wt      9087-7444-640g-5.2g     5159-1137-654g-18g      5619-714-651g-27g       4204-546-663g-34g
mmap    7434-4307-16XXg-106g    6039-3466-16XXg-98g     5046-2853-16XXg-111g    4299-2711-16XXg-111g

1M documents with 1mb pad field
        PK-only                 PK, 1 secondary         PK, 2 secondaries       PK, 3 secondaries
tokumx  100.1-70.1-864g-34g     100.9-72.8-1022g-33g    100.5-69.8-1019g-33g    99.7-67.9-1019g-33g
rocks   100.0-72.2-514g-34g     94.1-68.0-517g-34g      95.8-69.9-517g-34g      94.9-69.1-518g-34g
wt      106.7-101.3-518g-2.7g   107.1-101.7-518g-4.2g   107.7-101.5-519g-2.9g   108.0-101.7-520g-3.0g
mmap    79.2-32.2-20xxg-117g    93.9-49.2-20XXgb-117g   95.2-57.0-20XXg-118g    95.2-57.7-20XXg-117g

iostat

I examined iostat data to explain the differences in insert rates for the test with 2B documents and a 256 byte pad field. There are four lines for each engine with data for 0, 1, 2 and 3 secondary indexes. Note that with 3 secondary indexes the b-tree engines (WiredTiger, mmap) are writing more to storage per insert and reading a lot more per insert.  The output is:
  • r/s - average rate for r/s (storage reads/second)
  • rmb/s - average rate for read MB/second 
  • wmb/s - average rate for write MB/second
  • r/i - storage reads per insert
  • rkb/i - KB of storage reads per insert
  • wkb/i - KB of storage writes per insert 
  • #indexes - number of secondary indexes

TokuMX
r/s     rmb/s   wmb/s   r/i             rkb/i   wkb/i   #indexes
13.5    0.3     71.6    0.000226        0.005   1.233   0
116.4   5.0     79.6    0.002135        0.093   1.495   1
182.4   9.7     86.9    0.003715        0.202   1.812   2
320.9   16.4    98.0    0.007031        0.367   2.199   3

RocksDB
r/s     rmb/s   wmb/s   r/i             rkb/i   wkb/i   #indexes
11.1    0.3     228.9   0.000251        0.008   5.310   0
18.9    0.4     230.4   0.000503        0.010   6.276   1
114.7   3.9     234.3   0.003441        0.119   7.198   2
185.0   8.3     240.1   0.006186        0.284   8.218   3

WiredTiger
r/s     rmb/s   wmb/s   r/i             rkb/i   wkb/i   #indexes
3.8     0.0     128.3   0.000038        0.000   1.308   0
37.2    0.2     129.2   0.001522        0.009   5.413   1
2681.9  24.4    172.9   0.227747        2.123   15.033  2
5913.2  58.2    174.5   0.895392        9.021   27.053  3

mmapv1
r/s     rmb/s   wmb/s   r/i             rkb/i   wkb/i   #indexes
7.5     0.1     41.3    0.000250        0.003   1.407   0
28.7    0.2     126.6   0.002101        0.013   9.477   1
5534.8  70.7    94.7    0.969479        12.679  16.987  2
11116.2 160.2   57.5    4.645290        68.554  24.607  3

Using TokuMX as the base case, the normalized r/i rates are listed below. When using 3 secondary indexes,  mmapv1 does 660X more disk reads per insert and WiredTiger does 127X more disk reads per insert compared to TokuMX. That can be a problem even with fast storage.

0 indexes1 index2 indexes3 indexes
tokumx1.0001.0001.0001.000
rocksdb1.1110.2360.9260.880
wiredtiger0.1680.71361.305127.349
mmap1.1060.984260.963660.687

Configuration

This is the configuration file for RocksDB, WiredTiger and mmapv1:
processManagement:
  fork: true
systemLog:
  destination: file
  path: /mongo.30/log
  logAppend: true
storage:
  syncPeriodSecs: 60
  dbPath: /mongo.30/data
  journal:
    enabled: true
  mmapv1:
    journal:
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.engineConfig.cacheSizeGB: 32
storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

storage.rocksdb.cacheSizeGB: 32
storage.rocksdb.compression: snappy

And this is the configuration file for TokuMX
dbpath = /tokumx201/data
logpath = /tokumx201/log
logappend = true
fork = true
slowms = 2000
oplogSize = 2000
expireOplogHours = 2

It was then started using this shell script with "32G" as the argument:
numactl --interleave=all \
bin/mongod \
    --config $PWD/mongo.conf \
    --setParameter="defaultCompression=quicklz" \
    --setParameter="defaultFanout=128" \
    --setParameter="defaultReadPageSize=16384" \
    --setParameter="fastUpdates=true" \
    --cacheSize=$1 \
    --replSet foobar \
    --checkpointPeriod=900

sleep 5
echo "rs.initiate()" | bin/mongo
echo "show collections" | bin/mongo local

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...