Thursday, August 13, 2015

Insert benchmark and queries on slow storage for MongoDB

This continues the use of the insert benchmark to understand performance for different MongoDB engines with slow storage. I previously explained load and query performance for a server with PCIe flash and documented the impact of too frequent invalidation in the plan cache. In this post I explain load and query performance for a server with a disk array -- SW RAID 0 across 15 SATA disks.

My summary from this and the previous posts includes:
  • write-optimized engines (TokuMX, RocksDB) do much better than b-trees (WiredTiger, mmapv1) for secondary index maintenance on large databases because they don't do read-before-write. InnoDB might be unique among the b-tree implementations -- it has an optimization to save on disk reads via the insert buffer.
  • MongoDB invalidates cached plans too quickly, but that has been fixed recently. The performance impact from this can be much worse on slow storage.
  • TokuMX 2.0.1 has a CPU bottleneck from the partial eviction code that also hurts range query performance. It also wasn't doing index-only access plans for queries that should be index only.
  • RocksDB is very competitive for this workload. It was usually the first or second best on each test.
  • TokuMX tends to do more reads and fewer writes compared to RocksDB
  • mmapv1 can become single-threaded courtesy of the write lock not getting released when disk reads are done during index maintenance. See server-13325.
  • TokuMX and RocksDB on slow storage do loads almost as fast as on fast storage. With 3 secondary indexes TokuMX gets 98% of the rate it had on fast storage and RocksDB gets 83% of the rate it had on fast storage. RocksDB loses more because of compaction stalls. We are thinking of ways to make this better.
  • For write intensive workloads RocksDB loses more performance than TokuMX when the oplog is enabled. TokuMX loaded 500M docs at 44571/second with the oplog enabled versus 48515/second with it disabled. RocksDB did 24923/second with it enabled and 36759/second with it disabled. The percentages for rate-with-oplog-enabled / rate-with-oplog-disables is 91% for TokuMX and 67% for RocksDB. TokuMX has more optimizations for the oplog.

The tests

I wasted a few weeks of machine time getting this data. My first attempt used the same setup as the fast storage test -- 2B documents with a 256 byte pad column. The load with 3 secondary indexes would have taken much too long for mmapv1 and WiredTiger but I didn't realize this until more than one week into the load. I changed the load to 500M documents and even then I wasn't willing to wait for mmapv1 and WiredTiger to finish with 3 secondary indexes so I stopped the load early for them. I was more interested in incremental load than bulk load performance otherwise I would have created the indexes after loading the table.

I then lost another few days and repeated all query tests because I had a lousy setting for /sys/block/$device/queue/read_ahead_kb. This hurt mmapv1 the most on query tests. RocksDB was mostly immune to it because it sets POSIX_FADV_RANDOM by default for user readsI didn't notice this until late in the process. I did not repeat the load.

I spent a few more days improving the RocksDB configuration for disk arrays and finally another day repeating tests with the oplog disabled. This is a typical experience for performance tests, tests will be repeated.

All tests use my fork of the insert benchmark for MongoDB. The test servers have a disk array (SW RAID 0, 15 SATA disks), 32 hyperthread cores and 64G of RAM.  The servers with fast storage had more of everything: PCIe flash, 40 hyperthread cores, 144G of RAM. The CPU on the fast storage server was one generation newer than the CPU on the slow storage server. The mongod configuration was described in the two previous blog posts, but I limited to database cache to 16G versus 32G for the server with fast storage. Loads were done with 10 client threads. The query tests are described in the Test Details section of the previous post.

Load performance with oplog enabled

I had lots of data and a nice graph to share for the load. Then I realized that a lousy setting for read_ahead_kb made my load results bogus. I repeated the load tests for TokuMX and RocksDB with better settings. I did not repeat the load for WiredTiger and mmapv1 because that would take too long. I also stopped the load early for WiredTiger and mmapv1 so the query tests were run with 352M documents for mmapv1 and 419M documents for WiredTiger versus 500M documents for RocksDB and TokuMX. The oplog was enabled for these tests, but fsync was not done on commit.

This has data from iostat and vmstat including absolute and relative (normalized by insert/second) rates. From the normalized rates TokuMX and RocksDB use a similar amount of CPU per insert, while TokuMX reads ~7X more data and RocksDB writes ~3.5X more data per insert. Note that I can greatly reduce the write rate for RocksDB by using universal compaction, but I have yet to explore that tradeoff between write, read and space amplification. I suspect that RocksDB uses one more index than TokuMX to map DiscLoc to a document as explained in a previous post and this is a current limitation in MongoDB 3.0, so WiredTiger also suffers from it. This extra index might be the reason for some extra writes, but not for 3.5X more. The values in the table are:
  • r/s - average rate for iostat r/s
  • rmb/s - average rate for iostat read MB/second 
  • wmb/s - average rate for iostat write MB/second
  • r/o - iostat reads insert
  • rkb/o - iostat KB read per insert
  • wkb/o - iostat KB written per insert
  • us+sy - average value of vmstat us + sy (user & system CPU)
  • cs/o - vmstat cs per insert (context switches per insert)
  • (us+sy)/o - CPU time per insert
  • ips - inserts/second
  • #indexes - number of secondary indexes
r/s     rmb/s   wmb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   ips     engine
728.4   9.9      87.4   0.016343   0.228    2.007    40.1    4      0.000899    44571   tokumx
 77.8   0.9     176.6   0.003123   0.036    7.254    22.3    7      0.000895    24923   rocksdb

Load performance with oplog disabled

I repeated the load test for RocksDB and TokuMX with the oplog disabled to determine the overhead of the oplog for RocksDB and hopefully provide motivation to make this more efficient. The data below has the load rate (ips) and statistics from iostat and vmstat (absolute and normalized by the insert rate) for TokuMX and RocksDB with the oplog enabled (*.log) and disabled (*.nolog). Things I noticed include:
  • RocksDB loses more from the oplog than TokuMX. RocksDB gets 67% of the insert rate with the oplog enabled compared to when it is disabled, while TokuMX gets 91% in that comparison. TokuMX doesn't have to delete old data from the oplog, it can drop/truncate old partitions. While RocksDB must delete old entries one at a time.
  • The wkb/o (rate for bytes written to storage per insert) value is reduced almost in half for RocksDB with the oplog disabled. For an insert only workload documents are written at least twice (once to the oplog, once to the database) ignoring writes from LSM compaction in RocksDB. So when the oplog is disabled half of the writes are removed.
  • The rkb/o rate drops by ~6X for RocksDB when the oplog is disabled.
r/s     rmb/s   wmb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   qps     engine
728.4   9.9      87.4   0.016343   0.228    2.007    40.1    4      0.000899    44571   tokumx.log
77.8    0.9     176.6   0.003123   0.036    7.254    22.3    7      0.000895    24923   rocksdb.log

706.8   8.3     65.0    0.014568   0.176    1.372    39.0    4      0.000804    48515   tokumx.nolog
25.3    0.2     141.5   0.000689   0.005    3.941    23.2    6      0.000632    36759   rocksdb.nolog

Impact of too-frequent plan invalidation

The i1.q10 test shows how bad the impact can be from too-frequent plan invalidation. The problem is that the 3.0.X version of MongoDB that I used invalidates query plans after 1000 documents have been changed. That happens quickly for the workloads I tested and QPS drops when plans are recomputed. The table below shows the rates for inserts and queries per second with the default setting ($engine.nofix) and a workaround ($engine.fix). I was unable to use the workaround for TokuMX 2.0.1 as it is based on an older version of MongoDB. But I don't think TokuMX suffers from this problem. 

ipsAvgqpsAvg
wired.nofix6682.1
wired.fix61638
mmap.nofix9250.22
mmap.fix5834.2
rocks.nofix277831.3
rocks.fix2962878

Query performance

The query tests are described in the Test Details section of the previous post. The tests were run after the load with 3 secondary indexes in the sequence and here I share the results for i0.10, i1.q10, i10.q1 and i10.q0. The tests were only run for the configuration with 500M documents and a 256 byte pad field but the load was stopped early for WiredTiger and mmapv1 (see the section on Load Performance) so they had less data.

The queries used in this test are interesting. The secondary indexes were small enough to be cached for RocksDB, TokuMX and WiredTiger but only if I didn't use too much memory for the block caches. I used 16G which left almost 48G for the OS filesystem cache. To be precise, this was enough memory to almost cache the secondary indexes for TokuMX. The queries use covering secondary indexes so they should be index only, but were not for TokuMX for reasons I don't understand. The extra disk reads done by TokuMX were not good for QPS.

The secondary indexes were not small enough to be fully cached by mmapv1, but a lot of the data from them was in cache. With mmapv1 I don't have to worry about splitting RAM between uncompressed pages in the block cache and compressed pages in the OS filesystem cache. Also with mmapv1, decompression on page read wasn't needed because it doesn't use compression. This might explain why it had the highest QPS. AFAIK, WiredTiger doesn't use block compression (snappy in this case) for index pages so it has less overhead after reading the data compared to RocksDB.

This describes the columns in the tables below:
  • ipsAvg, qpsAvg - average rates for inserts and queries per second
  • dbSize - database size in GB at test end
  • rssSize - size of mongod process per RSS column at test end

i0.q10

Perhaps because I have never used mmapv1 or MyISAM in production I am always happy to get a result where they are the fastest. The queries were not index only for TokuMX and QPS suffers. They were index only for RocksDB, WiredTiger and mmapv1.

qpsAvgdbSizerssSize
tokumx529140g18g
wiredtiger8783129g15g
mmapv110827251g62g
rocksdb7711120g22g

This has data from iostat and vmstat including absolute and relative (normalized by IPS or QPS) rates. The columns are explained in the Load Performance section. TokuMX does ~1037X more disk reads per query than RocksDB. That is a problem. mmapv1 does ~8.4X more disk reads per query than RocksDB and that is much less of a problem in this case. RocksDB uses about 2X more CPU/query than mmapv1 while TokuMX uses about 4.8X more CPU/query than RocksDB.

r/s     rmb/s   wmb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   qps     engine
940.1   11.1    0.6     1.778132   21.445   1.204    10.3    13     0.019426    529     tokumx
  1.5    0.2    0.1     0.000176    0.019   0.017    36.7    10     0.004181    8783    wiredtiger
156.1    0.7    0.2     0.014419    0.065   0.015    28.2     7     0.002602    10827   mmapv1
 13.2    0.2    0.1     0.001715    0.029   0.019    31.3     7     0.004060    7711    rocksdb

i1.q10

The one insert thread was rate limited to 1000 inserts/second but WiredTiger and mmapv1 were unable to sustain that rate. This test was run with the fix to avoid too-frequent query planning. The QPS for TokuMX suffers because the query isn't index only for it. QPS for mmapv1 suffers because the database is much larger than others and because of the RW lock which is locked in W mode when disk reads are done for secondary index maintenance as explained in a previous post.

ipsAvgqpsAvgdbSizerssSize
tokumx989502143g17g
wiredtiger8975985131g16g
mmapv1253352258g21g
rocksdb9917129122g22g

i10.q0

TokuMX does much better than RocksDB here because RocksDB has more write amplification, TokuMX has more optimizations for the oplog and RocksDB must maintain one extra index. The insert rates for WiredTiger and mmapv1 are much lower because they do read-before-write during secondary index maintenance. The write efficiency can be seen in the wkb/o column where it is ~5X larger for RocksDB than TokuMX and ~23X larger for WiredTiger than TokuMX. The read efficiency can be seen in the r/o column where WiredTiger is ~40X larger than TokuMX, mmapv1 is ~18X larger than TokuMX and RocksDB is ~1.7X larger than TokuMX. WiredTiger and mmapv1 also use much more CPU per insert than TokuMX and RocksDB.

ipsAvgdbSizerssSize
tokumx48997153g19g
wiredtiger764139g16g
mmapv1649259g36g
rocksdb27810132g22g

r/s     rkb/s   wkb/s   r/o        rkb/o    wkb/o    us+sy   cs/o   (us+sy)/o   ips     engine
536.9   7.1      71.2   0.010958   0.148     1.488   32.6    4      0.000666    48997   tokumx
332.5   2.2      25.5   0.435168   3.006    34.154    5.4    8      0.007050    764     wiredtiger
131.6   0.9      15.8   0.202600   1.367    24.962    1.1    5      0.001684    649     mmapv1
547.3   5.0     206.5   0.018976   0.177     7.331   27.1    6      0.000940    28839   rocksdb

i10.q1

I am not sure why the insert rate for RocksDB here is higher than i10.q0, perhaps that is due to variance from compaction and a longer running test would have fixed it. This test was run with the fix to avoid too-frequent query planning. The insert rates are similar to the rates for i10.q0.

ipsAvgqpsAvgdbSizerssSize
tokumx4652439165g19g
wiredtiger66838146g19g
mmapv15834.2272g34g
rocksdb2962878138g22g



No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...