Monday, June 29, 2015

Insert benchmark for MongoDB, memory allocators and the oplog

I used the insert benchmark to measure performance with concurrency for WiredTiger and a cached database. My goals were to understand the impact of concurrency, the oplog, the memory allocator and transaction size. The test was run for 1 to 32 concurrent connections using both tcmalloc and jemalloc with the oplog enabled and disabled and for several document sizes. My conclusions from this include:
  • WiredTiger gets almost 12X more QPS from 1 to 20 concurrent clients with the oplog off and almost 9X more QPS with the oplog on. I think this is excellent for a young engine on this workload.
  • Document size affects performance. The test has a configurable padding field and the insert rate at 32 connections was 247,651 documents/second with a 16-byte pad field and 140,142 documents/second with a 1024-byte pad field.
  • Bundled tcmalloc gets 1% to 6% more QPS than jemalloc 3.6.0
  • The insert rate drops by more than half when the oplog is enabled. I expect this to be less of an issue soon.
While I only share results for WiredTiger here I know that on this workload the WiredTiger B-Tree performs better than RocksDB when the database is cached and then RocksDB does much better when the database is not cached. WiredTiger does better because it uses clever lock-free algorithms to avoid mutex contention and RocksDB does better because it uses a write-optimized algorithm and non-unique secondary index maintenance doesn't require page reads. The short version of that is that WiredTiger can be more CPU efficient and RocksDB can be more IO efficient.

Configuration

Tests were repeated for 1, 2, 4, 8, 12, 16, 20, 24, 28 and 32 concurrent connections. The test server has 40 hyperthread cores and 144G of RAM. I use a special MongoDB 3.0 branch and compile binaries. This test uses one collection with 3 secondary indexes. All clients insert data into that collection. The _id column is set by the driver. The fields on which secondary indexes are created are inserted in random order. There is the potential for extra contention (data & mutexes) because only one collection is used. I set two options for WiredTiger and otherwise the defaults were used:
storage.wiredTiger.collectionConfig.blockCompressor: snappy
storage.wiredTiger.engineConfig.journalCompressor: none

Results

This lists the insert rate for all of the configurations tested. The test names are encoded to include the engine name ("wt" is WiredTiger), padding size ("sz16" is 16 bytes), memory allocator ("jem" is jemalloc & "tcm" is tcmalloc) and whether the oplog is enabled ("op0" is off, "op1" is on). Results are missing for one of the configurations at 24 concurrent clients.

1       2       4       8       12      16      20      24      28      32      concurrency
17793   32143   60150   109853  154377  186886  209831  225879  235803  234158  wt.sz16.jem.op0
14758   26908   48186   69910   91761   109643  120413  129153  134313  140633  wt.sz16.jem.op1
18575   34482   63169   114752  160454  196784  218009  230588  244001  247651  wt.sz16.tcm.op0
15756   28461   50069   72988   101615  109223  127287  135525  133446  137377  wt.sz16.tcm.op1
17651   31466   58192   107472  152426  184849  205226          227172  226825  wt.sz64.jem.op0
14481   26565   47385   71059   87135   100684  110569  110066  119478  120851  wt.sz64.jem.op1
19094   33401   61426   111606  153950  190107  214670  227425  238072  239836  wt.sz64.tcm.op0
15399   27946   49392   72892   85185   99140   106172  101163  112812  119032  wt.sz64.tcm.op1
15759   29196   55124   98027   135320  161829  181049  197465  208484  211100  wt.sz256.jem.op0
13163   24092   41017   62878   71153   84155   87487   90366   91495   87025   wt.sz256.jem.op1
17299   30631   55900   101538  137529  165326  187330  200574  216888  217589  wt.sz256.tcm.op0
13927   25822   43428   60078   72195   78141   76053   73169   74824   64537   wt.sz256.tcm.op1
12115   22366   40793   71936   93701   109068  120175  129645  133238  141108  wt.sz1024.jem.op0
9938    17268   24985   31944   34127   38119   39196   38747   36796   38167   wt.sz1024.jem.op1
12933   23547   42426   73295   94123   110412  116003  136287  139914  140142  wt.sz1024.tcm.op0
10422   17701   23747   30276   29959   32444   32610   30839   31569   30089   wt.sz1024.tcm.op1

Memory allocator

This shows the ratio of the insert rate with tcmalloc vs jemalloc for 4 configurations. When the rate is greater than 1 then tcmalloc is faster. The oplog was enabled for these tests. Results are displayed for 4 configurations -- 16, 64, 256 and 1024 byte padding. For all results but one the insert rate was better with tcmalloc and the difference was more significant when the padding was smaller.

Oplog off

This shows the insert rate for 4 different configurations with padding of 16, 64, 256 and 1024 bytes. All configurations used tcmalloc and the oplog was disabled. The insert rate increases with concurrency and was better with a smaller padding size.

Oplog on

This shows the insert rate for 4 different configurations with padding of 16, 64, 256 and 1024 bytes. All configurations used tcmalloc and the oplog was enabled. The insert rate usually increases with concurrency and was better with a smaller padding size. For padding sizes of 256 and 1024 the insert rate decreased at high concurrency.

Oplog impact

This displays the ratio of the insert rates from the previous two sections. The ratio is the insert rate with the oplog on versus off. This rate should be less than one but not too much less. The overhead of the oplog increases with the padding size. In the worst case the oplog reduces the insert rate by 5X (about 0.20 below).  I expect that this overhead will be greatly reduced for WiredTiger and RocksDB in future releases.



No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...