Thursday, February 13, 2020

Short guide to MongoDB monitoring

This is short and incomplete. But it is a good start. This is written from the perspective of someone who spends all of their time trying to explain MongoDB performance for benchmarks. I have plenty of experience with databases in production. I have no experience with MongoDB in production.

Notes:
  • One day I hope for the equivalent of user/table statistics
  • For profiling use db.setProfilingLevel and db.getProfilingStatus.
  • To enable debug level messages in the diagnostic log use db.setLogLevel and db.getLogComponents. However I think it is best to start with non-debug messages so you can ignore these for a while.
  • I wish the set and get names above were symmetric. Just remember the set*Level method names and don't worry about the get* names. 
  • db.setProfilingLevel also determines what can get written to the diagnostic log. The important parameter is slowms. My advice is to set level=0 and slowms to something > 0 and use the diagnostic log. 
  • Learn how to use mtools to process entries from the diagnostic log. The tools are great and the docs are awesome.
  • The diagnostic log entries for COMMAND and TXN are the most likely entries you will examine. These lines can include many performance metrics but they are only printed when the values are > 0. Examples are below.
  • Also see db.stats, db.serverStatus, db.collStatsdb.collection.statsdb.collection.latencyStats, connPoolStats, replSetGetStatus,
  • Consider saving the FTDC files from the diagnostic.data directory if you have problems. There are some tools for accessing the data from an FTDC file.
  • Use explain to understand query plans and execution
  • Use mongostat and mongotop to see performance from a high level before drilling down
  • I don't have much experience with them yet but MongoDB has additional tools and services for monitoring.
Diagnostic log examples
These show some of the counters that can be included in diagnostic log entries. Remember that many of the counters are only printed when > 0.
Metrics on writes (bytesWritten, timeWritingMicros)
2020-02-04T18:05:24.787-0800 I COMMAND [conn41] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3139, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868324, 23435), signature: { hash: BinData(0, 9E42A21BD71EA0BB8DFEAD52013CFFF325E60BB6), keyId: 6789772850304647170 } }, lsid: { id: UUID("5d900961-93cf-4f7a-9426-940a37a71eca") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesWritten: 324324, timeWritingMicros: 656 } } protocol:op_msg 401ms
Metrics on reads (bytesRead, timeReadingMicros)
2020-02-04T18:08:52.442-0800 I COMMAND [conn33] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3719, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868532, 2858), signature: { hash: BinData(0, D289506AFBD0E8C8E8184112608CFED62E9A1B7D), keyId: 6789772850304647170 } }, lsid: { id: UUID("7391b47c-fd95-491b-a640-20c953bcacc0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1369, timeReadingMicros: 10 } } protocol:op_msg 389ms

timeWaitingMicros is less frequent. In this case it was ~2.2 seconds. That seems like a lot but the operation wrote more than 4M of data.
2020-02-04T18:22:05.668-0800 I COMMAND [conn46] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 144, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580869323, 2352), signature: { hash: BinData(0, 2CF49EC657832ECB9D7DD0AC5C65E5C3F80CA10B), keyId: 6789781594858061826 } }, lsid: { id: UUID("d88d7d65-f293-4060-a894-17c58123d787") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1881, bytesWritten: 4458441, timeReadingMicros: 15, timeWritingMicros: 5071 }, timeWaitingMicros: { cache: 2215267 } } protocol:op_msg 2621ms
timeAcquiringMicros for locks is also less frequent. I have a few examples from insert stress tests. I have yet to learn whether mongod times all lock waits. It isn't easy to do that without an impact on performance.
2020-02-04T17:52:55.133-0800 I COMMAND [conn43] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 1023, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580867574, 46329), signature: { hash: BinData(0, D158AF9490A8ABAEA2969E809D2C27FE57157B41), keyId: 6789772850304647170 } }, lsid: { id: UUID("fabaaa7b-c2c4-4712-ae16-f2c0f58a03a0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 14042 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{} protocol:op_msg 376ms
Commit appears in the log via TXN

2020-02-05T17:31:54.792-0800 I TXN [conn81] transaction parameters:{ lsid: { id: UUID("cae8daad-5603-4a41-87db-3fda07e90bee"), uid: BinData(0, 6399AB0DAC62F20BFC466753B10FB58FB7E692BEC952C69B84D997021794D1F8) }, txnNumber: 255, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1580952714, 1391) } }, readTimestamp:Timestamp(0, 0), ninserted:10 keysInserted:20 terminationCause:committed timeActiveMicros:604 timeInactiveMicros:1016 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 5 } }, Global: { acquireCount: { r: 3, w: 1 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 23 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 1ms

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...