Find and kill slow running queries in MongoDB

In Mongo, or more generally in any data storage engine, queries or updates that take longer than expected to run can be caused by many reasons:
– Slow network
– Wrong schema design (we all have seen the famous all-in-one table…)
– Wrong database design (“let’s store 100To of data in a standalone mongod!”)
– Bad partitioning (Hbase table with 200 regions with 2MB of data)
– Lack of useful indexes
– No statistics
– Incorrect hardware sizing (in 99.999%, the memory…)
– Missing or wrong configuration (aka the default one)
– Expensive map-reduce jobs
– Index builds
– …

Before even thinking of killing your long running queries in MongoDB, you should think more about these reasons. So ask yourself the following question: for which of these reasons your platform may slow your queries? There are planty of usefull tools that you should consider in order to diagnose your platform. OpsManager (with Munin agent) works like a charm and will give you tons of infos and metrics, so if you are going to play with MongoDB in production, don’t think twice and deploy it.

Now, if after some investigations you find that your platform is ready-to-use, the most likely problem is that you are doing needless full or huge partial collection scans on large collections. In other words, the database is computing the query results by iterating through each document in your collection, instead of using one of the available indexes. In MongoDB with WiredTiger (the now default storage engine), the storage engine uses tickets to limit the number of read/write operations processed simultaneously. Each time a query is done, a ticket is consumed. You have a limited number of tickets in WiredTiger (most seen values are about 128 or 200), meaning that if you have some long-running queries, the other queries may be long too and all your tickets are consumed. Now you can not even query your Mongo instances, your cluster is a brick.

We can avoid these situations by killing these long-running queries. MongoDB offers two methods which will be useful here: db.currentOp() and db.killOp().

db.currentOp(<operations>)

This method find informations on in-progress operations for the database instance. In other words, it will return informations on all active operations running on your database. This allows you to easily identify long-running and/or blocking operations.

db.killOp(opid)

This other method terminates an operation as specified by the operation ID. If you pass in true to db.currentOp(), the method returns information on all operations, including operations on idle connections and system operations, so just don’t. Even the documentation says:

WARNING
Terminate running operations with extreme caution. Only use db.killOp() to terminate operations initiated by clients and do not terminate internal database operations.

And this can be understandable: it may be a critical system operation, like chunk migration in a sharded cluster as just one example, replication threads would be another.

Now with these two methods, the work to be done here is to write a script killing all queries running for more than XX seconds on all shards.

NOTE
If you are using this script in a secure cluster, you’ll have to run it with a user owning the inprog role

/usr/share/mongo_slow_query_killer/mongo_slow_query_killer.sh

#!/bin/sh

COMMON=".inprog.forEach(function(query){ print(['[', new Date().toISOString(), '] ', '[', query.ns, '] [', query.op, '] [', query.opid, '] ', 'Killing query which was running:', query.secs_running, 'sec.'].join(' ')); db.killOp(query.opid);} );"

while true
do
    # Source the file at each iteration in order to do some "hot reload"
    . /etc/sysconfig/mongo_slow_query_killer

    for DC in $DCS
    do
        $BIN $AUTH --quiet --eval "db.currentOp({active : true, ns: \"$DC\", op:\"query\", secs_running:{\$gt:$RUNNING}})$COMMON" >> $LOGPATH/$PATTERN-`date +"%Y-%m-%d"`.log
        $BIN $AUTH --quiet --eval "db.currentOp({active : true, ns: \"$DC\", op: { \$in : [ \"command\", \"count\", \"query\" ] }, secs_running:{\$gt:$RUNNING}, \"query.count\" : { \$exists: true}})$COMMON" >> $LOGPATH/$PATTERN-`date +"%Y-%m-%d"`.log
    done

    sleep 5s
done

This script uses the eval param with Javascript command. We only kill find query or count operations running more than $RUNNING. $DCS are the watched collections. All these vars are configured in a sysconfig file:

/etc/sysconfig/mongo_slow_query_killer

# Limit in seconds
RUNNING=30

# Logging
LOGPATH=/var/log/mongo_slow_queries
PATTERN=mongo_slow_queries

# Dbs and or collections
DCS=("
mydb.MyCollection
")

# Auth
AUTH="host:27017/test --authenticationDatabase admin --username userWithInprodRole --password ThePa55w0rD"

# Binary
BIN=/usr/bin/mongo
#BIN=/var/lib/mongodb-mms-automation/`ls -1 /var/lib/mongodb-mms-automation/|grep mongodb-linux-x86_64|sort -r|head -1`/bin/mongo

# User
USR=mongod
GROUP=mongod

Now let’s wrap this script in a service (upstart here, but you can do the same for systemv or systemd):

/etc/init/mongo_slow_query_killer.conf


author "Vincent Devillers" start on runlevel [2345] stop on runlevel [!2345] # Try to respawn with a maximum of 10 times in a 90 second window. respawn respawn limit 10 90 script . /etc/sysconfig/mongo_slow_query_killer export LOGPATH export USR NOW=$(date +"%Y-%m-%d %k:%M:%S") echo "$NOW Running slow queries killer" >> $LOGPATH/$PATTERN.out chown $USR:$GROUP $LOGPATH/$PATTERN* chown +x /usr/share/mongo_slow_query_killer/mongo_slow_query_killer.sh # Need to use "exec su -s" because this version of upstart doesn't support setuid and setgid exec su -s /bin/sh $USR -c "/usr/share/mongo_slow_query_killer/mongo_slow_query_killer.sh >> $LOGPATH/$PATTERN.out 2>&1 < /dev/null" end script

In order to deploy the service:

  • First reload the Upstart configuration
initctl reload-configuration
  • Then check the upstart job list:
$ initctl list | grep mongo
  • and start the job:
$ initctl start mongo_slow_query_killer

Sources:
– https://docs.mongodb.com/manual/reference/method/db.currentOp/
– https://docs.mongodb.com/manual/reference/method/db.killOp/#db.killOp

Credits:
“Don’t! Touch!” by The Cookiemonster is licensed under CC BY-NC 2.0

Related Posts

Leave a comment

About privacy:

This site uses Akismet to reduce spam. Learn how your comment data is processed.