Saturday, December 13, 2014

Getting instant notifications on your smartphone or tablet from Nagios


If you are a network administrator or manage a production system of some sort, you might be interested to be notified as soon as possible in the event of problems.

Nagios is a great tool to watch your hosts and services and even though it's kind of 1990-ish it still does a very good job and stays the most used tool for this task.

It can natively send e-mails. That's great, but I don't know about you, I get more than 40 e-mails a day, and even with SaneBox and other tricks it's easy to miss e-mails and it can literally take days until I notice one that was more important than the others.

Pushover provides instant notifications on handheld devices such as Android, iPhone, and iPad using "push" technology (instead of polling).
It has many advantages:

  • It uses a REST API, meaning you can use it in all scenarios you ever dreamt of.
  • They provide clients for smartphones, tablets and they even support OS X Growl notifications.
  • It's inexpensive at only $5 USD.
  • You don't need to run or install any software on your servers.

Here is a tutorial on how to use it with Nagios, based on Debian 7.7 stable, as of December 2014:

  1. Create an account on Pushover. Mark down your personal key.
  2. Create an application. Mark down the application key.
    I named mine "Nagios" but you can choose anything you like.
  3. Set up your account on one of your devices.
  4. Download Jedda Wignall's script in your plugins directory with your root account:
    wget -O /usr/lib/nagios/plugins/notify_by_pushover.sh https://raw.githubusercontent.com/jedda/OSX-Monitoring-Tools/master/notify_by_pushover.sh
  5. Make it executable:
    chmod +x /usr/lib/nagios/plugins/notify_by_pushover.sh
  6. Create a Nagios command to run it. I like to have a /etc/nagios3/conf.d/custom_handlers.cfg file, but you can name it with any name you want as long as it is in that directory. Copy the following coutent to it:

    define command {
        command_name        handler_pushover
         command_line       $USER1$/notify_by_pushover.sh -u $ARG1$ -a $ARG2$ -t "$ARG3$" -m "$ARG4$"
    }


    $USER$1 stands for /usr/lib/nagios/plugins and $ARGn$ will be replaced with arguments set after the command-name following "!" in the next step.
  7. If you want every host to report a failure (and supposing you used the template everywhere), modify /etc/nagios3/conf.d/generic-host_nagios2.cfg, and add the following line anywhere inside the host definition:

    event_handler                   handler_pushover!PERSONAL_KEY!APP_KEY!Nagios!$HOSTNAME$ : $HOSTSTATE$
    Replace the personal key and the application keys with the ones obtained in step 1 and 2.

    "Nagios" here is the title that will appear in the notification, it doesn't need to match the application name and it can include spaces.
  8. The last parameter of the command (everything after the last "!") is the message. Here I simply chose to show the hostname and the state. The variables enclosed in dollar signs can be found in the Nagios manual.
  9. If you want every service to report a failure (and supposing you used the template everywhere), modify /etc/nagios3/conf.d/generic-service_nagios2.cfg, and add the following line anywhere inside the service definition:

    event_handler                   handler_pushover!up4evfJ7bkP43M5mLvGZ24Gp1aoM14!a6fLnAFT2rNMCXeEeogbghbLbV1o9m!Nagios!$HOSTNAME$ / $SERVICEDESC$ ($SERVICESTATE$) : $SERVICEOUTPUT$

    This will print the status of the service as well as the line shown in the web interface.
    You can include modify "Nagios" (title of the notification) and everything after the last ("!") with macros from the Nagios manual.
  10. Reload Nagios:
    service nagios3 reload
  11. Provoke a "warning" or "critical" message in Nagios. You should receive a notification as soon as the problem is discovered by Nagios.
Life is great!

Now, there may be better ways to do this. You should probably use the notification mechanism from Nagios instead. The handler is supposed to "handle" the problem, not report it. As a result I had to put retry_interval 120 in some of my service definitions to avoid getting the same message again and again.

Tell me what you think and how it worked out for you!

No comments:

Post a Comment