Sunday, May 1, 2016

Cool rules for Nagios

I really love Nagios.
For those who don't know what it is, Nagios is an open source monitoring solution, whose architecture, configuration and customization is really simple once you grasp its main principles.
One of these principles is to define commands (which usually use nagios plugins) which you can then re-use and parametrize as you wish.

Here are a couple of definitions I find useful.
(You might need to download some nagios plugins from the online repository.)

Check domain name is resolving to your public IP address

define command {
  command_name           check_dynamic_dns
  command_line           $USER1$/check_dns -H $ARG1$ -s resolver1.opendns.com  -a $(dig +short myip.opendns.com @resolver1.opendns.com)
}
define service {
  use                    generic-service
  host_name              andromeda
  service_description    DNS resolution andromeda.ddns.net
  check_command          check_dynamic_dns!andromeda.ddns.net
}

This is mostly useful if you don't manage the domain yourself. In this case we use a Dynamic DNS service. (Yes my machines are named after planets, moons, stars, constellations, galaxies, asteroids, ... How do you name yours?)

Remember you can override the default check interval and other monitoring parameters of the host per service. For instance, I set the interval to 10 minutes for most checks, except for the antivirus scan (check_clamscan) which runs much less often than that.

Also, most tests will be run a 2nd, 3rd and 4th time if the check was not OK, as we could have been "unlucky" with the measurement. This is especially true for ping, temperatures and such kind of things. The antivirus test overrides max_check_attempts, check_interval and retry_interval.
However we cannot get unlucky with antivirus checks, so there is no point retrying the measurement a second time as it is not supposed to magically get better the second time.

Check HDD temperatures

define command {
    command_name check_hdd_temps
    command_line sudo $USER1$/check_lm_sensors --sanitize --drives --high sdaTemp=41,51 --high sdbTemp=41,51 --high sdcTemp=41,51 --low sdaTemp=20,15 --low sdbTemp=20,15 --low sdcTemp=20,15
    }
define service {
        use                             generic-service
        host_name                       andromeda
        service_description             HDD Temperatures
        check_command                   check_hdd_temps
}


This uses the builtin thermometer of the drives. Note check_lm_sensors can also be used to monitor other temperatures (CPU, motherboard, chassis, etc.)
I chose to gather everything into one command because the disks are all part of the same RAID array, but you could have one service per drive. 
Explanation of the command:
For /dev/sda, /dev/sdb and /dev/sdc, the status will be warning when the temperature is 41 degrees celsius and above, and critical when it reaches 51 degrees. Likewise, when the temperature drops below 20 degrees, there will be a warning, and the situation is considered critical below 15 degrees.

Besides HDD temperature, I suggest you have a look at check_smart. I'll be happy to document this check if someone is interested.