Nagios offers various ways of monitoring computers and services. The previous chapter talked about passive checks and how they can be used to submit results to Nagios. It also discussed NRDP, which can be used to send check results from other machines to the Nagios server.
This chapter talks about another approach to check the service status. It uses Nagios active checks that run the actual check commands on different hosts. This approach is most useful in cases where resources local to a particular machine are to be checked, such as monitoring disk and memory usage as well as checking if your operating system is up to date. This type of data cannot be checked without running commands on the target computer.
Remote checks are usually used in combination with the Nagios plugins package that use either SSH or NRPE to run the plugins on the remote machine. This makes monitoring remote systems very similar to monitoring a local computer, with a difference only in the actual running of the commands on the remote machine. In this chapter, we will cover the following topics:
Nagios is often used to monitor computer resources such as CPU utilization, memory, and disk space. One way in which this can be done is to connect over SSH and run a Nagios check plugin.
Automating the authentication process requires setting up SSH to authenticate using public keys. This works because the Nagios server has an SSH private key and the target machine is configured to allow users with that particular key to connect without prompting for a password.
Nagios offers a check_by_ssh
plugin that takes the hostname and the actual command to run on the remote server. It then connects using SSH, runs the plugin, and returns both output and exit code from the actual check performed on the remote machine to Nagios running on the local server. Internally it runs the SSH client to connect to the server and runs the actual command to run along with its attributes on the target machine. After the check has been performed, the output along with the check command's exit code is returned to Nagios.
This way any Nagios plugin can be run from the same machine as the Nagios daemon as well as remotely over SSH without any changes to the plugins. Using the SSH protocol also means that the authorization process can be automated using the key-based authentication so that each check is done without any user activity. This way Nagios is able to log in to remote machines automatically without using any passwords. The following is an illustration of how such a check is performed:
Once Nagios schedules an active check to be performed, the check_by_ssh
plugin runs the ssh
command to connect to the remote host's SSH server. It then runs the actual plugin, which has to be present on the remote host, and waits for the result. The SSH client passes the standard output as well as exit code to the check_by_ssh
plugin that also prints the output and exits with the same code as the plugin.
Even though the scenario might seem a bit complicated, it works quite efficiently and requires very little setup to work properly. It also works with various flavors of Unix systems such as the SSH protocol, clients, and the shell syntax for commands used by the check_by_ssh
plugin is the same on all Unix-based systems.
SSH provides multiple ways for a user to authenticate. One of them is password-based authentication, which means that the user specifies a password; the SSH client sends it to the remote machine, and the remote machine checks if the password is correct.
Another form of verifying whether a user or program can access the remote machine is public key-based authentication. It uses asymmetric cryptography (visit http://en.wikipedia.org/wiki/Public-key_cryptography for more detail) to perform the authentication and provides a secure way to authenticate without specifying any credentials. It requires the user to generate an authentication key, which consists of a public and private key. By default, the filename is ~/.ssh/id_rsa
for the private key and ~/.ssh/id_rsa.pub
for the public key. The public key is then put on the remote machines and it allows the remote machine to authenticate the user. The SSH protocol then takes care of the authentication, it only requires the client machine to have the private key and the remote machine to be configured to accept it by adding the public key to the remote user's SSH authorized keys file, which is located in ~/.ssh/authorized_keys
in most cases.
Setting up remote checks over SSH requires a few steps. The first step is to create a dedicated user for performing checks on the machine on which the remote checks will be run. We will also need to set up directories for the user. The steps to create directory structure on the remote machine are very similar to the steps performed for the Nagios installation itself.
The first thing that needs to be performed on the Nagios server is the creation of a private and public key pair that will be used to log in to all the remote machines without using passwords. We will need to execute the ssh-keygen
command to generate it. For example:
root@nagiosserver:~# su -s /bin/bash nagios nagios@nagiosserver:~$ ssh-keygen Generating public/private rsa key pair. File in which to save the key (/opt/nagios/.ssh/id_rsa): <enter> Created directory '/opt/nagios/.ssh'. Enter passphrase (empty for no passphrase): <enter> Enter same passphrase again: <enter> Your identification has been saved in /opt/nagios/.ssh/id_rsa. Your public key has been saved in /opt/nagios/.ssh/id_rsa.pub. The key fingerprint is: c9:68:47:bd:cd:6e:12:d3:9b:e8:0d:cf:93:bd:33:98 nagios@nagiosserver nagios@nagiosserver:/root$
We used the su
command to switch users along with the -s
flag to force the shell to be /bin/bash;
this is because in most setups the nagios
user usually does not have shell access. The <enter>
text means that the question was answered with the default reply. The private key is saved as /opt/nagios/.ssh/id_rsa
, and the public key has been saved in the /opt/nagios/.ssh/id_rsa.pub
file.
At this point our Nagios server is set up.
Next we need to set up the remote machines that we will monitor. All the following commands should be executed on the remote machine that is to be monitored, unless explicitly mentioned. First, let's create a user and group named nagios
:
root@remotehost:~# groupadd nagios root@remotehost:~# useradd -g nagios -d /opt/nagios nagios
We do not need the nagioscmd
group as we will only need the account to log in to the machine. The computer that only performs checks does not have a full Nagios installation along with the external command pipe that needs a separate group.
The next thing that needs to be done is the compiling of the Nagios plugins. You will probably also need to install the prerequisites that are needed for Nagios. Detailed instructions on how to do this can be found in Chapter 2, Installing Nagios 4. For the rest of the section, we will assume that the Nagios plugins are installed in the /opt/nagios/plugins
directory, similar to how they were installed on the Nagios server.
It is best to install plugins in the same directory on all the machines they will be running. In this case, we can use the $USER1$
macro definition when creating the actual check commands in the main Nagios configuration. The USER1
macro points to the location where Nagios plugins are installed in the default Nagios installations. This is described in more detail in Chapter 2, Installing Nagios 4.
Next, we will need to create the /opt/nagios
directory and set its permissions:
root@remotehost:~# mkdir /opt/nagios root@remotehost:~# chown nagios:nagios /opt/nagios root@remotehost:~# chmod 0700 /opt/nagios
You can make the /opt/nagios
directory permissions less restrictive by setting the mode to 0755
. However, it is recommended not to make the users' home directories readable for all users.
We will now need to add the public key from the nagios
user on the remote machine that is running the Nagios daemon, as shown in the following command snippet:
root@remotehost:~# mkdir /opt/nagios/.ssh root@remotehost:~# echo 'ssh-rsa ... nagios@nagiosserver' /opt/nagios/.ssh/authorized_keys root@remotehost:~# chown Nagios:nagios /opt/nagios/.ssh /opt/nagios/.ssh/authorized_keys root@remotehost:~# chmod 0700 /opt/nagios/.ssh /opt/nagios/.ssh/authorized_keys
You need to replace the text ssh-rsa ... nagios@nagiosserver
with the actual contents of the /opt/nagios/.ssh/id_rsa.pub
file on the server that is running Nagios.
If your machine is maintained by more than one person, you might replace the nagios@nagiosserver
string to a more readable comment such as Nagios on nagiosserver SSH check public key
.
Make sure that you change the permissions for both the .ssh
directory and the authorized_keys
file, as many SSH server implementations ignore public key-based authorization if the files can be read or written to by other users on the system.
In order to configure multiple remote machines to be accessible over ssh
without a password, you will need to perform all the steps mentioned earlier, except the key generation at the computer running the Nagios server, as a single private key will be used to access multiple machines.
Assuming everything was done successfully, we can now move on to testing if the public key-based authorization actually works. In order to check that our connection can now be successfully established, we need to try to connect to the remote machine from the computer that has the Nagios daemon running. We will use the ssh
client with the verbose flag to make sure that our connection works properly:
nagios@nagiosserver:~$ ssh -v [email protected] OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to 192.168.2.1 [192.168.2.1] port 22. debug1: Connection established. debug1: identity file /opt/nagios/.ssh/id_rsa type 1 (...) debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-cbc hmac-md5 none debug1: kex: client->server aes128-cbc hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY The authenticity of host '192.168.2.1 (192.168.2.1)' can't be established. RSA key fingerprint is cf:72:1e:40:03:a4:e0:9b:6c:84:4e:e1:2d:ea:56:fc. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.2.1' (RSA) to the list of known hosts. debug1: ssh_rsa_verify: signature correct debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password debug1: Next authentication method: publickey debug1: Offering public key: /opt/nagios/.ssh/id_rsa debug1: Server accepts key: pkalg ssh-rsa blen 277 debug1: read PEM private key done: type RSA debug1: Authentication succeeded (publickey). debug1: channel 0: new [client-session] debug1: Entering interactive session. debug1: Sending environment. debug1: Sending env LANG = en_US.UTF-8 $
As we were connecting to the remote machine for the first time, the ssh
command asked whether to accept the connection so that SSH can continue and store the remote machine's key to a list of known hosts. This is only done once for each host.
Also, note that we need to test the connection from the Nagios account so that the keys that are used for authentication as well as the list of known hosts are the same ones that will be used by the Nagios daemon later.
Assuming that we have the Nagios plugins installed on the remote machine in the /opt/nagios/plugins
directory, we can try to use the check_by_ssh
plugin from the computer running Nagios to the remote machine by running the following command:
nagios@nagiosserver:~$ /opt/nagios/plugins/check_by_ssh -H 192.168.2.1 -C "/opt/nagios/plugins/check_apt" APT OK: 0 packages available for upgrade (0 critical updates).
We are now sure that the checking itself works fine, and we can move on to how check_by_ssh
can be used and what its syntax is.
As mentioned earlier, Nagios uses a separate check command that connects to a remote machine over SSH and runs the actual check command on it.
The command has multiple features and can be used to query a single service status by using active checks. It can also be used to perform and report multiple checks at once as passive checks.
The syntax of the plugin is as follows:
check_by_ssh -H <host> -C <command> [-fqv] [-1|-2] [-4|-6] [-S [lines]] [-E [lines]] [-t timeout] [-i identity] [-l user] [-n name] [-s servicelist] [-O outputfile] [-p port] [-o ssh-option]
The following table describes all the options accepted by the plugin. Items required are marked in bold:
Option |
Description |
---|---|
|
This provides the hostname or IP address of the machine to connect to; this option must be specified |
|
This provides the full path of the command to be executed on the remote host along with any additional arguments; this option must be specified |
|
This lets you log in as a specific user; if omitted, it defaults to the current user (usually |
|
This specifies the path to the SSH private key to be used for authorization; if omitted, then |
|
This allows passing SSH-specific options that will be passed as the |
|
This stops SSH from printing warning and information messages |
|
This specifies the time in seconds after which the connection should be terminated and a warning should be issued to Nagios |
|
This specifies the time in seconds after which the connection should be terminated and a critical should be issued to Nagios |
|
This specifies the time in seconds after which the connection should be terminated and checks should be stopped; defaults to 10 seconds |
|
This specifies the port to connect over SSH; defaults to 22 |
|
This will let you use the SSH protocol Version 1 |
|
This will let you use the SSH protocol Version 2; this is the default |
|
This will let you use IPv4 protocol for SSH connectivity |
|
This will let you use IPv6 protocol for SSH connectivity |
|
This will let you ignore all or the provided number of lines from the standard output |
|
This will let you ignore all or the provided number of lines from the standard error |
|
This tells SSH to work in the background just after connecting, instead of using a terminal |
The only required flags are -H
to specify the IP address or hostname to connect and -C
to specify the command to run. The remaining parameters are optional. If they are not passed, SSH defaults and the timeout of 10 seconds will be used.
The -S
and -E
options are used to skip messages that are written by the SSH client or the remote machine, regardless of the commands executed. For example, to properly check machines printing MOTD, even for non-interactive sessions, skipping it by using one of the options is required.
When specifying commands, they usually need to be enclosed in single or double quotation marks. This is because the entire command that should be run needs to be passed to check_by_ssh
as a single argument. If one or more arguments contain spaces, single quote characters will have to be used.
For example, when checking for disk usage remotely, we need to quote the entire command as well; this is because it's safer to quote the path to the drive we're checking, as shown here:
nagios@nagios1:~$ /opt/nagios/plugins/check_by_ssh -H 192.168.2.1 -C "/opt/nagios/plugins/check_disk -w 15% -c 10% -p '/'" DISK OK - free space: / 243 MB (17% inode=72%)
The example above is a typical usage of the check_by_ssh
plugin as an active check. It performs a single check and returns the status directly using the standard output and exit code. This is how it is used as an active check within Nagios.
If you want to use check_by_ssh
to deploy checks locally on the same machine as the one on which Nagios is running, you will need to add the SSH key from id_rsa.pub
to the authorized_keys
file on that machine as well. In order to verify that it works correctly, try logging in to the local machine over SSH.
Now that the plugin works when invoked manually; we need to configure Nagios to make use of it.
Usually, for commands that will be performed both locally and remotely, the approach is to create a duplicate entry for each command with a prefix, for example, _by_ssh
.
For example a command that checks swap usage locally may be defined as follows:
define command { command_name check_swap command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$ }
Then, assuming that we will also check the swap usage on remote machines, we need to define the following remote counterpart:
define command { command_name check_swap_by_ssh command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "$USER1$/check_swap -w $ARG1$ -c $ARG2$" }
Usually services are defined for groups of hosts. For example, a service to check swap space usage may be defined to be performed on all the Linux servers. It is more convenient to always use the check_swap_by_ssh
command in this case—both for local Nagios as well as all remote machines. The overhead for performing checks over SSH is relatively small and can be ignored in most cases.
However, this requires that a server running Nagios accepts SSH connections, which is not always the case. It is also possible to simply define two types of service - one that is run over SSH and one locally and define that localhost should not use the SSH based check—such as:
define service { use generic-service host_name localhost service_description SWAP check_command check_swap } define service { use generic-service host_name !localhost hostgroup_name linux-servers service_description SWAP check_command check_swap_by_ssh }
This way localhost
will use the check_swap
command and all the remaining machines that are part of the linux-servers
host group will use the check_swap_by_ssh
check command.
The check_by_ssh
plugin can also run multiple plugins at once and report their results to Nagios using the external command pipe. The reason for this approach is that the SSH protocol negotiations introduce a lot of overhead related to the protocol itself. For hosts with heavy load or for machines with connectivity issues, it is more efficient to run all the checks using a single SSH session instead of performing every check individually.
As the results are reported as passive checks, using this functionality requires that those services allow receiving passive check results over the command pipe.
One of the main issues with doing multiple checks is that it is not trivial to schedule these directly from Nagios. A typical approach to passive checks is to schedule checks from an external application such as cron (http://linux.die.net/man/8/cron).
An alternate approach is to create a dummy service in Nagios that will launch passive checks in the background. The actual result for this service would also be to check whether running the tests was successful or not. An upside of this approach is that the checks will be performed even if the cron
daemon is currently disabled, as Nagios will still take care of scheduling the checks done by it.
When using check_by_ssh
to report multiple results as passive checks, the following options need to be specified:
Option |
Description |
|
This provides the short name of the host that the tests refer; this is the name of the host that will be used when sending the results over the external command pipe |
|
These are the names of the services that the tests refer, separated by a colon; these are the names of services that will be used when sending results over the external pipe |
|
This is the path to the external command pipe to which the results of all the checks should be sent |
The options above are specific to performing multiple checks only and are not all of the options that the plugin accepts when running multiple checks. The remaining options described earlier must also be specified—especially the -H
and -C
options.
The -C
option needs to be specified multiple times, each for one check. The number of parameters must match the number of entries in the -s
parameter so that each result can be mapped to a service name.
The following example runs a disk space check for three partitions:
/opt/nagios/plugins/check_by_ssh -H 192.168.2.1 -O /tmp/out1 -n ubuntu1 -s "DISK /:DISK /usr:DISK /opt" -C "/opt/nagios/plugins/check_disk -w 15% -c 10% -p /" -C "/opt/nagios/plugins/check_disk -w 15% -c 10% -p /usr" -C "/opt/nagios/plugins/check_disk -w 15% -c 10% -p /opt"
This command will put the output into /tmp/out1
, similar to the following example:
[1462485600] PROCESS_SERVICE_CHECK_RESULT;ubuntu1;DISK /:DISK CRITICAL... [1462485600] PROCESS_SERVICE_CHECK_RESULT;ubuntu1;DISK /usr:DISK OK ... [1462485600] PROCESS_SERVICE_CHECK_RESULT;ubuntu1;DISK /opt:DISK OK ...
As mentioned earlier in this section, it is very common to write a script that is run as an active check and will perform passive checks.
The following is a sample script that runs several tests and reports their results back to Nagios:
#!/bin/sh COMMANDFILE=$1 HOSTNAME=$2 HOSTADDRESS=$3 PLUGINPATH=$4 $PLUGINPATH/check_by_ssh -H $HOSTADDRESS -t 30 -o $COMMANDFILE -n $HOSTNAME -s "SWAP:Root Partition:Processes:System Load" -C "$PLUGINPATH/check_swap -w 20% -c 10%" -C "$PLUGINPATH/check_disk -w 20% -c 10% -p /" -C "$PLUGINPATH/check_procs -w 100 -c 200" -C "$PLUGINPATH/check_load -w 5,3,2 -c 10,8,7" ( echo "BYSSH CRITICAL problem while running SSH" exit 2 ) echo "BYSSH OK checks launched" exit 0
For the remaining part of the section we'll assume that the script is in the /opt/nagios/plugins
directory and is called check_linux_services_by_ssh
.
The script will perform several checks, and if any of them fail, it will return a critical result as well. Otherwise, it will return an OK
status and the remaining results will be passed as passive check results. We will also need to configure Nagios, both services that will receive their results as passive checks, and the service that will actually schedule the checks properly.
All the services that are checked via the check_by_ssh
command itself have a very similar definition—accept passive checks and not have any active checks scheduled.
The following is a sample definition for the SWAP service:
define service { use generic-service host_name !localhost hostgroup_name linux-servers service_description SWAP active_checks_enabled 0 passive_checks_enabled 1 }
All other services will also need to have a very similar definition.
We might also define a template for such services and only create services that use it. This will make the configuration more readable.
We'll need to define a command definition that will launch the passive check script written earlier:
define command { command_name check_linux_services_by_ssh command_line $USER1$/check_linux_services_by_ssh "$COMMANDFILE$" "$HOSTNAME$" "$HOSTADDRESS$" "$USER1$" }
All the parameters that are used by the script are passed directly from the Nagios configuration. This makes reconfiguring paths to Nagios plugins or command pipe easier.
The next step is to define an actual service that will run these checks:
define service { use generic-service host_name !localhost hostgroup_name linux-servers service_description Check Services By SSH active_checks_enabled 1 passive_checks_enabled 0 check_command check_linux_services_by_ssh check_interval 30 check_period 24x7 max_check_attempts 1 notification_interval 30 notification_period 24x7 notification_options c,u,r contact_groups linux-admins }
This will cause the checks to be scheduled every 30 minutes. It will also notify the Linux administrators if any problem occurs with the scheduling of the checks.
An alternative approach is to use the cron
daemon to schedule the launch of the previous script. In such a case, the Check Services By SSH
service is not needed. In this case, scheduling of the checks is not done in Nagios, but we will still need to have the services for which the status will be reported.
In such a case, we need to make sure that cron
is running to have up-to-date results for the checks. Such verification can be done by monitoring the daemon using Nagios and the check_procs
plugin.
The first thing that needs to be done is to adapt the script to not print out the results in case everything worked fine and hardcode paths to the Nagios files:
#!/bin/sh COMMANDFILE=/vat/nagios/rw/nagios.cmd PLUGINPATH=/opt/nagios/plugins HOSTNAME=$1 HOSTADDRESS=$2 $PLUGINPATH/check_by_ssh -H $HOSTADDRESS -t 30 -o $COMMANDFILE -n $HOSTNAME -s "SWAP:Root Partition:Processes:System Load" -C "$PLUGINPATH/check_swap -w 20% -c 10%" -C "$PLUGINPATH/check_disk -w 20% -c 10% -p /" -C "$PLUGINPATH/check_procs -w 100 -c 200" -C "$PLUGINPATH/check_load -w 5,3,2 -c 10,8,7" || ( echo "BYSSH CRITICAL problem while running SSH" exit 2 ) exit 0
The main changes are that COMMANDFILE
and PLUGINPATH
variables are hardcoded as they are not passed from Nagios anymore. Also, by default the script does not print anything on standard output - this is because cron sends an e-mail with the script output if any is written or exit code is not 0.
The next step is to add an entry to the Nagios user, crontab
. This can be done by running the crontab -e
command as the nagios
user or the crontab -u nagios -e
command as the administrator.
Assuming that the check should be performed every 30 minutes, the crontab
entry should be as follows:
*/30 * * * * /opt/nagios/plugins/check_linux_services_by_ssh
For more details on how an entry in crontab
should look, please consult its manual page available at http://linux.die.net/man/5/crontab.
If you have followed the steps from the previous sections carefully, then everything should be working properly. However, in some cases, performing checks over SSH might not be working properly and troubleshooting needs to be done to understand the root cause of the problem.
The first thing that you should start with is using the check_ssh
plugin to make sure that SSH is accepting connections on the host that we are checking. For example, we can run the following command:
root@ubuntu1:~# /opt/nagios/plugins/check_ssh -H 192.168.2.51 SSH OK - OpenSSH_4.7p1 Debian-8ubuntu1.2 (protocol 2.0)
Where 192.168.2.51
is the name of the IP address of the remote machine we want to monitor. If no SSH server is set up on the remote host, the plugin will return Connection refused
status, and if it failed to connect, the result will state No route to host
. In these cases, you need to make sure that the SSH server is working and all routers and firewalls do not reject communications over SSH, which is TCP port 22.
Assuming that the SSH server is accepting connections, the next thing that can be checked is whether the SSH key-based authorization works correctly. To do this, switch to the user the Nagios process is running as. Next, try to connect to the remote machine. The following are sample commands to perform this check:
root@ubuntu1:~# su nagios - $ ssh -v 192.168.2.51
This way you can check the connectivity as the same user as that which Nagios is using to run checks. You can also analyze the logs that will be printed to the standard output, as described earlier in this chapter.
If the SSH client prompts you for a password, then your keys are not set up properly. It is a common mistake to set up keys on the root
account instead of setting them up on the nagios
account. If this is the case, then create a new set of keys as the correct user and verify whether these keys are working correctly now.
Assuming this step worked fine, the next thing to be done is checking whether invoking an actual check command produces correct results. For example:
root@ubuntu1:~# su nagios - $ ssh 192.168.2.51 /opt/nagios/plugins/check_procs PROCS OK: 51 processes
This way, you will check the connectivity as the same user at which Nagios is running checks.
The last check is to make sure that the check_by_ssh
plugin also returns correct information. For example by doing:
root@ubuntu1:~# su nagios - $ /opt/nagios/plugins/check_by_ssh -H 192.168.2.1 /opt/nagios/plugins/check_procs PROCS OK: 52 processes
If the last step also worked correctly, it means that all check commands are working correctly.
If you still have issues with the running of the checks, then the next thing you should investigate is if Nagios has been properly configured and whether all commands, hosts, and services are set up in the correct way.