Opsview Slaves
Within large organisations where there are large numbers of devices to monitor or in a distributed datacenter that can span continents, countries, counties or cities, a single Opsview server can get bogged down trying to process all the necessary device checks and fall behind leading to delayed checks and notifications.
Also, networks that make use of secured zones or firewall rules to segregate devices and services should not have that security compromised by having to open significant numbers of ports to enable monitoring.
Adding Opsview slaves to the monitoring hierarchy can help to spread the load and reduce latency, as well as reduce network administration issues in ensuring all devices can be adequately monitored.
Note: There is a limitation to the data returned from a slave. Opsview uses NSCA for the transport mechanism and this has a limitation to the first 511 bytes or the 1st line (which ever comes first) from the output of a plugin run on a slave.
Note: Each slave instance is a separate instance of Nagios® Core. If you move a host from one slave to another (or from the master to a slave), then the downtimes and acknowledgements are not synchronised, since each Nagios Core instance will not know the states from different instances. This can be overcome by setting notifications from the master instance only.
Note: It is not possible to have notification suppression based on parent/child relationships for hosts outside of a slave, because slaves only know about their own hosts. If notifications are sent from the slave, there may be more notifications than if it was sent from the master.
Packages
NOTE: The Debian/Ubuntu repositories have an “opsview-slave” package that ensures all dependancies are installed correctly (note: the package does not install any files - it is just used for dependancy checking).
For all other platforms you should not install Opsview packages on slave server. Opsview code will be pushed from master server to each slave during initial configuration and upgrades.
Restrictions
Slaves servers currently must have the same architecture and OS build as the master server (including prerequisite software)
Slave servers must have their time in sync with the master.
Pre-install Tasks
These steps are to be performed on the new slave server, unless otherwise stated
1. Create the nagios and nagcmd groups
groupadd nagios groupadd nagcmd
2. Create the nagios user and set its password to a known and secure value
useradd -g nagios -G nagcmd -d /var/log/nagios -m nagios passwd nagios
3. Ensure nagios user has root access for specific commands via sudo (note also the nagios user should have sudo on its PATH to use this correctly)
visudo # add the following line nagios ALL = NOPASSWD: /usr/local/nagios/bin/install_slave
4. On the master server, copy the nagios SSH public key from the master to the slave server
su - nagios
ssh-keygen -t dsa # Creates your SSH public/private keys if they do not currently exist
ssh-copy-id -i .ssh/id_dsa.pub {slave_hostname}
The .ssh directory should be mode 0700 and the id_dsa file should be 0600.
5. Set up the profile for the Nagios Core user on the slave server:
nagios@slave$ echo "test -f /usr/local/nagios/bin/profile && . /usr/local/nagios/bin/profile" >> ~nagios/.profile nagiso@slave$ chown nagios:nagios ~nagios/.profile
6. Copy the check_reqs and profile scripts from the master onto the slave as the nagios user. This should work without prompting for authentication:
nagios@master$ scp /usr/local/nagios/installer/check_reqs /usr/local/nagios/bin/profile {slave}:
On the slave as the nagios user, source the profile then run check_reqs:
nagios@slave$ . ./profile nagios@slave$ ./check_reqs slave
Fix any dependency issues listed.
7. Create the temporary drop directory on the slave
On the slave, create the temporary directory to put the transfer files:
su - root mkdir -p /usr/local/nagios/tmp chown nagios.nagios /usr/local/nagios /usr/local/nagios/tmp chmod 775 /usr/local/nagios /usr/local/nagios/tmp
8. Check SSH TCP Port Forwarding on the slave
In order to communicate with the master server, port forwarding must be enabled in /etc/ssh/sshd_config on the slave server. Ensure that the following option is set to yes (default is yes):
AllowTcpForwarding yes
Restart SSH server if this is changed.
Setup of the slave
1. Within the master web interface, ensure the slave host is set up on the master server
Configuration->Hosts->Create New Host
2. Within the master web interface, add the slave host as a monitoring server
Advanced->Monitoring Servers->Create New Monitoring Server
Note: The host used for the slaves must have at least 1 service associated with it, otherwise a reload will fail.
3. From the master server, send the necessary configuration files to the slave host
su - nagios /usr/local/nagios/bin/send2slaves -t [opsview slave name] # Test connection /usr/local/nagios/bin/send2slaves [opsview slave name]
The slave name is optional and can be used when multiple slaves have been defined. This will produce an error:
Errors requiring manual intervention: 1
and will detail running the commands in the next step.
4. On the slave server run the setup program
su - root cd /usr/local/nagios/tmp && ./install_slave
5. Within the master web interface, reload the Opsview configuration on the master server
Server->Status And Reload->Reload Configuration
Monitoring Slaves
Some services will be automatically created on the master to monitor each slave. These are called Slave-node: {hostname}. This service will run a plugin called check_opsview_slave_node which checks:
- if all slaves are contactable
- if their time is synchronised
- if NSCA has errored
- if nagios is running correctly
You should make sure that you will get alerts from this service as it will be the first warning of problems with a slave.
Note: In Opsview 2.14, this slave checking functionality was provided by check_opsview_slave assigned to the slaves. This is now redundant, so you can remove this service.
Slave failures
When a slave fails, this is the sequence of actions:
- The
Slave-node: {hostname}will go into a critical state - make sure you get notifications for this service - After 30 minutes, all hosts and services on the failed slave will go into an UNKNOWN stale state with the text of
UNKNOWN: Service results are stale. We think this is reflective of the situation as the service states are not up to date, and there is a single failure that needs resolving
Note: All hosts monitored by that slave will not change state. It is not possible to set a freshness value on the host state as hosts are only checked on demand and thus will not have regular results sent to the Opsview master.
Note: While a slave is down, a reload will fail. This is because the reload expects to be able to communicate with all slave nodes.
If the slave is expected to be offline for a long period, you can disable the slave by marking it as not activated. When the slave is restarted, Nagios Core on the slave will continue to run checks but results will not be received by the master. You may still get notifications from this slave server - remember to stop Nagios!
Slave clusters
If more than one server is selected during configuration of monitoring server, a 'slave cluster' will be created. This will provide automatic load balancing between nodes within the cluster and all checks will automatically fail over if one of the nodes in that cluster fails.
See slave clusters for more information on setting up and configuring slave clusters.
Using The Slave
To make use of the slave host, amend each host configuration and set the Monitored By value to the correct server
Configuration->Hosts->{Edit Host}->Monitored By->{slave_host}
Alternatively, you can drag and drop the host between servers on the Monitoring servers list page.
Slave Web Interface
The opsview slaves can have a standard nagios web interface, but this is not enabled by default. To enable it:
1. Add the apache web server user (i.e. the user that apache runs as) to the nagios user group. On Debian this is normally www-data - please check your system.
sudo usermod -G nagcmd www-data
2. Create an Apache configuration file by:
sudo su - cp /usr/local/nagios/installer/apache_opsview_slave.conf /etc/apache2/sites-available/opsview_slave vi /etc/apache2/sites-available/opsview_slave
Tailor as necessary for your site (if required) and then enable the new configuration (the default configuration should not be needed)
sudo a2ensite opsview_slave sudo a2dissite default
3. stop and start apache (not restart as the new group membership needs to be activated)
sudo /etc/init.d/apache2 stop sudo /etc/init.d/apache2 start
You should now be able to access the web pages on the slave using the same login credentials as the Opsview master server.
Using Slaves with reversed SSH tunnels
The reverse ssh tunnels are useful if security policy only allow slaves to initiate conversations to the master. A tunnel is started from slave to master, and then the master is able to start new communications with slave as required.
On master, edit opsview.conf and set:
$slave_initiated = 1; $slave_base_port = 25800;
Restart opsview and opsview-web.
The master is now ready for reverse connections.
You will need to access slave in some way for the initial install. Create the nagios groups and user and exchange SSH keys: the slave needs the public key of the master and the master needs public key of slave.
Create the slave on the monitoring servers page. On the list page, hover over the slave node. This should then tell you the slave port number.
On the slave firstly install the autossh program/package, and then as nagios user run:
ssh -N -R {slave_port_number}:127.0.0.1:22 opsviewmaster
This process will not exit on slave. On the master run the following as the nagios user:
/usr/local/nagios/bin/dosh -i -s {slavename} uname -a
This commands checks the connectivity to slaves.
Install Opsview with send2slaves {slave_name}. If this is a new slave you will need to manually intervene as root - check instructions onscreen. If this is a pre-exsting slave, rerun send2slaves {slave-name} again.
When installed, on the slave, create /usr/local/nagios/etc/opsview-slave.conf:
MASTER="master.opsview.org"
SLAVE_PORT={slave_port_number}
Test from the slave with:
/etc/init.d/opsview-slave test /etc/init.d/opsview-slave start /etc/init.d/opsview-slave status
On master, use dosh uname -a to test connectivity to all slaves.
All communications with master should now work correctly and a reload can now be performed.
