Automated Monitoring of a cloudinit.d Application¶
This document explains how to use cloudinit.d as an automated monitoring tool with a service such as pingdom, nagios, or crond. To simplify instructions we will use an example based on crond, but the concepts should easily transfer to more sophisticated monitoring services.
cloudinit.d is not only used to launch sophisticated multi-node cloud applications, but it can also be used to monitor them, and automatically repair them. This can be done manually with operator issued console commands, but it can also be automated with tools such as crond.
First let us take a look at how to manually use cloudinit.d for monitoring and repair. We take for example a typical web application platform with a load balancer and web servers. An operator launches the infrastructure with 1 load balancer and 8 web servers. It is important that all the web server nodes remain up and function to handle the expected load of this web application. If (when) something does go wrong the operator would like to repair it as quickly and surgically as possible. cloudinit.d is well suited for this task.
To launch an application the user simply runs:
cloudinitd -v -v boot main.conf
In the command output is the ‘runname’. The operator must hang onto this value in order to further monitor the run. When the operator wishes to check that the system is still running she uses the following command:
cloudinitd -vv status <runname>
This will display user friendly output indicating that either everything is working (in which case its exit code is also 0) or that something is wrong (in which case its exit code is non-0). If something is wrong the operator can automatically repair the application with the command:
cloudinitd -vv repair <runname>
This will locate the problem VM, reboot it, and then check to see if all services that depended upon the newly rebooted service need a repair as well. If the repair is successful the exit code will be 0, if something could not be repaired after a few retries then the exit code is non-0. Repair can be run on a healthy system. In which case it does nothing and returns a 0.
While there is a time and a place for manual monitoring, it is often more convenient to have automated monitoring that only interrupts the operator if a problem is solved. Because cloudinit.d is careful about its exit codes it can easily be configured to work with automated tools like crond.
The following script can be run every hour from crond to test and repair an application launched by cloudinit.d:
#/bin/bash runname=$1 cloudinitd status $runname >> /dev/null if [ $? -eq 0 ]; then exit 0 fi echo "WARNING $runname experienced an error, attempting to repair" cloudinitd repair $runname >> /dev/null if [ $? -ne 0 ]; then "We were unable to repair $runname. Please examine the logs" exit 1 fi "The repair was successful" exit 0
When nothing goes wrong this script will generate no output and therefore crond will not generate any email. However, if the initial status indicates a failure it will output a warning and then attempt to repair the application. It then outputs a message indicating the success of the repair command. Crond will email all of the output to the operator notifying her that the application experienced some turbulence.