Monitoring the health of numerous servers can be a challenging and time consuming task. Luckily modern servers support a software suite which allows administrators to monitor the health of the hardware itself. This includes temperature monitoring, power supply status, memory and ECC status, fan rpm, and many other attributes of the server hardware. The toolkit that I’m talking about is OpenIPMI, and it’s available in just about every linux distribution. For the purposes of this article I’m going to focus on RHEL5, but it should be straightforward to adapt these instructions to your distro.
Installing OpenIPMI
OpenIPMI is available as an rpm, and can be installed with yum like so:
Once installed you’ll want to start the service, which in turn will load the necessary kernel modules.
And we’ll also ensure that it starts up on boot.
This allows us to use the ipmitool command on the local machine, among other things. Let’s list the system event log to be sure that it’s working.
Hopefully your SEL is clear, but you may see hardware issues logged here that you weren’t aware of. On the bright side, now you can fix them before they crash your system!
Logging IPMI Events To Syslog
Ipmievd is a utility which can run as a daemon and will monitor your SEL for events, sending them to syslog when they occur. On RHEL5 it is available in the OpenIPMI-tools package.
Ensure that OpenIPMI-tools is installed.
yum install OpenIPMI-tools |
yum install OpenIPMI-tools
Before starting the daemon I needed to set the mode to SEL, as the default of “open” did not work on my servers. YMMV.
#/etc/sysconfig/ipmievd
# ipmievd configuration scripts
# Command line options of ipmievd, see man ipmievd for details
IPMIEVD_OPTIONS="sel" |
#/etc/sysconfig/ipmievd
# ipmievd configuration scripts
# Command line options of ipmievd, see man ipmievd for details
IPMIEVD_OPTIONS="sel"
Now we start the service, and ensure that it starts on boot. (note: ipmievd requires that the ipmi service be running)
/etc/init.d/ipmievd start
chkconfig ipmievd on |
/etc/init.d/ipmievd start
chkconfig ipmievd on
You should now see SEL events logged in syslog, by default with the local4 facility.
Generating Alerts When IPMI Events Happen
To generate an email alert when an IPMI event is logged I’m using swatch. I run the swatch process on my central log server so that I can monitor and alert off all my logs centrally, however this could be run on individual servers as well.
Swatch rpms are available for RHEL5 via Fedora Packages for Enterprise Linux.
First, we install swatch.
Then we define the regular expressions we will generate alerts from when they are matched in the logs. In my cause I’m using /etc/swatchrc, however you may use any file you wish. Swatch defaults to ~/.swatchrc.
Swatch swatchrc configuration example:
#/etc/swatchrc
# swatchrc - define regular expressions and generate alerts when matches are found in logs
# daemon is started from /etc/cron.d/swatch
#
### IPMI EVENTS ###
#
# Ignore common IPMI startup output
#
ignore /Reading\ sensors/
ignore /Waiting\ for\ events/
# Match ipmievd syslog entries like the following:
# Jul 12 09:36:39 server-01 ipmievd: foo bar baz
#
watchfor /(\S*)\ ([0-9]*)\ ([0-9]{2}:[0-9]{2}:[0-9]{2})\ (\S*)\ (ipmievd:)\ (.*)/
exec=echo $1 $2 $3 $4 $5 $6 | nail -r "[email protected]" -s "IPMI Event on $4" sysadmin@example.com |
#/etc/swatchrc
# swatchrc - define regular expressions and generate alerts when matches are found in logs
# daemon is started from /etc/cron.d/swatch
#
### IPMI EVENTS ###
#
# Ignore common IPMI startup output
#
ignore /Reading\ sensors/
ignore /Waiting\ for\ events/
# Match ipmievd syslog entries like the following:
# Jul 12 09:36:39 server-01 ipmievd: foo bar baz
#
watchfor /(\S*)\ ([0-9]*)\ ([0-9]{2}:[0-9]{2}:[0-9]{2})\ (\S*)\ (ipmievd:)\ (.*)/
exec=echo $1 $2 $3 $4 $5 $6 | nail -r "[email protected]" -s "IPMI Event on $4" [email protected]
Note: I am using the nail command in order to specify a from and subject header in the email itself. Nail is available from the RPMForge yum repositories, or could be substituted with your favorite mail command.
Now we’re ready to start swatch:
swatch -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log' |
swatch -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log'
To ensure that the process is running I made use of the –pid-file and –daemon options, and wrote a cron job to test if the pid is running which will restart swatch if not.
#/etc/cron.d/swatch
# make sure that swatch is running every minute
*/1 * * * * root pgrep -F /var/run/swatch.pid 2>&1 > /dev/null || swatch -c /etc/swatchrc --pid-file=/var/run/swatch.pid --daemon -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log'
# restart swatch every hour to ensure that new log files are monitored
0 */1 * * * root kill `cat /var/run/swatch.pid` 2>&1 > /dev/null |
#/etc/cron.d/swatch
# make sure that swatch is running every minute
*/1 * * * * root pgrep -F /var/run/swatch.pid 2>&1 > /dev/null || swatch -c /etc/swatchrc --pid-file=/var/run/swatch.pid --daemon -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log'
# restart swatch every hour to ensure that new log files are monitored
0 */1 * * * root kill `cat /var/run/swatch.pid` 2>&1 > /dev/null
Once this is complete you should begin seeing emails that look like this when IPMI events happen:
From: alert@example.com
To: sysadmin@example.com
Subject: IPMI Event on server-01
Jul 12 11:54:38 server-01 ipmievd: SEL overflow is cleared |
From: [email protected]
To: [email protected]
Subject: IPMI Event on server-01
Jul 12 11:54:38 server-01 ipmievd: SEL overflow is cleared