Monitoring the health of numerous servers can be a challenging and time consuming task. Luckily modern servers support a software suite which allows administrators to monitor the health of the hardware itself. This includes temperature monitoring, power supply status, memory and ECC status, fan rpm, and many other attributes of the server hardware. The toolkit that Iām talking about is OpenIPMI, and itās available in just about every linux distribution. For the purposes of this article Iām going to focus on RHEL5, but it should be straightforward to adapt these instructions to your distro.
Installing OpenIPMI
OpenIPMI is available as an rpm, and can be installed with yum like so:
Once installed youāll want to start the service, which in turn will load the necessary kernel modules.
And weāll also ensure that it starts up on boot.
This allows us to use the ipmitool command on the local machine, among other things. Letās list the system event log to be sure that itās working.
Hopefully your SEL is clear, but you may see hardware issues logged here that you werenāt aware of. On the bright side, now you can fix them before they crash your system!
Logging IPMI Events To Syslog
Ipmievd is a utility which can run as a daemon and will monitor your SEL for events, sending them to syslog when they occur. On RHEL5 it is available in the OpenIPMI-tools package.
Ensure that OpenIPMI-tools is installed.
yum install OpenIPMI-tools |
yum install OpenIPMI-tools
Before starting the daemon I needed to set the mode to SEL, as the default of āopenā did not work on my servers. YMMV.
#/etc/sysconfig/ipmievd
# ipmievd configuration scripts
Ā
# Command line options of ipmievd, see man ipmievd for details
IPMIEVD_OPTIONS="sel" |
#/etc/sysconfig/ipmievd
# ipmievd configuration scripts
# Command line options of ipmievd, see man ipmievd for details
IPMIEVD_OPTIONS="sel"
Now we start the service, and ensure that it starts on boot. (note: ipmievd requires that the ipmi service be running)
/etc/init.d/ipmievd start
Ā
chkconfig ipmievd on |
/etc/init.d/ipmievd start
chkconfig ipmievd on
You should now see SEL events logged in syslog, by default with the local4 facility.
Generating Alerts When IPMI Events Happen
To generate an email alert when an IPMI event is logged Iām using swatch. I run the swatch process on my central log server so that I can monitor and alert off all my logs centrally, however this could be run on individual servers as well.
Swatch rpms are available for RHEL5 via Fedora Packages for Enterprise Linux.
First, we install swatch.
Then we define the regular expressions we will generate alerts from when they are matched in the logs. In my cause Iām using /etc/swatchrc, however you may use any file you wish. Swatch defaults to ~/.swatchrc.
Swatch swatchrc configuration example:
#/etc/swatchrc
Ā
# swatchrc - define regular expressions and generate alerts when matches are found in logs
# daemon is started from /etc/cron.d/swatch
#
Ā
### IPMI EVENTS ###
#
Ā
# Ignore common IPMI startup output
#
ignore /Reading\ sensors/
ignore /Waiting\ for\ events/
Ā
# Match ipmievd syslog entries like the following:
# Jul 12 09:36:39 server-01 ipmievd: foo bar baz
#
watchfor /(\S*)\ ([0-9]*)\ ([0-9]{2}:[0-9]{2}:[0-9]{2})\ (\S*)\ (ipmievd:)\ (.*)/
exec=echo $1 $2 $3 $4 $5 $6 | nail -r "[email protected]" -s "IPMI Event on $4" sysadmin@example.com |
#/etc/swatchrc
# swatchrc - define regular expressions and generate alerts when matches are found in logs
# daemon is started from /etc/cron.d/swatch
#
### IPMI EVENTS ###
#
# Ignore common IPMI startup output
#
ignore /Reading\ sensors/
ignore /Waiting\ for\ events/
# Match ipmievd syslog entries like the following:
# Jul 12 09:36:39 server-01 ipmievd: foo bar baz
#
watchfor /(\S*)\ ([0-9]*)\ ([0-9]{2}:[0-9]{2}:[0-9]{2})\ (\S*)\ (ipmievd:)\ (.*)/
exec=echo $1 $2 $3 $4 $5 $6 | nail -r "[email protected]" -s "IPMI Event on $4" [email protected]
Note: I am using the nail command in order to specify a from and subject header in the email itself. Nail is available from the RPMForge yum repositories, or could be substituted with your favorite mail command.
Now weāre ready to start swatch:
swatch -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log' |
swatch -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log'
To ensure that the process is running I made use of the āpid-file and ādaemon options, and wrote a cron job to test if the pid is running which will restart swatch if not.
#/etc/cron.d/swatch
Ā
# make sure that swatch is running every minute
*/1 * * * * root pgrep -F /var/run/swatch.pid 2>&1 > /dev/null || swatch -c /etc/swatchrc --pid-file=/var/run/swatch.pid --daemon -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log'
Ā
# restart swatch every hour to ensure that new log files are monitored
0 */1 * * * root kill `cat /var/run/swatch.pid` 2>&1 > /dev/null |
#/etc/cron.d/swatch
# make sure that swatch is running every minute
*/1 * * * * root pgrep -F /var/run/swatch.pid 2>&1 > /dev/null || swatch -c /etc/swatchrc --pid-file=/var/run/swatch.pid --daemon -c /etc/swatchrc -p 'tail -f -n 0 /var/log/*log'
# restart swatch every hour to ensure that new log files are monitored
0 */1 * * * root kill `cat /var/run/swatch.pid` 2>&1 > /dev/null
Once this is complete you should begin seeing emails that look like this when IPMI events happen:
From: alert@example.com
To: sysadmin@example.com
Subject: IPMI Event on server-01
Ā
Ā
Jul 12 11:54:38 server-01 ipmievd: SEL overflow is cleared |
From: [email protected]
To: [email protected]
Subject: IPMI Event on server-01
Jul 12 11:54:38 server-01 ipmievd: SEL overflow is cleared