alerting with combo question
list Martin Flemming
Hi !
I've got an difficult problem with an alerting-rule, which my customers expect ..
.. unfortunately i'm afraid that it
couldn't work, maybe someone on the list got an idea .. :-)
I've got a lustere-quota-check
which should run on 4 host for reliability reasons ...
If everything is ok, the output looks on each host like
Tue Nov 30 11:13:35 CET 2010 - LUSTRE Mount(s) on tcx060 OK
filesystem summary: 78.2T 58.0T 16.3T 74% /scratch/hh/lustre/atlas
filesystem summary: 19.9T 18.7T 233.8G 93% /scratch/zn/lustre/atlas
If the quota will be reached, the output got additional informations of
all user-directory and there size ... till this point all easy ..
Now, my customers wants to get only one alert with the additional
informations of all user-directory and there size and not four times for each host
I've tested it with the bbcombotest.cfg like
e.g.
AtlasLustre.lustre-atlas = (tcx040.lustre\-atlas + tcx060.lustre\-atlas + tcx080.lustre\-atlas + tcx120.lustre\-atlas ) >= 4
This Alarm works of course, but i've got only this alert-message
Red Mon Nov 29 14:37:07 2010
(tcx040.lustre\-atlas+tcx060.lustre\-atlas+tcx080.lustre\-atlas+tcx120.lustre\-atlas)>=4 = (1+1+1+1)>=4 = 0
&green tcx040.lustre-atlas
&green tcx060.lustre-atlas
&green tcx080.lustre-atlas
&red tcx120.lustre-atlas
Without the additional informations of all user-directory and there size,
and thats logical of course .. but didn't solve my problem :-(
Any hints are welcome !
Thanks & cheers
Martin
list Henrik Størner
▸
In <user-bd772782d5b7@xymon.invalid> Martin Flemming <user-f286aaa49a76@xymon.invalid> writes:
I've tested it with the bbcombotest.cfg like
AtlasLustre.lustre-atlas = (tcx040.lustre\-atlas + tcx060.lustre\-atlas + tcx080.lustre\-atlas + tcx120.lustre\-atlas ) >= 4
This Alarm works of course, but i've got only this alert-message
Red Mon Nov 29 14:37:07 2010
(tcx040.lustre\-atlas+tcx060.lustre\-atlas+tcx080.lustre\-atlas+tcx120.lustre\-atlas)>=4 = (1+1+1+1)>=4 = 0 &green tcx040.lustre-atlas &green tcx060.lustre-atlas &green tcx080.lustre-atlas &red tcx120.lustre-atlas
Without the additional informations of all user-directory and there size, and thats logical of course .. but didn't solve my problem :-(
I'd use a script to handle the alerting in that case. You can grab the current status-data from Xymon using the bb 'hobbitdlog' command, so you can include those data in your alert message. See http://www.xymon.com/xymon/help/xymon-alerts.html for details on alert scripts. Something like this - completely untested: #!/bin/sh # $BBALPHAMSG contains the alert message text. Save it to # a file, then scan it for lines beginning with "&red" to get # the problem hosts. The grab the log-status for these hosts # and append it to the alert message. Finally, send the # alert. echo "$BBALPHAMSG" >/tmp/alert.txt egrep '^&red|^&yellow' /tmp/alert.txt | while read L do LOGID=`echo $L | awk '{print $2}'` # Get the host.status ID # Append the problem details to the alert text echo "$LOGID details" >>/tmp/alert.txt $BB $BBDISP "hobbitdlog $LOGID" >>/tmp/alert.txt done # Send out the alert mail -s "Lustre filesystem $BBCOLORLEVEL alert" $RCPT </tmp/alert.txt exit 0 In hobbit-alerts.cfg, use HOST=AtlasLustre TEST=lustre-atlas SCRIPT /usr/local/bin/lustrealert.sh user-ef86c43926b6@xymon.invalid Regards, Henrik
list Martin Flemming
Thanks a lot, Henrik ! It works like a charm :-) cheers, martin
▸
On Tue, 30 Nov 2010, Henrik St?rner wrote:
In <user-bd772782d5b7@xymon.invalid> Martin Flemming <user-f286aaa49a76@xymon.invalid> writes:I've tested it with the bbcombotest.cfg likeAtlasLustre.lustre-atlas = (tcx040.lustre\-atlas + tcx060.lustre\-atlas + tcx080.lustre\-atlas + tcx120.lustre\-atlas ) >= 4This Alarm works of course, but i've got only this alert-messageRed Mon Nov 29 14:37:07 2010(tcx040.lustre\-atlas+tcx060.lustre\-atlas+tcx080.lustre\-atlas+tcx120.lustre\-atlas)>=4 = (1+1+1+1)>=4 = 0 &green tcx040.lustre-atlas &green tcx060.lustre-atlas &green tcx080.lustre-atlas &red tcx120.lustre-atlasWithout the additional informations of all user-directory and there size, and thats logical of course .. but didn't solve my problem :-(I'd use a script to handle the alerting in that case. You can grab the current status-data from Xymon using the bb 'hobbitdlog' command, so you can include those data in your alert message. See http://www.xymon.com/xymon/help/xymon-alerts.html for details on alert scripts. Something like this - completely untested: #!/bin/sh # $BBALPHAMSG contains the alert message text. Save it to # a file, then scan it for lines beginning with "&red" to get # the problem hosts. The grab the log-status for these hosts # and append it to the alert message. Finally, send the # alert. echo "$BBALPHAMSG" >/tmp/alert.txt egrep '^&red|^&yellow' /tmp/alert.txt | while read L do LOGID=`echo $L | awk '{print $2}'` # Get the host.status ID # Append the problem details to the alert text echo "$LOGID details" >>/tmp/alert.txt $BB $BBDISP "hobbitdlog $LOGID" >>/tmp/alert.txt done # Send out the alert mail -s "Lustre filesystem $BBCOLORLEVEL alert" $RCPT </tmp/alert.txt exit 0 In hobbit-alerts.cfg, use HOST=AtlasLustre TEST=lustre-atlas SCRIPT /usr/local/bin/lustrealert.sh user-ef86c43926b6@xymon.invalid Regards, Henrik