Xymon Mailing List Archive search

alerting with combo question

3 messages in this thread

list Martin Flemming · Tue, 30 Nov 2010 11:28:55 +0100 (CET) ·
Hi !

I've got an difficult problem with an alerting-rule, which my customers expect ..

.. unfortunately i'm afraid that it 
couldn't work, maybe someone on the list got an idea .. :-)


I've got a lustere-quota-check
which should run on 4 host for reliability reasons  ...

If everything is ok, the output looks on each host  like

Tue Nov 30 11:13:35 CET 2010 - LUSTRE Mount(s) on tcx060 OK

filesystem summary:        78.2T       58.0T       16.3T  74% /scratch/hh/lustre/atlas
filesystem summary:        19.9T       18.7T      233.8G  93% /scratch/zn/lustre/atlas

If the quota will be reached, the output got additional informations of 
all user-directory  and there size ...  till this point all easy ..

Now, my customers wants to get only one alert with the additional 
informations of all user-directory  and there size and not four times for each host

I've tested it with the bbcombotest.cfg like

e.g.
AtlasLustre.lustre-atlas = (tcx040.lustre\-atlas + tcx060.lustre\-atlas + tcx080.lustre\-atlas + tcx120.lustre\-atlas )  >= 4

This Alarm works of course, but i've got only this alert-message

Red Mon Nov 29 14:37:07 2010

(tcx040.lustre\-atlas+tcx060.lustre\-atlas+tcx080.lustre\-atlas+tcx120.lustre\-atlas)>=4 = (1+1+1+1)>=4 = 0
&green tcx040.lustre-atlas
&green tcx060.lustre-atlas
&green tcx080.lustre-atlas
&red tcx120.lustre-atlas


Without the additional informations of all user-directory  and there size, 
and thats logical of course .. but didn't solve my problem :-(

Any hints are welcome !

Thanks & cheers

        Martin
list Henrik Størner · Tue, 30 Nov 2010 12:16:49 +0000 (UTC) ·
quoted from Martin Flemming
In <user-bd772782d5b7@xymon.invalid> Martin Flemming <user-f286aaa49a76@xymon.invalid> writes:
I've tested it with the bbcombotest.cfg like
AtlasLustre.lustre-atlas = (tcx040.lustre\-atlas + tcx060.lustre\-atlas + tcx080.lustre\-atlas + tcx120.lustre\-atlas )  >= 4
This Alarm works of course, but i've got only this alert-message
Red Mon Nov 29 14:37:07 2010
(tcx040.lustre\-atlas+tcx060.lustre\-atlas+tcx080.lustre\-atlas+tcx120.lustre\-atlas)>=4 = (1+1+1+1)>=4 = 0
&green tcx040.lustre-atlas
&green tcx060.lustre-atlas
&green tcx080.lustre-atlas
&red tcx120.lustre-atlas
Without the additional informations of all user-directory  and there size, 
and thats logical of course .. but didn't solve my problem :-(

I'd use a script to handle the alerting in that case. You can grab the 
current status-data from Xymon using the bb 'hobbitdlog' command, so
you can include those data in your alert message. See 
http://www.xymon.com/xymon/help/xymon-alerts.html for details on alert
scripts.


Something like this - completely untested:


#!/bin/sh

# $BBALPHAMSG contains the alert message text. Save it to
# a file, then scan it for lines beginning with "&red" to get 
# the problem hosts. The grab the log-status for these hosts
# and append it to the alert message. Finally, send the
# alert.

echo "$BBALPHAMSG" >/tmp/alert.txt
egrep '^&red|^&yellow' /tmp/alert.txt | while read L
do
   LOGID=`echo $L | awk '{print $2}'`  # Get the host.status ID
   # Append the problem details to the alert text
   echo "$LOGID details"           >>/tmp/alert.txt
   $BB $BBDISP "hobbitdlog $LOGID" >>/tmp/alert.txt
done
# Send out the alert
mail -s "Lustre filesystem $BBCOLORLEVEL alert" $RCPT </tmp/alert.txt
exit 0


In hobbit-alerts.cfg, use

   HOST=AtlasLustre TEST=lustre-atlas
      SCRIPT /usr/local/bin/lustrealert.sh user-ef86c43926b6@xymon.invalid


Regards,
Henrik
list Martin Flemming · Wed, 1 Dec 2010 11:53:02 +0100 (CET) ·
Thanks a lot, Henrik !

It works like a charm :-)

cheers,
 	martin
quoted from Henrik Størner

On Tue, 30 Nov 2010, Henrik St?rner wrote:
In <user-bd772782d5b7@xymon.invalid> Martin Flemming <user-f286aaa49a76@xymon.invalid> writes:
I've tested it with the bbcombotest.cfg like
AtlasLustre.lustre-atlas = (tcx040.lustre\-atlas + tcx060.lustre\-atlas + tcx080.lustre\-atlas + tcx120.lustre\-atlas )  >= 4
This Alarm works of course, but i've got only this alert-message
Red Mon Nov 29 14:37:07 2010
(tcx040.lustre\-atlas+tcx060.lustre\-atlas+tcx080.lustre\-atlas+tcx120.lustre\-atlas)>=4 = (1+1+1+1)>=4 = 0
&green tcx040.lustre-atlas
&green tcx060.lustre-atlas
&green tcx080.lustre-atlas
&red tcx120.lustre-atlas
Without the additional informations of all user-directory  and there size,
and thats logical of course .. but didn't solve my problem :-(

I'd use a script to handle the alerting in that case. You can grab the
current status-data from Xymon using the bb 'hobbitdlog' command, so
you can include those data in your alert message. See
http://www.xymon.com/xymon/help/xymon-alerts.html for details on alert
scripts.


Something like this - completely untested:


#!/bin/sh

# $BBALPHAMSG contains the alert message text. Save it to
# a file, then scan it for lines beginning with "&red" to get
# the problem hosts. The grab the log-status for these hosts
# and append it to the alert message. Finally, send the
# alert.

echo "$BBALPHAMSG" >/tmp/alert.txt
egrep '^&red|^&yellow' /tmp/alert.txt | while read L
do
  LOGID=`echo $L | awk '{print $2}'`  # Get the host.status ID
  # Append the problem details to the alert text
  echo "$LOGID details"           >>/tmp/alert.txt
  $BB $BBDISP "hobbitdlog $LOGID" >>/tmp/alert.txt
done
# Send out the alert
mail -s "Lustre filesystem $BBCOLORLEVEL alert" $RCPT </tmp/alert.txt
exit 0


In hobbit-alerts.cfg, use

  HOST=AtlasLustre TEST=lustre-atlas
     SCRIPT /usr/local/bin/lustrealert.sh user-ef86c43926b6@xymon.invalid


Regards,
Henrik