Xymon Mailing List Archive search

Alert Rules

11 messages in this thread

list Kevin Hanrahan · Wed, 2 Mar 2005 14:46:06 -0500 ·
I have an alert rule question. I have the following rule defined:

HOST=$AV
        MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVERED
        MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED

I did this because I was tired of getting those Windows event log messages.
I no longer get the event log alert or warning messages but I get 
"recovered" messages for the "msgs" category. I thought the first line would
eliminate ALL "msgs" alerts including when it recovers or goes green. Am I
thinking about this  the wrong way?


Kevin 

Note:  The information contained in this email and in any attachments is
intended only for the person or entity to which it is addressed and may
contain confidential and/or privileged material.  Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance upon, this information by persons or entities other than the
intended recipient is prohibited.  The recipient should check this email and
any attachments for the presence of viruses.  Sender accepts no liability
for any damages caused by any virus transmitted by this email. If you have
received this email in error, please notify us immediately by replying to
the message and delete the email from your computer.  This e-mail is and any
response to it will be unencrypted and, therefore, potentially unsecure.
Thank you.  NOVA Information Systems, Inc.
list Henrik Størner · Wed, 2 Mar 2005 21:55:16 +0100 ·
quoted from Kevin Hanrahan
On Wed, Mar 02, 2005 at 02:46:06PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I have an alert rule question. I have the following rule defined:

HOST=$AV
        MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVERED
        MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED

I did this because I was tired of getting those Windows event log messages.
I no longer get the event log alert or warning messages but I get 
"recovered" messages for the "msgs" category. I thought the first line would
eliminate ALL "msgs" alerts including when it recovers or goes green. Am I
thinking about this  the wrong way?
No, your thinking is absolutely right. It's a bug in RC4 (and earlier
versions) that "recovered" messages are sent even if no alert was sent
originally.

I have this fixed now, so relief is on its way.


Regards,
Henrik
list Kevin Hanrahan · Thu, 3 Mar 2005 00:16:18 -0500 ·
Excellent! Thank you....new release candidate?..or a patch to check out
maybe? 
-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Wednesday, March 02, 2005 3:55 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Alert Rules
Importance: Low
quoted from Henrik Størner

On Wed, Mar 02, 2005 at 02:46:06PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I have an alert rule question. I have the following rule defined:

HOST=$AV
        MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED

I did this because I was tired of getting those Windows event log
messages.
I no longer get the event log alert or warning messages but I get "recovered" messages for the "msgs" category. I thought the first line would eliminate ALL "msgs" alerts including when it recovers or goes green. Am I thinking about this  the wrong way?
No, your thinking is absolutely right. It's a bug in RC4 (and earlier
versions) that "recovered" messages are sent even if no alert was sent
originally.

I have this fixed now, so relief is on its way.


Regards,
Henrik
list Sue Bauer-Lee · Thu, 5 May 2005 11:12:07 -0400 ·
My epxressions here must be really confusing:

$WINOPS=user-cdfb1314498f@xymon.invalid

# CCRT Windows
HOST="%(cctfep3*|cctapp3*|cctfep1[0-9]||cctfep0*|cctapp[0-9]|cctpdp0*|cctdbp0*)" SERVICE=conn
(164)     MAIL $WINOPS  REPEAT=10 RECOVERED

(172) HOST="%(tucwbs1*|ttucfes1*|tucaps1*|tucwbq1*|tucfeq1*|tucapq1*)" SERVICE=conn
      MAIL $UNIXOPS REPEAT=10 RECOVERED

00012290 2005-05-05 10:49:39 *** Match with 'HOST=%(cctfep3*|cctapp3*|cctfep1[0-9]||cctfep0*|cctapp[0-9]|cctpdp0*|cctdbp0*) SERVICE=conn' ***
00012290 2005-05-05 10:49:39 Matching host:service:page 'smtp2:conn:' against rule line 164
00012290 2005-05-05 10:49:39 *** Match with 'MAIL $WINOPS  REPEAT=10 RECOVERED' ***
00012290 2005-05-05 10:49:39 Mail alert with command 'mail -s "Hobbit [12345] smtp2:conn CRITICAL (RED)" user-cdfb1314498f@xymon.invalid'
00012290 2005-05-05 10:49:39 Matching host:service:page 'smtp2:conn:' against rule line 172


This also paging by a different rule. bottom line is that there is no real
HOST entry to match this hostname. The paging rule is not listed on the info 
page for this host.

5 rules further down in the alerts file but not the last rule:
(doesn't show on the info page)

HOST=Teletrack,GetEfunds,Equifax-Canada,TU-Canada,TU,Equifax,Experian SERVICE=conn
      MAIL user-bff61b36f2ac@xymon.invalid
      MAIL $UNIXADM3


Sue Bauer-Lee        |    KE4HNN, SSCP
Carrollton, GA 30112 |    Email: user-06773162a9bc@xymon.invalid
list Werner Michels · Thu, 05 May 2005 12:44:05 -0300 ·
On Thu, 5 May 2005 11:12:07 -0400
quoted from Sue Bauer-Lee
Sue Bauer-Lee <user-06773162a9bc@xymon.invalid> wrote:
My epxressions here must be really confusing:

$WINOPS=user-cdfb1314498f@xymon.invalid

# CCRT Windows
HOST="%(cctfep3*|cctapp3*|cctfep1[0-9]||cctfep0*|cctapp[0-9]|cctpdp0*|cctdbp0*)" SERVICE=conn
(164)     MAIL $WINOPS  REPEAT=10 RECOVERED
	Most regex engines match a empty "ored" string agains everything with a TRUE return. So on the "cctfep1[0-9]||cctfep0*" you have an empty "||" sequence who will posible match agains every host. Try remove one of the "|".

	I didn't look at the code to be 100% sure on this.
	
	-wm
list Henrik Størner · Thu, 5 May 2005 23:05:08 +0200 ·
quoted from Sue Bauer-Lee
On Thu, May 05, 2005 at 11:12:07AM -0400, Sue Bauer-Lee wrote:
My epxressions here must be really confusing:

$WINOPS=user-cdfb1314498f@xymon.invalid

# CCRT Windows
HOST="%(cctfep3*|cctapp3*|cctfep1[0-9]||cctfep0*|cctapp[0-9]|cctpdp0*|cctdbp0*)" SERVICE=conn
(164)     MAIL $WINOPS  REPEAT=10 RECOVERED

(172) HOST="%(tucwbs1*|ttucfes1*|tucaps1*|tucwbq1*|tucfeq1*|tucapq1*)" SERVICE=conn
      MAIL $UNIXOPS REPEAT=10 RECOVERED
I think you've been bitten by a common pitfall in converting shell
"globbing" strings into regexp's: If you want to match "anything",
you must use '.*' in a regexp, not just '*' - because "*" just means
"0 or more occurrences of the token to the left of the *".

E.g. "abc*" means "ab followed by c or more c's", and is matched by "ab", 
"abc", "abcc", "abccc", "abccccc" ... "abc.*" means "abc following by
zero or more characters", and is matched by "abc", "abcdjweerp903485" etc.

So those two lines should probably be

 HOST="%(cctfep3.*|cctapp3.*|cctfep1[0-9]|cctfep0.*|cctapp[0-9]|cctpdp0.*|cctdbp0.*)" SERVICE=conn
      MAIL $WINOPS  REPEAT=10 RECOVERED
 
 HOST="%(tucwbs1*|ttucfes1.*|tucaps1.*|tucwbq1.*|tucfeq1.*|tucapq1.*)" SERVICE=conn
       MAIL $UNIXOPS REPEAT=10 RECOVERED


Regards,
Henrik
list Alan Killenbeck · Wed, 29 Jun 2005 13:27:45 -0400 ·
Hi,

I've been asked to try to make alerts only send 2 emals, at most, and
still send a RECOVERED message when things recover.
As a quick test, I set DURATION<30, and REPEAT=15, and after forcing a
service down, for over an hour, achieved the
two email alerts - but after bringing the service back up, did not get
sent a recovered message.

Is it possible to do this?

Alan
list Henrik Størner · Wed, 29 Jun 2005 20:55:31 +0200 ·
quoted from Alan Killenbeck
On Wed, Jun 29, 2005 at 01:27:45PM -0400, Killenbeck, Alan wrote:
I've been asked to try to make alerts only send 2 emals, at most, and
still send a RECOVERED message when things recover.
As a quick test, I set DURATION<30, and REPEAT=15, and after forcing a
service down, for over an hour, achieved the
two email alerts - but after bringing the service back up, did not get
sent a recovered message.
Hmm - hadn't thought about that. I'd say it ought to work, but looking 
at the way recovery messages are handled it seems you're right - if the
max. duration has been reached, the recovery message is never sent.

It's a bug. Will fix.


Regards,
Henrik
list Tom Georgoulias · Wed, 29 Jun 2005 15:04:26 -0400 ·
quoted from Henrik Størner
Killenbeck, Alan wrote:
I've been asked to try to make alerts only send 2 emals, at most, and
still send a RECOVERED message when things recover.

As a quick test, I set DURATION<30, and REPEAT=15, and after forcing
a service down, for over an hour, achieved the two email alerts - but
after bringing the service back up, did not get sent a recovered
message.

Is it possible to do this?
 From what you wrote, you're going to get more than 2 emails if the system remains offline longer than an hour.  The duration means the service can be out 30 mins before you get your first email, but after that threshold is reached, you'll get another every 15 mins *until* it is fixed.  It won't stop after the second one.  So, if you're down for 2 hours, you should see ~6 messages:  1 30 mins after the crash, then the rest every 15 mins thereafter.  I'm not aware of any way to restrict the emails to exactly 2.

As for a recovered message, did you add the RECOVERED option to your alert rules?  Hobbit doesn't send recovery messages unless you explicitly ask it to.

Tom
list Tom Georgoulias · Wed, 29 Jun 2005 15:21:50 -0400 ·
quoted from Tom Georgoulias
Tom Georgoulias wrote:
Killenbeck, Alan wrote:

I've been asked to try to make alerts only send 2 emals, at most, and
still send a RECOVERED message when things recover.

As a quick test, I set DURATION<30, and REPEAT=15, and after forcing
a service down, for over an hour, achieved the two email alerts - but
after bringing the service back up, did not get sent a recovered
message.

Is it possible to do this?
Oops, scratch my last message.  I thought the < in DURATION was pointing 
way.  ;)
Tom
list Henrik Størner · Wed, 29 Jun 2005 22:20:02 +0200 ·
quoted from Tom Georgoulias
On Wed, Jun 29, 2005 at 08:55:31PM +0200, Henrik Stoerner wrote:
On Wed, Jun 29, 2005 at 01:27:45PM -0400, Killenbeck, Alan wrote:
I've been asked to try to make alerts only send 2 emals, at most, and
still send a RECOVERED message when things recover.
It's a bug. Will fix.
I think this patch should do it.

--- hobbitd/do_alert.c	2005/06/06 09:27:07	1.69
+++ hobbitd/do_alert.c	2005/06/29 18:58:54
@@ -960,20 +960,26 @@
 	/* At this point, we know the configuration may result in an alert. */
 	if (anymatch) (*anymatch)++;
 -	duration = (time(NULL) - alert->eventstart);
-	if (crit && crit->minduration && (duration < crit->minduration)) { -		traceprintf("Failed '%s' (min. duration %d<%d)\n", cfline, duration, crit->minduration);
-		if (!printmode) return 0; -	}
+	/* +	 * Time checks should be done on real paging messages only. +	 * Not on recovery- or notify-messages.
+	 */
+	if (alert->state == A_PAGING) {
+		duration = (time(NULL) - alert->eventstart);
+		if (crit && crit->minduration && (duration < crit->minduration)) { +			traceprintf("Failed '%s' (min. duration %d<%d)\n", cfline, duration, crit->minduration);
+			if (!printmode) return 0; +		}
 -	if (crit && crit->maxduration && (duration > crit->maxduration)) { -		traceprintf("Failed '%s' (max. duration %d>%d)\n", cfline, duration, crit->maxduration);
-		if (!printmode) return 0; -	}
+		if (crit && crit->maxduration && (duration > crit->maxduration)) { +			traceprintf("Failed '%s' (max. duration %d>%d)\n", cfline, duration, crit->maxduration);
+			if (!printmode) return 0; +		}
 -	if (crit && crit->timespec && !timematch(crit->timespec)) { -		traceprintf("Failed '%s' (time criteria)\n", cfline);
-		if (!printmode) return 0; +		if (crit && crit->timespec && !timematch(crit->timespec)) { +			traceprintf("Failed '%s' (time criteria)\n", cfline);
+			if (!printmode) return 0; +		}
 	}
  	/* Check color. For RECOVERED messages, this holds the color of the alert, not the recovery state */