Xymon Mailing List Archive search

TIME alert problems (still)

8 messages in this thread

list Mike Rowell · Mon, 19 Jun 2006 09:50:37 +0100 ·
Fellow Hobbitiers...

 
Some may remember an issue I raised a few weeks back with regard to the
TIME option in the hobbit-alerts config in that TIME was not being
honoured and so we were getting alerted during blackout time periods
that were set.  This still looks like it's an issue in the snapshot I
downloaded 3days after the beta release.

 
Has anyone got any other ideas, is TIME only honoured on built in checks
or something along those lines as it's the ext scripts that are causing
the alerting (they are sending results to hobbit and hobbit does the
actual alerting).

 
Some information...

 
notifications.log

 
Sun Jun 18 21:19:37 2006 xxxxx.aq (10.6.2.2) support  [139] 1150661977 0

Sun Jun 18 21:24:21 2006 xxxxx.aq (10.7.2.2) sysalert [137] 1150662261 0

Sun Jun 18 21:24:21 2006 xxxxx.aq (10.7.2.2) support  [139] 1150662261 0

Sun Jun 18 21:25:21 2006 xxxxx.aq (10.8.2.2) sysalert  [137] 1150662321
0

Sun Jun 18 21:25:21 2006 xxxxx.aq (10.8.2.2) support  [139] 1150662321 0

Sun Jun 18 22:20:01 2006 xxxxx.aq (10.6.2.2) sysalert [137] 1150665601 0

Sun Jun 18 22:20:01 2006 xxxxx.aq (10.6.2.2) support  [139] 1150665601 0

 
hobbit-alerts.cfg

 
HOST=*

    MAIL=sysalert SERVICE=aq FORMAT=PLAIN REPEAT=1h COLOR=yellow

    MAIL=support SERVICE=aq COLOR=RED FORMAT=SMS DURATION>5 REPEAT=1h
TIME=W:0900:1700 STOP

 
(these are lines 135 and 136 so it looks like it's ignoring them
totally, although in bb-hostsvc.sh it shows them laid out properly with
the correct blackout times listed against the services).  As you can see
from the information above even though the aq service is set to only
alert W(eekdays) between 0900 and 1700 we were still getting alerts over
the weekend.

 
I also have the same problem with another service, this one was just
easiest to get the information for.

 
Regards,

 
Mike Rowell


This email has been scanned for all viruses by the MessageLabs service. 
list Henrik Størner · Mon, 19 Jun 2006 11:32:48 +0200 ·
quoted from Mike Rowell
On Mon, Jun 19, 2006 at 09:50:37AM +0100, Mike Rowell wrote:
Sun Jun 18 21:19:37 2006 xxxxx.aq (10.6.2.2) support  [139] 1150661977 0
Sun Jun 18 21:24:21 2006 xxxxx.aq (10.7.2.2) sysalert [137] 1150662261 0
Sun Jun 18 21:24:21 2006 xxxxx.aq (10.7.2.2) support  [139] 1150662261 0
Sun Jun 18 21:25:21 2006 xxxxx.aq (10.8.2.2) sysalert [137] 1150662321 0
Sun Jun 18 21:25:21 2006 xxxxx.aq (10.8.2.2) support  [139] 1150662321 0
Sun Jun 18 22:20:01 2006 xxxxx.aq (10.6.2.2) sysalert [137] 1150665601 0
Sun Jun 18 22:20:01 2006 xxxxx.aq (10.6.2.2) support  [139] 1150665601 0
HOST=*
    MAIL=sysalert SERVICE=aq FORMAT=PLAIN REPEAT=1h COLOR=yellow
    MAIL=support SERVICE=aq COLOR=RED FORMAT=SMS DURATION>5 REPEAT=1h TIME=W:0900:1700 STOP

(these are lines 135 and 136 so it looks like it's ignoring them
totally, although in bb-hostsvc.sh it shows them laid out properly with
the correct blackout times listed against the services).
What's on lines 137 and 139 of the hobbit-alerts.cfg file ? Those are
the lines that trigger these alerts, as evidenced by the "[13x]" 
in the log entries.


Regards,
Henrik
list Mike Rowell · Mon, 19 Jun 2006 10:38:17 +0100 ·
Henrik,

On 137 and 139 we have the catch alls for sysalert and support (support
is our red address and sysalert is where we send both to).

Mike
quoted from Henrik Størner

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: 19 June 2006 10:33
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] TIME alert problems (still)

On Mon, Jun 19, 2006 at 09:50:37AM +0100, Mike Rowell wrote:
Sun Jun 18 21:19:37 2006 xxxxx.aq (10.6.2.2) support  [139] 1150661977
0
Sun Jun 18 21:24:21 2006 xxxxx.aq (10.7.2.2) sysalert [137] 1150662261
0
Sun Jun 18 21:24:21 2006 xxxxx.aq (10.7.2.2) support  [139] 1150662261
0
Sun Jun 18 21:25:21 2006 xxxxx.aq (10.8.2.2) sysalert [137] 1150662321
0
Sun Jun 18 21:25:21 2006 xxxxx.aq (10.8.2.2) support  [139] 1150662321
0
Sun Jun 18 22:20:01 2006 xxxxx.aq (10.6.2.2) sysalert [137] 1150665601
0
Sun Jun 18 22:20:01 2006 xxxxx.aq (10.6.2.2) support  [139] 1150665601
0
HOST=*
    MAIL=sysalert SERVICE=aq FORMAT=PLAIN REPEAT=1h COLOR=yellow
    MAIL=support SERVICE=aq COLOR=RED FORMAT=SMS DURATION>5 REPEAT=1h
TIME=W:0900:1700 STOP
(these are lines 135 and 136 so it looks like it's ignoring them
totally, although in bb-hostsvc.sh it shows them laid out properly
with
the correct blackout times listed against the services).
What's on lines 137 and 139 of the hobbit-alerts.cfg file ? Those are
the lines that trigger these alerts, as evidenced by the "[13x]" 
in the log entries.


Regards,
Henrik


This email has been scanned for all viruses by the MessageLabs service.

This email has been scanned for all viruses by the MessageLabs service. 
list Henrik Størner · Mon, 19 Jun 2006 12:42:58 +0200 ·
quoted from Mike Rowell
On Mon, Jun 19, 2006 at 10:38:17AM +0100, Mike Rowell wrote:
Henrik,

On 137 and 139 we have the catch alls for sysalert and support (support
is our red address and sysalert is where we send both to).
Well, those catch-all rules are what triggers the alerts you don't want.
They probably have a "UNMATCHED" setting ? But that will also cause
them to be applied when the rules above them are skipped due to time-
constraints.

In other words, if you have a setup like

  HOST=myhost TEST=mytest
      MAIL user-c0b4a5e3f417@xymon.invalid TIME=W:0800:1700

  HOST=*
      MAIL user-9a4e95710e98@xymon.invalid UNMATCHED

then "user-9a4e95710e98@xymon.invalid" will get all myhost.mytest alerts
that happen outside the weekdays-0800-1700 time window.


Regards,
Henrik
list Mike Rowell · Mon, 19 Jun 2006 11:55:53 +0100 ·
Henrik,

So what you're saying is that when you have a TIME blackout window for a
service, even if the last rule for that service has STOP after it, the
alerts continue until it finds a rule it can send with?

That if it is what you are saying is not something I would be expecting.
Just so you can see, these are the two lines 137 and 139.

MAIL=user-d5da4a3e59bc@xymon.invalid COLOR=red,yellow REPEAT=1h FORMAT=PLAIN
MAIL=user-fca9e44cc8cf@xymon.invalid COLOR=RED FORMAT=SMS
DURATION>5 REPEAT=1h

Regards,
quoted from Henrik Størner

Mike

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: 19 June 2006 11:43
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] TIME alert problems (still)

On Mon, Jun 19, 2006 at 10:38:17AM +0100, Mike Rowell wrote:
Henrik,

On 137 and 139 we have the catch alls for sysalert and support
(support
is our red address and sysalert is where we send both to).
Well, those catch-all rules are what triggers the alerts you don't want.
They probably have a "UNMATCHED" setting ? But that will also cause
them to be applied when the rules above them are skipped due to time-
constraints.

In other words, if you have a setup like

  HOST=myhost TEST=mytest
      MAIL user-c0b4a5e3f417@xymon.invalid TIME=W:0800:1700

  HOST=*
      MAIL user-9a4e95710e98@xymon.invalid UNMATCHED

then "user-9a4e95710e98@xymon.invalid" will get all myhost.mytest alerts
that happen outside the weekdays-0800-1700 time window.


Regards,
Henrik


This email has been scanned for all viruses by the MessageLabs service.

This email has been scanned for all viruses by the MessageLabs service. 
list Henrik Størner · Mon, 19 Jun 2006 14:47:22 +0200 ·
Hi Mike,
quoted from Mike Rowell

On Mon, Jun 19, 2006 at 11:55:53AM +0100, Mike Rowell wrote:
So what you're saying is that when you have a TIME blackout window for a
service, even if the last rule for that service has STOP after it, the
alerts continue until it finds a rule it can send with?
Yes.
That if it is what you are saying is not something I would be expecting.
OK, let me try and explain why that is. From your other email I gather
your alert configuration (lines 134-139) is like this:
quoted from Mike Rowell

HOST=*
    MAIL=sysalert SERVICE=aq FORMAT=PLAIN REPEAT=1h COLOR=yellow
    MAIL=support SERVICE=aq COLOR=RED FORMAT=SMS DURATION>5 REPEAT=1h TIME=W:0900:1700 STOP
    MAIL=user-d5da4a3e59bc@xymon.invalid COLOR=red,yellow REPEAT=1h FORMAT=PLAIN
    MAIL=user-fca9e44cc8cf@xymon.invalid COLOR=RED FORMAT=SMS DURATION>5 REPEAT=1h

The STOP keyword means (from the man-page):
       "STOP Stop looking for more recipients after this one matches."
So STOP only applies for rules that are positively matched (ie. they did
result in an alert being sent).

If STOP meant "after seeing this rule, whether it matched or not, stop 
looking for any more recipients" - then your two last lines (the "catch-all"
rules) would never trigger because there's a STOP rule in front of them.
And that is not what you would expect either.

I *think* that what you want is to have "sysalert" and "support" alerted 
on weekdays, and the "systems at ..." and "support-rightmove at ..." alerted 
outside this time window. May I suggest

TIME=W:0900:1700 SERVICE=aq
    MAIL=sysalert COLOR=yellow FORMAT=PLAIN REPEAT=1h
    MAIL=support  COLOR=red    FORMAT=SMS DURATION>5 REPEAT=1h

EXTIME=W:0900:1700
    MAIL=user-d5da4a3e59bc@xymon.invalid COLOR=red,yellow REPEAT=1h FORMAT=PLAIN
    MAIL=user-fca9e44cc8cf@xymon.invalid COLOR=red FORMAT=SMS DURATION>5 REPEAT=1h


Regards,
Henrik
list Mike Rowell · Mon, 19 Jun 2006 15:58:31 +0100 ·
Thanks for this information Henrik,

One small problem, I'm running the 4.2 beta snapshot from a few days
after release, I'm getting this in the log files.

2006-06-19 14:56:58 Ignored unknown/unexpected token
'EXTIME=W:0900:1700' at line 131
2006-06-19 14:56:58 Ignored unknown/unexpected token
'EXTIME=*:0200:0700' at line 137

Can you let us know if it's the current snapshot we need to run to use
this feature?
quoted from Henrik Størner

Regards,

Mike

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: 19 June 2006 13:47
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] TIME alert problems (still)

Hi Mike,

On Mon, Jun 19, 2006 at 11:55:53AM +0100, Mike Rowell wrote:
So what you're saying is that when you have a TIME blackout window for
a
service, even if the last rule for that service has STOP after it, the
alerts continue until it finds a rule it can send with?
Yes.
That if it is what you are saying is not something I would be
expecting.
OK, let me try and explain why that is. From your other email I gather
your alert configuration (lines 134-139) is like this:

HOST=*
    MAIL=sysalert SERVICE=aq FORMAT=PLAIN REPEAT=1h COLOR=yellow
    MAIL=support SERVICE=aq COLOR=RED FORMAT=SMS DURATION>5 REPEAT=1h
TIME=W:0900:1700 STOP

    MAIL=user-7b0ad79b39aa@xymon.invalid COLOR=red,yellow REPEAT=1h FORMAT=PLAIN
    MAIL=user-53467fd899a1@xymon.invalid COLOR=RED FORMAT=SMS
quoted from Henrik Størner
DURATION>5 REPEAT=1h

The STOP keyword means (from the man-page):
       "STOP Stop looking for more recipients after this one matches."
So STOP only applies for rules that are positively matched (ie. they did
result in an alert being sent).

If STOP meant "after seeing this rule, whether it matched or not, stop 
looking for any more recipients" - then your two last lines (the
"catch-all"
rules) would never trigger because there's a STOP rule in front of them.
And that is not what you would expect either.

I *think* that what you want is to have "sysalert" and "support" alerted

on weekdays, and the "systems at ..." and "support-rightmove at ..." alerted 
outside this time window. May I suggest

TIME=W:0900:1700 SERVICE=aq
    MAIL=sysalert COLOR=yellow FORMAT=PLAIN REPEAT=1h
    MAIL=support  COLOR=red    FORMAT=SMS DURATION>5 REPEAT=1h

EXTIME=W:0900:1700

    MAIL=user-7b0ad79b39aa@xymon.invalid COLOR=red,yellow REPEAT=1h FORMAT=PLAIN
    MAIL=user-53467fd899a1@xymon.invalid COLOR=red FORMAT=SMS
quoted from Mike Rowell
DURATION>5 REPEAT=1h


Regards,
Henrik


This email has been scanned for all viruses by the MessageLabs service.

This email has been scanned for all viruses by the MessageLabs service. 
list Henrik Størner · Mon, 19 Jun 2006 17:06:13 +0200 ·
quoted from Mike Rowell
On Mon, Jun 19, 2006 at 03:58:31PM +0100, Mike Rowell wrote:
Thanks for this information Henrik,

One small problem, I'm running the 4.2 beta snapshot from a few days
after release, I'm getting this in the log files.

2006-06-19 14:56:58 Ignored unknown/unexpected token
'EXTIME=W:0900:1700' at line 131
2006-06-19 14:56:58 Ignored unknown/unexpected token
'EXTIME=*:0200:0700' at line 137

Can you let us know if it's the current snapshot we need to run to use
this feature?
Oops - sorry. Dont have en "EXTIME" keyword, since it's simple to do
with just TIME:
EXTIME=W:0900:1700
should be TIME=W:1700:0900,06:0000:2359

Which tells me that EXTIME is more readable, so perhaps I should go and
create that one...


Regards,
Henrik