Xymon Mailing List Archive search

hobbit-alerts.cfg: behaviour of TIME and DURATION together

4 messages in this thread

list Sebastian Auriol · Tue, 27 Jan 2009 14:34:46 -0000 ·
It seems the combination of TIME=W:0845:2355 and DURATION>15 in
hobbit-alerts.cfg means the earliest an alert can be sent out is 9 am.  Is
this what you would expect?  I would have expected these two rules to mean
the test should be in an alarm colour for more than 15 minutes and be
between the times of 08:45 and 23:55, weekdays.  Instead it seems to be
relating the DURATION with the time such that the DURATION only applies
_during_ the TIME.

If the current behaviour is intended, than will using EXTIME instead of TIME
be what I want?  Oh!  There is no EXTIME?!  I assumed there was but I see no
documentation for it apart from Henrik's suggestion that he might add it:
http://www.hswn.dk/hobbiton/2006/06/msg00417.html

Kind regards,

SebA
list Sebastian Auriol · Wed, 28 Jan 2009 20:03:20 -0000 ·
Bizarrely and somewhat contradictory to the behaviour below is the behaviour
of DURATION well inside of the times specified with the TIME rule.  Is
DURATION not reset when the colour of the alert changes???  That seems to be
the only explanation for what I'm seeing (though it is early days to be
certain).  Or, to put it another way, is DURATION the non-green DURATION,
rather than the duration of being in a certain colour?
 
The config I currently have is:
 
$pg-sebsms=user-2955bfa8f3cb@xymon.invalid TIME=W:0845:2355
 
HOST=DbR1 SERVICE=Special
     MAIL user-772ac99e3df0@xymon.invalid COLOR=red DURATION>2 REPEAT=30 RECOVERED
     MAIL $pg-sebsms COLOR=red DURATION>15 REPEAT=300 RECOVERED
 
I was hoping (and expecting) the above rules to only alert after 2 minutes
and 15 minutes repectively of being red, given that COLOR=red is part of the
rule.  I do, however, acknowledge that there may be (rare) cases where you
would want to include the yellow time in the DURATION.  In which case, we
really need REDDURATION, YELLOWDURATION and PURPLEDURATION rules.  Or
perhaps just a way of specifying how you want the DURATION to be calculated
in that rule: DURATIONTYPE=<NONGREEN|LASTCHANGE> (that's either or).  Or
even more powerfully: DURATIONCALC=color[,color] (adds up the duration of
being in these colour states).  (However, this could become resource
intensive if you specify DURATIONCALC=red,yellow,purple,green or something!
On the other hand, one only needs to check back as far as DURATION, rather
than calculate the total time in these colour states.)
 
I am using Hobbit 4.3 (trunk) from Dec 9 2008.
 
Looking carefully at 'man hobbitd_alert' this appears to be most relevant
part:
'When a status first goes to one of the ALERTCOLORS, hobbitd_alert is
notified of this change. It notes that the status is now in an alert state,
and records the timestamp when this event started, and adds the alert to the
list statuses that may potentially trigger one or more alert messages.'
I do not, however, think that this timestamp should be what is used by the
DURATION rule (it being far too simplistic), but it looks like it may very
well be.  Maybe this explains the behaviour I have with Big Brother's rules
that I always considered a weird bug:  sometimes the 'initial page delay' is
not respected.  This actually happened twice today and I got SMSes
simultaneously from BB and Hobbit when they had 5 minute and 15 minute
initial page delays respectively, and I got the SMS immediately after the
red.  It had however been yellow for some time before, but on BB my
pagelevels is set to "red purple", so the yellow should have been ignored
and not come into the equation.  How frustrating!  One of the main reasons I
wanted to move to Hobbit was to eliminate this 'bug' in Big Brother!
 
Still awaiting a reply on my message below BTW.  Given my unfortunate
theory, above, on what is going here, I suspect the TIME rule is causing
this magic timestamp to never be recorded!  Somehow it appears to be taking
precedence over the DURATION rule when I wish the DURATION rule to take
precedence (and I think that is more logical: if I wanted to mark it as
downtime, I'd have put the TIME rule into bb-hosts not hobbit-alerts.cfg!).
;)
 
'man hobbitd_alert' could be clearer, e.g. on how rules interact with each
other!
 
Many thanks,
 
SebA
quoted from Sebastian Auriol


From: SebA [mailto:user-7b2156f36779@xymon.invalid] 
Sent: 27 January 2009 14:35
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] hobbit-alerts.cfg: behaviour of TIME and DURATION together


It seems the combination of TIME=W:0845:2355 and DURATION>15 in
hobbit-alerts.cfg means the earliest an alert can be sent out is 9 am.  Is
this what you would expect?  I would have expected these two rules to mean
the test should be in an alarm colour for more than 15 minutes and be
between the times of 08:45 and 23:55, weekdays.  Instead it seems to be
relating the DURATION with the time such that the DURATION only applies
_during_ the TIME.

If the current behaviour is intended, than will using EXTIME instead of TIME
be what I want?  Oh!  There is no EXTIME?!  I assumed there was but I see no
documentation for it apart from Henrik's suggestion that he might add it:

 <http://www.hswn.dk/hobbiton/2006/06/msg00417.html>;

http://www.hswn.dk/hobbiton/2006/06/msg00417.html 

Kind regards, 

SebA
list Henrik Størner · Tue, 10 Feb 2009 11:31:00 +0000 (UTC) ·
quoted from Sebastian Auriol
In <user-2cd87595a7a8@xymon.invalid> "SebA" <user-7b2156f36779@xymon.invalid> writes:
Bizarrely and somewhat contradictory to the behaviour below is the behaviour
of DURATION well inside of the times specified with the TIME rule.  Is
DURATION not reset when the colour of the alert changes???  That seems to be
the only explanation for what I'm seeing (though it is early days to be
certain).  Or, to put it another way, is DURATION the non-green DURATION,
rather than the duration of being in a certain colour?
You are correct - DURATION is the time the status has been in a 
"potentially alerting state", i.e. yellow, red or purple.
quoted from Sebastian Auriol

The config I currently have is:

$pg-sebsms=user-2955bfa8f3cb@xymon.invalid TIME=W:0845:2355

HOST=DbR1 SERVICE=Special
    MAIL user-772ac99e3df0@xymon.invalid COLOR=red DURATION>2 REPEAT=30 RECOVERED
    MAIL $pg-sebsms COLOR=red DURATION>15 REPEAT=300 RECOVERED

I was hoping (and expecting) the above rules to only alert after 2 minutes
and 15 minutes repectively of being red, given that COLOR=red is part of the
rule.  I do, however, acknowledge that there may be (rare) cases where you
would want to include the yellow time in the DURATION.  In which case, we
really need REDDURATION, YELLOWDURATION and PURPLEDURATION rules.  Or
perhaps just a way of specifying how you want the DURATION to be calculated
in that rule: DURATIONTYPE=<NONGREEN|LASTCHANGE> (that's either or).  Or
even more powerfully: DURATIONCALC=color[,color] (adds up the duration of
being in these colour states).  (However, this could become resource
intensive if you specify DURATIONCALC=red,yellow,purple,green or something!
On the other hand, one only needs to check back as far as DURATION, rather
than calculate the total time in these colour states.)

I agree that the way it works currently is not entirely what you would 
expect from the rules you have. What would probably be best was for Xymon
to calculate the duration based on the COLOR-settings defined for the
alert (so for your rules, it would mean the alert triggered 2 respectively
15 minutes after the status went red - and yellow-time was ignored).

The problem with that approach is that it breaks down when a status
wobbles between yellow and red - e.g. a disk that is filled to just around
the critical level: You could end up in a situation where you wouldn't
get any alerts because it didn't stay red long enough to exceed the color-
specific DOWNTIME setting.


But it would probably make more sense than the current modus operandi. 
I'll see what I can do about that.


Regards,
Henrik
list Sebastian Auriol · Tue, 10 Feb 2009 15:33:57 -0000 ·
quoted from Henrik Størner
Henrik Størner <mailto:user-ce4a2c883f75@xymon.invalid> wrote:
In <user-2cd87595a7a8@xymon.invalid> "SebA" <user-7b2156f36779@xymon.invalid> writes:
<snip>
I agree that the way it works currently is not entirely what
you would
expect from the rules you have. What would probably be best
was for Xymon
to calculate the duration based on the COLOR-settings defined for the
alert (so for your rules, it would mean the alert triggered 2
respectively 15 minutes after the status went red - and yellow-time
was ignored). 

The problem with that approach is that it breaks down when a status
wobbles between yellow and red - e.g. a disk that is filled
to just around
the critical level: You could end up in a situation where you wouldn't
get any alerts because it didn't stay red long enough to
exceed the color-
specific DOWNTIME setting.


But it would probably make more sense than the current modus operandi.
I'll see what I can do about that.
If the alert timestamp is recorded as the first time the alert goes to one
of the colours in the COLOR rule instead of any of the ALERTCOLORS, but
recoveries are only on green, or whatever, then it would mean that this
alert for the flapping disk full message would still get sent but maybe the
2nd time it went red. So, it might be slightly better than what we have now.
However, this still wouldn't prevent lots of alerts coming to me that I
don't want since this test can flap between yellow and red and I consider
yellow to be a sufficient degree of recovery that I don't want another alert
as soon as it goes red again. If we look at disk in particular though,
surely if it is flapping between yellow and red the problem isn't too
serious. If one does want an alert for this, one can eliminate the DURATION
rule. If one does not, the DURATION rule should be a way of preventing
getting alerts for the flapping behaviour. This is what I've always
considered the use of the DURATION rule (although I was wrong given the way
it is currently working). Perhaps a more flexible and useful solution, while
still remaining easy to use, is to incorporate the change you suggest with a
RECOVERY= rule in the alerts. So each rule can specify what colour
consistutes a recovery. This means that some tests can have yellow while
others have green, allowing for different alerting behaviour for flapping
depending on the test, and it also allows those who get notified of
recoveries to have this information when they want. :)

Did you look at the original message in this thread, which was a slightly
different scenario?

Kind regards,

SebA