Xymon Mailing List Archive search

DURATION tag

6 messages in this thread

list Kevin Hanrahan · Mon, 14 Mar 2005 17:06:20 -0500 ·
I'm not sure the duration tag is working correctly in the hobbit-alerts.cfg
setup. I have tests like I/O and CPU that will spike for a short time and I
wanted to eliminate the email notifications for those spikes. I set the
DURATION tag for 10 or 20 minutes like this:


HOST=$SERVER1
        MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent
REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=http DURATION>10 REPEAT=1h
RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=webContent DURATION>10m REPEAT=1h
RECOVERED
        MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED


But, it seems that I get the alerts immediately. Clicking on the history
button shows it was in alarm (RED) for only 5:00 minutes which would be the
default poll time so I am guessing that it was actually in the alarm state
for even less that that. Has anyone else had problems with the DURATION tag?


Kevin 

Note:  The information contained in this email and in any attachments is
intended only for the person or entity to which it is addressed and may
contain confidential and/or privileged material.  Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance upon, this information by persons or entities other than the
intended recipient is prohibited.  The recipient should check this email and
any attachments for the presence of viruses.  Sender accepts no liability
for any damages caused by any virus transmitted by this email. If you have
received this email in error, please notify us immediately by replying to
the message and delete the email from your computer.  This e-mail is and any
response to it will be unencrypted and, therefore, potentially unsecure.
Thank you.  NOVA Information Systems, Inc.
list Henrik Størner · Mon, 14 Mar 2005 23:20:45 +0100 ·
quoted from Kevin Hanrahan
On Mon, Mar 14, 2005 at 05:06:20PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I'm not sure the duration tag is working correctly in the hobbit-alerts.cfg
setup. I have tests like I/O and CPU that will spike for a short time and I
wanted to eliminate the email notifications for those spikes. I set the
DURATION tag for 10 or 20 minutes like this:


HOST=$SERVER1
        MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent
REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVERED
The messages you get - are they alert messages or recovery messages?

I suppose you're running RC5 plus the patch I sent you for the
duplicate recovery messages ?


Regards,
Henrik
list Kevin Hanrahan · Tue, 15 Mar 2005 00:32:26 -0500 ·
I typically get the alert message closely followed by the recovery message
which doesn't make sense since if the poll time is 5 min. I would expect the
alert and recovery to have some seperation in them but they seem to have the
same timestamp....I will verify that.

Yes, I am running RC5 plus the patch for the duplicate recovery messages


I need to qualify this...It seems that this is a problem but I can't say for
certain yet since I haven't tried to manually cause high cpu load for <20
minutes but I will try that. I just wanted to see if anyone else had seen
any similar symptoms


KEvin


-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: Monday, March 14, 2005 5:21 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] DURATION tag
Importance: Low
quoted from Henrik Størner

On Mon, Mar 14, 2005 at 05:06:20PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I'm not sure the duration tag is working correctly in the 
hobbit-alerts.cfg setup. I have tests like I/O and CPU that will spike 
for a short time and I wanted to eliminate the email notifications for 
those spikes. I set the DURATION tag for 10 or 20 minutes like this:


HOST=$SERVER1
        MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent
REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m 
RECOVERED
The messages you get - are they alert messages or recovery messages?

I suppose you're running RC5 plus the patch I sent you for the duplicate
recovery messages ?


Regards,
Henrik
list Henrik Størner · Wed, 16 Mar 2005 10:08:09 +0100 ·
quoted from Kevin Hanrahan
On Tue, Mar 15, 2005 at 12:32:26AM -0500, Kevin Hanrahan wrote:
I typically get the alert message closely followed by the recovery message
which doesn't make sense since if the poll time is 5 min. I would expect the
alert and recovery to have some seperation in them but they seem to have the
same timestamp....I will verify that.

Yes, I am running RC5 plus the patch for the duplicate recovery messages

I need to qualify this...It seems that this is a problem but I can't say for
certain yet since I haven't tried to manually cause high cpu load for <20
minutes but I will try that. I just wanted to see if anyone else had seen
any similar symptoms
I use the DURATION setting myself, and haven't seen any alerts where
it was not observed.

I'd like you to dig into the history logs for one of these
occurrences and get the timestamps for when it went red and then back
to green, and the correllate that with the notifications.log file of
when alert- and recovery-messages were sent.

I you'd rather not, then just send me the ~/data/hist/HOSTNAME.cpu
file and the output from "grep HOSTNAME~/data/acks/notifications.log".


Regards,
Henrik
list Kevin Hanrahan · Sat, 26 Mar 2005 21:40:19 -0500 ·
I have been seeing what I think is a problem with the "DURATION" tag. I keep
getting alerted for very short outages on different tests when I have a
duration tag that I don't think is ever exceeded. For instance, I have the
following rule:


HOST=$UNIXPROD
        MAIL $SYSADMIN COLOR=red EXSERVICE=cpu,iostat,vmio,oracle,oracle9
REPEAT=30m RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=oracle DURATION>10m REPEAT=30m
RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=oracle9 DURATION>10m REPEAT=30m
RECOVERED
        MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>1h REPEAT=1h RECOVERED
TIME=W:0800:1700
        MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED


Then I got this alert:

red Sat Mar 26 21:23:09 EST 2005 Oracle test on "RM01": WARNING


And here is the data from the "hist" log


[root at sknxmon02 hist]# tail sfdomain2.oracle
Sat Mar 26 20:43:05 2005 yellow 1111887785 2404
Sat Mar 26 21:23:09 2005 red 1111890189


I get alerted immediately upon a red state! I have put in durations of up to
5 hours, just to be sure but when a test goes red, I get the alert right
away.


Does anybody else have these problems?


I am running RC5 with all patches


Thanks
quoted from Kevin Hanrahan

Kevin

Note:  The information contained in this email and in any attachments is
intended only for the person or entity to which it is addressed and may
contain confidential and/or privileged material.  Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance upon, this information by persons or entities other than the
intended recipient is prohibited.  The recipient should check this email and
any attachments for the presence of viruses.  Sender accepts no liability
for any damages caused by any virus transmitted by this email. If you have
received this email in error, please notify us immediately by replying to
the message and delete the email from your computer.  This e-mail is and any
response to it will be unencrypted and, therefore, potentially unsecure.
Thank you.  NOVA Information Systems, Inc.
list Henrik Størner · Sun, 27 Mar 2005 09:22:44 +0200 ·
quoted from Kevin Hanrahan
On Sat, Mar 26, 2005 at 09:40:19PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I have been seeing what I think is a problem with the "DURATION" tag. I keep
getting alerted for very short outages on different tests when I have a
duration tag that I don't think is ever exceeded.
I'd like you to add the "--cfid" and "--trace=FILENAME" options to the
hobbitd_alert command in hobbitlaunch.cfg.

--cfid will make the alerts include the line-number in hobbit-alerts.cfg
that triggered the alert.

--trace will make it dump all of the rule-matching it performs to the
log-file.

I suspect you have an UNMATCHED rule somewhere in your configuration
that triggers these alerts ...


Regards,
Henrik