DURATION tag
list Kevin Hanrahan
I'm not sure the duration tag is working correctly in the hobbit-alerts.cfg
setup. I have tests like I/O and CPU that will spike for a short time and I
wanted to eliminate the email notifications for those spikes. I set the
DURATION tag for 10 or 20 minutes like this:
HOST=$SERVER1
MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent
REPEAT=30m RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=http DURATION>10 REPEAT=1h
RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=webContent DURATION>10m REPEAT=1h
RECOVERED
MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED
But, it seems that I get the alerts immediately. Clicking on the history
button shows it was in alarm (RED) for only 5:00 minutes which would be the
default poll time so I am guessing that it was actually in the alarm state
for even less that that. Has anyone else had problems with the DURATION tag?
Kevin
Note: The information contained in this email and in any attachments is
intended only for the person or entity to which it is addressed and may
contain confidential and/or privileged material. Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance upon, this information by persons or entities other than the
intended recipient is prohibited. The recipient should check this email and
any attachments for the presence of viruses. Sender accepts no liability
for any damages caused by any virus transmitted by this email. If you have
received this email in error, please notify us immediately by replying to
the message and delete the email from your computer. This e-mail is and any
response to it will be unencrypted and, therefore, potentially unsecure.
Thank you. NOVA Information Systems, Inc.
list Henrik Størner
▸
On Mon, Mar 14, 2005 at 05:06:20PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I'm not sure the duration tag is working correctly in the hobbit-alerts.cfg
setup. I have tests like I/O and CPU that will spike for a short time and I
wanted to eliminate the email notifications for those spikes. I set the
DURATION tag for 10 or 20 minutes like this:
HOST=$SERVER1
MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent
REPEAT=30m RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVEREDThe messages you get - are they alert messages or recovery messages? I suppose you're running RC5 plus the patch I sent you for the duplicate recovery messages ? Regards, Henrik
list Kevin Hanrahan
I typically get the alert message closely followed by the recovery message which doesn't make sense since if the poll time is 5 min. I would expect the alert and recovery to have some seperation in them but they seem to have the same timestamp....I will verify that. Yes, I am running RC5 plus the patch for the duplicate recovery messages I need to qualify this...It seems that this is a problem but I can't say for certain yet since I haven't tried to manually cause high cpu load for <20 minutes but I will try that. I just wanted to see if anyone else had seen any similar symptoms KEvin -----Original Message----- From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Monday, March 14, 2005 5:21 PM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] DURATION tag Importance: Low
▸
On Mon, Mar 14, 2005 at 05:06:20PM -0500, user-fd47fec4b039@xymon.invalid wrote:I'm not sure the duration tag is working correctly in the
hobbit-alerts.cfg setup. I have tests like I/O and CPU that will spike
for a short time and I wanted to eliminate the email notifications for
those spikes. I set the DURATION tag for 10 or 20 minutes like this:
HOST=$SERVER1
MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent
REPEAT=30m RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m
RECOVEREDThe messages you get - are they alert messages or recovery messages? I suppose you're running RC5 plus the patch I sent you for the duplicate recovery messages ? Regards, Henrik
list Henrik Størner
▸
On Tue, Mar 15, 2005 at 12:32:26AM -0500, Kevin Hanrahan wrote:
I typically get the alert message closely followed by the recovery message which doesn't make sense since if the poll time is 5 min. I would expect the alert and recovery to have some seperation in them but they seem to have the same timestamp....I will verify that. Yes, I am running RC5 plus the patch for the duplicate recovery messages I need to qualify this...It seems that this is a problem but I can't say for certain yet since I haven't tried to manually cause high cpu load for <20 minutes but I will try that. I just wanted to see if anyone else had seen any similar symptoms
I use the DURATION setting myself, and haven't seen any alerts where it was not observed. I'd like you to dig into the history logs for one of these occurrences and get the timestamps for when it went red and then back to green, and the correllate that with the notifications.log file of when alert- and recovery-messages were sent. I you'd rather not, then just send me the ~/data/hist/HOSTNAME.cpu file and the output from "grep HOSTNAME~/data/acks/notifications.log". Regards, Henrik
list Kevin Hanrahan
I have been seeing what I think is a problem with the "DURATION" tag. I keep
getting alerted for very short outages on different tests when I have a
duration tag that I don't think is ever exceeded. For instance, I have the
following rule:
HOST=$UNIXPROD
MAIL $SYSADMIN COLOR=red EXSERVICE=cpu,iostat,vmio,oracle,oracle9
REPEAT=30m RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=oracle DURATION>10m REPEAT=30m
RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=oracle9 DURATION>10m REPEAT=30m
RECOVERED
MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>1h REPEAT=1h RECOVERED
TIME=W:0800:1700
MAIL $SYSADMIN COLOR=purple REPEAT=1h RECOVERED
Then I got this alert:
red Sat Mar 26 21:23:09 EST 2005 Oracle test on "RM01": WARNING
And here is the data from the "hist" log
[root at sknxmon02 hist]# tail sfdomain2.oracle
Sat Mar 26 20:43:05 2005 yellow 1111887785 2404
Sat Mar 26 21:23:09 2005 red 1111890189
I get alerted immediately upon a red state! I have put in durations of up to
5 hours, just to be sure but when a test goes red, I get the alert right
away.
Does anybody else have these problems?
I am running RC5 with all patches
Thanks
▸
Kevin
Note: The information contained in this email and in any attachments is
intended only for the person or entity to which it is addressed and may
contain confidential and/or privileged material. Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance upon, this information by persons or entities other than the
intended recipient is prohibited. The recipient should check this email and
any attachments for the presence of viruses. Sender accepts no liability
for any damages caused by any virus transmitted by this email. If you have
received this email in error, please notify us immediately by replying to
the message and delete the email from your computer. This e-mail is and any
response to it will be unencrypted and, therefore, potentially unsecure.
Thank you. NOVA Information Systems, Inc.
list Henrik Størner
▸
On Sat, Mar 26, 2005 at 09:40:19PM -0500, user-fd47fec4b039@xymon.invalid wrote:
I have been seeing what I think is a problem with the "DURATION" tag. I keep getting alerted for very short outages on different tests when I have a duration tag that I don't think is ever exceeded.
I'd like you to add the "--cfid" and "--trace=FILENAME" options to the hobbitd_alert command in hobbitlaunch.cfg. --cfid will make the alerts include the line-number in hobbit-alerts.cfg that triggered the alert. --trace will make it dump all of the rule-matching it performs to the log-file. I suspect you have an UNMATCHED rule somewhere in your configuration that triggers these alerts ... Regards, Henrik