Xymon Mailing List Archive search

Alert Rules - DURATION not working

9 messages in this thread

list David Gore · Tue, 01 Feb 2005 01:02:58 +0000 ·
As you can see from the out put below a DURATION of '15m' translates to 653760.  Regardless, we cannot get DURATION to work under any circumstance.  I sent some early e-mails with some logging output.  Either we have something configured wrong or DURATION is broken? 
All we would like is a rather simple rule.  Any host that is red for longer than 15 minutes send an e-mail/page, repeat every 8 hours (shift change).  The last rule is just a hack to get around DURATION not working for us.  Perhaps I do not understand the config rules?

Rules:

HOST=% COLOR=yellow
         MAIL user-66f2c06d9d16@xymon.invalid REPEAT=8h DURATION>15
         MAIL user-bce0fa03bec0@xymon.invalid REPEAT=8h DURATION>15m

COLOR=red EXSERVICE=cpu,mem,tl1am
        SCRIPT /export/home/hobbit/server/bin/delay_page PAGE REPEAT=8h RECOVERED

Debug:

HOST=% COLOR=yellow
        MAIL user-66f2c06d9d16@xymon.invalid REPEAT=480 COLOR=yellow DURATION>15
        MAIL user-bce0fa03bec0@xymon.invalid REPEAT=480 COLOR=yellow DURATION>653760

EXSERVICE=cpu,mem,tl1am COLOR=red
        SCRIPT /export/home/hobbit/server/bin/delay_page PAGE FORMAT=SCRIPT REPEAT=480 COLOR=red RECOVERED
list Henrik Størner · Tue, 1 Feb 2005 07:54:25 +0100 ·
quoted from David Gore
On Tue, Feb 01, 2005 at 01:02:58AM +0000, David Gore wrote:
As you can see from the out put below a DURATION of '15m' translates to 
653760.
I'll look into that
quoted from David Gore
Either we have something configured wrong or DURATION is broken? 
HOST=% COLOR=yellow
        MAIL user-66f2c06d9d16@xymon.invalid REPEAT=8h DURATION>15
        MAIL user-bce0fa03bec0@xymon.invalid REPEAT=8h DURATION>15m
"HOST=%" is definitely wrong. "HOST=%.*" is what you want.


Henrik
list David Gore · Tue, 01 Feb 2005 15:52:17 +0000 ·
Henrik,

Thank you so much for replying.  I caused a yellow alarm for procs on host rsoimpm1, I am expecting the rule to fire after 15 minutes.  Here is what I see from the log file in more detail:

005-02-01 15:17:29 hobbitd_alert: Got message 37 @@page#37|1107271049.602362|166.34.57.23
9|rsoimpm1|procs|166.34.57.239|1107272849|yellow|green|1107271049|CAY/pmservers|947420
2005-02-01 15:17:29 Got page message from rsoimpm1:procs
2005-02-01 15:17:29 Alert status changed from 0 to 1
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs %.*:(NULL):(NULL)
2005-02-01 15:17:29 pcre_exec returned 1
2005-02-01 15:17:29 Checking explicit color setting 10000000020 against 4 gives 1
2005-02-01 15:17:29 Found a first matching rule
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<900
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<39225600
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 Checking explicit color setting 10000000040 against 4 gives 0
2005-02-01 15:17:29 No more secondary matching rule
2005-02-01 15:17:29 1 alerts to go
2005-02-01 15:17:29 Compiling regex .*
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs %.*:(NULL):(NULL)
2005-02-01 15:17:29 pcre_exec returned 1
2005-02-01 15:17:29 Checking explicit color setting 10000000020 against 4 gives 1
2005-02-01 15:17:29 Found a first matching rule
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<900
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<39225600
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 send_alert rsoimpm1:procs state 0
2005-02-01 15:17:29 Checking explicit color setting 10000000040 against 4 gives 0
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs %.*:(NULL):(NULL)
2005-02-01 15:17:29 No more secondary matching rule
2005-02-01 15:17:29 pcre_exec returned 1
2005-02-01 15:17:29 Checking explicit color setting 10000000020 against 4 gives 1
2005-02-01 15:17:29 Found a first matching rule
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<900
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<39225600
2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL)
2005-02-01 15:17:29 Checking explicit color setting 10000000040 against 4 gives 0
2005-02-01 15:17:29 No more secondary matching rule

I caused a yellow alarm at 15:17, so far OK. Alert status changed, criteria match, regex match, color match, found rule, checking minduration, which fails, not less than 15 minutes.  Sorry, I did add to the debug print statement in the source code.

2005-02-01 15:22:29 hobbitd_alert: Got message 58 @@page#58|1107271349.301483|166.34.57.23
9|rsoimpm1|procs|166.34.57.239|1107273149|yellow|yellow|1107271049|CAY/pmservers|947420
2005-02-01 15:22:29 Got page message from rsoimpm1:procs
2005-02-01 15:22:29 0 alerts to go

2005-02-01 15:27:29 hobbitd_alert: Got message 79 @@page#79|1107271649.155212|166.34.57.23
9|rsoimpm1|procs|166.34.57.239|1107273449|yellow|yellow|1107271049|CAY/pmservers|947420
2005-02-01 15:27:29 Got page message from rsoimpm1:procs
2005-02-01 15:27:29 0 alerts to go

2005-02-01 15:32:28 hobbitd_alert: Got message 101 @@page#101|1107271948.980583|166.34.57.
239|rsoimpm1|procs|166.34.57.239|1107273748|yellow|yellow|1107271049|CAY/pmservers|947420
2005-02-01 15:32:28 Got page message from rsoimpm1:procs
2005-02-01 15:32:28 0 alerts to go

2005-02-01 15:37:28 hobbitd_alert: Got message 123 @@page#123|1107272248.884069|166.34.57.
239|rsoimpm1|procs|166.34.57.239|1107274048|yellow|yellow|1107271049|CAY/pmservers|947420
2005-02-01 15:37:28 Got page message from rsoimpm1:procs
2005-02-01 15:37:28 0 alerts to go

So it's like nothing happens afterwards?  Hopefully, I got all the relevant parts of the log file. I didn't want the posting to long.  Any ideas?


~David Gore
quoted from Henrik Størner


Henrik Stoerner wrote:
On Tue, Feb 01, 2005 at 01:02:58AM +0000, David Gore wrote:
As you can see from the out put below a DURATION of '15m' translates to 653760.

I'll look into that

Either we have something configured wrong or DURATION is broken? 
HOST=% COLOR=yellow
       MAIL user-66f2c06d9d16@xymon.invalid REPEAT=8h DURATION>15
       MAIL user-bce0fa03bec0@xymon.invalid REPEAT=8h DURATION>15m

"HOST=%" is definitely wrong. "HOST=%.*" is what you want.


Henrik

list Tom Georgoulias · Wed, 02 Feb 2005 08:56:22 -0500 ·
quoted from David Gore
David Gore wrote:
So it's like nothing happens afterwards?  Hopefully, I got all the
relevant parts of the log file. I didn't want the posting to long.  Any
ideas?
Have you made any progress on this?  I can't get the DURATION variable to work either, and this time around I'm sure a typo is not the reason for not getting an alert email.

Here's what I've done and what I see:

I added the --debug switch to hobbitd_alert in hobbitlaunch.cfg:

CMD hobbitd_channel --channel=page --log=$BBSERVERLOGS/page.log hobbitd_alert --debug

My rule from hobbit-alerts.cfg.

HOST=$FOUND_SYS
         MAIL user-20904209b1a6@xymon.invalid SERVICE=procs COLOR=red DURATION>5 REPEAT=5

After I add this rule, I restart hobbit.  I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted.

Excerpts from page.log:

(note:  I replaced a valid IP address with 0s in the 3rd field of the @@page line of this excerpt)

2005-02-02 08:11:12 hobbitd_alert: Got message 4 @@page#4|1107349872.146928|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351672|red|red|1107227163|web6|315344
2005-02-02 08:11:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:11:12 Alert status changed from 0 to 1
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs %(foundry.*).nandomedia.com:(NULL):(NULL)
2005-02-02 08:11:12 pcre_exec returned 2
2005-02-02 08:11:12 Checking default color setting 70 against 5 gives 1
2005-02-02 08:11:12 Found a first matching rule
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs
2005-02-02 08:11:12 failed minduration 0<300

So it looks like the duration variable was checked, which is good.  The next time I see this server in the page.log, the min duration isn't checked.

2005-02-02 08:16:12 hobbitd_alert: Got message 16 @@page#16|1107350172.517352|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351972|red|red|1107227163|web6|315344
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:16:12 0 alerts to go
2005-02-02 08:17:12 0 alerts to go

This message will repeat from now on, varying only in the message count #, but alerts are not sent out:

-bash-2.05b$ grep foundry data/acks/notifications.log
-bash-2.05b$

I dunno what else to investigate at this point.

Tom
list David Gore · Wed, 02 Feb 2005 14:54:58 +0000 ·
Tom,

No, I haven't found a solution.  I was hoping Henrik might find something.  Without a doubt we can launch rules every time WITHOUT a DURATION.

One of my co-workers has put in a script that launches everytime we get an alert and just waits for 15 minutes before it sends out a page/email unless of course it recovered.

Let me know, if you find out anything yourself Tom.  It is always possible we have something configured wrong.  I am running Hobbit on Solaris 9, if it matters.  I also chose not to monitor any 's' (secure) services like https during setup.

David Gore (v965-3670)
Enhanced Technology Support (ETS)
Network Management Systems (NMS)
IMPACT Transport Team Lead - SCSA, SCNA
Page: 1-800-PAG-eMCI pin 1406090
Vnet: 965-3676
quoted from Tom Georgoulias


Tom Georgoulias wrote:
David Gore wrote:
So it's like nothing happens afterwards?  Hopefully, I got all the
relevant parts of the log file. I didn't want the posting to long.  Any
ideas?

Have you made any progress on this?  I can't get the DURATION variable to work either, and this time around I'm sure a typo is not the reason for not getting an alert email.

Here's what I've done and what I see:

I added the --debug switch to hobbitd_alert in hobbitlaunch.cfg:

CMD hobbitd_channel --channel=page --log=$BBSERVERLOGS/page.log hobbitd_alert --debug

My rule from hobbit-alerts.cfg.

HOST=$FOUND_SYS
        MAIL user-20904209b1a6@xymon.invalid SERVICE=procs COLOR=red DURATION>5 REPEAT=5

After I add this rule, I restart hobbit.  I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted.

Excerpts from page.log:

(note:  I replaced a valid IP address with 0s in the 3rd field of the @@page line of this excerpt)

2005-02-02 08:11:12 hobbitd_alert: Got message 4 @@page#4|1107349872.146928|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351672|red|red|1107227163|web6|315344 
2005-02-02 08:11:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:11:12 Alert status changed from 0 to 1
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs %(foundry.*).nandomedia.com:(NULL):(NULL)
2005-02-02 08:11:12 pcre_exec returned 2
2005-02-02 08:11:12 Checking default color setting 70 against 5 gives 1
2005-02-02 08:11:12 Found a first matching rule
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs
2005-02-02 08:11:12 failed minduration 0<300

So it looks like the duration variable was checked, which is good.  The next time I see this server in the page.log, the min duration isn't checked.

2005-02-02 08:16:12 hobbitd_alert: Got message 16 @@page#16|1107350172.517352|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351972|red|red|1107227163|web6|315344 
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:16:12 0 alerts to go
2005-02-02 08:17:12 0 alerts to go

This message will repeat from now on, varying only in the message count #, but alerts are not sent out:

-bash-2.05b$ grep foundry data/acks/notifications.log
-bash-2.05b$

I dunno what else to investigate at this point.

Tom

list David Gore · Wed, 02 Feb 2005 15:26:25 +0000 ·
Henrik, Tom,

My 15 minute DURATION fired.  I don't think it is a coincidence that it fired at 1 day and 5 hours.  I think the earlier possible bug where when you specify 15m you get a particularly large number is probably where the problem is.
quoted from Tom Georgoulias

Tom Georgoulias wrote:
David Gore wrote:
So it's like nothing happens afterwards?  Hopefully, I got all the
relevant parts of the log file. I didn't want the posting to long.  Any
ideas?

Have you made any progress on this?  I can't get the DURATION variable to work either, and this time around I'm sure a typo is not the reason for not getting an alert email.

Here's what I've done and what I see:

I added the --debug switch to hobbitd_alert in hobbitlaunch.cfg:

CMD hobbitd_channel --channel=page --log=$BBSERVERLOGS/page.log hobbitd_alert --debug

My rule from hobbit-alerts.cfg.

HOST=$FOUND_SYS
        MAIL user-20904209b1a6@xymon.invalid SERVICE=procs COLOR=red DURATION>5 REPEAT=5

After I add this rule, I restart hobbit.  I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted.

Excerpts from page.log:

(note:  I replaced a valid IP address with 0s in the 3rd field of the @@page line of this excerpt)

2005-02-02 08:11:12 hobbitd_alert: Got message 4 @@page#4|1107349872.146928|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351672|red|red|1107227163|web6|315344 
2005-02-02 08:11:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:11:12 Alert status changed from 0 to 1
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs %(foundry.*).nandomedia.com:(NULL):(NULL)
2005-02-02 08:11:12 pcre_exec returned 2
2005-02-02 08:11:12 Checking default color setting 70 against 5 gives 1
2005-02-02 08:11:12 Found a first matching rule
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs
2005-02-02 08:11:12 failed minduration 0<300

So it looks like the duration variable was checked, which is good.  The next time I see this server in the page.log, the min duration isn't checked.

2005-02-02 08:16:12 hobbitd_alert: Got message 16 @@page#16|1107350172.517352|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351972|red|red|1107227163|web6|315344 
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:16:12 0 alerts to go
2005-02-02 08:17:12 0 alerts to go

This message will repeat from now on, varying only in the message count #, but alerts are not sent out:

-bash-2.05b$ grep foundry data/acks/notifications.log
-bash-2.05b$

I dunno what else to investigate at this point.

Tom

list Henrik Størner · Wed, 2 Feb 2005 17:45:58 +0100 ·
quoted from David Gore
On Wed, Feb 02, 2005 at 03:26:25PM +0000, David Gore wrote:
Henrik, Tom,

My 15 minute DURATION fired.  I don't think it is a coincidence that it fired at 1 day and 5 hours.  I think the earlier possible bug where when you specify 15m you get a particularly large number is probably where the problem is.
I tend to agree, but I've been too busy with "real" work these past
days, so I haven't had time to investigate it.

And the server that crashed sunday kept me busy over the week-end. But
if definitely showed that the repeat thing works ... I got about 900
mails for different services that failed because my external gateway
was down.


Henrik
list Tom Georgoulias · Wed, 02 Feb 2005 14:43:55 -0500 ·
quoted from David Gore
On Wed, Feb 02, 2005 at 03:26:25PM +0000, David Gore wrote:
My 15 minute DURATION fired.  I don't think it is a coincidence that it
fired at 1 day and 5 hours.  I think the earlier possible bug where when
you specify 15m you get a particularly large number is probably where
the problem is.
I've been testing with DURATION>10 since I last posted to the list, 
which showed up as 600s and only tested against one time:

"failed minduration 0<600"

I would've expected to see something like this:

Start hobbit, it runs though all the alerts at time=0 when "duration=0"

page.log
<snip>
"failed minduration 0<600"

5 mins later, when it checks again with "duration=300"

<snip>
"failed minduration 300<600"

5 mins later, duration=minduration and it doesn't fail the test, so it's 
time to send an alert.

Or, quite possibly, I don't know what I am talking about.
quoted from Henrik Størner

Henrik Stoerner wrote:

I tend to agree, but I've been too busy with "real" work these past
days, so I haven't had time to investigate it.
If I can be of any assistance in helping debug this by testing patches 
or alert conditions, just ask.
quoted from Henrik Størner

But
if definitely showed that the repeat thing works ... I got about 900
mails for different services that failed because my external gateway
was down.
:)  Nothing better than a real world event to stress test monitoring 
system...
list Henrik Størner · Wed, 2 Feb 2005 22:10:11 +0100 ·
quoted from Tom Georgoulias
On Wed, Feb 02, 2005 at 08:56:22AM -0500, Tom Georgoulias wrote:
HOST=$FOUND_SYS
        MAIL user-20904209b1a6@xymon.invalid SERVICE=procs COLOR=red DURATION>5 
REPEAT=5

After I add this rule, I restart hobbit.  I read on the list that 
restarting isn't necessary, but it has been my experience that changes 
made to hobbit-alerts.cfg do not always get put into effect unless 
hobbit is restarted.
It shouldn't be needed, but it doesn't harm.
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs 
(NULL):(NULL):procs
2005-02-02 08:11:12 failed minduration 0<300
OK
quoted from David Gore
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs
2005-02-02 08:16:12 0 alerts to go
And this looks suspicious.

What's supposed to happen is that after the alert is first reported to
the hobbitd_alert module, this module is supposed to keep track of
when the next alert is due (the REPEAT interval comes into play here),
and if no alerts are due then you get the "0 alerts to go" message.

So something messes up the timekeeping, and we never get around to
testing if the DURATION triggers after the first attempt.

[after looking over the code for 10 minutes]

I think I've got it, but there's been quite a few changes to various
bits so I dont want to send one-line fixes now. I'll come up with a
proper full package, which will also include fixes for many of the
other bugs that have been reported for beta6.


Henrik