RECOVERED alerts red->yellow

list Alan Sparks · Tue, 08 Jul 2008 20:44:16 -0600 ·

I have the following settings in my hobbitserver.cfg:
ALERTCOLORS="red,yellow,purple"
OKCOLORS="green,blue,clear"

I have the following rule in my hobbit-alerts.cfg:
SERVICE=disk COLOR=red
    MAIL user-94f515b60888@xymon.invalid RECOVERED

I get an alert when disk space goes critical (red).  However:
* when the status goes from red->green, I get a recovery alert;
* when the status goes red->yellow->green, I get NO recovery alert.

I've read the sections on ALERTCOLORS and OKCOLORS.  It does work better 
when I remove yellow from the list.  But I need to sometimes alert on 
yellow status, so this is not really an option.

What have I done wrong?  Thanks in advance.
-Alan

list Tim McCloskey · Tue, 08 Jul 2008 19:53:23 -0700 ·

Try the following in the specific line in hobbitalerts.cfg:
COLOR=red,yellow

▸ quoted from Alan Sparks



Alan Sparks wrote:

I have the following rule in my hobbit-alerts.cfg:
SERVICE=disk COLOR=red
   MAIL user-94f515b60888@xymon.invalid RECOVERED

list Alan Sparks · Tue, 08 Jul 2008 23:53:04 -0600 ·

OK, I've tried:
SERVICE=disk COLOR=red
    MAIL user-57a2c775d2be@xymon.invalid COLOR=red,yellow RECOVERED

And even:
SERVICE=disk COLOR=red,yellow
    MAIL user-57a2c775d2be@xymon.invalid COLOR=red RECOVERED

And neither sends a recovery page when disk goes red->yellow->green.
If I use only the COLOR=red,yellow, I'm goind to get alerted on a yellow 
disk (undesired).
Did I misunderstand your suggestion?  Thanks.
-Alan

▸ quoted from Tim McCloskey



Tim McCloskey wrote:

Try the following in the specific line in hobbitalerts.cfg:
COLOR=red,yellow


Alan Sparks wrote:

I have the following rule in my hobbit-alerts.cfg:
SERVICE=disk COLOR=red
   MAIL user-94f515b60888@xymon.invalid RECOVERED

list Martin Ward · Wed, 9 Jul 2008 10:18:21 +0100 ·

Hi Alan,

I think there is some confusion because in your first email you said:

▸ quoted from Alan Sparks

I've read the sections on ALERTCOLORS and OKCOLORS.  It does work
better 
when I remove yellow from the list.  But I need to sometimes alert on yellow status, so this is not really an option.

But here you state:

red->yellow->green. If I use only the COLOR=red,yellow, I'm goind to get alerted on a yellow disk (undesired).

So we need to understand more clearly when you want to be alerted on red
and yellow and when you don't.

|\/|artin

▸ quoted from Alan Sparks

--

-----Original Message-----
From: Alan Sparks [mailto:user-8f2174fd8b66@xymon.invalid] Sent: 09 July 2008 06:53
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] RECOVERED alerts red->yellow


OK, I've tried:
SERVICE=disk COLOR=red
    MAIL user-57a2c775d2be@xymon.invalid COLOR=red,yellow RECOVERED

And even:
SERVICE=disk COLOR=red,yellow
    MAIL user-57a2c775d2be@xymon.invalid COLOR=red RECOVERED

And neither sends a recovery page when disk goes red->yellow->green. If I use only the COLOR=red,yellow, I'm goind to get alerted on a yellow disk (undesired).
Did I misunderstand your suggestion?  Thanks.
-Alan


Tim McCloskey wrote:

Try the following in the specific line in hobbitalerts.cfg: > COLOR=red,yellow


Alan Sparks wrote:

I have the following rule in my hobbit-alerts.cfg: SERVICE=disk >> COLOR=red
   MAIL user-94f515b60888@xymon.invalid RECOVERED

*************************************************************************************


The message is intended for the named addressee only and may not be disclosed to or used by anyone else, nor may it be copied in any way. 
The contents of this message and its attachments are confidential and may also be subject to legal privilege.  If you are not the named addressee and/or have received this message in error, please advise us by e-mailing user-61c7f445d564@xymon.invalid and delete the message and any attachments without retaining any copies. 
Internet communications are not secure and COLT does not accept responsibility for this message, its contents nor responsibility for any viruses. 
No contracts can be created or varied on behalf of COLT Telecommunications, its subsidiaries or affiliates ("COLT") and any other party by email Communications unless expressly agreed in writing with such other party.  
Please note that incoming emails will be automatically scanned to eliminate potential viruses and unsolicited promotional emails. For more information refer to www.colt.net or contact us on +44(0)20 7390 3900.

list Alan Sparks · Wed, 09 Jul 2008 07:15:27 -0600 ·

Sorry, thought I was clear in the original email.  For the example test/alert scenario (the disk test), I need to be alerted when the disk goes red.  Only.  This is why I specified "COLOR=red" in the head line of the alert configuration.

There are other tests I would need to have an alert sent (a warning) if the test goes yellow.  Hence the ALERTCOLORS setting including yellow.

For the test I showed, I need an alert on red.  But I need a recovery page when it becomes non-red.  I can't get that when it goes from red to yellow or even finally green.  If it goes directly from red to greeen, I get the recovery page.

It's the recovery page problem I need advice about.
-Alan

▸ quoted from Martin Ward



Ward, Martin wrote:

Hi Alan,

I think there is some confusion because in your first email you said:

I've read the sections on ALERTCOLORS and OKCOLORS.  It does work

better

when I remove yellow from the list.  But I need to sometimes alert on yellow status, so this is not really an option.

But here you state:

red->yellow->green. If I use only the COLOR=red,yellow, I'm goind to get alerted on a yellow disk (undesired).

So we need to understand more clearly when you want to be alerted on red
and yellow and when you don't.

|\/|artin

list Mark Hinkle · Wed, 09 Jul 2008 16:25:51 -0700 ·

Yes, I see the same thing as Alan and maybe that is why his description makes sense to me.

The real questions are: what triggers a recovery message to be sent and who gets them? Is it when a test goes from any color to green? Or is it any "down-grade" in alert state (i.e. red->yellow, or yellow->green)? It appears to be the former - any color to green. And that makes sense - "recovery" means everything is ok, and that is what "green" means.

But that does leave an open question about that state change from red->yellow. In my environment, different notification methods are used for "red" than are used for "yellow", specifically sms text for red vs. emails for yellow.

*And that is where the problem comes in*: if a "red" failed test first goes to "yellow" before then going to "green", the recovery message (upon going green) is only sent to the notification destinations configured for the *yellow state*, not the red state.

I certainly understand how this logically occurs - red->yellow is not a recovery so nothing would be sent there at all. But hobbit does not seem to save a complete list of who has been notified for each "event", so it basically forgets about those folks sent notifications at the red level as soon as it transitions to yellow. When the test finally goes green, hobbit checks the alerts config for who would have been notified at *the state just before green* (in this case yellow) and sends recovery messages to those destinations. But it has lost the fact that it was actually at a red level previous to the yellow and should have sent recovery to those destinations as well.

I believe that BB keeps track of who has been notified for each event via the "user-cc7710b7adb8@xymon.invalid_host1.disk" type of entries in the tmp dir. This allows it to have a complete list of notification destinations that it could/can use for recoveries. I am not saying hobbit should use the same mechanism, but hobbit does *appear* to be losing some rather important state info.

--
Mark L. Hinkle
user-9816e24cee8c@xymon.invalid

list Alan Sparks · Thu, 10 Jul 2008 07:33:37 -0600 ·

After a day of running in trace and debug modes on the alerts module, I think I understand how this is broken. But I'm unsure anything but hacking the code can fix the issue. It appears to be unfortunate interactions in some of the features, including the "flap detection" stuff.

So: If I have the rule:
MAIL user-022708f53597@xymon.invalid TEST=disk COLOR=RED RECOVERED
and ALERTCOLORS="red,yellow,purple"

The traces show Hobbit going through the following "thought process":
* Say the disk goes yellow. That's in Hobbit's alert color list, so it triggers alert processing. But, no rule matches that color, so no alert is sent.
* Say the disk now goes red. Now, Hobbit sees that as a transition from an alert state to another alert state. Normally, it would suppress this, but there is logic to special-case going red, and the alert processing is triggered. This time, a rule matches, and an alert is sent.
* Say now the disk goes yellow. This is seen by Hobbit as a transition from an alert state to another alert state (due to both colors in ALERTCOLORS). No alert processin is done -- it is suppressed since it is NOT a recovery (it's flapping between two alert states). BUT, Hobbit now remembers the current color (alert state) as yellow.
* Finally, the disk goes green. This is a recovery, since it is a transition from the ALERTCOLORS to the OKCOLORS. And, this triggers alert rule processing. HOWEVER, now, the alert code scans for a rule for the last state of the alert -- yellow. And, of course, no such rule exists, and the rule that would trigger the recovery page is not used, and no recovery page is sent.

The RECOVERED keyword is only a flag on the rule that says if you match this rule during recovery processing, this recip does want a recovery page. But, Hobbit keeps no memory about which rule triggered an alert, it seems. It has to go back through the ruleset during recovery processing to find a rule to use. And because the colors change, no such rule can exist.

So I think you can call it a bug, or an unfortunate side effect of adding yellow to the ALERTCOLORS list. If you do, you'll compromise your recovery paging. If you don't, you can't send alerts on warning (yellow) conditions. Short of changing the code to eliminate the alert state suppression (i.e., flap detection),

I'm not certain how this can be fixed or worked around.
-Alan

▸ quoted from Mark Hinkle



Mark Hinkle wrote:

Yes, I see the same thing as Alan and maybe that is why his description makes sense to me.

The real questions are: what triggers a recovery message to be sent and who gets them? Is it when a test goes from any color to green? Or is it any "down-grade" in alert state (i.e. red->yellow, or yellow->green)? It appears to be the former - any color to green. And that makes sense - "recovery" means everything is ok, and that is what "green" means.

But that does leave an open question about that state change from red->yellow. In my environment, different notification methods are used for "red" than are used for "yellow", specifically sms text for red vs. emails for yellow.

*And that is where the problem comes in*: if a "red" failed test first goes to "yellow" before then going to "green", the recovery message (upon going green) is only sent to the notification destinations configured for the *yellow state*, not the red state.

I certainly understand how this logically occurs - red->yellow is not a recovery so nothing would be sent there at all. But hobbit does not seem to save a complete list of who has been notified for each "event", so it basically forgets about those folks sent notifications at the red level as soon as it transitions to yellow. When the test finally goes green, hobbit checks the alerts config for who would have been notified at *the state just before green* (in this case yellow) and sends recovery messages to those destinations. But it has lost the fact that it was actually at a red level previous to the yellow and should have sent recovery to those destinations as well.

I believe that BB keeps track of who has been notified for each event via the "user-cc7710b7adb8@xymon.invalid_host1.disk" type of entries in the tmp dir. This allows it to have a complete list of notification destinations that it could/can use for recoveries. I am not saying hobbit should use the same mechanism, but hobbit does *appear* to be losing some rather important state info.

list Alan Sparks · Wed, 23 Jul 2008 15:39:38 -0600 ·

Anyone have any other ideas how to fix this bug?  Thanks...
-Alan

▸ quoted from Alan Sparks


Alan Sparks wrote:

After a day of running in trace and debug modes on the alerts module, I think I understand how this is broken. But I'm unsure anything but hacking the code can fix the issue. It appears to be unfortunate interactions in some of the features, including the "flap detection" stuff.

So: If I have the rule:
MAIL user-022708f53597@xymon.invalid TEST=disk COLOR=RED RECOVERED
and ALERTCOLORS="red,yellow,purple"

The traces show Hobbit going through the following "thought process":
* Say the disk goes yellow. That's in Hobbit's alert color list, so it triggers alert processing. But, no rule matches that color, so no alert is sent.
* Say the disk now goes red. Now, Hobbit sees that as a transition from an alert state to another alert state. Normally, it would suppress this, but there is logic to special-case going red, and the alert processing is triggered. This time, a rule matches, and an alert is sent.
* Say now the disk goes yellow. This is seen by Hobbit as a transition from an alert state to another alert state (due to both colors in ALERTCOLORS). No alert processin is done -- it is suppressed since it is NOT a recovery (it's flapping between two alert states). BUT, Hobbit now remembers the current color (alert state) as yellow.
* Finally, the disk goes green. This is a recovery, since it is a transition from the ALERTCOLORS to the OKCOLORS. And, this triggers alert rule processing. HOWEVER, now, the alert code scans for a rule for the last state of the alert -- yellow. And, of course, no such rule exists, and the rule that would trigger the recovery page is not used, and no recovery page is sent.

The RECOVERED keyword is only a flag on the rule that says if you match this rule during recovery processing, this recip does want a recovery page. But, Hobbit keeps no memory about which rule triggered an alert, it seems. It has to go back through the ruleset during recovery processing to find a rule to use. And because the colors change, no such rule can exist.

So I think you can call it a bug, or an unfortunate side effect of adding yellow to the ALERTCOLORS list. If you do, you'll compromise your recovery paging. If you don't, you can't send alerts on warning (yellow) conditions. Short of changing the code to eliminate the alert state suppression (i.e., flap detection),

I'm not certain how this can be fixed or worked around.
-Alan

Mark Hinkle wrote:

Yes, I see the same thing as Alan and maybe that is why his description makes sense to me.

The real questions are: what triggers a recovery message to be sent and who gets them? Is it when a test goes from any color to green? Or is it any "down-grade" in alert state (i.e. red->yellow, or yellow->green)? It appears to be the former - any color to green. And that makes sense - "recovery" means everything is ok, and that is what "green" means.

But that does leave an open question about that state change from red->yellow. In my environment, different notification methods are used for "red" than are used for "yellow", specifically sms text for red vs. emails for yellow.

*And that is where the problem comes in*: if a "red" failed test first goes to "yellow" before then going to "green", the recovery message (upon going green) is only sent to the notification destinations configured for the *yellow state*, not the red state.

I certainly understand how this logically occurs - red->yellow is not a recovery so nothing would be sent there at all. But hobbit does not seem to save a complete list of who has been notified for each "event", so it basically forgets about those folks sent notifications at the red level as soon as it transitions to yellow. When the test finally goes green, hobbit checks the alerts config for who would have been notified at *the state just before green* (in this case yellow) and sends recovery messages to those destinations. But it has lost the fact that it was actually at a red level previous to the yellow and should have sent recovery to those destinations as well.

I believe that BB keeps track of who has been notified for each event via the "user-cc7710b7adb8@xymon.invalid_host1.disk" type of entries in the tmp dir. This allows it to have a complete list of notification destinations that it could/can use for recoveries. I am not saying hobbit should use the same mechanism, but hobbit does *appear* to be losing some rather important state info.

list Greg L Hubbard · Wed, 23 Jul 2008 17:27:28 -0500 ·

You might try having a separate rule for each color.  Then maybe the
rule would fire when the test transitions into that color.  It may not
fire when it transitions from one color to another in the same rule.
But I am just guessing!

GLH

▸ quoted from Alan Sparks


-----Original Message-----
From: Alan Sparks [mailto:user-8f2174fd8b66@xymon.invalid] Sent: Wednesday, July 23, 2008 4:40 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] RECOVERED alerts red->yellow

Anyone have any other ideas how to fix this bug?  Thanks...
-Alan

Alan Sparks wrote:

After a day of running in trace and debug modes on the alerts module, I think I understand how this is broken.  But I'm unsure anything but hacking the code can fix the issue.  It appears to be unfortunate interactions in some of the features, including the "flap detection"
stuff.

So: If I have the rule:
MAIL user-022708f53597@xymon.invalid TEST=disk COLOR=RED RECOVERED and ALERTCOLORS="red,yellow,purple"

The traces show Hobbit going through the following "thought process":
* Say the disk goes yellow.  That's in Hobbit's alert color list, so it triggers alert processing.  But, no rule matches that color, so no alert is sent.
* Say the disk now goes red.  Now, Hobbit sees that as a transition from an alert state to another alert state.  Normally, it would suppress this, but there is logic to special-case going red, and the alert processing is triggered.  This time, a rule matches, and an alert is sent.
* Say now the disk goes yellow.  This is seen by Hobbit as a transition from an alert state to another alert state (due to both colors in ALERTCOLORS).  No alert processin is done -- it is suppressed since it is NOT a recovery (it's flapping between two alert

states).  BUT, Hobbit now remembers the current color (alert state) as

yellow.
* Finally, the disk goes green.  This is a recovery, since it is a transition from the ALERTCOLORS to the OKCOLORS.  And, this triggers alert rule processing.  HOWEVER, now, the alert code scans for a rule for the last state of the alert -- yellow.  And, of course, no such rule exists, and the rule that would trigger the recovery page is not used, and no recovery page is sent.

The RECOVERED keyword is only a flag on the rule that says if you match this rule during recovery processing, this recip does want a recovery page.  But, Hobbit keeps no memory about which rule triggered

an alert, it seems.  It has to go back through the ruleset during recovery processing to find a rule to use.  And because the colors change, no such rule can exist.

So I think you can call it a bug, or an unfortunate side effect of adding yellow to the ALERTCOLORS list.  If you do, you'll compromise your recovery paging.  If you don't, you can't send alerts on warning
(yellow) conditions.  Short of changing the code to eliminate the alert state suppression (i.e., flap detection),

I'm not certain how this can be fixed or worked around.
-Alan


Mark Hinkle wrote:

Yes, I see the same thing as Alan and maybe that is why his description makes sense to me.

The real questions are: what triggers a recovery message to be sent and who gets them? Is it when a test goes from any color to green? Or

is it any "down-grade" in alert state (i.e. red->yellow, or
yellow->green)? It appears to be the former - any color to green. And
that makes sense - "recovery" means everything is ok, and that is what "green" means.

But that does leave an open question about that state change from
red->yellow. In my environment, different notification methods are
used for "red" than are used for "yellow", specifically sms text for red vs. emails for yellow.

*And that is where the problem comes in*: if a "red" failed test first goes to "yellow" before then going to "green", the recovery message (upon going green) is only sent to the notification destinations configured for the *yellow state*, not the red state.

I certainly understand how this logically occurs - red->yellow is not

a recovery so nothing would be sent there at all. But hobbit does not

seem to save a complete list of who has been notified for each "event", so it basically forgets about those folks sent notifications

at the red level as soon as it transitions to yellow. When the test finally goes green, hobbit checks the alerts config for who would have been notified at *the state just before green* (in this case
yellow) and sends recovery messages to those destinations. But it has

lost the fact that it was actually at a red level previous to the yellow and should have sent recovery to those destinations as well.

I believe that BB keeps track of who has been notified for each event

via the "user-cc7710b7adb8@xymon.invalid_host1.disk" type of entries in the tmp dir.
This allows it to have a complete list of notification destinations that it could/can use for recoveries. I am not saying hobbit should use the same mechanism, but hobbit does *appear* to be losing some rather important state info.

list Alan Sparks · Wed, 23 Jul 2008 17:11:39 -0600 ·

Had already considered that, but it doesn't work.  But thanks for the suggestion!
-Alan

▸ quoted from Greg L Hubbard


Hubbard, Greg L wrote:

You might try having a separate rule for each color.  Then maybe the
rule would fire when the test transitions into that color.  It may not
fire when it transitions from one color to another in the same rule.
But I am just guessing!

GLH

-----Original Message-----
From: Alan Sparks [mailto:user-8f2174fd8b66@xymon.invalid] Sent: Wednesday, July 23, 2008 4:40 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] RECOVERED alerts red->yellow

Anyone have any other ideas how to fix this bug?  Thanks...
-Alan

Alan Sparks wrote:

After a day of running in trace and debug modes on the alerts module, I think I understand how this is broken.  But I'm unsure anything but hacking the code can fix the issue.  It appears to be unfortunate interactions in some of the features, including the "flap detection"
stuff.

So: If I have the rule:
MAIL user-022708f53597@xymon.invalid TEST=disk COLOR=RED RECOVERED and ALERTCOLORS="red,yellow,purple"

The traces show Hobbit going through the following "thought process":
* Say the disk goes yellow.  That's in Hobbit's alert color list, so it triggers alert processing.  But, no rule matches that color, so no alert is sent.
* Say the disk now goes red.  Now, Hobbit sees that as a transition from an alert state to another alert state.  Normally, it would suppress this, but there is logic to special-case going red, and the alert processing is triggered.  This time, a rule matches, and an alert is sent.
* Say now the disk goes yellow.  This is seen by Hobbit as a transition from an alert state to another alert state (due to both colors in ALERTCOLORS).  No alert processin is done -- it is suppressed since it is NOT a recovery (it's flapping between two alert

states).  BUT, Hobbit now remembers the current color (alert state) as

yellow.
* Finally, the disk goes green.  This is a recovery, since it is a transition from the ALERTCOLORS to the OKCOLORS.  And, this triggers alert rule processing.  HOWEVER, now, the alert code scans for a rule for the last state of the alert -- yellow.  And, of course, no such rule exists, and the rule that would trigger the recovery page is not used, and no recovery page is sent.

The RECOVERED keyword is only a flag on the rule that says if you match this rule during recovery processing, this recip does want a recovery page.  But, Hobbit keeps no memory about which rule triggered

an alert, it seems.  It has to go back through the ruleset during recovery processing to find a rule to use.  And because the colors change, no such rule can exist.

So I think you can call it a bug, or an unfortunate side effect of adding yellow to the ALERTCOLORS list.  If you do, you'll compromise your recovery paging.  If you don't, you can't send alerts on warning
(yellow) conditions.  Short of changing the code to eliminate the alert state suppression (i.e., flap detection),

I'm not certain how this can be fixed or worked around.
-Alan


Mark Hinkle wrote:

Yes, I see the same thing as Alan and maybe that is why his description makes sense to me.

The real questions are: what triggers a recovery message to be sent and who gets them? Is it when a test goes from any color to green? Or

is it any "down-grade" in alert state (i.e. red->yellow, or
yellow->green)? It appears to be the former - any color to green. And
that makes sense - "recovery" means everything is ok, and that is what "green" means.

But that does leave an open question about that state change from
red->yellow. In my environment, different notification methods are
used for "red" than are used for "yellow", specifically sms text for red vs. emails for yellow.

*And that is where the problem comes in*: if a "red" failed test first goes to "yellow" before then going to "green", the recovery message (upon going green) is only sent to the notification destinations configured for the *yellow state*, not the red state.

I certainly understand how this logically occurs - red->yellow is not

a recovery so nothing would be sent there at all. But hobbit does not

seem to save a complete list of who has been notified for each "event", so it basically forgets about those folks sent notifications

at the red level as soon as it transitions to yellow. When the test finally goes green, hobbit checks the alerts config for who would have been notified at *the state just before green* (in this case
yellow) and sends recovery messages to those destinations. But it has

lost the fact that it was actually at a red level previous to the yellow and should have sent recovery to those destinations as well.

I believe that BB keeps track of who has been notified for each event

via the "user-cc7710b7adb8@xymon.invalid_host1.disk" type of entries in the tmp dir.
This allows it to have a complete list of notification destinations that it could/can use for recoveries. I am not saying hobbit should use the same mechanism, but hobbit does *appear* to be losing some rather important state info.

RECOVERED alerts red->yellow 🔗 link

RECOVERED alerts red->yellow