xymon_4.3.0-RC1: possible lost alerts

8 messages in this thread

list Dominique Frise · Fri, 11 Feb 2011 18:04:20 +0100 ·

Hi,

I think I found a bug in xymond_alert.c.

Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:

    816                  /*
    817                   * When a burst of alerts happen, we get lots of alert messages
    818                   * coming in quickly. So lets handle them in bunches and only
    819                   * do the full alert handling once every 10 secs - that lets us
    820                   * combine a bunch of alerts into one transmission process.
    821                   */
    822                  if (nowtimer < (lastxmit+10)) continue;
    823                  lastxmit = nowtimer;


The main loop will then wait for a new msg from xymond (Want msg <num>, startpos... etc).

Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.


Dominique

list Henrik Størner · Mon, 14 Feb 2011 10:00:38 +0000 (UTC) ·

▸ quoted from Dominique Frise

In <user-f44f191b2358@xymon.invalid> Dominique Frise <user-78ab6673b600@xymon.invalid> writes:

I think I found a bug in xymond_alert.c.

Lets say there is a page msg for hostA.serviceA and this alert will not 
be processed immediately because of this part of code:

   816                  /*
   817                   * When a burst of alerts happen, we get lots of alert messages
   818                   * coming in quickly. So lets handle them in bunches and only
   819                   * do the full alert handling once every 10 secs - that lets us
   820                   * combine a bunch of alerts into one transmission process.
   821                   */
   822                  if (nowtimer < (lastxmit+10)) continue;
   823                  lastxmit = nowtimer;

The main loop will then wait for a new msg from xymond (Want msg <num>, 
startpos... etc).

Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup 
the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct
(from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there
is a recovery message), then what is the point of sending an alert ?
The notification would be for data that is no longer valid, and 
personally I would rather NOT be alerted a 3 AM if the problem no 
longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!"
meme :-)


Regards,
Henrik

list Dominique Frise · Mon, 14 Feb 2011 12:21:14 +0100 ·

▸ quoted from Henrik Størner

On 02/14/11 11:00 AM, Henrik Størner wrote:

In<user-f44f191b2358@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  writes:

I think I found a bug in xymond_alert.c.

Lets say there is a page msg for hostA.serviceA and this alert will not
be processed immediately because of this part of code:

    816                  /*
    817                   * When a burst of alerts happen, we get lots of alert messages
    818                   * coming in quickly. So lets handle them in bunches and only
    819                   * do the full alert handling once every 10 secs - that lets us
    820                   * combine a bunch of alerts into one transmission process.
    821                   */
    822                  if (nowtimer<  (lastxmit+10)) continue;
    823                  lastxmit = nowtimer;

The main loop will then wait for a new msg from xymond (Want msg<num>,

▸ quoted from Henrik Størner

startpos... etc).

Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup
the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct
(from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there
is a recovery message), then what is the point of sending an alert ?
The notification would be for data that is no longer valid, and
personally I would rather NOT be alerted a 3 AM if the problem no
longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!"
meme :-)

I think the problem is rather that the behaviour is not deterministic.
Some alert/recovered transitions will get through (if the alert goes 
into the alerts loop processing without waiting) or can get lost (if 
alert and recovery are processed in the same loop).

Dominique

list Henrik Størner · Mon, 14 Feb 2011 12:46:30 +0000 (UTC) ·

▸ quoted from Dominique Frise

In <user-6eb5770a86f9@xymon.invalid> Dominique Frise <user-78ab6673b600@xymon.invalid> writes:

On 02/14/11 11:00 AM, Henrik St�rner wrote:

In<user-f44f191b2358@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  writes:

I think I found a bug in xymond_alert.c.

Lets say there is a page msg for hostA.serviceA and this alert will not
be processed immediately because of this part of code:

    816                  /*
    817                   * When a burst of alerts happen, we get lots of alert messages
    818                   * coming in quickly. So lets handle them in bunches and only
    819                   * do the full alert handling once every 10 secs - that lets us
    820                   * combine a bunch of alerts into one transmission process.
    821                   */
    822                  if (nowtimer<  (lastxmit+10)) continue;
    823                  lastxmit = nowtimer;

The main loop will then wait for a new msg from xymond (Want msg<num>,
startpos... etc).

Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup
the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct
(from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there
is a recovery message), then what is the point of sending an alert ?
The notification would be for data that is no longer valid, and
personally I would rather NOT be alerted a 3 AM if the problem no
longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!"
meme :-)

I think the problem is rather that the behaviour is not deterministic.
Some alert/recovered transitions will get through (if the alert goes 
into the alerts loop processing without waiting) or can get lost (if 
alert and recovery are processed in the same loop).

But it is "deterministic enough" that you will either get both of
them (alert + recovery), or neither. You will not get an alert
and then lose the recovery-message, or get a recovery-message
without the alert having been sent.


Regards,
Henrik

list Dominique Frise · Mon, 14 Feb 2011 14:38:08 +0100 ·

▸ quoted from Henrik Størner

On 02/14/11 01:46 PM, Henrik Størner wrote:

In<user-6eb5770a86f9@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  writes:

On 02/14/11 11:00 AM, Henrik Størner wrote:

In<user-f44f191b2358@xymon.invalid>   Dominique Frise<user-78ab6673b600@xymon.invalid>   writes:

I think I found a bug in xymond_alert.c.

Lets say there is a page msg for hostA.serviceA and this alert will not
be processed immediately because of this part of code:

     816                  /*
     817                   * When a burst of alerts happen, we get lots of alert messages
     818                   * coming in quickly. So lets handle them in bunches and only
     819                   * do the full alert handling once every 10 secs - that lets us
     820                   * combine a bunch of alerts into one transmission process.
     821                   */
     822                  if (nowtimer<   (lastxmit+10)) continue;
     823                  lastxmit = nowtimer;

The main loop will then wait for a new msg from xymond (Want msg<num>,
startpos... etc).

Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup
the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct
(from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there
is a recovery message), then what is the point of sending an alert ?
The notification would be for data that is no longer valid, and
personally I would rather NOT be alerted a 3 AM if the problem no
longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!"
meme :-)

I think the problem is rather that the behaviour is not deterministic.
Some alert/recovered transitions will get through (if the alert goes
into the alerts loop processing without waiting) or can get lost (if
alert and recovery are processed in the same loop).

But it is "deterministic enough" that you will either get both of
them (alert + recovery), or neither. You will not get an alert
and then lose the recovery-message, or get a recovery-message
without the alert having been sent.

This leads me to another question that never get answered:
what is suppose to happen if you remove the "clear" color from OKCOLORS 
in xymonserver.cfg ?
We would expect that not recovery message should be sent when a status 
goes from yellow/red to clear. Only the repeat interval should be reset.
Does this make sense ?

Dominique

list Henrik Størner · Mon, 14 Feb 2011 13:51:56 +0000 (UTC) ·

▸ quoted from Dominique Frise

In <user-fea03e92c89e@xymon.invalid> Dominique Frise <user-78ab6673b600@xymon.invalid> writes:

what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?

Then a "clear" status would trigger alerts, i.e. the xymond_alert
module would begin to see alert-messages for a clear status (same
as for yellow, red, purple).

I don't think you would actually see any alerts being sent, unless
you also change ALERTCOLORS to include the "clear" status.

But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...)
when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).

▸ quoted from Dominique Frise

We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset.
Does this make sense ?

Kind of, yes. I don't recall if it was actually tested.


Regards,
Henrik

list Dominique Frise · Mon, 14 Feb 2011 15:17:35 +0100 ·


Meilleures salutations,

Dominique
_______________UNIL - University of Lausanne_______________
Dominique Frise             E-mail: user-78ab6673b600@xymon.invalid
UNIL, Centre Informatique   Phone:         +XX XX XXX XX XX
Quartier Sorge / Amphimax   Fax:           +XX XX XXX XX XX
1015 Lausanne, Switzerland  URL:      http://www.unil.ch/ci

▸ quoted from Henrik Størner


On 02/14/11 02:51 PM, Henrik Størner wrote:

In<user-fea03e92c89e@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  writes:

what is suppose to happen if you remove the "clear" color from OKCOLORS
in xymonserver.cfg ?

Then a "clear" status would trigger alerts, i.e. the xymond_alert
module would begin to see alert-messages for a clear status (same
as for yellow, red, purple).

I don't think you would actually see any alerts being sent, unless
you also change ALERTCOLORS to include the "clear" status.

But that would be a bad idea, since "clear" is also used for
e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...)
when the server is down ("conn" status is red means client-side
tests will not go purple - they go clear).

We would expect that not recovery message should be sent when a status
goes from yellow/red to clear. Only the repeat interval should be reset.
Does this make sense ?

Kind of, yes. I don't recall if it was actually tested.

I dont't think it was ;-)
Here below the little changes we made in xymond_alerts.c (version before 
your last changes) to achieve this:


[super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c
--- xymond_alert.c.dist Sun Nov 14 18:21:19 2010
+++ xymond_alert.c      Mon Feb 14 15:02:24 2011
@@ -355,7 +355,7 @@
         char *msg;
         int seq;
         int argi;
-       int alertcolors, alertinterval;
+       int alertcolors, alertinterval, okcolors;
         char *configfn = NULL;
         char *checkfn = NULL;
         int checkpointinterval = 900;
@@ -377,6 +377,7 @@
         /* Load alert config */
         alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << 
COL_GREEN) | (1 << COL_BLUE)));
         alertinterval = 60*atoi(xgetenv("ALERTREPEAT"));
+       okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED));

         /* Create our loookup-trees */
         hostnames = rbtNew(name_compare);
@@ -656,7 +657,7 @@
                                         awalk->maxcolor = newcolor;
                                 }
                         }
-                       else {
+                       else if ((okcolors & (1 << newcolor)) != 0) {
                                 /*
                                  * Send one "recovered" message out 
now, then go to A_DEAD.
                                  * Dont update the color here - we want 
recoveries to go out
@@ -663,6 +664,11 @@
                                  * only if the alert color triggered an 
alert
                                  */
                                 awalk->state = A_RECOVERED;
+                       } else {
+                               /*
+                                * This color should not trigger 
"recovered" messages.
+                                */
+                               awalk->state = A_NORECIP;
                         }


With this in place we can better support alerting for SNMP traps (see 
previous discussion with Buchan 
http://www.xymon.com/archive/2011/02/msg00062.html), but then we want 
all short transitions from an alert state to a clear status to be 
processed by Xymon (not ignored).

Dominique

list Dominique Frise · Mon, 14 Feb 2011 17:08:51 +0100 ·

▸ quoted from Henrik Størner

On 02/14/11 02:51 PM, Henrik Størner wrote:

In<user-fea03e92c89e@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  writes:

what is suppose to happen if you remove the "clear" color from OKCOLORS
in xymonserver.cfg ?

Then a "clear" status would trigger alerts, i.e. the xymond_alert
module would begin to see alert-messages for a clear status (same
as for yellow, red, purple).

I don't think you would actually see any alerts being sent, unless
you also change ALERTCOLORS to include the "clear" status.

But that would be a bad idea, since "clear" is also used for
e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...)
when the server is down ("conn" status is red means client-side
tests will not go purple - they go clear).

We would expect that not recovery message should be sent when a status
goes from yellow/red to clear. Only the repeat interval should be reset.
Does this make sense ?

Kind of, yes. I don't recall if it was actually tested.

(Sorry, same reply was sent before with garbage as top post.)

▸ quoted from Dominique Frise


I dont't think it was ;-)
Here below the little changes we made in xymond_alerts.c (version before 
your last changes) to achieve this:


[super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c
--- xymond_alert.c.dist Sun Nov 14 18:21:19 2010
+++ xymond_alert.c      Mon Feb 14 15:02:24 2011
@@ -355,7 +355,7 @@
         char *msg;
         int seq;
         int argi;
-       int alertcolors, alertinterval;
+       int alertcolors, alertinterval, okcolors;
         char *configfn = NULL;
         char *checkfn = NULL;
         int checkpointinterval = 900;
@@ -377,6 +377,7 @@
         /* Load alert config */
         alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << 
COL_GREEN) | (1 << COL_BLUE)));
         alertinterval = 60*atoi(xgetenv("ALERTREPEAT"));
+       okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED));

         /* Create our loookup-trees */
         hostnames = rbtNew(name_compare);
@@ -656,7 +657,7 @@
                                         awalk->maxcolor = newcolor;
                                 }
                         }
-                       else {
+                       else if ((okcolors & (1 << newcolor)) != 0) {
                                 /*
                                  * Send one "recovered" message out 
now, then go to A_DEAD.
                                  * Dont update the color here - we want 
recoveries to go out
@@ -663,6 +664,11 @@
                                  * only if the alert color triggered an 
alert
                                  */
                                 awalk->state = A_RECOVERED;
+                       } else {
+                               /*
+                                * This color should not trigger 
"recovered" messages.
+                                */
+                               awalk->state = A_NORECIP;
                         }


With this in place we can better support alerting for SNMP traps (see 
previous discussion with Buchan 
http://www.xymon.com/archive/2011/02/msg00062.html), but then we want 
all short transitions from an alert state to a clear status to be 
processed by Xymon (not ignored).

Dominique

xymon_4.3.0-RC1: possible lost alerts 🔗 link

xymon_4.3.0-RC1: possible lost alerts