Xymon Mailing List Archive search

xymon_4.3.0-RC1: possible lost alerts

list Henrik Størner
Mon, 14 Feb 2011 10:00:38 +0000 (UTC)
Message-Id: <ijaug6$9ov$user-e356fad9864f@xymon.invalid>

In <user-f44f191b2358@xymon.invalid> Dominique Frise <user-78ab6673b600@xymon.invalid> writes:
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not 
be processed immediately because of this part of code:
   816                  /*
   817                   * When a burst of alerts happen, we get lots of alert messages
   818                   * coming in quickly. So lets handle them in bunches and only
   819                   * do the full alert handling once every 10 secs - that lets us
   820                   * combine a bunch of alerts into one transmission process.
   821                   */
   822                  if (nowtimer < (lastxmit+10)) continue;
   823                  lastxmit = nowtimer;
The main loop will then wait for a new msg from xymond (Want msg <num>, 
startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup 
the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct
(from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there
is a recovery message), then what is the point of sending an alert ?
The notification would be for data that is no longer valid, and 
personally I would rather NOT be alerted a 3 AM if the problem no 
longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!"
meme :-)


Regards,
Henrik