A different RECOVERED message problem!
list Sebastian Auriol
Hi all,
This seems to be a bug (or at least a flaw) in the alerting system (on
hobbit trunk from Dec 2008): a recovery is sent for a MAIL alert that never
actually sent an alert in the first place. The last two lines in the
notifications.log are recoveries that were sent at the exact same time. But
only 1 alert (alarm) was actually sent out.
[root at IVRA1 log]# tail /var/log/hobbit/notifications.log -n 3
Tue Jun 16 14:46:49 2009 Db1.Special2 (192.168.4.xx)
meATmyCompanyDOTcom[191] 1245160009 0
Tue Jun 16 14:48:43 2009 Db1.Special2 (192.168.4.xx)
meATmyCompanyDOTcom[191] 1245160123 0 492
Tue Jun 16 14:48:43 2009 Db1.Special2 (192.168.4.xx)
meATmyCompanyDOTcom[192] 1245160123 0 492
The reason why this happened may be seen from the hobbit-alerts.cfg file.
One of the alert lines triggers on persistent alarms (including yellow
alarms), while the other is immediate but for red only. Both have RECOVERED
alerts. The test was only red for 2 minutes, so only the first MAIL rule
fired initially. It seems that the RECOVERED part does not check that this
line actually triggered an alert before sending the recovered message!
HOST=Db1 SERVICE=Special2
MAIL meATmyCompanyDOTcom COLOR=red REPEAT=30 RECOVERED
MAIL meATmyCompanyDOTcom COLOR=yellow,red DURATION>10 REPEAT=30
RECOVERED
There haven't been any changes to the alerting system since December right?
Should I file this as a bug anywhere else?
Kind regards,
SebA
list Alan Sparks
▸
SebA wrote:
Hi all, This seems to be a bug (or at least a flaw) in the alerting system (on hobbit trunk from Dec 2008): a recovery is sent for a MAIL alert that never actually sent an alert in the first place. The last two lines in the notifications.log are recoveries that were sent at the exact same time. But only 1 alert (alarm) was actually sent out.
I have a feeling you're seeing an artifact of some oddities in how the alerts module matches up events and paging rules. You can see my analysis writeup here (however, I never solved the problem): http://www.hswn.dk/hobbiton/2008/07/msg00132.html -Alan
list Sebastian Auriol
▸
Alan Sparks <mailto:user-8f2174fd8b66@xymon.invalid> wrote:
SebA wrote:This seems to be a bug (or at least a flaw) in the alerting system (on hobbit trunk from Dec 2008): a recovery is sent for a MAIL alert that never actually sent an alert in the first place. The last two lines in the notifications.log are recoveries that were sent at the exact same time. But only 1 alert (alarm) was actually sent out.I have a feeling you're seeing an artifact of some oddities in how the alerts module matches up events and paging rules. You can see my analysis writeup here (however, I never solved the problem): http://www.hswn.dk/hobbiton/2008/07/msg00132.html -Alan
Yes, it looks like you're right Alan. I really think the alerts module needs a bit of a rewrite... I've already posted about several flaws in it over the past year... It just doesn't seem to be possible to get it doing what I want (and expect) for several different tests / scenarios (at the same time). SebA
list Alan Sparks
▸
SebA wrote:
Alan Sparks <mailto:user-8f2174fd8b66@xymon.invalid> wrote:I have a feeling you're seeing an artifact of some oddities in how the alerts module matches up events and paging rules. You can see my analysis writeup here (however, I never solved the problem): http://www.hswn.dk/hobbiton/2008/07/msg00132.html -AlanYes, it looks like you're right Alan. I really think the alerts module needs a bit of a rewrite... I've already posted about several flaws in it over the past year... It just doesn't seem to be possible to get it doing what I want (and expect) for several different tests / scenarios (at the same time).
SebWould tend to agree, due to issues with RECOVERED and DURATION processing, and the seeming conflict of interest between the alerts and the flap suppression algorithms. I have on occasion considered outboarding the alert processing, and just having Hobbit send alerts for anything to a custom script, where I could exert finer control. -Alan