xymon-4.3.0-RC1: alerting question

list Dominique Frise
Tue, 08 Feb 2011 11:02:35 +0100
Message-Id: <user-232c8801ab35@xymon.invalid>

Hi Buchan,

On 02/ 7/11 10:31 PM, Buchan Milne wrote:

On Monday, 7 February 2011 16:37:14 Dominique Frise wrote:

Hi Henrik,

Thanks for replying.

On 02/ 7/11 01:10 PM, Henrik Størner wrote:

In<user-bcb3434215c6@xymon.invalid>   Dominique Frise<user-78ab6673b600@xymon.invalid>
writes:

What is the minimum time for the same alert status to stay up to be
processed correctly by Xymon ?

I am not sure I understand the question - are you saying that
Xymon does not generate the notifications you expect it to ?

Sort of...

We have SNMP trap handling configured (thanks Andy Farrior)

It is an ugly hack. We need a better solution. I didn't implement this one for
my own environments, as I was not willing to settle for it (one issue being
the multiple parts, snmptrapd->snmptt->sec->perl script), but I haven't
finished the work I wanted to do (a perl NetSNMP::TrapReceiver running in
snmptrapd that does all the tasks above) to have a better solution.

Well Andy's work is advertized as "A very elegant method of feeding 
traps into Xymon" ;-) 
(http://www.xymon.com/xymon/help/xymon-tips.html#snmptraps)
This is also the kind of approach that is used for Nagios but there 
alerting is better supported by the "volatile" service 
.(http://nagios.sourceforge.net/docs/2_0/volatileservices.html).

but are not
completely happy with how it handles the alerting.
When a bad trap from a given host is received, an alert status is
generated for Xymon (yellow or red). So far, so good.

Actually, IMHO, no. The BB model works on monitoring a status, and generating
an event when the status changes. The problem comes when you listen for events
(traps), and the only way to handle them is to create a status, so you can
generate an event.

I think event-based monitoring should not go via 'status' messages, but go
into a separate channel, which handles events as events, and possibly alerts
directly instead of via the status channel.

Agree

Then, before this status'validity is expired (before it turns purple), a
periodic launch of a script will reset its color to green, thus
generating a recovered message indenpendently of the real status of the
service reported by the trap. Further more, while a<host>.trap status
is in alert state, other bad traps from same host and of same level will
not generate any alerts (igmored).

This is a generic problem, and applies to some extent to other tests as well.
Even if different types of traps were reported to different tests, there is
the issue of no component-level ack/alert/recover/disable etc. So, for
example, if non-critical filesystem goes yellow, and this is ack'ed or
disabled, then a critical filesystem does red, there will be no new
notification, it won't appear on the critical systems view, just as a trap for
a non-critical router interface will be lumped together with a critical one.

Not trivial to solve

Here follow a description of what we are trying to implement in order to
improve this hanlding:

****
1. a bad<host>trap is detected.
2. generate a yellow/red<host>.trap status for Xymon.
3. after a short delay (ideally 1 sec.), generate a clear<host>.trap
status for Xymon.

So, the status page for the host is useless, the only thing you get is
alerting, it would be much better (IMHO) to go:

1)snmptrapd running NetSNMP::TrapReceiver which does MIB parsing etc., pruning
of duplicate traps itself, storing some trap details, and sends an 'event'
message to hobbitd.
2)A hobbit worker listening on the event channel and deciding when to send
page or ack messages to hobbitd for hobbitd_alert to act on. In some cases, it
might be desirable for it to do something besides alert (e.g. trigger a
configuration update for a network device on a device configuration save trap)

Solid concept indeed

All traps status except those in alert state are periodically set to clear.
The red/yellow ->  clear transition should not generate a recovered
message. This should be achieved by removing "clear" from "OKCOLORS" in
xymonserver.cfg but this does not work without modifying xymond_alert.c.
A good<host>.trap should generate a green message and thus a recovered
message.

This is mostly just going to result in disk churn that you don't even want to
look at, just to send some mails. If you didn't have Xymon in the picture,
snmptrapd and traptoemail would do most of what you get ...

The database history fed by snmptt is quite useful too

We know that a 100% handling of traps in Xymon is not possible because
we are misusing a single status (trap) to report many others, but his
scenario would allow:

- a better alerting of all bad traps from the same host and of same level.

Well, it is slightly better, but I don't see how traps for different reasons
in different orders are going to be handled well.

Not covered at all :-(

- the recovered status is a real recover (the text of the trap explains
what recovered)

This is about the only advantage, and I think there is more that could be
improved with fewer disadvantages.

Eager to test your solution...
Dont forget to drop us a mail when its ready for testing!

Regards,
Dominique

Regards,
Buchan