xymon-4.3.0-RC1: alerting question

5 messages in this thread

list Dominique Frise · Fri, 04 Feb 2011 15:38:59 +0100 ·

Hi,

We looked in the code but could not find the answer...

What is the minimum time for the same alert status to stay up to be processed correctly by Xymon ?

For example in following transitions, what would the minimum time (in sec.) for the yellow statuses (same check) to be processed correctly by Xymon ?


long t.    short t.     long t.    short t.    long t.    long t.
green  ->  yellow   ->  clear  ->  yellow  ->  clear  ->  green
            alert                   alert                  recovered


Thanks for advices.

Dominique

list Henrik Størner · Mon, 7 Feb 2011 12:10:11 +0000 (UTC) ·

▸ quoted from Dominique Frise

In <user-bcb3434215c6@xymon.invalid> Dominique Frise <user-78ab6673b600@xymon.invalid> writes:

What is the minimum time for the same alert status to stay up to be 
processed correctly by Xymon ?

I am not sure I understand the question - are you saying that
Xymon does not generate the notifications you expect it to ?

▸ quoted from Dominique Frise

For example in following transitions, what would the minimum time (in 
sec.) for the yellow statuses (same check) to be processed correctly by 
Xymon ?

long t.    short t.     long t.    short t.    long t.    long t.
green  ->  yellow   ->  clear  ->  yellow  ->  clear  ->  green
           alert                   alert                  recovered

Provided you have alerts setup on a yellow status, and there is not
a DURATION parameter that delays the alert, then you should get an
alert on each of the transitions to yellow.

("clear" is not an alerting color - only yellow, red and purple are).

The only "minimum time" Xymon has in relation to alerts, is the
DURATION parameter that you specify in alerts.cfg (hobbit-alerts.cfg
in older versions).


Regards,
Henrik

list Dominique Frise · Mon, 07 Feb 2011 15:37:14 +0100 ·

Hi Henrik,

Thanks for replying.

▸ quoted from Henrik Størner


On 02/ 7/11 01:10 PM, Henrik Størner wrote:

In<user-bcb3434215c6@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  writes:

What is the minimum time for the same alert status to stay up to be
processed correctly by Xymon ?

I am not sure I understand the question - are you saying that
Xymon does not generate the notifications you expect it to ?

Sort of...

We have SNMP trap handling configured (thanks Andy Farrior) but are not completely happy with how it handles the alerting.
When a bad trap from a given host is received, an alert status is generated for Xymon (yellow or red). So far, so good.
Then, before this status'validity is expired (before it turns purple), a periodic launch of a script will reset its color to green, thus generating a recovered message indenpendently of the real status of the service reported by the trap. Further more, while a <host>.trap status is in alert state, other bad traps from same host and of same level will not generate any alerts (igmored).

Here follow a description of what we are trying to implement in order to improve this hanlding:

****
1. a bad <host>trap is detected.
2. generate a yellow/red <host>.trap status for Xymon.
3. after a short delay (ideally 1 sec.), generate a clear <host>.trap status for Xymon.

All traps status except those in alert state are periodically set to clear.
The red/yellow -> clear transition should not generate a recovered message. This should be achieved by removing "clear" from "OKCOLORS" in xymonserver.cfg but this does not work without modifying xymond_alert.c.
A good <host>.trap should generate a green message and thus a recovered message.

We know that a 100% handling of traps in Xymon is not possible because we are misusing a single status (trap) to report many others, but his scenario would allow:

- a better alerting of all bad traps from the same host and of same level.
- the recovered status is a real recover (the text of the trap explains what recovered)
****

The issue we have now is that we are missing some alerts. We enabled debug and tracing but due to the amount of alerts we get, it is extremely difficult to follow one single alert. We think this could be related how xymond_alerts handles bunches of messages (10 sec.handling).

Can you please confirm ?

Thanks for your time.

Dominique

▸ quoted from Henrik Størner

For example in following transitions, what would the minimum time (in
sec.) for the yellow statuses (same check) to be processed correctly by
Xymon ?

long t.    short t.     long t.    short t.    long t.    long t.
green  ->   yellow   ->   clear  ->   yellow  ->   clear  ->   green
            alert                   alert                  recovered

Provided you have alerts setup on a yellow status, and there is not
a DURATION parameter that delays the alert, then you should get an
alert on each of the transitions to yellow.

("clear" is not an alerting color - only yellow, red and purple are).

The only "minimum time" Xymon has in relation to alerts, is the
DURATION parameter that you specify in alerts.cfg (hobbit-alerts.cfg
in older versions).


Regards,
Henrik

list Buchan Milne · Mon, 7 Feb 2011 23:31:06 +0200 ·

▸ quoted from Dominique Frise

On Monday, 7 February 2011 16:37:14 Dominique Frise wrote:

Hi Henrik,

Thanks for replying.

On 02/ 7/11 01:10 PM, Henrik Størner wrote:

In<user-bcb3434215c6@xymon.invalid>  Dominique Frise<user-78ab6673b600@xymon.invalid>  
writes:

What is the minimum time for the same alert status to stay up to be
processed correctly by Xymon ?
I am not sure I understand the question - are you saying that

Xymon does not generate the notifications you expect it to ?

Sort of...

We have SNMP trap handling configured (thanks Andy Farrior)

It is an ugly hack. We need a better solution. I didn't implement this one for my own environments, as I was not willing to settle for it (one issue being the multiple parts, snmptrapd->snmptt->sec->perl script), but I haven't finished the work I wanted to do (a perl NetSNMP::TrapReceiver running in snmptrapd that does all the tasks above) to have a better solution.

▸ quoted from Dominique Frise

but are not
completely happy with how it handles the alerting.
When a bad trap from a given host is received, an alert status is
generated for Xymon (yellow or red). So far, so good.

Actually, IMHO, no. The BB model works on monitoring a status, and generating an event when the status changes. The problem comes when you listen for events (traps), and the only way to handle them is to create a status, so you can generate an event.

I think event-based monitoring should not go via 'status' messages, but go into a separate channel, which handles events as events, and possibly alerts directly instead of via the status channel.

▸ quoted from Dominique Frise

Then, before this status'validity is expired (before it turns purple), a
periodic launch of a script will reset its color to green, thus
generating a recovered message indenpendently of the real status of the
service reported by the trap. Further more, while a <host>.trap status
is in alert state, other bad traps from same host and of same level will
not generate any alerts (igmored).

This is a generic problem, and applies to some extent to other tests as well. Even if different types of traps were reported to different tests, there is the issue of no component-level ack/alert/recover/disable etc. So, for example, if non-critical filesystem goes yellow, and this is ack'ed or disabled, then a critical filesystem does red, there will be no new notification, it won't appear on the critical systems view, just as a trap for a non-critical router interface will be lumped together with a critical one.

▸ quoted from Dominique Frise

Here follow a description of what we are trying to implement in order to
improve this hanlding:

****
1. a bad <host>trap is detected.
2. generate a yellow/red <host>.trap status for Xymon.
3. after a short delay (ideally 1 sec.), generate a clear <host>.trap
status for Xymon.

So, the status page for the host is useless, the only thing you get is alerting, it would be much better (IMHO) to go:

1)snmptrapd running NetSNMP::TrapReceiver which does MIB parsing etc., pruning of duplicate traps itself, storing some trap details, and sends an 'event' message to hobbitd.
2)A hobbit worker listening on the event channel and deciding when to send page or ack messages to hobbitd for hobbitd_alert to act on. In some cases, it might be desirable for it to do something besides alert (e.g. trigger a configuration update for a network device on a device configuration save trap)

▸ quoted from Dominique Frise

All traps status except those in alert state are periodically set to clear.
The red/yellow -> clear transition should not generate a recovered
message. This should be achieved by removing "clear" from "OKCOLORS" in
xymonserver.cfg but this does not work without modifying xymond_alert.c.
A good <host>.trap should generate a green message and thus a recovered
message.

This is mostly just going to result in disk churn that you don't even want to look at, just to send some mails. If you didn't have Xymon in the picture, snmptrapd and traptoemail would do most of what you get ...

▸ quoted from Dominique Frise

We know that a 100% handling of traps in Xymon is not possible because
we are misusing a single status (trap) to report many others, but his
scenario would allow:

- a better alerting of all bad traps from the same host and of same level.

Well, it is slightly better, but I don't see how traps for different reasons in different orders are going to be handled well.

▸ quoted from Dominique Frise

- the recovered status is a real recover (the text of the trap explains
what recovered)

This is about the only advantage, and I think there is more that could be improved with fewer disadvantages.

Regards,
Buchan

list Dominique Frise · Tue, 08 Feb 2011 11:02:35 +0100 ·

Hi Buchan,

▸ quoted from Buchan Milne


On 02/ 7/11 10:31 PM, Buchan Milne wrote:

On Monday, 7 February 2011 16:37:14 Dominique Frise wrote:

Hi Henrik,

Thanks for replying.

On 02/ 7/11 01:10 PM, Henrik Størner wrote:

In<user-bcb3434215c6@xymon.invalid>   Dominique Frise<user-78ab6673b600@xymon.invalid>
writes:

What is the minimum time for the same alert status to stay up to be
processed correctly by Xymon ?

I am not sure I understand the question - are you saying that
Xymon does not generate the notifications you expect it to ?

Sort of...

We have SNMP trap handling configured (thanks Andy Farrior)

It is an ugly hack. We need a better solution. I didn't implement this one for
my own environments, as I was not willing to settle for it (one issue being
the multiple parts, snmptrapd->snmptt->sec->perl script), but I haven't
finished the work I wanted to do (a perl NetSNMP::TrapReceiver running in
snmptrapd that does all the tasks above) to have a better solution.

Well Andy's work is advertized as "A very elegant method of feeding 
traps into Xymon" ;-) 
(http://www.xymon.com/xymon/help/xymon-tips.html#snmptraps)
This is also the kind of approach that is used for Nagios but there 
alerting is better supported by the "volatile" service 
.(http://nagios.sourceforge.net/docs/2_0/volatileservices.html).

▸ quoted from Buchan Milne

but are not
completely happy with how it handles the alerting.
When a bad trap from a given host is received, an alert status is
generated for Xymon (yellow or red). So far, so good.

Actually, IMHO, no. The BB model works on monitoring a status, and generating
an event when the status changes. The problem comes when you listen for events
(traps), and the only way to handle them is to create a status, so you can
generate an event.

I think event-based monitoring should not go via 'status' messages, but go
into a separate channel, which handles events as events, and possibly alerts
directly instead of via the status channel.

Agree

▸ quoted from Buchan Milne

Then, before this status'validity is expired (before it turns purple), a
periodic launch of a script will reset its color to green, thus
generating a recovered message indenpendently of the real status of the


service reported by the trap. Further more, while a<host>.trap status

▸ quoted from Buchan Milne

is in alert state, other bad traps from same host and of same level will
not generate any alerts (igmored).

This is a generic problem, and applies to some extent to other tests as well.
Even if different types of traps were reported to different tests, there is
the issue of no component-level ack/alert/recover/disable etc. So, for
example, if non-critical filesystem goes yellow, and this is ack'ed or
disabled, then a critical filesystem does red, there will be no new
notification, it won't appear on the critical systems view, just as a trap for
a non-critical router interface will be lumped together with a critical one.

Not trivial to solve

▸ quoted from Buchan Milne

Here follow a description of what we are trying to implement in order to
improve this hanlding:

****


1. a bad<host>trap is detected.
2. generate a yellow/red<host>.trap status for Xymon.
3. after a short delay (ideally 1 sec.), generate a clear<host>.trap

▸ quoted from Buchan Milne

status for Xymon.

So, the status page for the host is useless, the only thing you get is
alerting, it would be much better (IMHO) to go:

1)snmptrapd running NetSNMP::TrapReceiver which does MIB parsing etc., pruning
of duplicate traps itself, storing some trap details, and sends an 'event'
message to hobbitd.
2)A hobbit worker listening on the event channel and deciding when to send
page or ack messages to hobbitd for hobbitd_alert to act on. In some cases, it
might be desirable for it to do something besides alert (e.g. trigger a
configuration update for a network device on a device configuration save trap)

Solid concept indeed

▸ quoted from Buchan Milne

All traps status except those in alert state are periodically set to clear.
The red/yellow ->  clear transition should not generate a recovered
message. This should be achieved by removing "clear" from "OKCOLORS" in
xymonserver.cfg but this does not work without modifying xymond_alert.c.


A good<host>.trap should generate a green message and thus a recovered

▸ quoted from Buchan Milne

message.

This is mostly just going to result in disk churn that you don't even want to
look at, just to send some mails. If you didn't have Xymon in the picture,
snmptrapd and traptoemail would do most of what you get ...

The database history fed by snmptt is quite useful too

▸ quoted from Buchan Milne

We know that a 100% handling of traps in Xymon is not possible because
we are misusing a single status (trap) to report many others, but his
scenario would allow:

- a better alerting of all bad traps from the same host and of same level.

Well, it is slightly better, but I don't see how traps for different reasons
in different orders are going to be handled well.

Not covered at all :-(

▸ quoted from Buchan Milne

- the recovered status is a real recover (the text of the trap explains
what recovered)

This is about the only advantage, and I think there is more that could be
improved with fewer disadvantages.

Eager to test your solution...
Dont forget to drop us a mail when its ready for testing!

Regards,
Dominique

Regards,
Buchan

xymon-4.3.0-RC1: alerting question 🔗 link

xymon-4.3.0-RC1: alerting question