Status change and history snapshots - how to force one?

3 messages in this thread

list Vernon Everett · Wed, 4 Mar 2009 11:36:22 +0900 ·

Greetings all

Here is an interesting scenario, and I was wondering if there is a feature to cater for this.

We are monitoring a system, and the alert went red.
We acknowldeged the alert, and logged a call with our vendor for a replacement component.
So far, all good.

However, while waiting for the replacement component, we had another failure, which also should have triggered a red alert.
(It's a disk array, with many many disks, and many hot spares, so it wasn't a tragic failure)

The problem, of course, is that since the first alert was acknowledged, nobody spotted the second one, until the first disk was replaced, and the system remained red.
(Somebody did the old "hmm, that's odd" and had a look.)

Also, in the history, there is only one "snapshot" - when the first device failed. We have no history snapshot to show the second device failed.

Is there a way to force a status update/change, even though there is no real colour change?
Is there a way for Xymon to detect that we are looking at a second failure?

If the answer to the above is no, then can we add this to the feature wish-list?

I gave it a little thought, and came up with what I think is a simple implementation.
We could add this an option to bb, similar to the [+lifetime] option, which will force the server to do a colour change from any colour to the new colour, even if they are the same.
This should also clear any acks, and take a snapshot for the history.
Implementing it as an option to bb, will probably require relatively minor changes at server level, and put the onus on the client scripts to decide what should force a change.
But then again, I have no idea what is really involved in this, so feal free to ignore this paragraph. :-)

Regards
Vernon

NOTICE: This email and any attachments are confidential.
They may contain legally privileged information or
copyright material. You must not read, copy, use or
disclose them without authorisation. If you are not an
intended recipient, please contact us at once by return
email and then delete both messages and all attachments.

list Henrik Størner · Wed, 4 Mar 2009 13:30:16 +0100 ·

▸ quoted from Vernon Everett

On Wed, Mar 04, 2009 at 11:36:22AM +0900, Everett, Vernon wrote:

We are monitoring a system, and the alert went red.
We acknowldeged the alert, and logged a call with our vendor for a replacement component.
So far, all good.

However, while waiting for the replacement component, we had another failure, which also should have triggered a red alert.
(It's a disk array, with many many disks, and many hot spares, so it wasn't a tragic failure)

The problem, of course, is that since the first alert was acknowledged, nobody spotted the second one, until the first disk was replaced, and the system remained red.


Xymon did what it was supposed to do. There's no way - short of
human intelligence - to determine that the two red statuses were
different.

Is there a way to force a status update/change, even though there is no real colour change?

No, not with the current logic.

Is there a way for Xymon to detect that we are looking at a second failure?

No.

If the answer to the above is no, then can we add this to the feature wish-list?

:-) I suppose so, but we would have to figure out just what it is that
you want Xymon to do.

▸ quoted from Vernon Everett

I gave it a little thought, and came up with what I think is a simple implementation.
We could add this an option to bb, similar to the [+lifetime] option, which will force the server to do a colour change from any colour to the new colour, even if they are the same.
This should also clear any acks, and take a snapshot for the history.
Implementing it as an option to bb, will probably require relatively minor changes at server level, and put the onus on the client scripts to decide what should force a change.

That's one possibility. I'm not terribly thrilled with it, because
for a lot of tests - all the standard ones, and particularly the
ones likes "msgs" or "procs" that collect lots of different data -
it will be somewhat of a headache to provide a framework for rules
that determine when the situation has changed 'enough' to warrant 
such an override. And I think that your average admin would not be 
pleased when his ACK was auto-cleared at 3 AM.

Maybe we could do it based on the ack's ? When an ack expires or
is cleared, this triggers the "next status is different" situation.
(You cannot clear an ack right now, but that can be done and would
probably be meaningful anyway). If we do that, then you would clear
the ack after you had repaired the first disk; then the status would
remain red, but Xymon would know now that it was a "different" red 
from the first one because you had cleared the ack for the first one.
So you would immediately get a new alert, and the history log would
update with the new status snapshot.


Regards,
Henrik

list Kristian Nielsen · Wed, 04 Mar 2009 13:58:39 +0100 ·

▸ quoted from Henrik Størner

"Everett, Vernon" <user-9da1a1882f49@xymon.invalid> writes:

We are monitoring a system, and the alert went red.
We acknowldeged the alert, and logged a call with our vendor for a replacement component.
So far, all good.

However, while waiting for the replacement component, we had another failure, which also should have triggered a red alert.
(It's a disk array, with many many disks, and many hot spares, so it wasn't a tragic failure)

The problem, of course, is that since the first alert was acknowledged, nobody spotted the second one, until the first disk was replaced, and the system remained red.
(Somebody did the old "hmm, that's odd" and had a look.)

I think the real problem here is that your test flagged a red alarm, even
though the new disk had been ordered and the original problem had therefore
been detected and handled. In my world, "acknowledged" means that the problem
has not been resolved, but someone is working on it (so I don't have to). Once
resolved, the status should go green. So you had a false alarm, and this did
hide your second alarm, as false alarms can do.

So I would suggest change the test to only flag parts that are broken, but not
yet ordered a replacement for.

It seems what you are asking for is some kind of count for red alarms. Like
the first red alarm is "1 component broken", and the next alarm would then be
"2 components broken", and the increase from count=1 to count=2 should trigger
a new alarm. However, adding something like this fundamentally changes the
monitoring model, and it seems to me this would just complicate matters both
for testing and reporting without much gain.

Alternatively you could just set up seperate alarms for each separate
component; that is not much different from counting number of failing
components that you would in any case have to do.

Hope this helps,

 - Kristian.

Status change and history snapshots - how to force one? 🔗 link

Status change and history snapshots - how to force one?