Use hobbit in operation center with critcal systems view

8 messages in this thread

list Gräub Roland · Mon, 15 Oct 2007 09:29:53 +0200 ·

Hi all

We are planing a full change from our "production" Systemmanagment Tool to Hobbit. Now on most Systems are both clients installed and the plan is to use only hobbit.

In our environment the Operation Center always call when a alerts shows up on their Event Console and acknowledge the alert. With this action the alert is no longer visible for the operators.
At the moment the Operation Center didnt care about hobbit, but of course lots of other people in our organization use hobbit and are happy with this great tool.

With the critical systems view i think hobbit offers an ideal view for our Operation Center.

Now following questions/toughts came up when we look closer;

Acknowledge;
If an alert is acknowledge from the operators in critical systems this is a fix acknowledge for the given time, also when there is a status change.
When a problem is fixed and goes red/yellow again it will not shown up in critical view until the acked time is expired.
This sould be an option to ack a alert until a status change (like in disable until ok).

The option Host-ack seems to be broken, on my system only one Test is acknowledged although the Host-ack Checkbox is selected.

Log;
Missing a Log/Report from Critical view. A Report with information about the alerts and acknowledgeds information that were made in Critical systems would be helpful.

Definition (Edit Critical Systems);
Easiest way for us; made standard definitions and add host to this templates. Works fine.
But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view.
Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.

Special Case missed or belated Messages by Operation Center;
Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event.
A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).

Is this anyway a topic ? How is that handled in your organisation ?

Regards,
Roland

list Henrik Størner · Fri, 9 Nov 2007 00:26:42 +0100 ·

▸ quoted from Gräub Roland

On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:

In our environment the Operation Center always call when a alerts shows 
up on their Event Console and acknowledge the alert. With this action 
the alert is no longer visible for the operators.

Now following questions/toughts came up when we look closer;

Acknowledge; 
If an alert is acknowledge from the operators in critical systems 
this is a fix acknowledge for the given time, also when there is a 
status change.
When a problem is fixed and goes red/yellow again it will not shown 
up in critical view until the acked time is expired.
This sould be an option to ack a alert until a status change (like in 
disable until ok).

I decided against the "ack-until-ok" method, because in my experience
systems often go briefly ok while being fixed, and then they crash
again. (E.g. you'd reboot a server and all the processes startup, but
one process that is being monitored dies after a few minutes). So the
monitoring reports OK for a few minutes, and then go red - if you did
use an "ack-until-ok" it would show up on the critical systems view
again, triggering a new ticket.

What happens now is that when the status goes green, a timer kicks off
in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit
for good measure). If the test has been OK throughout those 12 minutes
then the ack is cleared; if it goes non-green during that time the timer
is reset and the ack persists (at least until it eventually expires).

▸ quoted from Gräub Roland

The option Host-ack seems to be broken, on my system only one Test is 
acknowledged although the Host-ack Checkbox is selected.

A quick test says you're right. Will have to look into that.

▸ quoted from Gräub Roland

Log;
Missing a Log/Report from Critical view. A Report with information about 
the alerts and acknowledgeds information that were made in Critical systems 
would be helpful.

Right now it isn't even being logged, except inside the Hobbit daemon. A
reporting tool is needed, I agree.

▸ quoted from Gräub Roland

Definition (Edit Critical Systems);
Easiest way for us; made standard definitions and add host to this templates. Works fine.
But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. 
Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.

Wouldn't these two do the same thing ? 
Using the alert definitions to control the critical view is an interesting 
idea, I hadn't thought of that.

▸ quoted from Gräub Roland

Special Case missed or belated Messages by Operation Center;
Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. 
A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).

This is a problem with all of the tests that have multiple ways of going
red: disk, procs, msgs and http are the common ones. I don't have
solution to that right now. The way Hobbit works right now assumes that
when you get an alert about the "disk" status, you keep on fixing it
until the status goes green - and then the Operations Center won't need
to raise a ticket for the second event.


Regards,
Henrik

list Eduard Michels · Fri, 9 Nov 2007 08:38:09 -0200 ·

▸ quoted from Henrik Størner

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: quinta-feira, 8 de novembro de 2007 21:27
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Use hobbit in operation center with critcal systems view

On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:

In our environment the Operation Center always call when a alerts > shows up on their Event Console and acknowledge the alert. With this > action the alert is no longer visible for the operators.

Now following questions/toughts came up when we look closer;

Acknowledge;
If an alert is acknowledge from the operators in critical systems this > is a fix acknowledge for the given time, also when there is a status > change.
When a problem is fixed and goes red/yellow again it will not shown up > in critical view until the acked time is expired.
This sould be an option to ack a alert until a status change (like in > disable until ok).

I decided against the "ack-until-ok" method, because in my experience systems often go briefly ok while being fixed, and then they crash again. (E.g. you'd reboot a server and all the processes startup, but one process that is being monitored dies after a few minutes). So the monitoring reports OK for a few minutes, and then go red - if you did use an "ack-until-ok" it would show up on the critical systems view again, triggering a new ticket.

What happens now is that when the status goes green, a timer kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit for good measure). If the test has been OK throughout those 12 minutes then the ack is cleared; if it goes non-green during that time the timer is reset and the ack persists (at least until it eventually expires).

The option Host-ack seems to be broken, on my system only one Test is > acknowledged although the Host-ack Checkbox is selected.

A quick test says you're right. Will have to look into that.

Log;
Missing a Log/Report from Critical view. A Report with information > about the alerts and acknowledgeds information that were made in > Critical systems would be helpful.

Right now it isn't even being logged, except inside the Hobbit daemon. A reporting tool is needed, I agree.

Definition (Edit Critical Systems);
Easiest way for us; made standard definitions and add host to this templates. Works fine.
But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view.
Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.

Wouldn't these two do the same thing ?
Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.

Special Case missed or belated Messages by Operation Center; Now some > application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event.
A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).

This is a problem with all of the tests that have multiple ways of going
red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.

I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test

Regards,
Henrik

list Gary Baluha · Fri, 9 Nov 2007 09:21:06 -0500 ·

▸ quoted from Eduard Michels

Special Case missed or belated Messages by Operation Center;
Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event.
A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).

This is a problem with all of the tests that have multiple ways of going
red: disk, procs, msgs and http are the common ones. I don't have
solution to that right now. The way Hobbit works right now assumes that
when you get an alert about the "disk" status, you keep on fixing it
until the status goes green - and then the Operations Center won't need
to raise a ticket for the second event.

As has been mentioned before, it seems the "Info" column doesn't
properly display GROUP alert definitions...

Anyway, what about doing something with the way GROUP alerts are
defined to take care of such tests with multiple ways of going red.
For starters, I wouldn't think it would be too hard to modify the
Critical Systems page to handle group-based alerts.  You could then
expand on that idea to take care of each individual triggering event.
Migrating this functionality to the non-green page/etc might take a
little more work, but I know at least where I work, getting this taken
care of so our Operations Center doesn't needlessly call people is the
first time I would want to get working.

list Gräub Roland · Mon, 12 Nov 2007 10:01:08 +0100 ·

▸ quoted from Eduard Michels

Acknowledge; > If an alert is acknowledge from the operators in critical systems > this is a fix acknowledge for the given time, also when there is a > status change.

When a problem is fixed and goes red/yellow again it will not shown > up in critical view until the acked time is expired.
This sould be an option to ack a alert until a status change (like in > disable until ok).

I decided against the "ack-until-ok" method, because in my experience
systems often go briefly ok while being fixed, and then they crash
again. (E.g. you'd reboot a server and all the processes startup, but
one process that is being monitored dies after a few minutes). So the
monitoring reports OK for a few minutes, and then go red - if you did
use an "ack-until-ok" it would show up on the critical systems view
again, triggering a new ticket.

What happens now is that when the status goes green, a timer kicks off
in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit
for good measure). If the test has been OK throughout those 12 minutes
then the ack is cleared; if it goes non-green during that time the timer
is reset and the ack persists (at least until it eventually expires).

I agree with you the ack until-ok could be end in a lot more unneeded alerts. So its unnecessary.
The cleartime of 12 min is a good choice, might be an option in hobbitserver.cfg.

▸ quoted from Eduard Michels

Definition (Edit Critical Systems);
Easiest way for us; made standard definitions and add host to this templates. Works fine.
But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. > Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.

Wouldn't these two do the same thing ?

Actually in daytimes the recovery-group gets alerts on the in-house pager.
This are the identical Systems like in the operator view but the defintion is in hobbit-alerts.

By the way in the page.log i get this message from my custom pager-script;
2007-11-09 09:05:00 hobbitd_alert: Got message 52634, expected 52615
Maybe the reason is the long script runtime to send the message trough a slow analog modem connection on a other server; this takes 30seconds to finish.
But i dont know what this message really mean, it seems to work as expected.

▸ quoted from Gary Baluha

Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.

Special Case missed or belated Messages by Operation Center;
Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. > A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).

This is a problem with all of the tests that have multiple ways of going
red: disk, procs, msgs and http are the common ones. I don't have
solution to that right now. The way Hobbit works right now assumes that
when you get an alert about the "disk" status, you keep on fixing it
until the status goes green - and then the Operations Center won't need
to raise a ticket for the second event.

Its like you say when its red i have to fix it until the test is green again. Maybe we disassemble some Test(example made for important procs a own test / split custom tests).

Roland

list Gräub Roland · Mon, 12 Nov 2007 10:12:39 +0100 ·

▸ quoted from Gräub Roland

Special Case missed or belated Messages by Operation > Center; Now some > > application/scripts sends Alerts to the Console View and > the Operation Center make an alert call for each event.
A problem in Hobbit/BB is when changes happen in red > messages, the Operation Center didnt realize that until the > acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a > second filesystem goes red) or with nested Tests/Logfiles. > With the Event Console they get two messages (each for one > Filesystem).
This is a problem with all of the tests that have multiple > ways of going
red: disk, procs, msgs and http are the common ones. I don't > have solution to that right now. The way Hobbit works right > now assumes that when you get an alert about the "disk" > status, you keep on fixing it until the status goes green - > and then the Operations Center won't need to raise a ticket > for the second event.

I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test

Sounds promising; how it works exactly ?

If the alert is already red, how you can send a new alert ?

list Eduard Michels · Mon, 12 Nov 2007 08:42:16 -0200 ·

Hi

▸ quoted from Gräub Roland

-----Original Message-----
From: Gräub Roland [mailto:user-3de8f19e45fe@xymon.invalid] Sent: segunda-feira, 12 de novembro de 2007 07:13
To: user-ae9b8668bcde@xymon.invalid
Subject: AW: [hobbit] Use hobbit in operation center with critcal systems view

Special Case missed or belated Messages by Operation

Center; Now some

application/scripts sends Alerts to the Console View and
the Operation Center make an alert call for each event.
A problem in Hobbit/BB is when changes happen in red
messages, the Operation Center didnt realize that until the > > acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a
second filesystem goes red) or with nested Tests/Logfiles. > > With the Event Console they get two messages (each for one > > Filesystem).

This is a problem with all of the tests that have multiple ways of > > going
red: disk, procs, msgs and http are the common ones. I don't have > > solution to that right now. The way Hobbit works right now assumes > > that when you get an alert about the "disk"
status, you keep on fixing it until the status goes green - and then > > the Operations Center won't need to raise a ticket for the second > > event.

I use as a solution to this problem, the counting of alerts within > each test, if the number of alerts has changed, then a new alert will > be generated with the status of the test

Sounds promising; how it works exactly ?

If the alert is already red, how you can send a new alert ?

I create a new red to red event. Soo i have ever time a new event if number alert change.
This procedure have I in one software that was either modification the BB.
I am working to try to adapt it to the Hobbit now

list Buchan Milne · Thu, 15 Nov 2007 18:43:24 +0200 ·

▸ quoted from Gräub Roland

On Fri, 2007-11-09 at 00:26 +0100, Henrik Stoerner wrote:

On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:

Definition (Edit Critical Systems);
Easiest way for us; made standard definitions and add host to this templates. Works fine.
But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. 
Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.

Wouldn't these two do the same thing ? 
Using the alert definitions to control the critical view is an interesting 
idea, I hadn't thought of that.

I remind you that I previously asked for a method to filter in
hobbit-alerts.cfg based on whether the test is a critical test (in it's
timeframe etc.).

The other problem I have with the critical view is the fact that (in
4.2.0 + the patch set from about one year ago) the "Config Report
(critical)" does not work, it displays nothing, even though "Config
Report" for the same page/host lists the details in the NK column.

We have just enabled passing events via a SCRIPT rule in
hobbit-alerts.cfg through to CA Unicenter (unfortunately with a
proprietary middleware).

Regards,
Buchan

Use hobbit in operation center with critcal systems view 🔗 link

Use hobbit in operation center with critcal systems view