Windows Cluster Monitor

list Al Jeffcoat · Thu, 24 Feb 2005 18:27:35 -0500 ·

Hello All,

Our new director would like to monitor EVERYTHING from BB/Hobbit. We
have been monitoring our UNIX and Storage Devices for a few years now.
Now that I have windows servers to monitor, I'd like to know if anyone
has a decent way to monitor Windows Clusters? I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:

Nodea - Application Offline
Nodeb - Application Online
Clustername - Application Responding @ This address

How would you set up resource (process) monitoring for an Active /
Passive cluster? Or an Active / Active cluster?

This is in response to a problem that has been occurring on a new 24x7
Windows server blue screening daily, in spite of all the "fixes" that
have occurred to solve the problem (more hardware, patches, reload os,
etc, etc).

We'll soon be moving the application to an AIX server, but I'll have the
same questions on an HACMP cluster at that point :)

TIA

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.

list Oliver Bassett · Fri, 25 Feb 2005 13:43:37 +1300 ·

My current solution for this is pretty messy, but it works for me.

In the below I am assuming that you are going to have the windows bb client
installed on all nodes of the cluster.

Active/Passive:
Since want to know when the services fail over I check to ensure
processes/services are running on the active node, as well as ensuring that
the cluster service is running on both nodes. This let's me know when it
fails over so I can investigate immediately.

I also run a specific application check against the clustered application on
the cluster address to ensure the application itself is up and running. If
it isn't then there is a really serious problem.

Active/Active:
Since I don't do this I can be entirely sure, but this would be easier to
monitor I would think, just ensure that all the services required are
running on all nodes, including the cluster service itself.

I hope this is of some help.

Regards
Oliver Bassett

▸ quoted from Al Jeffcoat

-----Original Message-----
From: Jeffcoat, Al [mailto:user-b34a8ad6e24c@xymon.invalid]
Sent: Friday, 25 February 2005 12:28 p.m.
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Windows Cluster Monitor

Hello All,

Our new director would like to monitor EVERYTHING from BB/Hobbit.  We
have been monitoring our UNIX and Storage Devices for a few years now.
Now that I have windows servers to monitor, I'd like to know if anyone
has a decent way to monitor Windows Clusters?  I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:

Nodea			- Application Offline
Nodeb			- Application Online
Clustername		- Application Responding @ This address

How would you set up resource (process) monitoring for an Active /
Passive cluster?  Or an Active / Active cluster?  

This is in response to a problem that has been occurring on a new 24x7
Windows server blue screening daily, in spite of all the "fixes" that
have occurred to solve the problem (more hardware, patches, reload os,
etc, etc).

We'll soon be moving the application to an AIX server, but I'll have the
same questions on an HACMP cluster at that point :)

TIA

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

This e-mail message and any attached files are confidential and are intended
solely for the use of the addressee(s) named above. If you are not the
intended recipient, any review, use, or distribution of this e-mail message
and any attached files is strictly prohibited. This communication may
contain material protected by Federal privacy regulations, attorney-client
work product, or other privileges. If you have received this confidential
communication in error, please notify the sender immediately by reply e-mail
message and permanently delete the original message.  To reply to our email
administrator directly, send an email to:  user-ecde3bbc361d@xymon.invalid .
If this e-mail message concerns a contract matter, be advised that no
employee or agent is authorized to conclude any binding agreement on behalf
of Orlando Regional Healthcare by e-mail without express written
confirmation by an officer of the corporation. Any views or opinions
presented in this e-mail are solely those of the author and do not
necessarily represent those of Orlando Regional Healthcare.

######################################################################


This e-mail message has been scanned and cleared by MailMarshal at
http://www.gen-i.co.nz
######################################################################

*****************************************************************************
This communication, including any attachments, is confidential.
If you are not the intended recipient, you should not read it
- please contact me immediately, destroy it, and do not copy
or use any part of this communication or disclose anything about it,
Thank you.
Please note that this communication does not designate an information system
for the purposes of the Electronic Transactions Act 2002
******************************************************************************

list Kevin Grady · Thu, 24 Feb 2005 19:46:06 -0500 ·

Use WMI to query the MSCluster_Resource groups and you can grab the
status of each resource and then report back to hobbit.

Here's a link to some examples from MS.

http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/default.mspx

▸ quoted from Oliver Bassett



On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:

Hello All,

Our new director would like to monitor EVERYTHING from BB/Hobbit. We
have been monitoring our UNIX and Storage Devices for a few years now.
Now that I have windows servers to monitor, I'd like to know if anyone
has a decent way to monitor Windows Clusters? I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:

Nodea - Application Offline
Nodeb - Application Online
Clustername - Application Responding @ This address

How would you set up resource (process) monitoring for an Active /
Passive cluster? Or an Active / Active cluster?

This is in response to a problem that has been occurring on a new 24x7
Windows server blue screening daily, in spite of all the "fixes" that
have occurred to solve the problem (more hardware, patches, reload os,
etc, etc).

We'll soon be moving the application to an AIX server, but I'll have the
same questions on an HACMP cluster at that point :)

TIA

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.

list Kevin Grady · Thu, 24 Feb 2005 20:36:17 -0500 ·

I'll post something this weekend as I have been working on this for a
SQL cluster we have running.

▸ quoted from Kevin Grady



On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <user-50dc3c45bc73@xymon.invalid> wrote:

Use WMI to query the MSCluster_Resource groups and you can grab the
status of each resource and then report back to hobbit.

Here's a link to some examples from MS.

http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/default.mspx


On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:

Hello All,

Our new director would like to monitor EVERYTHING from BB/Hobbit. We
have been monitoring our UNIX and Storage Devices for a few years now.
Now that I have windows servers to monitor, I'd like to know if anyone
has a decent way to monitor Windows Clusters? I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:

Nodea - Application Offline
Nodeb - Application Online
Clustername - Application Responding @ This address

How would you set up resource (process) monitoring for an Active /
Passive cluster? Or an Active / Active cluster?

This is in response to a problem that has been occurring on a new 24x7
Windows server blue screening daily, in spite of all the "fixes" that
have occurred to solve the problem (more hardware, patches, reload os,
etc, etc).

We'll soon be moving the application to an AIX server, but I'll have the
same questions on an HACMP cluster at that point :)

TIA

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.

list Al Jeffcoat · Thu, 24 Feb 2005 21:14:00 -0500 ·

That'd be great.  We are demoing this for the new director on Monday, so
that would really be nice.  Thanks very much.

If not, we'll coble something together base on the responses we've
gotten so far.  

The UNIX stuff is easy, it's been in place forever.  The windows stuff
is new to me (I'm a home dabbler in windows) as far as a prd
environment.  And, our windows admins have resisted this every time we
bring it up, so I'm also fighting that resistance :).

In any case, thanks for the reply...

▸ signature


Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid


-----Original Message-----

▸ quoted from Kevin Grady

From: kevin grady [mailto:user-50dc3c45bc73@xymon.invalid] 
Sent: Thursday, February 24, 2005 8:36 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Windows Cluster Monitor

I'll post something this weekend as I have been working on this for a
SQL cluster we have running.


On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <user-50dc3c45bc73@xymon.invalid>
wrote:

Use WMI to query the MSCluster_Resource groups and you can grab the
status of each resource and then report back to hobbit.

Here's a link to some examples from MS.

http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/de
fault.mspx

▸ quoted from Kevin Grady


On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid>
wrote:

Hello All,

Our new director would like to monitor EVERYTHING from BB/Hobbit.

We

have been monitoring our UNIX and Storage Devices for a few years
now.
Now that I have windows servers to monitor, I'd like to know if
anyone
has a decent way to monitor Windows Clusters?  I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:

Nodea                   - Application Offline
Nodeb                   - Application Online
Clustername             - Application Responding @ This address

How would you set up resource (process) monitoring for an Active /
Passive cluster?  Or an Active / Active cluster?

This is in response to a problem that has been occurring on a new

24x7

Windows server blue screening daily, in spite of all the "fixes"
that
have occurred to solve the problem (more hardware, patches, reload
os,
etc, etc).

We'll soon be moving the application to an AIX server, but I'll have
the
same questions on an HACMP cluster at that point :)

TIA

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

This e-mail message and any attached files are confidential and are
intended solely for the use of the addressee(s) named above. If you are
not the intended recipient, any review, use, or distribution of this
e-mail message and any attached files is strictly prohibited. This
communication may contain material protected by Federal privacy
regulations, attorney-client work product, or other privileges. If you
have received this confidential communication in error, please notify
the sender immediately by reply e-mail message and permanently delete
the original message.  To reply to our email administrator directly,
send an email to:  user-ecde3bbc361d@xymon.invalid .  If this e-mail
message concerns a contract matter, be advised that no employee or agent
is authorized to conclude any binding agreement on behalf of Orlando

Regional Healthcare by e-mail without express written confirmation by an
officer of the corporation. Any views or opinions presented in this
e-mail are solely those of the author and do not necessarily represent
those of Orlando Regional Healthcare.

This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message.  To reply to our email administrator directly, send an email to:  user-ecde3bbc361d@xymon.invalid .  If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.

list Kevin Grady · Fri, 25 Feb 2005 12:19:42 -0500 ·

Check out bb-mscs on deadcat. It does most of what I am looking for
and maybe enough for your demo. You'd need to adjust to script if you
want red if a resource goes offline. Right now it will turn yellow.

▸ quoted from Al Jeffcoat



On Thu, 24 Feb 2005 21:14:00 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:

That'd be great.  We are demoing this for the new director on Monday, so
that would really be nice.  Thanks very much.

If not, we'll coble something together base on the responses we've
gotten so far.

The UNIX stuff is easy, it's been in place forever.  The windows stuff
is new to me (I'm a home dabbler in windows) as far as a prd
environment.  And, our windows admins have resisted this every time we
bring it up, so I'm also fighting that resistance :).

In any case, thanks for the reply...

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

-----Original Message-----
From: kevin grady [mailto:user-50dc3c45bc73@xymon.invalid]
Sent: Thursday, February 24, 2005 8:36 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Windows Cluster Monitor

I'll post something this weekend as I have been working on this for a
SQL cluster we have running.

On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <user-50dc3c45bc73@xymon.invalid>
wrote:

Use WMI to query the MSCluster_Resource groups and you can grab the
status of each resource and then report back to hobbit.

Here's a link to some examples from MS.

http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/de
fault.mspx


On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid>
wrote:

Hello All,

Our new director would like to monitor EVERYTHING from BB/Hobbit.

We

have been monitoring our UNIX and Storage Devices for a few years
now.
Now that I have windows servers to monitor, I'd like to know if
anyone
has a decent way to monitor Windows Clusters?  I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:

Nodea                   - Application Offline
Nodeb                   - Application Online
Clustername             - Application Responding @ This address

How would you set up resource (process) monitoring for an Active /
Passive cluster?  Or an Active / Active cluster?

This is in response to a problem that has been occurring on a new

24x7

Windows server blue screening daily, in spite of all the "fixes"
that
have occurred to solve the problem (more hardware, patches, reload
os,
etc, etc).

We'll soon be moving the application to an AIX server, but I'll have
the
same questions on an HACMP cluster at that point :)

TIA

Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid

This e-mail message and any attached files are confidential and are
intended solely for the use of the addressee(s) named above. If you are
not the intended recipient, any review, use, or distribution of this
e-mail message and any attached files is strictly prohibited. This
communication may contain material protected by Federal privacy
regulations, attorney-client work product, or other privileges. If you
have received this confidential communication in error, please notify
the sender immediately by reply e-mail message and permanently delete
the original message.  To reply to our email administrator directly,
send an email to:  user-ecde3bbc361d@xymon.invalid .  If this e-mail
message concerns a contract matter, be advised that no employee or agent
is authorized to conclude any binding agreement on behalf of Orlando

Regional Healthcare by e-mail without express written confirmation by an
officer of the corporation. Any views or opinions presented in this
e-mail are solely those of the author and do not necessarily represent
those of Orlando Regional Healthcare.

This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message.  To reply to our email administrator directly, send an email to:  user-ecde3bbc361d@xymon.invalid .  If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.

list Tom Georgoulias · Fri, 25 Feb 2005 13:13:50 -0500 ·

Couple of things:

1.  This may be a dumb question, but I'm gonna ask it anyway.  Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?  I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog.  ACK works for red & yellow, though.

2.  I could've sworn that I had read about the ability to merge all the purple alerts into a single email or behavior that did that automatically, but I can't seem to find it in the docs.  Is that possible?  Where can I read up on how to use it?  I'd love to get a single alert if a client goes purple  that can use a single ACK code to disable pages.

Tom

list Henrik Størner · Fri, 25 Feb 2005 19:39:20 +0100 ·

▸ quoted from Tom Georgoulias

On Fri, Feb 25, 2005 at 01:13:50PM -0500, Tom Georgoulias wrote:

Couple of things:

1.  This may be a dumb question, but I'm gonna ask it anyway.  Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?

Yes, that's the idea.

▸ quoted from Tom Georgoulias

 I've been experiementing with turning off big brother on a client
and causing purple pages, but using the ACK code in the emails does
not prevent purple pages from continuing, nor does my explanation
get recorded into the acklog.  ACK works for red & yellow, though.

Hmm - odd. I'll try it out later tonight.

2.  I could've sworn that I had read about the ability to merge all the purple alerts into a single email or behavior that did that automatically, but I can't seem to find it in the docs.  Is that possible?

No. I'd like to do some more general merging of alerts - not just
purple ones - but that'll be later.

▸ quoted from Tom Georgoulias

 Where can I read up on how to use it?  I'd love to get a single
alert if a client goes purple that can use a single ACK code to
disable pages.

OK, I'll let you in on a secret: If you send an acknowledge with
minus-ACKCODE, it will work as an ack for all current alerts on that
host.


Henrik

list Tom Georgoulias · Fri, 25 Feb 2005 14:03:50 -0500 ·

▸ quoted from Henrik Størner

Henrik Stoerner wrote:

1.  This may be a dumb question, but I'm gonna ask it anyway.  Should I
expect to be able to take the ACK code in purple pages and use it with
the acknowledge alert feature?

Yes, that's the idea.

OK.

It seems that when I acknowledge a red/yellow alert, the trend chart is not updated during the acknowledgment time period, but resumes after the time period is over (without using any of the data that would've been collected during that time).  Is that also expected?

Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?

▸ quoted from Henrik Størner

No. I'd like to do some more general merging of alerts - not just
purple ones - but that'll be later.

OK, that explains why I couldn't find anything about it in the docs.

▸ quoted from Henrik Størner

OK, I'll let you in on a secret: If you send an acknowledge with
minus-ACKCODE, it will work as an ack for all current alerts on that
host.

:)  Sounds good.


Tom

list Tom Georgoulias · Fri, 25 Feb 2005 15:04:34 -0500 ·

▸ quoted from Henrik Størner

Henrik Stoerner wrote:

 I've been experiementing with turning off big brother on a client
and causing purple pages, but using the ACK code in the emails does
not prevent purple pages from continuing, nor does my explanation
get recorded into the acklog.  ACK works for red & yellow, though.

Hmm - odd. I'll try it out later tonight.

A follow up:  I restarted hobbit and repeated the experiement, turning off bbc on my client and waiting for 30 mins until it was put into a purple status.  Then I took one of the ACK codes from my purple alert emails and it work just as expected, disabling paging for the time duration I entered and displaying the text message that I entered. Tested putting a "-" in front of the ACK code and it acknowledged all the purples for the host, so that's a neat little trick.

I cannot explain why this didn't work the first times that I tried it, but I swear it didn't.

Tom

list Henrik Størner · Sun, 27 Feb 2005 17:27:38 +0100 ·

▸ quoted from Tom Georgoulias

On Fri, Feb 25, 2005 at 02:03:50PM -0500, Tom Georgoulias wrote:

Henrik Stoerner wrote:

1.  This may be a dumb question, but I'm gonna ask it anyway.  Should I
expect to be able to take the ACK code in purple pages and use it with
the acknowledge alert feature?

Yes, that's the idea.

I tried it now, and ack'ing a purple status seems to work ok. I'll see
if it stops sending me alerts.

▸ quoted from Tom Georgoulias

It seems that when I acknowledge a red/yellow alert, the trend chart is 
not updated during the acknowledgment time period, but resumes after the 
time period is over (without using any of the data that would've been 
collected during that time).  Is that also expected?

Ack'ing should not have any influence on whether data is collected or
not. What matters is if there are any updates - if the host is down,
you obviously won't be getting any new reports, and then the graphs
won't update.

▸ quoted from Tom Georgoulias

Also, how can I unacknowledge a host, if I fix a problem before the time 
that I estimated it would take?

You cannot, but the acknowledge should clear automatically as soon as
an OK status arrives.


Regards,
Henrik

list Tom Georgoulias · Mon, 28 Feb 2005 13:28:18 -0500 ·

▸ quoted from Henrik Størner

Henrik Stoerner wrote:

I tried it now, and ack'ing a purple status seems to work ok. I'll see
if it stops sending me alerts.

I am able to ack as well, so that works.

While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins.  I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?

▸ quoted from Henrik Størner

Ack'ing should not have any influence on whether data is collected or
not. What matters is if there are any updates - if the host is down,
you obviously won't be getting any new reports, and then the graphs
won't update.

In the cases where I was testing and observed the behavior above (a 97% full disk partition), the client was online and sending data but the graphs had stalled.

This doesn't seem to be happening on RC4, so something was either fixed or the fresh install on my end helped.

▸ quoted from Henrik Størner

Also, how can I unacknowledge a host, if I fix a problem before the time
that I estimated it would take?

You cannot, but the acknowledge should clear automatically as soon as
an OK status arrives.

I think I found a loop hole that may cause problems in certain circumstances:  Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears.  That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple).  This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while).

Tom

list Henrik Størner · Mon, 28 Feb 2005 23:06:25 +0100 ·

▸ quoted from Tom Georgoulias

On Mon, Feb 28, 2005 at 01:28:18PM -0500, Tom Georgoulias wrote:

While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins.  I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?

It's the program that generates the status message, that also
determines how long it is valid. So this is something you set on each
BB client or extension script. You actually cannot set it anywhere for
the network tests performed by bbtest-net (I just checked and was a
bit surprised that I had not provided some way of changing this).

I think I found a loop hole that may cause problems in certain circumstances:  Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears.  That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple).  This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while).

There are lots of ways you can outsmart the system. And you needn't
have a purple status in-between:

1) Disk fills up and goes red
2) Clueless admin ack's the disk alert for 60 minutes, then reboots
   the server because that "usually fixes things"
3) Disk stays red and no alerts go out until an hour has passed

In such cases there is little Hobbit can do. When you ack an alert,
you take over the responsibility for that status for the time the ack
is valid. If you "fix" something without checking that it actually did
solve the problem, you're asking for trouble.

If you really want it, it's not a big problem to implement an
"de-acknowledge" function. It might even be worthwhile for reporting
purposes, to keep track of how much time your admins are using on
troubleshooting. I'm open to suggestions.


Regards,
Henrik

list Tom Georgoulias · Tue, 01 Mar 2005 16:24:55 -0500 ·

▸ quoted from Henrik Størner

Henrik Stoerner wrote:

It's the program that generates the status message, that also
determines how long it is valid. So this is something you set on each
BB client or extension script.

OK, that is different than BB, which only needed to have the PURPLEDELAY set on the server side, in bbdef-server.sh.

▸ quoted from Henrik Størner

In such cases there is little Hobbit can do. When you ack an alert,
you take over the responsibility for that status for the time the ack
is valid. If you "fix" something without checking that it actually did
solve the problem, you're asking for trouble.

I've been thinking about this a bit and I cannot see a clean, easy way to solve it either.  Having an ack clear each time the status changes could be rather annoying, and a complicated set of if/then conditions is bad too.  So I've voting for leaving it as is for now.  I trust our team to do the right thing and we generally strive to keep things in the green anyway.  :)

▸ quoted from Henrik Størner

If you really want it, it's not a big problem to implement an
"de-acknowledge" function. It might even be worthwhile for reporting
purposes, to keep track of how much time your admins are using on
troubleshooting. I'm open to suggestions.

I can see this being helpful in cases where I'd like to wipe out all the various acks for whatever reason and return a system to its normal, paging self, but those situations are quite uncommon.  If it's easy to implement, I wouldn't mind having it.

Tom

list Henrik Størner · Tue, 1 Mar 2005 22:51:11 +0100 ·

▸ quoted from Tom Georgoulias

On Tue, Mar 01, 2005 at 04:24:55PM -0500, Tom Georgoulias wrote:

Henrik Stoerner wrote:

It's the program that generates the status message, that also
determines how long it is valid. So this is something you set on each
BB client or extension script.

OK, that is different than BB, which only needed to have the PURPLEDELAY 
set on the server side, in bbdef-server.sh.

No, this actually works exactly like in BB. PURPLEDELAY in BB only
determines the interval between updates of a purple status *after*
it has gone purple; it doesn't determine how long to wait before a
normal status changes to purple.

That's why when you have scripts that run once an hour, you need to send
in the status beginning with "status+65 ..." or it will go purple
before the next planned update.

▸ quoted from Tom Georgoulias

In such cases there is little Hobbit can do. When you ack an alert,
you take over the responsibility for that status for the time the ack
is valid. If you "fix" something without checking that it actually did
solve the problem, you're asking for trouble.

I've been thinking about this a bit and I cannot see a clean, easy way 
to solve it either.

Well, we agree then :-)

▸ quoted from Tom Georgoulias

If you really want it, it's not a big problem to implement an
"de-acknowledge" function. It might even be worthwhile for reporting
purposes, to keep track of how much time your admins are using on
troubleshooting. I'm open to suggestions.

I can see this being helpful in cases where I'd like to wipe out all the 
various acks for whatever reason and return a system to its normal, 
paging self, but those situations are quite uncommon.  If it's easy to 
implement, I wouldn't mind having it.

I knew you wouldn't :-))

Henrik

Windows Cluster Monitor 🔗 link

Windows Cluster Monitor