Windows Cluster Monitor
list Al Jeffcoat
Hello All, Our new director would like to monitor EVERYTHING from BB/Hobbit. We have been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie: Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster? This is in response to a problem that has been occurring on a new 24x7 Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc). We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :) TIA Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 user-b34a8ad6e24c@xymon.invalid This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
list Oliver Bassett
My current solution for this is pretty messy, but it works for me. In the below I am assuming that you are going to have the windows bb client installed on all nodes of the cluster. Active/Passive: Since want to know when the services fail over I check to ensure processes/services are running on the active node, as well as ensuring that the cluster service is running on both nodes. This let's me know when it fails over so I can investigate immediately. I also run a specific application check against the clustered application on the cluster address to ensure the application itself is up and running. If it isn't then there is a really serious problem. Active/Active: Since I don't do this I can be entirely sure, but this would be easier to monitor I would think, just ensure that all the services required are running on all nodes, including the cluster service itself. I hope this is of some help. Regards Oliver Bassett
▸
-----Original Message-----
From: Jeffcoat, Al [mailto:user-b34a8ad6e24c@xymon.invalid]
Sent: Friday, 25 February 2005 12:28 p.m.
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Windows Cluster Monitor
Hello All,
Our new director would like to monitor EVERYTHING from BB/Hobbit. We
have been monitoring our UNIX and Storage Devices for a few years now.
Now that I have windows servers to monitor, I'd like to know if anyone
has a decent way to monitor Windows Clusters? I had a thought to
monitor by ping each node in the cluster, and the cluster name, ie:
Nodea - Application Offline
Nodeb - Application Online
Clustername - Application Responding @ This address
How would you set up resource (process) monitoring for an Active /
Passive cluster? Or an Active / Active cluster?
This is in response to a problem that has been occurring on a new 24x7
Windows server blue screening daily, in spite of all the "fixes" that
have occurred to solve the problem (more hardware, patches, reload os,
etc, etc).
We'll soon be moving the application to an AIX server, but I'll have the
same questions on an HACMP cluster at that point :)
TIA
Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid
This e-mail message and any attached files are confidential and are intended
solely for the use of the addressee(s) named above. If you are not the
intended recipient, any review, use, or distribution of this e-mail message
and any attached files is strictly prohibited. This communication may
contain material protected by Federal privacy regulations, attorney-client
work product, or other privileges. If you have received this confidential
communication in error, please notify the sender immediately by reply e-mail
message and permanently delete the original message. To reply to our email
administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid .
If this e-mail message concerns a contract matter, be advised that no
employee or agent is authorized to conclude any binding agreement on behalf
of Orlando Regional Healthcare by e-mail without express written
confirmation by an officer of the corporation. Any views or opinions
presented in this e-mail are solely those of the author and do not
necessarily represent those of Orlando Regional Healthcare.
######################################################################This e-mail message has been scanned and cleared by MailMarshal at http://www.gen-i.co.nz ###################################################################### ***************************************************************************** This communication, including any attachments, is confidential. If you are not the intended recipient, you should not read it - please contact me immediately, destroy it, and do not copy or use any part of this communication or disclose anything about it, Thank you. Please note that this communication does not designate an information system for the purposes of the Electronic Transactions Act 2002 ******************************************************************************
list Kevin Grady
Use WMI to query the MSCluster_Resource groups and you can grab the status of each resource and then report back to hobbit. Here's a link to some examples from MS. http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/default.mspx
▸
On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:Hello All, Our new director would like to monitor EVERYTHING from BB/Hobbit. We have been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie: Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster? This is in response to a problem that has been occurring on a new 24x7 Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc). We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :) TIA Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 user-b34a8ad6e24c@xymon.invalid This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
list Kevin Grady
I'll post something this weekend as I have been working on this for a SQL cluster we have running.
▸
On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <user-50dc3c45bc73@xymon.invalid> wrote:Use WMI to query the MSCluster_Resource groups and you can grab the status of each resource and then report back to hobbit. Here's a link to some examples from MS. http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/default.mspx On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:Hello All, Our new director would like to monitor EVERYTHING from BB/Hobbit. We have been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie: Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster? This is in response to a problem that has been occurring on a new 24x7 Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc). We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :) TIA Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 user-b34a8ad6e24c@xymon.invalid This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
list Al Jeffcoat
That'd be great. We are demoing this for the new director on Monday, so that would really be nice. Thanks very much. If not, we'll coble something together base on the responses we've gotten so far. The UNIX stuff is easy, it's been in place forever. The windows stuff is new to me (I'm a home dabbler in windows) as far as a prd environment. And, our windows admins have resisted this every time we bring it up, so I'm also fighting that resistance :). In any case, thanks for the reply...
▸
Al Jeffcoat
IBM Certified Support Specialist, AIX
Enterprise Storage Administrator
System Programmer II
(321)843-1051
user-b34a8ad6e24c@xymon.invalid
-----Original Message-----
▸
From: kevin grady [mailto:user-50dc3c45bc73@xymon.invalid]
Sent: Thursday, February 24, 2005 8:36 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Windows Cluster Monitor
I'll post something this weekend as I have been working on this for a
SQL cluster we have running.
On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <user-50dc3c45bc73@xymon.invalid>
wrote:Use WMI to query the MSCluster_Resource groups and you can grab the status of each resource and then report back to hobbit. Here's a link to some examples from MS.
http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/de fault.mspx
▸
On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:Hello All, Our new director would like to monitor EVERYTHING from BB/Hobbit.
We
have been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie: Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster? This is in response to a problem that has been occurring on a new
24x7
Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc). We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :) TIA Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 user-b34a8ad6e24c@xymon.invalid This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando
Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
list Kevin Grady
Check out bb-mscs on deadcat. It does most of what I am looking for and maybe enough for your demo. You'd need to adjust to script if you want red if a resource goes offline. Right now it will turn yellow.
▸
On Thu, 24 Feb 2005 21:14:00 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:That'd be great. We are demoing this for the new director on Monday, so that would really be nice. Thanks very much. If not, we'll coble something together base on the responses we've gotten so far. The UNIX stuff is easy, it's been in place forever. The windows stuff is new to me (I'm a home dabbler in windows) as far as a prd environment. And, our windows admins have resisted this every time we bring it up, so I'm also fighting that resistance :). In any case, thanks for the reply... Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 user-b34a8ad6e24c@xymon.invalid -----Original Message----- From: kevin grady [mailto:user-50dc3c45bc73@xymon.invalid] Sent: Thursday, February 24, 2005 8:36 PM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] Windows Cluster Monitor I'll post something this weekend as I have been working on this for a SQL cluster we have running. On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <user-50dc3c45bc73@xymon.invalid> wrote:Use WMI to query the MSCluster_Resource groups and you can grab the status of each resource and then report back to hobbit. Here's a link to some examples from MS.http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/de fault.mspxOn Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <user-b34a8ad6e24c@xymon.invalid> wrote:Hello All, Our new director would like to monitor EVERYTHING from BB/Hobbit.Wehave been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie: Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster? This is in response to a problem that has been occurring on a new24x7Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc). We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :) TIA Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 user-b34a8ad6e24c@xymon.invalid This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of OrlandoRegional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: user-ecde3bbc361d@xymon.invalid . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
list Tom Georgoulias
Couple of things: 1. This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature? I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog. ACK works for red & yellow, though. 2. I could've sworn that I had read about the ability to merge all the purple alerts into a single email or behavior that did that automatically, but I can't seem to find it in the docs. Is that possible? Where can I read up on how to use it? I'd love to get a single alert if a client goes purple that can use a single ACK code to disable pages. Tom
list Henrik Størner
▸
On Fri, Feb 25, 2005 at 01:13:50PM -0500, Tom Georgoulias wrote:
Couple of things: 1. This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?
Yes, that's the idea.
▸
I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog. ACK works for red & yellow, though.
Hmm - odd. I'll try it out later tonight.
2. I could've sworn that I had read about the ability to merge all the purple alerts into a single email or behavior that did that automatically, but I can't seem to find it in the docs. Is that possible?
No. I'd like to do some more general merging of alerts - not just purple ones - but that'll be later.
▸
Where can I read up on how to use it? I'd love to get a single alert if a client goes purple that can use a single ACK code to disable pages.
OK, I'll let you in on a secret: If you send an acknowledge with minus-ACKCODE, it will work as an ack for all current alerts on that host. Henrik
list Tom Georgoulias
▸
Henrik Stoerner wrote:
1. This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?Yes, that's the idea.
OK. It seems that when I acknowledge a red/yellow alert, the trend chart is not updated during the acknowledgment time period, but resumes after the time period is over (without using any of the data that would've been collected during that time). Is that also expected? Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?
▸
No. I'd like to do some more general merging of alerts - not just purple ones - but that'll be later.
OK, that explains why I couldn't find anything about it in the docs.
▸
OK, I'll let you in on a secret: If you send an acknowledge with minus-ACKCODE, it will work as an ack for all current alerts on that host.
:) Sounds good. Tom
list Tom Georgoulias
▸
Henrik Stoerner wrote:
I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog. ACK works for red & yellow, though.Hmm - odd. I'll try it out later tonight.
A follow up: I restarted hobbit and repeated the experiement, turning off bbc on my client and waiting for 30 mins until it was put into a purple status. Then I took one of the ACK codes from my purple alert emails and it work just as expected, disabling paging for the time duration I entered and displaying the text message that I entered. Tested putting a "-" in front of the ACK code and it acknowledged all the purples for the host, so that's a neat little trick. I cannot explain why this didn't work the first times that I tried it, but I swear it didn't. Tom
list Henrik Størner
▸
On Fri, Feb 25, 2005 at 02:03:50PM -0500, Tom Georgoulias wrote:
Henrik Stoerner wrote:1. This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?Yes, that's the idea.
I tried it now, and ack'ing a purple status seems to work ok. I'll see if it stops sending me alerts.
▸
It seems that when I acknowledge a red/yellow alert, the trend chart is not updated during the acknowledgment time period, but resumes after the time period is over (without using any of the data that would've been collected during that time). Is that also expected?
Ack'ing should not have any influence on whether data is collected or not. What matters is if there are any updates - if the host is down, you obviously won't be getting any new reports, and then the graphs won't update.
▸
Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?
You cannot, but the acknowledge should clear automatically as soon as an OK status arrives. Regards, Henrik
list Tom Georgoulias
▸
Henrik Stoerner wrote:
I tried it now, and ack'ing a purple status seems to work ok. I'll see if it stops sending me alerts.
I am able to ack as well, so that works. While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins. I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?
▸
Ack'ing should not have any influence on whether data is collected or not. What matters is if there are any updates - if the host is down, you obviously won't be getting any new reports, and then the graphs won't update.
In the cases where I was testing and observed the behavior above (a 97% full disk partition), the client was online and sending data but the graphs had stalled. This doesn't seem to be happening on RC4, so something was either fixed or the fresh install on my end helped.
▸
Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?You cannot, but the acknowledge should clear automatically as soon as an OK status arrives.
I think I found a loop hole that may cause problems in certain circumstances: Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears. That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple). This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while). Tom
list Henrik Størner
▸
On Mon, Feb 28, 2005 at 01:28:18PM -0500, Tom Georgoulias wrote:
While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins. I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?
It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script. You actually cannot set it anywhere for the network tests performed by bbtest-net (I just checked and was a bit surprised that I had not provided some way of changing this).
I think I found a loop hole that may cause problems in certain circumstances: Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears. That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple). This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while).
There are lots of ways you can outsmart the system. And you needn't have a purple status in-between: 1) Disk fills up and goes red 2) Clueless admin ack's the disk alert for 60 minutes, then reboots the server because that "usually fixes things" 3) Disk stays red and no alerts go out until an hour has passed In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble. If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions. Regards, Henrik
list Tom Georgoulias
▸
Henrik Stoerner wrote:
It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script.
OK, that is different than BB, which only needed to have the PURPLEDELAY set on the server side, in bbdef-server.sh.
▸
In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble.
I've been thinking about this a bit and I cannot see a clean, easy way to solve it either. Having an ack clear each time the status changes could be rather annoying, and a complicated set of if/then conditions is bad too. So I've voting for leaving it as is for now. I trust our team to do the right thing and we generally strive to keep things in the green anyway. :)
▸
If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions.
I can see this being helpful in cases where I'd like to wipe out all the various acks for whatever reason and return a system to its normal, paging self, but those situations are quite uncommon. If it's easy to implement, I wouldn't mind having it. Tom
list Henrik Størner
▸
On Tue, Mar 01, 2005 at 04:24:55PM -0500, Tom Georgoulias wrote:
Henrik Stoerner wrote:It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script.OK, that is different than BB, which only needed to have the PURPLEDELAY set on the server side, in bbdef-server.sh.
No, this actually works exactly like in BB. PURPLEDELAY in BB only determines the interval between updates of a purple status *after* it has gone purple; it doesn't determine how long to wait before a normal status changes to purple. That's why when you have scripts that run once an hour, you need to send in the status beginning with "status+65 ..." or it will go purple before the next planned update.
▸
In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble.I've been thinking about this a bit and I cannot see a clean, easy way to solve it either.
Well, we agree then :-)
▸
If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions.I can see this being helpful in cases where I'd like to wipe out all the various acks for whatever reason and return a system to its normal, paging self, but those situations are quite uncommon. If it's easy to implement, I wouldn't mind having it.
I knew you wouldn't :-)) Henrik