only alert if X number of hosts are already in error
list Bruce Lysik
Hi, So while discussing monitoring with my coworkers, the most common statement is: I don't really care if one server in a certain pool goes into error. I only want to get paged if 5 or 10 in that same pool do. I've been trying to figure out how to make this work in hobbit. I suppose a paging script which keeps track of hosts in error and only pages if above that number would be possible. But would this be something more easily done within hobbit itself? Does anyone think this would be useful? -- Bruce Z. Lysik <user-4e63a10f8934@xymon.invalid> Operations Engineer The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (XXX) XXX-XXXX and delete or destroy any copy of this message (including any attachments).
list Henrik Størner
▸
On Thu, Jun 16, 2005 at 01:31:32PM -0700, Bruce Lysik wrote:
So while discussing monitoring with my coworkers, the most common statement is: I don't really care if one server in a certain pool goes into error. I only want to get paged if 5 or 10 in that same pool do. I've been trying to figure out how to make this work in hobbit. I suppose a paging script which keeps track of hosts in error and only pages if above that number would be possible. But would this be something more easily done within hobbit itself? Does anyone think this would be useful?
My best suggestion would be to use the bbcombotest tool to define a pseudo "host" with the combined status of your host "pool". E.g. if you're monitoring http on 5 hosts, you could define a combination test like this: Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3 That would give you a red alert if 3 or fewer hosts in the pool were green. And you could then trigger an alert based on that test result. Regards, Henrik
list Bruce Lysik
▸
My best suggestion would be to use the bbcombotest tool to define a pseudo "host" with the combined status of your host "pool". E.g. if you're monitoring http on 5 hosts, you could define a combination test like this: Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3 That would give you a red alert if 3 or fewer hosts in the pool were green. And you could then trigger an alert based on that test result.
Pretty unwieldy when you have large pools of servers, however. I just started writing a smart paging script which will keep track of downed hosts and decide whether or not to page. One question I have so far is: Does hobbit wait for an alerting script to return before continuing to evaluate other rules?
▸
--
Bruce Z. Lysik <user-4e63a10f8934@xymon.invalid>
Operations Engineer
The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (XXX) XXX-XXXX and delete or destroy any copy of this message (including any attachments).
list Henrik Størner
▸
On Thu, Jun 16, 2005 at 02:28:53PM -0700, Bruce Lysik wrote:
My best suggestion would be to use the bbcombotest tool to define a pseudo "host" with the combined status of your host "pool". E.g. if you're monitoring http on 5 hosts, you could define a combination test like this: Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3 That would give you a red alert if 3 or fewer hosts in the pool were green. And you could then trigger an alert based on that test result.Pretty unwieldy when you have large pools of servers, however.
Could be, yes.
▸
I just started writing a smart paging script which will keep track of downed hosts and decide whether or not to page.
I'm interested to know if this kind of alerting is generally useful.
I suspect it might be ... if so, then we should devise a way of defining
such alerts directly in Hobbit instead of forcing you to come up with
scripts that work around this.
Perhaps one solution could be to implement a new kind of rule for the
hobbit-alerts file. Currently all of the rules are matched against a
specific host+test combination; we could define a type of rule that
could be matched against all of the host+test statuses that are in an
alerting stage, and then have the rule trigger based on some criteria
for how many matches we get.
Something like
HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5
MAIL user-3aaf2ac8399f@xymon.invalid
The "COUNT>=5" would then cause this rule to trigger only if there
were 5 or more hosts named www.*.foo.com, whose http tests are red.
You could even combine this with other criteria, say have a threshold of
5 during the daytime, and 10 during off-hours.
I can foresee a problem in handling recovery-notifications for this kind
of alerts, but that's something I'll have to think about.
Would that be useful ?
▸
One question I have so far is: Does hobbit wait for an alerting script to return before continuing to evaluate other rules?
Paging scripts are serialized, yes - Hobbit will wait for a paging script to complete before continuing down the list of alert rules. Regards, Henrik
list Bruce Lysik
▸
Something like
HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5
MAIL user-3aaf2ac8399f@xymon.invalidThe "COUNT>=5" would then cause this rule to trigger only if there were 5 or more hosts named www.*.foo.com, whose http tests are red. You could even combine this with other criteria, say have a threshold of 5 during the daytime, and 10 during off-hours.
I can foresee a problem in handling recovery-notifications for this kind of alerts, but that's something I'll have to think about.
Would that be useful ?
That would seem extremely useful. I was thinking about notifications as well, and my first thought was just to notify on every recovery (if you've selected RECOVERED). That way you would know if a single host kept doing a down/up/down/up/down/up type of thing. This would work alright in my environment since hosts that go into error for any length of time tend not to fix themselves anyways. :) -- Bruce Z. Lysik <user-4e63a10f8934@xymon.invalid>
▸
The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (XXX) XXX-XXXX and delete or destroy any copy of this message (including any attachments).
list Daniel J McDonald
▸
On Fri, 2005-06-17 at 08:01 +0200, Henrik Stoerner wrote:
Something like
HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5
MAIL user-3aaf2ac8399f@xymon.invalid
The "COUNT>=5" would then cause this rule to trigger only if there
were 5 or more hosts named www.*.foo.com, whose http tests are red.
You could even combine this with other criteria, say have a threshold of
5 during the daytime, and 10 during off-hours.
I can foresee a problem in handling recovery-notifications for this kind
of alerts, but that's something I'll have to think about.
Would that be useful ?The main place I would use it would be NTP alerts. If one router loses NTP, I'm not terribly worried. If 10-20 of them all fail at once then I know there is something really bad happening... Maybe both GPS clocks lost sync and all 4 cesium backups failed, or ntp locked up on a core router and I need to make fewer down-stream nodes dependent on that one. I would also consider using it for purple alerts. I don't want individual purples for most of my stuff, but if there are a lot of them (>100) then I know I killed mrtg and I should page on that. -- Daniel J McDonald, CCIE # 2495, CNX Austin Energy user-290ce4e24e19@xymon.invalid
list Winn Beutler
I would like to try the combotest but am not sure what needs to be done besides defining a combotest in combotest.cfg. I have looked through the docs but cant find a complete example. Once there is a test named 'Pool1.http' defined in combotest.cfg, what needs to be present in bb-hosts and hobbit-alerts.cfg to receive an email? Many thanks for the help. Everyone here is impressed with HOBBIT! Winn -----Original Message----- From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Thursday, June 16, 2005 3:16 PM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] only alert if X number of hosts are already in error
▸
E.g. if you're monitoring http on 5 hosts, you could define a
combination test like this:
Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3
That would give you a red alert if 3 or fewer hosts in the pool were
green. And you could then trigger an alert based on that test result.
Regards,
Henrik
SPECIAL NOTICE
All information transmitted hereby is intended only for the use of the
addressee(s) named above and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or distribution
of confidential and privileged information is prohibited. If the reader
of this message is not the intended recipient(s) or the employee or agent
responsible for delivering the message to the intended recipient, you are
hereby notified that you must not read this transmission and that disclosure,
copying, printing, distribution or use of any of the information contained
in or attached to this transmission is STRICTLY PROHIBITED.
Anyone who receives confidential and privileged information in error should
notify us immediately by telephone and mail the original message to us at
the above address and destroy all copies. To the extent any portion of this
communication contains public information, no such restrictions apply to that
information. (gate01)