Xymon Mailing List Archive search

only alert if X number of hosts are already in error

7 messages in this thread

list Bruce Lysik · Thu, 16 Jun 2005 13:31:32 -0700 ·
Hi,

So while discussing monitoring with my coworkers, the most common statement is:

I don't really care if one server in a certain pool goes into error.  I only want to get paged if 5 or 10 in that same pool do.

I've been trying to figure out how to make this work in hobbit.  I suppose a paging script which keeps track of hosts in error and only pages if above that number would be possible. 

But would this be something more easily done within hobbit itself?  Does anyone think this would be useful?

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer


The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (XXX) XXX-XXXX and delete or destroy any copy of this message (including any attachments).
list Henrik Størner · Thu, 16 Jun 2005 23:15:56 +0200 ·
quoted from Bruce Lysik
On Thu, Jun 16, 2005 at 01:31:32PM -0700, Bruce Lysik wrote:
So while discussing monitoring with my coworkers, the most common statement is:

I don't really care if one server in a certain pool goes into error.  I only want to get paged if 5 or 10 in that same pool do.

I've been trying to figure out how to make this work in hobbit.  I suppose a paging script which keeps track of hosts in error and only pages if above that number would be possible. 

But would this be something more easily done within hobbit itself?  Does anyone think this would be useful?
My best suggestion would be to use the bbcombotest tool to define
a pseudo "host" with the combined status of your host "pool".

E.g. if you're monitoring http on 5 hosts, you could define a
combination test like this:

Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3

That would give you a red alert if 3 or fewer hosts in the pool were
green. And you could then trigger an alert based on that test result.


Regards,
Henrik
list Bruce Lysik · Thu, 16 Jun 2005 14:28:53 -0700 ·
quoted from Henrik Størner
My best suggestion would be to use the bbcombotest tool to define
a pseudo "host" with the combined status of your host "pool".

E.g. if you're monitoring http on 5 hosts, you could define a
combination test like this:

Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3

That would give you a red alert if 3 or fewer hosts in the pool were
green. And you could then trigger an alert based on that test result.
Pretty unwieldy when you have large pools of servers, however.  
I just started writing a smart paging script which will keep track of downed hosts and decide whether or not to page.  
One question I have so far is: Does hobbit wait for an alerting script to return before continuing to evaluate other rules?  
quoted from Bruce Lysik
--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer 

The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (XXX) XXX-XXXX and delete or destroy any copy of this message (including any attachments).
list Henrik Størner · Fri, 17 Jun 2005 08:01:36 +0200 ·
quoted from Bruce Lysik
On Thu, Jun 16, 2005 at 02:28:53PM -0700, Bruce Lysik wrote:
My best suggestion would be to use the bbcombotest tool to define
a pseudo "host" with the combined status of your host "pool".

E.g. if you're monitoring http on 5 hosts, you could define a
combination test like this:

Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3

That would give you a red alert if 3 or fewer hosts in the pool were
green. And you could then trigger an alert based on that test result.
Pretty unwieldy when you have large pools of servers, however.  
Could be, yes.
quoted from Bruce Lysik
I just started writing a smart paging script which will keep track of 
downed hosts and decide whether or not to page.  
I'm interested to know if this kind of alerting is generally useful.
I suspect it might be ... if so, then we should devise a way of defining
such alerts directly in Hobbit instead of forcing you to come up with
scripts that work around this.

Perhaps one solution could be to implement a new kind of rule for the
hobbit-alerts file. Currently all of the rules are matched against a
specific host+test combination; we could define a type of rule that
could be matched against all of the host+test statuses that are in an 
alerting stage, and then have the rule trigger based on some criteria
for how many matches we get.

Something like

   HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5
      MAIL user-3aaf2ac8399f@xymon.invalid

The "COUNT>=5" would then cause this rule to trigger only if there
were 5 or more hosts named www.*.foo.com, whose http tests are red.
You could even combine this with other criteria, say have a threshold of
5 during the daytime, and 10 during off-hours.

I can foresee a problem in handling recovery-notifications for this kind
of alerts, but that's something I'll have to think about.

Would that be useful ?
quoted from Bruce Lysik

One question I have so far is: Does hobbit wait for an alerting script 
to return before continuing to evaluate other rules?  
Paging scripts are serialized, yes - Hobbit will wait for a paging
script to complete before continuing down the list of alert rules.


Regards,
Henrik
list Bruce Lysik · Fri, 17 Jun 2005 07:52:14 -0700 ·
quoted from Henrik Størner
Something like
   HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5
       MAIL user-3aaf2ac8399f@xymon.invalid
The "COUNT>=5" would then cause this rule to trigger only if there
were 5 or more hosts named www.*.foo.com, whose http tests are red.
You could even combine this with other criteria, say have a threshold of
5 during the daytime, and 10 during off-hours.
I can foresee a problem in handling recovery-notifications for this kind
of alerts, but that's something I'll have to think about.
Would that be useful ?
That would seem extremely useful.  I was thinking about notifications as well, and my first thought was just to notify on every recovery (if you've selected RECOVERED).  That way you would know if a single host kept doing a down/up/down/up/down/up type of thing.

This would work alright in my environment since hosts that go into error for any length of time tend not to fix themselves anyways. :)

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
quoted from Bruce Lysik


The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (XXX) XXX-XXXX and delete or destroy any copy of this message (including any attachments).
list Daniel J McDonald · Mon, 20 Jun 2005 08:14:59 -0500 ·
quoted from Bruce Lysik
On Fri, 2005-06-17 at 08:01 +0200, Henrik Stoerner wrote:
Something like

   HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5
      MAIL user-3aaf2ac8399f@xymon.invalid

The "COUNT>=5" would then cause this rule to trigger only if there
were 5 or more hosts named www.*.foo.com, whose http tests are red.
You could even combine this with other criteria, say have a threshold of
5 during the daytime, and 10 during off-hours.

I can foresee a problem in handling recovery-notifications for this kind
of alerts, but that's something I'll have to think about.

Would that be useful ?
The main place I would use it would be NTP alerts.  If one router loses
NTP, I'm not terribly worried.  If 10-20 of them all fail at once then I
know there is something really bad happening... Maybe both GPS clocks
lost sync and all 4 cesium backups failed, or ntp locked up on a core
router and I need to make fewer down-stream nodes dependent on that one.


I would also consider using it for purple alerts.  I don't want
individual purples for most of my stuff, but if there are a lot of them
(>100) then I know I killed mrtg and I should page on that.
-- 
Daniel J McDonald, CCIE # 2495, CNX
Austin Energy

user-290ce4e24e19@xymon.invalid
list Winn Beutler · Thu, 23 Jun 2005 14:05:52 -0600 ·
I would like to try the combotest but am not sure what needs to be done besides defining a combotest in combotest.cfg.  I have looked through the docs but cant find a complete example.

Once there is a test named 'Pool1.http' defined in combotest.cfg, what needs to be present in bb-hosts and hobbit-alerts.cfg to receive an email?
Many thanks for the help.  Everyone here is impressed with HOBBIT!
Winn

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: Thursday, June 16, 2005 3:16 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] only alert if X number of hosts are already in
error
quoted from Henrik Størner

E.g. if you're monitoring http on 5 hosts, you could define a
combination test like this:

Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3

That would give you a red alert if 3 or fewer hosts in the pool were
green. And you could then trigger an alert based on that test result.

Regards,
Henrik

SPECIAL NOTICE

All information transmitted hereby is intended only for the use of the
addressee(s) named above and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or distribution
of confidential and privileged information is prohibited. If the reader
of this message is not the intended recipient(s) or the employee or agent
responsible for delivering the message to the intended recipient, you are
hereby notified that you must not read this transmission and that disclosure,
copying, printing, distribution or use of any of the information contained
in or attached to this transmission is STRICTLY PROHIBITED.

Anyone who receives confidential and privileged information in error should
notify us immediately by telephone and mail the original message to us at
the above address and destroy all copies.  To the extent any portion of this
communication contains public information, no such restrictions apply to that
information. (gate01)