Xymon Mailing List Archive search

EXHOST usage

9 messages in this thread

list Charles Jones · Wed, 26 Oct 2005 04:12:00 -0700 ·
Perhaps it's because I'm working on this at 4am, but I'm having a problem with the EXHOST option, that according to hobbitd_alert --test isn't working, I also am not sure how to do a particular host/service exclusion.

Heres basically what my below alert config is meant to accomplish.
1. For any alerts on any servers, send alerts to an alert email address.
2. For 2 particular web servers (web5.mydomain.com and web6.mydomain.com), send an alert to one person, but *not *the alert alias.
3. For a set of oracle servers, send an extra alert message to an alternate email address/cellphone.
4. After hours (from 5pm until 8am), only send alerts to an alternate email address (but still need the seperate alert for the web5 and web6 hosts described in #2).
5. After hours (from 5pm until 8am), send an alert to my cellphone for any hosts and services being red for more than 30 mins.
6. Do not alert for high load average on a particular server from 6-10am.

My first problem is I am not sure how to implement #2 (exclude alerts for the msgs on web5 and web6 from being sent to the main alert email, and instead send them to the alternate address). I'm thinking one solution would be to define it as the very first rule and use the "STOP" option, like:  HOST=$WEB_SERVERS SERVICE=msgs COLOR=red MAIL user-ea4a73bd57c1@xymon.invalid STOP

Also, according to the tests I did with hobbit_alert --test, the last rule is not working...Is there a more logical way to simply specify that a single host/service combo be ignored during a certain timeframe?  My hobbit-alerts.cfg is below, if you see anything wrong or have suggestions on a better way to accomplish my list above I would appreciate it.

# hobbitd-alerts.cfg
# oradb1-9.mydomain.com
$ORACLE_SERVERS=%oradb(.).mydomain.com
# web1-9.mydomain.com
$WEB_SERVERS=%web(.).mydomain.com
# All hosts
$ALL_HOSTS=*

# Send me an email alert if any service on any host goes red.
# Note: This rule will probably be removed once the alert rules are deemed fully working.
HOST=$ALL_HOSTS SERVICE=* COLOR=red MAIL user-87499726aafb@xymon.invalid

# Notify webdev about (only) web errors on web5 and web6
# FIXME: Need to make it so that user-87499726aafb@xymon.invalid does NOT get these!
HOST=$WEB_SERVERS SERVICE=msgs COLOR=red MAIL user-ea4a73bd57c1@xymon.invalid

# Send an alert to dba phone on oracle-specific problems
HOST=$ORACLE_SERVERS SERVICE=msgs,oradb,orasys COLOR=red FORMAT=sms
MAIL user-b2813ebf6fe7@xymon.invalid

# Send me a page if any hosts go red for more than 30 minutes during offhours
HOST=$ALL_HOSTS SERVICE=* COLOR=red DURATION>30  TIME=1700-0800 FORMAT=sms
     MAIL user-f3d2a9d4064f@xymon.invalid

# Ignore high load average warnings for dataproc1 in the mornings
EXHOST=dataproc1.mydomain.com SERVICE=cpu COLOR=red TIME=0600-1000
     MAIL user-b75305ea6ec0@xymon.invalid
list Henrik Størner · Wed, 26 Oct 2005 16:33:51 +0200 ·
quoted from Charles Jones
On Wed, Oct 26, 2005 at 04:12:00AM -0700, Charles Jones wrote:
Perhaps it's because I'm working on this at 4am, but I'm having a 
problem with the EXHOST option, that according to hobbitd_alert --test 
isn't working, I also am not sure how to do a particular host/service 
exclusion.

Heres basically what my below alert config is meant to accomplish.
1. For any alerts on any servers, send alerts to an alert email address.
2. For 2 particular web servers (web5.mydomain.com and 
web6.mydomain.com), send an alert to one person, but *not *the alert alias.
3. For a set of oracle servers, send an extra alert message to an 
alternate email address/cellphone.
One way of doing these would be:

# 2 special webservers, that ONLY get this alert (2)
HOST=$WEB_SERVERS SERVICE=msgs COLOR=red
	MAIL user-ea4a73bd57c1@xymon.invalid STOP

# Oracle alerts (3)
HOST=$ORACLE_SERVERS SERVICE=msgs,oradb,orasys COLOR=red FORMAT=sms
	MAIL user-b2813ebf6fe7@xymon.invalid

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red 
	MAIL user-87499726aafb@xymon.invalid
quoted from Charles Jones
4. After hours (from 5pm until 8am), only send alerts to an alternate 
email address (but still need the seperate alert for the web5 and web6 
hosts described in #2).
5. After hours (from 5pm until 8am), send an alert to my cellphone for 
any hosts and services being red for more than 30 mins.
For these, modify the default rule marked (1) to use different alerts
based on time. E.g.

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red 
	MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
	# Outside office hours, mail alerts to a different address (4)
	MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
	# Outside office hours, send to my cell phone (5)
	MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800
6. Do not alert for high load average on a particular server from 6-10am.
There's no really elegant way of doing that ... it makes me think that
perhaps there should be some way of defining a "no-action" rule: "For
these conditions, do NOT send any alerts, and stop looking for more
alert recipients". But for now, you'll have to modify the default rule
to exclude that host, then setup specific rules for that host. So your
default rule becomes

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red EXHOST=dataproc1.mydomain.com
	MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
	# Outside office hours, mail alerts to a different address (4)
	MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
	# Outside office hours, send to my cell phone (5)
	MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

and the specific rules for that host:

# Load avg alerts only from 10am -> 6am
HOST=dataproc1.mydomain.com SERVICE=la TIME=*:1000:0600
	MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
	MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
	MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800
# All other services alert like the normal default rule.
HOST=dataproc1.mydomain.com EXSERVICE=la
	MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
	MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
	MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800


Regards,
Henrik
list Charles Jones · Wed, 26 Oct 2005 09:52:40 -0700 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
On Wed, Oct 26, 2005 at 04:12:00AM -0700, Charles Jones wrote:
 
Perhaps it's because I'm working on this at 4am, but I'm having a 
problem with the EXHOST option, that according to hobbitd_alert --test 
isn't working, I also am not sure how to do a particular host/service 
exclusion.

Heres basically what my below alert config is meant to accomplish.
1. For any alerts on any servers, send alerts to an alert email address.
2. For 2 particular web servers (web5.mydomain.com and 
web6.mydomain.com), send an alert to one person, but *not *the alert alias.
3. For a set of oracle servers, send an extra alert message to an 
alternate email address/cellphone.
   
One way of doing these would be:

# 2 special webservers, that ONLY get this alert (2)
HOST=$WEB_SERVERS SERVICE=msgs COLOR=red
MAIL user-ea4a73bd57c1@xymon.invalid STOP

# Oracle alerts (3)
HOST=$ORACLE_SERVERS SERVICE=msgs,oradb,orasys COLOR=red FORMAT=sms
MAIL user-b2813ebf6fe7@xymon.invalid

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red 
MAIL user-87499726aafb@xymon.invalid

 
4. After hours (from 5pm until 8am), only send alerts to an alternate 
email address (but still need the seperate alert for the web5 and web6 
hosts described in #2).
5. After hours (from 5pm until 8am), send an alert to my cellphone for 
any hosts and services being red for more than 30 mins.
   
For these, modify the default rule marked (1) to use different alerts
based on time. E.g.

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red 
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
# Outside office hours, mail alerts to a different address (4)
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
# Outside office hours, send to my cell phone (5)
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

 
Ahh! I didn't realize you could make multiple TIME 
specifications...that's the main thing I was missing.
quoted from Henrik Størner
6. Do not alert for high load average on a particular server from 6-10am.
   
There's no really elegant way of doing that ... it makes me think that
perhaps there should be some way of defining a "no-action" rule: "For
these conditions, do NOT send any alerts, and stop looking for more
alert recipients".
That would be nice, I hereby dub it, the BLACKHOLE option ;-)
quoted from Henrik Størner
But for now, you'll have to modify the default rule
to exclude that host, then setup specific rules for that host. So your
default rule becomes

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red EXHOST=dataproc1.mydomain.com
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
# Outside office hours, mail alerts to a different address (4)
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
# Outside office hours, send to my cell phone (5)
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

and the specific rules for that host:

# Load avg alerts only from 10am -> 6am
HOST=dataproc1.mydomain.com SERVICE=la TIME=*:1000:0600
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800
# All other services alert like the normal default rule.
HOST=dataproc1.mydomain.com EXSERVICE=la
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

This has me a bit confused. The default rule I understand, as it's the 
normal rule except its excluding the dataproc1 host.  The specific rules 
though, the first one, has a TIME specification in the HOST= line, 
indicating from 6am-10am, but then the MAIL lines following it specify 
times outside that window...is that basically a way to trick hobbit into 
not sending a mail at all?

Note: the way I handle this in BigBrother is via an exclude rule, 
basically when you define a rule with a ! in front of it, it removes 
that host/service from the FINAL match list.  Hopefully you can 
implement something in Hobbit for a similar effect.

# Dont wake OnCall person every morning about dataproc1 cpu/load being high
!dataproc1.mydomain.com;;cpu;;*;0600-1000;user-87499726aafb@xymon.invalid

I also use the same technique on BigBrother to remove alerts during 
certain hours:
# Don't send alerts about web errors during non-working hours.
!web*.mydomain.com;;msgs;;*;0000-0800;user-08c77f58d2e1@xymon.invalid
!web*.mydomain.com;;msgs;;*;1700-0000;user-87499726aafb@xymon.invalid

-Charles
list Adam Scheblein · Wed, 26 Oct 2005 12:01:31 -0500 ·
Greetings,

 
Is there a way to make the apache graphs show up in the http page?

 
Thanks,

Adam
list Henrik Størner · Wed, 26 Oct 2005 19:13:17 +0200 ·
quoted from Charles Jones
On Wed, Oct 26, 2005 at 09:52:40AM -0700, Charles Jones wrote:
Henrik Stoerner wrote:
# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red EXHOST=dataproc1.mydomain.com
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
# Outside office hours, mail alerts to a different address (4)
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
# Outside office hours, send to my cell phone (5)
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

and the specific rules for that host:

# Load avg alerts only from 10am -> 6am
HOST=dataproc1.mydomain.com SERVICE=la TIME=*:1000:0600
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800
# All other services alert like the normal default rule.
HOST=dataproc1.mydomain.com EXSERVICE=la
MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800
This has me a bit confused. The default rule I understand, as it's the 
normal rule except its excluding the dataproc1 host.  The specific rules 
though, the first one, has a TIME specification in the HOST= line, 
indicating from 6am-10am
No, the TIME=*:1000:0600 makes the rule apply from 10am until 6am *the
next day*. If you wanted a rule that works from 6am-10am, it would be
TIME=*:0600:1000, with the two time-specs reversed.
quoted from Charles Jones
Note: the way I handle this in BigBrother is via an exclude rule, 
basically when you define a rule with a ! in front of it, it removes 
that host/service from the FINAL match list.  Hopefully you can 
implement something in Hobbit for a similar effect.
What I've done now (you can grab it from the snapshot that is generated
later tonight - 5 hours from now) would allow you to setup those rules
like this:

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red
	# Ignore "cpu" alerts from dataproc1 in the morning
	IGNORE HOST=dataproc1.mydomain.com SERVICE=cpu TIME=*:0600:1000
	# During office hours, alert to the mailbox.
quoted from Charles Jones
	MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
	# Outside office hours, mail alerts to a different address (4)
	MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
	# Outside office hours, send to my cell phone (5)
	MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

"IGNORE" is a special recipient definition - like MAIL and SCRIPT - so
you can apply all of the host-, service- and time-filters etc to it.
If the IGNORE recipient triggers, it doesn't trigger an alert - and it
stops Hobbit looking for more recipients (like the STOP flag).


Regards,
Henrik
list Henrik Størner · Wed, 26 Oct 2005 19:14:00 +0200 ·
quoted from Adam Scheblein
On Wed, Oct 26, 2005 at 12:01:31PM -0500, Scheblein, Adam wrote:
Is there a way to make the apache graphs show up in the http page?
No.


Regards,
Henrik
list Charles Jones · Thu, 27 Oct 2005 12:59:45 -0700 ·
Henrik Stoerner wrote:

Henrik,

This will work nicely, but I also realized something else...while the IGNORE option will indeed ignore the alert, the main display will still show red for that service.  So while no pages may go out, anyone who monitors the page visually will still think there is a problem.

The optimal soultion I think, would be a "MAINT" option, where you could specify time periods that a host and or service automatically goes into maintenance mode.  This way no pages go out, and the display has the proper visual indication that the host has been administratively disabled. Something like:
MAINT HOST=dataproc1.mydomain.com SERVICE=cpu TIME=*:0600:1000

It might even be nice to be able to tack on a REASON="Some reason here", that gets put on the maint page, but if that is too difficult could just use a generic reason like "administratively disabled".

-Charles
quoted from Henrik Størner
Note: the way I handle this in BigBrother is via an exclude rule, basically when you define a rule with a ! in front of it, it removes that host/service from the FINAL match list.  Hopefully you can implement something in Hobbit for a similar effect.
   
What I've done now (you can grab it from the snapshot that is generated
later tonight - 5 hours from now) would allow you to setup those rules
like this:

# Default rule (1)
HOST=$ALL_HOSTS SERVICE=* COLOR=red
	# Ignore "cpu" alerts from dataproc1 in the morning
	IGNORE HOST=dataproc1.mydomain.com SERVICE=cpu TIME=*:0600:1000
	# During office hours, alert to the mailbox.
	MAIL user-87499726aafb@xymon.invalid TIME=*:0800:1700
	# Outside office hours, mail alerts to a different address (4)
	MAIL user-51f025a3ef62@xymon.invalid TIME=*:1700:0800
	# Outside office hours, send to my cell phone (5)
	MAIL user-f3d2a9d4064f@xymon.invalid FORMAT=sms DURATION>30 TIME=*:1700:0800

"IGNORE" is a special recipient definition - like MAIL and SCRIPT - so
you can apply all of the host-, service- and time-filters etc to it.
If the IGNORE recipient triggers, it doesn't trigger an alert - and it
stops Hobbit looking for more recipients (like the STOP flag).
 
list Henrik Størner · Fri, 28 Oct 2005 08:48:37 +0200 ·
quoted from Charles Jones
On Thu, Oct 27, 2005 at 12:59:45PM -0700, Charles Jones wrote:
The optimal soultion I think, would be a "MAINT" option, where you could specify time periods that a host and or service automatically goes into maintenance mode.
Hobbit has had that from day 1. It's called DOWNTIME, and goes in
the bb-hosts file.


Henrik
list Charles Jones · Fri, 28 Oct 2005 04:49:40 -0700 ·
Oh crap...open mouth, insert foot. Sorry Henrik, I should have thoroughly read the bb-hosts man page.  I will be sure to RTFM before I make stupid suggestions from now on :)

-Charles
quoted from Henrik Størner

Henrik Stoerner wrote:
On Thu, Oct 27, 2005 at 12:59:45PM -0700, Charles Jones wrote:
 
The optimal soultion I think, would be a "MAINT" option, where you could specify time periods that a host and or service automatically goes into maintenance mode.
   
Hobbit has had that from day 1. It's called DOWNTIME, and goes in
the bb-hosts file.