Xymon Mailing List Archive search

No pages when going from yellow to red

14 messages in this thread

list Pat Vaughan · Tue, 1 Nov 2005 15:00:01 -0500 (EST) ·
On my AIX box running the Hobbit client I found that disk alarms aren't
generated if the condition goes from yellow to red.

I have rules to send an Email if it's yellow, and I always get that.  I
also have rules to send a page if the state is red, and I get those if it
jumps from green to red.  But, if the state goes to yellow and then to
red, the paging rule never fires.  There isn't a entry added to the
notification log either.

I've mentioned this before on the list, but never got a definite response.
list Andy France · Wed, 2 Nov 2005 09:22:36 +1300 ·

"Pat Vaughan" wrote on 02/11/2005 09:00:01 a.m.:
quoted from Pat Vaughan
On my AIX box running the Hobbit client I found that disk alarms aren't
generated if the condition goes from yellow to red.
I have rules to send an Email if it's yellow, and I always get that.  I
also have rules to send a page if the state is red, and I get those if it
jumps from green to red.  But, if the state goes to yellow and then to
red, the paging rule never fires.  There isn't a entry added to the
notification log either.
I've mentioned this before on the list, but never got a definite
response.
I'm pretty sure I've seen the same issue, running 4.1.2 on Solaris 9 x86
for the server.  I can't absolutely confirm but will keep and eye out for
it next time.

I've also had an incident over the weekend where the status went:

  green --> red --> yellow --> green

I got the page for the red alert, but did not get a recovery message.  If
the status goes:

  green --> red --> green

I get both the page and the recovery.

It's my rostered week on call, so if any fixes need to be tested get them
in quick!

Cheers,
Andy.

#####################################################################################

This email is intended for the person to whom it is addressed
only. If you are not the intended recipient, do not read, copy
or use the contents in any way. The opinions expressed may not
necessarily reflect those of ZESPRI Group of Companies ('ZESPRI').

While every effort has been made to verify the information
contained herein, ZESPRI does not make any representations 
as to the accuracy of the information or to the performance
of any data, information or the products mentioned herein.
ZESPRI will not accept liability for any losses, damage or
consequence, however, resulting directly or indirectly from
the use of this e-mail/attachments.
#####################################################################################
list Henrik Størner · Tue, 1 Nov 2005 22:25:12 +0100 ·
quoted from Pat Vaughan
On Tue, Nov 01, 2005 at 03:00:01PM -0500, Pat Vaughan wrote:
On my AIX box running the Hobbit client I found that disk alarms aren't
generated if the condition goes from yellow to red.

I have rules to send an Email if it's yellow, and I always get that.  I
also have rules to send a page if the state is red, and I get those if it
jumps from green to red.  But, if the state goes to yellow and then to
red, the paging rule never fires.  There isn't a entry added to the
notification log either.

I've mentioned this before on the list, but never got a definite response.
I'll look into this problem.


Henrik
list Henrik Størner · Tue, 1 Nov 2005 23:15:25 +0100 ·
quoted from Henrik Størner
On Tue, Nov 01, 2005 at 10:25:12PM +0100, Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 03:00:01PM -0500, Pat Vaughan wrote:
On my AIX box running the Hobbit client I found that disk alarms aren't
generated if the condition goes from yellow to red.

I have rules to send an Email if it's yellow, and I always get that.  I
also have rules to send a page if the state is red, and I get those if it
jumps from green to red.  But, if the state goes to yellow and then to
red, the paging rule never fires.  There isn't a entry added to the
notification log either.

I've mentioned this before on the list, but never got a definite response.
I'll look into this problem.
I've been trying to re-create this problem, but I do get the alerts I 
expect to get.

One thing that might confuse some people: The REPEAT setting counts
"across colors". E.g. if you have REPEAT=30 (the default, 30 minutes
between alerts), and the sequence of events goes like

  22:05 Test goes yellow - alert (yellow) is sent
  22:35 Test still yellow - repeat alert (yellow) is sent
  22:45 Test goes red. No alert is sent because it is only 10 minutes
        since the last alert went out.
  23:05 Test still red - now an alert (red) is sent.


Henrik
list Larry Barber · Tue, 1 Nov 2005 17:25:33 -0500 (EST) ·
If you use separate rules for yellow and red alerts, say a
hobbit-alerts.cfg that looked something like:

HOST a_host:
	MAIL user-7f377ba79fba@xymon.invalid COLOR=red DELAY=0 REPEAT=30m RECOVERED
	MAIL user-7f377ba79fba@xymon.invalid COLOR=yellow DELAY=0 REPEAT=30m RECOVERED

would you then get a separate email when the condition turned red at
22:45 below?

Thanks,
Larry Barber
quoted from Henrik Størner


On Tue, 2005-11-01 at 16:15 -0600, user-ce4a2c883f75@xymon.invalid wrote:
On Tue, Nov 01, 2005 at 10:25:12PM +0100, Henrik Stoerner wrote: > On Tue, Nov 01, 2005 at 03:00:01PM -0500, Pat Vaughan wrote: > > On my AIX box running the Hobbit client I found that disk alarms
aren't > > generated if the condition goes from yellow to red. > >  > > I have rules to send an Email if it's yellow, and I always get
that.  I > > also have rules to send a page if the state is red, and I get
those if it > > jumps from green to red.  But, if the state goes to yellow and
then to > > red, the paging rule never fires.  There isn't a entry added to
the > > notification log either. > >  > > I've mentioned this before on the list, but never got a definite
response. >  > I'll look into this problem.

I've been trying to re-create this problem, but I do get the alerts
I  expect to get.

One thing that might confuse some people: The REPEAT setting counts "across colors". E.g. if you have REPEAT=30 (the default, 30 minutes between alerts), and the sequence of events goes like

  22:05 Test goes yellow - alert (yellow) is sent   22:35 Test still yellow - repeat alert (yellow) is sent   22:45 Test goes red. No alert is sent because it is only 10 minutes         since the last alert went out.   23:05 Test still red - now an alert (red) is sent.


Henrik

list Henrik Størner · Tue, 1 Nov 2005 23:36:45 +0100 ·
quoted from Larry Barber
On Tue, Nov 01, 2005 at 05:25:33PM -0500, user-7a6c75d6cc10@xymon.invalid wrote:
If you use separate rules for yellow and red alerts, say a
hobbit-alerts.cfg that looked something like:

HOST a_host:
	MAIL user-7f377ba79fba@xymon.invalid COLOR=red DELAY=0 REPEAT=30m RECOVERED
	MAIL user-7f377ba79fba@xymon.invalid COLOR=yellow DELAY=0 REPEAT=30m RECOVERED

would you then get a separate email when the condition turned red at
22:45 below?
No, and that might be something that could change. The repeat-
checking code currently identifies an alert by the combination
of hostname, servicename and recipient; I could easily change
that so a separate line in the config-file would result in a new
set of repeat-checks. 
  22:05 Test goes yellow - alert (yellow) is sent >   22:35 Test still yellow - repeat alert (yellow) is sent >   22:45 Test goes red. No alert is sent because it is only 10 minutes >         since the last alert went out. >   23:05 Test still red - now an alert (red) is sent.

Regards,
Henrik
list Pat Vaughan · Wed, 2 Nov 2005 14:04:21 -0500 (EST) ·
ACK!  So, if what do we do if we want to get Emails for yellow alerts, and
pages for red alerts and not get repeat pages every x minutes?  It seems
that a red alert is usually pretty important, and we want to know about it
immediately, instead of waiting until the repeat time expires (which we
set to 30d per a previous recommendation).

I would expect that a change in the state of a test would reset the REPEAT
counter.
quoted from Larry Barber
On Tue, Nov 01, 2005 at 10:25:12PM +0100, Henrik Stoerner wrote:
I've been trying to re-create this problem, but I do get the alerts I
expect to get.

One thing that might confuse some people: The REPEAT setting counts
"across colors". E.g. if you have REPEAT=30 (the default, 30 minutes
between alerts), and the sequence of events goes like

  22:05 Test goes yellow - alert (yellow) is sent
  22:35 Test still yellow - repeat alert (yellow) is sent
  22:45 Test goes red. No alert is sent because it is only 10 minutes
        since the last alert went out.
  23:05 Test still red - now an alert (red) is sent.


Henrik

list Pat Vaughan · Wed, 2 Nov 2005 14:07:59 -0500 (EST) ·
Okay, my recipients are different, but I'm using scripts instead of MAIL
recipients:

SCRIPT /usr/local/bin/scripts/hobbit-mail UNIX_ADMIN
SERVICE=%(cpu|disk|entstat|procs|ssh|telnet|vmio) COLOR=yellow,purple
REPEAT=30d RECOVERED

SCRIPT /usr/local/bin/scripts/hobbit-mailpage $PATVAUGHAN_PAGERMAIL
SERVICE=%(cpu|disk|entstat|memory|procs|ssh|telnet|vmio) COLOR=red
TIME=12345:0800:1700 REPEAT=30d
quoted from Henrik Størner
On Tue, Nov 01, 2005 at 05:25:33PM -0500, user-7a6c75d6cc10@xymon.invalid wrote:
If you use separate rules for yellow and red alerts, say a
hobbit-alerts.cfg that looked something like:

HOST a_host:
	MAIL user-7f377ba79fba@xymon.invalid COLOR=red DELAY=0 REPEAT=30m RECOVERED
	MAIL user-7f377ba79fba@xymon.invalid COLOR=yellow DELAY=0 REPEAT=30m RECOVERED

would you then get a separate email when the condition turned red at
22:45 below?
No, and that might be something that could change. The repeat-
checking code currently identifies an alert by the combination
of hostname, servicename and recipient; I could easily change
that so a separate line in the config-file would result in a new
set of repeat-checks.
list Etienne Roulland · Fri, 04 Nov 2005 12:32:20 +0100 ·
Hi,

i got this night a similar problem.

A CPU alert was sent but never get the recovered. We trace our sending
script, it was not called to send the recovered msg. 
Looks like red => green = recovered
red -> yellow -> green = no recovered

My conf :

HOST=*
	SCRIPT /usr2/hobbit/server/sendit.sh  SMS COLOR=red,purple REPEAT=15m
DURATION>6m RECOVERED FORMAT=sms

Regards
quoted from Pat Vaughan


On mer, 2005-11-02 at 14:04 -0500, Pat Vaughan wrote:
ACK!  So, if what do we do if we want to get Emails for yellow alerts, and
pages for red alerts and not get repeat pages every x minutes?  It seems
that a red alert is usually pretty important, and we want to know about it
immediately, instead of waiting until the repeat time expires (which we
set to 30d per a previous recommendation).

I would expect that a change in the state of a test would reset the REPEAT
counter.
On Tue, Nov 01, 2005 at 10:25:12PM +0100, Henrik Stoerner wrote:
I've been trying to re-create this problem, but I do get the alerts I
expect to get.

One thing that might confuse some people: The REPEAT setting counts
"across colors". E.g. if you have REPEAT=30 (the default, 30 minutes
between alerts), and the sequence of events goes like

  22:05 Test goes yellow - alert (yellow) is sent
  22:35 Test still yellow - repeat alert (yellow) is sent
  22:45 Test goes red. No alert is sent because it is only 10 minutes
        since the last alert went out.
  23:05 Test still red - now an alert (red) is sent.


Henrik

-- 

Etienne Roulland -- CVF Bordeaux
list Pat Vaughan · Mon, 7 Nov 2005 15:56:37 -0500 (EST) ·
quoted from Pat Vaughan
No, and that might be something that could change. The repeat-
checking code currently identifies an alert by the combination
of hostname, servicename and recipient; I could easily change
that so a separate line in the config-file would result in a new
set of repeat-checks.
Is this something that might make it into the next version?  I'm almost
ready to take a snapshot if I have to.  This bit me again today.  It seems
to me that the most intelligent change would be to generate a new
repeat-check for every line in the hobbit-alerts file or, and I haven't
looked at the code at all, to reset the repeat timer every time a test
changes color (possibly using a different keyword to keep current setups
working as anticipated).
list Henrik Størner · Mon, 7 Nov 2005 22:27:21 +0100 ·
quoted from Pat Vaughan
On Mon, Nov 07, 2005 at 03:56:37PM -0500, Pat Vaughan wrote:
No, and that might be something that could change. The repeat-
checking code currently identifies an alert by the combination
of hostname, servicename and recipient; I could easily change
that so a separate line in the config-file would result in a new
set of repeat-checks.
Is this something that might make it into the next version?  I'm almost
ready to take a snapshot if I have to.  This bit me again today.
I did some work on this yesterday - while working on it, I found
out that there is something buggy in the current version. From my
Changes file (http://www.hswn.dk/beta/Changes):

* The handling of alerts was counting the duration of an event
  based on when the color last changed. This meant that each
  time the color changed, any DURATION counters were reset.
  This would cause alerts to not go out if a status was changing
  between yellow and red faster than any DURATION setting.
  Changed this to count the event start as the *first* time the
  status went into an alert state (yellow or red, usually).

I then also implemented the following change:

* When a status goes yellow->red, the repeat-interval is
  now cleared for any alerts. This makes sure you get an
  alert immediately for the most severe state seen. This
  only affects the first such transition; if the status
  later changes between yellow/red, this normal REPEAT
  interval applies.

So you'll now get an alert when it goes yellow, and another
when it goes red (if your configuration includes alerts for 
these colors, obviously).

This is in the current snapshot, and will also be in the next
release. I am tempted to do a 4.1.3 release fairly soon - this
problem is fairly serious. And the disk graph problem that is
also fixed in the current snapshot annoys quite a few people.
quoted from Pat Vaughan

It seems
to me that the most intelligent change would be to generate a new
repeat-check for every line in the hobbit-alerts file or, and I haven't
looked at the code at all, to reset the repeat timer every time a test
changes color (possibly using a different keyword to keep current setups
working as anticipated).
I'd rather not have the REPEAT handling tied to the physical layout
of the configuration file - it makes it a lot harder to handle when
the file is changed while alerts are active. I know I wrote something
different in the message you've quoted, but after looking some more
at the problem I've changed my mind.

I think the new code strikes a sensible balance between getting
the necessary alerts and not being flooded with them. The current
version works the way it does because I did not want to be
flooded with alerts by a state that kept on changing between
yellow and red - eg. a disk that is filled just about the 
limit between the warning and panic levels. The new code will
give you that one extra alert telling you that the situation
is critical, but once it has done that it will obey the
REPEAT setting and only send you an alert every 30 minutes
(or whatever your REPEAT interval is).


Regards,
Henrik
list Charles Jones · Mon, 07 Nov 2005 14:48:34 -0700 ·
I upgraded a hobitt instance from 4.0beta4 (yes terribly old but it has been working fine), to 4.1.2 (actually latest snapshot), and after the upgrade the main page still said it was version 4.0beta4, even after a new host that I added showed up on the display.

I was sure if it was a caching problem or what, so what I ended up doing was saving a copy of bb-hosts.cfg ,hobbitalerts.cfg, and  to /tmp, then I totally deleted the "server" directory and reinstalled, put the config files back, fired up hobbit and all seemed well.

My question is, how does hobbit handle doing an upgrade? It didnt replace my bb-hosts file, so apparently it is aware of the existance of one different from the default and doesn't overwrite it...does it do the same checks for the various files in the other subdirectories?

Maybe there should be a makefile option ("make upgrade-install") to upgrade, that saves your config files but totally replaces all preexisting hobbit components, to make sure nothing old is hanging around, like I ended up having both a hobbit.sh and a starthobbit.sh (the old one) in my server directory.

-Charles
list Henrik Størner · Mon, 7 Nov 2005 23:08:32 +0100 ·
quoted from Charles Jones
On Mon, Nov 07, 2005 at 02:48:34PM -0700, Charles Jones wrote:
I upgraded a hobitt instance from 4.0beta4 (yes terribly old but it has been working fine), to 4.1.2 (actually latest snapshot), and after the upgrade the main page still said it was version 4.0beta4, even after a new host that I added showed up on the display.

I was sure if it was a caching problem or what, so what I ended up doing was saving a copy of bb-hosts.cfg ,hobbitalerts.cfg, and  to /tmp, then I totally deleted the "server" directory and reinstalled, put the config files back, fired up hobbit and all seemed well.
Not sure why that happened .. 4.0-beta4 is pretty old (in fact, it is
the oldest release I have lying around - older than that, I'd have to
fetch it from my RCS archive).
My question is, how does hobbit handle doing an upgrade? It didnt replace my bb-hosts file, so apparently it is aware of the existance of one different from the default and doesn't overwrite it...does it do the same checks for the various files in the other subdirectories?
With beta's and snapshots, all bets are off - but when upgrading from
one official release to another, "make install" tries fairly hard to
handle configuration files the right way:

* It won't touch an existing bb-hosts file.

* It will try to add new entries to the hobbitlaunch.cfg,
  hobbitserver.cfg, hobbitgraph.cfg, hobbitcgi.cfg and columndoc.csv   files. It doesn't delete anything, since it could be entries that   you've added yourself for custom scripts and extensions.

* For the GIF's, webpage header/footer templates and the like,
  Hobbit uses an MD5 checksum to see if the version on your box is
  one of the default ones shipped with an older version of Hobbit.
  If it is, then it will replace it with the current version.

* Files that have been renamed - currently, it's only hobbitd_larrd
  which was renamed to hobbitd_rrd - it will delete the old file,   but set up a symlink so any references to the old filename still   work, but hit the new file.

Maybe there should be a makefile option ("make upgrade-install") to upgrade, that saves your config files but totally replaces all preexisting hobbit components, to make sure nothing old is hanging around, like I ended up having both a hobbit.sh and a starthobbit.sh (the old one) in my server directory.
I think "make install" should just do what it does now. "starthobbit.sh"
went away between RC4 and RC5.


Regards,
Henrik
list Pat Vaughan · Mon, 7 Nov 2005 17:10:20 -0500 (EST) ·
quoted from Henrik Størner
So you'll now get an alert when it goes yellow, and another
when it goes red (if your configuration includes alerts for
these colors, obviously).
That sounds like it should fix my problem perfectly.
quoted from Henrik Størner
I think the new code strikes a sensible balance between getting
the necessary alerts and not being flooded with them. The current
version works the way it does because I did not want to be
flooded with alerts by a state that kept on changing between
yellow and red - eg. a disk that is filled just about the
limit between the warning and panic levels. The new code will
give you that one extra alert telling you that the situation
is critical, but once it has done that it will obey the
REPEAT setting and only send you an alert every 30 minutes
(or whatever your REPEAT interval is).
I like it, if we get a page and don't fix whatever is wrong so it goes
back to a "green" state we should be flogged anyway.  I'll be sure to grab
the next version as soon as it's available.