Xymon Mailing List Archive search

possible alerting bug in RC2?

10 messages in this thread

list Bruce Lysik · Mon, 14 Feb 2005 13:28:28 -0800 ·
Hi,

So I installed RC2 this morning.  Later on, I noticed an alert email for a monitor going into yellow.  I had disabled this previously with --alertcolors=red,purple in hobbitlaunch.cfg.  Here's the snippet:

[hobbitd]
        HEARTBEAT
        ENVFILE /opt/bb/server/etc/hobbitserver.cfg
        CMD hobbitd --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbit
d.chk --checkpoint-interval=600 --purple-conn=conn --log=$BBSERVERLOGS/hobbitd.l
og --admin-senders=127.0.0.1,$BBSERVERIP --alertcolors=red,purple

And here's the alert I just received:

im68:cpu yellow [-1]
yellow Mon Feb 14 13:13:56 PST 2005 up: 208 day(s), 1 users, 115 procs, load=529


LOAD AVG on im68 is 529

Any ideas?

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer
list Asif Iqbal · Mon, 14 Feb 2005 18:44:37 -0500 ·
quoted from Bruce Lysik
On Mon, Feb 14, 2005 at 01:28:28PM, Bruce Lysik wrote:
Hi,

So I installed RC2 this morning.  Later on, I noticed an alert email for a monitor going into yellow.  I had disabled this previously with --alertcolors=red,purple in hobbitlaunch.cfg.  Here's the snippet:

[hobbitd]
        HEARTBEAT
        ENVFILE /opt/bb/server/etc/hobbitserver.cfg
        CMD hobbitd --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbit
d.chk --checkpoint-interval=600 --purple-conn=conn --log=$BBSERVERLOGS/hobbitd.l
og --admin-senders=127.0.0.1,$BBSERVERIP --alertcolors=red,purple

And here's the alert I just received:

im68:cpu yellow [-1]
yellow Mon Feb 14 13:13:56 PST 2005 up: 208 day(s), 1 users, 115 procs, load=529
What do you get when you run the following test as hobbit user?

cd ~hobbit/server
./bin/bbcmd --test <FQDN of im68> cpu
quoted from Bruce Lysik

LOAD AVG on im68 is 529

Any ideas?

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer

-- 

Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
"...it said: Install Windows XP or better...so I installed Solaris..."
list Asif Iqbal · Mon, 14 Feb 2005 18:47:09 -0500 ·
quoted from Asif Iqbal
On Mon, Feb 14, 2005 at 06:44:37PM, Asif Iqbal wrote:
On Mon, Feb 14, 2005 at 01:28:28PM, Bruce Lysik wrote:
Hi,
So I installed RC2 this morning.  Later on, I noticed an alert email for a monitor going into yellow.  I had disabled this previously with --alertcolors=red,purple in hobbitlaunch.cfg.  Here's the snippet:
[hobbitd]
        HEARTBEAT
        ENVFILE /opt/bb/server/etc/hobbitserver.cfg
        CMD hobbitd --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbit
d.chk --checkpoint-interval=600 --purple-conn=conn --log=$BBSERVERLOGS/hobbitd.l
og --admin-senders=127.0.0.1,$BBSERVERIP --alertcolors=red,purple
And here's the alert I just received:
im68:cpu yellow [-1]
yellow Mon Feb 14 13:13:56 PST 2005 up: 208 day(s), 1 users, 115 procs, load=529
What do you get when you run the following test as hobbit user?

cd ~hobbit/server
./bin/bbcmd --test <FQDN of im68> cpu
oops I meant

./bin/bbcmd hobbitd_alert --test FQDN cpu
quoted from Asif Iqbal
LOAD AVG on im68 is 529
Any ideas?
--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer
-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
"...it said: Install Windows XP or better...so I installed Solaris..."

-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
"...it said: Install Windows XP or better...so I installed Solaris..."
list Bruce Lysik · Mon, 14 Feb 2005 15:50:50 -0800 ·
quoted from Asif Iqbal
What do you get when you run the following test as hobbit user?

cd ~hobbit/server
./bin/bbcmd --test <FQDN of im68> cpu
-bash-2.05b$ ./bin/bbcmd --test im68.internal.shutterfly.com cpu
2005-02-14 15:48:51 Using default environment file /opt/bb/server/etc/hobbitserver.cfg
2005-02-14 15:48:51 execvp() failed: No such file or directory

Same with non-FQDN:

-bash-2.05b$ ./bin/bbcmd --test im68 cpu
2005-02-14 15:49:12 Using default environment file /opt/bb/server/etc/hobbitserver.cfg
2005-02-14 15:49:12 execvp() failed: No such file or directory

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer
list Bruce Lysik · Mon, 14 Feb 2005 15:55:46 -0800 ·
oops I meant

./bin/bbcmd hobbitd_alert --test FQDN cpu
Ah, that actually works.  Prepare for spam:

-bash-2.05b$ ./bin/bbcmd hobbitd_alert --test im68 cpu
2005-02-14 15:53:56 Using default environment file /opt/bb/server/etc/hobbitserver.cfg
Matching host:service:page 'im68:cpu:' against rule line 68:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 76:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 84:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 92:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 100:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 110:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 117:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 124:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 131:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 139:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 147:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 155:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 163:Matched
    *** Match with 'HOST=$HG-IMAGE' ***
Matching host:service:page 'im68:cpu:' against rule line 169:Failed (min. duration)
Matching host:service:page 'im68:cpu:' against rule line 171:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 176:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 184:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 192:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 200:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 208:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 213:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 218:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 223:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 231:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 239:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 247:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 255:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 263:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 271:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 279:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 284:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 289:Failed (hostname not in include list)
Matching host:service:page 'im68:cpu:' against rule line 297:Failed (hostname not in include list)


--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer
list Asif Iqbal · Mon, 14 Feb 2005 22:57:39 -0500 ·
quoted from Bruce Lysik
On Mon, Feb 14, 2005 at 03:55:46PM, Bruce Lysik wrote:
oops I meant

./bin/bbcmd hobbitd_alert --test FQDN cpu
Ah, that actually works.  Prepare for spam:

-bash-2.05b$ ./bin/bbcmd hobbitd_alert --test im68 cpu
2005-02-14 15:53:56 Using default environment file /opt/bb/server/etc/hobbitserver.cfg
Matching host:service:page 'im68:cpu:' against rule line 163:Matched
    *** Match with 'HOST=$HG-IMAGE' ***
Matching host:service:page 'im68:cpu:' against rule line 169:Failed (min. duration)
I do see any rule with RED alert. Try to remove the DURARTION parameter
for this rule and show me the output again.

You can also just post the relavant portion of the following test's
output to get a better diagnose

./bin/bbcmd hobbitd_alert --dump-config 

Thanks
quoted from Asif Iqbal
-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
"...it said: Install Windows XP or better...so I installed Solaris..."
list Bruce Lysik · Tue, 15 Feb 2005 18:48:43 -0800 ·
quoted from Asif Iqbal
You can also just post the relavant portion of the following test's
output to get a better diagnose

./bin/bbcmd hobbitd_alert --dump-config 
Sure.  It's very simple, really:

HOST=<snip list of about 100 hosts>
	SCRIPT /opt/bb/server/ext/email bruce_mail FORMAT=SCRIPT REPEAT=30 DURATION>6 RECOVERED

But the main issue is, I was under the impression a yellow alert would never trigger anything, because of how I've defined --alertcolors.

I've just received another yellow alert.  So something is up.

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer
list Henrik Størner · Wed, 16 Feb 2005 13:15:31 +0100 ·
quoted from Bruce Lysik
On Mon, Feb 14, 2005 at 01:28:28PM -0800, Bruce Lysik wrote:
So I installed RC2 this morning.  Later on, I noticed an alert email
 for a monitor going into yellow.  I had disabled this previously
 with --alertcolors=red,purple in hobbitlaunch.cfg.
[config from another mail]
quoted from Asif Iqbal
HOST=<snip list of about 100 hosts>
       SCRIPT /opt/bb/server/ext/email bruce_mail FORMAT=SCRIPT REPEAT=30 DURATION>6 RECOVERED
And here's the alert I just received:

im68:cpu yellow [-1]
yellow Mon Feb 14 13:13:56 PST 2005 up: 208 day(s), 1 users, 115 procs, load=529
The alert you show here looks like a recovery-notice (the "-1" I
assume is the acknowledgment cookie, and this value indicates that
there is no active alert).

If you look in the ~/data/ack/notifications.log file for these
notifications, you can tell if it's an alert message or a recovery
message by the number of columns in the file. E.g. in my log I have

Wed Feb 16 13:08:43 2005 www.sslug.dk.smtp (130.228.2.150) user-ce4a2c883f75@xymon.invalid 1108555723 725
Wed Feb 16 13:09:44 2005 www.sslug.dk.smtp (130.228.2.150)user-ce4a2c883f75@xymon.invalid 1108555784 725 61

The first one is the alert message, the second is the recovery
message. The recovery has an extra field "61", which is the duration
of the event (in seconds).


Could you check the following in hobbitlaunch.cfg:

* The "hobbitd" command has "--alertcolors=red,purple --okcolors=green"
* The "hobbitd_alert" command has "--alertcolors=red,purple"

This setup should give you alerts when a status is red (or purple),
and recovery notices only when they go green (after being red or
purple).


Regards,
Henrik
list Bruce Lysik · Wed, 16 Feb 2005 10:01:40 -0800 ·
quoted from Henrik Størner
And here's the alert I just received:

im68:cpu yellow [-1]
yellow Mon Feb 14 13:13:56 PST 2005 up: 208 day(s), 1 
users, 115 procs, load=529
The alert you show here looks like a recovery-notice (the "-1" I
assume is the acknowledgment cookie, and this value indicates that
there is no active alert).
Argh. That's bitten me before.  I have to figure out how to get the actual word 'recovered' in there to make it easier to understand at 4am.
quoted from Henrik Størner
Could you check the following in hobbitlaunch.cfg:

* The "hobbitd" command has "--alertcolors=red,purple 
--okcolors=green"
* The "hobbitd_alert" command has "--alertcolors=red,purple"

This setup should give you alerts when a status is red (or purple),
and recovery notices only when they go green (after being red or
purple).
Hmm.  I didn't have --okcolors=green, so I added that to the hobbitd command. 

Just to clarify about the hobbitd_alert command, that's in the bbpage module, correct?  I didn't have --alertcolors there, but I've just added it:

[bbpage]
        ENVFILE /opt/bb/server/etc/hobbitserver.cfg
        NEEDS hobbitd
        CMD hobbitd_channel --channel=page   --log=$BBSERVERLOGS/page.log hobbitd_alert --alertcolors=red,purple

Thanks for your assistance.

--
Bruce Z. Lysik  <user-4e63a10f8934@xymon.invalid>
Operations Engineer
list Henrik Størner · Wed, 16 Feb 2005 19:34:21 +0100 ·
quoted from Bruce Lysik
On Wed, Feb 16, 2005 at 10:01:40AM -0800, Bruce Lysik wrote:

Could you check the following in hobbitlaunch.cfg:
* The "hobbitd" command has "--alertcolors=red,purple > --okcolors=green"
* The "hobbitd_alert" command has "--alertcolors=red,purple"
This setup should give you alerts when a status is red (or purple),
and recovery notices only when they go green (after being red or
purple).
Hmm.  I didn't have --okcolors=green, so I added that to the hobbitd command. 
Just to clarify about the hobbitd_alert command, that's in the
bbpage module, correct?  
Yep.
quoted from Bruce Lysik
I didn't have --alertcolors there, but I've just added it:

[bbpage]
        ENVFILE /opt/bb/server/etc/hobbitserver.cfg
        NEEDS hobbitd
        CMD hobbitd_channel --channel=page   --log=$BBSERVERLOGS/page.log hobbitd_alert --alertcolors=red,purple
Looks ok now.


Henrik