Xymon Mailing List Archive search

Paging & Notification Not Working

16 messages in this thread

list James Wade · Thu, 12 Jul 2007 09:23:38 -0500 ·
Hi,

I never received any answers on this one.
I had another problem today and no one was paged.

How can I manually test paging?

Thanks...James

Hello,

I've noticed a random problem in Hobbit that
I can't seem to track down. I thought perhaps
other folks had seen the same problem.

Occasionally, paging and notification seems to
stop for no reason or all the groups in my file
don't get notified. The only way you know that
it happens is that we stop receiving notifications.
ie...something went wrong and you are not notified.

As an example, a database server last night filled 
up a /u? partition 100%. We received no email notification
of the event.

The notifications.log indicates that messages stopped
around 9:21 p.m.... ie...This was the last message
at the end of the notifications log. 

Mon Jul  9 21:18:39 2007 server1.disk (192.168.10.2) user-0a82b7f6ef66@xymon.invalid
[129] 1184033919 100

The one above was the last message received. The DB group
was notified of the problem, but the unix group was not.

Also, we had other problems, minor, with a couple other systems
we should have gotten notices from, but all paging seemed
to halt after the last message above.

Has anyone seen this. Any suggested method of tracking it down?
I've fixed it in the past by restarting Hobbit, and then 
notification and paging seem to start working again.

Thanks...James
list Jason Altrincham Jones · Thu, 12 Jul 2007 15:30:00 +0100 ·
Are you 100% sure that your hobbit-alerts file is configured correctly?

You can test your config with:

$BBHOME/bin/bbcmd hobbitd_alert --test <host> <test>

Should give you a list of everyone that should be notified

Jason.
quoted from James Wade

-----Original Message-----
From: James Wade [mailto:user-659655b2ea05@xymon.invalid] 
Sent: 12 July 2007 15:24
To: user-ae9b8668bcde@xymon.invalid
Subject: FW: [hobbit] Paging & Notification Not Working

Hi,

I never received any answers on this one.
I had another problem today and no one was paged.

How can I manually test paging?

Thanks...James

Hello,

I've noticed a random problem in Hobbit that
I can't seem to track down. I thought perhaps
other folks had seen the same problem.

Occasionally, paging and notification seems to
stop for no reason or all the groups in my file
don't get notified. The only way you know that
it happens is that we stop receiving notifications.
ie...something went wrong and you are not notified.

As an example, a database server last night filled 
up a /u? partition 100%. We received no email notification
of the event.

The notifications.log indicates that messages stopped
around 9:21 p.m.... ie...This was the last message
at the end of the notifications log. 

Mon Jul  9 21:18:39 2007 server1.disk (192.168.10.2)
user-0a82b7f6ef66@xymon.invalid
[129] 1184033919 100

The one above was the last message received. The DB group
was notified of the problem, but the unix group was not.

Also, we had other problems, minor, with a couple other systems
we should have gotten notices from, but all paging seemed
to halt after the last message above.

Has anyone seen this. Any suggested method of tracking it down?
I've fixed it in the past by restarting Hobbit, and then 
notification and paging seem to start working again.

Thanks...James
list James Wade · Thu, 12 Jul 2007 09:58:08 -0500 ·
Thanks Jason, I did a test of my config, and it looks fine.

This morning about 7:45 we received an oracle partition filling
up. It should have paged people via a pager and email.

The DB group received an email, put not a page, the Unix group
received no email, and no page.

The notification log confirms that only the DB group was sent
a notification.

Here's a partial of the output:

00009958 2007-07-12 09:35:47 send_alert db101:disk state Paging
00009958 2007-07-12 09:35:47 Matching host:service:page 'db101:disk:Chicago'
against rule line 121
00009958 2007-07-12 09:35:47 *** Match with 'HOST=*' ***
00009958 2007-07-12 09:35:47 Matching host:service:page 'db101:disk:Chicago'
against rule line 122
00009958 2007-07-12 09:35:47 *** Match with 'MAIL user-8c39f9246549@xymon.invalid
REPEAT=15 COLOR=RED DURATION<30 RECOVERED' ***
00009958 2007-07-12 09:35:47 Mail alert with command 'mailx -s "BB [12345]
db101:disk CRITICAL (RED)" user-8c39f9246549@xymon.invalid'
00009958 2007-07-12 09:35:47 Matching host:service:page 'db101:disk:Chicago'
against rule line 132
00009958 2007-07-12 09:35:47 *** Match with 'MAIL user-5c6e60637778@xymon.invalid
REPEAT=15 COLOR=RED DURATION<30 RECOVERED' ***

I never received a page today.

Notification log:

Wed Jul 11 20:54:24 2007 system005.cpu (192.168.10.3)
user-8c39f9246549@xymon.invalid [125] 1184205263 200
Thu Jul 12 07:29:08 2007 db101.disk (192.168.10.10)
user-644d591309b5@xymon.invalid[129] 1184243348 100
Thu Jul 12 07:31:12 2007 db101.disk (192.168.10.10)
user-644d591309b5@xymon.invalid[129] 1184243472 100

The notification log shows that only pages to the DB group occurred
and nothing was sent to me as an example. Yet as the check above shows,
I should have matched and been sent both a page and email.

James
quoted from Jason Altrincham Jones

-----Original Message-----
From: Jones, Jason (Altrincham) [mailto:user-ee957b46acd2@xymon.invalid] 
Sent: Thursday, July 12, 2007 9:30 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: RE: [hobbit] Paging & Notification Not Working

Are you 100% sure that your hobbit-alerts file is configured correctly?

You can test your config with:

$BBHOME/bin/bbcmd hobbitd_alert --test <host> <test>

Should give you a list of everyone that should be notified

Jason.

-----Original Message-----
From: James Wade [mailto:user-659655b2ea05@xymon.invalid] 
Sent: 12 July 2007 15:24
To: user-ae9b8668bcde@xymon.invalid
Subject: FW: [hobbit] Paging & Notification Not Working

Hi,

I never received any answers on this one.
I had another problem today and no one was paged.

How can I manually test paging?

Thanks...James

Hello,

I've noticed a random problem in Hobbit that
I can't seem to track down. I thought perhaps
other folks had seen the same problem.

Occasionally, paging and notification seems to
stop for no reason or all the groups in my file
don't get notified. The only way you know that
it happens is that we stop receiving notifications.
ie...something went wrong and you are not notified.

As an example, a database server last night filled 
up a /u? partition 100%. We received no email notification
of the event.

The notifications.log indicates that messages stopped
around 9:21 p.m.... ie...This was the last message
at the end of the notifications log. 

Mon Jul  9 21:18:39 2007 server1.disk (192.168.10.2)
user-0a82b7f6ef66@xymon.invalid
[129] 1184033919 100

The one above was the last message received. The DB group
was notified of the problem, but the unix group was not.

Also, we had other problems, minor, with a couple other systems
we should have gotten notices from, but all paging seemed
to halt after the last message above.

Has anyone seen this. Any suggested method of tracking it down?
I've fixed it in the past by restarting Hobbit, and then 
notification and paging seem to start working again.

Thanks...James
list Dirk Kastens · Wed, 18 Jul 2007 09:05:23 +0200 ·
Hi,

is there a fix for the problem, meanwhile?
We have the same situation, here. A filesystem on one of our mail servers ran full, but the admins didn't get an alert from hobbit. The server is listed in our bb-hosts file on the "mail" page with an IP of "0.0.0.0". I found out that only the hosts with a real IP address will get an alert. I could find a workaround: if I add the option "prefer" to the host entry with the "0.0.0.0" address, the host gets an alert. But if there are more entries of the same host on different pages, only the first one with the "prefer" option is recognized by the alert function.

So, if we have the following bb-hosts file:

page linux
    123.456.78.9 my.mail.host # ...
page mail
    0.0.0.0 my.mail.host # noconn
page redhat
    0.0.0.0 my.mail.host # noconn prefer

and I define alerts for all three pages, the alert only works for page "linux" and page "redhat". The host on page "mail" is being ignored.

Please let me know if there is a solution for the problem.

-- 
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +XX-XXX-XXX-XXXX, FAX: -2470
list Henrik Størner · Wed, 18 Jul 2007 09:26:47 +0200 ·
quoted from Dirk Kastens
On Wed, Jul 18, 2007 at 09:05:23AM +0200, Dirk Kastens wrote:
is there a fix for the problem, meanwhile?
We have the same situation, here. A filesystem on one of our mail 
servers ran full, but the admins didn't get an alert from hobbit.
James and I managed to track down the cause of his problems, and 
it turned out to be a configuration problem - specifically, the
way the DURATION parameter in hobbit-alerts.cfg works.

James had used the DURATION setting to limit the number of alerts
sent, by using "DURATION<30" to only send alerts for 30 minutes. 
Also, there was one group of people receiving yellow alerts, and
another group receiving red alerts.  His setup was like this:

    HOST=foo SERVICE=disk
    	MAIL user-ac82b13207c8@xymon.invalid COLOR=yellow DURATION<30
	MAIL user-8e0a50b26a06@xymon.invalid COLOR=red DURATION<30

If the "disk" status went yellow at 6PM and red at 7PM, then 
user-8e0a50b26a06@xymon.invalid didn't receive any notification.

That's because the DURATION value is counted from the start of the
event, which begins when the status goes yellow. So by 7 PM the 
event has a duration of 60 minutes, which is above the 30-minute
threshold - so the red alert was suppressed.
quoted from Dirk Kastens
The server is listed in our bb-hosts file on the "mail" page with an IP of 
"0.0.0.0". I found out that only the hosts with a real IP address will 
get an alert. 
The IP in bb-hosts has nothing to do with alerts.
quoted from Dirk Kastens
page linux
   123.456.78.9 my.mail.host # ...
page mail
   0.0.0.0 my.mail.host # noconn
page redhat
   0.0.0.0 my.mail.host # noconn prefer

and I define alerts for all three pages, the alert only works for page 
"linux" and page "redhat". The host on page "mail" is being ignored.
This is a different problem. Each host has a "primary" page - only one!
It's the first page that defines it (that would be "linux"), except if
you use the "prefer" keyword then it is of course the page that has the
preferred definition of the host ("redhat", in your example). If you're
unsure of what page Hobbit uses as the primary page, then check it on
the "info" status page.


Regards,
Henrik
list Dirk Kastens · Wed, 18 Jul 2007 09:51:45 +0200 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
page linux
   123.456.78.9 my.mail.host # ...
page mail
   0.0.0.0 my.mail.host # noconn
page redhat
   0.0.0.0 my.mail.host # noconn prefer

and I define alerts for all three pages, the alert only works for page 
"linux" and page "redhat". The host on page "mail" is being ignored.
This is a different problem. Each host has a "primary" page - only one!
It's the first page that defines it (that would be "linux"), except if
you use the "prefer" keyword then it is of course the page that has the
preferred definition of the host ("redhat", in your example). If you're
unsure of what page Hobbit uses as the primary page, then check it on
the "info" status page.
Yes, I know. I wrote, that this is only a workaround for the missing 
alert. When I leave out the "prefer" statement in the above example, the 
alert only works for the host on the "linux" page.
I tested my configuration with the (patched) hobbitd_alert program:

$ bbcmd hobbitd_alert --test my.mail.host --color=red disk

Matching host:service:page 'my.mail.host:--color=red:linux'
*** Match with 'PAGE=linux' ***
Matching host:service:page 'my.mail.host:--color=red:linux'
Failed 'PAGE=mail' (pagename not in include list)

When I add the "prefer" option to the host on the mail page, I get

Matching host:service:page 'my.mail.host:--color=red:mail
Failed 'PAGE=linux' (pagename not in include list)
Matching host:service:page 'my.mail.host:--color=red:mail
*** Match with 'PAGE=mail' ***

So, what can I do to get the alert function working for BOTH pages?
quoted from Dirk Kastens

-- 
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +XX-XXX-XXX-XXXX, FAX: -2470
list Henrik Størner · Wed, 18 Jul 2007 13:04:38 +0200 ·
quoted from Dirk Kastens
On Wed, Jul 18, 2007 at 09:51:45AM +0200, Dirk Kastens wrote:
So, what can I do to get the alert function working for BOTH pages?
This functionality is in the current snapshot, but not in the
4.2.0 version.

Snapshots are available at http://www.hswn.dk/beta/


Regards,
Henrik
list Dirk Kastens · Wed, 18 Jul 2007 13:35:24 +0200 ·
quoted from Henrik Størner
This functionality is in the current snapshot, but not in the
4.2.0 version.
Aah, thanks. That's what I wanted to know :-)

-- 
Viele Gruesse,
quoted from Dirk Kastens

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +XX-XXX-XXX-XXXX, FAX: -2470
list Dirk Kastens · Wed, 18 Jul 2007 15:20:30 +0200 ·
quoted from Dirk Kastens
Dirk Kastens wrote:
This functionality is in the current snapshot, but not in the
4.2.0 version.
Aah, thanks. That's what I wanted to know :-)
I can confirm that the alerting now works as expected.
But the information on the info pages is wrong. The "Alerting:" section only shows the correct info for the preferred host entries. When I define an alert for a page with a secondary host entry (IP 0.0.0.0) the info page of the host says "No alerts defined".
quoted from Dirk Kastens

-- 
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +XX-XXX-XXX-XXXX, FAX: -2470
list James Wade · Wed, 18 Jul 2007 08:40:40 -0500 ·
I'm not that familiar with RRD. I've loaded
it and set it up and it works just fine.

However, some of the developers here have asked me
if they could get a dump of the data instead
of the graphs.

I've used the rrdtool to do a dump as well as
using xml format, but the output isn't in a
format the developers want.

I'm wondering if anyone's written any scripts
or even an html based cgi script to output rrd
data in a more readable format; perhaps associated
with the data information: cpu, memory, etc...,

Thanks....James
list Henrik Størner · Wed, 18 Jul 2007 15:56:09 +0200 ·
quoted from Dirk Kastens
On Wed, Jul 18, 2007 at 03:20:30PM +0200, Dirk Kastens wrote:
Dirk Kastens wrote:
This functionality is in the current snapshot, but not in the
4.2.0 version.
Aah, thanks. That's what I wanted to know :-)
I can confirm that the alerting now works as expected.
But the information on the info pages is wrong.
One small oversight on my part. Apply this patch on top of the snapshot.

Henrik
Attachments (1)
list Dirk Kastens · Wed, 18 Jul 2007 17:16:58 +0200 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
One small oversight on my part. Apply this patch on top of the snapshot.
Great! It works!
Thanks for your support.
quoted from Dirk Kastens

-- 
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +XX-XXX-XXX-XXXX, FAX: -2470
list Mario Andre · Thu, 19 Jul 2007 16:06:15 -0300 ·
James,

I've modified the view graph of Stefan tools V0.2.0 adminscripts
http://www.fh-augsburg.de/~henk/hobbit/  to dump RRD. You can view the two
files code attached.

 Put rrdtocsv.sh and adminscripts_functions2.sh in your cgi-bin directory.

Regards,

Mario.
quoted from James Wade


On 7/18/07, James Wade <user-659655b2ea05@xymon.invalid> wrote:

I'm not that familiar with RRD. I've loaded
it and set it up and it works just fine.

However, some of the developers here have asked me
if they could get a dump of the data instead
of the graphs.

I've used the rrdtool to do a dump as well as
using xml format, but the output isn't in a
format the developers want.

I'm wondering if anyone's written any scripts
or even an html based cgi script to output rrd
data in a more readable format; perhaps associated
with the data information: cpu, memory, etc...,

Thanks....James

Attachments (1)
list Galen Johnson · Thu, 19 Jul 2007 15:11:51 -0400 ·
There is actually a V0.3.0b that has quite a few updates (path changes primarily) that you may want to ensure it works with as well.  Stefan has been pretty receptive to updates.

 
=G=
quoted from Mario Andre

 
From: mario andre [mailto:user-82c7780661a4@xymon.invalid] Sent: Thursday, July 19, 2007 3:06 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] RRD Dump

 
James, 
 
I've modified the view graph of Stefan tools V0.2.0 adminscripts http://www.fh-augsburg.de/~henk/hobbit/   to dump RRD. You can view the two files code attached. 
 
Put rrdtocsv.sh and adminscripts_functions2.sh in your cgi-bin directory. 
 
Regards,

 
Mario.

  
On 7/18/07, James Wade <user-659655b2ea05@xymon.invalid> wrote: 

I'm not that familiar with RRD. I've loaded
it and set it up and it works just fine.

However, some of the developers here have asked me
if they could get a dump of the data instead
of the graphs.

I've used the rrdtool to do a dump as well as
using xml format, but the output isn't in a format the developers want.

I'm wondering if anyone's written any scripts
or even an html based cgi script to output rrd
data in a more readable format; perhaps associated
with the data information: cpu, memory, etc..., 
Thanks....James
list Mario Andre · Thu, 19 Jul 2007 16:14:18 -0300 ·
quoted from Galen Johnson
  James,

I've modified the view graph of Stefan tools V0.2.0 adminscripts
http://www.fh-augsburg.de/~henk/hobbit/
  to dump RRD.

Let me know if you want this files so I can send to your email.
quoted from Galen Johnson

Regards,

Mario.


On 7/18/07, James Wade <user-659655b2ea05@xymon.invalid> wrote:

I'm not that familiar with RRD. I've loaded
it and set it up and it works just fine.

However, some of the developers here have asked me
if they could get a dump of the data instead
of the graphs.

I've used the rrdtool to do a dump as well as
using xml format, but the output isn't in a
format the developers want.

I'm wondering if anyone's written any scripts
or even an html based cgi script to output rrd
data in a more readable format; perhaps associated
with the data information: cpu, memory, etc...,

Thanks....James

list S Aiello · Fri, 20 Jul 2007 12:10:44 -0400 ·
quoted from Henrik Størner
On Wednesday 18 July 2007 07:04, Henrik Stoerner wrote:
On Wed, Jul 18, 2007 at 09:51:45AM +0200, Dirk Kastens wrote:
So, what can I do to get the alert function working for BOTH pages?
This functionality is in the current snapshot, but not in the
4.2.0 version.

Snapshots are available at http://www.hswn.dk/beta/

This sounds awesome, this was something that I was really looking forward to in Hobbit, but was dispondant when I found out that hobbit-alertd only saw a server on one page, regardless of aliases. So thank you !!

Right now I am on 4.2.0 with the all-in-one patch. I am just curious whether all these patches will be included in a new all-in-one or a new release ? I am not try to push for either, just looking for a loose ETA. I am starting my initial migration, and this feature would be a life saver.

Thanks,
 ~Steve