Xymon Mailing List Archive search

Hobbit 4.0.4 released

19 messages in this thread

list Henrik Størner · Sun, 29 May 2005 13:19:02 +0200 ·
Version 4.0.4 is now available on SourceForge:
https://sourceforge.net/project/showfiles.php?group_id=128058&release_id=330776

This fixes the bugs reported since Hobbit 4.0.3 was released.

The only new stuff in this release is a new "beastat" utility to
collect performance statistics from an BEA Weblogic server via
SNMP. This replaces the old "bea-snmpstats" script, and should work
for sites with a large number of BEA instances per server. This is
still somewhat experimental, but I am interested to hear reports of
it working or not. (It's been tested with Weblogic ver. 8).

The Changelog is appended.


Regards,
Henrik


Changes from 4.0.3 -> 4.0.4

Bugfixes:
* "nodisp" tag re-implemented for hosts that should not
  appear on the Hobbit webpages.
* Enabling the "apache" data collection could crash bbtest-net
  if the /server-status page returned was larger than 32 KB.
* The "bbcmd" tool would not pass a --debug option to the command
  it was running.
* Nested macros in hobbit-alerts.cfg were not working.
* Using TAB's in hobbit-alerts.cfg could confuse the alert module.
* hobbitd_alert's --test option now determines the page-name for
  a host automatically. It will also now accept a time-parameter
  to simulate how alerts are processed at specific time-of-day.
* Status messages from "dialup" hosts should not go purple.
* The "mailq" RRD handler would pick up the first number from 
  the "requests" line, which might not be the right number.
* Scheduling a "disable" did not work.
* The startup-script was modified to correctly handle stale PID
  files left over from an unclean shutdown of Hobbit. These are
  now removed, and startup will proceed normally.

Improvements:
* CGI tools now log error-output to dedicated logs in
  /var/log/hobbit/
* The "bea-snmpstats.sh" script has been removed, and replaced
  with an enhanced tool "beastat". This collects statistics via
  SNMP from BEA Weblogic servers, and reports these via "data"
  messages to Hobbit. The data collected are run-time data for
  the JRockIT JVM and thread/memory utilization data from each
  Weblogic server instance.
* The "ntpstat" RRD handler now accepts the raw output from
  "ntpq -c rv"
* The heartbeat-timeout for hobbitd has been increased to 60 
  seconds.
list Andy France · Tue, 31 May 2005 09:43:40 +1200 ·

Henrik Stoerner wrote on 29/05/2005 23:19:02:
Version 4.0.4 is now available on SourceForge:
https://sourceforge.net/project/showfiles.php?
group_id=128058&release_id=330776
This fixes the bugs reported since Hobbit 4.0.3 was released.
Changes from 4.0.3 -> 4.0.4
Bugfixes:
quoted from Henrik Størner
* Nested macros in hobbit-alerts.cfg were not working.
* Using TAB's in hobbit-alerts.cfg could confuse the alert module.
* hobbitd_alert's --test option now determines the page-name for
a host automatically. It will also now accept a time-parameter
to simulate how alerts are processed at specific time-of-day.
Since updating to 4.0.4, I have had a couple of "reds" which have not
generated all of my alerts.

I have converted my hobbit-alerts.cfg file to use macros, and the
hobbitd_alert --test correctly expands all of the results, but only the
emails were executued and the scripts were not.

Relevant pieces from hobbit-alerts.cfg are:

  $MAIL_ANDY_STD=MAIL andy REPEAT=60m RECOVERED NOTICE
  $MAIL_GAYNA_STD=MAIL gayna REPEAT=60m RECOVERED
  $MAIL_INFORMIX_INF=MAIL informix REPEAT=60m RECOVERED
SERVICE=drstatus,informix
  $PAGE_SUPPORT_STD=SCRIPT /export/home/hobbit/bin/qpage support-pg
REPEAT=60m RECOVERED FORMAT=SMS COLOR=red
  $ETXT_ANDY_STD=SCRIPT /export/home/hobbit/bin/etxt andy-ph RECOVERED
FORMAT=SMS COLOR=red

  HOST=km-akl5,unakl5
        $MAIL_ANDY_STD
        $MAIL_GAYNA_STD
        $MAIL_INFORMIX_INF
        $PAGE_SUPPORT_STD
        $ETXT_ANDY_STD

And the test output shows:

  bash-3.00$ ~/server/bin/bbcmd hobbitd_alert --test unakl5 cpu | grep
alert
  2005-05-31 09:30:15 Using default environment file
/export/home/hobbit/server/etc/hobbitserver.cfg
  00024460 2005-05-31 09:30:15 send_alert unakl5:cpu state Paging
  00024460 2005-05-31 09:30:15 Mail alert with command '/opt/csw/bin/nail
-s "Hobbit [12345] unakl5:cpu CRITICAL (RED)" andy'
  00024460 2005-05-31 09:30:15 Mail alert with command '/opt/csw/bin/nail
-s "Hobbit [12345] unakl5:cpu CRITICAL (RED)" gayna'
  00024460 2005-05-31 09:30:15 Script alert with command
'/export/home/hobbit/bin/qpage' and recipient support-pg
  00024460 2005-05-31 09:30:15 Script alert with command
'/export/home/hobbit/bin/etxt' and recipient andy-ph

But, as mentioned, the qpage and etxt scripts did not get executed.

I can try rolling back to my previous patched 4.0.3 version if required -
but would obviously prefer to see if there is an issue with 4.0.4.

Thanks and regards,
Andy.

#####################################################################################

This email is intended for the person to whom it is addressed
only. If you are not the intended recipient, do not read, copy
or use the contents in any way. The opinions expressed may not
necessarily reflect those of ZESPRI Group of Companies ('ZESPRI').

While every effort has been made to verify the information
contained herein, ZESPRI does not make any representations 
as to the accuracy of the information or to the performance
of any data, information or the products mentioned herein.
ZESPRI will not accept liability for any losses, damage or
consequence, however, resulting directly or indirectly from
the use of this e-mail/attachments.
#####################################################################################
list Henrik Størner · Tue, 31 May 2005 14:16:41 +0200 ·
quoted from Andy France
On Tue, May 31, 2005 at 09:43:40AM +1200, Andy France wrote:
Since updating to 4.0.4, I have had a couple of "reds" which have not
generated all of my alerts.
Could you check the notifications.log file for any mention of these
alerts being sent out by Hobbit, and the page.log file for any errors
from your scripts ? Both should be in the /var/log/hobbit/ directory.


Regards,
Henrik
list Peter Welter · Tue, 16 Aug 2005 16:03:06 +0200 ·
Hello Hernik,

Although it *has* worked before, our Hobbit server 4.0.4 has a problem
now whereas it is not running external alert-script anymore?! When
running hobbitd_alert it says it would run the script where in fact,
it does not; again anymore?!

Hobbit has been installed using the sources, running "rpmbuild
--rebuild hobbit-4.0.4-1.src.rpm". It did run the external scripts, so
what has changed? We applied during our maintenance window the SLES9
patches. It has not reported since then, when I think of it. So, is
there a library missing? Any suggestions?

Regards,

Peter

orwell # ldd /usr/lib/hobbit/server/bin/hobbitd_alert
        linux-gate.so.1 =>  (0xffffe000)
        libpcre.so.0 => /usr/lib/libpcre.so.0 (0x4001d000)
        libc.so.6 => /lib/tls/libc.so.6 (0x40029000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)


hobbit at orwell:~> server/bin/hobbitd_alert --test orwell disk 30 red
00012428 2005-08-16 15:46:11 send_alert orwell:disk state Paging
00012428 2005-08-16 15:46:11 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 183
00012428 2005-08-16 15:46:11 Failed 'HOST=%(nagger) EXSERVICE=http'
(hostname not in include list)
00012428 2005-08-16 15:46:11 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 189
00012428 2005-08-16 15:46:11 *** Match with 'HOST=%(orwell)' ***
00012428 2005-08-16 15:46:11 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 190
00012428 2005-08-16 15:46:11 *** Match with '$UNIXDAG' ***
00012428 2005-08-16 15:46:11 Mail alert with command 'mail -s "Hobbit
[12345] orwell:disk CRITICAL (RED)" email at adress_removed.nl'
00012428 2005-08-16 15:46:11 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 192
00012428 2005-08-16 15:46:11 *** Match with '$UNIXNACHT' ***
00012428 2005-08-16 15:46:11 Mail alert with command 'mail -s "Hobbit
[12345] orwell:disk CRITICAL (RED)" email at iaddress_removed.nl'
00012428 2005-08-16 15:46:11 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 193
00012428 2005-08-16 15:46:11 *** Match with '$UNIXSEMAFOON_BEHEER' ***
00012428 2005-08-16 15:46:11 Script alert with command
'/usr/local/bb/consigne.ksh' and recipient 006XXXXXXXX


2005/5/31, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid>:
quoted from Andy France
On Tue, May 31, 2005 at 09:43:40AM +1200, Andy France wrote:
Since updating to 4.0.4, I have had a couple of "reds" which have not
generated all of my alerts.
Could you check the notifications.log file for any mention of these
alerts being sent out by Hobbit, and the page.log file for any errors
from your scripts ? Both should be in the /var/log/hobbit/ directory.


Regards,
Henrik

list Henrik Størner · Tue, 16 Aug 2005 16:24:13 +0200 ·
quoted from Peter Welter
On Tue, Aug 16, 2005 at 04:03:06PM +0200, Peter Welter wrote:
Although it *has* worked before, our Hobbit server 4.0.4 has a problem
now whereas it is not running external alert-script anymore?! When
running hobbitd_alert it says it would run the script where in fact,
it does not; again anymore?!
Anything unusual in the /var/log/hobbit/page.log file ?
quoted from Peter Welter
Hobbit has been installed using the sources, running "rpmbuild
--rebuild hobbit-4.0.4-1.src.rpm". It did run the external scripts, so
what has changed? We applied during our maintenance window the SLES9
patches. It has not reported since then, when I think of it. So, is
there a library missing? Any suggestions?

'/usr/local/bb/consigne.ksh' and recipient 006XXXXXXXX
What happens if you login as the hobbit user and run the script like
this:

BBCOLORLEVEL=red \
BBALPHAMSG="Just a test" \
ACKCODE="12345" \
RCPT="006xxxxxx" \
BBHOSTNAME="some.host.name" \
MACHIP"10.0.0.1" \
BBSVCNAME="conn" \
BBSVCNUM="300" \
BBHOSTSVC="some.host.name.conn" \
BBHOSTSVCCOMMAS="some,host,name.conn" \
BBNUMERIC="30001000000000112345" \
/usr/local/bb/consigne.ksh 

That's basically what Hobbit does when running the script.


Henrik
list Peter Welter · Tue, 16 Aug 2005 16:44:47 +0200 ·
Anything unusual in the /var/log/hobbit/page.log file ?
hobbit at orwell:~> more /var/log/hobbit/page.log
2005-08-16 14:34:43 Tried to down BOARDBUSY: Invalid argument
quoted from Henrik Størner
What happens if you login as the hobbit user and run the script like
this:

That's basically what Hobbit does when running the script.
I ran this script, and, yes the buzzer rang.

Thanks,
Peter
list Peter Welter · Wed, 17 Aug 2005 04:56:59 +0200 ·
Hello Henrik,

Since I'm totally flabbergasted of Hobbit not running an external
script anymore, there must be a simple explanation for it and I'm sure
I'll have a few laughs afterwards :-/

Since Hobbit is very important to us, and I don't wanna rush into
things (updating the entire server from 4.0.4 to the newest version),
I will first try to make a hobbit-alert.cfg as small and simple as
possible containing only the stuff needed. A test-alert-config, which
contains one external script that will be run in a yellow condition
and this script will sent an email. This should be a simple dummy-test
that will check the entire Hobbit-alert-setup just to be sure.

Will let you know the results asap!

Regards,

Peter
list Henrik Størner · Wed, 17 Aug 2005 07:36:52 +0200 ·
quoted from Peter Welter
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote:
Hello Henrik,

Since I'm totally flabbergasted of Hobbit not running an external
script anymore, there must be a simple explanation for it and I'm sure
I'll have a few laughs afterwards :-/

Since Hobbit is very important to us, and I don't wanna rush into
things (updating the entire server from 4.0.4 to the newest version),
I will first try to make a hobbit-alert.cfg as small and simple as
possible containing only the stuff needed. A test-alert-config, which
contains one external script that will be run in a yellow condition
and this script will sent an email. This should be a simple dummy-test
that will check the entire Hobbit-alert-setup just to be sure.
You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert
command in hobbitlaunch.cfg. That will give you a closer watch on how
each alert is handled by the alert module.

Do the missing alerts show up in the notifications.log file ?


Regards,
Henrik
list Peter Welter · Wed, 17 Aug 2005 08:01:13 +0200 ·
2005/8/17, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid>:
quoted from Peter Welter
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote:
You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert
command in hobbitlaunch.cfg. That will give you a closer watch on how
each alert is handled by the alert module.
Thanks, I will do so now.
Do the missing alerts show up in the notifications.log file ?
No, unfortunately.

I'll keep you posted.
list Peter Welter · Wed, 17 Aug 2005 09:54:27 +0200 ·
Status update:

After adapting the hobbit-alert.cfg to a minimum, enabling the trace
facility, it becomes clear to me that after restarting Hobbit, the
downtime for a service is completely recalculated. It finds a match
for a service whch is down for an 1hour and 17 minutes and it says:

00003590 2005-08-17 09:46:33 Matching host:service:page
'burad12:raid:DNO/SAPEPROC' against rule line 196
00003590 2005-08-17 09:46:33 Failed '$UNIXDAG' (min. duration 0<360)
00003590 2005-08-17 09:46:33 Matching host:service:page
'burad12:raid:DNO/SAPEPROC' against rule line 197
00003590 2005-08-17 09:46:33 Failed '$UNIXTEST' (min. duration 0<1800)

Hmmm... I am restarting Hobbit now and then, fi. because 'hobbit.sh
rotate' does not work at my installation and the rotatelogs for linux
moves the notification.log to notification.log.1 which keeps being
used without restarting.

So, monitoring this logs seems to clarify things to me... Now let's
trim the point where a script is being called.To be continued...

2005/8/17, Peter Welter <user-f55666bd0d1e@xymon.invalid>:
quoted from Peter Welter
2005/8/17, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid>:
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote:
You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert
command in hobbitlaunch.cfg. That will give you a closer watch on how
each alert is handled by the alert module.
Thanks, I will do so now.
Do the missing alerts show up in the notifications.log file ?
No, unfortunately.

I'll keep you posted.
list Henrik Størner · Wed, 17 Aug 2005 10:07:37 +0200 ·
quoted from Peter Welter
On Wed, Aug 17, 2005 at 09:54:27AM +0200, Peter Welter wrote:
Status update:

After adapting the hobbit-alert.cfg to a minimum, enabling the trace
facility, it becomes clear to me that after restarting Hobbit, the
downtime for a service is completely recalculated. It finds a match
for a service whch is down for an 1hour and 17 minutes and it says:
It should pick up the old duration from the checkpoint file. What's
your hobbitd and hobbitd_alert commands in hobbitlaunch.cfg ?


Henrik
list Peter Welter · Wed, 17 Aug 2005 10:37:14 +0200 ·
quoted from Henrik Størner
It should pick up the old duration from the checkpoint file. What's
your hobbitd and hobbitd_alert commands in hobbitlaunch.cfg ?
# This is the main Hobbit daemon. You cannot live without this one.
[hobbitd]
        HEARTBEAT
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP

[bbpage]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        NEEDS hobbitd
        CMD hobbitd_channel --channel=page  
--log=$BBSERVERLOGS/page.log hobbitd_alert --trace=/tmp/alerttrace.log

The directory /usr/lib/hobbit/server/tmp/ contains:

-rw-r--r--  1 hobbit hobbit 318947 2005-08-17 10:32 hobbitd.chk

Regards,
Peter
list Peter Welter · Wed, 17 Aug 2005 14:09:55 +0200 ·
Henrik,

Status update:

I found out that setting the values the same for sending an email and
executing the consigne-script, worked out fine; see alerttrace.txt:

00007902 2005-08-17 11:02:50 *** Match with '$UNIXDAG' ***
00007902 2005-08-17 11:02:50 Mail alert with command 'mail -s "Hobbit
[329697] burad12:raid CRITICAL (RED)"
user-23de35574831@xymon.invalid'
00007902 2005-08-17 11:02:50 Matching host:service:page
'burad12:raid:DNO/SAPEPROC' against rule line 197
00007902 2005-08-17 11:02:50 *** Match with '$UNIXTEST' ***
00007902 2005-08-17 11:02:50 Script alert with command
'/usr/local/bb/consigne.ksh' and recipient 00665022245
00007902 2005-08-17 11:03:13 send_alert lucifer:disk state Paging
00007902 2005-08-17 11:03:13 Matching host:service:page
'lucifer:disk:DNO/UB' against rule line 195
00007902 2005-08-17 11:03:13 Failed
'HOST=%(burad12|burad14|burad11|burad15)' (hostname not in include
list)
00007597 2005-08-17 11:03:21 @@page igrsxc002:msgs:WINDOWS=red
00007597 2005-08-17 11:03:21 state 1->1
00007597 2005-08-17 11:04:14 @@page igrsdm001:disk:WINDOWS=yellow
00007597 2005-08-17 11:04:14 state 1->1

However, since I bumped up the duration to 30m before the script is
executed and (probably) restarted hobbit several times (sooner than
the 30 minutes interval), the script seems not to execute yesterday
:-(

However it worked fine today, twice. For the moment I'll leave the
debug-file  /tmp/alerttrace.txt in hobbit-alerts.cfg; it sure comes in
handy!

Regards, Peter
list Peter Welter · Thu, 18 Aug 2005 15:23:15 +0200 ·
Hi Henrik,

Today with the alerttrace still on and, yes, yesterday the script was
executed correctly in a tiny test-config. The original config still
gives me problems. I checked for control characters in the
hobbit-alerts.cfg-file (vi -> set list), and nothing weird found.

Part of the hobbit-alerts.cfg

-some macro's:

### Enabled now and then for testing purposes.
###$UNIXTEST=MAIL user-de15be7e9d2b@xymon.invalid DURATION>6m TIME=W:0800:1730
REPEAT=1d RECOVERED COLOR=yellow,red,purple

$UNIXDAG=MAIL user-7e295ea068b5@xymon.invalid DURATION>6m TIME=W:0800:1730
REPEAT=1d RECOVERED

$UNIXNACHT=MAIL user-7e295ea068b5@xymon.invalid TIME=*:0000:2359 DURATION>30m
REPEAT=1d SERVICE=!cpu,!msgs RECOVERED COLOR=!yellow

$UNIXSEMAFOON_BEHEER=SCRIPT /usr/local/bb/consigne.ksh 00765327285
FORMAT=SMS TIME=*:0000:2359 DURATION>30m REPEAT=60m
SERVICE=!cpu,!msgs,!smtp,!bbgen,!bbtest,!hobbitd COLOR=!yellow

-A host not responding for $UNIXSEMAFOON_BEHEER while the yellow mail
$UNIXDAG has been sent:

HOST=%(orwell)
        $UNIXDAG
        $UNIXTEST
        $UNIXNACHT
        $UNIXSEMAFOON_BEHEER

The host does give me an email for a threshold exceeded (disk>95%) and
that can be seen in the trace (I only grepped the host specific
entries):

00013241 2005-08-18 10:04:45 *** Match with 'HOST=%(orwell)' ***
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 191
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 193
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 194
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 196
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 203
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 209
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 216
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 223
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 229
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 236
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 242
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 254
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 261
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 268
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 275
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 282
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 287
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 294
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 300
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 304
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 311
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 322
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 332
00013241 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 340
00013241 2005-08-18 10:04:45 Failed 'HOST=%(orwell)' (hostname not in
include list)
00015024 2005-08-18 10:04:45 send_alert orwell:disk state Paging
00015024 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 184
00015024 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 190
00015024 2005-08-18 10:04:45 *** Match with 'HOST=%(orwell)' ***
00015024 2005-08-18 10:04:45 Matching host:service:page
'orwell:disk:DNO/SBEHEER' against rule line 191
00015024 2005-08-18 10:04:45 Mail alert with command 'mail -s "Hobbit
[25437] orwell:disk CRITICAL (RED)" user-35877f7bd688@xymon.invalid'

But the next (expected) step can not be seen in the trace and it does not occur.

All this could be just a configuration issue, so I restored another
tiny config and restarted Hobbit, and that worked fine. So no problems
with the mail or script etc  :-]

So, now I did the following:
-I restored the hobbit-alert.cfg we must use.
-I uncommented my $UNIXTEST-macro to prevent empty lines in
HOST-sections in the hobbit-alert.cfg knowing that Hobbit can have
problems with 2 or more spaces (perhaps newlines too?)
-moved the $UNIXTEST-macro to the end of each HOST-section for times I
comment out the previous line ;-)
-Restarted Hobbit.
-Now the first alert is being sent as it should, but the one alert
that should page after 30 minutes fails and nothing that triggers
something in the logfile.

Regards,

Peter
list Peter Welter · Fri, 19 Aug 2005 11:42:15 +0200 ·
I've been digging the Hobbit-emaillist and found something that might
be applicable to this problem. First, the email correspondence between
you and Peter Murray:

[snip]
"On Tue, Jul 26, 2005 at 08:16:56AM -0400, Peter Murray wrote:
HOST=testhost.syr.edu RECOVERED
     MAIL user-b4e299ec46f5@xymon.invalid FORMAT=TEXT DURATION>10 REPEAT=20
     MAIL user-513d0343e7bf@xymon.invalid FORMAT=SMS DURATION>20 REPEAT=20

What happens is the first alert (FORMAT=TEXT) goes out at 10 minutes,
nothing at 20 minutes, both at 30 miuntes, nothing at 40 minutes, both
at 50 minutes, and so on.
Confirmed that this is a bug in all current Hobbit versions. It will be
fixed in 4.1.2 - you can pick up the latest snapshot for a working
version."
[snip]

Second, from the Changes-file from 4.1.1 -> 4.1.2 (I run 4.0.4):

[snip]
"* When multiple recipients of an alert had different minimum
  duration and/or repeat-settings, they would mostly use only
  the settings for the first recipient."
[snip]


Can you confirm this?

Regards,
Peter
list Terry Rossi · 24 Aug 2005 23:57:16 GMT ·
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote:
Hello Henrik,

Since I'm totally flabbergasted of Hobbit not running an external
script anymore, there must be a simple explanation for it and I'm sure
I'll have a few laughs afterwards :-/

Since Hobbit is very important to us, and I don't wanna rush into
things (updating the entire server from 4.0.4 to the newest version),
I will first try to make a hobbit-alert.cfg as small and simple as
possible containing only the stuff needed. A test-alert-config, which
contains one external script that will be run in a yellow condition
and this script will sent an email. This should be a simple dummy-test
that will check the entire Hobbit-alert-setup just to be sure.
You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert
command in hobbitlaunch.cfg. That will give you a closer watch on how
each alert is handled by the alert module.

Do the missing alerts show up in the notifications.log file ?


Regards,
Henrik
list Peter Welter · Thu, 25 Aug 2005 15:06:23 +0200 ·
Hello Henrik,

Glad to see you are back. Can you confirm my email from Fri, Aug 19,
2005 at 11:42 AM.

Regards,

Peter
list Henrik Størner · Sat, 27 Aug 2005 11:20:19 +0200 ·
quoted from Peter Welter
On Wed, Aug 17, 2005 at 10:37:14AM +0200, Peter Welter wrote:
It should pick up the old duration from the checkpoint file. What's
your hobbitd and hobbitd_alert commands in hobbitlaunch.cfg ?
[bbpage]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        NEEDS hobbitd
        CMD hobbitd_channel --channel=page  
--log=$BBSERVERLOGS/page.log hobbitd_alert --trace=/tmp/alerttrace.log

The directory /usr/lib/hobbit/server/tmp/ contains:

-rw-r--r--  1 hobbit hobbit 318947 2005-08-17 10:32 hobbitd.chk
OK, You're running without the alert-module checkpoint file. There are
two things we can do:

1) Add "--checkpoint-file=$BBTMP/alert.chk --checkpoint-interval=600" to
   the hobbitd_alert command in hobbitlaunch.cfg. That way it will
   remember all active alerts when you restart Hobbit.

2) When a new alert was first seen (also after a restart of Hobbit), the
   duration was reset to 0 - instead of using the information Hobbit
   already had about when the status change occurred. I've changed this
   in the code, so that it picks up the duration of the alert from the 
   timestamp we keep for when the last status change happened.


Regards,
Henrik
list Peter Welter · Sat, 27 Aug 2005 12:10:08 +0200 ·
Hello Henrik,
quoted from Henrik Størner
two things we can do:

1) Add "--checkpoint-file=$BBTMP/alert.chk --checkpoint-interval=600" to
   the hobbitd_alert command in hobbitlaunch.cfg. That way it will
   remember all active alerts when you restart Hobbit.
I'll do that asap (coming monday). That will certainly resolve this issue.
quoted from Henrik Størner
2) When a new alert was first seen (also after a restart of Hobbit), the
   duration was reset to 0 - instead of using the information Hobbit
   already had about when the status change occurred. I've changed this
   in the code, so that it picks up the duration of the alert from the
   timestamp we keep for when the last status change happened.
Ok, but that usefull addition is for new/coming releases.

However, I think I found out why the entire problem showed up in the
first place. I had a alert-config that first mailed on an occuring
event and if that was not dealt with properly, ran a pager script 20
minutes later. After an evening of applying (OS-)patches, a reboot
etc. it did not work anymore. Eventually I thought that it had to do
with a alert-config modification, resulting in this
email-conversation.

As suggested, I checked the alerttrace.log, but could not find a
reason why this problem happened (I changed pagerscript to mail, but
no result). It *does* worked fine when *all* the alerts are processed
at the same time!

Exploring the mailinglist and Changes-file for each version, I think
it can be brought down to a known bug in Hobbit that is to be fixed in
4.1.2; see my mail from August 19th, 11:42.

Since we are running 4.0.4, I'm thinking what is a wise thing to do?
The workaround does work fine now (we are a 24*7 University), I
thinking to wait untill 4.1.2 reaches the Stable status, since 4.1.1
does not solve this particular bug.

Regards, Peter