Xymon Mailing List Archive search

Tricky bug in Purple status determination

3 messages in this thread

list Samuel Cai · Mon, 15 Sep 2008 19:43:37 -0700 ·
Hi,

 
Recently we found a weird problem in history of one monitoring, there
were a lot of purple status, and the duration was "none" or 1second. The
thing we were monitoring was running fine, and this problem was there
since we used Hobbit (more than half a year), so it rules out
possibility of error in that thing.

This monitoring is a script defined in hobbitlaunch.cfg on Hobbit
server, runs every 30m

I checked log, the purple status was updated by hobbitd, and then I
checked source code of hobbitd, found it checked purple status every 30m
(correct me if I'm wrong since I only know a little of C), so I guess
due to some program issues, there were some milliseconds differences
bettwen hobbitd's determination and script's update, that results in
very short duration of purple status.

 
So after I changed interval to 25m, that weird problem is gone.

 
Thanks,

Samuel Cai
list Ralph Mitchell · Mon, 15 Sep 2008 23:43:45 -0500 ·
When a report comes in to Hobbit, the default "time to live" for the report
is 30 mins.  As long as another report comes in within that time, the timer
is reset.  If there's no report, that column goes purple.

If your test is reporting every 30 mins, there's a good chance it'll exhibit
the behaviour you describe.

What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in the
bb man page, and make the lifetime a bit longer than the the test interval.

Ralph Mitchell
quoted from Samuel Cai


On Mon, Sep 15, 2008 at 9:43 PM, Samuel Cai <user-ba507acc1d03@xymon.invalid>wrote:
 Hi,


Recently we found a weird problem in history of one monitoring, there were
a lot of purple status, and the duration was "none" or 1second. The thing we
were monitoring was running fine, and this problem was there since we used
Hobbit (more than half a year), so it rules out possibility of error in that
thing.

This monitoring is a script defined in hobbitlaunch.cfg on Hobbit server,
runs every 30m

I checked log, the purple status was updated by hobbitd, and then I checked
source code of hobbitd, found it checked purple status every 30m (correct me
if I'm wrong since I only know a little of C), so I guess due to some
program issues, there were some milliseconds differences bettwen hobbitd's
determination and script's update, that results in very short duration of
purple status.


So after I changed interval to 25m, that weird problem is gone.


Thanks,

Samuel Cai
list Samuel Cai · Mon, 15 Sep 2008 22:52:33 -0700 ·
Thanks! Your information is really helpful, I now understand this is not
a bug, and it's a build-in feature of Hobbit and well documented in man
page. 

 
We don't have defined LIFETIME, so more like to use INTERVAL to control
status. I'll shorten the internal from 30m to 25m to avoid this problem.

 
Samuel Cai
quoted from Ralph Mitchell

 
From: Ralph Mitchell [mailto:user-00a5e44c48c0@xymon.invalid] 
Sent: Tuesday, September 16, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Tricky bug in Purple status determination

 
When a report comes in to Hobbit, the default "time to live" for the
report is 30 mins.  As long as another report comes in within that time,
the timer is reset.  If there's no report, that column goes purple.

If your test is reporting every 30 mins, there's a good chance it'll
exhibit the behaviour you describe.

What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in
the bb man page, and make the lifetime a bit longer than the the test
interval.

Ralph Mitchell


On Mon, Sep 15, 2008 at 9:43 PM, Samuel Cai
<user-ba507acc1d03@xymon.invalid> wrote:

Hi,

 
Recently we found a weird problem in history of one monitoring, there
were a lot of purple status, and the duration was "none" or 1second. The
thing we were monitoring was running fine, and this problem was there
since we used Hobbit (more than half a year), so it rules out
possibility of error in that thing.

This monitoring is a script defined in hobbitlaunch.cfg on Hobbit
server, runs every 30m

I checked log, the purple status was updated by hobbitd, and then I
checked source code of hobbitd, found it checked purple status every 30m
(correct me if I'm wrong since I only know a little of C), so I guess
due to some program issues, there were some milliseconds differences
bettwen hobbitd's determination and script's update, that results in
very short duration of purple status.

 
So after I changed interval to 25m, that weird problem is gone.

 
Thanks,

Samuel Cai