Xymon Mailing List Archive search

Test goes purple randomly

7 messages in this thread

list Wayne Gemmell · Thu, 23 Oct 2008 09:34:26 +0200 ·
Hiya

I have got a custom script that goes purple randomly for less than 5 seconds. Could this be because hobbit is not getting a response in the interval it is expecting a response (in this case 30 min). I have a few time-consuming custom scripts that run and my suspicion is that they don't all complete in an allotted time so hobbit assumes there is no response. Any input on this?

This is what the html log says.

Date 					Status 	Duration
Wed Oct 22 21:28:37 2008 	green 	11:57:35
Wed Oct 22 21:28:37 2008 	purple 	none


-- 
Regards
Wayne
list Samuel Cai · Thu, 23 Oct 2008 18:44:22 -0700 ·
This is due to the interval, if you change to less than 30 mins, then no
purple.

Samuel Cai
quoted from Wayne Gemmell

-----Original Message-----
From: Wayne Gemmell [mailto:user-7a761fbb908f@xymon.invalid] 
Sent: Thursday, October 23, 2008 3:34 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Test goes purple randomly

Hiya

I have got a custom script that goes purple randomly for less than 5
seconds. 
Could this be because hobbit is not getting a response in the interval
it is 
expecting a response (in this case 30 min). I have a few time-consuming 
custom scripts that run and my suspicion is that they don't all complete
in 
an allotted time so hobbit assumes there is no response. Any input on
this?

This is what the html log says.

Date 					Status 	Duration
Wed Oct 22 21:28:37 2008 	green 	11:57:35
Wed Oct 22 21:28:37 2008 	purple 	none


-- 
Regards
Wayne
list Samuel Cai · Thu, 23 Oct 2008 18:47:51 -0700 ·
Found history email:

From: Ralph Mitchell [mailto:user-00a5e44c48c0@xymon.invalid] 
Sent: Tuesday, September 16, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Tricky bug in Purple status determination

 
When a report comes in to Hobbit, the default "time to live" for the
report is 30 mins.  As long as another report comes in within that time,
the timer is reset.  If there's no report, that column goes purple.

If your test is reporting every 30 mins, there's a good chance it'll
exhibit the behaviour you describe.

What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in
the bb man page, and make the lifetime a bit longer than the the test
interval.

Ralph Mitchell


On Mon, Sep 15, 2008 at 9:43 PM, Samuel Cai
<user-ba507acc1d03@xymon.invalid> wrote:

Hi,

 
Recently we found a weird problem in history of one monitoring, there
were a lot of purple status, and the duration was "none" or 1second. The
thing we were monitoring was running fine, and this problem was there
since we used Hobbit (more than half a year), so it rules out
possibility of error in that thing.

This monitoring is a script defined in hobbitlaunch.cfg on Hobbit
server, runs every 30m

I checked log, the purple status was updated by hobbitd, and then I
checked source code of hobbitd, found it checked purple status every 30m
(correct me if I'm wrong since I only know a little of C), so I guess
due to some program issues, there were some milliseconds differences
bettwen hobbitd's determination and script's update, that results in
very short duration of purple status.

 
So after I changed interval to 25m, that weird problem is gone.

 
Thanks,
quoted from Samuel Cai

Samuel Cai


-----Original Message-----
From: Wayne Gemmell [mailto:user-7a761fbb908f@xymon.invalid] 
Sent: Thursday, October 23, 2008 3:34 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Test goes purple randomly

Hiya

I have got a custom script that goes purple randomly for less than 5
seconds. 
Could this be because hobbit is not getting a response in the interval
it is 
expecting a response (in this case 30 min). I have a few time-consuming 
custom scripts that run and my suspicion is that they don't all complete
in 
an allotted time so hobbit assumes there is no response. Any input on
this?

This is what the html log says.

Date 					Status 	Duration
Wed Oct 22 21:28:37 2008 	green 	11:57:35
Wed Oct 22 21:28:37 2008 	purple 	none


-- 
Regards
Wayne
list Wayne Gemmell · Fri, 24 Oct 2008 09:33:31 +0200 ·
Excelent, thanks. This also explains why the test was purple for an hour, 
green for an hour when the test was set for exery hour.
quoted from Samuel Cai

On Friday 24 October 2008 03:47:51 Samuel Cai wrote:
Found history email:

From: Ralph Mitchell [mailto:user-00a5e44c48c0@xymon.invalid]
Sent: Tuesday, September 16, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Tricky bug in Purple status determination


When a report comes in to Hobbit, the default "time to live" for the
report is 30 mins.  As long as another report comes in within that time,
the timer is reset.  If there's no report, that column goes purple.

If your test is reporting every 30 mins, there's a good chance it'll
exhibit the behaviour you describe.

What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in
the bb man page, and make the lifetime a bit longer than the the test
interval.

Ralph Mitchell
-- 

Regards
Wayne
list Richard Finegold · Fri, 24 Oct 2008 19:05:16 -0700 ·
If one uses a default ratio of 5:30 (5 minute poll, 30 minute expire)
then LIFETIME should be more than "a bit longer than" the life span of
the report. Assuming the philosophy of no less than 5 missed intervals
leading to purple is consistent. Hmm...

Ah, the bb manpage says "sligtly more than" (sic) "is a good idea".
This is contradicted by the LIFETIME default of 30. Two ways come to
mind to resolve this contradiction:
  * Change LIFETIME to 6 or 7 (hobbitd.c, handle_status, validity).
  * Change the manpage's text, from "sligtly more than" to "a multiple
of" (or something similar).

On Thu, Oct 23, 2008 at 6:47 PM, Samuel Cai
quoted from Wayne Gemmell
<user-ba507acc1d03@xymon.invalid> wrote:
From: Ralph Mitchell [mailto:user-00a5e44c48c0@xymon.invalid]
Sent: Tuesday, September 16, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Tricky bug in Purple status determination
[snip]
What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in
the bb man page, and make the lifetime a bit longer than the the test
interval.
list Sanu Mathew · Sat, 25 Oct 2008 12:30:35 +0530 ·
Folks,

My hobbit server has suddenly started showing conn and ssh entries to be
purple, everytime, i restart the hobbit service on my hobbit server, all
looks green and appropriate for a few mins, and then they are again back to
purple. I have read the below email, but can someone tell me, where to
change the LIFETIME, or does making a change to the LIFETIME fix my problem?

Any help at the earliest would be greatly appreciated...

Thanks,
Sanu
quoted from Richard Finegold

On Sat, Oct 25, 2008 at 7:35 AM, Richard Finegold <user-6a016aac278a@xymon.invalid>wrote:
If one uses a default ratio of 5:30 (5 minute poll, 30 minute expire)
then LIFETIME should be more than "a bit longer than" the life span of
the report. Assuming the philosophy of no less than 5 missed intervals
leading to purple is consistent. Hmm...

Ah, the bb manpage says "sligtly more than" (sic) "is a good idea".
This is contradicted by the LIFETIME default of 30. Two ways come to
mind to resolve this contradiction:
 * Change LIFETIME to 6 or 7 (hobbitd.c, handle_status, validity).
 * Change the manpage's text, from "sligtly more than" to "a multiple
of" (or something similar).

On Thu, Oct 23, 2008 at 6:47 PM, Samuel Cai
<user-ba507acc1d03@xymon.invalid> wrote:
From: Ralph Mitchell [mailto:user-00a5e44c48c0@xymon.invalid]
Sent: Tuesday, September 16, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Tricky bug in Purple status determination
[snip]
What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in
the bb man page, and make the lifetime a bit longer than the the test
interval.
list Michael A. Price · Mon, 27 Oct 2008 05:57:36 -0400 ·
Is this a fix to the hobbit turning purple issue for devmon also?

Thanks, michael
quoted from Richard Finegold

On 10/24/08 10:05 PM, "Richard Finegold" <user-6a016aac278a@xymon.invalid> wrote:

If one uses a default ratio of 5:30 (5 minute poll, 30 minute expire)
then LIFETIME should be more than "a bit longer than" the life span of
the report. Assuming the philosophy of no less than 5 missed intervals
leading to purple is consistent. Hmm...

Ah, the bb manpage says "sligtly more than" (sic) "is a good idea".
This is contradicted by the LIFETIME default of 30. Two ways come to
mind to resolve this contradiction:
  * Change LIFETIME to 6 or 7 (hobbitd.c, handle_status, validity).
  * Change the manpage's text, from "sligtly more than" to "a multiple
of" (or something similar).

On Thu, Oct 23, 2008 at 6:47 PM, Samuel Cai
<user-ba507acc1d03@xymon.invalid> wrote:
From: Ralph Mitchell [mailto:user-00a5e44c48c0@xymon.invalid]
Sent: Tuesday, September 16, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Tricky bug in Purple status determination
[snip]
What you should do is alter the test script to use the "status+LIFETIME"
format, where LIFETIME is the life span of the report, as described in
the bb man page, and make the lifetime a bit longer than the the test
interval.