Xymon Mailing List Archive search

How many times does xymonnet retry?

7 messages in this thread

list Glauber Ribeiro · Fri, 8 Jan 2016 16:13:21 +0000 ·
Hello, all,

I didn’t find this information in the man pages for xymonnet and xymonnet-again. How many times does xymonnet retry a failed test? Does it go red on the first failure (assuming default configuration), or only after all the retries failed?

Thanks,

glauber
list John Thurston · Fri, 08 Jan 2016 08:34:44 -0900 ·
quoted from Glauber Ribeiro
On 1/8/2016 7:13 AM, Ribeiro, Glauber wrote:
Hello, all,

I didn’t find this information in the man pages for xymonnet and
xymonnet-again. How many times does xymonnet retry a failed test? Does
it go red on the first failure (assuming default configuration), or only
after all the retries failed?
It goes red when xymonnet detects the failure. The test is then assigned 
to xymonnet-again which executes it more frequently for a total of 30 
minutes. If it is still red, xymonnet-again quits hammering it.

 From the man page for xymonnet-again
Only tests whose first failure occurred within 30 minutes are included in the tests that are run by xymonnet-again.sh. The 30 minute limit is there to avoid hosts that are down for longer periods of time to bog down xymonnet-again.sh. You can change this limit with the "--frequenttestlimit=SECONDS" when you run xyxmonnet.
-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Glauber Ribeiro · Fri, 8 Jan 2016 18:04:47 +0000 ·
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
quoted from John Thurston


-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston
Sent: Friday, January 08, 2016 11:35
To: xymon at xymon.com
Subject: Re: [Xymon] How many times does xymonnet retry?

On 1/8/2016 7:13 AM, Ribeiro, Glauber wrote:
Hello, all,

I didn’t find this information in the man pages for xymonnet and
xymonnet-again. How many times does xymonnet retry a failed test? Does
it go red on the first failure (assuming default configuration), or only
after all the retries failed?
It goes red when xymonnet detects the failure. The test is then assigned 
to xymonnet-again which executes it more frequently for a total of 30 
minutes. If it is still red, xymonnet-again quits hammering it.

 From the man page for xymonnet-again
Only tests whose first failure occurred within 30 minutes are included in the tests that are run by xymonnet-again.sh. The 30 minute limit is there to avoid hosts that are down for longer periods of time to bog down xymonnet-again.sh. You can change this limit with the "--frequenttestlimit=SECONDS" when you run xyxmonnet.
-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list John Thurston · Fri, 08 Jan 2016 09:41:11 -0900 ·
quoted from Glauber Ribeiro
On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
I see no reference to the _number_ of retries, only to the _duration_ of the effort.

The number of retries will depend on how frequently the attempt is made and how long each attempt takes to fail. The first is probably controlled in code (and may be configurable at run time). The second is dependent on the protocol being tested, the behavior of the network, and the form of the failure.

An ICMP test, for example, may reliably fail and timeout in 4 seconds.

An SSH test (also handled by xymonnet) may fail in 4 seconds when it can't initiate a TCP connection. It may also be able to linger on for several minutes if a TCP connection can be established but not maintained.

Is the number of retires significant in your business case?
quoted from Glauber Ribeiro

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Glauber Ribeiro · Fri, 8 Jan 2016 19:59:45 +0000 ·
Q: Is the number of retires significant in your business case?

A: Not really, I was just trying to understand how this works to see if it would provide precedent for one of our custom tests, which we are adding retries to.


I think I have a good idea how the retries work now. When a test fails, xymonnet writes information to a text file.

Xymonnet-again is a simple script, which is kicked off once a minute, to look for that text file - if it's present, it feeds it into xymonnet. The file (frequenttests) is simply the command line options for the xymonnet run, including the names of the hosts that had failed tests (but not which tests failed).

So theoretically, things could be retried up to 30 times. 
quoted from John Thurston


-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston
Sent: Friday, January 08, 2016 12:41
To: xymon at xymon.com
Subject: Re: [Xymon] How many times does xymonnet retry?

On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
I see no reference to the _number_ of retries, only to the _duration_ of the effort.

The number of retries will depend on how frequently the attempt is made and how long each attempt takes to fail. The first is probably controlled in code (and may be configurable at run time). The second is dependent on the protocol being tested, the behavior of the network, and the form of the failure.

An ICMP test, for example, may reliably fail and timeout in 4 seconds.

An SSH test (also handled by xymonnet) may fail in 4 seconds when it can't initiate a TCP connection. It may also be able to linger on for several minutes if a TCP connection can be established but not maintained.

Is the number of retires significant in your business case?

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Japheth Cleaver · Fri, 8 Jan 2016 12:13:33 -0800 ·
This is correct. In some other monitoring systems (Nagios/Icinga come to
mind), there's a notion of a Hard Fail vs Soft Fail and the scheduling
system can run checks several times before a "Hard Fail" is recorded.

Because there's no discrete scheduling system (or dispatcher) within
xymon, it doesn't really have that same model, and the built-in tools like
xymonnet don't conceptualize it.

Fundamentally, you have any number of things testing and whatever
frequency or decision process they're independently doing, and xymond is
simply accepting reports (and displaying/handling them) as needed.

As xymonnet runs at intervals, each run is distinct. If it's
down/slow/hung/whatever, it's marked as such and is not tested again
during that execution.


If you add that together, though, it provides other options for
administrator-defined recurrence, such as the "xymonnet-again.sh" script,
as you've seen.


When we were migrating from a system that had been configured to retry 3
times before alerting, we realized that we saved so much power in
efficiency moving to xymon (shameless plug ;) ), that we could lower our
xymonnet interval greatly and just make sure that 3 entire runs would
complete before the "red" alert was sent (using the DURATION value in
alerts.cfg(5)).


xymonnet-again.sh itself is somewhat basic, but you can script up any
number of additional ways of dispatching with the same concept. I have a
script on another server that queries xymond for any non-green 'dns' tests
every 10s and re-scans just those hosts with lower --timeout values.

As above, I've found interval scanning and adjusting your duration to be
simpler conceptually and to handle most of the cases that are needed. It
also sidesteps the problem of an overloaded scheduler during a crisis,
leaving just the extra time needed for failing TCP tests in the first
place.


HTH,
-jc
quoted from Glauber Ribeiro


On Fri, January 8, 2016 11:59 am, Ribeiro, Glauber wrote:
Q: Is the number of retires significant in your business case?

A: Not really, I was just trying to understand how this works to see if it
would provide precedent for one of our custom tests, which we are adding
retries to.


I think I have a good idea how the retries work now. When a test fails,
xymonnet writes information to a text file.

Xymonnet-again is a simple script, which is kicked off once a minute, to
look for that text file - if it's present, it feeds it into xymonnet. The
file (frequenttests) is simply the command line options for the xymonnet
run, including the names of the hosts that had failed tests (but not which
tests failed).

So theoretically, things could be retried up to 30 times.


-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston
Sent: Friday, January 08, 2016 12:41
To: xymon at xymon.com
Subject: Re: [Xymon] How many times does xymonnet retry?

On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it
will keep trying for 30 minutes?
I see no reference to the _number_ of retries, only to the _duration_ of
the effort.

The number of retries will depend on how frequently the attempt is made
and how long each attempt takes to fail. The first is probably
controlled in code (and may be configurable at run time). The second is
dependent on the protocol being tested, the behavior of the network, and
the form of the failure.

An ICMP test, for example, may reliably fail and timeout in 4 seconds.

An SSH test (also handled by xymonnet) may fail in 4 seconds when it
can't initiate a TCP connection. It may also be able to linger on for
several minutes if a TCP connection can be established but not
maintained.

Is the number of retires significant in your business case?
list Jeremy Laidman · Wed, 13 Jan 2016 01:39:33 +0000 ·
On Sat, Jan 9, 2016 at 5:41 AM John Thurston <user-ce4d79d99bab@xymon.invalid>
quoted from Japheth Cleaver
wrote:
On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it
will keep trying for 30 minutes?
The number of retries will depend on how frequently the attempt is made
and how long each attempt takes to fail.

I don't think it is dependent on how long it takes to fail.  The
xymonnetagain.sh script is run every minute, and so it will probably retry
either 29 or 30 times, depending on the exact timing of its launch.