Custom check interval for different status of custom tests

6 messages in this thread

list Andrey Chervonets · Thu, 2 May 2013 10:07:31 +0300 ·

Is it possible to define different check interval for custom tests?

for example, we check tablespaces space usage in database.
For some tests we do not to check once per 5 min just because of 
performance impact of some complex requests.

When status is green - we do check every 30 min.
But if status is yellow and red - the DBA can make fix faster then in 30 
min (or even situation can change faster) 
and it would nice to reflect correct status earlier then in 30 min.


Is i possible now or may be it can be implemented in next release?


Best regards,

Andrey Chervonets
SIA CoMinder
http://www.cominder.eu/
mobile: +XXX XXXXXXXX

list Michael Beatty · Thu, 02 May 2013 08:37:57 -0400 ·

While I don't believe this to be built into Xymon (would be nice to have), I have written scripts that do this by putting a loop in the script with retry logic. Set Xymon to run the script every 30 minutes, if yellow or red is detected, sleep 1 minute check again... repeat 29 times or until green.


Michael Beatty

▸ quoted from Andrey Chervonets


On 05/02/2013 03:07 AM, Andrey Chervonets wrote:

Is it possible to define different check interval for custom tests?

for example, we check tablespaces space usage in database.
For some tests we do not to check once per 5 min just because of performance impact of some complex requests.

When status is green - we do check every 30 min.
But if status is yellow and red - the DBA can make fix faster then in 30 min (or even situation can change faster)
and it would nice to reflect correct status earlier then in 30 min.


Is i possible now or may be it can be implemented in next release?

Best regards,

Andrey Chervonets
SIA CoMinder
http://www.cominder.eu/
mobile: +XXX XXXXXXXX

list Japheth Cleaver · Thu, 2 May 2013 14:48:20 -0000 (UTC) ·

▸ quoted from Michael Beatty

While I don't believe this to be built into Xymon (would be nice to
have), I have written scripts that do this by putting a loop in the
script with retry logic. Set Xymon to run the script every 30 minutes,
if yellow or red is detected, sleep 1 minute check again... repeat 29
times or until green.


Michael Beatty

There's a xymonnet-again script that actually kind of does just this.

# "xymonnetagain" picks up the tests that the normal network test consider
"failed", and re-does those
# tests more often. This enables Xymon to pick up a recovered network
service faster than
# if it were tested only by the "xymonnet" task (which only runs every 5
minutes). So if you have
# servers with very high availability guarantees, running this task will
make your availability
# reports look much better.
[xymonnetagain]
        ENVFILE /etc/xymon/xymonserver.cfg
        NEEDS xymond
        CMD /etc/xymon/ext/xymonnet-again.sh
        LOGFILE $XYMONSERVERLOGS/xymonnetagain.log
        INTERVAL 30s


Of course, that only covers xymonnet-run tests. If you have a custom
script, the same logic would be possible. One that runs every so often,
and one that runs much more frequently, taking a list of known-to-be-bad
hosts.

Another way to do this is with a live query, but that depends on where you
allow querying from. Something like:
 'xymon $XYMSRV "xymondboard test=thetest color=yellow,red
fields=hostname" | xargs -r /usr/bin/yourtestscript.sh

▸ quoted from Andrey Chervonets

On 05/02/2013 03:07 AM, Andrey Chervonets wrote:

Is it possible to define different check interval for custom tests?

for example, we check tablespaces space usage in database.
For some tests we do not to check once per 5 min just because of
performance impact of some complex requests.

When status is green - we do check every 30 min.
But if status is yellow and red - the DBA can make fix faster then in
30 min (or even situation can change faster)
and it would nice to reflect correct status earlier then in 30 min.


Is i possible now or may be it can be implemented in next release?

Best regards,

Andrey Chervonets


Part of the issue here is that xymond (the central daemon) is not really
in charge of scheduling. That's by design, as the core needs to
first-and-foremost handle message passing traffic and current status
records (and noting message expiration times). It's really up to the
external programs (like xymonnet) to take their config, query the status
(if needed), and schedule or perform their checks accordingly.


Regards,

-jc

list Jeremy Laidman · Fri, 3 May 2013 11:35:58 +1000 ·

▸ quoted from Andrey Chervonets

On 2 May 2013 17:07, Andrey Chervonets <user-e7fb5c02322c@xymon.invalid> wrote:

For some tests we do not to check once per 5 min just because of
performance impact of some complex requests.


If you re-run the checks more frequently when they fail, won't there be a
performance impact?  If the failure is due to load, then you might end up
making things worse.  Even if load isn't impacted, the people who are
troubleshooting the problem might think your monitoring is the /cause/ of
the problem, rather than a symptom.

I thought about trying to solve this in a generic way - having a script
that looks for failures and does a re-test, perhaps for tests that are
tagged for re-testing in hosts.cfg.  However, I realised that very few of
my tests would benefit from this and not be at risk of causing increased
load during a time of trouble.  Of those, I really would need to handle
each one on a case-by-case basis, to determine an optimal balance of
detecting resolution quickly vs limiting load caused by the tests.  As it's
a case-by-case assessment, I thought a generic solution wouldn't be
appropriate.

J

list Henrik Størner · Fri, 03 May 2013 11:18:31 +0200 ·

▸ quoted from Japheth Cleaver

On 02-05-2013 16:48, user-87556346d4af@xymon.invalid wrote:

Another way to do this is with a live query, but that depends on where you
allow querying from. Something like:
  'xymon $XYMSRV "xymondboard test=thetest color=yellow,red
fields=hostname" | xargs -r /usr/bin/yourtestscript.sh

Or just use the "query" command

   $XYMON $XYMSRV "query HOSTNAME.TEST"

Or have the script keep track of what the last status it sent - then run the script every 5 minutes, and if the last status was "red" OR more than 30 minutes have elapsed, then re-run the test.

Watch out for the status going purple, if you don't update it before it expires. Use the "status+LIFETIME" to send a status that lasts longer than the default 30 minutes.


Regards,
Henrik

list Andrey Chervonets · Fri, 3 May 2013 15:40:16 +0300 ·

Really our situation is not so dramatic. per 5 min. tests will affect performance, but will not block anything.
We can allow it for short time, but we have to avoid unnecessary load during all the day.
In most cases fix delivered quite fast and it would be nice to have more accurate resolution time statistic.

▸ quoted from Jeremy Laidman



Best regards,

Andrey Chervonets
SIA CoMinder
http://www.cominder.eu/


From:   Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
To:     Andrey Chervonets <user-e7fb5c02322c@xymon.invalid>, Cc:     "xymon at xymon.com" <xymon at xymon.com>
Date:   03.05.2013 04:36
Subject:        Re: [Xymon] Custom check interval for different status of custom tests


On 2 May 2013 17:07, Andrey Chervonets <user-e7fb5c02322c@xymon.invalid> wrote:
For some tests we do not to check once per 5 min just because of performance impact of some complex requests. 
If you re-run the checks more frequently when they fail, won't there be a performance impact?  If the failure is due to load, then you might end up making things worse.  Even if load isn't impacted, the people who are troubleshooting the problem might think your monitoring is the /cause/ of the problem, rather than a symptom.

I thought about trying to solve this in a generic way - having a script that looks for failures and does a re-test, perhaps for tests that are tagged for re-testing in hosts.cfg.  However, I realised that very few of my tests would benefit from this and not be at risk of causing increased load during a time of trouble.  Of those, I really would need to handle each one on a case-by-case basis, to determine an optimal balance of detecting resolution quickly vs limiting load caused by the tests.  As it's a case-by-case assessment, I thought a generic solution wouldn't be appropriate.

J

Custom check interval for different status of custom tests 🔗 link

Custom check interval for different status of custom tests