DURATION rules for specific host alerts

3 messages in this thread

list Gary Baluha · Fri, 22 Jun 2007 10:49:43 -0400 ·

Is there a [non-messy] way to set a DURATION rule for a specific host
alert?  Basically, what I'm thinking of is something like this:

In hobbit-clients.cfg
HOST=myhost
    LOAD 20 30 DURATION>5m

The effect being, the status of the "myhost" cpu alert will only change to
yellow/red if the load is above the appropriate threshold for more than 5
minutes.

There are a few hosts that occasionally will spike above the cpu load
thresholds, but only for a few minutes (usually around 5 min at most), and
then recover on its own.  However, I don't want to raise the thresholds,
because a sustained load (more than 10 minutes) at this level _is_ actually
a critical event.  It's just not critical if it is just a momentary spike.

My specific example is with cpu load, but it could be for other things too,
such as process counts, memory, or even in some situations, disk space.

list Daniel Bourque · Fri, 22 Jun 2007 10:12:32 -0500 ·

Why would you not want the status to change ? Such a history log is great for troubleshooting.

if you don't want to be notified about it, just use this in the hobbit-alerts.cfg

Page=x
    IGNORE HOST=foo SERVICE=cpu COLOR=red DURATION<5m

if you don't want it to change the status color on the parent pages , then use NOPROPYELLOW:cpu in the bb-hosts file.

if you REALLY don't want it to change status, increase the LOAD numbers in the hobbit-clients.cfg file.

-Dan

▸ quoted from Gary Baluha


Gary Baluha wrote:

Is there a [non-messy] way to set a DURATION rule for a specific host alert?  Basically, what I'm thinking of is something like this:

In hobbit-clients.cfg
HOST=myhost
    LOAD 20 30 DURATION>5m

The effect being, the status of the "myhost" cpu alert will only change to yellow/red if the load is above the appropriate threshold for more than 5 minutes.

There are a few hosts that occasionally will spike above the cpu load thresholds, but only for a few minutes (usually around 5 min at most), and then recover on its own.  However, I don't want to raise the thresholds, because a sustained load (more than 10 minutes) at this level _is_ actually a critical event.  It's just not critical if it is just a momentary spike.

My specific example is with cpu load, but it could be for other things too, such as process counts, memory, or even in some situations, disk space.

list Gary Baluha · Fri, 22 Jun 2007 13:36:47 -0400 ·

▸ quoted from Daniel Bourque

On 6/22/07, Daniel Bourque <user-a141068964db@xymon.invalid> wrote:

 Why would you not want the status to change ? Such a history log is great
for troubleshooting.

I wouldn't want the status to change, because I'm essentially making it a
two-part threshold; one part based on the hard-and-true numeric value, and
another threshold based on the length of time.

▸ quoted from Daniel Bourque


if you don't want to be notified about it, just use this in the

hobbit-alerts.cfg

Page=x
    IGNORE HOST=foo SERVICE=cpu COLOR=red DURATION<5m

Ahh, that's the sort of hobbit-alerts rule that would work for me, at least
until (if?) there becomes a way to do what I'm looking for in
hobbit-clients.cfg.

▸ quoted from Daniel Bourque


if you don't want it to change the status color on the parent pages , then

use NOPROPYELLOW:cpu in the bb-hosts file.

if you REALLY don't want it to change status, increase the LOAD numbers in
the hobbit-clients.cfg file.

The problem is that it is only a problem if the load is _sustained_ for more
than 10 minutes or so.
If I set the red threshold to Y, and the load momentarily spikes to Y+1, it
isn't a problem.  But if I raise the threshold to Y+2 and now I get a
sustained load of Y+1, it would be a problem since I wouldn't get alerted.

Essentially, I'm looking for a sort of time-based hysteretic monitoring.

▸ quoted from Daniel Bourque


-Dan

Gary Baluha wrote:

Is there a [non-messy] way to set a DURATION rule for a specific host
alert?  Basically, what I'm thinking of is something like this:

In hobbit-clients.cfg
HOST=myhost
    LOAD 20 30 DURATION>5m

The effect being, the status of the "myhost" cpu alert will only change to
yellow/red if the load is above the appropriate threshold for more than 5
minutes.

There are a few hosts that occasionally will spike above the cpu load
thresholds, but only for a few minutes (usually around 5 min at most), and
then recover on its own.  However, I don't want to raise the thresholds,
because a sustained load (more than 10 minutes) at this level _is_ actually
a critical event.  It's just not critical if it is just a momentary spike.

My specific example is with cpu load, but it could be for other things
too, such as process counts, memory, or even in some situations, disk space.

DURATION rules for specific host alerts 🔗 link

DURATION rules for specific host alerts