Xymon Mailing List Archive search

Comment on Flapping

7 messages in this thread

list Elizabeth Schwartz · Tue, 29 Mar 2011 22:54:25 -0400 ·
First, let me say that this is very nifty.
Flap detection makes folks look at things that they might have missed.

It's driving the NOC folks **nuts** though. Acking the reds should
stop them from paging, but the main page then stays red for a full
half hour, even though the problem is completely fixed. IMHO it would
be very useful to have a "release" or "ALL CLEAR"  button of some sort
for flapping situations that have been dealt with. The NOC folks  hate
red screens...
(it would be even MORE useful to have a release button that only some
folks could push...is there any seekrit workaround?)

In our case we have a bit of a cascade situation, where one server can
trigger a lot of secondary reds, so we end up looking at a whole lot
of red...

thanks Betsy
list Ralph Mitchell · Tue, 29 Mar 2011 23:43:18 -0400 ·
On Tue, Mar 29, 2011 at 10:54 PM, Elizabeth Schwartz <
quoted from Elizabeth Schwartz
user-c61747246f66@xymon.invalid> wrote:
First, let me say that this is very nifty.
Flap detection makes folks look at things that they might have missed.

It's driving the NOC folks **nuts** though. Acking the reds should
stop them from paging, but the main page then stays red for a full
half hour, even though the problem is completely fixed. IMHO it would
be very useful to have a "release" or "ALL CLEAR"  button of some sort
for flapping situations that have been dealt with. The NOC folks  hate
red screens...
(it would be even MORE useful to have a release button that only some
folks could push...is there any seekrit workaround?)
I don't know much about flapping, but what happens if you manually send a
'green' status for the flapping service?

On the Xymon server:

     bb localhost "status server,domain,com.column green `date`
Flapped so hard we took off..."

If that does the trick it could be turned into your "seekrit" webpage for
certain select folks to be able to clear the status.

Did you try going to the Enable/Disable page and disabling the red things
with "Until OK" selected??  That would make the red dot go away until after
the next green report.

Ralph Mitchell
list Elizabeth Schwartz · Wed, 30 Mar 2011 12:52:53 -0400 ·
On Tue, Mar 29, 2011 at 11:43 PM, Ralph Mitchell
quoted from Ralph Mitchell
<user-00a5e44c48c0@xymon.invalid> wrote:
On Tue, Mar 29, 2011 at 10:54 PM, Elizabeth Schwartz
I don't know much about flapping, but what happens if you manually send a
'green' status for the flapping service?
On the Xymon server:
     bb localhost "status server,domain,com.column green `date`
I will try this the next time we get a flap, thanks! That would be a
good seekrit
quoted from Ralph Mitchell
Did you try going to the Enable/Disable page and disabling the red things
with "Until OK" selected??  That would make the red dot go away until after
We don't want to disable them, because what if they go down again? Not
unlikely with a previously flapping service.
Can't have a half-hour window with no monitoring of vital services

thanks Betsy
list Ryan Novosielski · Wed, 30 Mar 2011 13:21:58 -0400 ·
Excuse top-posting, it's my hardware.

Disabling "Until OK" would disable only until the next green state.

-- Sent from my Palm Pre
On Mar 30, 2011 12:53, Elizabeth Schwartz &lt;user-c61747246f66@xymon.invalid&gt; wrote: 

On Tue, Mar 29, 2011 at 11:43 PM, Ralph Mitchell

&lt;user-00a5e44c48c0@xymon.invalid&gt; wrote:

&gt; On Tue, Mar 29, 2011 at 10:54 PM, Elizabeth Schwartz

&gt; I don't know much about flapping, but what happens if you manually send a

&gt; 'green' status for the flapping service?

&gt; On the Xymon server:

&gt; &nbsp;&nbsp; &nbsp; bb localhost "status server,domain,com.column green `date`
quoted from Elizabeth Schwartz


I will try this the next time we get a flap, thanks! That would be a

good seekrit


&gt; Did you try going to the Enable/Disable page and disabling the red things

&gt; with "Until OK" selected?? &nbsp;That would make the red dot go away until after
quoted from Elizabeth Schwartz


We don't want to disable them, because what if they go down again? Not

unlikely with a previously flapping service.

Can't have a half-hour window with no monitoring of vital services


thanks Betsy
list Tom Georgoulias · Wed, 30 Mar 2011 13:52:44 -0400 ·
quoted from Elizabeth Schwartz
On 03/29/2011 10:54 PM, Elizabeth Schwartz wrote:
Flap detection makes folks look at things that they might have missed.

It's driving the NOC folks **nuts** though. Acking the reds should
stop them from paging, but the main page then stays red for a full
half hour, even though the problem is completely fixed.
(it would be even MORE useful to have a release button that only some
folks could push...is there any seekrit workaround?)

In our case we have a bit of a cascade situation, where one server can
trigger a lot of secondary reds, so we end up looking at a whole lot
of red...
Have you explored the depends tag at all?  It might help reduce or 
eliminate the cascade effect.  You can read about it in the hosts.cfg 
manpage.

Tom
list Henrik Størner · Thu, 31 Mar 2011 15:29:14 +0200 ·
On Tue, 29 Mar 2011 22:54:25 -0400, Elizabeth Schwartz
quoted from Ralph Mitchell
<user-c61747246f66@xymon.invalid> wrote:
First, let me say that this is very nifty.
Flap detection makes folks look at things that they might have missed.
Glad you like it:-)
quoted from Tom Georgoulias
It's driving the NOC folks **nuts** though. Acking the reds should
stop them from paging, but the main page then stays red for a full
half hour, even though the problem is completely fixed. IMHO it would
be very useful to have a "release" or "ALL CLEAR"  button of some sort
for flapping situations that have been dealt with. The NOC folks  hate
red screens...
Well ... yes, I see your point but I am not sure I agree with it.

If your NOC folks are using the "critical view", then they can ack the
alert, and it's gone from their view. That is how I think it/they should
work :-)

I know a lot of sites use the "All non-green" view or even the full
overview pages for monitoring, and the ack won't change the color there. If
you must have a green display in that case, then you can disable the status
(make it "blue") for 30 minutes, and then it will return to the real status
after that half hour has passed. But of course, any errors during that
period will not show up until the disable-period expires.

There may be a third possibility that does what you're asking for. I think
(haven't tested it) that the new "modify" command would override a flapping
status. If you have a "disk" status on the "server1" host, then a command
like this

   xymon 127.0.0.1 "modify server1.disk green manual Disk cleanup
completed"

will override the normal status-color and force the status green with the
comment "Disk cleanup completed". The "manual" keyword is just a token to
identify this modification. However, a modification is only valid for 2
status-updates, so it won't handle the full 30-minute period. It wouldn't
be terribly difficult to modify xymond to allow modifiers to be valid for a
longer period of time.

This could easily be wrapped into the status display when a flapping
status is shown.


Regards,
Henrik
list Elizabeth Schwartz · Thu, 31 Mar 2011 11:56:11 -0400 ·
Interesting, thanks!

We haven't explored the critical systems view because there's a
perception that *all* our monitored systems are critical. With Big
Brother (which I'm hoping to turn off next week!) we've been going on
the model of trying to make all the alerts that have to wake soneone
up be red, and making ones that can wait not go over yellow. But it's
true that as we get bigger having all those ack'ed yellows around
muddies up the display.And now with the flapping feature,  I see what
you're saying.


I'm finding the critical hosts setup to be rather indimidating,
though. We've got roughly 250 hosts , 71  distinct *types* of host.
Some of them can be cloned as generic unix or generic linux or
whatever,  but most have at least one test specific to their business
function, There's an average of maybe ten tests per host, and some
hosts have tests that run on only one or two servers in a cluster.  Am
I understanding correctly that when you edit the critical systems
view, you're editing a group that applies to only one particular test?
That is, I have to create "production databases-disk" and "production
databases-ntp" as separate entities? (or maybe it should be "hosts
with sev1 disk" and "hosts with sev1 ntp"? Or is there a way to set
the rules for all tests on a production database?

Are people with hundreds of hosts using this feature? If so, any tips?
I suspect I'm misunderstanding how to set it up.

thanks
Newbie