Xymon Mailing List Archive search

the all or nothing nature of hobbit

6 messages in this thread

list Dan Simoes · Thu, 7 Dec 2006 11:59:30 -0800 ·
I love hobbit and have been using it (and BB) for many years, so take this
as constructive criticism.

One of my biggest headaches with BB (and now hobbit) has been the
all-or-nothing nature of alerts.
By this I mean that if your main network link is down, everything goes red
for network status.

Something happened on my monitoring box (probably DNS) that caused a cadence
of http errors.  http was not truly down on all these N hosts on various
networks, it was the network test that was failing on the monitoring box.

I'm unaware of a solution to this issue, and I'm considering moving to
another product because of it.
Are there any solutions, either existing or planned?

Lastly, who is maintaining the debian package for hobbit?  Both the server
and client packages still have the same bugs I reported months ago.

Thanks.

Dan
list Dan Vande More · Thu, 7 Dec 2006 14:05:44 -0600 ·
Try leveraging the "depends" functionality given in bb-hosts. Correctly
implemented, it should account for most cases of multiple errors with the
tests:

http://www.hswn.dk/hobbit/help/manpages/man5/bb-hosts.5.html

I am unsure as to who maintains the debian pkg. I wish it could make it
mainline though...

Dan
quoted from Dan Simoes


On 12/7/06, Dan Simoes <user-3428f00c5f40@xymon.invalid> wrote:
I love hobbit and have been using it (and BB) for many years, so take this
as constructive criticism.

One of my biggest headaches with BB (and now hobbit) has been the
all-or-nothing nature of alerts.
By this I mean that if your main network link is down, everything goes red
for network status.

Something happened on my monitoring box (probably DNS) that caused a
cadence of http errors.  http was not truly down on all these N hosts on
various networks, it was the network test that was failing on the monitoring
box.

I'm unaware of a solution to this issue, and I'm considering moving to
another product because of it.
Are there any solutions, either existing or planned?

Lastly, who is maintaining the debian package for hobbit?  Both the server
and client packages still have the same bugs I reported months ago.

Thanks.

Dan

list Henrik Størner · Thu, 7 Dec 2006 22:05:47 +0100 ·
quoted from Dan Simoes
On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:
I love hobbit and have been using it (and BB) for many years, so take this
as constructive criticism.

One of my biggest headaches with BB (and now hobbit) has been the
all-or-nothing nature of alerts.
By this I mean that if your main network link is down, everything goes red
for network status.

Something happened on my monitoring box (probably DNS) that caused a cadence
of http errors.  http was not truly down on all these N hosts on various
networks, it was the network test that was failing on the monitoring box.
It's a valid point - but it is also very, very difficult to handle. Not
so much because it is difficult to suppress alerts; the $1bn question is
how to decide when to suppress an alert, and which issue is the root
cause of all the problems we're seeing.

Heck, sometimes it can be difficult even for intelligent humans to
figure out what is really going on ...

I think what this really boils down to is some form of event correlation
mechanism, on top of which you then apply some heuristics (that's a
fancy word for "guessing") to decide what is the core issue. E.g. if we
have 200 tests reporting a failure because of a DNS lookup that timed
out, then we probably have an issue with the DNS server we used. But it
could also be a firewall mis-configuration that blocks our outbound DNS
queries, or an IP address conflict that causes our DNS lookups to go to 
a server which doesn't handle DNS - it is really hard for any machine to
figure that out by itself.

The current implementation is not ideal, I'll be the first to admit
that. Any ideas for improving it are welcome, but please consider the
possibilities for the system making wrong decisions. I'd rather send out
one alert too many than one too few.
quoted from Dan Vande More

I'm unaware of a solution to this issue, and I'm considering moving to
another product because of it.
If you know of any products that are really good at handling this, I'd
be interested to hear about them.
quoted from Dan Vande More
Lastly, who is maintaining the debian package for hobbit?  Both the server
and client packages still have the same bugs I reported months ago.
Since there haven't been any Hobbit releases since August, that really
shouldn't come as a surprise.


Regards,
Henrik
list Buchan Milne · Fri, 8 Dec 2006 09:10:08 +0200 ·
quoted from Henrik Størner
On Thursday 07 December 2006 23:05, Henrik Stoerner wrote:
On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:
I love hobbit and have been using it (and BB) for many years, so take
this as constructive criticism.

One of my biggest headaches with BB (and now hobbit) has been the
all-or-nothing nature of alerts.
By this I mean that if your main network link is down, everything goes
red for network status.

Something happened on my monitoring box (probably DNS) that caused a
cadence of http errors.  http was not truly down on all these N hosts on
various networks, it was the network test that was failing on the
monitoring box.
It's a valid point - but it is also very, very difficult to handle. Not
so much because it is difficult to suppress alerts; the $1bn question is
how to decide when to suppress an alert, and which issue is the root
cause of all the problems we're seeing.

Heck, sometimes it can be difficult even for intelligent humans to
figure out what is really going on ...

I think what this really boils down to is some form of event correlation
mechanism,
Event correlation seems to be the current buzzword from all the monitoring 
tool vendors whose presentations I have seen recently ...
quoted from Henrik Størner
on top of which you then apply some heuristics (that's a 
fancy word for "guessing") to decide what is the core issue. E.g. if we
have 200 tests reporting a failure because of a DNS lookup that timed
out, then we probably have an issue with the DNS server we used. But it
could also be a firewall mis-configuration that blocks our outbound DNS
queries, or an IP address conflict that causes our DNS lookups to go to
a server which doesn't handle DNS - it is really hard for any machine to
figure that out by itself.

The current implementation is not ideal, I'll be the first to admit
that. Any ideas for improving it are welcome, but please consider the
possibilities for the system making wrong decisions. I'd rather send out
one alert too many than one too few.
I'm unaware of a solution to this issue, and I'm considering moving to
another product because of it.
If you know of any products that are really good at handling this, I'd
be interested to hear about them.
I can list some (proprietary ones) that are punting this, but I've never seen 
them in action.

Regards,
Buchan

-- 
Buchan Milne
ISP Systems Specialist - Monitoring/Authentication Team Leader
B.Eng,RHCE(803004789010797),LPIC-2(LPI000074592)
list Dan Simoes · Fri, 8 Dec 2006 18:11:52 -0800 ·
quoted from Buchan Milne
On 12/7/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:

I think what this really boils down to is some form of event correlation
mechanism, on top of which you then apply some heuristics (that's a
fancy word for "guessing") to decide what is the core issue. E.g. if we
have 200 tests reporting a failure because of a DNS lookup that timed
out, then we probably have an issue with the DNS server we used. But it
could also be a firewall mis-configuration that blocks our outbound DNS
queries, or an IP address conflict that causes our DNS lookups to go to
a server which doesn't handle DNS - it is really hard for any machine to
figure that out by itself.

What I had in mind was more of a baseline check, before proceeding to the
other tests.
Can't resolve DNS?  All other tests which depend on DNS are skipped.
Can't ping your default router?  Don't bother with any extra network tests.
quoted from Buchan Milne
I'm unaware of a solution to this issue, and I'm considering moving to
another product because of it.
If you know of any products that are really good at handling this, I'd
be interested to hear about them.

I can't think of any in particular.  I've used unicenter way back when and
don't recall this issue, but it's been a while.  And I've only taken a
cursory look at nagios.
quoted from Henrik Størner
Lastly, who is maintaining the debian package for hobbit?  Both the server
and client packages still have the same bugs I reported months ago.
Since there haven't been any Hobbit releases since August, that really
shouldn't come as a surprise.

True, but they are bugs I pointed out since before the release candidate.
In particular, the client postconf script munges the
/etc/default/hobbit-client file and needs to be edited by hand before hobbit
will run.
I'd be happy to provide feedback to whomever is maintaining the package (is
that you Henrik?)

Thanks.
list Scott Walters · Fri, 8 Dec 2006 22:25:05 -0500 ·
quoted from Dan Simoes
I think what this really boils down to is some form of event correlation
mechanism, on top of which you then apply some heuristics (that's a
fancy word for "guessing") to decide what is the core issue.

| If you know of any products that are really good at handling this, I'd
| be interested to hear about them.

Heuristics is poppycock in the datacenter.  Humans are so ridiculously good
at correlating events the effort is completely useless to try and train a
computer to guess.  Now, from an intellectual or research point of view that
may not be the case, but I am pragmatic in the datacenter:  Useful, not
interesting.

My thought to "solve" this problem is the idea of "scenario
fingerprinting."  As I mentioned, trying to teach a computer to learn is
futile, but instructing a computer to look for *known* conditions works
perfectly.  Criminals and problems have a tendency to repeat themselves.

So, rather than deal with "event correlation", I think a better approach
would be an engine that could do state analysis with many rules for a single
scenario. Perhaps it's semantics, but "event correlation" to me implies
events over time, and I don't think you need the time parameter, only the
view of the environment at an instant, the fingerprint.   If the scenario is
recognized, then "react" by disabling and alerting appropriately.

Example, say you lose a router in Europe and all the pings die across the
pond (I am in North America).  Generate a "scenario alert" that described
the scenario and disable all the routers/hosts over there.  Odds are if that
router went down once,  it will go down again.

You leverage the ability of a human to correlate with the computers ability
to "keep on the look-out" for "known offenders."  I think this methodlogy
could also be applied to the RRD system stats.

Let the machines do what they are good at, following instructions, and let
the humans do what they are good at, thinking.

Scott