the all or nothing nature of hobbit
list Dan Simoes
I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism. One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status. Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box. I'm unaware of a solution to this issue, and I'm considering moving to another product because of it. Are there any solutions, either existing or planned? Lastly, who is maintaining the debian package for hobbit? Both the server and client packages still have the same bugs I reported months ago. Thanks. Dan
list Dan Vande More
Try leveraging the "depends" functionality given in bb-hosts. Correctly implemented, it should account for most cases of multiple errors with the tests: http://www.hswn.dk/hobbit/help/manpages/man5/bb-hosts.5.html I am unsure as to who maintains the debian pkg. I wish it could make it mainline though... Dan
▸
On 12/7/06, Dan Simoes <user-3428f00c5f40@xymon.invalid> wrote:I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism. One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status. Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box. I'm unaware of a solution to this issue, and I'm considering moving to another product because of it. Are there any solutions, either existing or planned? Lastly, who is maintaining the debian package for hobbit? Both the server and client packages still have the same bugs I reported months ago. Thanks. Dan
list Henrik Størner
▸
On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:
I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism. One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status. Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box.
It's a valid point - but it is also very, very difficult to handle. Not so much because it is difficult to suppress alerts; the $1bn question is how to decide when to suppress an alert, and which issue is the root cause of all the problems we're seeing. Heck, sometimes it can be difficult even for intelligent humans to figure out what is really going on ... I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue. E.g. if we have 200 tests reporting a failure because of a DNS lookup that timed out, then we probably have an issue with the DNS server we used. But it could also be a firewall mis-configuration that blocks our outbound DNS queries, or an IP address conflict that causes our DNS lookups to go to a server which doesn't handle DNS - it is really hard for any machine to figure that out by itself. The current implementation is not ideal, I'll be the first to admit that. Any ideas for improving it are welcome, but please consider the possibilities for the system making wrong decisions. I'd rather send out one alert too many than one too few.
▸
I'm unaware of a solution to this issue, and I'm considering moving to another product because of it.
If you know of any products that are really good at handling this, I'd be interested to hear about them.
▸
Lastly, who is maintaining the debian package for hobbit? Both the server and client packages still have the same bugs I reported months ago.
Since there haven't been any Hobbit releases since August, that really shouldn't come as a surprise. Regards, Henrik
list Buchan Milne
▸
On Thursday 07 December 2006 23:05, Henrik Stoerner wrote:
On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism. One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status. Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box.It's a valid point - but it is also very, very difficult to handle. Not so much because it is difficult to suppress alerts; the $1bn question is how to decide when to suppress an alert, and which issue is the root cause of all the problems we're seeing. Heck, sometimes it can be difficult even for intelligent humans to figure out what is really going on ... I think what this really boils down to is some form of event correlation mechanism,
Event correlation seems to be the current buzzword from all the monitoring tool vendors whose presentations I have seen recently ...
▸
on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue. E.g. if we have 200 tests reporting a failure because of a DNS lookup that timed out, then we probably have an issue with the DNS server we used. But it could also be a firewall mis-configuration that blocks our outbound DNS queries, or an IP address conflict that causes our DNS lookups to go to a server which doesn't handle DNS - it is really hard for any machine to figure that out by itself. The current implementation is not ideal, I'll be the first to admit that. Any ideas for improving it are welcome, but please consider the possibilities for the system making wrong decisions. I'd rather send out one alert too many than one too few.I'm unaware of a solution to this issue, and I'm considering moving to another product because of it.If you know of any products that are really good at handling this, I'd be interested to hear about them.
I can list some (proprietary ones) that are punting this, but I've never seen them in action. Regards, Buchan -- Buchan Milne ISP Systems Specialist - Monitoring/Authentication Team Leader B.Eng,RHCE(803004789010797),LPIC-2(LPI000074592)
list Dan Simoes
▸
On 12/7/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue. E.g. if we have 200 tests reporting a failure because of a DNS lookup that timed out, then we probably have an issue with the DNS server we used. But it could also be a firewall mis-configuration that blocks our outbound DNS queries, or an IP address conflict that causes our DNS lookups to go to a server which doesn't handle DNS - it is really hard for any machine to figure that out by itself.
What I had in mind was more of a baseline check, before proceeding to the
other tests.
Can't resolve DNS? All other tests which depend on DNS are skipped.
Can't ping your default router? Don't bother with any extra network tests.
▸
I'm unaware of a solution to this issue, and I'm considering moving toanother product because of it.If you know of any products that are really good at handling this, I'd be interested to hear about them.
I can't think of any in particular. I've used unicenter way back when and
don't recall this issue, but it's been a while. And I've only taken a
cursory look at nagios.
▸
Lastly, who is maintaining the debian package for hobbit? Both the serverand client packages still have the same bugs I reported months ago.Since there haven't been any Hobbit releases since August, that really shouldn't come as a surprise.
True, but they are bugs I pointed out since before the release candidate.
In particular, the client postconf script munges the
/etc/default/hobbit-client file and needs to be edited by hand before hobbit
will run.
I'd be happy to provide feedback to whomever is maintaining the package (is
that you Henrik?)
Thanks.
list Scott Walters
▸
I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue.
| If you know of any products that are really good at handling this, I'd | be interested to hear about them.
Heuristics is poppycock in the datacenter. Humans are so ridiculously good
at correlating events the effort is completely useless to try and train a
computer to guess. Now, from an intellectual or research point of view that
may not be the case, but I am pragmatic in the datacenter: Useful, not
interesting.
My thought to "solve" this problem is the idea of "scenario
fingerprinting." As I mentioned, trying to teach a computer to learn is
futile, but instructing a computer to look for *known* conditions works
perfectly. Criminals and problems have a tendency to repeat themselves.
So, rather than deal with "event correlation", I think a better approach
would be an engine that could do state analysis with many rules for a single
scenario. Perhaps it's semantics, but "event correlation" to me implies
events over time, and I don't think you need the time parameter, only the
view of the environment at an instant, the fingerprint. If the scenario is
recognized, then "react" by disabling and alerting appropriately.
Example, say you lose a router in Europe and all the pings die across the
pond (I am in North America). Generate a "scenario alert" that described
the scenario and disable all the routers/hosts over there. Odds are if that
router went down once, it will go down again.
You leverage the ability of a human to correlate with the computers ability
to "keep on the look-out" for "known offenders." I think this methodlogy
could also be applied to the RRD system stats.
Let the machines do what they are good at, following instructions, and let
the humans do what they are good at, thinking.
Scott