grouping methods

19 messages in this thread

list Tim McCloskey · Thu, 12 Jun 2008 20:30:30 -0700 ·

Looking for some thoughts and experiences on how folks have configured their systems. Mainly in regard to classification/grouping of servers for alerting purposes. I'll try to keep this short.

Currently I'm running a total if 3 hobbit servers, each in a different data center. Each server monitors clients local to his network in addition to each of his partner servers smtp box, etc. This all works fine. However, our alerting system, which also works fine is overly complex and contains too many opportunities for bugs.

In a nutshell, we have 3 groups of sysadmins that rotate on call every nn interval. Each group may be involved with a number of systems in each location and some of the admins will work on multiple Operating Systems.

I'm looking for a way to avoid having specific alert rules for each server (lots of text, even with regex macros/vars). More to the point, I want to categorize the servers based on a sysadmin group then the rules can be considerably less complex.
Dividing the alerting on OS categories does not work well as some of the admins are cross platform folks.
Dividing the alerting by page does not work well as the same 'page' may contain servers belonging to one or more sysadmin groups. The 'Class' statement for bb-hosts seems like a possibility, however I think the intended purpose is more related to whatever logs are defined in client-local, so I don't think that will work beyond log files.

Ideally I'd like to define the sysadmin group in the bb-hosts file but I don't think this is possible.

In summary, if I maintain immense configuration files with somewhat repetitive data Hobbit works quite well. I'd like to reduce the complexity but maintain the functionality. Maybe its not in the cards, or maybe - and I am hoping this is the case - I missed some cool flag or config setting.

Thoughts?

list Vernon Everett · Fri, 13 Jun 2008 14:30:07 +0800 ·

Separate team, separate page(s). 
Look up PAGE= in the hobbit-alerts man page. 
Saved me a lot of pain.

Cheers
    Vernon

▸ quoted from Tim McCloskey

-----Original Message-----
From: Tim McCloskey [mailto:user-ec6b983a1247@xymon.invalid] 
Sent: Friday, 13 June 2008 11:31 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] grouping methods

Looking for some thoughts and experiences on how folks have configured
their systems.  Mainly in regard to classification/grouping of servers
for alerting purposes.  I'll try to keep this short.

Currently I'm running a total if 3 hobbit servers, each in a different
data center.  Each server monitors clients local to his network in
addition to each of his partner servers smtp box, etc.  This all works
fine.  However, our alerting system, which also works fine is overly
complex and contains too many opportunities for bugs.

In a nutshell, we have 3 groups of sysadmins that rotate on call every
nn interval.  Each group may be involved with a number of systems in
each location and some of the admins will work on multiple Operating
Systems.

I'm looking for a way to avoid having specific alert rules for each
server (lots of text, even with regex macros/vars). 
More to the point, I want to categorize the servers based on a sysadmin
group then the rules can be considerably less complex.
Dividing the alerting on OS categories does not work well as some of the
admins are cross platform folks.
Dividing the alerting by page does not work well as the same 'page' may
contain servers belonging to one or more sysadmin groups.  The 'Class'
statement for bb-hosts seems like a possibility, however I think the
intended purpose is more related to whatever logs are defined in
client-local, so I don't think that will work beyond log files.

Ideally I'd like to define the sysadmin group in the bb-hosts file but I
don't think this is possible.

In summary, if I maintain immense configuration files with somewhat
repetitive data Hobbit works quite well.  I'd like to reduce the
complexity but maintain the functionality.  Maybe its not in the cards,
or maybe - and I am hoping this is the case - I missed some cool flag or
config setting.

Thoughts?


NOTICE: This email and any attachments are confidential. 
They may contain legally privileged information or 
copyright material. You must not read, copy, use or 
disclose them without authorisation. If you are not an 
intended recipient, please contact us at once by return 
email and then delete both messages and all attachments.

list Tim McCloskey · Fri, 13 Jun 2008 22:03:52 -0700 ·

Thanks Vernon.  I was really trying to avoid that route, even though it seems to be the cleanest approach available at this time.  Seems to be a common thread here so I'll stop beating the bush on this one.....


Everett, Vernon wrote:

Separate team, separate page(s).

-----Original Message-----
From: Tim McCloskey Dividing the alerting by page does not work well as the same 'page' may
contain servers belonging to one or more sysadmin groups.

list Doug Linder · Mon, 16 Jun 2008 13:07:22 -0400 ·

▸ quoted from Vernon Everett

Currently I'm running a total if 3 hobbit servers, each in a 
different data center.

Why 3 servers?

We use one server to monitor hundreds of systems in data centers all
over the world.  Having one centralized configuration sure makes like a
lot easier than trying to maintain three of them.  

Doug Linder

list Josh Luthman · Mon, 16 Jun 2008 13:24:14 -0400 ·

Not sure what the real reasoning is behind this but if you have 1000
servers monitored behind 3 hobbit servers each, figure one Hobbit
server goes down you lost 1000/3000 being monitored.  If you have 3000
servers being monitored behind 1 hobbit server, that one point of
failure leaves you blind of all 3000 servers.

Those are my thoughts, at least =)

On Mon, Jun 16, 2008 at 1:07 PM, Linder, Doug (SABIC Innovative

▸ quoted from Doug Linder

Plastics, consultant) <user-c834f078a0a6@xymon.invalid> wrote:

Currently I'm running a total if 3 hobbit servers, each in a
different data center.

Why 3 servers?

We use one server to monitor hundreds of systems in data centers all
over the world.  Having one centralized configuration sure makes like a
lot easier than trying to maintain three of them.

Doug Linder

--


Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Joe Sloan · Mon, 16 Jun 2008 10:45:40 -0700 ·

▸ quoted from Josh Luthman

Josh Luthman wrote:

Not sure what the real reasoning is behind this but if you have 1000
servers monitored behind 3 hobbit servers each, figure one Hobbit
server goes down you lost 1000/3000 being monitored.  If you have 3000
servers being monitored behind 1 hobbit server, that one point of
failure leaves you blind of all 3000 servers.

We do it with redundancy. Each server in our various data centers is monitored by two bb servers, with one of the two set up to send notifications, but in all other aspects the monitoring is active/active, and we get only one notification for alerts, rather than a pair of redundant notifications.

We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control, and a bb server in Arizona can't talk to the corresponding bb server in California, so the normally passive monitoring server goes into failover mode, and begins sending notification for alerts, since it can't verify that the other bb server is alive.

Thus, we always receive notifications for all alerts, and in the worst case we may get redundant notifications in the case of a split brain situation, which is the lesser of the evils.

Once this notification failover capability makes it into hobbit, we can finally switch from bb to hobbit.

Joe

list Josh Luthman · Mon, 16 Jun 2008 13:57:39 -0400 ·

This is quite obviously a well found problem and sought after feature
- getting redundant Hobbit servers.

Please help us, code monkeys =)

Josh

▸ quoted from Joe Sloan


On Mon, Jun 16, 2008 at 1:45 PM, Sloan <user-b1d2c84d244b@xymon.invalid> wrote:

Josh Luthman wrote:

Not sure what the real reasoning is behind this but if you have 1000
servers monitored behind 3 hobbit servers each, figure one Hobbit
server goes down you lost 1000/3000 being monitored.  If you have 3000
servers being monitored behind 1 hobbit server, that one point of
failure leaves you blind of all 3000 servers.

We do it with redundancy. Each server in our various data centers is
monitored by two bb servers, with one of the two set up to send
notifications, but in all other aspects the monitoring is active/active, and
we get only one notification for alerts, rather than a pair of redundant
notifications.

We've not had a bb server go down in all the years we've been using it, but
sometimes wan connectivity goes away due to circumstances beyond our
control, and a bb server in Arizona can't talk to the corresponding bb
server in California, so the normally passive monitoring server goes into
failover mode, and begins sending notification for alerts, since it can't
verify that the other bb server is alive.

Thus, we always receive notifications for all alerts, and in the worst case
we may get redundant notifications in the case of a split brain situation,
which is the lesser of the evils.

Once this notification failover capability makes it into hobbit, we can
finally switch from bb to hobbit.

Joe

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Doug Linder · Mon, 16 Jun 2008 14:07:37 -0400 ·

▸ quoted from Josh Luthman

Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been 
using it, but sometimes wan connectivity goes away due to 
circumstances beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder

list Josh Luthman · Mon, 16 Jun 2008 14:15:36 -0400 ·

That is one thing I have thought about bringing up a few times - a
summary alert.

When the power goes out or the WAN has issues, I get text messages of
very important servers.  The problem behind this is when they go up
and down it is very irritating to battle through even several messages
on my phone.  I have a BB8800 which allows me to go through them
pretty quick, but for an admin with a RAZR a dozen text messages would
take several minutes to go through.

Maybe we could get some sort of toggle-able proxy for all alerts and
the proxy sends out a summary every 60s?  Just tossing ideas out here
at this point.

Josh

On Mon, Jun 16, 2008 at 2:07 PM, Linder, Doug (SABIC Innovative

▸ quoted from Doug Linder

Plastics, consultant) <user-c834f078a0a6@xymon.invalid> wrote:

Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been
using it, but sometimes wan connectivity goes away due to
circumstances beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Rich Smrcina · Mon, 16 Jun 2008 13:17:31 -0500 ·

If this is a situation of routed networks, Hobbit can know about that 
with directives in the bb-hosts file.  If it knows a host behind a 
router is down, it will only notify for the router, not the hosts behind 
the router.

▸ quoted from Josh Luthman


Linder, Doug (SABIC Innovative Plastics, consultant) wrote:

Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been 
using it, but sometimes wan connectivity goes away due to 
circumstances beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder

--


Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

list Doug Linder · Mon, 16 Jun 2008 14:20:08 -0400 ·

It would be nice to have an elegant solution, but I don't worry about it
that much because 1) linux servers go down so infrequently, and 2) it
would be pretty trivial to set up your own redundancy between hobbit
servers.  I can think of half a dozen ways to do it off the top of my
head.  For example:

Main Hobbit Server (MHS) does its thing normally.  Backup Hobbit Server
(BHS) syncs/mirrors the drive of the MHS server via rsync or whatever,
and runs a copy of hobbit which monitors only one other system: the MHS.
If the BHS detects that the MHS is down, the alert triggers a script
that brings up its mirror copy of the server.

Doug

▸ quoted from Josh Luthman

-----Original Message-----
From: Josh Luthman [mailto:user-4c45a83f15cb@xymon.invalid] Sent: Monday, June 16, 2008 1:58 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] grouping methods

This is quite obviously a well found problem and sought after feature
- getting redundant Hobbit servers.

Please help us, code monkeys =)

Josh

On Mon, Jun 16, 2008 at 1:45 PM, Sloan <user-b1d2c84d244b@xymon.invalid> wrote:

Josh Luthman wrote:

Not sure what the real reasoning is behind this but if you have 1000 >> servers monitored behind 3 hobbit servers each, figure one Hobbit >> server goes down you lost 1000/3000 being monitored.  If you have >> 3000 servers being monitored behind 1 hobbit server, that one point >> of failure leaves you blind of all 3000 servers.

We do it with redundancy. Each server in our various data centers is > monitored by two bb servers, with one of the two set up to send > notifications, but in all other aspects the monitoring is > active/active, and we get only one notification for alerts, rather > than a pair of redundant notifications.

We've not had a bb server go down in all the years we've been using > it, but sometimes wan connectivity goes away due to circumstances > beyond our control, and a bb server in Arizona can't talk to the > corresponding bb server in California, so the normally passive > monitoring server goes into failover mode, and begins sending > notification for alerts, since it can't verify that the other bb server is alive.

Thus, we always receive notifications for all alerts, and in the worst > case we may get redundant notifications in the case of a split brain > situation, which is the lesser of the evils.

Once this notification failover capability makes it into hobbit, we > can finally switch from bb to hobbit.

Joe

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Josh Luthman · Mon, 16 Jun 2008 14:27:25 -0400 ·

Yes - I have that setup with customers' routers and CPEs.

The real problem is when, for example, 3 servers in one data center in
New Mexico lose connectivity with us in Ohio.  Then I get 3 SMS
messages on my phone, followed by 3 more when it comes back up.

It would be very convenient to have 1 messages saying this, that and
another thing went down in the last 60s.

▸ quoted from Rich Smrcina


On Mon, Jun 16, 2008 at 2:17 PM, Rich Smrcina <user-cf452ff334e0@xymon.invalid> wrote:

If this is a situation of routed networks, Hobbit can know about that with
directives in the bb-hosts file.  If it knows a host behind a router is
down, it will only notify for the router, not the hosts behind the router.

Linder, Doug (SABIC Innovative Plastics, consultant) wrote:

Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been using it,
but sometimes wan connectivity goes away due to circumstances beyond our
control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder

--
Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Greg L Hubbard · Mon, 16 Jun 2008 13:36:46 -0500 ·

I have used this method with great success, but it is a pain in the
you-know-what to maintain.  It would be nice if this "router" tagging
could be made recursive so you only have to specify one upstream host
for each host, assuming that the upstream host is also in Hobbit.  As it
is today you have to specify the full path to each "leaf" and this can
get long.

GLH

▸ quoted from Rich Smrcina

-----Original Message-----
From: Rich Smrcina [mailto:user-cf452ff334e0@xymon.invalid] Sent: Monday, June 16, 2008 1:18 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] grouping methods

If this is a situation of routed networks, Hobbit can know about that
with directives in the bb-hosts file.  If it knows a host behind a
router is down, it will only notify for the router, not the hosts behind
the router.

Linder, Doug (SABIC Innovative Plastics, consultant) wrote:

Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down.  All it knows is that it can't talk to the system anymore and something is wrong, so

it generates an alert.  When a whole network goes down, it can become hundreds of simultaneous alerts.  And that's annoying enough when it's

just email alerts.  When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to
manually close.

We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as
possible.
But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases.  This is especially a problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."

Doug Linder

--
Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

list Rich Smrcina · Mon, 16 Jun 2008 13:41:41 -0500 ·

Oh, I think I get it.... you want to be able to consolidate 
notifications.  Somehow, if Hobbit knows that the same person is going 
to get notified of multiple events, that it should only send one.

Yes, nice....

▸ quoted from Josh Luthman


Josh Luthman wrote:

Yes - I have that setup with customers' routers and CPEs.

The real problem is when, for example, 3 servers in one data center in
New Mexico lose connectivity with us in Ohio.  Then I get 3 SMS
messages on my phone, followed by 3 more when it comes back up.

It would be very convenient to have 1 messages saying this, that and
another thing went down in the last 60s.

-- 
Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

list Josh Luthman · Mon, 16 Jun 2008 14:44:04 -0400 ·

I would rather sever my you-know-what then have to go through all of that =)

That is a LOT of work to do.  I'll put up with the annoying
multi-message nights instead of doing that!

▸ quoted from Greg L Hubbard


On Mon, Jun 16, 2008 at 2:36 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

I have used this method with great success, but it is a pain in the
you-know-what to maintain.  It would be nice if this "router" tagging
could be made recursive so you only have to specify one upstream host
for each host, assuming that the upstream host is also in Hobbit.  As it
is today you have to specify the full path to each "leaf" and this can
get long.

GLH

-----Original Message-----
From: Rich Smrcina [mailto:user-cf452ff334e0@xymon.invalid]
Sent: Monday, June 16, 2008 1:18 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] grouping methods

If this is a situation of routed networks, Hobbit can know about that
with directives in the bb-hosts file.  If it knows a host behind a
router is down, it will only notify for the router, not the hosts behind
the router.

Linder, Doug (SABIC Innovative Plastics, consultant) wrote:

Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been using
it, but sometimes wan connectivity goes away due to circumstances
beyond our control

This is by far the biggest annoyance we have with all system
monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring
system has no way of knowing whether a system is down because it
crashed or if it's down because the network went down.  All it knows
is that it can't talk to the system anymore and something is wrong, so

it generates an alert.  When a whole network goes down, it can become
hundreds of simultaneous alerts.  And that's annoying enough when it's

just email alerts.  When you use Hobbit to generate cases in your
trouble ticket system, that can be hundreds of new, useless cases to
manually close.

We don't want to raise the amount of time a system has to be down
before Hobbit generates an alert, because we want to know as soon as
possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some
kind of flood-detection routine built in, where it could tell how
rapidly it was sending alerts about connection problems for machines
all on the same network, and was smart enough to think "Whoa, I'm
about to send 100 connection alarms about systems on the same
network.... Instead of sending 100 of them, maybe I'll just send ONE
alert saying "You got a big problem here."

Doug Linder

--
Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Ralph Mitchell · Mon, 16 Jun 2008 14:00:53 -0500 ·

▸ quoted from Rich Smrcina

On Mon, Jun 16, 2008 at 1:41 PM, Rich Smrcina <user-cf452ff334e0@xymon.invalid> wrote:

Oh, I think I get it.... you want to be able to consolidate notifications.
 Somehow, if Hobbit knows that the same person is going to get notified of
multiple events, that it should only send one.

Yes, nice....


It might not be perfect, but perhaps that could be managed via a couple of
scripts.  Configure Hobbit to alert using the SCRIPT option, and have that
script append the message to a file named for the recipient.  Have a second
script fired by cron that would do the delivery via email, SMS, etc, then
delete the file.

Ralph Mitchell

list Rich Smrcina · Mon, 16 Jun 2008 14:08:05 -0500 ·

Might it be helpful if router paths could be 'macro-ed'?  Something like 
notifications... so only the macro definitions had to be maintained?

Granted, I like Greg's idea much better... :)

▸ quoted from Josh Luthman


Josh Luthman wrote:

I would rather sever my you-know-what then have to go through all of that =)

That is a LOT of work to do.  I'll put up with the annoying
multi-message nights instead of doing that!

On Mon, Jun 16, 2008 at 2:36 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

I have used this method with great success, but it is a pain in the
you-know-what to maintain.  It would be nice if this "router" tagging
could be made recursive so you only have to specify one upstream host
for each host, assuming that the upstream host is also in Hobbit.  As it
is today you have to specify the full path to each "leaf" and this can
get long.

GLH

-- 
Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

list Josh Luthman · Mon, 16 Jun 2008 15:22:09 -0400 ·

Exactly right! :)


On 6/16/08, Rich Smrcina <user-cf452ff334e0@xymon.invalid> wrote:

Oh, I think I get it.... you want to be able to consolidate
notifications.  Somehow, if Hobbit knows that the same person is going
to get notified of multiple events, that it should only send one.

Yes, nice....

Josh Luthman wrote:

Yes - I have that setup with customers' routers and CPEs.

The real problem is when, for example, 3 servers in one data center in
New Mexico lose connectivity with us in Ohio.  Then I get 3 SMS
messages on my phone, followed by 3 more when it comes back up.

It would be very convenient to have 1 messages saying this, that and
another thing went down in the last 60s.

--
Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid
http://www.linkedin.com/in/richsmrcina

Catch the WAVV!  http://www.wavv.org
WAVV 2009 - Orlando, FL - May 15-19, 2009

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Dave Haertig · Mon, 16 Jun 2008 17:16:37 -0600 ·

I wrote a custom alert script to handle this.  The first alert is sent
immediately, then the rest are spooled up and sent later, as a batch.

(1) The alert script first checks if its spool file exists.  If so, the
current alert is appended to that file and the custom alert script
exits.  There is one spool file per recipient address.

(2) If the spool file does not exist, the custom alert script sends out
the current alert as normal and then creates a zero length spool file.
It also creates an "at" job.  The "at" job will mail the spool file to
it's normal recipent after one hours wait, and then delete it.  This
spoolfile deletion resets the spooling.  You can vary the one hour
setting to suite your needs.

(3) When the "at" job fires it mails the spoolfile if it is non-zero
length.  Then it deletes the spoolfile (reset).

Enhancements:

I found that if the server reboots while the spool file is spooling, the
"at" job gets killed and you end up endlessly spooling forever and ever.
To work around this:

(1) The custom alert script was modified to check the age of the
spoolfile as its first step.  If it's "too old" (in my example, over 1
hour 15 minutes old), the alert script mails it immediately, deletes it,
and then starts from the beginning with the current alert.

(2) Additionally, a cronjob was added to check for stale spoolfiles.
The job runs every 15 minutes and looks for spoolfiles over 1 hour 15
minutes old. If any are found, the cronjob does the mailing and
deleting.

Those are the basics.  I enhanced it further so that different alert
types could be grouped together into different spoolfiles and spooling
could be for different lengths of time.  I did this by symlinking the
alert script to different names.  The name of the symlink was structured
and the script looked at how it was invoked and parsed out the spooling
group and length of time from its invokation name.  The specific
spoolfile was then named based on recipient, spool duration, and group.

It is more complex to describe what I did than to actually code it!
Unfortunately I cannot post the script.  It does a bunch more than just
this spooling function, some of that being company proprietary.  It
would take quite a bit of work for me to strip out the proprietary stuff
to create a generic demonstation script for posting.

This script also does a function similar to spooling, but not quite.
Implemented as a symlink to a different name.  I call it a "consolidate"
funciton.  It works pretty much the same as spooling, but instead of
sending the spoolfile after an hour, it only waits 5 minutes, deletes
the spoolfile without mailing it, and then basically does a "screen
scrape" of the bb2.html page and lists all the non-green lights it finds
there.  This works well for pagers.  Rather than getting a whole bunch
of pages, you get one page that lists all the current light statuses.
As part of the consolidation during the screen scrape (actually I open
the actual html file, so I'm dependant on consistant file structure
unfortunatly) I heavily abbreviate things so they will fit in the tight
SMS message length limits.  A consolidate message  might look like this
cryptic example, but I know what it means!  "!BB! R:testa:srv1
R:testc:srv7 Y:testf:srv2 P:testq:srv3"  I list things in order of
importance (reds before yellows, etc) so if the messatge does get
truncated, the most important parts make it through.

 
-----Original Message-----
From: Linder, Doug (SABIC Innovative Plastics, consultant)
[mailto:user-c834f078a0a6@xymon.invalid] 
Sent: Monday, June 16, 2008 12:08 PM

▸ quoted from Josh Luthman

To: user-ae9b8668bcde@xymon.invalid
Subject: RE: [hobbit] grouping methods


Sloan [mailto:user-b1d2c84d244b@xymon.invalid] wrote:

We've not had a bb server go down in all the years we've been using 
it, but sometimes wan connectivity goes away due to circumstances 
beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder

grouping methods 🔗 link

grouping methods