Highlights of the 4.3.0 version

list Scott Walters
Fri, 3 Aug 2007 13:15:27 -0400
Message-Id: <user-8e8d75a3efb1@xymon.invalid>

I am definitely in the "monitor only" camp.  As appealing as
"self-healing" may seem, I've seen attempts go horrible wrong too many
times.  For example, shutting down Oracle for upgrades and then being
restarted in the middle of the upgrade.  Not good.

I also agree that "self-healing" lends itself to band-aids that avoid
root-cause determination.  I don't think this requires "baby-sitting,"
but a commitment to fixing things once.  I have also had the
displeasure of making permanent band-aids, but I cannot condone it.

All of those "operational" aspects aside, I've convinced myself from a
security point of view, corrective action from monitoring is bad-- a
clear violation of the separation of duties.  You don't want your
auditors "cleaning up" the numbers as they go over your books.

You know what's better than your webserver being automatically
restarted when it crashes?  Your webserver not crashing.

I completely support the absence of corrective actions from monitor
triggers.  The question I have yet to answer satisfactorily is,"Should
the monitoring system perform additional data collection after
specific errors?"  For example, running a particular "find" command
when disk usage increases to try and identify which files are causing
the partition to fill.


Scott Walters
-PacketPusher

On 8/3/07, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

Well, I use Netcool which has the opposite philosophy -- there is a
"process automation" system that watches processes and restarts them if
they fail, while also logging restarts.  You can configure a "restart"
parameter to be anything from 0 (forever) to any number of times.  I
like to set a reasonable number so persistent errors eventually kill the
process, but occasional errors do not.  Log files are not overwritten,
but are appended and rotated.

But whatever.  My view seems to be in the minority -- guess the rest of
you don't mind 24x7x365 babysitting.

GLH

-----Original Message-----
From: Galen Johnson [mailto:user-87f955643e3d@xymon.invalid]
Sent: Friday, August 03, 2007 10:18 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: RE: [hobbit] Highlights of the 4.3.0 version

DOn't forget...this is the model that Tivoli and HP Openview, and many
other commercial monitoring solutions provide and sell as a feature.
From my experience as a sys admin, I've alwys found that automatically
restarting a service if it goes down to be "a bad thing"(TM).

In many solutions, logs get overwritten upon a restart that would be
integral to the real resolution and prevention.

=G=

-----Original Message-----
From: Tod Hansmann [mailto:user-b6e28cb93fa4@xymon.invalid]
Sent: Friday, August 03, 2007 10:40 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: RE: [hobbit] Highlights of the 4.3.0 version

In my experience, I have to agree.  Hobbit is for monitoring so the
information that x is down gets to people who can properly diagnose what
is going on, not take generic actions.  If generic actions were
something that were required for X to function properly, it should be a
feature of that software.

Hobbit CAN do some scripting based on alerts, but even that might be a
bit more than a systems administrator wants to hinder himself with.

Tod Hansmann
Network Engineer


-----Original Message-----
From: Buchan Milne [mailto:user-9b139aff4dec@xymon.invalid]
Sent: Friday, August 03, 2007 12:31 AM
To: user-ae9b8668bcde@xymon.invalid
Cc: Hubbard, Greg L
Subject: Re: [hobbit] Highlights of the 4.3.0 version

On Tuesday 24 July 2007 22:55:02 Hubbard, Greg L wrote:

Wonder if there is any way to tell a client what it's status is so it
can be autonomous?  What I mean is this:  suppose there was a way for
the Hobbit client to tell the server that service X was now in state

Y,

and a client-side module could then activate response Z on its own?

I don't like band-aids like this.

"restart because it's down" prevents the real impact of problems being
seen, and provides less motivation for fixing things properly. Instead,
you sit with frequent short outages (which may avoid the attention of
managers, production managers) which have end-user impact.

I like even less using a monitoring system to do this ...

Regards,
Buchan