Xymon Mailing List Archive search

Highlights of the 4.3.0 version

list Gary Baluha
Mon, 6 Aug 2007 09:28:31 -0400
Message-Id: <user-41a024334610@xymon.invalid>

On 8/3/07, Haertig, David F (Dave) <user-68874b735d77@xymon.invalid> wrote:
Most everything I do in Hobbit is a custom script.  Restarting crashed
processes is one of the least of my worries.  Although in some rare
cases I do just that (short term), with appropriate logging and email to
the app developement team.  The corporate expense of having the app down
is too great to let Utopian ideas prevail.

Agreed, though sometimes it's worth the effort for an extra few minutes of
downtime to do *some* analysis.

Most of the automated Hobbit stuff I do is not restarting dead apps
(luckily, that is very infrequent around here).  It's more mundane.  One
example is disk space.  A full filesystem would shut many things down.
Apps should not fill a filesystem, but sometimes they do.  So my custom
Hobbit scripts first scream and scream about low disk space, even
analysing things down to specific subdirectories and fast growing files
and doing trend analysis.  But if their call is not answered, they start
freeing up space from a "private reserve" I have set aside to deal with
emergencies.  So if we experience a sudden unexpected blowup in a
filesystem at 3am, Hobbit keeps things running in production until the
appropriate people can look into and diagnose the problem.  This may not
be Utopian behavior, but it sure is practical at 3am in the morning!

What sort of trend analysis do your scripts perform?  We have a few boxes
that are notorious for filling up their disk space, and I haven't yet come
up with an idea of how to neatly track exactly what it is that keeps filling
up the disk.

But my vote would be for Hobbit out-of-the-box to NOT attempt automated
repair actions.  That should be left to the Hobbit administrator.  We
can write custom monitor scripts or custom alert scripts to add this
functionality if it's appropriate for our environments.  It's trivial to
integrate your own scripting into Hobbit.

Due to the demands of some of the other admins, I have implemented a script
that does some rudimentary restarting, and even looks at the status of the
specific Hobbit alert in question, so that it doesn't try to restart
something, if the alert has been disabled (such as for a planned downtime).

It wasn't all that hard to write, and I also would prefer Hobbit NOT have
auto-restart logic out of the box.

I sure wish I worked in Utopia though.  The job would be a helluva lot
less stressful!  :-)

Working in the real world isn't as bad, compared to working the real world
where management _thinks_ you actually work in Utopia, and yet still can't
spare an extra second of downtime for real-time root cause analysis. ;-)