autofixing

list Alan Sparks
Fri, 06 Apr 2012 15:44:31 -0600
Message-Id: <user-c86a37b2f371@xymon.invalid>

I'd generally agree that fixing root cause whenever possible, so the
problem doesn't occur is preferable.  In a past life, we did do some of
this - of course, do whatever we could to prevent the problem in the
first place... but web server instances crash, and sometimes traffic
irregularities cause logs to fill fast than usual.

I had a hack going that involved cfengine, with cfrun callable from a
paging script.  The premise was to have cfengine invoked on the remote
node before pages actually went out (e.g., a DURATION delay on real
pages), to see if cfengine could fix the simpler problems (like a
process dying or whatnot).  If it could, we could sleep.  If not, the
second-level page went out for human intervention.

We didn't do much autofixing... there wasn't a lot in the environment
that lent itself to such.  Either we engineered an HA environment
(clustered) where a dead machine didn't affect the service... or the
problem was probably not simple to fix, and we needed real eyes/brains
on it.
-Alan

On 4/6/2012 3:31 PM, Larry Barber wrote:

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks,
Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <user-6ef9c2864140@xymon.invalid
<mailto:user-6ef9c2864140@xymon.invalid>> wrote:

    The kind of things that you can automate should be handled
    routinely, not be triggered by an alert from your monitoring tool.
    If you have logs growing to fast that they are filling up you file
    system you should find out what is filling them up and why and then
    fix that. Automatic log rotation and compression should be done by a
    tool like logrotate, not Xymon or any other monitoring tool. You
    shouldn't be using a monitoring tool to trigger routine maintenance,
    it simply causes unnecessary alerts that cause problems in other areas.

    Thanks,
    Larry Barber


    On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <user-ca972c0c43a8@xymon.invalid
    <mailto:user-ca972c0c43a8@xymon.invalid>> wrote:

        Larry,____

        __ __

        Some auto correcting is not bad.  Back in the Big brother days I
        had a datacenter and team of folks. We managed to the “yellow”
        alerts. I had folks correct and build scripts to address the
        things that brought on the yellow so we never saw the red.  This
        made it so very little red was ever seen.____

        __ __

        Now the things you can automate are the disk full kind of
        things. If that happens you can run a script to clean logs
        compress and that stuff.  This was usually handled by managing
        the yellow. There would be a script in place to keep the space
        to below the yellow trigger. So if you got a red it was usually
        a bug temp file or something that would get cleaned shortly. So
        say on the red alert you could have it run the cleanup script
        rather than waiting for your cron to do the normal cleanup.____

        __ __

        Now on other issues it really depends on what the alert is
        about. You cannot automate everything economically. At some
        point it is cheaper and faster to put a human in the loop. I did
        have a script that would take the e-mail response from the alert
        and we could have it parse the message and do the work. This was
        back in the day with the RIM pagers. So you got an alert you
        replied to the alert with “run clean script on host” The reply
        e-mail was parsed in by the same script we were using to
        acknowledge the alert. It would parse and run a clean script.
        This let my admins be able to work issues while away from a PC
        or network connection.____

        __ __

        I do hear and agree with your concerns. A blanket statement from
        managers that do not have a full understanding of all the
        elements is a ruff thing to swallow. But there heart is in the
        right spot J____

        __ __

        I guess in a rather long rambling way I am saying that you learn
        and tune your systems. Address re-occurring issues so they do
        not. Then watch for the next thing to be addressed.____

        __ __

        __ __

        -Kevin____

        __ __

        __ __

        *From:*xymon-bounces at xymon.com <mailto:xymon-bounces at xymon.com>
        [mailto:xymon-bounces at xymon.com
        <mailto:xymon-bounces at xymon.com>] *On Behalf Of *Larry Barber
        *Sent:* Friday, April 06, 2012 1:43 PM
        *To:* xymon at xymon.com <mailto:xymon at xymon.com>
        *Subject:* [Xymon] autofixing____

        __ __

        My management has gotten the idea that we should be automating
        the repair processes on our servers. They want things set up so
        that when a fault is detected a script is run that attempts to
        repair it. I've tried to convince them that this is a profoundly
        wrong-headed idea, but I'm not having much luck. Do any of you
        know of any articles or resources that might help convince them?

        Thanks,
        Larry Barber____