The kind of things that you can automate should be handled routinely, not
be triggered by an alert from your monitoring tool. If you have logs growing
to fast that they are filling up you file system you should find out what is
filling them up and why and then fix that. Automatic log rotation and
compression should be done by a tool like logrotate, not Xymon or any other
monitoring tool. You shouldn't be using a monitoring tool to trigger routine
maintenance, it simply causes unnecessary alerts that cause problems in
other areas.
Thanks,
Larry Barber
On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <user-ca972c0c43a8@xymon.invalid> wrote:
Larry,
Some auto correcting is not bad. Back in the Big brother days I had a
datacenter and team of folks. We managed to the “yellow” alerts. I had folks
correct and build scripts to address the things that brought on the yellow
so we never saw the red. This made it so very little red was ever seen.
Now the things you can automate are the disk full kind of things. If that
happens you can run a script to clean logs compress and that stuff. This
was usually handled by managing the yellow. There would be a script in place
to keep the space to below the yellow trigger. So if you got a red it was
usually a bug temp file or something that would get cleaned shortly. So say
on the red alert you could have it run the cleanup script rather than
waiting for your cron to do the normal cleanup.
Now on other issues it really depends on what the alert is about. You
cannot automate everything economically. At some point it is cheaper and
faster to put a human in the loop. I did have a script that would take the
e-mail response from the alert and we could have it parse the message and do
the work. This was back in the day with the RIM pagers. So you got an alert
you replied to the alert with “run clean script on host” The reply e-mail
was parsed in by the same script we were using to acknowledge the alert. It
would parse and run a clean script. This let my admins be able to work
issues while away from a PC or network connection.
I do hear and agree with your concerns. A blanket statement from managers
that do not have a full understanding of all the elements is a ruff thing to
swallow. But there heart is in the right spot J
I guess in a rather long rambling way I am saying that you learn and tune
your systems. Address re-occurring issues so they do not. Then watch for the
next thing to be addressed.
-Kevin
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf
Of Larry Barber
Sent: Friday, April 06, 2012 1:43 PM
To: xymon at xymon.com
Subject: [Xymon] autofixing
My management has gotten the idea that we should be automating the repair
processes on our servers. They want things set up so that when a fault is
detected a script is run that attempts to repair it. I've tried to convince
them that this is a profoundly wrong-headed idea, but I'm not having much
luck. Do any of you know of any articles or resources that might help
convince them?
Thanks,
Larry Barber