On Friday 03 August 2007 19:15:27 Scott Walters wrote:
I am definitely in the "monitor only" camp. As appealing as
"self-healing" may seem, I've seen attempts go horrible wrong too many
times. For example, shutting down Oracle for upgrades and then being
restarted in the middle of the upgrade. Not good.
How about the easy example of a web server not responding. Do you restart it ? In the case I am thinking of, no. Since, the reason it is not responding is that the database server it (and another 4 webservers) is waiting for is having problems. Restarting the web server would drop the >1000 existing (working) sessions, causing a full-blown outage, and migrate the problem to the other 4 web servers that sit behind the same load balancer.
I also agree that "self-healing" lends itself to band-aids that avoid
root-cause determination.
Or *prevent* the root-cause determination. For example, I had a problem on an LDAP server that appeared once in 2 or 3 weeks. I start it under a debugger, and when next experienced the problem, some online debugging (after taking it out of the pool) with a developer found and fixed the bug within one hour (and allowed me to understand the cause so I could work around it). A restart here would have meant waiting some more and another few outages.
I don't think this requires "baby-sitting," but a commitment to fixing things once. I have also had the
displeasure of making permanent band-aids, but I cannot condone it.
We do have some applications that require supervision ... but for them we use daemon-tools or supervise-scripts (a re-implementation of daemon-tools), as these are *much* better at supervision than a monitoring system. If you really need a baby-sitter, the monitoring system isn't the best one ...
All of those "operational" aspects aside, I've convinced myself from a
security point of view, corrective action from monitoring is bad-- a
clear violation of the separation of duties. You don't want your
auditors "cleaning up" the numbers as they go over your books.
You know what's better than your webserver being automatically
restarted when it crashes? Your webserver not crashing.
I completely support the absence of corrective actions from monitor
triggers. The question I have yet to answer satisfactorily is,"Should
the monitoring system perform additional data collection after
specific errors?" For example, running a particular "find" command
when disk usage increases to try and identify which files are causing
the partition to fill.
Or attach a debugger to the hung process and get a backtrace ?
Regards,
Buchan