Restarting failed processes on the client
list Thomas Kaehn
Hi,
it is possible to monitor processes using PROC statements
in hobbit-clients.cfg.
But is there also a proper way in Hobbit to take action on failed
processes? Let's say calling "sudo /etc/init.d/ssh start" in case no
sshd processes are found? Ideally configurable on the server (e.g. PROC
sshd ACTION=/etc/init.d/ssh start), so that the configuration which
processes to monitor and which process to restart does not need to be
specified twice.
Ciao,
Thomas
--
Thomas Kähn WESTEND GmbH | Internet-Business-Provider
Technik CISCO Systems Partner - Authorized Reseller
Im Süsterfeld 6 Tel 0241/701333-18
user-02a72cb3f725@xymon.invalid D-52072 Aachen Fax 0241/911879
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608
Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb
list Stewart Larsen
Not sure if this is the right place... I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section... <msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs> Is this syntax correct?
list Henrik Størner
▸
On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:
But is there also a proper way in Hobbit to take action on failed processes?
No. Hobbit only monitors things, it doesn't act to recover from
any failures.
If You really want this, then the easiest way is probably to
have a script on the Hobbit server that handles the service
restart, and trigger it from an alerting script. Here's how:
First, setup monitoring of the "sshd" process in hobbit-clients.cfg
with
PROC sshd GROUP=ssh
You need the "GROUP" setting to be able to distinguish between
different types of "procs" alerts.
Next, create /usr/local/bin/sshRecover.sh with the commands needed
to restart ssh - you can use $BBHOSTNAME to get the name of the host
that has the problem.
Finally, in hobbit-alerts.cfg you should have
HOST=hostA,hostB,hostC SERVICE=procs GROUP=ssh
SCRIPT /usr/local/bin/sshRecover.sh 0
to trigger the sshRecover.sh script when the "procs" column
goes red due to the "sshd" process missing. The "0" at the end
is a mandatory parameter in hobbit-alerts.cfg (the "recipient"
if you read the man-page) but here it's just a dummy parameter.
Regards,
Henrik
list Henrik Størner
▸
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:
[snipped]
Particularly for ssh, running the recovery script from the Hobbit
server might not be easy - since ssh is usually the only way you
can remote-login to the server and gets things (re-)started.
So to implement the same functionality on the client-side, you can
write a client-side extension script that does:
#!/bin/sh
PROCSTATUS=`$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}'`
if test "$PROCSTATUS" = "red"
then
/etc/init.d/sshd restart
fi
exit 0
This triggers the "sshd restart" whenever the "procs" status goes red.
So it won't be able to tell if it's the sshd process that triggers a red
if you're monitoring multiple processes on each host. So alternatively,
you could add network-monitoring of "ssh", and then query the "ssh"
column instead of the "procs" column.
Regards,
Henrik
list Thomas Kaehn
Hi Henrik,
▸
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:But is there also a proper way in Hobbit to take action on failed processes?No. Hobbit only monitors things, it doesn't act to recover from any failures.
thanks for your suggestions how to solve the problem. However if Hobbit is aimed at monitoring it's probably better not to misuse the alert functionality for restarting processes. Your second solution solves this problem and may also be used to act on further problems - not only "procs". So I think this would be the best solution.
▸
Ciao,
Thomas
--
Thomas Kähn WESTEND GmbH | Internet-Business-Provider
Technik CISCO Systems Partner - Authorized Reseller
Im Süsterfeld 6 Tel 0241/701333-18
user-02a72cb3f725@xymon.invalid D-52072 Aachen Fax 0241/911879
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608
Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb
list Daniel Bourque
As a last resort, if you also have rsh running, you could
- set hosts.equiv to allow the hobbit user coming in from the hobbit
server to login as user x without a password,
- then give user x sudo ( with NOPASSWD ) rights to restart sshd.
I have a bunch automated fixes i setup, restart ntpd, kill processes,
etc, using the SCRIPT alert & ssh keys.
In your case you could do this to restart the local or remote ssh service
< from hobbit-alerts.cfg>
...
PAGE=bla COLOR=red
SCRIPT /opt/hobbit/server/bin/autofix_ssh autofix_ssh
SERVICE=ssh DURATION<10m
MAIL user-0a951403e24f@xymon.invalid DURATION>10m REPEAT=30m
<autofix_ssh>
#!/bin/bash
if [ $BBHOSTNAME -eq `hostname` ] ; then
sudo /etc/init.d/sshd restart
else
rsh $BBHOSTNAME -l userx sudo /etc/init.d/sshd restart";
fi
hope this helps
Daniel Bourque
Systems/Network Administrator
WeatherData Service Inc
An Accuweather Company
Office (XXX) XXX-XXXX
Office (XXX) XXX-XXXX ext. XXXX
Mobile (XXX) XXX-XXXX
▸
Henrik Stoerner wrote:
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:[snipped] Particularly for ssh, running the recovery script from the Hobbit server might not be easy - since ssh is usually the only way you can remote-login to the server and gets things (re-)started. So to implement the same functionality on the client-side, you can write a client-side extension script that does: #!/bin/sh PROCSTATUS=`$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}'` if test "$PROCSTATUS" = "red" then /etc/init.d/sshd restart fi exit 0 This triggers the "sshd restart" whenever the "procs" status goes red. So it won't be able to tell if it's the sshd process that triggers a red if you're monitoring multiple processes on each host. So alternatively, you could add network-monitoring of "ssh", and then query the "ssh" column instead of the "procs" column. Regards, Henrik
list Etienne Grignon
Hello Stewart, You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules. 2007/7/11, Stewart Larsen <user-4bb0ef2a7550@xymon.invalid>:
▸
Not sure if this is the right place... I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section... <msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs> Is this syntax correct?
The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other. If you really doesn't want the event to be reported, may be you should ignore it definitively. Regards, -- Etienne GRIGNON
list Stewart Larsen
Thanks. I've read the manual, but the syntax below does not seem to behave the way I expect. The first error I get with that EventID triggers an alert. I thought with the syntax given, I would need to see 10 log entries within a 30 minute period before I get an alert. Is this a bug in BBWin, or am I doing something incorrect here? Stewart
▸
Hello Stewart, You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules. 2007/7/11, Stewart Larsen <user-4bb0ef2a7550@xymon.invalid>:Not sure if this is the right place... I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section... <msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs> Is this syntax correct?The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other. If you really doesn't want the event to be reported, may be you should ignore it definitively. Regards, -- Etienne GRIGNON
--
Stewart Larsen
list Darin D [eit] Dugan
In case you didn't show us your whole <msgs> section, make sure your match rule is before any other more general match rule (such as the default red/error and yellow/warning rules). I believe the first match wins. Cheers. D
▸
-----Original Message----- From: Stewart Larsen [mailto:user-4bb0ef2a7550@xymon.invalid] Sent: Thursday, July 12, 2007 11:42 AM To: user-ae9b8668bcde@xymon.invalid Cc: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] BBWin and Message problems Thanks. I've read the manual, but the syntax below does not seem to behave the way I expect. The first error I get with that EventID triggers an alert. I thought with the syntax given, I would need to see 10 log entries within a 30 minute period before I get an alert. Is this a bug in BBWin, or am I doing something incorrect here? StewartHello Stewart, You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules. 2007/7/11, Stewart Larsen <user-4bb0ef2a7550@xymon.invalid>:Not sure if this is the right place... I have a particular error in my logs. It's not a real issus unblessIsee 10 of them in a 30 minute period, so I set up a rule in the msgs section... <msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m"/></msgs> Is this syntax correct?The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other. If you really doesn't want the event to be reported, may be you should ignore it definitively. Regards, -- Etienne GRIGNON-- Stewart Larsen