Xymon Mailing List Archive search

Restarting failed processes on the client

9 messages in this thread

list Thomas Kaehn · Wed, 11 Jul 2007 14:01:13 +0200 ·
Hi,

it is possible to monitor processes using PROC statements
in hobbit-clients.cfg.

But is there also a proper way in Hobbit to take action on failed
processes? Let's say calling "sudo /etc/init.d/ssh start" in case no
sshd processes are found? Ideally configurable on the server (e.g. PROC
sshd ACTION=/etc/init.d/ssh start), so that the configuration which
processes to monitor and which process to restart does not need to be
specified twice.

Ciao,
Thomas
-- 
Thomas Kähn                   WESTEND GmbH  |  Internet-Business-Provider
Technik                       CISCO Systems Partner - Authorized Reseller
                              Im Süsterfeld 6          Tel 0241/701333-18
user-02a72cb3f725@xymon.invalid                D-52072 Aachen              Fax 0241/911879
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608
Geschäftsführer:           Thomas Neugebauer, Thomas Heller, Michael Kolb
list Stewart Larsen · Wed, 11 Jul 2007 10:13:42 -0400 (EDT) ·
Not sure if this is the right place...

I have a particular error in my logs.  It's not a real issus unbless I see
10 of them in a 30 minute period, so I set up a rule in the msgs
section...

<msgs>
<setting name="summary" value="true" />
<match  logfile="Application" eventid="3317" count="10" delay="30m" />
</msgs>

Is this syntax correct?
list Henrik Størner · Wed, 11 Jul 2007 16:13:56 +0200 ·
quoted from Thomas Kaehn
On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:
But is there also a proper way in Hobbit to take action on failed
processes?
No. Hobbit only monitors things, it doesn't act to recover from
any failures.

If You really want this, then the easiest way is probably to
have a script on the Hobbit server that handles the service
restart, and trigger it from an alerting script. Here's how:

First, setup monitoring of the "sshd" process in hobbit-clients.cfg 
with
    PROC sshd GROUP=ssh 
You need the "GROUP" setting to be able to distinguish between
different types of "procs" alerts.

Next, create /usr/local/bin/sshRecover.sh with the commands needed 
to restart ssh - you can use $BBHOSTNAME to get the name of the host 
that has the problem. 

Finally, in hobbit-alerts.cfg you should have
    HOST=hostA,hostB,hostC SERVICE=procs GROUP=ssh
        SCRIPT /usr/local/bin/sshRecover.sh 0
to trigger the sshRecover.sh script when the "procs" column
goes red due to the "sshd" process missing. The "0" at the end
is a mandatory parameter in hobbit-alerts.cfg (the "recipient"
if you read the man-page) but here it's just a dummy parameter.


Regards,
Henrik
list Henrik Størner · Wed, 11 Jul 2007 16:20:41 +0200 ·
quoted from Henrik Størner
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
If You really want this, then the easiest way is probably to
have a script on the Hobbit server that handles the service
restart, and trigger it from an alerting script. Here's how:
[snipped]

Particularly for ssh, running the recovery script from the Hobbit
server might not be easy - since ssh is usually the only way you 
can remote-login to the server and gets things (re-)started.

So to implement the same functionality on the client-side, you can
write a client-side extension script that does:

   #!/bin/sh

   PROCSTATUS=`$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}'`
   if test "$PROCSTATUS" = "red"
   then
      /etc/init.d/sshd restart
   fi

   exit 0

This triggers the "sshd restart" whenever the "procs" status goes red.
So it won't be able to tell if it's the sshd process that triggers a red
if you're monitoring multiple processes on each host. So alternatively,
you could add network-monitoring of "ssh", and then query the "ssh"
column instead of the "procs" column.


Regards,
Henrik
list Thomas Kaehn · Thu, 12 Jul 2007 09:58:52 +0200 ·
Hi Henrik,
quoted from Henrik Størner

On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:
But is there also a proper way in Hobbit to take action on failed
processes?
No. Hobbit only monitors things, it doesn't act to recover from
any failures.
thanks for your suggestions how to solve the problem. However
if Hobbit is aimed at monitoring it's probably better not
to misuse the alert functionality for restarting processes.

Your second solution solves this problem and may also be used to act on
further problems - not only "procs". So I think this would be the best
solution.
quoted from Thomas Kaehn

Ciao,
Thomas
-- 
Thomas Kähn                   WESTEND GmbH  |  Internet-Business-Provider
Technik                       CISCO Systems Partner - Authorized Reseller
                              Im Süsterfeld 6          Tel 0241/701333-18
user-02a72cb3f725@xymon.invalid                D-52072 Aachen              Fax 0241/911879
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608
Geschäftsführer:           Thomas Neugebauer, Thomas Heller, Michael Kolb
list Daniel Bourque · Thu, 12 Jul 2007 09:50:11 -0500 ·
As a last resort, if you also have rsh running, you could
- set hosts.equiv to allow the hobbit user coming in from the hobbit 
server to login as user x without a password,
- then give user x sudo ( with NOPASSWD ) rights to restart sshd.

I have a bunch automated fixes i setup, restart ntpd, kill processes, 
etc, using the SCRIPT alert & ssh keys.

In your case you could do this to restart the local or remote ssh service


< from hobbit-alerts.cfg>
...
PAGE=bla COLOR=red
        SCRIPT /opt/hobbit/server/bin/autofix_ssh autofix_ssh 
SERVICE=ssh  DURATION<10m
        MAIL user-0a951403e24f@xymon.invalid DURATION>10m REPEAT=30m

<autofix_ssh>

#!/bin/bash

if [ $BBHOSTNAME -eq `hostname` ] ; then
    sudo /etc/init.d/sshd restart
else
    rsh $BBHOSTNAME -l userx sudo /etc/init.d/sshd restart";
fi


hope this helps

Daniel Bourque
Systems/Network Administrator
WeatherData Service Inc
An Accuweather Company

Office (XXX) XXX-XXXX
Office (XXX) XXX-XXXX ext. XXXX
Mobile (XXX) XXX-XXXX
quoted from Henrik Størner


Henrik Stoerner wrote:
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
 
If You really want this, then the easiest way is probably to
have a script on the Hobbit server that handles the service
restart, and trigger it from an alerting script. Here's how:
   
[snipped]

Particularly for ssh, running the recovery script from the Hobbit
server might not be easy - since ssh is usually the only way you 
can remote-login to the server and gets things (re-)started.

So to implement the same functionality on the client-side, you can
write a client-side extension script that does:

  #!/bin/sh

  PROCSTATUS=`$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}'`
  if test "$PROCSTATUS" = "red"
  then
     /etc/init.d/sshd restart
  fi

  exit 0

This triggers the "sshd restart" whenever the "procs" status goes red.
So it won't be able to tell if it's the sshd process that triggers a red
if you're monitoring multiple processes on each host. So alternatively,
you could add network-monitoring of "ssh", and then query the "ssh"
column instead of the "procs" column.


Regards,
Henrik

list Etienne Grignon · Thu, 12 Jul 2007 17:58:55 +0200 ·
Hello Stewart,


You can set a counter so you may only receive alert if several events
on the same rule are matched. Default is 0 which means that the first
event matched will generate an alert. count must be a positive number.
count has no effect on ignore rules.


2007/7/11, Stewart Larsen <user-4bb0ef2a7550@xymon.invalid>:
quoted from Stewart Larsen
Not sure if this is the right place...

I have a particular error in my logs.  It's not a real issus unbless I see
10 of them in a 30 minute period, so I set up a rule in the msgs
section...

<msgs>
<setting name="summary" value="true" />
<match  logfile="Application" eventid="3317" count="10" delay="30m" />
</msgs>

Is this syntax correct?
The syntax is good but actually, count option just helps to trigger
events that appear often in the last 30 minutes (your delay setting).
If count is reached, msgs agent will still report all of the events
because depending the rules, events can be different each other.

If you really doesn't want the event to be reported, may be you should
ignore it definitively.

Regards,


-- 
Etienne GRIGNON
list Stewart Larsen · Thu, 12 Jul 2007 12:41:52 -0400 (EDT) ·
Thanks. I've read the manual, but the syntax below does not seem to behave
the way I expect.

The first error I get  with that EventID triggers an alert.  I thought
with the syntax given, I would need to see 10 log entries within a 30
minute period before I get an alert.

Is this a bug in BBWin, or am I doing something incorrect here?

Stewart
quoted from Etienne Grignon

Hello Stewart,


You can set a counter so you may only receive alert if several events
on the same rule are matched. Default is 0 which means that the first
event matched will generate an alert. count must be a positive number.
count has no effect on ignore rules.


2007/7/11, Stewart Larsen <user-4bb0ef2a7550@xymon.invalid>:
Not sure if this is the right place...

I have a particular error in my logs.  It's not a real issus unbless I
see
10 of them in a 30 minute period, so I set up a rule in the msgs
section...

<msgs>
<setting name="summary" value="true" />
<match  logfile="Application" eventid="3317" count="10" delay="30m" />
</msgs>

Is this syntax correct?
The syntax is good but actually, count option just helps to trigger
events that appear often in the last 30 minutes (your delay setting).
If count is reached, msgs agent will still report all of the events
because depending the rules, events can be different each other.

If you really doesn't want the event to be reported, may be you should
ignore it definitively.

Regards,


--
Etienne GRIGNON

-- 

Stewart Larsen
list Darin D [eit] Dugan · Thu, 12 Jul 2007 13:22:23 -0500 ·
In case you didn't show us your whole <msgs> section, make sure your
match rule is before any other more general match rule (such as the
default red/error and yellow/warning rules). I believe the first match
wins.

Cheers.
D
quoted from Stewart Larsen
-----Original Message-----
From: Stewart Larsen [mailto:user-4bb0ef2a7550@xymon.invalid]
Sent: Thursday, July 12, 2007 11:42 AM
To: user-ae9b8668bcde@xymon.invalid
Cc: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] BBWin and Message problems

Thanks. I've read the manual, but the syntax below does not seem to
behave
the way I expect.

The first error I get  with that EventID triggers an alert.  I thought
with the syntax given, I would need to see 10 log entries within a 30
minute period before I get an alert.

Is this a bug in BBWin, or am I doing something incorrect here?

Stewart

Hello Stewart,


You can set a counter so you may only receive alert if several
events
on the same rule are matched. Default is 0 which means that the
first
event matched will generate an alert. count must be a positive
number.
count has no effect on ignore rules.


2007/7/11, Stewart Larsen <user-4bb0ef2a7550@xymon.invalid>:
Not sure if this is the right place...

I have a particular error in my logs.  It's not a real issus
unbless
I
see
10 of them in a 30 minute period, so I set up a rule in the msgs
section...

<msgs>
<setting name="summary" value="true" />
<match  logfile="Application" eventid="3317" count="10" delay="30m"
/>
</msgs>

Is this syntax correct?
The syntax is good but actually, count option just helps to trigger
events that appear often in the last 30 minutes (your delay
setting).
If count is reached, msgs agent will still report all of the events
because depending the rules, events can be different each other.

If you really doesn't want the event to be reported, may be you
should
ignore it definitively.

Regards,
--
Etienne GRIGNON
--
Stewart Larsen