Xymon Mailing List Archive search

Checking process longevity

7 messages in this thread

list Colin Coe · Mon, 4 Feb 2008 08:20:42 +0900 ·
Hi all

I'm trying to work out how I can get hobbit to alert me when a process
has existed for more than n seconds.  This is important as we sometimes
have NFS problems that cause processes such as df to hang due to stale
mounts and I'd like tp pick these up sooner rather than later.

Any ideas on how to implement this?

Thanks

CC


NOTICE: This email and any attachments are confidential. They may contain legally privileged information or copyright material. You must not read, copy, use or disclose them without authorisation. If you are not an intended recipient, please contact us at once by return email and then delete both messages and all attachments.
list Henrik Størner · Mon, 4 Feb 2008 10:17:40 +0100 ·
quoted from Colin Coe
On Mon, Feb 04, 2008 at 08:20:42AM +0900, Coe, Colin C. (Unix Engineer) wrote:
I'm trying to work out how I can get hobbit to alert me when a process
has existed for more than n seconds.  This is important as we sometimes
have NFS problems that cause processes such as df to hang due to stale
mounts and I'd like tp pick these up sooner rather than later.
Wouldn't it be easier to just scan the logfile for NFS timeout errors?

Hobbit doesn't track the lifetime of a process, and I would think this
would be very bothersome to setup because you'd have to exclude long-
living daemon processes.


Regards,
Henrik
list Colin Coe · Tue, 5 Feb 2008 13:47:07 +0900 ·
quoted from Henrik Størner
-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Monday, 4 February 2008 6:18 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Checking process longevity

On Mon, Feb 04, 2008 at 08:20:42AM +0900, Coe, Colin C. (Unix Engineer) wrote:
I'm trying to work out how I can get hobbit to alert me when a process
has existed for more than n seconds.  This is important as we sometimes
have NFS problems that cause processes such as df to hang due to stale
mounts and I'd like tp pick these up sooner rather than later.
Wouldn't it be easier to just scan the logfile for NFS timeout errors?

Hobbit doesn't track the lifetime of a process, and I would think this
would be very bothersome to setup because you'd have to exclude long-
living daemon processes.


Regards,
Henrik
By default, under RHEL (most of) the files under /var/log are owned by,
and only readable by, root.  I'm still deciding whether or not to allow
hobbit to read the log files.  I do think that there are other cases
where monitoring how long a process exists is useful.

I was thinking that this could be done by adding a new flag to 'PROC' in
hobbit-clients.cfg.  Something like:

PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]
[RUNTIME=seconds]

Example, alert if a 'df' has existed for more 60 seconds

HOST foo
	PROC df RUNTIME=60

I started hacking but my C fu is weak.
quoted from Colin Coe

CC


NOTICE: This email and any attachments are confidential. They may contain legally privileged information or copyright material. You must not read, copy, use or disclose them without authorisation. If you are not an intended recipient, please contact us at once by return email and then delete both messages and all attachments.
list Henrik Størner · Tue, 5 Feb 2008 12:58:06 +0100 ·
quoted from Colin Coe
On Tue, Feb 05, 2008 at 01:47:07PM +0900, Coe, Colin C. (Unix Engineer) wrote:
I do think that there are other cases
where monitoring how long a process exists is useful.

I was thinking that this could be done by adding a new flag to 'PROC' in
hobbit-clients.cfg.  Something like:

PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]
[RUNTIME=seconds]

Example, alert if a 'df' has existed for more 60 seconds

HOST foo
	PROC df RUNTIME=60
Sure. Only problem is: How do you determine how long a process has
existed ?

Some systems report the start-time of a process in a separate column
(START in Linux, STIME in Solaris, ...) Not very accurate, since if they
were started more than 24 hours ago it shows only the date. I guess we
could use that.


Regards,
Henrik
list Massimo Morsiani · Tue, 5 Feb 2008 13:57:26 +0100 ·
Hi all,

what is the right way to use CLASS tag in hobbit-alerts.cfg?
I didn't find anything in Hobbit man-pages.


Regards.

Massimo Morsiani
Information Technology Dept.
Gilbarco S.p.a.
via de' Cattani, 220/G
50145 Firenze, Italy
tel:	+XX-XXX-XXXXX
fax:	+XX-XXX-XXXXXX
email:	user-32025d8bd22e@xymon.invalid
web:	http://www.gilbarco.it

This message (including any attachments) contains confidential and/or proprietary information intended only for the addressee.  Any unauthorized disclosure, copying, distribution or reliance on the contents of this information is strictly prohibited and may constitute a violation of law.  If you are not the intended recipient, please notify the sender immediately by responding to this e-mail, and delete the message from your system.  If you have any questions about this e-mail please notify the sender immediately.
list Colin Coe · Fri, 8 Feb 2008 11:36:19 +0900 ·
quoted from Henrik Størner
 
-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Tuesday, 5 February 2008 8:58 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Checking process longevity

On Tue, Feb 05, 2008 at 01:47:07PM +0900, Coe, Colin C. (Unix Engineer) wrote:
I do think that there are other cases
where monitoring how long a process exists is useful.
I was thinking that this could be done by adding a new flag to 'PROC' in
hobbit-clients.cfg.  Something like:
PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]
[RUNTIME=seconds]
Example, alert if a 'df' has existed for more 60 seconds
HOST foo
	PROC df RUNTIME=60
Sure. Only problem is: How do you determine how long a process has
existed ?

Some systems report the start-time of a process in a separate column
(START in Linux, STIME in Solaris, ...) Not very accurate, since if they
were started more than 24 hours ago it shows only the date. I guess we
could use that.


Regards,
Henrik
That sounds great.  Typically, I'm looking for processes existing for 5
minutes.
quoted from Colin Coe

Thanks

CC

NOTICE: This email and any attachments are confidential. They may contain legally privileged information or copyright material. You must not read, copy, use or disclose them without authorisation. If you are not an intended recipient, please contact us at once by return email and then delete both messages and all attachments.
list John Glowacki · Thu, 20 Mar 2008 17:07:26 -0400 ·
quoted from Colin Coe
On Tue, Feb 5, 2008 at 7:58 AM, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Tue, Feb 05, 2008 at 01:47:07PM +0900, Coe, Colin C. (Unix Engineer) wrote:
I do think that there are other cases
where monitoring how long a process exists is useful.

I was thinking that this could be done by adding a new flag to 'PROC' in
hobbit-clients.cfg.  Something like:

PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]
[RUNTIME=seconds]

Example, alert if a 'df' has existed for more 60 seconds

HOST foo>
      PROC df RUNTIME=60
 Sure. Only problem is: How do you determine how long a process has
 existed ?

 Some systems report the start-time of a process in a separate column
 (START in Linux, STIME in Solaris, ...) Not very accurate, since if they
 were started more than 24 hours ago it shows only the date. I guess we
 could use that.


 Regards,
 Henrik
If etime was added to ps command this could be added to Solaris and
Linux for this purpose. stime seems like it would report month day or
year depending on OS and time passed.

Solaris man for ps:
     etime In the POSIX locale, the elapsed time since  the  pro-
           cess was started, in the form:
           [[dd-]hh:]mm:ss

Example output for different times.
   STIME     ELAPSED
Mar18  2-03:45:15
Mar19  1-06:56:27
14:45    02:08:04
15:33    01:20:22
16:25       28:27
16:53       00:29
16:53       00:00

SunOS 5.7
   STIME     ELAPSED
  May_04 686-03:00:28

SunOS 5.9
   STIME     ELAPSED
  Mar_08 377-21:30:03

SunOS 5.10
   STIME     ELAPSED
  Jun_12 282-01:43:14

Red Hat Enterprise Linux AS release 3
Linux 2.4.21-47.ELsmp
STIME     ELAPSED
Mar15  5-02:51:34

Red Hat Enterprise Linux AS release 4
Linux 2.6.9-34.ELsmp
STIME     ELAPSED
 2007 203-00:13:39

I found etime today because I had to prove processes started on a
Solaris system on Feb_20 of 2007 and not Feb_20 2008.

Hope this is helpful.

John