Xymon Mailing List Archive search

PROC matching failure due to column bloat

3 messages in this thread

list Jeremy Laidman · Tue, 24 Apr 2012 13:04:14 +1000 ·
Peeps

I have both Solaris and Linux servers where a large or long-running process
causes PROC matching to fail.  Here are some examples:


 7701     1 root       Feb 28 S  24  0.0 00:00:00  0.0   572   2692
/sbin/agetty -L 9600 ttyS0 vt102
 7702     1 root       Feb 28 S  23  0.0 00:00:00  0.0   576   2692
/sbin/agetty -L 9600 ttyS1 vt102
 7704     1 named      Feb 28 S  18  2.4 1-08:59:39  4.4 270500 412784
/usr/sbin/named -u named -f
26498  3293 root     12:47:46 S  14  0.0 00:00:00  0.0   468   2676 sleep 180

This is on Linux.  Note the longer-than-a-day TIME column that pushes the
columns after it to the right.

The following is on Solaris 9:

11201 11199  n101649 12:38:54 S  59  0.0        0:00  0.0 1000 1144 vmstat 300 2
11202     1  n101649 12:38:54 S  59  0.0        0:00  0.0  968 1104 sh
-c iostat -dxsrP 300 2 1>/tmp/xymon_iostatdisk.redacted
 3244  2965     root   Feb_16 S  59  0.0        5:20  0.1 7104 18736
/opt/OV/lbin/eaagt/opcle -std
 3245  2965     root   Feb_16 S  59  0.0        1:18  0.1 6376 20960
/opt/OV/lbin/eaagt/opcmona
 3253     1     root   Feb_16 S  59  0.9  1-10:46:45  0.8 58168 59632
/usr/local/sbin/named -f

Solaris "ps" output allows more characters for TIME than Linux.  However in
this case the memory columns (RSS and VSZ) are larger than expected,
pushing a couple of digits over into the process name area.

It seems that Xymon is parsing these based on fixed column sizes, defined
for each OS.  The result of these particular examples is that Xymon fails
to match on the process name.  Instead, I need to use match strings like so:

        PROC "%^(\d* |^)/usr/local/sbin/named(\s*$|\s)" 1 1
"TEXT=/usr/local/sbin/named"

or

        PROC "%^(\d* |^)/usr/sbin/named(\s*$|\s)" 1 1 "TEXT=/usr/sbin/named"

I guess this email is part "am I doing something wrong", part "does anyone
have a better idea", and part feature request (for more awk-like positional
matching).

Cheers
Jeremy
list David W David Gore · Tue, 24 Apr 2012 14:39:04 -0400 ·
Jeremy,

I would guess it's not matching because what you show begins with a space and not a number or digit.  Could you try simplifying the line to:

PROC "%named" 1 1 "TEXT=/usr/local/sbin/named"

See if that works and then add complexity if there are other processes with the string 'named' in the ps listing.

You may also want to look at the 'Client data' link on the procs alert for this host and then the [ps] section to see how your ps listing is being presented to Xymon as it may not match what you see at the command line depending on what command line ps options you are using.

~David
quoted from Jeremy Laidman

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jeremy Laidman
Sent: Monday, April 23, 2012 23:04
To: xymon at xymon.com
Subject: [Xymon] PROC matching failure due to column bloat

Peeps

I have both Solaris and Linux servers where a large or long-running process causes PROC matching to fail.  Here are some examples:


 7701     1 root       Feb 28 S  24  0.0 00:00:00  0.0   572   2692 /sbin/agetty -L 9600 ttyS0 vt102
 7702     1 root       Feb 28 S  23  0.0 00:00:00  0.0   576   2692 /sbin/agetty -L 9600 ttyS1 vt102
 7704     1 named      Feb 28 S  18  2.4 1-08:59:39  4.4 270500 412784 /usr/sbin/named -u named -f
26498  3293 root     12:47:46 S  14  0.0 00:00:00  0.0   468   2676 sleep 180
This is on Linux.  Note the longer-than-a-day TIME column that pushes the columns after it to the right.

The following is on Solaris 9:

11201 11199  n101649 12:38:54 S  59  0.0        0:00  0.0 1000 1144 vmstat 300 2
11202     1  n101649 12:38:54 S  59  0.0        0:00  0.0  968 1104 sh -c iostat -dxsrP 300 2 1>/tmp/xymon_iostatdisk.redacted
 3244  2965     root   Feb_16 S  59  0.0        5:20  0.1 7104 18736 /opt/OV/lbin/eaagt/opcle -std
 3245  2965     root   Feb_16 S  59  0.0        1:18  0.1 6376 20960 /opt/OV/lbin/eaagt/opcmona
 3253     1     root   Feb_16 S  59  0.9  1-10:46:45  0.8 58168 59632 /usr/local/sbin/named -f
Solaris "ps" output allows more characters for TIME than Linux.  However in this case the memory columns (RSS and VSZ) are larger than expected, pushing a couple of digits over into the process name area.

It seems that Xymon is parsing these based on fixed column sizes, defined for each OS.  The result of these particular examples is that Xymon fails to match on the process name.  Instead, I need to use match strings like so:

        PROC "%^(\d* |^)/usr/local/sbin/named(\s*$|\s)" 1 1 "TEXT=/usr/local/sbin/named"

or

        PROC "%^(\d* |^)/usr/sbin/named(\s*$|\s)" 1 1 "TEXT=/usr/sbin/named"

I guess this email is part "am I doing something wrong", part "does anyone have a better idea", and part feature request (for more awk-like positional matching).

Cheers
Jeremy
list Japheth Cleaver · Tue, 24 Apr 2012 14:03:21 -0700 (PDT) ·
quoted from David W David Gore
Peeps

I have both Solaris and Linux servers where a large or long-running
process
causes PROC matching to fail.  Here are some examples:


 7701     1 root       Feb 28 S  24  0.0 00:00:00  0.0   572   2692
/sbin/agetty -L 9600 ttyS0 vt102
 7702     1 root       Feb 28 S  23  0.0 00:00:00  0.0   576   2692
/sbin/agetty -L 9600 ttyS1 vt102
 7704     1 named      Feb 28 S  18  2.4 1-08:59:39  4.4 270500 412784
/usr/sbin/named -u named -f
26498  3293 root     12:47:46 S  14  0.0 00:00:00  0.0   468   2676 sleep
180

This is on Linux.  Note the longer-than-a-day TIME column that pushes the
columns after it to the right.

The following is on Solaris 9:

11201 11199  n101649 12:38:54 S  59  0.0        0:00  0.0 1000 1144 vmstat
300 2
11202     1  n101649 12:38:54 S  59  0.0        0:00  0.0  968 1104 sh
-c iostat -dxsrP 300 2 1>/tmp/xymon_iostatdisk.redacted
 3244  2965     root   Feb_16 S  59  0.0        5:20  0.1 7104 18736
/opt/OV/lbin/eaagt/opcle -std
 3245  2965     root   Feb_16 S  59  0.0        1:18  0.1 6376 20960
/opt/OV/lbin/eaagt/opcmona
 3253     1     root   Feb_16 S  59  0.9  1-10:46:45  0.8 58168 59632
/usr/local/sbin/named -f

Solaris "ps" output allows more characters for TIME than Linux.  However
in
this case the memory columns (RSS and VSZ) are larger than expected,
pushing a couple of digits over into the process name area.

It seems that Xymon is parsing these based on fixed column sizes, defined
for each OS.  The result of these particular examples is that Xymon fails
to match on the process name.  Instead, I need to use match strings like
so:

        PROC "%^(\d* |^)/usr/local/sbin/named(\s*$|\s)" 1 1
"TEXT=/usr/local/sbin/named"

or

        PROC "%^(\d* |^)/usr/sbin/named(\s*$|\s)" 1 1
"TEXT=/usr/sbin/named"

I guess this email is part "am I doing something wrong", part "does anyone
have a better idea", and part feature request (for more awk-like
positional
matching).

Cheers
Jeremy

Close... AFAIK, it's actually looking for the proper column name in a
given listing (from the first line), and then keying off of that. When the
columns don't line up with the given header, xymond_client examines the
wrong substring. So it's dynamic and static :)


see: xymon-4.3.7/xymond/client/solaris.c:66        
unix_procs_report(hostname, clienttype, os, hinfo, fromline, timestr,
"CMD", "COMMAND", psstr);

and xymond_client.c:958 onward


On the SunOS box I've got, the ps command (xymonclient-sunos.sh) is
providing the following field list, and below that is some of the output
it gets wrong.

I suppose one quick fix if you're getting this a lot might be to manually
change the order of the fields in the client script to "ps -A -o
args,pid,ppid,user,stime,s,pri,pcpu,time,pmem,rss,vsz"


I'm not sure if other processing is going on, but the only drawback might
be slightly odd-looking ps output in your client logs.


HTH,
-jc


-bash-3.2$ ps -A -o pid,ppid,user,stime,s,pri,pcpu,time,pmem,rss,vsz,args
| head -1
  PID  PPID     USER    STIME S PRI %CPU        TIME %MEM  RSS  VSZ COMMAND
-bash-3.2$ ps -A -o pid,ppid,user,stime,s,pri,pcpu,time,pmem,rss,vsz,args
| sort -k8 -r | head
  693   666     root   Mar_28 S  59  0.1  1-12:12:45  0.0 2720 3424
dovecot-auth -w
  160     1     root   Mar_28 S  59  0.2  1-04:06:09  0.1 51552 78848
/usr/sbin/nscd
    3     0     root   Mar_28 S  60  0.1    15:09:32  0.0    0    0 fsflush
  482     1     root   Mar_28 S  59  0.0    10:10:31  0.1 94008 101432
/opt/local/bin/python /usr/local/sbin/denyhosts.py --daemon
--config=/usr/share
    6     0     root   Mar_28 S   0  0.1    08:18:46  0.0    0    0 vmtasks
  461     1     root   Mar_28 S  60  0.0    04:19:34  0.0 1744 3048
/usr/lib/nfs/nfsd
   95     0     root   Mar_28 S  99  0.0    04:02:11  0.0    0    0
zpool-pool
  746   666  dovecot   Mar_28 S  59  0.0    02:33:29  0.0 11376 12584
pop3-login
 1183   666     root   Mar_28 S  59  0.0    02:25:22  0.0 2736 3424
dovecot-auth -w
  666     1     root   Mar_28 S  59  0.0    02:05:38  0.0 2304 3464
/usr/local/sbin/dovecot