False Process Down Alerts

11 messages in this thread

list Chris Naude · Fri, 15 Jan 2010 20:59:20 -0700 ·

I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?

 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268
/opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium

-- 
Chris Naude

list Lars Ebeling · Sat, 16 Jan 2010 15:56:04 +0100 ·

It looks like two instances of the client are writing to the file at the same time or almost ;)

Lars

▸ quoted from Chris Naude

  ----- Original Message ----- 
  From: Chris Naude 
  To: user-ae9b8668bcde@xymon.invalid 
  Sent: Saturday, January 16, 2010 4:59 AM
  Subject: [hobbit] False Process Down Alerts


  I'm run into a strange problem with my Xymon server. I noticed today that I'm receiving random false alerts for processes being down. When I look at the process list output in the alert it looks as if the data coming from the clients isn't correct. Here is an example. Has anyone seen anything like this?


 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


  Xymon version: 4.3.0-0.beta2
  Xymon server: CentOS 5.4 32 bit


  Client: HP-UX 11.31 Itanium

  -- 
  Chris Naude

list Chris Naude · Sat, 16 Jan 2010 10:44:31 -0700 ·

That makes a lot of sense. I did have some issues with the startup scripts
on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed
before it goes live tonight. Thanks!

On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <user-1fecd3eafd52@xymon.invalid

▸ quoted from Lars Ebeling

wrote:

 It looks like two instances of the client are writing to the file at the
same time or almost ;)

Lars

----- Original Message -----


*From:* Chris Naude <user-aaac7867ee41@xymon.invalid>

▸ quoted from Lars Ebeling

*To:* user-ae9b8668bcde@xymon.invalid
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts

I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?

 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium

--
Chris Naude

--


Chris Naude

list Chris Naude · Sun, 17 Jan 2010 16:11:44 -0700 ·

The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?

[image: red] 89%     /testdb3 (37771472% used) has reached the PANIC level (95%)

Filesystem            1024-blocks  Used  Available Capacity Mounted on
/dev/vgtestdb1/lvol1    107844344 70901816 36942528    66%     /testdb1
/dev/vgtestdb2/lvol1    35962064 25453128 10508936    71%     /testdb2
/dev/vgtestdb4/lvol1    970909400 825006344 145903056    85%     /testdb4
/dev/vgtestdb3/lv
l1 ]  338788224 301016752 37771472    89%     /testdb3
/dev/vgtestdb5/lvol1    179789048 150553912 29235136    84%     /testdb5
/dev/vg00/lvol8       24580711    74501 24506210     1%     /home
/dev/vg00/lvol4       10226680  6339283  3887397    62%     /opt

▸ quoted from Chris Naude



On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

That makes a lot of sense. I did have some issues with the startup scripts
on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed
before it goes live tonight. Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
user-1fecd3eafd52@xymon.invalid> wrote:

 It looks like two instances of the client are writing to the file at the
same time or almost ;)

Lars

----- Original Message -----
*From:* Chris Naude <user-aaac7867ee41@xymon.invalid>
*To:* user-ae9b8668bcde@xymon.invalid
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts

I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?

 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium

--
Chris Naude

--
Chris Naude

--


Chris Naude

list Josh Luthman · Sun, 17 Jan 2010 18:21:15 -0500 ·

Is there only one client sending data as this name?  I don't think you
answered Lars' email.

What does the alert read and what does the data say?  Missing process?  Too
high of a load?

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein

▸ quoted from Chris Naude



On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?

[image: red] 89%     /testdb3 (37771472% used) has reached the PANIC level (95%)

Filesystem            1024-blocks  Used  Available Capacity Mounted on
/dev/vgtestdb1/lvol1    107844344 70901816 36942528    66%     /testdb1
/dev/vgtestdb2/lvol1    35962064 25453128 10508936    71%     /testdb2
/dev/vgtestdb4/lvol1    970909400 825006344 145903056    85%     /testdb4
/dev/vgtestdb3/lv
l1 ]  338788224 301016752 37771472    89%     /testdb3
/dev/vgtestdb5/lvol1    179789048 150553912 29235136    84%     /testdb5
/dev/vg00/lvol8       24580711    74501 24506210     1%     /home
/dev/vg00/lvol4       10226680  6339283  3887397    62%     /opt


On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

That makes a lot of sense. I did have some issues with the startup scripts
on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed
before it goes live tonight. Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
user-1fecd3eafd52@xymon.invalid> wrote:

 It looks like two instances of the client are writing to the file at
the same time or almost ;)

Lars

----- Original Message -----
 *From:* Chris Naude <user-aaac7867ee41@xymon.invalid>
*To:* user-ae9b8668bcde@xymon.invalid
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts

I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?

 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium

--
Chris Naude

--
Chris Naude

--
Chris Naude

list Chris Naude · Sun, 17 Jan 2010 17:08:28 -0700 ·

I have 7 clients running. Each client has a different name. They are all
sending data to the primary Xymon server. The alerts are reading missing
processes, full file systems, and msgs errors. Here is another sample of an
unusual error. You can see the process list has a funky break in it.

 Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

[image: yellow] Expected string COMMAND not found in ps output header

  PID  PPID USER
  STIM] S PRI  %CPU     TIME     VSZ COMMAND
    0     0 root      Dec 14  S 127  0.16 00:40:00       0 swapper
    1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
   48     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   45     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   42     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   31     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   30     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   29     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   28     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   26     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
    5     0 root      Dec 14  R 152  0.00 00:00:02       0 signald
    6     0 root      Dec 14  R 152  0.00 00:00:03       0 kmemdaemon
   17     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   16     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   15     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   14     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   13     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   12     0 root      Dec 14  S 152  0.00 00:00:00       0 usbhubd
   11     0 root      Dec 14  R 152  0.00 00:01:11       0 escsid
   10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
    9     0 root      Dec 14  R 152  0.00 00:01:27       0 ksyncer_daemon

7     0]root      Dec 14  R 152
 0.00 00:]0:00       0 kai_daemon
   50     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   47     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   44     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   41     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached


On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman

▸ quoted from Josh Luthman

<user-4c45a83f15cb@xymon.invalid>wrote:

Is there only one client sending data as this name?  I don't think you
answered Lars' email.

What does the alert read and what does the data say?  Missing process?  Too
high of a load?

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein


On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?

[image: red] 89%     /testdb3 (37771472% used) has reached the PANIC level (95%)

Filesystem            1024-blocks  Used  Available Capacity Mounted on
/dev/vgtestdb1/lvol1    107844344 70901816 36942528    66%     /testdb1
/dev/vgtestdb2/lvol1    35962064 25453128 10508936    71%     /testdb2
/dev/vgtestdb4/lvol1    970909400 825006344 145903056    85%     /testdb4
/dev/vgtestdb3/lv
l1 ]  338788224 301016752 37771472    89%     /testdb3
/dev/vgtestdb5/lvol1    179789048 150553912 29235136    84%     /testdb5
/dev/vg00/lvol8       24580711    74501 24506210     1%     /home
/dev/vg00/lvol4       10226680  6339283  3887397    62%     /opt


On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

That makes a lot of sense. I did have some issues with the startup
scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it
fixed before it goes live tonight. Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
user-1fecd3eafd52@xymon.invalid> wrote:

 It looks like two instances of the client are writing to the file at
the same time or almost ;)

Lars

----- Original Message -----
 *From:* Chris Naude <user-aaac7867ee41@xymon.invalid>
*To:* user-ae9b8668bcde@xymon.invalid
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts

I'm run into a strange problem with my Xymon server. I noticed today
that I'm receiving random false alerts for processes being down. When I look
at the process list output in the alert it looks as if the data coming from
the clients isn't correct. Here is an example. Has anyone seen anything like
this?

 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium

--
Chris Naude

--
Chris Naude

--
Chris Naude

-- 
Chris Naude

list Chris Naude · Mon, 18 Jan 2010 12:20:43 -0700 ·

I've managed to stop the flood of false alerts. I removed all of my non-prod
clients from the bb-hosts and shut off their client processes. The problem
seems to be somehow related to the amount of data the Xymon server is trying
to process.

▸ quoted from Chris Naude


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

I have 7 clients running. Each client has a different name. They are all
sending data to the primary Xymon server. The alerts are reading missing
processes, full file systems, and msgs errors. Here is another sample of an
unusual error. You can see the process list has a funky break in it.

 Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

[image: yellow] Expected string COMMAND not found in ps output header

  PID  PPID USER
  STIM] S PRI  %CPU     TIME     VSZ COMMAND
    0     0 root      Dec 14  S 127  0.16 00:40:00       0 swapper
    1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
   48     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   45     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   42     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   31     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   30     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   29     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   28     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   26     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
    5     0 root      Dec 14  R 152  0.00 00:00:02       0 signald
    6     0 root      Dec 14  R 152  0.00 00:00:03       0 kmemdaemon
   17     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   16     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   15     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   14     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   13     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   12     0 root      Dec 14  S 152  0.00 00:00:00       0 usbhubd
   11     0 root      Dec 14  R 152  0.00 00:01:11       0 escsid
   10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
    9     0 root      Dec 14  R 152  0.00 00:01:27       0 ksyncer_daemon

7     0]root      Dec 14  R 152
 0.00 00:]0:00       0 kai_daemon
   50     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   47     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   44     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   41     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached


On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid

▸ quoted from Chris Naude

wrote:

Is there only one client sending data as this name?  I don't think you
answered Lars' email.

What does the alert read and what does the data say?  Missing process?
Too high of a load?

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein


On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?

[image: red] 89%     /testdb3 (37771472% used) has reached the PANIC level (95%)

Filesystem            1024-blocks  Used  Available Capacity Mounted on
/dev/vgtestdb1/lvol1    107844344 70901816 36942528    66%     /testdb1
/dev/vgtestdb2/lvol1    35962064 25453128 10508936    71%     /testdb2
/dev/vgtestdb4/lvol1    970909400 825006344 145903056    85%     /testdb4
/dev/vgtestdb3/lv
l1 ]  338788224 301016752 37771472    89%     /testdb3
/dev/vgtestdb5/lvol1    179789048 150553912 29235136    84%     /testdb5
/dev/vg00/lvol8       24580711    74501 24506210     1%     /home
/dev/vg00/lvol4       10226680  6339283  3887397    62%     /opt


On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude <user-aaac7867ee41@xymon.invalid>wrote:

That makes a lot of sense. I did have some issues with the startup
scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it
fixed before it goes live tonight. Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
user-1fecd3eafd52@xymon.invalid> wrote:

 It looks like two instances of the client are writing to the file at
the same time or almost ;)

Lars

----- Original Message -----
 *From:* Chris Naude <user-aaac7867ee41@xymon.invalid>
*To:* user-ae9b8668bcde@xymon.invalid
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts

I'm run into a strange problem with my Xymon server. I noticed today
that I'm receiving random false alerts for processes being down. When I look
at the process list output in the alert it looks as if the data coming from
the clients isn't correct. Here is an example. Has anyone seen anything like
this?

 9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 9794     1 oracle   10:55:57 S 154  0.00 00:00:0
  217600]oracleTEST (LOCAL=NO)
 1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
 8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
 2711     1 roo
      ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
 3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
 3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium

--
Chris Naude

--
Chris Naude

--
Chris Naude

--
Chris Naude

-- 
Chris Naude

list Doug Williams · Mon, 18 Jan 2010 12:41:23 -0700 ·

Seems to me your clients data is being truncated.  Try modifying this in
your hobbitserver.cfg.  You may want to set them appropriate size for
your xymon server.  I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me).  Restart hobbit server after
making change to hobbitserver.cfg


MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000

▸ quoted from Chris Naude

-----Original Message-----
From: Chris Naude [mailto:user-aaac7867ee41@xymon.invalid] 
Sent: Monday, January 18, 2010 2:21 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process.


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>

▸ quoted from Chris Naude

wrote:


	I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it. 


	 Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok


	 yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>;

▸ quoted from Chris Naude

Expected string COMMAND not found in ps output header
	
	  PID  PPID USER     
	  STIM] S PRI  %CPU     TIME     VSZ COMMAND
	    0     0 root      Dec 14  S 127  0.16 00:40:00       0
swapper
	    1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
	   48     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   45     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   42     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   31     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   30     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   29     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   28     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   26     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	    5     0 root      Dec 14  R 152  0.00 00:00:02       0
signald
	    6     0 root      Dec 14  R 152  0.00 00:00:03       0
kmemdaemon
	   17     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   16     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   15     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   14     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   13     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   12     0 root      Dec 14  S 152  0.00 00:00:00       0
usbhubd
	   11     0 root      Dec 14  R 152  0.00 00:01:11       0
escsid
	   10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
	    9     0 root      Dec 14  R 152  0.00 00:01:27       0
ksyncer_daemon
	   
	7     0]root      Dec 14  R 152
	 0.00 00:]0:00       0 kai_daemon
	   50     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   47     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   44     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   41     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached

	On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
<user-4c45a83f15cb@xymon.invalid> wrote:
	

		Is there only one client sending data as this name?  I
don't think you answered Lars' email.
		
		What does the alert read and what does the data say?
Missing process?  Too high of a load?
		
		Josh Luthman
		Office: XXX-XXX-XXXX
		Direct: XXX-XXX-XXXX
		XXXX Wayne St
		Suite XXXX
		Troy, OH XXXXX
		
		"The secret to creativity is knowing how to hide your
sources."
		--- Albert Einstein


		On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
<user-aaac7867ee41@xymon.invalid> wrote:
		

			The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?

			red 89%     /testdb3 (37771472% used) has
reached the PANIC level (95%)
			
			Filesystem            1024-blocks  Used
Available Capacity Mounted on
			/dev/vgtestdb1/lvol1    107844344 70901816
36942528    66%     /testdb1
			/dev/vgtestdb2/lvol1    35962064 25453128
10508936    71%     /testdb2
			/dev/vgtestdb4/lvol1    970909400 825006344
145903056    85%     /testdb4
			/dev/vgtestdb3/lv
			l1 ]  338788224 301016752 37771472    89%
/testdb3
			/dev/vgtestdb5/lvol1    179789048 150553912
29235136    84%     /testdb5
			/dev/vg00/lvol8       24580711    74501 24506210
1%     /home
			/dev/vg00/lvol4       10226680  6339283  3887397
62%     /opt


			On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
<user-aaac7867ee41@xymon.invalid> wrote:
			

				That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!


				On Sat, Jan 16, 2010 at 7:56 AM, Lars
Ebeling <user-1fecd3eafd52@xymon.invalid> wrote:
				

					It looks like two instances of
the client are writing to the file at the same time or almost ;)
					 
					
					Lars

						----- Original Message
						From: Chris Naude
						To: user-ae9b8668bcde@xymon.invalid 
						Sent: Saturday, January
16, 2010 4:59 AM
						Subject: [hobbit] False
Process Down Alerts

						I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this? 

						 9613  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
						10389  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
						 9794     1 oracle
10:55:57 S 154  0.00 00:00:0
						  217600]oracleTEST
(LOCAL=NO)
						 1592     1 oracle
Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
						12751  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
						 8965  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c

						11819     1 oracle
Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
						 2711     1 roo
						      ]ec  4  S 120
0.04 00:02:16     868 /usr/sbin/xntpd
						 3547     1 xymon
Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
						 3728     1 root
Dec  4  R 152  0.00 00:00:37    4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


						Xymon version:
4.3.0-0.beta2
						Xymon server: CentOS 5.4
32 bit

						Client: HP-UX 11.31
Itanium

						-- 
						Chris Naude
						

				-- 
				Chris Naude
				

			-- 
			Chris Naude
			

	-- 
	Chris Naude
	

-- 
Chris Naude

list Odinn · Mon, 18 Jan 2010 12:03:37 -0800 (PST) ·

My xymon server monitors over 1500 clients with no issues.  When I see false alerts, it has always been a configuration on my part where I have 2 servers in my bb-host file using the same name on different IPs.
 --


Jim Sloan


Just remember, today is the day you thought tomorrow was going to be yesterday.

▸ quoted from Chris Naude

From: Chris Naude <user-aaac7867ee41@xymon.invalid>
To: user-ae9b8668bcde@xymon.invalid
Sent: Mon, January 18, 2010 2:20:43 PM
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my non-prod clients from the bb-hosts and shut off their client processes. The problem seems to be somehow related to the amount of data the Xymon server is trying to process. 

On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <user-aaac7867ee41@xymon.invalid> wrote:

I have 7 clients running. Each client has a different name. They are all sending data to the primary Xymon server. The alerts are reading missing processes, full file systems, and msgs errors. Here is another sample of an unusual error. You can see the process list has a funky break in it.

Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok
Expected string COMMAND not found in ps output header

 PID  PPID USER     
 STIM] S PRI  %CPU     TIME     VSZ COMMAND
   0     0 root      Dec 14  S 127  0.16 00:40:00       0 swapper
   1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
  48     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  45     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  42     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  31     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  30     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  29     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  28     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  26     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
   5     0 root      Dec 14  R 152  0.00 00:00:02       0 signald
   6     0 root      Dec 14  R 152  0.00 00:00:03       0 kmemdaemon
  17     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  16     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  15     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  14     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  13     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  12     0 root      Dec 14  S 152  0.00 00:00:00       0 usbhubd
  11     0 root      Dec 14  R 152  0.00 00:01:11       0 escsid
  10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
   9     0 root      Dec 14  R 152  0.00 00:01:27       0 ksyncer_daemon
  
7     0]root      Dec 14  R 152
0.00 00:]0:00       0 kai_daemon
  50     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  47     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  44     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached
  41     0 root      Dec 14  S 152  0.00 00:00:00       0 net_str_cached


On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Is there only one client sending data as this name?  I don't think you answered Lars' email.

What does the alert read and what does the data say?  Missing process?  Too high of a load?

Josh Luthman

Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein


On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude <user-aaac7867ee41@xymon.invalid> wrote:

The problem has suddenly become much much worse. I verified with tcpdump that the data coming from the client is 100% correct. It seems something on the Xymon server side is not handling the client data correctly. Anyone have any other ideas?

89%     /testdb3 (37771472% used) has reached the PANIC level (95%)

Filesystem            1024-blocks  Used  Available Capacity Mounted on
/dev/vgtestdb1/lvol1    107844344 70901816 36942528    66%     /testdb1
/dev/vgtestdb2/lvol1    35962064 25453128 10508936    71%     /testdb2
/dev/vgtestdb4/lvol1    970909400 825006344 145903056    85%     /testdb4
/dev/vgtestdb3/lv
l1 ]  338788224 301016752 37771472    89%     /testdb3
/dev/vgtestdb5/lvol1    179789048 150553912 29235136    84%     /testdb5
/dev/vg00/lvol8       24580711    74501 24506210     1%     /home
/dev/vg00/lvol4       10226680  6339283  3887397    62%     /opt


On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude <user-aaac7867ee41@xymon.invalid> wrote:

That makes a lot of sense. I did have some issues with the startup scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed before it goes live tonight. Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <user-1fecd3eafd52@xymon.invalid> wrote:

It looks like two instances of the client are 
writing to the file at the same time or almost ;)
Lars
----- Original Message -----

From: Chris 
 Naude 
To: user-ae9b8668bcde@xymon.invalid 
Sent: Saturday, January 16, 2010 4:59 
 AM
Subject: [hobbit] False Process Down 
 Alerts

I'm run into a strange problem with my Xymon server. I noticed 
 today that I'm receiving random false alerts for processes being down. When I 
 look at the process list output in the alert it looks as if the data coming 
 from the clients isn't correct. Here is an example. Has anyone seen anything 
 like this?


9613  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
10389  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
9794     1 oracle   10:55:57 S 154  0.00 00:00:0
 217600]oracleTEST (LOCAL=NO)
1592     1 oracle    Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
12751  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
8965  1944 root      Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c


11819     1 oracle    Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
2711     1 roo
     ]ec  4  S 120  0.04 00:02:16     868 /usr/sbin/xntpd
3547     1 xymon     Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728     1 root      Dec  4  R 152  0.00 00:00:37    4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit


Client: HP-UX 11.31 Itanium

-- 
Chris Naude

-- 
Chris Naude

-- 
Chris Naude

-- 
Chris Naude

-- 
Chris Naude

list Chris Naude · Mon, 18 Jan 2010 17:46:54 -0700 ·

I never received any alerts about messages being truncated. After disabling
the non prod clients i started receiving alerts about the messages being
truncated. I adjusted these values as specified below and they are good now.
Tomorrow i'll enable the non prod servers again and see if this is what the
original culprit was. Thanks!


On Mon, Jan 18, 2010 at 12:41 PM, Williams, Doug (Consultant-RIC) <

▸ quoted from Odinn

user-63162c140807@xymon.invalid> wrote:

Seems to me your clients data is being truncated.  Try modifying this in
your hobbitserver.cfg.  You may want to set them appropriate size for
your xymon server.  I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me).  Restart hobbit server after
making change to hobbitserver.cfg


MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000


-----Original Message-----
From: Chris Naude [mailto:user-aaac7867ee41@xymon.invalid]
Sent: Monday, January 18, 2010 2:21 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process.


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>
wrote:


       I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it.


        Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

         yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>;
Expected string COMMAND not found in ps output header

         PID  PPID USER
         STIM] S PRI  %CPU     TIME     VSZ COMMAND
           0     0 root      Dec 14  S 127  0.16 00:40:00       0
swapper
           1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
          48     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          45     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          42     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          31     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          30     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          29     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          28     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          26     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
           5     0 root      Dec 14  R 152  0.00 00:00:02       0
signald
           6     0 root      Dec 14  R 152  0.00 00:00:03       0
kmemdaemon
          17     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          16     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          15     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          14     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          13     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          12     0 root      Dec 14  S 152  0.00 00:00:00       0
usbhubd
          11     0 root      Dec 14  R 152  0.00 00:01:11       0
escsid
          10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
           9     0 root      Dec 14  R 152  0.00 00:01:27       0
ksyncer_daemon

       7     0]root      Dec 14  R 152
        0.00 00:]0:00       0 kai_daemon
          50     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          47     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          44     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          41     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached

       On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
<user-4c45a83f15cb@xymon.invalid> wrote:


               Is there only one client sending data as this name?  I
don't think you answered Lars' email.

               What does the alert read and what does the data say?
Missing process?  Too high of a load?

               Josh Luthman
               Office: XXX-XXX-XXXX
               Direct: XXX-XXX-XXXX
               XXXX Wayne St
               Suite XXXX
               Troy, OH XXXXX

               "The secret to creativity is knowing how to hide your
sources."
               --- Albert Einstein


               On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
<user-aaac7867ee41@xymon.invalid> wrote:


                       The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?

                        red 89%     /testdb3 (37771472% used) has
reached the PANIC level (95%)

                       Filesystem            1024-blocks  Used
Available Capacity Mounted on
                       /dev/vgtestdb1/lvol1    107844344 70901816
36942528    66%     /testdb1
                       /dev/vgtestdb2/lvol1    35962064 25453128
10508936    71%     /testdb2
                       /dev/vgtestdb4/lvol1    970909400 825006344
145903056    85%     /testdb4
                       /dev/vgtestdb3/lv
                       l1 ]  338788224 301016752 37771472    89%
/testdb3
                       /dev/vgtestdb5/lvol1    179789048 150553912
29235136    84%     /testdb5
                       /dev/vg00/lvol8       24580711    74501 24506210
1%     /home
                       /dev/vg00/lvol4       10226680  6339283  3887397
62%     /opt


                       On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
<user-aaac7867ee41@xymon.invalid> wrote:


                               That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!


                               On Sat, Jan 16, 2010 at 7:56 AM, Lars
Ebeling <user-1fecd3eafd52@xymon.invalid> wrote:


                                       It looks like two instances of
the client are writing to the file at the same time or almost ;)


                                       Lars

                                               ----- Original Message
                                               From: Chris Naude
                                                To: user-ae9b8668bcde@xymon.invalid
                                               Sent: Saturday, January
16, 2010 4:59 AM
                                               Subject: [hobbit] False
Process Down Alerts

                                               I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this?

                                                9613  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
                                               10389  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
                                                9794     1 oracle
10:55:57 S 154  0.00 00:00:0
                                                 217600]oracleTEST
(LOCAL=NO)
                                                1592     1 oracle
Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
                                               12751  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
                                                8965  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c

                                               11819     1 oracle
Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
                                                2711     1 roo
                                                     ]ec  4  S 120
0.04 00:02:16     868 /usr/sbin/xntpd
                                                3547     1 xymon
Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
                                                3728     1 root
Dec  4  R 152  0.00 00:00:37    4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


                                               Xymon version:
4.3.0-0.beta2
                                               Xymon server: CentOS 5.4
32 bit

                                               Client: HP-UX 11.31
Itanium

                                               --
                                               Chris Naude


                               --
                               Chris Naude


                       --
                       Chris Naude


       --
       Chris Naude


--
Chris Naude

-- 
Chris Naude

list Tom L. Stewart · Mon, 18 Jan 2010 22:27:25 -0600 ·

I had this problem and then did the adjustment. Since then, I get a 5
minute hole in load average and a couple of other trends, even though in
the solaris systems I have no problem using the multi-cpu and zone
process without any problems.  Most of the time when the hole shows up,
I will get other missing 5 minute stats exactly one hour after the first
one and then does it two or three times. I have tried to disable the
caching, but it did not make a difference. The 4.3.0-2 beta seems to be
very broken and no one knows why. Right now, I trying to determine if I
am better off with another product, since issues do not seem to be a
priority with anyone.

 
Tom

▸ quoted from Chris Naude


 
From: Chris Naude [mailto:user-aaac7867ee41@xymon.invalid] 
Sent: Monday, January 18, 2010 6:47 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] False Process Down Alerts

 
I never received any alerts about messages being truncated. After
disabling the non prod clients i started receiving alerts about the
messages being truncated. I adjusted these values as specified below and
they are good now. Tomorrow i'll enable the non prod servers again and
see if this is what the original culprit was. Thanks!

 
On Mon, Jan 18, 2010 at 12:41 PM, Williams, Doug (Consultant-RIC)
<user-63162c140807@xymon.invalid> wrote:

Seems to me your clients data is being truncated.  Try modifying this in
your hobbitserver.cfg.  You may want to set them appropriate size for
your xymon server.  I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me).  Restart hobbit server after
making change to hobbitserver.cfg


MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000


-----Original Message-----
From: Chris Naude [mailto:user-aaac7867ee41@xymon.invalid]
Sent: Monday, January 18, 2010 2:21 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process.


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <user-aaac7867ee41@xymon.invalid>
wrote:


       I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it.


        Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

        yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>;

Expected string COMMAND not found in ps output header

         PID  PPID USER
         STIM] S PRI  %CPU     TIME     VSZ COMMAND
           0     0 root      Dec 14  S 127  0.16 00:40:00       0
swapper
           1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
          48     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          45     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          42     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          31     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          30     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          29     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          28     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          26     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
           5     0 root      Dec 14  R 152  0.00 00:00:02       0
signald
           6     0 root      Dec 14  R 152  0.00 00:00:03       0
kmemdaemon
          17     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          16     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          15     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          14     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          13     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          12     0 root      Dec 14  S 152  0.00 00:00:00       0
usbhubd
          11     0 root      Dec 14  R 152  0.00 00:01:11       0
escsid
          10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
           9     0 root      Dec 14  R 152  0.00 00:01:27       0
ksyncer_daemon

       7     0]root      Dec 14  R 152
        0.00 00:]0:00       0 kai_daemon
          50     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          47     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          44     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
          41     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached

       On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
<user-4c45a83f15cb@xymon.invalid> wrote:


               Is there only one client sending data as this name?  I
don't think you answered Lars' email.

               What does the alert read and what does the data say?
Missing process?  Too high of a load?

               Josh Luthman
               Office: XXX-XXX-XXXX
               Direct: XXX-XXX-XXXX
               XXXX Wayne St
               Suite XXXX
               Troy, OH XXXXX

               "The secret to creativity is knowing how to hide your
sources."
               --- Albert Einstein


               On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
<user-aaac7867ee41@xymon.invalid> wrote:


                       The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?

                       red 89%     /testdb3 (37771472% used) has
reached the PANIC level (95%)

                       Filesystem            1024-blocks  Used
Available Capacity Mounted on
                       /dev/vgtestdb1/lvol1    107844344 70901816
36942528    66%     /testdb1
                       /dev/vgtestdb2/lvol1    35962064 25453128
10508936    71%     /testdb2
                       /dev/vgtestdb4/lvol1    970909400 825006344
145903056    85%     /testdb4
                       /dev/vgtestdb3/lv
                       l1 ]  338788224 301016752 37771472    89%
/testdb3
                       /dev/vgtestdb5/lvol1    179789048 150553912
29235136    84%     /testdb5
                       /dev/vg00/lvol8       24580711    74501 24506210
1%     /home
                       /dev/vg00/lvol4       10226680  6339283  3887397
62%     /opt


                       On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
<user-aaac7867ee41@xymon.invalid> wrote:


                               That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!


                               On Sat, Jan 16, 2010 at 7:56 AM, Lars
Ebeling <user-1fecd3eafd52@xymon.invalid> wrote:


                                       It looks like two instances of
the client are writing to the file at the same time or almost ;)


                                       Lars

                                               ----- Original Message
                                               From: Chris Naude


                                               To: user-ae9b8668bcde@xymon.invalid
                                               Sent: Saturday, January
16, 2010 4:59 AM
                                               Subject: [hobbit] False
Process Down Alerts

                                               I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this?

                                                9613  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
                                               10389  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
                                                9794     1 oracle
10:55:57 S 154  0.00 00:00:0
                                                 217600]oracleTEST
(LOCAL=NO)
                                                1592     1 oracle
Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
                                               12751  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
                                                8965  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c

                                               11819     1 oracle
Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
                                                2711     1 roo
                                                     ]ec  4  S 120
0.04 00:02:16     868 /usr/sbin/xntpd
                                                3547     1 xymon
Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
                                                3728     1 root
Dec  4  R 152  0.00 00:00:37    4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


                                               Xymon version:
4.3.0-0.beta2
                                               Xymon server: CentOS 5.4
32 bit

                                               Client: HP-UX 11.31
Itanium

                                               --
                                               Chris Naude


                               --
                               Chris Naude


                       --
                       Chris Naude


       --
       Chris Naude


--
Chris Naude


-- 
Chris Naude

False Process Down Alerts 🔗 link

False Process Down Alerts