xymon hostdata module going rogue

21 messages in this thread

list Scot Kreienkamp · Wed, 10 Jun 2015 17:01:21 +0000 ·

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00 [xymond_hostdata] <defunct>
hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00 [xymond_hostdata] <defunct>
hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00 [xymond_hostdata] <defunct>
hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00 [xymond_hostdata] <defunct>
hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00 [xymond_hostdata] <defunct>
hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00 [xymond_hostdata] <defunct>

It seemed related to drop messages, so I did a test.


[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
161
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
162
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
163
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
164
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
165
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
166
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l
167

So every time I send a drop message I get a defunct process hanging out.  Bug in Xymon?

This is on RHEL5, xymon 4.3.21.

Thanks!

Scot Kreienkamp | Senior Systems Engineer | La-Z-Boy Corporate
One La-Z-Boy Drive | Monroe, Michigan 48162  | * XXX-XXX-XXXX | | * 7349151444 | *  user-9678697f1438@xymon.invalid<mailto:user-9678697f1438@xymon.invalid>
www<http://www.la-z-boy.com/>.la-z-boy.com<http://www.la-z-boy.com/>; | facebook.<https://www.facebook.com/lazboy>com<https://www.facebook.com/lazboy>/<https://www.facebook.com/lazboy>lazboy<http://facebook.com/lazboy>; | twitter.com/lazboy<https://twitter.com/lazboy>; | youtube.com/<https://www.youtube.com/user/lazboy>lazboy<https://www.youtube.com/user/lazboy>;

[cid:lzbVertical_hres_e400f094-7baa-4c76-a2c8-89ec5a6f2cfd.jpg]


This message is intended only for the individual or entity to which it is addressed.  It may contain privileged, confidential information which is exempt from disclosure under applicable laws.  If you are not the intended recipient, you are strictly prohibited from disseminating or distributing this information (other than to the intended recipient) or copying this information.  If you have received this communication in error, please notify us immediately by e-mail or by telephone at the above number. Thank you.

Attachments (1)

attachment.jpg image/jpeg · 6.4 KB

list Japheth Cleaver · Wed, 10 Jun 2015 10:20:45 -0700 ·

▸ quoted from Scot Kreienkamp

On Wed, June 10, 2015 10:01 am, Scot Kreienkamp wrote:

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages, so I did a test.


[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
161
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
162
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
163
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
164
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
165
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
166
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps
auxw |grep xymond_hostdata |wc -l
167

So every time I send a drop message I get a defunct process hanging out.
Bug in Xymon?

This is on RHEL5, xymon 4.3.21.

Thanks!


Scot,


Some background: When doing a full drop on a host, xymond_hostdata (and
xymond_history, IIRC) forks to perform the recursive directory removal of
history files and whatnot in the background, then exits out. That's why it
corresponds to those events.


Looks like xymond_hostdata.c is missing a SIGCHLD registration, which is
causing the defunct processes to stack up. Strangely, I haven't observed
this behavior on RHEL6 at all though, even though we're dropping hosts all
the time. Odd.


The following patch should fix the issue for you, I believe.


Regards,

-jc

Attachments (1)

attachment.obj application/octet-stream · 354 B

list Scot Kreienkamp · Wed, 10 Jun 2015 17:44:52 +0000 ·


Scot Kreienkamp  | Senior Systems Engineer | La-Z-Boy Corporate
One La-Z-Boy Drive | Monroe, Michigan 48162 |  Office: XXX-XXX-XXXX |  |  Mobile: XXXXXXXXXX | Email: user-9678697f1438@xymon.invalid

-----Original Message-----
From: J.C. Cleaver [mailto:user-87556346d4af@xymon.invalid]
Sent: Wednesday, June 10, 2015 1:21 PM
To: Scot Kreienkamp
Cc: xymon at xymon.com
Subject: Re: xymon hostdata module going rogue


  |  |
 | ,   |  |  |  | Email: user-87556346d4af@xymon.invalid

▸ quoted from Scot Kreienkamp

On Wed, June 10, 2015 10:01 am, Scot Kreienkamp wrote:

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages, so I did a test.


[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
161
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
162
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
163
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
164
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
165
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
166
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ;
ps
auxw |grep xymond_hostdata |wc -l
167

So every time I send a drop message I get a defunct process hanging out.
Bug in Xymon?

This is on RHEL5, xymon 4.3.21.

Thanks!


Scot,


Some background: When doing a full drop on a host, xymond_hostdata (and
xymond_history, IIRC) forks to perform the recursive directory removal of
history files and whatnot in the background, then exits out. That's why it
corresponds to those events.


Looks like xymond_hostdata.c is missing a SIGCHLD registration, which is
causing the defunct processes to stack up. Strangely, I haven't observed
this behavior on RHEL6 at all though, even though we're dropping hosts all
the time. Odd.


The following patch should fix the issue for you, I believe.


Regards,

-jc
This message is intended only for the individual or entity to which it is
addressed.  It may contain privileged, confidential information which is
exempt from disclosure under applicable laws.  If you are not the intended
recipient, you are strictly prohibited from disseminating or distributing this
information (other than to the intended recipient) or copying this
information.  If you have received this communication in error, please notify
us immediately by e-mail or by telephone at the above number. Thank you.


Hi JC,

Thanks, but no such luck.  I deleted the entire 4.3.21 source tree and expanded it again to make sure I get a pristine source, put the patch in the xymond directory, applied the patch with patch -p1.  It applied cleanly so I did a configure, make, make install.  I am still getting the defunct processes though.  I am not seeing anything in the logs.

▸ quoted from Scot Kreienkamp


This message is intended only for the individual or entity to which it is addressed.  It may contain privileged, confidential information which is exempt from disclosure under applicable laws.  If you are not the intended recipient, you are strictly prohibited from disseminating or distributing this information (other than to the intended recipient) or copying this information.  If you have received this communication in error, please notify us immediately by e-mail or by telephone at the above number. Thank you.

list Japheth Cleaver · Wed, 10 Jun 2015 14:57:15 -0700 ·

▸ quoted from Scot Kreienkamp


On Wed, June 10, 2015 10:44 am, Scot Kreienkamp wrote:

So every time I send a drop message I get a defunct process hanging
out.
Bug in Xymon?

This is on RHEL5, xymon 4.3.21.

Thanks!


Scot,


Some background: When doing a full drop on a host, xymond_hostdata (and
xymond_history, IIRC) forks to perform the recursive directory removal
of
history files and whatnot in the background, then exits out. That's why
it
corresponds to those events.


Looks like xymond_hostdata.c is missing a SIGCHLD registration, which
is
causing the defunct processes to stack up. Strangely, I haven't
observed
this behavior on RHEL6 at all though, even though we're dropping hosts
all
the time. Odd.


The following patch should fix the issue for you, I believe.


Regards,

-jc
This message is intended only for the individual or entity to which it
is
addressed.  It may contain privileged, confidential information which
is
exempt from disclosure under applicable laws.  If you are not the
intended
recipient, you are strictly prohibited from disseminating or
distributing this
information (other than to the intended recipient) or copying this
information.  If you have received this communication in error, please
notify
us immediately by e-mail or by telephone at the above number. Thank
you.


Hi JC,

Thanks, but no such luck.  I deleted the entire 4.3.21 source tree and
expanded it again to make sure I get a pristine source, put the patch in
the xymond directory, applied the patch with patch -p1.  It applied
cleanly so I did a configure, make, make install.  I am still getting the
defunct processes though.  I am not seeing anything in the logs.

Oy, my apologies.

That's what I get for typing faster than thinking.
Can you try this patch instead?

-jc

Attachments (1)

attachment.obj application/octet-stream · 349 B

list Scot Kreienkamp · Thu, 11 Jun 2015 13:45:55 +0000 ·

▸ signature


Scot Kreienkamp  | Senior Systems Engineer | La-Z-Boy Corporate
One La-Z-Boy Drive | Monroe, Michigan 48162 |  Office: XXX-XXX-XXXX |  |  Mobile: XXXXXXXXXX | Email: user-9678697f1438@xymon.invalid

-----Original Message-----

▸ quoted from Japheth Cleaver

From: J.C. Cleaver [mailto:user-87556346d4af@xymon.invalid]
Sent: Wednesday, June 10, 2015 5:57 PM
To: Scot Kreienkamp
Cc: xymon at xymon.com
Subject: RE: xymon hostdata module going rogue

  |  |
 | ,   |  |  |  | Email: user-87556346d4af@xymon.invalid
On Wed, June 10, 2015 10:44 am, Scot Kreienkamp wrote:

So every time I send a drop message I get a defunct process hanging
out.
Bug in Xymon?

This is on RHEL5, xymon 4.3.21.

Thanks!


Scot,


Some background: When doing a full drop on a host, xymond_hostdata

(and

xymond_history, IIRC) forks to perform the recursive directory removal
of
history files and whatnot in the background, then exits out. That's why
it
corresponds to those events.


Looks like xymond_hostdata.c is missing a SIGCHLD registration, which
is
causing the defunct processes to stack up. Strangely, I haven't
observed
this behavior on RHEL6 at all though, even though we're dropping hosts
all
the time. Odd.


The following patch should fix the issue for you, I believe.


Regards,

-jc
This message is intended only for the individual or entity to which it
is
addressed.  It may contain privileged, confidential information which
is
exempt from disclosure under applicable laws.  If you are not the
intended
recipient, you are strictly prohibited from disseminating or
distributing this
information (other than to the intended recipient) or copying this
information.  If you have received this communication in error, please
notify
us immediately by e-mail or by telephone at the above number. Thank
you.


Hi JC,

Thanks, but no such luck.  I deleted the entire 4.3.21 source tree and
expanded it again to make sure I get a pristine source, put the patch in
the xymond directory, applied the patch with patch -p1.  It applied
cleanly so I did a configure, make, make install.  I am still getting the
defunct processes though.  I am not seeing anything in the logs.

Oy, my apologies.

That's what I get for typing faster than thinking.
Can you try this patch instead?

-jc


This message is intended only for the individual or entity to which it is
addressed.  It may contain privileged, confidential information which is
exempt from disclosure under applicable laws.  If you are not the intended
recipient, you are strictly prohibited from disseminating or distributing this
information (other than to the intended recipient) or copying this
information.  If you have received this communication in error, please notify
us immediately by e-mail or by telephone at the above number. Thank you.


Yep, that one worked.

Thanks!

▸ quoted from Japheth Cleaver


This message is intended only for the individual or entity to which it is addressed.  It may contain privileged, confidential information which is exempt from disclosure under applicable laws.  If you are not the intended recipient, you are strictly prohibited from disseminating or distributing this information (other than to the intended recipient) or copying this information.  If you have received this communication in error, please notify us immediately by e-mail or by telephone at the above number. Thank you.

list John Thurston · Fri, 28 Aug 2015 12:45:20 -0800 ·

▸ quoted from Scot Kreienkamp

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were 
not working as expected. After a few hours of digging around, I decided 
that the alert daemon was failing to retrieve hostnames and failing 
miserably.

Have other people seen this behavior?
-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list John Thurston · Fri, 28 Aug 2015 14:16:00 -0800 ·

▸ quoted from John Thurston

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It 
certainly looks like this behavior breaks the alert daemon. Fortunately, 
I "drop" hosts in batches so can restart Xymon at that time, but this is 
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?

▸ quoted from John Thurston


-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list Japheth Cleaver · Fri, 28 Aug 2015 16:12:14 -0700 ·

▸ quoted from John Thurston

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon. Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this is
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?

--


I thought this had sounded familiar.

The patch from
http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked
in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the
most recent Terabithia RPM.

If you could test the direct patch (for hostdata, at
http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachment.obj
) on your OS, that would be very helpful. Signal handling is always a bit
tricky to ensure is correct across the board.


Regards,

-jc

list Andy Smith · Sun, 30 Aug 2015 10:56:26 +0100 ·

▸ quoted from Japheth Cleaver

J.C. Cleaver wrote:

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

Hi everyone,

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28435  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28440  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28444  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28449  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon. Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this is
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?

--


I thought this had sounded familiar.

The patch from
http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked
in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the
most recent Terabithia RPM.

If you could test the direct patch (for hostdata, at
http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachment.obj
) on your OS, that would be very helpful. Signal handling is always a bit
tricky to ensure is correct across the board.


Regards,

-jc

Problem repeated here on Solaris 10, but solved by patch suggested.
-- 
Andy

list John Thurston · Mon, 31 Aug 2015 09:19:13 -0800 ·

▸ quoted from John Thurston

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

. . .

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon. Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this is
still pretty icky.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:

The patch from
http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked
in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the
most recent Terabithia RPM.

If you could test the direct patch (for hostdata, at
http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachment.obj
) on your OS, that would be very helpful. Signal handling is always a bit
tricky to ensure is correct across the board.

I have patched one of my servers and it behaves much better under my 
contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The 
original report was under Red Hat Enterprise Linux 5.

If my understanding of this is correct, it is a pretty nasty defect :(

My failure scenario was non-delivery of some email alerts for hosts in 
dire straits. I have several customers who do not monitor the web 
interface, but rely on email notifications to warn them of impending 
problems. These folks had been without any alerting capability since 
early in July when I "dropped" at host and unknowingly clobbered the 
child of xymond_hostdata.

▸ quoted from John Thurston


-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list Japheth Cleaver · Mon, 31 Aug 2015 14:24:05 -0700 ·

▸ quoted from John Thurston

On Mon, August 31, 2015 10:19 am, John Thurston wrote:

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

. . .

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I
decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon.
Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this
is
still pretty icky.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:

The patch from
http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked
in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in
the
most recent Terabithia RPM.

If you could test the direct patch (for hostdata, at
http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachment.obj
) on your OS, that would be very helpful. Signal handling is always a
bit
tricky to ensure is correct across the board.

I have patched one of my servers and it behaves much better under my
contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The
original report was under Red Hat Enterprise Linux 5.

If my understanding of this is correct, it is a pretty nasty defect :(

My failure scenario was non-delivery of some email alerts for hosts in
dire straits. I have several customers who do not monitor the web
interface, but rely on email notifications to warn them of impending
problems. These folks had been without any alerting capability since
early in July when I "dropped" at host and unknowingly clobbered the
child of xymond_hostdata.

Thanks for the confirmation... Yes, I believe it's probably time to start
another release cycle, for this and a few other of the recent bug fixes
still pending.


Regards,

-jc

list Mark Felder · Fri, 04 Sep 2015 11:08:20 -0500 ·

▸ quoted from Japheth Cleaver


On Mon, Aug 31, 2015, at 16:24, J.C. Cleaver wrote:

On Mon, August 31, 2015 10:19 am, John Thurston wrote:

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

. . .

hobbit   28452  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I
decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon.
Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this
is
still pretty icky.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:

The patch from
http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked
in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in
the
most recent Terabithia RPM.

If you could test the direct patch (for hostdata, at
http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachment.obj
) on your OS, that would be very helpful. Signal handling is always a
bit
tricky to ensure is correct across the board.

I have patched one of my servers and it behaves much better under my
contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The
original report was under Red Hat Enterprise Linux 5.

If my understanding of this is correct, it is a pretty nasty defect :(

My failure scenario was non-delivery of some email alerts for hosts in
dire straits. I have several customers who do not monitor the web
interface, but rely on email notifications to warn them of impending
problems. These folks had been without any alerting capability since
early in July when I "dropped" at host and unknowingly clobbered the
child of xymond_hostdata.

Thanks for the confirmation... Yes, I believe it's probably time to start
another release cycle, for this and a few other of the recent bug fixes
still pending.

For the record, I can't reproduce this on FreeBSD either.

list Mark Felder · Fri, 04 Sep 2015 11:27:30 -0500 ·


On Fri, Sep 4, 2015, at 11:08, Mark Felder wrote:

For the record, I can't reproduce this on FreeBSD either.

This specifically was for the extra "xymond_hostdata" child processes... 

Now that I think of it, I do recall being unable to identify why some
alerts were not sent on a large xymon installation... perhaps this was
the culprit? Do we know roughly how long this problem may have existed?

list John Thurston · Tue, 01 Dec 2015 08:14:20 -0900 ·

How embarrassing. I was composing a note to mention a problem with the 
list archives not capturing all messages . . . when I discovered that 
the message for which I was searching was never sent to the list.

I composed the following message back in early October and then sent it 
only to myself :p  No wonder it didn't generate any chatter.

▸ quoted from Mark Felder


On 8/28/2015 3:12 PM, J.C. Cleaver wrote:

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

  . . .

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon. Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this is
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?

This patch took care of the defunct/zonebie processes on "drop" events, 
but I've just discovered that it does not solve the underlying problem. 
It still appears that xymond_hostdata does not behave correctly 
following a "drop" command. The effect is that alerts fail to be 
delivered for _some_ messages because hostnames can no longer be retrieved.

Example:

My xymon server is humming along. I have the alert module debug-logging 
to alerts.log.  Immediately after issuing a "drop" command of the sort:

#xymon localhost "drop foo.bar.com sslcert"

the following sorts appear in the alerts.log. After this, some messages 
may result in alert emails being sent, but most quietly disappear.
Currently, my resolution is to "xymon.sh restart" but that is much too 
heavy handed for long term use.

21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted
21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
21178 2015-10-05 16:39:43.257718 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257773 Found a first matching rule
21178 2015-10-05 16:39:43.257802 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257830 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257854 Found a first matching rule
21178 2015-10-05 16:39:43.257879 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257910 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257935 Found a first matching rule
21178 2015-10-05 16:39:43.257960 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257986 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258010 Found a first matching rule
21178 2015-10-05 16:39:43.258035 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258061 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258088 Found a first matching rule
21178 2015-10-05 16:39:43.258113 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258140 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258164 Found a first matching rule
21178 2015-10-05 16:39:43.258188 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258211 0 alerts to go
21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos 134769, endpos -1, usedbytes=0, bufleft=131470
21178 2015-10-05 16:39:47.962032 Got 2831 bytes
21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039 @@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754||||
21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos -1
21178 2015-10-05 16:39:47.962204 Got page message from soajnuexhs1.bar.com:msgs
21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos 137600, endpos -1, usedbytes=0, bufleft=128639
21178 2015-10-05 16:39:58.022397 Got 297 bytes
21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040 @@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS|||||
21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos -1
21178 2015-10-05 16:39:58.022593 Got page message from doadofjdc-ea05p.bar.com:msgs
21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0
21178 2015-10-05 16:39:58.022666 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022706 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022739 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022776 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022808 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022841 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022873 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022904 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022935 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022967 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022998 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023028 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023059 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023089 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023120 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023151 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023187 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023221 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023252 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023282 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023313 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023342 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023369 Found no first matching rule
21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos 137897, endpos -1, usedbytes=0, bufleft=128342
21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due to EOF

▸ quoted from John Thurston


-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list John Thurston · Tue, 01 Dec 2015 08:32:58 -0900 ·

I was bit by this in the middle of November, and didn't notice it until a customer alerted me today to a shortage of email messages.

To recap:

Some alerts get sent correctly, but in other cases the alert daemon aborts message processing and no alert is sent. In the cases where the daemon stops processing, my debug log begins to accumulate messages of the sort:

1730 2015-12-01 07:58:39.501785 Checking criteria for host 'upsjdc.state.ak.us', which is not defined

There is sometimes a <defunct> process left hanging around. At other times there is not.

Performing a "xymon.sh restart" makes it all work again.

Today, I had a process tree something like:

29118 /opt/xymon/server/bin/xymonlaunch --config=/opt/xymon/server/etc/tasks.cfg --en
  29119 xymond --pidfile=/var/log/xymon/xymond.pid --restart=/opt/xymon/server/tmp/xymo
  29120 /opt/xymon/server/bin/xymonfetch --id=1 --interval=79 --no-daemon --pidfile=/va
  29144 xymond_channel --channel=stachg --log=/var/log/xymon/history.log xymond_history
    29201 xymond_history --pidfile=/var/log/xymon/xymond_history.pid
  29145 xymond_channel --channel=page --log=/var/log/xymon/alert.log xymond_alert --deb
    29307 xymond_alert --debug --checkpoint-file=/opt/xymon/server/tmp/alert.chk --checkp
      1588  <defunct>

I killed off PID 29145, it was recreated, and the alerts began flowing again.

In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007

The only thing I can think to do now is make my xymon client monitor the alert.log and warn me when "which is not defined" start appearing so I can manually kill/restart the process.

▸ quoted from John Thurston

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list Japheth Cleaver · Tue, 1 Dec 2015 12:48:14 -0800 ·

▸ quoted from John Thurston


On Tue, December 1, 2015 9:14 am, John Thurston wrote:

How embarrassing. I was composing a note to mention a problem with the
list archives not capturing all messages . . . when I discovered that
the message for which I was searching was never sent to the list.

I composed the following message back in early October and then sent it
only to myself :p  No wonder it didn't generate any chatter.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:

On Fri, August 28, 2015 3:16 pm, John Thurston wrote:

On 8/28/2015 12:45 PM, John Thurston wrote:

On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:

I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>

  . . .

It seemed related to drop messages . . .

Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I
decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?

I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon.
Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this
is
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?

This patch took care of the defunct/zonebie processes on "drop" events,
but I've just discovered that it does not solve the underlying problem.
It still appears that xymond_hostdata does not behave correctly
following a "drop" command. The effect is that alerts fail to be
delivered for _some_ messages because hostnames can no longer be
retrieved.

Example:

My xymon server is humming along. I have the alert module debug-logging
to alerts.log.  Immediately after issuing a "drop" command of the sort:

#xymon localhost "drop foo.bar.com sslcert"

the following sorts appear in the alerts.log. After this, some messages
may result in alert emails being sent, but most quietly disappear.
Currently, my resolution is to "xymon.sh restart" but that is much too
heavy handed for long term use.

21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted
21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of
/opt/xymon/server/etc/alerts.cfg
21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of
/opt/xymon/server/etc/holidays.cfg
21178 2015-10-05 16:39:43.257718 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257773 Found a first matching rule
21178 2015-10-05 16:39:43.257802 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257830 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257854 Found a first matching rule
21178 2015-10-05 16:39:43.257879 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257910 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257935 Found a first matching rule
21178 2015-10-05 16:39:43.257960 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257986 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258010 Found a first matching rule
21178 2015-10-05 16:39:43.258035 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258061 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258088 Found a first matching rule
21178 2015-10-05 16:39:43.258113 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258140 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258164 Found a first matching rule
21178 2015-10-05 16:39:43.258188 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258211 0 alerts to go
21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos
134769, endpos -1, usedbytes=0, bufleft=131470
21178 2015-10-05 16:39:47.962032 Got 2831 bytes
21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039
@@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754||||
21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos
-1
21178 2015-10-05 16:39:47.962204 Got page message from
soajnuexhs1.bar.com:msgs
21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos
137600, endpos -1, usedbytes=0, bufleft=128639
21178 2015-10-05 16:39:58.022397 Got 297 bytes
21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040
@@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS|||||
21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos
-1
21178 2015-10-05 16:39:58.022593 Got page message from
doadofjdc-ea05p.bar.com:msgs
21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0
21178 2015-10-05 16:39:58.022666 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022706 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022739 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022776 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022808 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022841 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022873 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022904 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022935 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022967 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022998 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023028 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023059 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023089 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023120 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023151 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023187 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023221 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023252 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023282 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023313 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023342 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023369 Found no first matching rule
21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos
137897, endpos -1, usedbytes=0, bufleft=128342
21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due
to EOF


Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.

AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).
There were a few patches for things in xymond.c at one point, and more
error checking when going to POSIX btrees generally, but I hadn't
encountered this in other intermittent hostlist readers.

1) Which version of Solaris is this?
2) Have you experienced this in other workers for xymon? (IE,
xymond_client not being able to look up hostnames after a drop -- would
probably lead to random purples)
3) Does issuing a "reload" command or -HUP to xymond_alert re-sync things?


-jc

list Japheth Cleaver · Tue, 1 Dec 2015 12:51:12 -0800 ·


On Tue, December 1, 2015 9:32 am, John Thurston wrote:
*snip*

▸ quoted from John Thurston

In this occurrence, it does not appear to be related to a "drop"
message. My last recorded "drop" was at 20151103-0846 and the alert
process didn't start logging "which is not defined" until 20151120-0007

Hmm. Okay, that does change things slightly. Fortunately, that means it's
probably specifically caused by drops per se. Were there any other errors
that occurred with other components around this time? Perhaps the system
being low enough on memory that some re-allocations might have failed?

Regards,
-jc

list John Thurston · Tue, 01 Dec 2015 12:03:03 -0900 ·

On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
- snip -

▸ quoted from Japheth Cleaver

Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.

AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).

See my later message for its relation to 'drop' activity.

▸ quoted from Japheth Cleaver

There were a few patches for things in xymond.c at one point, and more
error checking when going to POSIX btrees generally, but I hadn't
encountered this in other intermittent hostlist readers.

1) Which version of Solaris is this?

Solaris 10, most recent update, SPARC

▸ quoted from Japheth Cleaver

2) Have you experienced this in other workers for xymon? (IE,
xymond_client not being able to look up hostnames after a drop -- would
probably lead to random purples)

I haven't seen behavior like that with other worker processes.
Is there a way to interactively run a worker process and have it hit the 
daemon process for the hostnames?
Aside from making the process dump core, is there a way to get the 
daemon to spill its current list of hostnames?

3) Does issuing a "reload" command or -HUP to xymond_alert re-sync things?

I didn't do a 'reload', but I killed the "xymond_channel --channel=page 
--log=/var/log/xymon/alert.log xymond_alert" process and alerts started 
working again.

I haven't yet found a way to induce this failure, so I haven't yet 
identified the minimal recovery steps. I'm working on it, though.

▸ quoted from John Thurston

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list John Thurston · Tue, 01 Dec 2015 12:41:26 -0900 ·

▸ quoted from Japheth Cleaver

On 12/1/2015 11:51 AM, J.C. Cleaver wrote:

On Tue, December 1, 2015 9:32 am, John Thurston wrote:
*snip*

In this occurrence, it does not appear to be related to a "drop"
message. My last recorded "drop" was at 20151103-0846 and the alert
process didn't start logging "which is not defined" until 20151120-0007

Hmm. Okay, that does change things slightly. Fortunately, that means it's
probably specifically caused by drops per se. Were there any other errors
that occurred with other components around this time?

I have several instances of "Oversize status msg from " in the xymond.log, but those are appearing six hours before the bad behavior appeared in xymon_alert. I have difficulty believing they are related.

▸ quoted from Japheth Cleaver

Perhaps the system
being low enough on memory that some re-allocations might have failed?

I think this is unlikely. The system has 256GB of RAM, and there are no memory caps placed on the non-global zone in which xymon is running. I don't have information of its size on Nov 20, but today it using about 400MB of RAM. All of the zones on the system are consuming less than 10GB of the 256GB and it wouldn't have been significantly different a few weeks ago.

I've been doing some 'drops' today to try to break it, but haven't succeeded. I'll continue to beat on it and see if I can find a repeatable failure scenario.

fwiw, this is under 4.3.22

▸ quoted from John Thurston

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list Japheth Cleaver · Tue, 1 Dec 2015 13:53:48 -0800 ·

▸ quoted from John Thurston

On Tue, December 1, 2015 1:41 pm, John Thurston wrote:

On 12/1/2015 11:51 AM, J.C. Cleaver wrote:

On Tue, December 1, 2015 9:32 am, John Thurston wrote:
*snip*

In this occurrence, it does not appear to be related to a "drop"
message. My last recorded "drop" was at 20151103-0846 and the alert
process didn't start logging "which is not defined" until 20151120-0007

Hmm. Okay, that does change things slightly. Fortunately, that means
it's
probably specifically caused by drops per se. Were there any other
errors
that occurred with other components around this time?

I have several instances of "Oversize status msg from " in the
xymond.log, but those are appearing six hours before the bad behavior
appeared in xymon_alert. I have difficulty believing they are related.

Ack. Yeah, that should have been 'NOT specifically' :)

▸ quoted from John Thurston

Perhaps the system
being low enough on memory that some re-allocations might have failed?

I think this is unlikely. The system has 256GB of RAM, and there are no
memory caps placed on the non-global zone in which xymon is running. I
don't have information of its size on Nov 20, but today it using about
400MB of RAM. All of the zones on the system are consuming less than
10GB of the 256GB and it wouldn't have been significantly different a
few weeks ago.

I've been doing some 'drops' today to try to break it, but haven't
succeeded. I'll continue to beat on it and see if I can find a
repeatable failure scenario.

fwiw, this is under 4.3.22


Hmm.
This is an area where it's possible that glibc/NULL issues might be
causing subtle things too. I could easily see the btree getting hosed by
tree re-insertion of a key we weren't really expecting.


-jc

list John Thurston · Mon, 14 Dec 2015 11:27:05 -0900 ·

▸ quoted from John Thurston

On 12/1/2015 12:03 PM, John Thurston wrote:

On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
- snip -

Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.

AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).

- snip -

▸ quoted from John Thurston

I haven't yet found a way to induce this failure, so I haven't yet
identified the minimal recovery steps. I'm working on it, though.

I think I might be able to reproduce the failure :)  Start with the 
following, stable server arrangement:

+ x.bar.com is running xymon 4.3.22 on Solaris 10 SPARC
+ The following is defined in tasks.cfg:
   CMD xymond_channel --channel=page  --log=$XYMONSERVERLOGS/alert.log \
   xymond_alert --debug --checkpoint-file=$XYMONTMP/alert.chk \
   --checkpoint-interval=600
+ Host foo.bar.com is defined in DNS and does not permit ICMP traffic 
and does not have a xymon client installed on it

Throw a spanner in the works by the following actions:

+ Add host foo.bar.com to an existing page and group in hosts.cfg
+ ~/server/bin/xymoncmd ~/server/bin/xymonnet foo.bar.com

And see the trouble commence in alert.log:

6690 2015-12-14 10:52:06.859998 Got 415 bytes
6690 2015-12-14 10:52:06.860110 xymond_alert: Got message 95 @@page#95/foo.bar.com|1450122726.859873|10.10.10.55|foo.bar.com|conn|0.0.0.0|1450124526|red|none|1450122726|Page/Subpage|65234||||
6690 2015-12-14 10:52:06.860140 startpos 5659, fillpos 5659, endpos -1
6690 2015-12-14 10:52:06.860172 Got page message from foo.bar.com:conn
6690 2015-12-14 10:52:06.860249 Alert status changed from 0 to 1
6690 2015-12-14 10:52:06.860285 Checking criteria for host 'foo.bar.com', which is not defined
6690 2015-12-14 10:52:06.861674 Checking criteria for host 'foo.bar.com', which is not defined
6690 2015-12-14 10:52:06.861728 Checking criteria for host 'foo.bar.com', which is not defined
6690 2015-12-14 10:52:06.861761 Found no first matching rule
6690 2015-12-14 10:52:06.861813 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
6690 2015-12-14 10:52:06.861861 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
6690 2015-12-14 10:52:06.861891 Checking criteria for host 'zebra.bar.com', which is not defined

After killing the "xymond_channel --channel=page" process, a new one is 
created as a child of xymonlaunch and everything behaves normally again.

I currently have a tail on my alert.log to warn me of the appearance of 
the string, "which is not defined". When that appears, I know it is time 
to HUP the "page" channel. This is a rather crude hammer to leave laying 
on the table next to my production server, but it keeps us running :)

I have a core file from the xymond_channel process, but its stack 
contains only:

 feee041c _syscall6 (1, 1, 0, 1, 7d0, 3a0f4) + 20
 00013c90 _start   (0, 0, 0, 0, 0, 0) + 5c

I have a core file from the xymond_alert process, but its stack contains 
only:

 feede7d8 __pollsys (ffbfcd50, 1, ffbfcdc0, 0, 0, 0) + 8
 fee79b8c pselect  (ffbfcd50, fef56790, fef56790, 40, ffbfcdc0, 0) + 1c8
 fee79f04 select   (1, ffbfce58, 0, 0, ffbfce48, ffbfced8) + a0
 00015fa4 get_xymond_message (4b400, 4b14c, 4b148, ffbfcf88, 4b16c, 35d50) + 270
 0003293c main     (1, 566f245d, 0, 33b00, 4b000, 33bb8) + 378
 00014a34 _start   (0, 0, 0, 0, 0, 0) + 5c
which is whatever it was happily processing when I killed it, not the 
stack at the time it ended up at line 815 of loadalerts.c

What can I do and what information can I gather which will help narrow 
the fault domain?

▸ quoted from John Thurston


-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

xymon hostdata module going rogue 🔗 link

xymon hostdata module going rogue