On Mon, Aug 31, 2015, at 16:24, J.C. Cleaver wrote:
On Mon, August 31, 2015 10:19 am, John Thurston wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
. . .
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00
[xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I
decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon.
Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this
is
still pretty icky.
On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
The patch from
http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked
in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in
the
most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at
http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachment.obj
) on your OS, that would be very helpful. Signal handling is always a
bit
tricky to ensure is correct across the board.
I have patched one of my servers and it behaves much better under my
contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The
original report was under Red Hat Enterprise Linux 5.
If my understanding of this is correct, it is a pretty nasty defect :(
My failure scenario was non-delivery of some email alerts for hosts in
dire straits. I have several customers who do not monitor the web
interface, but rely on email notifications to warn them of impending
problems. These folks had been without any alerting capability since
early in July when I "dropped" at host and unknowingly clobbered the
child of xymond_hostdata.
Thanks for the confirmation... Yes, I believe it's probably time to start
another release cycle, for this and a few other of the recent bug fixes
still pending.
For the record, I can't reproduce this on FreeBSD either.