Xymon Mailing List Archive search

xymond crashing! -- Please help!

list Matt Vander Werf
Sat, 30 Jan 2016 18:32:07 -0500
Message-Id: <user-990cbddd141b@xymon.invalid>

Opps...somehow sent too soon there...

No, I haven't made any recent changes to client-local.cfg. I don't actually
use that config for anything actually.

It seems to work just fine when you're starting off with no xymond.chk file
(like when the file is moved out of the way), but once the service gets
restarted (or stopped and started), then the crashes start again and it
becomes basically unusable. So maybe it has to do with reading the current
state from the xymond.chk file? Or loading all the statuses?

It seems to load all the statuses and then tries to set up a network
listener and then crashes.

No, I'm not seeing any other error messages from xymond's startup that
would seem related. Just that "Cannot bind to listen socket (Address
already in use)" you saw earlier when it crashes.

Are you saying I could pull data from the old xymond.chk file and manually
put it in the current xymond.chk file when xymond is stooped? Or?

Any other ideas? I'm sort of in a rut here...  :/ Not entirely sure what I
can do to get my Xymon instance working again..

Any other details I can provide that might shine a light on this issue?

Thanks!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 6:05 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

No,


--
Matt Vander Werf

On Sat, Jan 30, 2016 at 5:46 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace for
you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612
"linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1)
at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?

Let me know what I can do.

Thanks!!
This works. It's strange in that it points to a problem with the
client-local configs, but I'm not sure how the tree would get into a
corrupt state.

Were any changes made recently to the client-local file? Any other errors
seen during xymond's startup that might seem related?

It's probably *not* an issue with a status message, if they're all
crashing at the same spot. This was an incoming client message that was
either garbled or accessing garbled data somehow.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was
stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for
things
that were disabled).

What all data exactly did I lose with moving that checkpoint file out
of
the way?

Is there anyway to get the data back? Or maybe figure out the
corruptness
in the checkpoint file and then move the file back in place?
There are several different bits in there, including scheduled tasks,
disable states, and the current status messages. You can manually copy the
file back at this point while xymond is off and it will load state back
from it (along with the old status messages, but they'll get overwritten
as soon as the next cycle come through).

Also, see my most recent e-mail with the xymonlaunch log (if you
haven't
already). Looks like this has happened in the past but resolved
itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core
files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the
backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Ugh. systemd :( I forgot that that's not a real shell file any more. Looks
like you found a way though!


-jc