Xymon Mailing List Archive search

xymond crashing! -- Please help!

11 messages in this thread

list Matt Vander Werf · Sat, 30 Jan 2016 11:21:53 -0500 ·
Hello,

I'm having a major issue with xymond crashing shortly after the service
starts.

I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).

When I check the status of the xymon service, it shows it as up but with
only the xymonlaunch parent process and vmstat processes. Upon restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:

           ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed Fatal
signal caught!

along with the xymonlaunch process and some vmstat processes. After a while
that process goes away. Sometimes a single xymond_rrd will show up
alongside the xymonlaunch and vmstat processes as well after a little while.

I'm already running xymond in --debug mode.

This is what I see in the xymond log around the time of the crash:

2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host> (<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address already in
use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout 50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed'

It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have alphabetically).

I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also tried
restarting the xymon server machine itself, with the same crash happening
when the service starts the first time.

This just started happening out of the blue a couple of hours ago...

Looking in netstat, there are no active connections using port 1984 on the
local side, just a bunch of clients trying to connect to the server with
1984 in the foreign address.

ANY help would be much appreciated as currently our Xymon server is not
working!!

Thanks!!

--
Matt Vander Werf
list Matt Vander Werf · Sat, 30 Jan 2016 11:28:06 -0500 ·
As a followup, xymond seems to try and start itself up again after a while
(probably because xymonlaunch is still running) and goes for a short while
working just fine and then just crashes again with the same messages and
results.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
quoted from Matt Vander Werf
wrote:
Hello,

I'm having a major issue with xymond crashing shortly after the service
starts.

I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).

When I check the status of the xymon service, it shows it as up but with
only the xymonlaunch parent process and vmstat processes. Upon restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:

           ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed Fatal
signal caught!

along with the xymonlaunch process and some vmstat processes. After a
while that process goes away. Sometimes a single xymond_rrd will show up
alongside the xymonlaunch and vmstat processes as well after a little while.

I'm already running xymond in --debug mode.

This is what I see in the xymond log around the time of the crash:

2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host> (<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address already
in use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout 50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed'

It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have alphabetically).

I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also tried
restarting the xymon server machine itself, with the same crash happening
when the service starts the first time.

This just started happening out of the blue a couple of hours ago...

Looking in netstat, there are no active connections using port 1984 on the
local side, just a bunch of clients trying to connect to the server with
1984 in the foreign address.

ANY help would be much appreciated as currently our Xymon server is not
working!!

Thanks!!

--
Matt Vander Werf
list Matt Vander Werf · Sat, 30 Jan 2016 12:33:18 -0500 ·
See below for a snippet from the xymonlaunch.log log file from starting up
to crashing to trying to restart itself to crashing some more.

Looking through the xymonlaunch log, it looks like this kind of pattern
shows up quite a few other times, dating back to November last year. But it
always seemed to resolve itself after a good while.

Not sure how long it will take to resolve itself this time (if it resolves
itself at all).

Any guidance is appreciated and let me know if there's anything I can do or
provide to help figure out this issue!

Thanks!!


2016-01-30 12:07:40 xymonlaunch starting
2016-01-30 12:07:40.975237 Loading tasklist configuration from
/etc/xymon/tasks.cfg
2016-01-30 12:07:40 xymonlaunch: starting task [xymond]
2016-01-30 12:07:47 xymonlaunch: starting task [history]
2016-01-30 12:07:47 xymonlaunch: starting task [alert]
2016-01-30 12:07:47 xymonlaunch: starting task [clientdata]
2016-01-30 12:07:47 xymonlaunch: starting task [rrdstatus]
2016-01-30 12:07:47 xymonlaunch: starting task [rrddata]
2016-01-30 12:07:47 xymonlaunch: starting task [hostdata]
2016-01-30 12:07:47 xymonlaunch: starting task [storestatus]
2016-01-30 12:08:05.802277 Task xymond terminated by signal 6
2016-01-30 12:08:05 xymonlaunch: starting task [xymond]
2016-01-30 12:08:05.803642 Task xymonnet terminated by signal 15
2016-01-30 12:08:06.348868 Task xymond terminated, status 1
2016-01-30 12:08:11 xymonlaunch: starting task [xymond]
2016-01-30 12:08:11.896124 Task xymond terminated, status 1
2016-01-30 12:08:16 xymonlaunch: starting task [xymond]
2016-01-30 12:08:17.438144 Task xymond terminated, status 1
2016-01-30 12:08:22 xymonlaunch: starting task [xymond]
2016-01-30 12:08:22.981957 Task xymond terminated, status 1
2016-01-30 12:08:27 xymonlaunch: starting task [xymond]
2016-01-30 12:08:28.530953 Task xymond terminated, status 1
2016-01-30 12:08:28.531006 Postponing restart of [xymond] for 600 seconds
from last start due to multiple failures
2016-01-30 12:18:31.888403 Releasing [xymond] from failure hold
2016-01-30 12:18:31 xymonlaunch: starting task [xymond]
2016-01-30 12:18:36 xymonlaunch: starting task [history]
2016-01-30 12:18:36 xymonlaunch: starting task [alert]
2016-01-30 12:18:36 xymonlaunch: starting task [clientdata]
2016-01-30 12:18:36 xymonlaunch: starting task [rrdstatus]
2016-01-30 12:18:36 xymonlaunch: starting task [rrddata]
2016-01-30 12:18:36 xymonlaunch: starting task [hostdata]
2016-01-30 12:18:36.888969 Releasing [xymonnet] from failure hold
2016-01-30 12:18:36 xymonlaunch: starting task [storestatus]
2016-01-30 12:19:57.503293 Task xymond terminated by signal 6
2016-01-30 12:19:57 xymonlaunch: starting task [xymond]
2016-01-30 12:19:58.062318 Task xymond terminated, status 1
2016-01-30 12:20:03 xymonlaunch: starting task [xymond]
2016-01-30 12:20:03.607835 Task xymond terminated, status 1
2016-01-30 12:20:08 xymonlaunch: starting task [xymond]
2016-01-30 12:20:09.154910 Task xymond terminated, status 1
2016-01-30 12:20:14 xymonlaunch: starting task [xymond]
2016-01-30 12:20:14.702550 Task xymond terminated, status 1
2016-01-30 12:20:19 xymonlaunch: starting task [xymond]
2016-01-30 12:20:20.247913 Task xymond terminated, status 1
2016-01-30 12:20:20.247966 Postponing restart of [xymond] for 600 seconds
from last start due to multiple failures

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 11:28 AM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
quoted from Matt Vander Werf
wrote:
As a followup, xymond seems to try and start itself up again after a while
(probably because xymonlaunch is still running) and goes for a short while
working just fine and then just crashes again with the same messages and
results.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hello,

I'm having a major issue with xymond crashing shortly after the service
starts.

I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).

When I check the status of the xymon service, it shows it as up but with
only the xymonlaunch parent process and vmstat processes. Upon restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:

           ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed Fatal
signal caught!

along with the xymonlaunch process and some vmstat processes. After a
while that process goes away. Sometimes a single xymond_rrd will show up
alongside the xymonlaunch and vmstat processes as well after a little while.

I'm already running xymond in --debug mode.

This is what I see in the xymond log around the time of the crash:

2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host> (<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address already
in use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout 50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed'

It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have alphabetically).

I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also tried
restarting the xymon server machine itself, with the same crash happening
when the service starts the first time.

This just started happening out of the blue a couple of hours ago...

Looking in netstat, there are no active connections using port 1984 on
the local side, just a bunch of clients trying to connect to the server
with 1984 in the foreign address.

ANY help would be much appreciated as currently our Xymon server is not
working!!

Thanks!!

--
Matt Vander Werf
list Japheth Cleaver · Sat, 30 Jan 2016 09:39:14 -0800 ·
Hi Matt,

The log lines you're seeing are actually from the new xymond process
trying to start up, then failing because the port is already in use. I
think the timeout right below it is from the previous process's signal
handler giving up, based on the timestamps.

Can you get a backtrace from xymond's core file? It should be left in
/var/lib/xymon/tmp/, or in the (*shudder*) systemd journal somewhere...

If your system is set not to keep them by default, add
''
export DAEMON_COREFILE_LIMIT="unlimited"
ulimit -c unlimited
''
to /etc/sysconfig/xymonlaunch

I suspect there might be something corrupted in the xymond checkpoint file.
First, do a 'service xymon stop' and make sure all xymon processes are
completely gone, including any xymond's still pending, then start xymon
back up. If it crashes again, do the same, but move the
/var/lib/xymon/xymond.chk checkpoint file out of the way after it's off,
and let it come back up.

If it *still* doesn't come up, there's something else going on. Either
way, a full backtrace will help let us see where exactly it's dying.


HTH,
-jc
quoted from Matt Vander Werf


On Sat, January 30, 2016 8:28 am, Matt Vander Werf wrote:
As a followup, xymond seems to try and start itself up again after a while
(probably because xymonlaunch is still running) and goes for a short while
working just fine and then just crashes again with the same messages and
results.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hello,

I'm having a major issue with xymond crashing shortly after the service
starts.

I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).

When I check the status of the xymon service, it shows it as up but with
only the xymonlaunch parent process and vmstat processes. Upon
restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:

           ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal
quoted from Matt Vander Werf
<server
hostname>.xymond red (Check time of report) - xymond program crashed
Fatal
signal caught!

along with the xymonlaunch process and some vmstat processes. After a
while that process goes away. Sometimes a single xymond_rrd will show up
alongside the xymonlaunch and vmstat processes as well after a little
while.

I'm already running xymond in --debug mode.

This is what I see in the xymond log around the time of the crash:

2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host>
(<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address already
in use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout
50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed'

It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have
alphabetically).

I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also
tried
restarting the xymon server machine itself, with the same crash
happening
when the service starts the first time.

This just started happening out of the blue a couple of hours ago...

Looking in netstat, there are no active connections using port 1984 on
the
local side, just a bunch of clients trying to connect to the server with
1984 in the foreign address.

ANY help would be much appreciated as currently our Xymon server is not
working!!

Thanks!!

--
Matt Vander Werf
list Matt Vander Werf · Sat, 30 Jan 2016 13:10:44 -0500 ·
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for things
that were disabled).

What all data exactly did I lose with moving that checkpoint file out of
the way?

Is there anyway to get the data back? Or maybe figure out the corruptness
in the checkpoint file and then move the file back in place?

Also, see my most recent e-mail with the xymonlaunch log (if you haven't
already). Looks like this has happened in the past but resolved itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch


Thanks for your help!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 12:39 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
quoted from Japheth Cleaver
wrote:
Hi Matt,

The log lines you're seeing are actually from the new xymond process
trying to start up, then failing because the port is already in use. I
think the timeout right below it is from the previous process's signal
handler giving up, based on the timestamps.

Can you get a backtrace from xymond's core file? It should be left in
/var/lib/xymon/tmp/, or in the (*shudder*) systemd journal somewhere...

If your system is set not to keep them by default, add
''
export DAEMON_COREFILE_LIMIT="unlimited"
ulimit -c unlimited
''
to /etc/sysconfig/xymonlaunch

I suspect there might be something corrupted in the xymond checkpoint file.
First, do a 'service xymon stop' and make sure all xymon processes are
completely gone, including any xymond's still pending, then start xymon
back up. If it crashes again, do the same, but move the
/var/lib/xymon/xymond.chk checkpoint file out of the way after it's off,
and let it come back up.

If it *still* doesn't come up, there's something else going on. Either
way, a full backtrace will help let us see where exactly it's dying.


HTH,
-jc


On Sat, January 30, 2016 8:28 am, Matt Vander Werf wrote:
As a followup, xymond seems to try and start itself up again after a
while
(probably because xymonlaunch is still running) and goes for a short
while
working just fine and then just crashes again with the same messages and
results.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hello,

I'm having a major issue with xymond crashing shortly after the service
starts.

I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).

When I check the status of the xymon service, it shows it as up but with
only the xymonlaunch parent process and vmstat processes. Upon
restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:

           ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal
<server
hostname>.xymond red (Check time of report) - xymond program crashed
Fatal
signal caught!

along with the xymonlaunch process and some vmstat processes. After a
while that process goes away. Sometimes a single xymond_rrd will show up
alongside the xymonlaunch and vmstat processes as well after a little
while.

I'm already running xymond in --debug mode.

This is what I see in the xymond log around the time of the crash:

2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host>
(<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address already
in use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout
50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal <server
hostname>.xymond red (Check time of report) - xymond program crashed'

It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have
alphabetically).

I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also
tried
restarting the xymon server machine itself, with the same crash
happening
when the service starts the first time.

This just started happening out of the blue a couple of hours ago...

Looking in netstat, there are no active connections using port 1984 on
the
local side, just a bunch of clients trying to connect to the server with
1984 in the foreign address.

ANY help would be much appreciated as currently our Xymon server is not
working!!

Thanks!!

--
Matt Vander Werf
list Matt Vander Werf · Sat, 30 Jan 2016 13:45:44 -0500 ·
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace for you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612 "linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1) at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?

Let me know what I can do.

Thanks!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
quoted from Matt Vander Werf
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for things
that were disabled).

What all data exactly did I lose with moving that checkpoint file out of
the way?

Is there anyway to get the data back? Or maybe figure out the corruptness
in the checkpoint file and then move the file back in place?

Also, see my most recent e-mail with the xymonlaunch log (if you haven't
already). Looks like this has happened in the past but resolved itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch


Thanks for your help!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 12:39 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
Hi Matt,

The log lines you're seeing are actually from the new xymond process
trying to start up, then failing because the port is already in use. I
think the timeout right below it is from the previous process's signal
handler giving up, based on the timestamps.

Can you get a backtrace from xymond's core file? It should be left in
/var/lib/xymon/tmp/, or in the (*shudder*) systemd journal somewhere...

If your system is set not to keep them by default, add
''
export DAEMON_COREFILE_LIMIT="unlimited"
ulimit -c unlimited
''
to /etc/sysconfig/xymonlaunch

I suspect there might be something corrupted in the xymond checkpoint
file.
First, do a 'service xymon stop' and make sure all xymon processes are
completely gone, including any xymond's still pending, then start xymon
back up. If it crashes again, do the same, but move the
/var/lib/xymon/xymond.chk checkpoint file out of the way after it's off,
and let it come back up.

If it *still* doesn't come up, there's something else going on. Either
way, a full backtrace will help let us see where exactly it's dying.


HTH,
-jc


On Sat, January 30, 2016 8:28 am, Matt Vander Werf wrote:
As a followup, xymond seems to try and start itself up again after a
while
(probably because xymonlaunch is still running) and goes for a short
while
working just fine and then just crashes again with the same messages and
results.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hello,

I'm having a major issue with xymond crashing shortly after the service
starts.

I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).

When I check the status of the xymon service, it shows it as up but
with
only the xymonlaunch parent process and vmstat processes. Upon
restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:

           ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal
<server
hostname>.xymond red (Check time of report) - xymond program crashed
Fatal
signal caught!

along with the xymonlaunch process and some vmstat processes. After a
while that process goes away. Sometimes a single xymond_rrd will show
up
alongside the xymonlaunch and vmstat processes as well after a little
while.

I'm already running xymond in --debug mode.

This is what I see in the xymond log around the time of the crash:

2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host>
(<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address
already
in use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout
50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal
<server
hostname>.xymond red (Check time of report) - xymond program crashed'

It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have
alphabetically).

I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also
tried
restarting the xymon server machine itself, with the same crash
happening
when the service starts the first time.

This just started happening out of the blue a couple of hours ago...

Looking in netstat, there are no active connections using port 1984 on
the
local side, just a bunch of clients trying to connect to the server
with
1984 in the foreign address.

ANY help would be much appreciated as currently our Xymon server is not
working!!

Thanks!!

--
Matt Vander Werf
list Japheth Cleaver · Sat, 30 Jan 2016 14:46:44 -0800 ·
quoted from Matt Vander Werf
On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace for
you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612 "linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1) at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?

Let me know what I can do.

Thanks!!
This works. It's strange in that it points to a problem with the
client-local configs, but I'm not sure how the tree would get into a
corrupt state.

Were any changes made recently to the client-local file? Any other errors
seen during xymond's startup that might seem related?

It's probably *not* an issue with a status message, if they're all
crashing at the same spot. This was an incoming client message that was
either garbled or accessing garbled data somehow.
quoted from Matt Vander Werf

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was
stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for
things
that were disabled).

What all data exactly did I lose with moving that checkpoint file out of
the way?

Is there anyway to get the data back? Or maybe figure out the
corruptness
in the checkpoint file and then move the file back in place?
There are several different bits in there, including scheduled tasks,
disable states, and the current status messages. You can manually copy the
file back at this point while xymond is off and it will load state back
from it (along with the old status messages, but they'll get overwritten
as soon as the next cycle come through).
quoted from Matt Vander Werf

Also, see my most recent e-mail with the xymonlaunch log (if you haven't
already). Looks like this has happened in the past but resolved
itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Ugh. systemd :( I forgot that that's not a real shell file any more. Looks
like you found a way though!


-jc
list Matt Vander Werf · Sat, 30 Jan 2016 18:05:13 -0500 ·
Hi J.C.,

No,


--
Matt Vander Werf

On Sat, Jan 30, 2016 at 5:46 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
quoted from Japheth Cleaver
wrote:
On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace for
you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612 "linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1)
at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?

Let me know what I can do.

Thanks!!
This works. It's strange in that it points to a problem with the
client-local configs, but I'm not sure how the tree would get into a
corrupt state.

Were any changes made recently to the client-local file? Any other errors
seen during xymond's startup that might seem related?

It's probably *not* an issue with a status message, if they're all
crashing at the same spot. This was an incoming client message that was
either garbled or accessing garbled data somehow.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was
stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for
things
that were disabled).

What all data exactly did I lose with moving that checkpoint file out of
the way?

Is there anyway to get the data back? Or maybe figure out the
corruptness
in the checkpoint file and then move the file back in place?
There are several different bits in there, including scheduled tasks,
disable states, and the current status messages. You can manually copy the
file back at this point while xymond is off and it will load state back
from it (along with the old status messages, but they'll get overwritten
as soon as the next cycle come through).

Also, see my most recent e-mail with the xymonlaunch log (if you haven't
already). Looks like this has happened in the past but resolved
itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Ugh. systemd :( I forgot that that's not a real shell file any more. Looks
like you found a way though!


-jc

list Matt Vander Werf · Sat, 30 Jan 2016 18:32:07 -0500 ·
Opps...somehow sent too soon there...

No, I haven't made any recent changes to client-local.cfg. I don't actually
use that config for anything actually.

It seems to work just fine when you're starting off with no xymond.chk file
(like when the file is moved out of the way), but once the service gets
restarted (or stopped and started), then the crashes start again and it
becomes basically unusable. So maybe it has to do with reading the current
state from the xymond.chk file? Or loading all the statuses?

It seems to load all the statuses and then tries to set up a network
listener and then crashes.

No, I'm not seeing any other error messages from xymond's startup that
would seem related. Just that "Cannot bind to listen socket (Address
already in use)" you saw earlier when it crashes.

Are you saying I could pull data from the old xymond.chk file and manually
put it in the current xymond.chk file when xymond is stooped? Or?

Any other ideas? I'm sort of in a rut here...  :/ Not entirely sure what I
can do to get my Xymon instance working again..

Any other details I can provide that might shine a light on this issue?

Thanks!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 6:05 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
quoted from Matt Vander Werf
wrote:
Hi J.C.,

No,


--
Matt Vander Werf

On Sat, Jan 30, 2016 at 5:46 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace for
you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612
"linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1)
at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?

Let me know what I can do.

Thanks!!
This works. It's strange in that it points to a problem with the
client-local configs, but I'm not sure how the tree would get into a
corrupt state.

Were any changes made recently to the client-local file? Any other errors
seen during xymond's startup that might seem related?

It's probably *not* an issue with a status message, if they're all
crashing at the same spot. This was an incoming client message that was
either garbled or accessing garbled data somehow.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was
stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for
things
that were disabled).

What all data exactly did I lose with moving that checkpoint file out
of
the way?

Is there anyway to get the data back? Or maybe figure out the
corruptness
in the checkpoint file and then move the file back in place?
There are several different bits in there, including scheduled tasks,
disable states, and the current status messages. You can manually copy the
file back at this point while xymond is off and it will load state back
from it (along with the old status messages, but they'll get overwritten
as soon as the next cycle come through).

Also, see my most recent e-mail with the xymonlaunch log (if you
haven't
already). Looks like this has happened in the past but resolved
itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core
files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the
backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Ugh. systemd :( I forgot that that's not a real shell file any more. Looks
like you found a way though!


-jc

list Japheth Cleaver · Sun, 31 Jan 2016 09:16:16 -0800 ·
quoted from Matt Vander Werf
On Sat, January 30, 2016 3:32 pm, Matt Vander Werf wrote:
Opps...somehow sent too soon there...

No, I haven't made any recent changes to client-local.cfg. I don't
actually
use that config for anything actually.

It seems to work just fine when you're starting off with no xymond.chk
file
(like when the file is moved out of the way), but once the service gets
restarted (or stopped and started), then the crashes start again and it
becomes basically unusable. So maybe it has to do with reading the current
state from the xymond.chk file? Or loading all the statuses?

It seems to load all the statuses and then tries to set up a network
listener and then crashes.

This is most likely the previous xymond instance still taking the network
port. After startup, xymond never re-reads the checkpoint file. If crashes
are eventually occurring even after it's started up without a checkpoint
file in place then whatever it is is occurring "live" and it's not the
checkpoint itself that's the problem.
quoted from Matt Vander Werf

No, I'm not seeing any other error messages from xymond's startup that
would seem related. Just that "Cannot bind to listen socket (Address
already in use)" you saw earlier when it crashes.

Are you saying I could pull data from the old xymond.chk file and manually
put it in the current xymond.chk file when xymond is stooped? Or?
This is correct. The checkpoint file is a simple text file written out.
Actually, it might be worth a quick scan with grep or just eyeballing a
'cat' to see if there's any obviously corrupt data in there. Initial raw
messages are not binary safe, and are decompressed by xymond if needed
before internal processing, so everything there should be plain. If you
see binary garbage, something unusual has happened.
quoted from Matt Vander Werf
Any other ideas? I'm sort of in a rut here...  :/ Not entirely sure what I
can do to get my Xymon instance working again..

Any other details I can provide that might shine a light on this issue?
- Can you send a copy of your client-local.cfg? Or, if not using it much,
revert it to the standard one?

- When did the issue first start?

- Based on the single backtrace, there's something strange about a client
record being pulled in, or an underlying issue with posix btrees, and/or
memory management.

Are all the crashes occurring at the same area? If so, for the same client
host message/report?

Is the main xymond server under any sort of memory pressure, or has there
been a recent glibc update or change in libraries that might require a
reboot to fully take effect?


-jc
quoted from Matt Vander Werf
Thanks!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 6:05 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

No,


--
Matt Vander Werf

On Sat, Jan 30, 2016 at 5:46 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace
for
you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612
"linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "",
can_respond=can_respond at entry=1)
at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package
for
glibc or other packages?

Let me know what I can do.

Thanks!!
This works. It's strange in that it points to a problem with the
client-local configs, but I'm not sure how the tree would get into a
corrupt state.

Were any changes made recently to the client-local file? Any other
errors
seen during xymond's startup that might seem related?

It's probably *not* an issue with a status message, if they're all
crashing at the same spot. This was an incoming client message that was
either garbled or accessing garbled data somehow.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf
<user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was
stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for
things
that were disabled).

What all data exactly did I lose with moving that checkpoint file
out
of
the way?

Is there anyway to get the data back? Or maybe figure out the
corruptness
in the checkpoint file and then move the file back in place?
There are several different bits in there, including scheduled tasks,
disable states, and the current status messages. You can manually copy
the
file back at this point while xymond is off and it will load state back
from it (along with the old status messages, but they'll get
overwritten
as soon as the next cycle come through).

Also, see my most recent e-mail with the xymonlaunch log (if you
haven't
already). Looks like this has happened in the past but resolved
itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core
files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the
backtrace
out of the binary core files, besides trying to read the files with
an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Ugh. systemd :( I forgot that that's not a real shell file any more.
Looks
like you found a way though!


-jc

list Matt Vander Werf · Sun, 31 Jan 2016 15:00:19 -0500 ·
Hi J.C.,

First of all, thanks for your continued assistance on this! It's greatly
appreciated!! :)

When looking to see if there were any clients in common in the different
crash back traces, and I think I may have figured out the issue here.

I've been trying to confirm that this is in fact the cause of the crashing
(which is why you haven't heard back from me for so long), and I think it's
safe to say that this was the cause of the crashing.

I noticed one client in particular that was showing up in several of the
core dump back traces (let's call this Client A). Now, this client stood
out to me cause I've been seeing it often in error messages from xymond
regarding oversized client messages and truncated status messages (see PS
below).

So, I went to Client A and stopped the xymon-client service. Then I went
and took out all entries for Client A from the checkpoint file (after
stopping the xymon service, of course).

Started up the xymon service again and voila...no crashing! I confirmed
this by starting up the xymon-client service on Client A again and then
restarting the xymon server service and after a short while, it crashed (as
expected).

Now, while the client messages were very strange and large and unusual,
there definitely wasn't any binary data. I will be sending you the
problematic entries from the checkpoint file in a separate e-mail (as I'd
rather not send the contents to the entire list...). Hopefully you can make
sense of them...

I'm not sure what in the client entries made it crash here of if it was
supposed to crash at all. Could it have really been from too large of
client/status messages?


**Do you know if it's possible to address this so it doesn't crash in the
future? For now, I'm just going to have to keep the xymon-client service
turned off, at least until the client in question calms down.


(PS: I can't really control what is done on my client machines very much,
so I was more or less ignoring the oversized client and status messages
from the client in question (knowing they eventually would go away).
I already tried increasing the MAXMSG size values, but I didn't want to
have to increase them as high as I would have needed to to satisfy the
client in question. I never thought that they would actually ever cause
Xymon to crash...)


Anyways, thanks as always for your excellent help J.C.!!

--
Matt Vander Werf

On Sun, Jan 31, 2016 at 12:16 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
quoted from Japheth Cleaver
wrote:
On Sat, January 30, 2016 3:32 pm, Matt Vander Werf wrote:
Opps...somehow sent too soon there...

No, I haven't made any recent changes to client-local.cfg. I don't
actually
use that config for anything actually.

It seems to work just fine when you're starting off with no xymond.chk
file
(like when the file is moved out of the way), but once the service gets
restarted (or stopped and started), then the crashes start again and it
becomes basically unusable. So maybe it has to do with reading the
current
state from the xymond.chk file? Or loading all the statuses?

It seems to load all the statuses and then tries to set up a network
listener and then crashes.

This is most likely the previous xymond instance still taking the network
port. After startup, xymond never re-reads the checkpoint file. If crashes
are eventually occurring even after it's started up without a checkpoint
file in place then whatever it is is occurring "live" and it's not the
checkpoint itself that's the problem.

No, I'm not seeing any other error messages from xymond's startup that
would seem related. Just that "Cannot bind to listen socket (Address
already in use)" you saw earlier when it crashes.

Are you saying I could pull data from the old xymond.chk file and
manually
put it in the current xymond.chk file when xymond is stooped? Or?
This is correct. The checkpoint file is a simple text file written out.
Actually, it might be worth a quick scan with grep or just eyeballing a
'cat' to see if there's any obviously corrupt data in there. Initial raw
messages are not binary safe, and are decompressed by xymond if needed
before internal processing, so everything there should be plain. If you
see binary garbage, something unusual has happened.
Any other ideas? I'm sort of in a rut here...  :/ Not entirely sure what
I
can do to get my Xymon instance working again..

Any other details I can provide that might shine a light on this issue?
- Can you send a copy of your client-local.cfg? Or, if not using it much,
revert it to the standard one?

- When did the issue first start?

- Based on the single backtrace, there's something strange about a client
record being pulled in, or an underlying issue with posix btrees, and/or
memory management.

Are all the crashes occurring at the same area? If so, for the same client
host message/report?

Is the main xymond server under any sort of memory pressure, or has there
been a recent glibc update or change in libraries that might require a
reboot to fully take effect?


-jc
Thanks!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 6:05 PM, Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

No,


--
Matt Vander Werf

On Sat, Jan 30, 2016 at 5:46 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace
for
you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612
"linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "",
can_respond=can_respond at entry=1)
at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
out>)
at xymond.c:6288


Is this what you wanted? Do you want me to install the debug package
for
glibc or other packages?

Let me know what I can do.

Thanks!!
This works. It's strange in that it points to a problem with the
client-local configs, but I'm not sure how the tree would get into a
corrupt state.

Were any changes made recently to the client-local file? Any other
errors
seen during xymond's startup that might seem related?

It's probably *not* an issue with a status message, if they're all
crashing at the same spot. This was an incoming client message that was
either garbled or accessing garbled data somehow.

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf
<user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi J.C.,

Moving the xymond.chk checkpoint file out of the way after it was
stopped
seemed to fix this (at least so far).

I see that I lost all record of disabled tests (getting alerts for
things
that were disabled).

What all data exactly did I lose with moving that checkpoint file
out
of
the way?

Is there anyway to get the data back? Or maybe figure out the
corruptness
in the checkpoint file and then move the file back in place?
There are several different bits in there, including scheduled tasks,
disable states, and the current status messages. You can manually copy
the
file back at this point while xymond is off and it will load state back
from it (along with the old status messages, but they'll get
overwritten
as soon as the next cycle come through).

Also, see my most recent e-mail with the xymonlaunch log (if you
haven't
already). Looks like this has happened in the past but resolved
itself....

Regarding the backtrace....

I put those lines in /etc/sysconfig/xymonlaunch and I see the core
files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the
backtrace
out of the binary core files, besides trying to read the files with
an
editor? Any way to know which core file had the backtrace?

Also, I see this in journalctl:

Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Ugh. systemd :( I forgot that that's not a real shell file any more.
Looks
like you found a way though!


-jc