Xymon Mailing List Archive search

xymonnet - fatal signal caught

13 messages in this thread

list Wallace Barrow · Tue, 14 Oct 2014 11:21:14 -0500 ·
Hello,

This is my first post to the mailing list and I was hoping to get some
help with an issue I am having. 

My xymon server has been running great with no issues for months,
version xymon-server-4.3.17_3 on FreeBSD10.

This morning xymonnet has crashed. Here is the output from the
xymonlaunch.log file:

06:45:07 Task xymonnet terminated by signal 6
06:50:11 Task xymonnet terminated by signal 6
06:55:11 Task xymonnet terminated by signal 6
7:00:12 Task xymonnet terminated by signal 6
07:05:13 Task xymonnet terminated by signal 6
7:10:14 Task xymonnet terminated by signal 6
07:20:16 Task xymonnet terminated by signal 6

Server was rebooted:

07:26:48 xymonlaunch starting
07:26:48 Loading tasklist configuration from (location of file here)
07:26:48 Loading hostnames
07:26:49 Loading saved state
07:26:49 Setting up network listener on 0.0.0.0:1984
07:26:49 Setting up signal handlers
7:26:49 Setting up xymond channels
07:26:49 Setting up logfiles
07:31:02 Task xymonnet terminated by signal 6
07:36:04 Task xymonnet terminated by signal 6
7:41:09 Task xymonnet terminated by signal 6
7:46:11 Task xymonnet terminated by signal 6

Before it started crashing there were no changes (in xymon config files)
or updates made. We add new hosts and checks every few days. 

I checked the other logs for xymon and things looks ok. Other areas of
xymon work as well.

Maybe I am over looking a log file and missing something obvious of why
it keeps crashing. Any thoughts on where to look next? 

Thank you,

--
list Mark Felder · Tue, 14 Oct 2014 11:51:33 -0500 ·
quoted from Wallace Barrow

On Tue, Oct 14, 2014, at 11:21, Wallace Barrow wrote:
Maybe I am over looking a log file and missing something obvious of why
it keeps crashing. Any thoughts on where to look next? 
When it crashes does it create core.* files in Xymon's tmp directory? I
think that's /usr/local/www/xymon/server/tmp

Perhaps it's related to this post:

http://lists.xymon.com/archive/2014-February/039058.html

If there's a core file could you paste the backtrace info?

# cd /usr/local/www/xymon/server
# gdb bin/xymonnet tmp/core.64739   (example core file name)

-- GDB informational output will be here --

(gdb) bt    <-- type "bt" at the prompt to get the backtrace output and
then copy/paste it back to the list
list Wallace Barrow · Tue, 14 Oct 2014 12:04:59 -0500 ·
Core file used: Oct 14 11:49 xymonnet.core

New core file generated every 10 minutes (being over written) 

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `xymonnet'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libcares.so.2...done.
Loaded symbols for /usr/local/lib/libcares.so.2
Reading symbols from /usr/lib/libssl.so.7...done.
Loaded symbols for /usr/lib/libssl.so.7
Reading symbols from /lib/libcrypto.so.7...done.
Loaded symbols for /lib/libcrypto.so.7
Reading symbols from /usr/local/lib/libpcre.so.3...done.
Loaded symbols for /usr/local/lib/libpcre.so.3
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /lib/libthr.so.3...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /usr/local/lib/nss_ldap.so.1...done.
Loaded symbols for /usr/local/lib/nss_ldap.so.1
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x0000000801457e1a in kill () from /lib/libc.so.7
[New Thread 802006400 (LWP 100281/xymonnet)]

(gdb) bt
#0  0x0000000801457e1a in kill () from /lib/libc.so.7
#1  0x0000000801456ac9 in abort () from /lib/libc.so.7
#2  0x000000000041c051 in sigsegv_handler (signum=<value optimized out>)
at sig.c:57
#3  <signal handler called>
#4  0x000000000041028a in dns_simple_callback (arg=0x80259adc0,
status=<value optimized out>, timeout=0, hent=0x8021221a0) at dns.c:120
#5  0x0000000800851fe0 in ares_gethostbyname_file () from
/usr/local/lib/libcares.so.2
#6  0x0000000800851ec1 in ares_gethostbyname_file () from
/usr/local/lib/libcares.so.2
#7  0x000000080085f4f7 in ares_search () from
/usr/local/lib/libcares.so.2
#8  0x000000080085f0e6 in ares_search () from
/usr/local/lib/libcares.so.2
#9  0x000000080085e858 in ares_query () from
/usr/local/lib/libcares.so.2
#10 0x000000080085c5b2 in ares_process_fd () from
/usr/local/lib/libcares.so.2
#11 0x000000080085ded6 in ares_process_fd () from
/usr/local/lib/libcares.so.2
#12 0x000000080085d5ca in ares_process_fd () from
/usr/local/lib/libcares.so.2
#13 0x000000080085bb54 in ares_process () from
/usr/local/lib/libcares.so.2
#14 0x000000080085badf in ares_process () from
/usr/local/lib/libcares.so.2
#15 0x000000000041010b in dns_ares_queue_run (channel=0x802176000) at
dns.c:172
#16 0x0000000000409d15 in main (argc=7, argv=0x7fffffffb7a0) at
xymonnet.c:2305
(gdb) 
quoted from Mark Felder

Perhaps it's related to this post:

http://lists.xymon.com/archive/2014-February/039058.html

If there's a core file could you paste the backtrace info?

# cd /usr/local/www/xymon/server
# gdb bin/xymonnet tmp/core.64739   (example core file name)

-- GDB informational output will be here --

(gdb) bt    <-- type "bt" at the prompt to get the backtrace output and
then copy/paste it back to the list
list Wallace Barrow · Tue, 14 Oct 2014 17:53:24 -0500 ·
So far by adding --no-ares to tasks.cfg has let the program start and
things seem to be working.
list Mark Felder · Tue, 14 Oct 2014 18:27:25 -0500 ·
quoted from Wallace Barrow
On Tue, Oct 14, 2014, at 17:53, Wallace Barrow wrote:
So far by adding --no-ares to tasks.cfg has let the program start and
things seem to be working. 
I see that this has come up several times before. From a 2007 thread
Henrik mentioned that this can happen randomly and was unsure why. Do we
have any better ideas how to debug this problem? Would be nice to be
able to come up with a permanent solution for everyone so this doesn't
happen and catch people off guard.
list Jeremy Laidman · Wed, 15 Oct 2014 13:00:17 +1100 ·
quoted from Mark Felder
On 15 October 2014 10:27, Mark Felder <user-db141d317836@xymon.invalid> wrote:
I see that this has come up several times before. From a 2007 thread
Henrik mentioned that this can happen randomly and was unsure why. Do we
have any better ideas how to debug this problem? Would be nice to be
able to come up with a permanent solution for everyone so this doesn't
happen and catch people off guard.

Being a random problem makes it difficult to track down the fault, but if
we have a willing participant that can reproduce the fault, we might be
able to make progress.  From reading the code, it looks like the ARES
library is resolving with success, but when xymonnet is copying the
resolved address into its own data structure, it fails to copy.  The
troublesome line 120 is:

memcpy(&dnsc->addr, *(hent->h_addr_list), sizeof(dnsc->addr));

(From my poor knowledge of C) some problems that can arise here are:
a) dhsc->addr or dsc is null
b) hent->h_addr_list or hent is null
c) dnsc->addr is larger than hent->h_addr_list

Perhaps we need to see the values of these.  Wallace, can you recompile
after inserting these lines immediately before line 120:

dbgprintf("ARES host=%s\n", hent->h_name);
dbgprintf("ARES status=%d name=%s\n", status, dnsc->name);
dbgprintf("ARES addr size=%d\n", sizeof(dnsc->addr));
dbgprintf("ARES addr hex=%#lx\n", dnsc->addr);
dbgprintf("ARES addr ascii=%s\n", inet_ntoa(dnsc->addr));

Assuming this compiles correctly for you (it did for me), backup the old
xymonnet, and copy the newly compiled on into place.  Then wait for a core
dump, and see what's in the logs.

Warning: This might break your monitoring, so you might not want to use
this on a production system, depending on your stability requirements.

Alternatively, you might see if you can reproduce the problem by running
the xymonnet binary manually, something like this:

xymonnet --debug --no-update name.of.server

If this dumps core, then you should be able to manually run the new binary
in the same way, and check the log output for our debug statements.

J
list Mark Felder · Wed, 15 Oct 2014 07:38:54 -0500 ·
quoted from Jeremy Laidman
On Tue, Oct 14, 2014, at 21:00, Jeremy Laidman wrote:
Perhaps we need to see the values of these.  Wallace, can you recompile
after inserting these lines immediately before line 120:

dbgprintf("ARES host=%s\n", hent->h_name);
dbgprintf("ARES status=%d name=%s\n", status, dnsc->name);
dbgprintf("ARES addr size=%d\n", sizeof(dnsc->addr));
dbgprintf("ARES addr hex=%#lx\n", dnsc->addr);
dbgprintf("ARES addr ascii=%s\n", inet_ntoa(dnsc->addr));
Are you talking about line 120 in xymonnet/dns.c ? It's not obvious
which file you're talking about in the xymonnet source code. It would be
clearer if you provided a diff in the future.
list Mark Felder · Wed, 15 Oct 2014 07:46:02 -0500 ·
quoted from Mark Felder

On Wed, Oct 15, 2014, at 07:38, Mark Felder wrote:
On Tue, Oct 14, 2014, at 21:00, Jeremy Laidman wrote:
Perhaps we need to see the values of these.  Wallace, can you recompile
after inserting these lines immediately before line 120:

dbgprintf("ARES host=%s\n", hent->h_name);
dbgprintf("ARES status=%d name=%s\n", status, dnsc->name);
dbgprintf("ARES addr size=%d\n", sizeof(dnsc->addr));
dbgprintf("ARES addr hex=%#lx\n", dnsc->addr);
dbgprintf("ARES addr ascii=%s\n", inet_ntoa(dnsc->addr));
Are you talking about line 120 in xymonnet/dns.c ? It's not obvious
which file you're talking about in the xymonnet source code. It would be
clearer if you provided a diff in the future.
It does appear Jeremy meant dns.c. I'm providing a diff -- it does
compile as Jeremy said.
Attachments (1)
list Mark Felder · Tue, 23 Dec 2014 09:19:48 -0600 ·
I am now hitting this upon the upgrade to 4.3.18. I've tried building
against the supplied c-ares instead of the system c-ares (1.10.0 on
FreeBSD) to no effect.

vm# gdb ../bin/xymonnet xymonnet.core
quoted from Wallace Barrow
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `xymonnet'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/lib/libssl.so.7...done.
Loaded symbols for /usr/lib/libssl.so.7
Reading symbols from /lib/libcrypto.so.7...done.
Loaded symbols for /lib/libcrypto.so.7

Reading symbols from /usr/local/lib/libpcre.so.1...done.
Loaded symbols for /usr/local/lib/libpcre.so.1
quoted from Wallace Barrow
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /lib/libthr.so.3...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1

#0  0x0000000801266a1a in kill () from /lib/libc.so.7
[New Thread 801c06400 (LWP 100117/xymonnet)]
(gdb) bt
#0  0x0000000801266a1a in kill () from /lib/libc.so.7
#1  0x0000000801265149 in abort () from /lib/libc.so.7
#2  0x0000000000423f01 in sigsegv_handler (signum=<value optimized out>)
at sig.c:57
#3  <signal handler called>
#4  0x0000000801262f8b in strlen () from /lib/libc.so.7
#5  0x0000000000425218 in addtobuffer_many (buf=0x801c19830) at
strfunc.c:140
#6  0x000000000041199c in display_rr (aptr=<value optimized out>,
abuf=<value optimized out>, alen=<value optimized out>,
response=0x801cd9100) at dns2.c:475
#7  0x00000000004111f3 in dns_detail_callback (arg=0x801cd9100,
status=<value optimized out>, timeouts=<value optimized out>,
abuf=0x7fffffffb2f0 "▒▒\205", alen=130) at dns2.c:268
#8  0x000000000041814d in search_callback ()
#9  0x0000000000417b7e in qcallback ()
#10 0x0000000000417113 in end_query ()
#11 0x000000000041788e in process_answer ()
#12 0x00000000004168c7 in processfds ()
#13 0x000000000041067b in dns_ares_queue_run (channel=0x801d07000) at
dns.c:172
#14 0x0000000000410b23 in dns_test_server (serverip=<value optimized
out>, hostname=0x801c19744 "feld.me", banner=0x801c19750) at dns.c:341
#15 0x000000000040a694 in main (argc=4, argv=0x7fffffffc990) at
xymonnet.c:1049


Even with xymonnet --no-ares it will crash. It seems that does not turn
off the c-ares codepath entirely as c-ares is still mentioned in the
backtrace.

Does anyone have any clue what is causing this? I see in the backtrace
at #7 that may very well be some unicode character it is choking on, but
I don't have any unicode characters in DNS... If this is really an
intermittent problem with c-ares it would be wise to consider an
alternative that is under heavy development like libasr:
https://www.opensmtpd.org/announces/libasr-1.0.0.txt
list Mark Felder · Mon, 05 Jan 2015 08:17:14 -0600 ·
Another FreeBSD user hit this and helped track this down. This crash is
specific to 4.3.18 but affects all platforms.

http://sourceforge.net/p/xymon/code/7484/tree//branches/4.3.18/xymonnet/dns2.c?diff=516c17fd34309d2eb14bcb64:7483

addtobuffer_many() in case T_AAAA is a variadic function that needs to
terminate in a NULL.

This patch fixes the crash caused by any DNS checks which return AAAA
records.

https://svnweb.freebsd.org/ports/head/net-mgmt/xymon-server/files/patch-xymonnet_dns2.c?view=markup&pathrev=376300
list Wallace Barrow · Mon, 05 Jan 2015 16:03:20 -0600 ·
No issues for me after updated so far. Thanks!
list Laurent Royer · Sun, 25 Jan 2015 02:36:26 +0000 ·
Well,


The "Task xymonnet terminated by signal 6?" append every five minutes.


You should have a "delayred=dns:5" in the hosts.cfg. Perhaps combined with a downtime.


I have had such a problem and it's now resolved like that.


Regards


Laurent Royer
?
list Jeremy Laidman · Sat, 7 Feb 2015 08:34:53 +1100 ·
On 25 January 2015 at 13:36, Laurent ROYER <user-15b052bbd14c@xymon.invalid>
wrote:
The "Task xymonnet terminated by signal 6​" append every five minutes.
What version of Xymon are you running?  There is a known bug in 4.3.18 that
was fixed in January.
quoted from Laurent Royer
You should have a "delayred=dns:5" in the hosts.cfg. Perhaps combined with
a downtime.
Interesting work-around.

In the past, this problem has been attributed to the use of the ARES DNS
library.  Some have had success adding "--no-ares" to the xymonnet command
line in tasks.cfg.

Cheers
Jeremy