Xymon Mailing List Archive search

Scheduled disable causes crash?

7 messages in this thread

list Johan Sjöberg · Tue, 5 Oct 2010 11:09:32 +0200 ·
Hi!

During the last month, we have had some problems with Xymon when using scheduled disabled (added from the web interface).

The first problem we had was on September 17th, when hobbitd crashed while/after running the scheduled disable. We got the following error in hobbit.log
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete


After that incident, I enabled verbose logging for hobbitd, and turned up the logging for some other process as well. This morning at 06:00 we had a new scheduled disable. This time hobbitd just stopped logging after running the disable. The web interface did not work correctly. When clicking a test, an "Internal server error" message was displayed. Also, the bbgen test went purple (last update at 05:59:26).

The only errors I have been able to find in the logs are in bb-display.log:
2010-10-05 06:00:26 xstrdup: Cannot dup NULL string
2010-10-05 06:01:26 xstrdup: Cannot dup NULL string
2010-10-05 06:02:27 xstrdup: Cannot dup NULL string
2010-10-05 06:12:30 xstrdup: Cannot dup NULL string
2010-10-05 06:13:35 xstrdup: Cannot dup NULL string
2010-10-05 06:14:39 xstrdup: Cannot dup NULL string
2010-10-05 06:24:40 xstrdup: Cannot dup NULL string
2010-10-05 06:25:41 xstrdup: Cannot dup NULL string
2010-10-05 06:26:42 xstrdup: Cannot dup NULL string
2010-10-05 06:36:47 xstrdup: Cannot dup NULL string
2010-10-05 06:37:48 xstrdup: Cannot dup NULL string
2010-10-05 06:39:54 xstrdup: Cannot dup NULL string
2010-10-05 06:40:54 xstrdup: Cannot dup NULL string
2010-10-05 06:41:54 xstrdup: Cannot dup NULL string

Xymon was restarted at 06:43.
Here is the hobbit.log from the time. It was too large to paste in the mail.
http://pastebin.com/JuU0BHje

We are running Xymon 4.2.3 on CentOS 5.

Best regards,
Johan Sjöberg
list Johan Sjöberg · Wed, 20 Oct 2010 12:32:28 +0200 ·
Hi.

This happened again yesterday morning. We found that core dumps had been created. Here is what gdb tells us about the core dumps. Does anyone have a clue of what might be causing this problem? The Xymon server is running CentOS 5.5 32-bit .

[root at mon01 acks]# file core.28581
core.28581: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'bbgen'

[root at mon01 acks]# gdb /usr/local/xymon/server/bin/bbgen core.28581
Reading symbols from /usr/local/xymon/server/bin/bbgen...done.

warning: .dynamic section for "/lib/libc.so.6" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations
Reading symbols from /lib/libpcre.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpcre.so.0
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by `bbgen --recentgifs --subpagecolumns=2 --report'.
Program terminated with signal 6, Aborted.
#0 0x00b2f402 in __kernel_vsyscall ()

/Johan
quoted from Johan Sjöberg

From: Johan Sjöberg [mailto:user-74c177c1220d@xymon.invalid]
Sent: den 5 oktober 2010 11:10
To: xymon at xymon.com
Subject: [xymon] Scheduled disable causes crash?

Hi!

During the last month, we have had some problems with Xymon when using scheduled disabled (added from the web interface).

The first problem we had was on September 17th, when hobbitd crashed while/after running the scheduled disable. We got the following error in hobbit.log
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete


After that incident, I enabled verbose logging for hobbitd, and turned up the logging for some other process as well. This morning at 06:00 we had a new scheduled disable. This time hobbitd just stopped logging after running the disable. The web interface did not work correctly. When clicking a test, an "Internal server error" message was displayed. Also, the bbgen test went purple (last update at 05:59:26).

The only errors I have been able to find in the logs are in bb-display.log:
2010-10-05 06:00:26 xstrdup: Cannot dup NULL string
2010-10-05 06:01:26 xstrdup: Cannot dup NULL string
2010-10-05 06:02:27 xstrdup: Cannot dup NULL string
2010-10-05 06:12:30 xstrdup: Cannot dup NULL string
2010-10-05 06:13:35 xstrdup: Cannot dup NULL string
2010-10-05 06:14:39 xstrdup: Cannot dup NULL string
2010-10-05 06:24:40 xstrdup: Cannot dup NULL string
2010-10-05 06:25:41 xstrdup: Cannot dup NULL string
2010-10-05 06:26:42 xstrdup: Cannot dup NULL string
2010-10-05 06:36:47 xstrdup: Cannot dup NULL string
2010-10-05 06:37:48 xstrdup: Cannot dup NULL string
2010-10-05 06:39:54 xstrdup: Cannot dup NULL string
2010-10-05 06:40:54 xstrdup: Cannot dup NULL string
2010-10-05 06:41:54 xstrdup: Cannot dup NULL string

Xymon was restarted at 06:43.
Here is the hobbit.log from the time. It was too large to paste in the mail.
http://pastebin.com/JuU0BHje

We are running Xymon 4.2.3 on CentOS 5.

Best regards,
Johan Sjöberg
list Johan Sjöberg · Tue, 23 Nov 2010 08:11:48 +0100 ·
Hi.

This happened again this morning at 03:00 when we had a scheduled disable. When Xymon stops working, it generates a core file every 2 or 3 minutes in /usr/local/xymon/data/acks.

They all look like this (but with different strings in "#0  0x0076d402 in __kernel_vsyscall ()":

[root at mon01 acks]# gdb /usr/local/xymon/server/bin/bbgen core.9464
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>;
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>;...
Reading symbols from /usr/local/xymon/server/bin/bbgen...done.
quoted from Johan Sjöberg
Reading symbols from /lib/libpcre.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpcre.so.0
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by `bbgen --recentgifs --subpagecolumns=2 --report'.
Program terminated with signal 6, Aborted.

#0  0x0076d402 in __kernel_vsyscall ()

Do you have any idea what might be causing this, or how we can proceed to try to find out more about the problem?
quoted from Johan Sjöberg

/Johan

From: Johan Sjöberg [mailto:user-74c177c1220d@xymon.invalid]
Sent: den 20 oktober 2010 12:32
To: xymon at xymon.com
Subject: [xymon] RE: Scheduled disable causes crash?

Hi.

This happened again yesterday morning. We found that core dumps had been created. Here is what gdb tells us about the core dumps. Does anyone have a clue of what might be causing this problem? The Xymon server is running CentOS 5.5 32-bit .

[root at mon01 acks]# file core.28581
core.28581: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'bbgen'

[root at mon01 acks]# gdb /usr/local/xymon/server/bin/bbgen core.28581
Reading symbols from /usr/local/xymon/server/bin/bbgen...done.

warning: .dynamic section for "/lib/libc.so.6" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations
Reading symbols from /lib/libpcre.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpcre.so.0
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by `bbgen --recentgifs --subpagecolumns=2 --report'.
Program terminated with signal 6, Aborted.
#0 0x00b2f402 in __kernel_vsyscall ()

/Johan

From: Johan Sjöberg [mailto:user-74c177c1220d@xymon.invalid]
Sent: den 5 oktober 2010 11:10
To: xymon at xymon.com
Subject: [xymon] Scheduled disable causes crash?

Hi!

During the last month, we have had some problems with Xymon when using scheduled disabled (added from the web interface).

The first problem we had was on September 17th, when hobbitd crashed while/after running the scheduled disable. We got the following error in hobbit.log
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete


After that incident, I enabled verbose logging for hobbitd, and turned up the logging for some other process as well. This morning at 06:00 we had a new scheduled disable. This time hobbitd just stopped logging after running the disable. The web interface did not work correctly. When clicking a test, an "Internal server error" message was displayed. Also, the bbgen test went purple (last update at 05:59:26).

The only errors I have been able to find in the logs are in bb-display.log:
2010-10-05 06:00:26 xstrdup: Cannot dup NULL string
2010-10-05 06:01:26 xstrdup: Cannot dup NULL string
2010-10-05 06:02:27 xstrdup: Cannot dup NULL string
2010-10-05 06:12:30 xstrdup: Cannot dup NULL string
2010-10-05 06:13:35 xstrdup: Cannot dup NULL string
2010-10-05 06:14:39 xstrdup: Cannot dup NULL string
2010-10-05 06:24:40 xstrdup: Cannot dup NULL string
2010-10-05 06:25:41 xstrdup: Cannot dup NULL string
2010-10-05 06:26:42 xstrdup: Cannot dup NULL string
2010-10-05 06:36:47 xstrdup: Cannot dup NULL string
2010-10-05 06:37:48 xstrdup: Cannot dup NULL string
2010-10-05 06:39:54 xstrdup: Cannot dup NULL string
2010-10-05 06:40:54 xstrdup: Cannot dup NULL string
2010-10-05 06:41:54 xstrdup: Cannot dup NULL string

Xymon was restarted at 06:43.
Here is the hobbit.log from the time. It was too large to paste in the mail.
http://pastebin.com/JuU0BHje

We are running Xymon 4.2.3 on CentOS 5.

Best regards,
Johan Sjöberg
list Henrik Størner · Mon, 6 Dec 2010 11:41:05 +0000 (UTC) ·
In <user-bf372f430ba3@xymon.invalid> =?iso-8859-1?Q?Johan_Sj=F6berg?= <user-74c177c1220d@xymon.invalid> writes:
During the last month, we have had some problems with Xymon when using sche=
duled disabled (added from the web interface).
The first problem we had was on September 17th, when hobbitd crashed while/=
after running the scheduled disable. We got the following error in hobbit.l=
og
quoted from Johan Sjöberg
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete
There is a bug lurking in the scheduled-task code, but I haven't been
able to quite nail down where it is. I've seen the same problem that
you have a couple of times, where a scheduled "disable" results in
xymond (hobbitd) crashing immediately afterwards.

One potential bug I did catch is fixed with the following patch:

Index: xymond/xymond.c
===================================================================
--- xymond/xymond.c	(revision 6604)
+++ xymond/xymond.c	(working copy)
@@ -3971,7 +3971,7 @@
 	if (msg->doingwhat == RESPONDING) {
 		shutdown(msg->sock, SHUT_RD);
 	}
-	else {
+	else if (msg->sock >= 0) {
 		shutdown(msg->sock, SHUT_RDWR);
 		close(msg->sock);
 		msg->sock = -1;
@@ -5040,6 +5040,8 @@
 					swalk = swalk->next;
 
 					memset(&task, 0, sizeof(task));
+					task.sock = -1;
+					task.doingwhat = NOTALK;
 					inet_aton(runtask->sender, (struct in_addr *) &task.addr.sin_addr.s_addr);
 					task.buf = task.bufp = runtask->command;
 					task.buflen = strlen(runtask->command); task.bufsz = task.buflen+1;


So it would be interesting to see if this helps in your setup. This patch
is against the current beta-3 code, but it applies to version 4.2.3 as
well if you run patch and explicitly tell it which file to patch:

   patch hobbit-4.2.3/hobbitd/hobbitd.c < task.patch


I am not sure if this fixes the problem, though. Because if this is
what causes the crash, then it ought to happen before the log message
that the task ran is written. Unless the bug doesn't crash the system
right away, but only triggers some memory corruption that results in
a later crash ...


Regards,
Henrik
list Johan Sjöberg · Mon, 6 Dec 2010 12:48:59 +0100 ·
Hi. 

Thanks for your reply. I will test this patch on our system.

/Johan
-----Original Message-----
From: Henrik Størner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: den 6 december 2010 12:41
To: xymon at xymon.com
Subject: Re: [xymon] Scheduled disable causes crash?

In
<user-06299813fc61@xymon.invalid
ement.se> =?iso-8859-1?Q?Johan_Sj=F6berg?=
quoted from Henrik Størner
<user-74c177c1220d@xymon.invalid> writes:
During the last month, we have had some problems with Xymon when
using sche=
duled disabled (added from the web interface).
The first problem we had was on September 17th, when hobbitd crashed
while/=
after running the scheduled disable. We got the following error in hobbit.l=
og
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete
There is a bug lurking in the scheduled-task code, but I haven't been
able to quite nail down where it is. I've seen the same problem that
you have a couple of times, where a scheduled "disable" results in
xymond (hobbitd) crashing immediately afterwards.

One potential bug I did catch is fixed with the following patch:

Index: xymond/xymond.c
==========================================================
=========
--- xymond/xymond.c	(revision 6604)
+++ xymond/xymond.c	(working copy)
@@ -3971,7 +3971,7 @@
 	if (msg->doingwhat == RESPONDING) {
 		shutdown(msg->sock, SHUT_RD);
 	}
-	else {
+	else if (msg->sock >= 0) {
 		shutdown(msg->sock, SHUT_RDWR);
 		close(msg->sock);
 		msg->sock = -1;
@@ -5040,6 +5040,8 @@
 					swalk
= swalk->next;


	memset(&task, 0, sizeof(task));
• task.sock = -1;
• task.doingwhat = NOTALK;

	inet_aton(runtask->sender, (struct in_addr *)
&task.addr.sin_addr.s_addr);

	task.buf = task.bufp = runtask->command;

	task.buflen = strlen(runtask->command); task.bufsz =
task.buflen+1;


So it would be interesting to see if this helps in your setup. This patch
is against the current beta-3 code, but it applies to version 4.2.3 as
well if you run patch and explicitly tell it which file to patch:

   patch hobbit-4.2.3/hobbitd/hobbitd.c < task.patch


I am not sure if this fixes the problem, though. Because if this is
what causes the crash, then it ought to happen before the log message
that the task ran is written. Unless the bug doesn't crash the system
right away, but only triggers some memory corruption that results in
a later crash ...


Regards,
Henrik

list Johan Sjöberg · Tue, 18 Jan 2011 08:21:54 +0100 ·
Hi.

I have not seen this problem since applying the patch. I can't be sure that it's fixed since it didn't happen every time, but it is looking good.
quoted from Johan Sjöberg

/Johan
-----Original Message-----
From: Henrik Størner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: den 6 december 2010 12:41
To: xymon at xymon.com
Subject: Re: [xymon] Scheduled disable causes crash?

In
<user-06299813fc61@xymon.invalid
ement.se> =?iso-8859-1?Q?Johan_Sj=F6berg?=
<user-74c177c1220d@xymon.invalid> writes:
During the last month, we have had some problems with Xymon when
using sche=
duled disabled (added from the web interface).
The first problem we had was on September 17th, when hobbitd crashed
while/=
after running the scheduled disable. We got the following error in hobbit.l=
og
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete
There is a bug lurking in the scheduled-task code, but I haven't been
able to quite nail down where it is. I've seen the same problem that
you have a couple of times, where a scheduled "disable" results in
xymond (hobbitd) crashing immediately afterwards.

One potential bug I did catch is fixed with the following patch:

Index: xymond/xymond.c
==========================================================
=========
--- xymond/xymond.c	(revision 6604)
+++ xymond/xymond.c	(working copy)
@@ -3971,7 +3971,7 @@
 	if (msg->doingwhat == RESPONDING) {
 		shutdown(msg->sock, SHUT_RD);
 	}
-	else {
+	else if (msg->sock >= 0) {
 		shutdown(msg->sock, SHUT_RDWR);
 		close(msg->sock);
 		msg->sock = -1;
@@ -5040,6 +5040,8 @@
 					swalk
= swalk->next;


	memset(&task, 0, sizeof(task));
• task.sock = -1;
• task.doingwhat = NOTALK;

	inet_aton(runtask->sender, (struct in_addr *)
&task.addr.sin_addr.s_addr);

	task.buf = task.bufp = runtask->command;

	task.buflen = strlen(runtask->command); task.bufsz =
task.buflen+1;


So it would be interesting to see if this helps in your setup. This patch
is against the current beta-3 code, but it applies to version 4.2.3 as
well if you run patch and explicitly tell it which file to patch:

   patch hobbit-4.2.3/hobbitd/hobbitd.c < task.patch


I am not sure if this fixes the problem, though. Because if this is
what causes the crash, then it ought to happen before the log message
that the task ran is written. Unless the bug doesn't crash the system
right away, but only triggers some memory corruption that results in
a later crash ...


Regards,
Henrik

list Henrik Størner · Tue, 18 Jan 2011 09:09:10 +0000 (UTC) ·
Hi Johan,
quoted from Johan Sjöberg

In <user-1b0573d8080c@xymon.invalid> =?iso-8859-1?Q?Johan_Sj=F6berg?= <user-74c177c1220d@xymon.invalid> writes:
I have not seen this problem since applying the patch. I can't be sure that
it's fixed since it didn't happen every time, but it is looking good.
Thanks for updating me with this! As I wrote I was not sure if this
was indeed the problem, so it is very nice to know that it seems to
have helped. It was one of the "should-really-be-fixed-before-release"
bugs that was bothering me.


Regards,
Henrik