Scheduled disable causes crash?
list Johan Sjöberg
Hi! During the last month, we have had some problems with Xymon when using scheduled disabled (added from the web interface). The first problem we had was on September 17th, when hobbitd crashed while/after running the scheduled disable. We got the following error in hobbit.log 2010-09-17 05:00:00 Fatal error in select: Bad file descriptor 2010-09-17 05:00:00 Setup complete After that incident, I enabled verbose logging for hobbitd, and turned up the logging for some other process as well. This morning at 06:00 we had a new scheduled disable. This time hobbitd just stopped logging after running the disable. The web interface did not work correctly. When clicking a test, an "Internal server error" message was displayed. Also, the bbgen test went purple (last update at 05:59:26). The only errors I have been able to find in the logs are in bb-display.log: 2010-10-05 06:00:26 xstrdup: Cannot dup NULL string 2010-10-05 06:01:26 xstrdup: Cannot dup NULL string 2010-10-05 06:02:27 xstrdup: Cannot dup NULL string 2010-10-05 06:12:30 xstrdup: Cannot dup NULL string 2010-10-05 06:13:35 xstrdup: Cannot dup NULL string 2010-10-05 06:14:39 xstrdup: Cannot dup NULL string 2010-10-05 06:24:40 xstrdup: Cannot dup NULL string 2010-10-05 06:25:41 xstrdup: Cannot dup NULL string 2010-10-05 06:26:42 xstrdup: Cannot dup NULL string 2010-10-05 06:36:47 xstrdup: Cannot dup NULL string 2010-10-05 06:37:48 xstrdup: Cannot dup NULL string 2010-10-05 06:39:54 xstrdup: Cannot dup NULL string 2010-10-05 06:40:54 xstrdup: Cannot dup NULL string 2010-10-05 06:41:54 xstrdup: Cannot dup NULL string Xymon was restarted at 06:43. Here is the hobbit.log from the time. It was too large to paste in the mail. http://pastebin.com/JuU0BHje We are running Xymon 4.2.3 on CentOS 5. Best regards, Johan Sjöberg
list Johan Sjöberg
Hi. This happened again yesterday morning. We found that core dumps had been created. Here is what gdb tells us about the core dumps. Does anyone have a clue of what might be causing this problem? The Xymon server is running CentOS 5.5 32-bit . [root at mon01 acks]# file core.28581 core.28581: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'bbgen' [root at mon01 acks]# gdb /usr/local/xymon/server/bin/bbgen core.28581 Reading symbols from /usr/local/xymon/server/bin/bbgen...done. warning: .dynamic section for "/lib/libc.so.6" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations Reading symbols from /lib/libpcre.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpcre.so.0 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Core was generated by `bbgen --recentgifs --subpagecolumns=2 --report'. Program terminated with signal 6, Aborted. #0 0x00b2f402 in __kernel_vsyscall () /Johan
▸
From: Johan Sjöberg [mailto:user-74c177c1220d@xymon.invalid] Sent: den 5 oktober 2010 11:10 To: xymon at xymon.com Subject: [xymon] Scheduled disable causes crash? Hi! During the last month, we have had some problems with Xymon when using scheduled disabled (added from the web interface). The first problem we had was on September 17th, when hobbitd crashed while/after running the scheduled disable. We got the following error in hobbit.log 2010-09-17 05:00:00 Fatal error in select: Bad file descriptor 2010-09-17 05:00:00 Setup complete After that incident, I enabled verbose logging for hobbitd, and turned up the logging for some other process as well. This morning at 06:00 we had a new scheduled disable. This time hobbitd just stopped logging after running the disable. The web interface did not work correctly. When clicking a test, an "Internal server error" message was displayed. Also, the bbgen test went purple (last update at 05:59:26). The only errors I have been able to find in the logs are in bb-display.log: 2010-10-05 06:00:26 xstrdup: Cannot dup NULL string 2010-10-05 06:01:26 xstrdup: Cannot dup NULL string 2010-10-05 06:02:27 xstrdup: Cannot dup NULL string 2010-10-05 06:12:30 xstrdup: Cannot dup NULL string 2010-10-05 06:13:35 xstrdup: Cannot dup NULL string 2010-10-05 06:14:39 xstrdup: Cannot dup NULL string 2010-10-05 06:24:40 xstrdup: Cannot dup NULL string 2010-10-05 06:25:41 xstrdup: Cannot dup NULL string 2010-10-05 06:26:42 xstrdup: Cannot dup NULL string 2010-10-05 06:36:47 xstrdup: Cannot dup NULL string 2010-10-05 06:37:48 xstrdup: Cannot dup NULL string 2010-10-05 06:39:54 xstrdup: Cannot dup NULL string 2010-10-05 06:40:54 xstrdup: Cannot dup NULL string 2010-10-05 06:41:54 xstrdup: Cannot dup NULL string Xymon was restarted at 06:43. Here is the hobbit.log from the time. It was too large to paste in the mail. http://pastebin.com/JuU0BHje We are running Xymon 4.2.3 on CentOS 5. Best regards, Johan Sjöberg
list Johan Sjöberg
Hi. This happened again this morning at 03:00 when we had a scheduled disable. When Xymon stops working, it generates a core file every 2 or 3 minutes in /usr/local/xymon/data/acks. They all look like this (but with different strings in "#0 0x0076d402 in __kernel_vsyscall ()": [root at mon01 acks]# gdb /usr/local/xymon/server/bin/bbgen core.9464 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>; This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>;... Reading symbols from /usr/local/xymon/server/bin/bbgen...done.
▸
Reading symbols from /lib/libpcre.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpcre.so.0
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by `bbgen --recentgifs --subpagecolumns=2 --report'.
Program terminated with signal 6, Aborted.
#0 0x0076d402 in __kernel_vsyscall ()
Do you have any idea what might be causing this, or how we can proceed to try to find out more about the problem?
▸
/Johan From: Johan Sjöberg [mailto:user-74c177c1220d@xymon.invalid] Sent: den 20 oktober 2010 12:32 To: xymon at xymon.com Subject: [xymon] RE: Scheduled disable causes crash? Hi. This happened again yesterday morning. We found that core dumps had been created. Here is what gdb tells us about the core dumps. Does anyone have a clue of what might be causing this problem? The Xymon server is running CentOS 5.5 32-bit . [root at mon01 acks]# file core.28581 core.28581: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'bbgen' [root at mon01 acks]# gdb /usr/local/xymon/server/bin/bbgen core.28581 Reading symbols from /usr/local/xymon/server/bin/bbgen...done. warning: .dynamic section for "/lib/libc.so.6" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations Reading symbols from /lib/libpcre.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpcre.so.0 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Core was generated by `bbgen --recentgifs --subpagecolumns=2 --report'. Program terminated with signal 6, Aborted. #0 0x00b2f402 in __kernel_vsyscall () /Johan From: Johan Sjöberg [mailto:user-74c177c1220d@xymon.invalid] Sent: den 5 oktober 2010 11:10 To: xymon at xymon.com Subject: [xymon] Scheduled disable causes crash? Hi! During the last month, we have had some problems with Xymon when using scheduled disabled (added from the web interface). The first problem we had was on September 17th, when hobbitd crashed while/after running the scheduled disable. We got the following error in hobbit.log 2010-09-17 05:00:00 Fatal error in select: Bad file descriptor 2010-09-17 05:00:00 Setup complete After that incident, I enabled verbose logging for hobbitd, and turned up the logging for some other process as well. This morning at 06:00 we had a new scheduled disable. This time hobbitd just stopped logging after running the disable. The web interface did not work correctly. When clicking a test, an "Internal server error" message was displayed. Also, the bbgen test went purple (last update at 05:59:26). The only errors I have been able to find in the logs are in bb-display.log: 2010-10-05 06:00:26 xstrdup: Cannot dup NULL string 2010-10-05 06:01:26 xstrdup: Cannot dup NULL string 2010-10-05 06:02:27 xstrdup: Cannot dup NULL string 2010-10-05 06:12:30 xstrdup: Cannot dup NULL string 2010-10-05 06:13:35 xstrdup: Cannot dup NULL string 2010-10-05 06:14:39 xstrdup: Cannot dup NULL string 2010-10-05 06:24:40 xstrdup: Cannot dup NULL string 2010-10-05 06:25:41 xstrdup: Cannot dup NULL string 2010-10-05 06:26:42 xstrdup: Cannot dup NULL string 2010-10-05 06:36:47 xstrdup: Cannot dup NULL string 2010-10-05 06:37:48 xstrdup: Cannot dup NULL string 2010-10-05 06:39:54 xstrdup: Cannot dup NULL string 2010-10-05 06:40:54 xstrdup: Cannot dup NULL string 2010-10-05 06:41:54 xstrdup: Cannot dup NULL string Xymon was restarted at 06:43. Here is the hobbit.log from the time. It was too large to paste in the mail. http://pastebin.com/JuU0BHje We are running Xymon 4.2.3 on CentOS 5. Best regards, Johan Sjöberg
list Henrik Størner
In <user-bf372f430ba3@xymon.invalid> =?iso-8859-1?Q?Johan_Sj=F6berg?= <user-74c177c1220d@xymon.invalid> writes:
During the last month, we have had some problems with Xymon when using sche= duled disabled (added from the web interface).
The first problem we had was on September 17th, when hobbitd crashed while/= after running the scheduled disable. We got the following error in hobbit.l= og
▸
2010-09-17 05:00:00 Fatal error in select: Bad file descriptor
2010-09-17 05:00:00 Setup complete
There is a bug lurking in the scheduled-task code, but I haven't been
able to quite nail down where it is. I've seen the same problem that
you have a couple of times, where a scheduled "disable" results in
xymond (hobbitd) crashing immediately afterwards.
One potential bug I did catch is fixed with the following patch:
Index: xymond/xymond.c
===================================================================
--- xymond/xymond.c (revision 6604)
+++ xymond/xymond.c (working copy)
@@ -3971,7 +3971,7 @@
if (msg->doingwhat == RESPONDING) {
shutdown(msg->sock, SHUT_RD);
}
- else {
+ else if (msg->sock >= 0) {
shutdown(msg->sock, SHUT_RDWR);
close(msg->sock);
msg->sock = -1;
@@ -5040,6 +5040,8 @@
swalk = swalk->next;
memset(&task, 0, sizeof(task));
+ task.sock = -1;
+ task.doingwhat = NOTALK;
inet_aton(runtask->sender, (struct in_addr *) &task.addr.sin_addr.s_addr);
task.buf = task.bufp = runtask->command;
task.buflen = strlen(runtask->command); task.bufsz = task.buflen+1;
So it would be interesting to see if this helps in your setup. This patch
is against the current beta-3 code, but it applies to version 4.2.3 as
well if you run patch and explicitly tell it which file to patch:
patch hobbit-4.2.3/hobbitd/hobbitd.c < task.patch
I am not sure if this fixes the problem, though. Because if this is
what causes the crash, then it ought to happen before the log message
that the task ran is written. Unless the bug doesn't crash the system
right away, but only triggers some memory corruption that results in
a later crash ...
Regards,
Henrik
list Johan Sjöberg
Hi. Thanks for your reply. I will test this patch on our system. /Johan
-----Original Message----- From: Henrik Størner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: den 6 december 2010 12:41 To: xymon at xymon.com Subject: Re: [xymon] Scheduled disable causes crash? In <user-06299813fc61@xymon.invalid ement.se> =?iso-8859-1?Q?Johan_Sj=F6berg?=
▸
<user-74c177c1220d@xymon.invalid> writes:During the last month, we have had some problems with Xymon when using sche= duled disabled (added from the web interface).The first problem we had was on September 17th, when hobbitd crashed while/= after running the scheduled disable. We got the following error in hobbit.l= og 2010-09-17 05:00:00 Fatal error in select: Bad file descriptor 2010-09-17 05:00:00 Setup completeThere is a bug lurking in the scheduled-task code, but I haven't been able to quite nail down where it is. I've seen the same problem that you have a couple of times, where a scheduled "disable" results in xymond (hobbitd) crashing immediately afterwards. One potential bug I did catch is fixed with the following patch: Index: xymond/xymond.c ========================================================== ========= --- xymond/xymond.c (revision 6604) +++ xymond/xymond.c (working copy) @@ -3971,7 +3971,7 @@ if (msg->doingwhat == RESPONDING) { shutdown(msg->sock, SHUT_RD); } - else { + else if (msg->sock >= 0) { shutdown(msg->sock, SHUT_RDWR); close(msg->sock); msg->sock = -1; @@ -5040,6 +5040,8 @@ swalk = swalk->next; memset(&task, 0, sizeof(task)); • task.sock = -1; • task.doingwhat = NOTALK; inet_aton(runtask->sender, (struct in_addr *) &task.addr.sin_addr.s_addr); task.buf = task.bufp = runtask->command; task.buflen = strlen(runtask->command); task.bufsz = task.buflen+1; So it would be interesting to see if this helps in your setup. This patch is against the current beta-3 code, but it applies to version 4.2.3 as well if you run patch and explicitly tell it which file to patch: patch hobbit-4.2.3/hobbitd/hobbitd.c < task.patch I am not sure if this fixes the problem, though. Because if this is what causes the crash, then it ought to happen before the log message that the task ran is written. Unless the bug doesn't crash the system right away, but only triggers some memory corruption that results in a later crash ... Regards, Henrik
list Johan Sjöberg
Hi. I have not seen this problem since applying the patch. I can't be sure that it's fixed since it didn't happen every time, but it is looking good.
▸
/Johan
-----Original Message----- From: Henrik Størner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: den 6 december 2010 12:41 To: xymon at xymon.com Subject: Re: [xymon] Scheduled disable causes crash? In <user-06299813fc61@xymon.invalid ement.se> =?iso-8859-1?Q?Johan_Sj=F6berg?= <user-74c177c1220d@xymon.invalid> writes:During the last month, we have had some problems with Xymon when using sche= duled disabled (added from the web interface).The first problem we had was on September 17th, when hobbitd crashed while/= after running the scheduled disable. We got the following error in hobbit.l= og 2010-09-17 05:00:00 Fatal error in select: Bad file descriptor 2010-09-17 05:00:00 Setup completeThere is a bug lurking in the scheduled-task code, but I haven't been able to quite nail down where it is. I've seen the same problem that you have a couple of times, where a scheduled "disable" results in xymond (hobbitd) crashing immediately afterwards. One potential bug I did catch is fixed with the following patch: Index: xymond/xymond.c ========================================================== ========= --- xymond/xymond.c (revision 6604) +++ xymond/xymond.c (working copy) @@ -3971,7 +3971,7 @@ if (msg->doingwhat == RESPONDING) { shutdown(msg->sock, SHUT_RD); } - else { + else if (msg->sock >= 0) { shutdown(msg->sock, SHUT_RDWR); close(msg->sock); msg->sock = -1; @@ -5040,6 +5040,8 @@ swalk = swalk->next; memset(&task, 0, sizeof(task)); • task.sock = -1; • task.doingwhat = NOTALK; inet_aton(runtask->sender, (struct in_addr *) &task.addr.sin_addr.s_addr); task.buf = task.bufp = runtask->command; task.buflen = strlen(runtask->command); task.bufsz = task.buflen+1; So it would be interesting to see if this helps in your setup. This patch is against the current beta-3 code, but it applies to version 4.2.3 as well if you run patch and explicitly tell it which file to patch: patch hobbit-4.2.3/hobbitd/hobbitd.c < task.patch I am not sure if this fixes the problem, though. Because if this is what causes the crash, then it ought to happen before the log message that the task ran is written. Unless the bug doesn't crash the system right away, but only triggers some memory corruption that results in a later crash ... Regards, Henrik
list Henrik Størner
Hi Johan,
▸
In <user-1b0573d8080c@xymon.invalid> =?iso-8859-1?Q?Johan_Sj=F6berg?= <user-74c177c1220d@xymon.invalid> writes:
I have not seen this problem since applying the patch. I can't be sure that it's fixed since it didn't happen every time, but it is looking good.
Thanks for updating me with this! As I wrote I was not sure if this was indeed the problem, so it is very nice to know that it seems to have helped. It was one of the "should-really-be-fixed-before-release" bugs that was bothering me. Regards, Henrik