Xymon Mailing List Archive search

hobbitd_alert crashes

8 messages in this thread

list Dominique Frise · Fri, 02 Jun 2006 07:38:25 +0200 ·
Hi,

This is snapshot of 01 june running on Solaris 9.

[bb at iris tmp]$ gdb ../bin/hobbitd_alert core
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.9"...
Core was generated by `hobbitd_alert 
--checkpoint-file=/soft/pub/BB/hobbit/server/tmp/alert.chk --chec'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /usr/lib/libresolv.so.2...done.
Loaded symbols for /usr/lib/libresolv.so.2
Reading symbols from /usr/lib/libsocket.so.1...done.
Loaded symbols for /usr/lib/libsocket.so.1
Reading symbols from /usr/lib/libnsl.so.1...done.
Loaded symbols for /usr/lib/libnsl.so.1
Reading symbols from /usr/lib/libc.so.1...done.
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libdl.so.1...done.
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from /usr/lib/libmp.so.2...done.
Loaded symbols for /usr/lib/libmp.so.2
Reading symbols from /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1...done.
Loaded symbols for /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
(gdb)


Dominique
UNIL - University of Lausanne
list Henrik Størner · Fri, 2 Jun 2006 07:40:13 +0200 ·
Could you do the "bt" command also, please ... ?

Henrik
quoted from Dominique Frise

On Fri, Jun 02, 2006 at 07:38:25AM +0200, Dominique Frise wrote:
Hi,

This is snapshot of 01 june running on Solaris 9.

[bb at iris tmp]$ gdb ../bin/hobbitd_alert core
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.9"...
Core was generated by `hobbitd_alert 
--checkpoint-file=/soft/pub/BB/hobbit/server/tmp/alert.chk --chec'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /usr/lib/libresolv.so.2...done.
Loaded symbols for /usr/lib/libresolv.so.2
Reading symbols from /usr/lib/libsocket.so.1...done.
Loaded symbols for /usr/lib/libsocket.so.1
Reading symbols from /usr/lib/libnsl.so.1...done.
Loaded symbols for /usr/lib/libnsl.so.1
Reading symbols from /usr/lib/libc.so.1...done.
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libdl.so.1...done.
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from /usr/lib/libmp.so.2...done.
Loaded symbols for /usr/lib/libmp.so.2
Reading symbols from 
/usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1...done.
Loaded symbols for /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
(gdb)
list Dominique Frise · Fri, 02 Jun 2006 07:43:13 +0200 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
Could you do the "bt" command also, please ... ?

Henrik

On Fri, Jun 02, 2006 at 07:38:25AM +0200, Dominique Frise wrote:
Hi,

This is snapshot of 01 june running on Solaris 9.

[bb at iris tmp]$ gdb ../bin/hobbitd_alert core
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.9"...
Core was generated by `hobbitd_alert 
--checkpoint-file=/soft/pub/BB/hobbit/server/tmp/alert.chk --chec'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /usr/lib/libresolv.so.2...done.
Loaded symbols for /usr/lib/libresolv.so.2
Reading symbols from /usr/lib/libsocket.so.1...done.
Loaded symbols for /usr/lib/libsocket.so.1
Reading symbols from /usr/lib/libnsl.so.1...done.
Loaded symbols for /usr/lib/libnsl.so.1
Reading symbols from /usr/lib/libc.so.1...done.
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libdl.so.1...done.
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from /usr/lib/libmp.so.2...done.
Loaded symbols for /usr/lib/libmp.so.2
Reading symbols from 
/usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1...done.
Loaded symbols for /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
(gdb)
...

(gdb) bt
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
#1  0xff136d58 in abort () from /usr/lib/libc.so.1
#2  0x0002134c in sigsegv_handler (signum=0) at sig.c:57
#3  <signal handler called>
(gdb)


Dominique
UNIL - University of Lausanne
list Henrik Størner · Fri, 2 Jun 2006 07:55:08 +0200 ·
quoted from Dominique Frise
On Fri, Jun 02, 2006 at 07:43:13AM +0200, Dominique Frise wrote:
(gdb) bt
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
#1  0xff136d58 in abort () from /usr/lib/libc.so.1
#2  0x0002134c in sigsegv_handler (signum=0) at sig.c:57
#3  <signal handler called>
Hrm, that isn't much to go on. Does it crash right away when 
you start Hobbit, or only after some time has passed ?

If it crashes right away, I'd like a copy of your bb-hosts,
hobbitserver.cfg and hobbit-alerts.cfg files. If it crashes
after some time, could you add the "--debug" option to
the hobbitd_alert command in hobbitlaunch.cfg, and then mail
me the ~hobbit/server/logs/page.log file after it has crashed?


Regards,
Henrik
list Dominique Frise · Fri, 02 Jun 2006 08:04:20 +0200 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
On Fri, Jun 02, 2006 at 07:43:13AM +0200, Dominique Frise wrote:
(gdb) bt
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
#1  0xff136d58 in abort () from /usr/lib/libc.so.1
#2  0x0002134c in sigsegv_handler (signum=0) at sig.c:57
#3  <signal handler called>

Hrm, that isn't much to go on. Does it crash right away when 
you start Hobbit, or only after some time has passed ?
It crashed 3 times last night. Hobbit was last restarted yesterday at 05:10 PM
quoted from Henrik Størner
If it crashes right away, I'd like a copy of your bb-hosts,
hobbitserver.cfg and hobbit-alerts.cfg files. If it crashes
after some time, could you add the "--debug" option to
the hobbitd_alert command in hobbitlaunch.cfg, and then mail
me the ~hobbit/server/logs/page.log file after it has crashed?
Done.
I'll mail you the log asap.

Thank you.

Dominique
UNIL - University of Lausanne
list Dominique Frise · Fri, 02 Jun 2006 08:19:10 +0200 ·
quoted from Dominique Frise
Dominique Frise wrote:
Henrik Stoerner wrote:
On Fri, Jun 02, 2006 at 07:43:13AM +0200, Dominique Frise wrote:
(gdb) bt
#0  0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
#1  0xff136d58 in abort () from /usr/lib/libc.so.1
#2  0x0002134c in sigsegv_handler (signum=0) at sig.c:57
#3  <signal handler called>

Hrm, that isn't much to go on. Does it crash right away when you start 
Hobbit, or only after some time has passed ?
It crashed 3 times last night. Hobbit was last restarted yesterday at 
05:10 PM
If it crashes right away, I'd like a copy of your bb-hosts,
hobbitserver.cfg and hobbit-alerts.cfg files. If it crashes
after some time, could you add the "--debug" option to
the hobbitd_alert command in hobbitlaunch.cfg, and then mail
me the ~hobbit/server/logs/page.log file after it has crashed?
Done.
I'll mail you the log asap.

Thank you.

Dominique
UNIL - University of Lausanne

Looking at the event log, I noticed that the 3 times that hobbitd_alert 
crashed, it was trying to send to an IGNORE recipient (not always the same).
Here are our IGNORE rules after macros definitions at top of hobbit-alerts.cfg. 
Maybe there is something wrong with this configuration?

...
...
#---------------------------------------
# Hosts groups
#
$SAP_HOSTS=quartz,topaze,onyx,its,tulp,zircon
$ADMIN_HOSTS=bilbo,falco,furio

#------------------------------------------------------------------------------
# Rules to exclude alerting during a period of time
#------------------------------------------------------------------------------

HOST=* SERVICE=bckp TIME=*:2000:0700
    IGNORE
HOST=uldns1,uldns2 SERVICE=ldap TIME=*:0500:0530
    IGNORE
HOST=kawa,kawa2 SERVICE=http TIME=*:2210:2235
    IGNORE
HOST=gaia SERVICE=http TIME=*:0012:0015
    IGNORE
HOST=balrog,godzilla,smaug SERVICE=cpu TIME=*:2000:2359
    IGNORE
HOST=acsls,balrog,godzilla,smaug SERVICE=memory TIME=*:0600:0800
    IGNORE
HOST=unimedia,unimediad SERVICE=orcl,http TIME=*:0655:1115
    IGNORE
HOST=virtuavd SERVICE=orcl TIME=*:0001:0400
    IGNORE
HOST=tstvirtua SERVICE=orcl TIME=*:2159:0200
    IGNORE
HOST=ged SERVICE=http TIME=*:0305:0315
    IGNORE
HOST=$SAP_HOSTS SERVICE=conn,cpu,http,ftp TIME=*:1900:0700
    IGNORE
HOST=$SAP_HOSTS SERVICE=orcl,procs TIME=*:1900:2359
    IGNORE
HOST=$ADMIN_HOSTS SERVICE=http,sslcert TIME=*:0030:0630
    IGNORE
HOST=$ADMIN_HOSTS SERVICE=conn,cpu,http,sslcert TIME=*:1800:0700
    IGNORE
HOST=esope SERVICE=http,orcl TIME=*:0355:0600
    IGNORE
HOST=pcsan SERVICE=msgs,svcs,procs TIME=*:1955:2300
    IGNORE
HOST=iris SERVICE=hobbitd TIME=*:0310:0320
    IGNORE
HOST=lanfeust,winup TIME=*:1945:2200
    IGNORE
...
...


Dominique
UNIL - University of Lausanne
list Henrik Størner · Fri, 2 Jun 2006 13:22:48 +0200 ·
quoted from Dominique Frise
On Fri, Jun 02, 2006 at 08:19:10AM +0200, Dominique Frise wrote:
Looking at the event log, I noticed that the 3 times that hobbitd_alert 
crashed, it was trying to send to an IGNORE recipient (not always the same).
Thanks, it was easy to reproduce the problem once I tried some IGNORE
rules. I believe this patch should solve the problem.


Regards,
Henrik

-------------- next part --------------
--- hobbitd/do_alert.c	2006/05/28 15:16:51	1.91
+++ hobbitd/do_alert.c	2006/06/02 11:12:17
@@ -88,6 +88,8 @@
 	char *id, *method = "unknown";
 	repeat_t *walk;
 
+	if (recip->method == M_IGNORE) return NULL;
• switch (recip->method) {
 	  case M_MAIL: method = "mail"; break;
 	  case M_SCRIPT: method = "script"; break;
@@ -325,6 +327,8 @@
 			 * might create here is NOT used later on.
 			 */
 			rpt = find_repeatinfo(alert, recip, 1);
+			if (!rpt) continue;	/* Happens for e.g. M_IGNORE recipients */
• dprintf("  repeat %s at %d\n", rpt->recipid, rpt->nextalert);
 			if (rpt->nextalert > now) {
 				traceprintf("Recipient '%s' dropped, next alert due at %d > %d\n",
--- lib/loadalerts.c	2006/05/31 08:50:03	1.13
+++ lib/loadalerts.c	2006/06/02 11:19:36
@@ -1092,7 +1092,9 @@
 		     (recip->criteria && (recip->criteria->sendnotice == SR_WANTED)) ) notice = 1;
 
 		*codes = '\0';
-		if (recip->method == M_IGNORE) strcat(codes, "I");
+		if (recip->method == M_IGNORE) {
+			recip->recipient = "-- ignored --";
+		}
 		if (recip->noalerts) { if (strlen(codes)) strcat(codes, ",A"); else strcat(codes, "-A"); }
 		if (recovered && !recip->noalerts) { if (strlen(codes)) strcat(codes, ",R"); else strcat(codes, "R"); }
 		if (notice) { if (strlen(codes)) strcat(codes, ",N"); else strcat(codes, "N"); }
list Dominique Frise · Sat, 03 Jun 2006 11:03:26 +0200 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
On Fri, Jun 02, 2006 at 08:19:10AM +0200, Dominique Frise wrote:

Looking at the event log, I noticed that the 3 times that hobbitd_alert 
crashed, it was trying to send to an IGNORE recipient (not always the same).

Thanks, it was easy to reproduce the problem once I tried some IGNORE
rules. I believe this patch should solve the problem.


Regards,
Henrik


--- hobbitd/do_alert.c	2006/05/28 15:16:51	1.91
+++ hobbitd/do_alert.c	2006/06/02 11:12:17
@@ -88,6 +88,8 @@
 	char *id, *method = "unknown";
 	repeat_t *walk;
 
+	if (recip->method == M_IGNORE) return NULL;
• switch (recip->method) {
 	  case M_MAIL: method = "mail"; break;
 	  case M_SCRIPT: method = "script"; break;
@@ -325,6 +327,8 @@
 			 * might create here is NOT used later on.
 			 */
 			rpt = find_repeatinfo(alert, recip, 1);
+			if (!rpt) continue;	/* Happens for e.g. M_IGNORE recipients */
• dprintf("  repeat %s at %d\n", rpt->recipid, rpt->nextalert);
 			if (rpt->nextalert > now) {
 				traceprintf("Recipient '%s' dropped, next alert due at %d > %d\n",
--- lib/loadalerts.c	2006/05/31 08:50:03	1.13
+++ lib/loadalerts.c	2006/06/02 11:19:36
@@ -1092,7 +1092,9 @@
 		     (recip->criteria && (recip->criteria->sendnotice == SR_WANTED)) ) notice = 1;
 
 		*codes = '\0';
-		if (recip->method == M_IGNORE) strcat(codes, "I");
+		if (recip->method == M_IGNORE) {
+			recip->recipient = "-- ignored --";
+		}
 		if (recip->noalerts) { if (strlen(codes)) strcat(codes, ",A"); else strcat(codes, "-A"); }
 		if (recovered && !recip->noalerts) { if (strlen(codes)) strcat(codes, ",R"); else strcat(codes, "R"); }
 		if (notice) { if (strlen(codes)) strcat(codes, ",N"); else strcat(codes, "N"); }

We did not have any new crash since we applied the patches :-)

Thank you.

Dominique
UNIL - University of Lausanne