hobbitd_alert crashes
list Dominique Frise
Hi, This is snapshot of 01 june running on Solaris 9. [bb at iris tmp]$ gdb ../bin/hobbitd_alert core GNU gdb 6.0 Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.9"... Core was generated by `hobbitd_alert --checkpoint-file=/soft/pub/BB/hobbit/server/tmp/alert.chk --chec'. Program terminated with signal 6, Aborted. Reading symbols from /usr/local/lib/libpcre.so.0...done. Loaded symbols for /usr/local/lib/libpcre.so.0 Reading symbols from /usr/lib/libresolv.so.2...done. Loaded symbols for /usr/lib/libresolv.so.2 Reading symbols from /usr/lib/libsocket.so.1...done. Loaded symbols for /usr/lib/libsocket.so.1 Reading symbols from /usr/lib/libnsl.so.1...done. Loaded symbols for /usr/lib/libnsl.so.1 Reading symbols from /usr/lib/libc.so.1...done. Loaded symbols for /usr/lib/libc.so.1 Reading symbols from /usr/lib/libdl.so.1...done. Loaded symbols for /usr/lib/libdl.so.1 Reading symbols from /usr/lib/libmp.so.2...done. Loaded symbols for /usr/lib/libmp.so.2 Reading symbols from /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1...done. Loaded symbols for /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1 #0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1 (gdb) Dominique UNIL - University of Lausanne
list Henrik Størner
Could you do the "bt" command also, please ... ? Henrik
▸
On Fri, Jun 02, 2006 at 07:38:25AM +0200, Dominique Frise wrote:Hi, This is snapshot of 01 june running on Solaris 9. [bb at iris tmp]$ gdb ../bin/hobbitd_alert core GNU gdb 6.0 Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.9"... Core was generated by `hobbitd_alert --checkpoint-file=/soft/pub/BB/hobbit/server/tmp/alert.chk --chec'. Program terminated with signal 6, Aborted. Reading symbols from /usr/local/lib/libpcre.so.0...done. Loaded symbols for /usr/local/lib/libpcre.so.0 Reading symbols from /usr/lib/libresolv.so.2...done. Loaded symbols for /usr/lib/libresolv.so.2 Reading symbols from /usr/lib/libsocket.so.1...done. Loaded symbols for /usr/lib/libsocket.so.1 Reading symbols from /usr/lib/libnsl.so.1...done. Loaded symbols for /usr/lib/libnsl.so.1 Reading symbols from /usr/lib/libc.so.1...done. Loaded symbols for /usr/lib/libc.so.1 Reading symbols from /usr/lib/libdl.so.1...done. Loaded symbols for /usr/lib/libdl.so.1 Reading symbols from /usr/lib/libmp.so.2...done. Loaded symbols for /usr/lib/libmp.so.2 Reading symbols from /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1...done. Loaded symbols for /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1 #0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1 (gdb)
list Dominique Frise
▸
Henrik Stoerner wrote:
Could you do the "bt" command also, please ... ? Henrik On Fri, Jun 02, 2006 at 07:38:25AM +0200, Dominique Frise wrote:Hi, This is snapshot of 01 june running on Solaris 9. [bb at iris tmp]$ gdb ../bin/hobbitd_alert core GNU gdb 6.0 Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.9"... Core was generated by `hobbitd_alert --checkpoint-file=/soft/pub/BB/hobbit/server/tmp/alert.chk --chec'. Program terminated with signal 6, Aborted. Reading symbols from /usr/local/lib/libpcre.so.0...done. Loaded symbols for /usr/local/lib/libpcre.so.0 Reading symbols from /usr/lib/libresolv.so.2...done. Loaded symbols for /usr/lib/libresolv.so.2 Reading symbols from /usr/lib/libsocket.so.1...done. Loaded symbols for /usr/lib/libsocket.so.1 Reading symbols from /usr/lib/libnsl.so.1...done. Loaded symbols for /usr/lib/libnsl.so.1 Reading symbols from /usr/lib/libc.so.1...done. Loaded symbols for /usr/lib/libc.so.1 Reading symbols from /usr/lib/libdl.so.1...done. Loaded symbols for /usr/lib/libdl.so.1 Reading symbols from /usr/lib/libmp.so.2...done. Loaded symbols for /usr/lib/libmp.so.2 Reading symbols from /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1...done. Loaded symbols for /usr/platform/SUNW,Sun-Fire-480R/lib/libc_psr.so.1 #0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1 (gdb)
...
(gdb) bt
#0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1
#1 0xff136d58 in abort () from /usr/lib/libc.so.1
#2 0x0002134c in sigsegv_handler (signum=0) at sig.c:57
#3 <signal handler called>
(gdb)
Dominique
UNIL - University of Lausanne
list Henrik Størner
▸
On Fri, Jun 02, 2006 at 07:43:13AM +0200, Dominique Frise wrote:
(gdb) bt #0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff136d58 in abort () from /usr/lib/libc.so.1 #2 0x0002134c in sigsegv_handler (signum=0) at sig.c:57 #3 <signal handler called>
Hrm, that isn't much to go on. Does it crash right away when you start Hobbit, or only after some time has passed ? If it crashes right away, I'd like a copy of your bb-hosts, hobbitserver.cfg and hobbit-alerts.cfg files. If it crashes after some time, could you add the "--debug" option to the hobbitd_alert command in hobbitlaunch.cfg, and then mail me the ~hobbit/server/logs/page.log file after it has crashed? Regards, Henrik
list Dominique Frise
▸
Henrik Stoerner wrote:
On Fri, Jun 02, 2006 at 07:43:13AM +0200, Dominique Frise wrote:(gdb) bt #0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff136d58 in abort () from /usr/lib/libc.so.1 #2 0x0002134c in sigsegv_handler (signum=0) at sig.c:57 #3 <signal handler called>Hrm, that isn't much to go on. Does it crash right away when you start Hobbit, or only after some time has passed ?
It crashed 3 times last night. Hobbit was last restarted yesterday at 05:10 PM
▸
If it crashes right away, I'd like a copy of your bb-hosts, hobbitserver.cfg and hobbit-alerts.cfg files. If it crashes after some time, could you add the "--debug" option to the hobbitd_alert command in hobbitlaunch.cfg, and then mail me the ~hobbit/server/logs/page.log file after it has crashed?
Done. I'll mail you the log asap. Thank you. Dominique UNIL - University of Lausanne
list Dominique Frise
▸
Dominique Frise wrote:
Henrik Stoerner wrote:On Fri, Jun 02, 2006 at 07:43:13AM +0200, Dominique Frise wrote:(gdb) bt #0 0xff1a05c8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff136d58 in abort () from /usr/lib/libc.so.1 #2 0x0002134c in sigsegv_handler (signum=0) at sig.c:57 #3 <signal handler called>Hrm, that isn't much to go on. Does it crash right away when you start Hobbit, or only after some time has passed ?It crashed 3 times last night. Hobbit was last restarted yesterday at 05:10 PMIf it crashes right away, I'd like a copy of your bb-hosts, hobbitserver.cfg and hobbit-alerts.cfg files. If it crashes after some time, could you add the "--debug" option to the hobbitd_alert command in hobbitlaunch.cfg, and then mail me the ~hobbit/server/logs/page.log file after it has crashed?Done. I'll mail you the log asap. Thank you. Dominique UNIL - University of Lausanne
Looking at the event log, I noticed that the 3 times that hobbitd_alert
crashed, it was trying to send to an IGNORE recipient (not always the same).
Here are our IGNORE rules after macros definitions at top of hobbit-alerts.cfg.
Maybe there is something wrong with this configuration?
...
...
#---------------------------------------
# Hosts groups
#
$SAP_HOSTS=quartz,topaze,onyx,its,tulp,zircon
$ADMIN_HOSTS=bilbo,falco,furio
#------------------------------------------------------------------------------
# Rules to exclude alerting during a period of time
#------------------------------------------------------------------------------
HOST=* SERVICE=bckp TIME=*:2000:0700
IGNORE
HOST=uldns1,uldns2 SERVICE=ldap TIME=*:0500:0530
IGNORE
HOST=kawa,kawa2 SERVICE=http TIME=*:2210:2235
IGNORE
HOST=gaia SERVICE=http TIME=*:0012:0015
IGNORE
HOST=balrog,godzilla,smaug SERVICE=cpu TIME=*:2000:2359
IGNORE
HOST=acsls,balrog,godzilla,smaug SERVICE=memory TIME=*:0600:0800
IGNORE
HOST=unimedia,unimediad SERVICE=orcl,http TIME=*:0655:1115
IGNORE
HOST=virtuavd SERVICE=orcl TIME=*:0001:0400
IGNORE
HOST=tstvirtua SERVICE=orcl TIME=*:2159:0200
IGNORE
HOST=ged SERVICE=http TIME=*:0305:0315
IGNORE
HOST=$SAP_HOSTS SERVICE=conn,cpu,http,ftp TIME=*:1900:0700
IGNORE
HOST=$SAP_HOSTS SERVICE=orcl,procs TIME=*:1900:2359
IGNORE
HOST=$ADMIN_HOSTS SERVICE=http,sslcert TIME=*:0030:0630
IGNORE
HOST=$ADMIN_HOSTS SERVICE=conn,cpu,http,sslcert TIME=*:1800:0700
IGNORE
HOST=esope SERVICE=http,orcl TIME=*:0355:0600
IGNORE
HOST=pcsan SERVICE=msgs,svcs,procs TIME=*:1955:2300
IGNORE
HOST=iris SERVICE=hobbitd TIME=*:0310:0320
IGNORE
HOST=lanfeust,winup TIME=*:1945:2200
IGNORE
...
...
Dominique
UNIL - University of Lausanne
list Henrik Størner
▸
On Fri, Jun 02, 2006 at 08:19:10AM +0200, Dominique Frise wrote:
Looking at the event log, I noticed that the 3 times that hobbitd_alert crashed, it was trying to send to an IGNORE recipient (not always the same).
Thanks, it was easy to reproduce the problem once I tried some IGNORE
rules. I believe this patch should solve the problem.
Regards,
Henrik
-------------- next part --------------
--- hobbitd/do_alert.c 2006/05/28 15:16:51 1.91
+++ hobbitd/do_alert.c 2006/06/02 11:12:17
@@ -88,6 +88,8 @@
char *id, *method = "unknown";
repeat_t *walk;
+ if (recip->method == M_IGNORE) return NULL;
• switch (recip->method) {
case M_MAIL: method = "mail"; break;
case M_SCRIPT: method = "script"; break;
@@ -325,6 +327,8 @@
* might create here is NOT used later on.
*/
rpt = find_repeatinfo(alert, recip, 1);
+ if (!rpt) continue; /* Happens for e.g. M_IGNORE recipients */
• dprintf(" repeat %s at %d\n", rpt->recipid, rpt->nextalert);
if (rpt->nextalert > now) {
traceprintf("Recipient '%s' dropped, next alert due at %d > %d\n",
--- lib/loadalerts.c 2006/05/31 08:50:03 1.13
+++ lib/loadalerts.c 2006/06/02 11:19:36
@@ -1092,7 +1092,9 @@
(recip->criteria && (recip->criteria->sendnotice == SR_WANTED)) ) notice = 1;
*codes = '\0';
- if (recip->method == M_IGNORE) strcat(codes, "I");
+ if (recip->method == M_IGNORE) {
+ recip->recipient = "-- ignored --";
+ }
if (recip->noalerts) { if (strlen(codes)) strcat(codes, ",A"); else strcat(codes, "-A"); }
if (recovered && !recip->noalerts) { if (strlen(codes)) strcat(codes, ",R"); else strcat(codes, "R"); }
if (notice) { if (strlen(codes)) strcat(codes, ",N"); else strcat(codes, "N"); }
list Dominique Frise
▸
Henrik Stoerner wrote:
On Fri, Jun 02, 2006 at 08:19:10AM +0200, Dominique Frise wrote:Looking at the event log, I noticed that the 3 times that hobbitd_alert crashed, it was trying to send to an IGNORE recipient (not always the same).Thanks, it was easy to reproduce the problem once I tried some IGNORE rules. I believe this patch should solve the problem. Regards, Henrik --- hobbitd/do_alert.c 2006/05/28 15:16:51 1.91 +++ hobbitd/do_alert.c 2006/06/02 11:12:17 @@ -88,6 +88,8 @@ char *id, *method = "unknown"; repeat_t *walk; + if (recip->method == M_IGNORE) return NULL; • switch (recip->method) { case M_MAIL: method = "mail"; break; case M_SCRIPT: method = "script"; break; @@ -325,6 +327,8 @@ * might create here is NOT used later on. */ rpt = find_repeatinfo(alert, recip, 1); + if (!rpt) continue; /* Happens for e.g. M_IGNORE recipients */ • dprintf(" repeat %s at %d\n", rpt->recipid, rpt->nextalert); if (rpt->nextalert > now) { traceprintf("Recipient '%s' dropped, next alert due at %d > %d\n", --- lib/loadalerts.c 2006/05/31 08:50:03 1.13 +++ lib/loadalerts.c 2006/06/02 11:19:36 @@ -1092,7 +1092,9 @@ (recip->criteria && (recip->criteria->sendnotice == SR_WANTED)) ) notice = 1; *codes = '\0'; - if (recip->method == M_IGNORE) strcat(codes, "I"); + if (recip->method == M_IGNORE) { + recip->recipient = "-- ignored --"; + } if (recip->noalerts) { if (strlen(codes)) strcat(codes, ",A"); else strcat(codes, "-A"); } if (recovered && !recip->noalerts) { if (strlen(codes)) strcat(codes, ",R"); else strcat(codes, "R"); } if (notice) { if (strlen(codes)) strcat(codes, ",N"); else strcat(codes, "N"); }
We did not have any new crash since we applied the patches :-) Thank you. Dominique UNIL - University of Lausanne