Xymon Mailing List Archive search

A few hobbit problems

9 messages in this thread

list Eric E *hs Schwimmer · Sun, 3 Apr 2005 16:53:28 -0400 ·
Hi all,

We're in the process of migrating from BB to hobbit, and, for the most part,
the process has been quick and painless (hobbit seems to be handling our
800+ hosts much better than BB did).

We're seeing a few oddities, though;  my apoligies if someone else has already pointed these out, I've only been on the list since earlier today and don't know how to search the archives.

1. We see the occaisional yellow message from bbtest:
  Error output:
  dns_queue_run deadlock - loops=260

During this problem, ~70 seconds get added to our "Test setup" phase as
reported by bb-test.  This only happens once or twice an hour so far, so its not a showstopper.  We are using the --dns=ip option when calling the bbtest-net binary from hobbitlaunch.

2. The bb-eventlog.sh script dumps core.  I've run it successfully in the      
  past, but at some point, as we added more hosts to our bb-hosts file, it 
  began to fail.  Calling it from the command line:
  
  % setenv QUERY_STRING 'MAXTIME=140&MAXCOUNT=&Send=View+log'
  % /usr/local/hobbit/bb-eventlog.sh
  Segmentation fault (core dumped)

TIA,
-Eric Schwimmer
Network Engineer
University of Virginia HSCS
list Henrik Størner · Sun, 3 Apr 2005 23:17:23 +0200 ·
Hi Eric,
quoted from Eric E *hs Schwimmer

On Sun, Apr 03, 2005 at 04:53:28PM -0400, Schwimmer, Eric E *HS wrote:
We're in the process of migrating from BB to hobbit, and, for the most part,
the process has been quick and painless (hobbit seems to be handling our
800+ hosts much better than BB did).
Glad to hear that.
quoted from Eric E *hs Schwimmer
1. We see the occaisional yellow message from bbtest:
  Error output:
  dns_queue_run deadlock - loops=260

During this problem, ~70 seconds get added to our "Test setup" phase
as reported by bb-test.  This only happens once or twice an hour so
far, so its not a showstopper.  We are using the --dns=ip option
when calling the bbtest-net binary from hobbitlaunch.
Since you're seeing this, some DNS lookups must be happening. Most
likely, they are from http tests, or hosts that have a "0.0.0.0" IP.

I dont have a solution for this problem - it would mean digging into
the C-ARES library which handles the DNS lookups, and I haven't done
that yet. If it becomes more urgent, I'll see what I can do about it.
quoted from Eric E *hs Schwimmer

2. The bb-eventlog.sh script dumps core.  I've run it successfully
in the past, but at some point, as we added more hosts to our
bb-hosts file, it began to fail.  Calling it from the command line:
  
  % setenv QUERY_STRING 'MAXTIME=140&MAXCOUNT=&Send=View+log'
  % /usr/local/hobbit/bb-eventlog.sh
  Segmentation fault (core dumped)
Could you try getting the call trace from the core file ? Assuming
the core file is in the current directory, you should do this:

    $ gdb /usr/local/hobbit/server/bin/bb-eventlog.cgi core
    [messages from gdb]
    gdb> bt

The output from the "bt" command would be very helpful in narrowing
down the problem.


Thanks,
Henrik
list Eric E *hs Schwimmer · Sun, 3 Apr 2005 18:20:53 -0400 ·
quoted from Henrik Størner
Could you try getting the call trace from the core file ? Assuming
the core file is in the current directory, you should do this:

   $ gdb /usr/local/hobbit/server/bin/bb-eventlog.cgi core
   [messages from gdb]
   gdb> bt

The output from the "bt" command would be very helpful in narrowing
down the problem.
Below is the output from gdb, I apoligize for the formatting, I'm using a rather awkward web client.  Interestingly, I found that it runs fine as user 'hobbit', but users root and apache get a segfault.  

Core was generated by `/usr/local/hobbit/server/bin/bb-eventlog.cgi'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  0x08049502 in do_eventlog (output=0x9145c0, maxcount=100, maxminutes=140, allowallhosts=1)
    at eventlog.c:170
170                                     fprintf(output, "<TD ALIGN=CENTER BGCOLOR=%s><FONT COLOR=black>%s</FONT></TD>\n",
(gdb) bt
#0  0x08049502 in do_eventlog (output=0x9145c0, maxcount=100, maxminutes=140, allowallhosts=1)
    at eventlog.c:170
#1  0x08049c5b in main (argc=160256136, argv=0x98d50a3) at eventlog.c:338
list Eric E *hs Schwimmer · Sun, 3 Apr 2005 18:32:22 -0400 ·
I might have found another, unrelated, problem:

When I include any one of the following three lines in my bb-hosts file:
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu/
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu=137.54.102.2/
137.54.102.2   healthsystem.virginia.edu # http://137.54.102.2/

The bbgen process seems to hang.  The 'healthsystem.virginia.edu' page fails to appear in the appropriate menu.  None of the menu pages update, although individual test pages (such as a conn test for a switch) update appropriately.

Furthermore, the bbgen test for my hobbit server sends this message:
- Program crashed
Fatal signal caught!

(though you can only see it by viewing the bbgen test;  the parent page doesnt update).

-Eric
list Henrik Størner · Mon, 4 Apr 2005 07:49:25 +0200 ·
quoted from Eric E *hs Schwimmer
On Sun, Apr 03, 2005 at 06:20:53PM -0400, Schwimmer, Eric E *HS wrote:
Could you try getting the call trace from the core file ? Assuming
the core file is in the current directory, you should do this:

   $ gdb /usr/local/hobbit/server/bin/bb-eventlog.cgi core
   [messages from gdb]
   gdb> bt

The output from the "bt" command would be very helpful in narrowing
down the problem.
Below is the output from gdb
Thanks, that pin-pointed the problem nicely. Your eventlog has an
entry from a host that is not in the bb-hosts file; these are ignored
by the normal eventlog shown on the bb2 page, but the CGI script tried
to include them with fatal consequences.

I've attached a patch to fix this. To apply, save the patch to
/tmp/eventlog-crash.patch, then

  cd hobbit-4.0.1
  patch -p0 </tmp/eventlog-crash.patch
  make
  make install  # as root


Regards,
Henrik
-------------- next part --------------
--- bbdisplay/pagegen.c	2005/03/22 09:03:37	1.139
+++ bbdisplay/pagegen.c	2005/04/04 05:43:53
@@ -1023,7 +1023,7 @@
 	while (p) {
 		/* Dont redo the eventlog or acklog things */
 		if (strcmp(p, "eventlog.sh") == 0) {
-			if (bb2eventlog && !havedoneeventlog) do_eventlog(output, bb2eventlogmaxcount, bb2eventlogmaxtime, 0);
+			if (bb2eventlog && !havedoneeventlog) do_eventlog(output, bb2eventlogmaxcount, bb2eventlogmaxtime);
 		}
 		else if (strcmp(p, "acklog.sh") == 0) {
 			if (bb2acklog && !havedoneacklog) do_acklog(output, 25, 240);
@@ -1202,7 +1202,7 @@
 		do_bb2ext(output, "BBMKBB2EXT", "mkbb");
 
 		/* Dont redo the eventlog or acklog things */
-		if (bb2eventlog && !havedoneeventlog) do_eventlog(output, 0, 240, 0);
+		if (bb2eventlog && !havedoneeventlog) do_eventlog(output, 0, 240);
 		if (bb2acklog && !havedoneacklog) do_acklog(output, 25, 240);
 	}
 
--- bbdisplay/eventlog.c	2005/03/22 09:03:37	1.17
+++ bbdisplay/eventlog.c	2005/04/04 05:42:57
@@ -48,7 +48,7 @@
 	return result;
 }
 
-void do_eventlog(FILE *output, int maxcount, int maxminutes, int allowallhosts)
+void do_eventlog(FILE *output, int maxcount, int maxminutes)
 {
 	FILE *eventlog;
 	char eventlogfilename[PATH_MAX];
@@ -117,7 +117,7 @@
 
 		if ( (itemsfound == 8) && 
 		     (eventtime > cutoff) && 
-		     (allowallhosts || (eventhost && !eventhost->nobb2)) && 
+		     (eventhost && !eventhost->nobb2) && 
 		     (wanted_eventcolumn(svcname)) ) {
 
 			newevent = (event_t *) malloc(sizeof(event_t));
@@ -335,7 +335,7 @@
 
 	headfoot(stdout, "event", "", "header", COL_GREEN);
 	fprintf(stdout, "<center>\n");
-	do_eventlog(stdout, maxcount, maxminutes, 1);
+	do_eventlog(stdout, maxcount, maxminutes);
 	fprintf(stdout, "</center>\n");
 	headfoot(stdout, "event", "", "footer", COL_GREEN);
 
--- bbdisplay/eventlog.h	2005/03/22 09:03:37	1.3
+++ bbdisplay/eventlog.h	2005/04/04 05:43:14
@@ -14,6 +14,6 @@
 extern char *eventignorecolumns;
 extern int havedoneeventlog;
 
-extern void do_eventlog(FILE *output, int maxcount, int maxminutes, int allowallhosts);
+extern void do_eventlog(FILE *output, int maxcount, int maxminutes);
 
 #endif
list Henrik Størner · Mon, 4 Apr 2005 07:55:29 +0200 ·
quoted from Eric E *hs Schwimmer
On Sun, Apr 03, 2005 at 06:32:22PM -0400, Schwimmer, Eric E *HS wrote:
When I include any one of the following three lines in my bb-hosts file:
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu/
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu=137.54.102.2/
137.54.102.2   healthsystem.virginia.edu # http://137.54.102.2/

The bbgen process seems to hang.  The 'healthsystem.virginia.edu'
page fails to appear in the appropriate menu.  None of the menu
pages update, although individual test pages (such as a conn test
for a switch) update appropriately.
Furthermore, the bbgen test for my hobbit server sends this message:
- Program crashed
Fatal signal caught!
This is a sure sign of the "bbgen" task crashing while generating the
new webpages. You should find a core file from it in the
~hobbit/server/tmp/ directory (or occasionally in ~hobbit/data/logs/),
so do the same thing that you did with the eventlog problem:

    cd ~hobbit/server
    gdb bin/bbgen tmp/core
    gdb> bt

and send me the output.


Thanks,

Henrik
list Eric E *hs Schwimmer · Mon, 4 Apr 2005 09:11:17 -0400 ·
Here's the gdb output:

Core was generated by `bbgen --hobbitd --recentgifs --subpagecolumns=2 --report'.
Program terminated with signal 6, Aborted.
quoted from Eric E *hs Schwimmer
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2

#0  0x007d57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x007d57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00815955 in raise () from /lib/tls/libc.so.6
#2  0x00817319 in abort () from /lib/tls/libc.so.6
#3  0x0805df0e in sigsegv_handler (signum=11) at sig.c:57
#4  <signal handler called>
#5  0x00856490 in strcpy () from /lib/tls/libc.so.6
#6  0x0804d1ed in load_bbhosts (pgset=0x80647e3 "") at loadbbhosts.c:603
#7  0x08049bdb in main (argc=5, argv=0xbff67c44) at bbgen.c:550

-Eric

-----Original Message-----
From:	Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent:	Mon 4/4/2005 1:55 AM
To:	user-ae9b8668bcde@xymon.invalid
Cc:	
Subject:	Re: [hobbit] A few hobbit problems
quoted from Henrik Størner
On Sun, Apr 03, 2005 at 06:32:22PM -0400, Schwimmer, Eric E *HS wrote:
When I include any one of the following three lines in my bb-hosts file:
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu/
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu=137.54.102.2/
137.54.102.2   healthsystem.virginia.edu # http://137.54.102.2/

The bbgen process seems to hang.  The 'healthsystem.virginia.edu'
page fails to appear in the appropriate menu.  None of the menu
pages update, although individual test pages (such as a conn test
for a switch) update appropriately.
Furthermore, the bbgen test for my hobbit server sends this message:
- Program crashed
Fatal signal caught!
This is a sure sign of the "bbgen" task crashing while generating the
new webpages. You should find a core file from it in the
~hobbit/server/tmp/ directory (or occasionally in ~hobbit/data/logs/),
so do the same thing that you did with the eventlog problem:

    cd ~hobbit/server
    gdb bin/bbgen tmp/core
    gdb> bt

and send me the output.


Thanks,

Henrik
list Eric E *hs Schwimmer · Mon, 4 Apr 2005 09:21:00 -0400 ·
Works like a champ!  Thanks!
quoted from Henrik Størner

-Eric

-----Original Message-----
From:	Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent:	Mon 4/4/2005 1:49 AM
To:	user-ae9b8668bcde@xymon.invalid
Cc:	
Subject:	Re: [hobbit] A few hobbit problems
On Sun, Apr 03, 2005 at 06:20:53PM -0400, Schwimmer, Eric E *HS wrote:
Could you try getting the call trace from the core file ? Assuming
the core file is in the current directory, you should do this:

   $ gdb /usr/local/hobbit/server/bin/bb-eventlog.cgi core
   [messages from gdb]
   gdb> bt

The output from the "bt" command would be very helpful in narrowing
down the problem.
Below is the output from gdb
Thanks, that pin-pointed the problem nicely. Your eventlog has an
entry from a host that is not in the bb-hosts file; these are ignored
by the normal eventlog shown on the bb2 page, but the CGI script tried
to include them with fatal consequences.

I've attached a patch to fix this. To apply, save the patch to
/tmp/eventlog-crash.patch, then

  cd hobbit-4.0.1
  patch -p0 </tmp/eventlog-crash.patch
  make
  make install  # as root


Regards,
Henrik
list Henrik Størner · Mon, 4 Apr 2005 18:05:26 +0200 ·
quoted from Eric E *hs Schwimmer
On Sun, Apr 03, 2005 at 06:32:22PM -0400, Schwimmer, Eric E *HS wrote:
I might have found another, unrelated, problem:

When I include any one of the following three lines in my bb-hosts file:
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu/
137.54.102.2   healthsystem.virginia.edu # http://healthsystem.virginia.edu=137.54.102.2/
137.54.102.2   healthsystem.virginia.edu # http://137.54.102.2/

The bbgen process seems to hang.
I investigated this together with Eric, and found out that the culprit
was setting FQDN=FALSE in hobbitserver.cfg - this was not handled
correctly by bbgen after it was adapted for Hobbit.

Since it hasn't shown up in the beta-tests, I guess most of you use
the default setup where FQDN=TRUE :-)

I'll probably release a 4.0.2 version in a few days with the
collection of patches that have been done after the 4.0 release.


Regards,
Henrik