Xymon Mailing List Archive search

hobbitd_larrd is crashing

16 messages in this thread

list Larry Barber · Fri, 9 Jun 2006 13:30:39 -0500 ·
For some reason hobbitd_larrd has started crashing on my main production
server. Larrd-status.log has messages like this in it:

2006-06-09 12:42:39 Worker process died with exit code 139, terminating
2006-06-09 12:54:42 2006-06-09 12:54:42 Worker process died with exit code
139, terminating
2006-06-09 12:54:42 Our child has failed and will not talk to us: Channel
status, PID 18254
2006-06-09 12:54:42 Worker process died with exit code 139, terminating
2006-06-09 12:56:40 2006-06-09 12:56:40 Worker process died with exit code
139, terminating
2006-06-09 12:56:40 Our child has failed and will not talk to us: Channel
status, PID 18803
2006-06-09 12:56:40 Worker process died with exit code 139, terminating
2006-06-09 12:56:48 Host 'enormous' reports vmstat for an unknown OS
2006-06-09 12:58:43 2006-06-09 12:58:43 Worker process died with exit code
139, terminating
2006-06-09 12:58:43 2006-06-09 12:58:43 Worker process died with exit code
139, terminating


Loading the core file into gdb and executin "backtrace" yields "No stack".
Any ideas what's going on? I'm running Hobbit 4.1.2.rc1 on a RedHat ES3 box.

Thanks,
Larry Barber
list Henrik Størner · Fri, 9 Jun 2006 22:53:55 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 01:30:39PM -0500, Larry Barber wrote:
For some reason hobbitd_larrd has started crashing on my main production
server. Larrd-status.log has messages like this in it:

2006-06-09 12:42:39 Worker process died with exit code 139, terminating

Loading the core file into gdb and executin "backtrace" yields "No stack".
Any ideas what's going on? I'm running Hobbit 4.1.2.rc1 on a RedHat ES3 box.
4.1.2-rc1 is pretty old (almost one year).

I can think of several problems that might cause this, but my first
suggestion would be to at least upgrade to the 4.1.2p1 release that is
the current production-release.

From a testing perspective I'd like you to try out the 4.2 beta release
that went out early this week, but I fully understand if you would
rather not run the beta-version on a production system.


Regards,
Henrik
list Larry Barber · Fri, 9 Jun 2006 16:21:56 -0500 ·
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks like:

#0  0x00dfe60a in do_lookup_versioned () from /lib/ld-linux.so.2
#1  0x00dfd776 in _dl_lookup_versioned_symbol_internal () from /lib/ld-
linux.so.2
#2  0x00e01473 in fixup () from /lib/ld-linux.so.2
#3  0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x08054c6d in sigsegv_handler (signum=11) at sig.c:51
#5  <signal handler called>
#6  0x00dfe3da in do_lookup () from /lib/ld-linux.so.2
#7  0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2
#8  0x00e0140f in fixup () from /lib/ld-linux.so.2
#9  0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037
"stellent_pre-prod_v-ip",
    fn=0x805f6e0
"tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"...,
creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
#11 0x0804f294 in do_net_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip",
testname=0xb755d04e "http",
    msg=0xb755d07c "status stellent_pre-prod_v-ip.http green Fri Jun  9
16:16:31 2006: OK ; OK\n\n&green
https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-419";...,
tstamp=1149887818) at rrd/do_net.c:48
#12 0x0805024a in update_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip",
testname=0xb755d04e "http",
    msg=0xb755d07c "status stellent_pre-prod_v-ip.http green Fri Jun  9
16:16:31 2006: OK ; OK\n\n&green
https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-419";...,
tstamp=1149887818, sender=0x1ca3f <Address 0x1ca3f out of bounds>,
ldef=0x1ca3f) at do_rrd.c:291
#13 0x08049cf0 in main (argc=117311, argv=0xbfff8324) at hobbitd_rrd.c:199

larrd-status.log looks like:

...

2006-06-09 15:45:24 Our child has failed and will not talk to us: Channel
status, PID 22591
2006-06-09 15:45:24 Worker process died with exit code 139, terminating
2006-06-09 15:56:03 2006-06-09 15:56:03 Worker process died with exit code
139, terminating
2006-06-09 15:57:03 2006-06-09 15:57:03 Worker process died with exit code
139, terminating
2006-06-09 15:57:24 Worker process died with exit code 139, terminating
2006-06-09 15:57:24 Our child has failed and will not talk to us: Channel
status, PID 25060
2006-06-09 15:57:24 Worker process died with exit code 139, terminating
2006-06-09 15:59:24 2006-06-09 15:59:24 Worker process died with exit code
139, terminating
2006-06-09 15:59:24 Worker process died with exit code 139, terminating
2006-06-09 16:09:26 Worker process died with exit code 139, terminating
2006-06-09 16:13:01 2006-06-09 16:13:01 Worker process died with exit code
139, terminating
2006-06-09 16:13:02 Worker process died with exit code 139, terminating
2006-06-09 16:14:56 2006-06-09 16:14:56 Worker process died with exit code
139, terminating
2006-06-09 16:14:57 Worker process died with exit code 139, terminating
2006-06-09 16:16:58 2006-06-09 16:16:58 Worker process died with exit code
139, terminating
2006-06-09 16:16:58 Worker process died with exit code 139, terminating


It just started doing this today, I can't think of anything that I have done
that could cause it.

Thanks,
Larry Barber
quoted from Henrik Størner


On 6/9/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Jun 09, 2006 at 01:30:39PM -0500, Larry Barber wrote:
For some reason hobbitd_larrd has started crashing on my main production
server. Larrd-status.log has messages like this in it:

2006-06-09 12:42:39 Worker process died with exit code 139, terminating

Loading the core file into gdb and executin "backtrace" yields "No
stack".
Any ideas what's going on? I'm running Hobbit 4.1.2.rc1 on a RedHat ES3
box.
4.1.2-rc1 is pretty old (almost one year).

I can think of several problems that might cause this, but my first
suggestion would be to at least upgrade to the 4.1.2p1 release that is
the current production-release.

From a testing perspective I'd like you to try out the 4.2 beta release
that went out early this week, but I fully understand if you would
rather not run the beta-version on a production system.


Regards,
Henrik

list Henrik Størner · Fri, 9 Jun 2006 23:40:48 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 04:21:56PM -0500, Larry Barber wrote:
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks like:

#5  <signal handler called>
#6  0x00dfe3da in do_lookup () from /lib/ld-linux.so.2
#7  0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2
#8  0x00e0140f in fixup () from /lib/ld-linux.so.2
#9  0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037
"stellent_pre-prod_v-ip",
   fn=0x805f6e0
"tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"...,
creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
OK, the call trace looks sane so I think we can rule out simple memory
corruption here.

The crash happens when trying to print an error-message from the RRDtool
library, when trying to create a new RRD file for tracking a http test
response time (it has just called the rrd_create() function, which returns 
an error and hobbit is trying to print out the error message when it crashes.

The filename looks somewhat suspicious. It is generated from the URL
that is tested, and it is a very long filename beginning with
"tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE="
It's an http test for the host "stellent_pre-prod_v-ip"

My guess is that this filename is just too long. It *could* overflow the
buffer set aside for the RRD filename - in that case, the attached patch 
against 4.1.2p1 should help.
quoted from Larry Barber

It just started doing this today, I can't think of anything that I have done
that could cause it.
I think You just added this http test for "stellent_pre-prod_v-ip".


Regards,
Henrik

-------------- next part --------------
--- hobbitd/do_rrd.c.orig	2005-08-02 14:59:18.000000000 +0200
+++ hobbitd/do_rrd.c	2006-06-09 23:38:05.307993923 +0200
@@ -118,7 +118,8 @@
 			return -1;
 		}
 	}
-	strcat(filedir, "/"); strcat(filedir, fn);
+	snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
+	filedir[sizeof(filedir)-1] = '\0';
 	creparams[1] = filedir;	/* Icky */
 
 	if (stat(filedir, &st) == -1) {
list Henrik Størner · Fri, 9 Jun 2006 23:52:05 +0200 ·
quoted from Henrik Størner
On Fri, Jun 09, 2006 at 11:40:48PM +0200, Henrik Stoerner wrote:
My guess is that this filename is just too long. It *could* overflow the
buffer set aside for the RRD filename - in that case, the attached patch 
against 4.1.2p1 should help.
Correction, there is one more place that is sensitive to the filename
length. Please use this corrected patch instead of the one I sent
earlier.
quoted from Henrik Størner


Regards,
Henrik

-------------- next part --------------
--- hobbitd/do_rrd.c.orig	2005-08-02 14:59:18.000000000 +0200
+++ hobbitd/do_rrd.c	2006-06-09 23:38:05.307993923 +0200
@@ -118,7 +118,8 @@
 			return -1;
 		}
 	}
-	strcat(filedir, "/"); strcat(filedir, fn);
+	snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
+	filedir[sizeof(filedir)-1] = '\0';
 	creparams[1] = filedir;	/* Icky */
 
 	if (stat(filedir, &st) == -1) {

--- hobbitd/rrd/do_net.c.orig	2005-09-28 23:20:56.000000000 +0200
+++ hobbitd/rrd/do_net.c	2006-06-09 23:50:40.404367975 +0200
@@ -43,7 +43,8 @@
 
 				if (strncmp(urlfn, "http://";, 7) == 0) urlfn += 7;
 				p = urlfn; while ((p = strchr(p, '/')) != NULL) *p = ',';
-				sprintf(rrdfn, "tcp.http.%s.rrd", urlfn);
+				snprintf(rrdfn, sizeof(rrdfn)-1, "tcp.http.%s.rrd", urlfn);
+				rrdfn[sizeof(rrdfn)-1] = '\0';
 				sprintf(rrdvalues, "%d:%.2f", (int)tstamp, seconds);
 				create_and_update_rrd(hostname, rrdfn, bbnet_params, bbnet_tpl);
 				xfree(url); url = NULL;
list Larry Barber · Fri, 9 Jun 2006 17:01:55 -0500 ·
No joy, it is still crashing, stack trace:

(gdb)
#0  0x0046260a in do_lookup_versioned () from /lib/ld-linux.so.2
#1  0x00461776 in _dl_lookup_versioned_symbol_internal () from /lib/ld-
linux.so.2
#2  0x00465473 in fixup () from /lib/ld-linux.so.2
#3  0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x08054c79 in sigsegv_handler (signum=11) at sig.c:51
#5  <signal handler called>
#6  0x004623da in do_lookup () from /lib/ld-linux.so.2
#7  0x00461103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2
#8  0x0046540f in fixup () from /lib/ld-linux.so.2
#9  0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0x0804a92b in create_and_update_rrd (hostname=0x7 <Address 0x7 out of
bounds>,
    fn=0x805f6e0
"tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"...,
creparams=0x805e5c0, template=0x93f7b20 "sec") at do_rrd.c:145
#11 0x0804f2a0 in do_net_rrd (hostname=0xb755f036 "stellent_pre-prod_v-ip",
testname=0xb755f04d "http",
    msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun  9
16:53:40 2006: OK ; OK\n\n&green
https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-419";...,
tstamp=1149890052) at rrd/do_net.c:48
#12 0x08050256 in update_rrd (hostname=0xb755f036 "stellent_pre-prod_v-ip",
testname=0xb755f04d "http",
    msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun  9
16:53:40 2006: OK ; OK\n\n&green
https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-419";...,
tstamp=1149890052, sender=0x1ca3f <Address 0x1ca3f out of bounds>,
ldef=0x1ca3f) at do_rrd.c:293
#13 0x08049cf0 in main (argc=117311, argv=0xbfffab14) at hobbitd_rrd.c:199


I was looking at your patch, and it doesn't look to me like that new lines
are doing the same thing as the old:
quoted from Henrik Størner

-	strcat(filedir, "/"); strcat(filedir, fn);
+	snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
+	filedir[sizeof(filedir)-1] = '\0';
 	creparams[1] = filedir;	/* Icky */

It looks like the original line creates something like "filedir/fn"
while the new lines create something like "filedir/hostname/fn". Is
this right?
quoted from Henrik Størner

Thanks,
Larry Barber


On 6/9/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Jun 09, 2006 at 04:21:56PM -0500, Larry Barber wrote:
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks
like:

#5  <signal handler called>
#6  0x00dfe3da in do_lookup () from /lib/ld-linux.so.2
#7  0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2
#8  0x00e0140f in fixup () from /lib/ld-linux.so.2
#9  0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037
"stellent_pre-prod_v-ip",
   fn=0x805f6e0

"tcp.http.https:,,pws.tc.sc.egov.usda.gov
,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"...,
quoted from Henrik Størner
creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
OK, the call trace looks sane so I think we can rule out simple memory
corruption here.

The crash happens when trying to print an error-message from the RRDtool
library, when trying to create a new RRD file for tracking a http test
response time (it has just called the rrd_create() function, which returns
an error and hobbit is trying to print out the error message when it
crashes.

The filename looks somewhat suspicious. It is generated from the URL
that is tested, and it is a very long filename beginning with

"tcp.http.https:,,pws.tc.sc.egov.usda.gov
,siteminderagent,dmsforms,login_banner.fcc?TYPE="
quoted from Henrik Størner
It's an http test for the host "stellent_pre-prod_v-ip"

My guess is that this filename is just too long. It *could* overflow the
buffer set aside for the RRD filename - in that case, the attached patch
against 4.1.2p1 should help.

It just started doing this today, I can't think of anything that I have
done
that could cause it.
I think You just added this http test for "stellent_pre-prod_v-ip".


Regards,
Henrik

list Larry Barber · Fri, 9 Jun 2006 17:14:10 -0500 ·
After applying the second patch, it's still crashing, stacktrace:

(gdb) backtrace
#0  0x00d8960a in do_lookup_versioned () from /lib/ld-linux.so.2
#1  0x00d88776 in _dl_lookup_versioned_symbol_internal () from /lib/ld-
linux.so.2
#2  0x00d8c473 in fixup () from /lib/ld-linux.so.2
#3  0x00d8c330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x08054c89 in sigsegv_handler (signum=11) at sig.c:51
#5  <signal handler called>
#6  0x0039078b in strlen () from /lib/tls/libc.so.6
#7  0x0035e621 in vfprintf () from /lib/tls/libc.so.6
#8  0x0037fd24 in vsnprintf () from /lib/tls/libc.so.6
#9  0x08050fb3 in errprintf (fmt=0x8057cd8 "RRD error creating %s: %s\n") at
errormsg.c:51
#10 0x0804a93a in create_and_update_rrd (hostname=0x7 <Address 0x7 out of
bounds>,
    fn=0x805f6e0
"tcp.http.https:,,pws.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d3b2e2ae-78ac-495d-a153-09f36b6aa237&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$2z10ILc8e"...,
creparams=0x805e5c0, template=0x9098e68 "sec") at do_rrd.c:145
#11 0x0804f2af in do_net_rrd (hostname=0xb7560037 "FS_PVHOST",
testname=0xb7560041 "http",
    msg=0xb756006f "status FS_PVHOST.http green Fri Jun  9 17:11:00 2006: OK
; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200
OK\r\nDate: Fri, 09 Jun 2006 22:11:57 GMT\r\nServer:
IBM_HTTP_Server/2.0.47."..., tstamp=1149891084) at rrd/do_net.c:50
#12 0x08050266 in update_rrd (hostname=0xb7560037 "FS_PVHOST",
testname=0xb7560041 "http",
    msg=0xb756006f "status FS_PVHOST.http green Fri Jun  9 17:11:00 2006: OK
; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200
OK\r\nDate: Fri, 09 Jun 2006 22:11:57 GMT\r\nServer:
IBM_HTTP_Server/2.0.47."..., tstamp=1149891084, sender=0x706a4266 <Address
0x706a4266 out of bounds>, ldef=0x706a4266) at do_rrd.c:293
#13 0x08049cf0 in main (argc=1886012006, argv=0xbfffbab4) at
hobbitd_rrd.c:199


BTW, those ultra-long URL's have been in there for quite a while, several
months anyway.

Thanks,
Larry Barber
quoted from Larry Barber


On 6/9/06, Larry Barber <user-6ef9c2864140@xymon.invalid> wrote:
No joy, it is still crashing, stack trace:

(gdb)
#0  0x0046260a in do_lookup_versioned () from /lib/ld-linux.so.2
#1  0x00461776 in _dl_lookup_versioned_symbol_internal () from /lib/ld-
linux.so.2
#2  0x00465473 in fixup () from /lib/ld-linux.so.2
#3  0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x08054c79 in sigsegv_handler (signum=11) at sig.c:51
#5  <signal handler called>
#6  0x004623da in do_lookup () from /lib/ld-linux.so.2
#7  0x00461103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2
#8  0x0046540f in fixup () from /lib/ld-linux.so.2
#9  0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0x0804a92b in create_and_update_rrd (hostname=0x7 <Address 0x7 out of
bounds>,
    fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"...,
creparams=0x805e5c0, template=0x93f7b20 "sec") at do_rrd.c:145
#11 0x0804f2a0 in do_net_rrd (hostname=0xb755f036
"stellent_pre-prod_v-ip", testname=0xb755f04d "http",
    msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun  9
16:53:40 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-419
"..., tstamp=1149890052) at rrd/do_net.c:48
#12 0x08050256 in update_rrd (hostname=0xb755f036
"stellent_pre-prod_v-ip", testname=0xb755f04d "http",
    msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun  9
16:53:40 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-419
"..., tstamp=1149890052, sender=0x1ca3f <Address 0x1ca3f out of bounds>,
ldef=0x1ca3f) at do_rrd.c:293
#13 0x08049cf0 in main (argc=117311, argv=0xbfffab14) at hobbitd_rrd.c:199


I was looking at your patch, and it doesn't look to me like that new lines
are doing the same thing as the old:

-	strcat(filedir, "/"); strcat(filedir, fn);
+	snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
+	filedir[sizeof(filedir)-1] = '\0';
 	creparams[1] = filedir;	/* Icky */


It looks like the original line creates something like "filedir/fn" while the new lines create something like "filedir/hostname/fn". Is this right?

Thanks,
Larry Barber


On 6/9/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Jun 09, 2006 at 04:21:56PM -0500, Larry Barber wrote:
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks
like:

#5  <signal handler called>

#6  0x00dfe3da in do_lookup () from /lib/ld- linux.so.2
quoted from Larry Barber
#7  0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2
#8  0x00e0140f in fixup () from /lib/ld-linux.so.2
#9  0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037
"stellent_pre-prod_v-ip",
   fn=0x805f6e0
"tcp.http.https:,,pws.tc.sc.egov.usda.gov
,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"...,
creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
OK, the call trace looks sane so I think we can rule out simple memory
corruption here.

The crash happens when trying to print an error-message from the RRDtool
library, when trying to create a new RRD file for tracking a http test
response time (it has just called the rrd_create() function, which returns
an error and hobbit is trying to print out the error message when it
crashes.

The filename looks somewhat suspicious. It is generated from the URL
that is tested, and it is a very long filename beginning with
"tcp.http.https:,,pws.tc.sc.egov.usda.gov
,siteminderagent,dmsforms,login_banner.fcc?TYPE="
It's an http test for the host "stellent_pre-prod_v-ip"

My guess is that this filename is just too long. It *could* overflow the
buffer set aside for the RRD filename - in that case, the attached patch
against 4.1.2p1 should help.

It just started doing this today, I can't think of anything that I have
done
that could cause it.
I think You just added this http test for "stellent_pre-prod_v-ip".


Regards,
Henrik

list Henrik Størner · Sat, 10 Jun 2006 00:16:40 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 05:01:55PM -0500, Larry Barber wrote:
No joy, it is still crashing, stack trace:
Does rrdtool work for you? Try running
   rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576
Assuming you're not root, this should print out the message
   ERROR: creating '/foo.rrd': Permission denied
quoted from Larry Barber
I was looking at your patch, and it doesn't look to me like that new lines
are doing the same thing as the old:

-	strcat(filedir, "/"); strcat(filedir, fn);
+	snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
+	filedir[sizeof(filedir)-1] = '\0';
	creparams[1] = filedir;	/* Icky */

It looks like the original line creates something like "filedir/fn"
while the new lines create something like "filedir/hostname/fn". Is
this right?
It is. In the old version, "filedir" contained the rrd top-level
directory + the hostname, e.g. "/hobbit/rrd/myhost", and then it
added an extra "/" and the rrd filename.

The new version just uses snprintf() to output the top-level
directory + the hostname-directory + the rrd filename in one go.

	sprintf(filedir, "%s/%s", rrddir, hostname);
	if (stat(filedir, &st) == -1) {
		...
	}
	snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
	filedir[sizeof(filedir)-1] = '\0';


Regards,
Henrik
list Larry Barber · Fri, 9 Jun 2006 17:23:21 -0500 ·
rrdtool performs as expected:

-bash-2.05b$ /usr/local/rrdtool-1.2.10/bin/rrdtool create /foo.rrd
DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576
ERROR: creating '/foo.rrd': Permission denied
quoted from Henrik Størner

Thanks,
Larry Barber


On 6/9/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Jun 09, 2006 at 05:01:55PM -0500, Larry Barber wrote:
No joy, it is still crashing, stack trace:
Does rrdtool work for you? Try running
   rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576
Assuming you're not root, this should print out the message
   ERROR: creating '/foo.rrd': Permission denied
I was looking at your patch, and it doesn't look to me like that new
lines
are doing the same thing as the old:

-     strcat(filedir, "/"); strcat(filedir, fn);
+     snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname,
fn);
+     filedir[sizeof(filedir)-1] = '\0';
      creparams[1] = filedir; /* Icky */

It looks like the original line creates something like "filedir/fn"
while the new lines create something like "filedir/hostname/fn". Is
this right?
It is. In the old version, "filedir" contained the rrd top-level
directory + the hostname, e.g. "/hobbit/rrd/myhost", and then it
added an extra "/" and the rrd filename.

The new version just uses snprintf() to output the top-level
directory + the hostname-directory + the rrd filename in one go.

        sprintf(filedir, "%s/%s", rrddir, hostname);
        if (stat(filedir, &st) == -1) {
                ...
        }
        snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname,
fn);
        filedir[sizeof(filedir)-1] = '\0';


Regards,
Henrik

list Henrik Størner · Sat, 10 Jun 2006 00:31:20 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 05:23:21PM -0500, Larry Barber wrote:
rrdtool performs as expected:

-bash-2.05b$ /usr/local/rrdtool-1.2.10/bin/rrdtool create /foo.rrd
DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576
ERROR: creating '/foo.rrd': Permission denied
OK, but I still think it is odd that it crashes while printing an
RRDtool error message.

What happens if you use this patch on top of the one you already
installed ?


Henrik

-------------- next part --------------
--- hobbitd/do_rrd.c.p1	2006-06-10 00:26:34.449750393 +0200
+++ hobbitd/do_rrd.c	2006-06-10 00:31:02.065972642 +0200
@@ -141,7 +141,14 @@
 		optind = opterr = 0; rrd_clear_error();
 		result = rrd_create(pcount, creparams);
 		if (result != 0) {
-			errprintf("RRD error creating %s: %s\n", filedir, rrd_get_error());
+			char *errmsg = rrd_get_error();
+			char errcopy[100];
• +			if (errmsg == NULL) errmsg = "Unknown rrd error";
+			strncpy(errcopy, errmsg, sizeof(errcopy)-1);
+			errcopy[sizeof(errcopy)-1] = '\0';
• +			errprintf("RRD error creating %s: %s\n", filedir, errcopy);
 			MEMUNDEFINE(filedir);
 			MEMUNDEFINE(rrdvalues); MEMUNDEFINE(rrdfn);
 			return 1;
list Larry Barber · Fri, 9 Jun 2006 17:37:45 -0500 ·
Still crashing, stack trace:
(gdb)
#0  0x00f1260a in do_lookup_versioned () from /lib/ld-linux.so.2
#1  0x00f11776 in _dl_lookup_versioned_symbol_internal () from /lib/ld-
linux.so.2
#2  0x00f15473 in fixup () from /lib/ld-linux.so.2
#3  0x00f15330 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x08054cad in sigsegv_handler (signum=11) at sig.c:51
#5  <signal handler called>
#6  0x00b8f657 in strlen () from /lib/csa/sse2/sse2_boost.so.1
#7  0x00a8ac19 in OK_BOD_strncpy () from /lib/csa/libcsa.so.6
#8  0x00a883c0 in strncpy () from /lib/csa/libcsa.so.6
#9  0x0804a93c in create_and_update_rrd (hostname=0x7 <Address 0x7 out of
bounds>, fn=0x63 <Address 0x63 out of bounds>, creparams=0x805e5c0,
template=0x8fa8340 "sec")
    at do_rrd.c:150
#10 0x0804f2d3 in do_net_rrd (hostname=0xb7560036 "FS_PVHOST",
testname=0xb7560040 "http",
    msg=0xb756006e "status FS_PVHOST.http green Fri Jun  9 17:34:51 2006: OK
; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200
OK\r\nDate: Fri, 09 Jun 2006 22:35:15 GMT\r\nServer:
IBM_HTTP_Server/2.0.47."..., tstamp=1149892521) at rrd/do_net.c:50
#11 0x0805028a in update_rrd (hostname=0xb7560036 "FS_PVHOST",
testname=0xb7560040 "http",
    msg=0xb756006e "status FS_PVHOST.http green Fri Jun  9 17:34:51 2006: OK
; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200
OK\r\nDate: Fri, 09 Jun 2006 22:35:15 GMT\r\nServer:
IBM_HTTP_Server/2.0.47."..., tstamp=1149892521, sender=0xffffffc0 <Address
0xffffffc0 out of bounds>, ldef=0xffffffc0) at do_rrd.c:301
#12 0x08049cf0 in main (argc=-64, argv=0xbfffb634) at hobbitd_rrd.c:199


Notice entry #9, it appears that something is munging up the hostname
variable.
quoted from Henrik Størner

Thanks,
Larry Barber


On 6/9/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Jun 09, 2006 at 05:23:21PM -0500, Larry Barber wrote:
rrdtool performs as expected:

-bash-2.05b$ /usr/local/rrdtool-1.2.10/bin/rrdtool create /foo.rrd
DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576
ERROR: creating '/foo.rrd': Permission denied
OK, but I still think it is odd that it crashes while printing an
RRDtool error message.

What happens if you use this patch on top of the one you already
installed ?


Henrik

list Henrik Størner · Sat, 10 Jun 2006 00:47:57 +0200 ·
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
quoted from Larry Barber
#8  0x00a883c0 in strncpy () from /lib/csa/libcsa.so.6
#9  0x0804a93c in create_and_update_rrd (hostname=0x7 <Address 0x7 out of
bounds>, fn=0x63 <Address 0x63 out of bounds>, creparams=0x805e5c0,
template=0x8fa8340 "sec")
   at do_rrd.c:150
It still crashes while handling the data we get from the rrd_get_error()
routine.

I had a look at the rrdtool sources, and this crash doesn't make sense.
rrd_get_error() returns a static buffer, so it should be able to crash.
quoted from Larry Barber
Notice entry #9, it appears that something is munging up the hostname
variable.
Most likely, it is just a memory scribble that hits part of the stack
as a result of the real error.

It's too late for me to do more about it now, but I would like to take 
a closer look at this. If you could tar up the 4.1.2p1 build directory
including the hobbitd_rrd binary and the core file and mail it to me.
I'll have a look at it in the morning when I'm a bit more awake.


Regards,
Henrik
list Henrik Størner · Sat, 10 Jun 2006 00:51:29 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
As a final (desperate) fix, change to code to avoid printing the
error message. I.e. around line 143 in hobbitd/do_rrd.c, add a
line after 
    result = rrd_create(pcount, creparams);
with
    if (result != 0) return 1;


If that doesn't crash, then I'm really suspicious of your rrdtool
library. What version is that, by the way ?


Regards,
Henrik
list Henrik Størner · Sat, 10 Jun 2006 00:57:02 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
I just remembered: Check the size of you Hobbit logfiles, especially
the "rrd-status.log" file.

The current Hobbit versions do not have large file support, so if the
log gets around 2 GB, printing anything to the logfile will cause an I/O
error, and this has been seen to crash programs. That would explain why
this happens when it tries to print an error message.


Regards,
Henrik
list Larry Barber · Fri, 9 Jun 2006 18:26:11 -0500 ·
The larrd-status.log file wasn't unduly large, they get rotated daily. I am
using version 1.2.10 of rrdtool, although 1.0.48 is also installed on the
machine.

Even with the (desperate) fix, it is still coring.

Where should I mail those files to? I assume you don't want them on the
mailling list.
quoted from Henrik Størner

Thanks,
Larry Barber

On 6/9/06, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
I just remembered: Check the size of you Hobbit logfiles, especially
the "rrd-status.log" file.

The current Hobbit versions do not have large file support, so if the
log gets around 2 GB, printing anything to the logfile will cause an I/O
error, and this has been seen to crash programs. That would explain why
this happens when it tries to print an error message.


Regards,
Henrik

list Henrik Størner · Sat, 10 Jun 2006 08:16:58 +0200 ·
quoted from Larry Barber
On Fri, Jun 09, 2006 at 06:26:11PM -0500, Larry Barber wrote:
The larrd-status.log file wasn't unduly large, they get rotated daily. I am
using version 1.2.10 of rrdtool, although 1.0.48 is also installed on the
machine.

Even with the (desperate) fix, it is still coring.

Where should I mail those files to? I assume you don't want them on the
mailling list.
My direct mail address, user-ce4a2c883f75@xymon.invalid


Regards,
Henrik