Xymon Mailing List Archive search

Devmon causing core dumps

list Buchan Milne
Fri, 31 Oct 2008 12:15:27 +0200
Message-Id: <user-4916d60c0049@xymon.invalid>

On Friday 31 October 2008 05:51:42 Everett, Vernon wrote:
Hi all

Devmon was causing the hobbitd_rrd module to crash and burn.
Now this could be a bug, but it could also be a PEBKAC. I am hoping
somebody can assist either way.

I added a Cisco 2851 to Hobbit, using devmon.
Now here is the possible PEBKAC
Since Devmon doesn't have templates for the 2851, I used the template for
the Cisco 2811. (Network guru told me they are pretty much the same, except
for a few extra bells and whistles on the 2851.)

The data for the device started appearing in Hobbit, and all looked good.
Devmon even created the rrd files for the new Cisco device.

However, the hobbitd_rrd module started core dumping, and the Hobbit server
page started displaying red for hobbitd_rrd with the crash detected
message. See core data below.
Took the new Cisco device out of Hobbit, and cores stopped, and life was
good again.

Is there a significant enough difference between the 2851 and the 2811 to
cause this, or are we looking at a genuine bug?
Real bug. I see it on the temperature tests on a new IOS.
I am leaning towards a bug,
because even if the collected data was complete rubbish, should it cause
the module to core?

Regards
     Vernon

My Linux guy reckons this is the important stuff from the core.
uname -a
Linux las006 2.6.18-92.1.1.el5 #1 SMP Thu May 22 09:01:47 EDT 2008 x86_64
x86_64 x86_64 GNU/Linux cat /etc/redhat-release Red Hat Enterprise Linux
Client release 5.2 (Tikanga)

gdb -c core.8550 /usr/lib/hobbit/server/bin/hobbitd_rrd
GNU gdb Red Hat Linux (6.5-37.el5_2.1rh) Copyright (C) 2006 Free Software
Foundation, Inc. GDB is free software, covered by the GNU General Public
License, and you are welcome to change it and/or distribute copies of it
under certain conditions. Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host
libthread_db library "/lib64/libthread_db.so.1".

Reading symbols from /usr/lib64/librrd.so.2...done.
Loaded symbols for /usr/lib64/librrd.so.2 Reading symbols from
/usr/lib64/libpng12.so.0...done. Loaded symbols for
/usr/lib64/libpng12.so.0 Reading symbols from /lib64/libpcre.so.0...done.
Loaded symbols for /lib64/libpcre.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /usr/lib64/libfreetype.so.6...done.
Loaded symbols for /usr/lib64/libfreetype.so.6 Reading symbols from
/usr/lib64/libz.so.1...done. Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /usr/lib64/libart_lgpl_2.so.2...done.
Loaded symbols for /usr/lib64/libart_lgpl_2.so.2 Reading symbols from
/lib64/libm.so.6...done. Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by
`hobbitd_rrd --rrddir=/var/lib/hobbit/rrd --debug'. Program terminated with
signal 6, Aborted.
#0  0x0000003db7a30155 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003db7a30155 in raise () from /lib64/libc.so.6
#1  0x0000003db7a31bf0 in abort () from /lib64/libc.so.6
#2  0x00000000004119f3 in sigsegv_handler (signum=<value optimized out>) at
sig.c:57 #3  <signal handler called>
#4  0x0000003db7a77ac0 in strcat () from /lib64/libc.so.6
#5  0x000000000040462a in do_devmon_rrd (hostname=0x2ada311e2806
"PERIR205", testname=0x2ada311e280f "if_load", msg=<value optimized out>,
tstamp=<value optimized out>) at rrd/do_devmon.c:87
#6  0x000000000040b656 in update_rrd (hostname=0x2ada311e2806 "PERIR205",
testname=0x2ada311e280f "if_load", msg=0x2ada311e2842 "status
PERIR205.if_load green Fri Oct 31 10:31:39 2008", tstamp=1225416699,
sender=<value optimized out>, ldef=0xfeffffffffffff00) at do_rrd.c:372 #7 
0x000000000040261d in main (argc=<value optimized out>,
argv=0x7fff7a088318) at hobbitd_rrd.c:153 (gdb)

Could you show the Devmon RRD section of the message for the if_load test on 
the PERIR205 host? I can confirm the cause, and maybe offer a workaround.

I am actually (constantly) reproducing the issue on my workstation against the 
new IOS that can trigger this, I have a workaround in place in production, and 
was hoping to get around to fixing this next week.

Regards,
Buchan