On Friday, 21 August 2009 00:42:59 David Baldwin wrote:
user-c15424b7e83a@xymon.invalid wrote:
Hi Buchan,
We get a core dump, running a pstack gives the following info:
core 'core' of 11142: hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
fed28a17 _lwp_kill (1, 6) + 7
fecd1d63 raise (6) + 1f
fecb1bad abort (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) +
cd 08060291 xstrdup (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf,
8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff,
84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd
(84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main
(2, 804613c, 8046148) + 4dc
080539fc _start (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
OK, so it crashed in do_netapp_extratest_rrd from hobbitd/rrd/do_netapp.c .
I'm not familiar with pstack, but it looks like this may be from a stripped
binary (or, you may be able to get more information from pstack).
If pstack can't show the values, then you may want to consider running
hobbitd_rrd with the --debug flag, which should result in some logging of what
it has received just before it crashes.
That looks like you are running extratest for a netapp which from what I
can see in hobbitd/do_rrd.c is what handles the xtstats column reported
by netapp.pl - just from a cursory glance at the code - I don't use it
myself. You really need to look at the C code to check it's doing the
right thing. You have 2 choices - quick fix is to disable just that test
in netapp.pl - other option is to work out what format it should be and
fix the test.
In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
implement what is documented
What is not implemented?
Where do you see this documented?
There is one fix that I have committed in svn (Xymon 4.2 branch, Xymon 4.3
branch, devmon svn). I am not aware of any other requests or bugs filed on the
devmon rrd collector.
and I use a perl script with --extra-script
instead
Is this the one shipped with devmon, or would you like to contribute a better
one?
Various RRD handlers are in hobbitd/rrd/do_*.c
Looking at the code for xstrdup in lib/memory.c as below you should
check your logs - it's probably getting called with a NULL pointer
(unlikely you're out of memory), but the logs should tell you.
char *xstrdup(const char *s)
{
char *result;
if (s == NULL) {
errprintf("xstrdup: Cannot dup NULL string\n");
abort();
}
result = strdup(s);
if (result == NULL) {
errprintf("xstrdup: Out of memory\n");
abort();
}
#ifdef MEMORY_DEBUG
add_to_memlist(result, strlen(result)+1);
#endif
return result;
}
xstrdup is called twice in do_netapp_extratest_rrd, but seeing the string that
it's aborting on would help narrow it down. If you can provide the status
message that made hobbitd_rrd crash (retrieve it using: bb localhost
'hobbitdlog hostname.testname') it can be used to reproduce this by someone
trying to fix the bug.
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of
errors, which span over 607625 lines (this is just for today, we roll the
logs each night). This seems abnormally large to me and I think
eventually this is crashing the server.
It is still unlikely that this has anything to do with hobbitd_rrd crashing.
Regards,
Buchan