Xymon Mailing List Archive search

RRD crashing high availability hobbit

list Buchan Milne
Fri, 21 Aug 2009 14:27:06 +0100
Message-Id: <user-125228fb241f@xymon.invalid>

On Friday, 21 August 2009 00:42:59 David Baldwin wrote:
user-c15424b7e83a@xymon.invalid wrote:
Hi Buchan,

We get a core dump, running a pstack gives the following info:

core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
 fed28a17 _lwp_kill (1, 6) + 7
 fecd1d63 raise    (6) + 1f
 fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) +
cd 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf,
8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff,
84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd
(84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main    
(2, 804613c, 8046148) + 4dc
 080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80

OK, so it crashed in do_netapp_extratest_rrd from hobbitd/rrd/do_netapp.c . 
I'm not familiar with pstack, but it looks like this may be from a stripped 
binary (or, you may be able to get more information from pstack).

If pstack can't show the values, then you may want to consider running 
hobbitd_rrd with the --debug flag, which should result in some logging of what 
it has received just before it crashes.
That looks like you are running extratest for a netapp which from what I
can see in hobbitd/do_rrd.c is what handles the xtstats column reported
by netapp.pl - just from a cursory glance at the code - I don't use it
myself. You really need to look at the C code to check it's doing the
right thing. You have 2 choices - quick fix is to disable just that test
in netapp.pl - other option is to work out what format it should be and
fix the test.

In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
implement what is documented
What is not implemented?

Where do you see this documented?

There is one fix that I have committed in svn (Xymon 4.2 branch, Xymon 4.3 
branch, devmon svn). I am not aware of any other requests or bugs filed on the 
devmon rrd collector.
and I use a perl script with --extra-script
instead
Is this the one shipped with devmon, or would you like to contribute a better 
one?
Various RRD handlers are in hobbitd/rrd/do_*.c
Looking at the code for xstrdup in lib/memory.c as below you should
check your logs - it's probably getting called with a NULL pointer
(unlikely you're out of memory), but the logs should tell you.

char *xstrdup(const char *s)
{
        char *result;

        if (s == NULL) {
                errprintf("xstrdup: Cannot dup NULL string\n");
                abort();
        }

        result = strdup(s);
        if (result == NULL) {
                errprintf("xstrdup: Out of memory\n");
                abort();
        }

#ifdef MEMORY_DEBUG
        add_to_memlist(result, strlen(result)+1);
#endif

        return result;
}
xstrdup is called twice in do_netapp_extratest_rrd, but seeing the string that 
it's aborting on would help narrow it down. If you can provide the status 
message that made hobbitd_rrd crash (retrieve it using: bb localhost 
'hobbitdlog hostname.testname') it can be used to reproduce this by someone 
trying to fix the bug.
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of
errors, which span over 607625 lines (this is just for today, we roll the
logs each night). This seems abnormally large to me and I think
eventually this is crashing the server.
It is still unlikely that this has anything to do with hobbitd_rrd crashing.

Regards,
Buchan