Xymon Mailing List Archive search

Weird disk alert with bad data

5 messages in this thread

list Martha McConaghy · Tue, 28 Oct 08 12:54:58 EDT ·
We recently got the AIX client working with our Hobbit server.  I then
had to apply a patch to rrd/do_vmstat.c to fix a problem with rrd crashing
due to an uninitialized variable coming from the AIX client.  Despite that,
I'm still seeing a weird problem.  One of the other non-AIX clients will have
their disk check to to red alert.  When, I take a look at it, the disks are
fine.  However, the data being processed by rrd is off by a few characters
which seems to be what is causing the red alert to be generated.  It will last
for an hour or so, then will go green again and the problem will move to a
different non-AIX client.  When I remove the three AIX clients from bb-hosts,
the problem disappears.  So, it seems to be pretty clearly related to
the AIX client, though is affecting other alerts.

Any thoughts on what to do?  Have we stumbled onto another bug?

Martha
list Stef Coene · Tue, 28 Oct 2008 21:10:17 +0100 ·
quoted from Martha McConaghy
On Tuesday 28 October 2008, Martha McConaghy wrote:
We recently got the AIX client working with our Hobbit server.  I then
had to apply a patch to rrd/do_vmstat.c to fix a problem with rrd crashing
due to an uninitialized variable coming from the AIX client.  Despite that,
I'm still seeing a weird problem.  One of the other non-AIX clients will
have their disk check to to red alert.  When, I take a look at it, the
disks are fine.  However, the data being processed by rrd is off by a few
characters which seems to be what is causing the red alert to be generated.
 It will last for an hour or so, then will go green again and the problem
will move to a different non-AIX client.  When I remove the three AIX
clients from bb-hosts, the problem disappears.  So, it seems to be pretty
clearly related to the AIX client, though is affecting other alerts.

Any thoughts on what to do?  Have we stumbled onto another bug?
What patch did you applied for the rrd?

I have lots of AIX client talking to lots of hobbit servers and I never had a 
problem with the rrds.  The only patch I applied regarding vmstat is adding 
cpu_pc and cpu_ec and striping of . and , of the numbers.

My vmstat patch:

--- ./hobbit-4.2.0/hobbitd/rrd/do_vmstat.c   2006-08-09 22:10:06.000000000 
+0200
+++ ./hobbit-4.2.0-OK/hobbitd/rrd/do_vmstat.c   2007-03-13 11:40:39.000000000 
+0100
@@ -76,6 +76,8 @@
   { 14, "cpu_sys" },
   { 15, "cpu_idl" },
   { 16, "cpu_wait" },
+  { 17, "cpu_pc" },
+  { 18, "cpu_ec" },
   { -1, NULL }
 };

@@ -322,6 +324,17 @@
   p = strchr(datapart, '\n'); if (p) *p = '\0';
   p = strtok(datapart, " "); datacount = 0;
   while (p && (datacount < MAX_VMSTAT_VALUES)) {
• +      /* Removing . and , from the numbers */
+      char *p1;
+      while ( (p1 = strchr(p,'.')) != NULL ) {
+         strcpy (p1, p1+1) ;
+      }
+      char *p2;
+      while ( (p2 = strchr(p,',')) != NULL ) {
+         strcpy (p2, p2+1) ;
+      }
• values[datacount++] = atoi(p);
      p = strtok(NULL, " ");
   }


Stef
list Martha McConaghy · Wed, 29 Oct 08 13:05:52 EDT ·
Thanks, Stef.  That patch seems to have resolved the problem.

Martha
list Craig Cook · Thu, 30 Oct 2008 16:27:32 -0400 ·
Speaking of weird disk alerts...

I have been getting alarms like this lately, on various Solaris hosts at random times...


Thu

Oct 30]16:06:19 EDT 2008 - Filesystems NOT ok
red 6676762 3552136    66%    / (10332220% used) has reached the PANIC level (95%)
red 1622387 2474913    40%    /var (4138686% used) has reached the PANIC level (95%)
red 60080034 32598889    65%    /export (93615073% used) has reached the PANIC level (95%)
red 1320 58296888     1%    /etc/svc/volatile (58298208% used) has reached the PANIC level (95%)
red 16 58296888     1%    /tmp (58296904% used) has reached the PANIC level (95%)

Filesystem            kbytes    use
   ]vail capacity  Mounted on
/dev/md/dsk/d0       10332220 6676762 3552136    66%    /
/dev/md/dsk/d4       4138686 1622387 2474913    40%    /var
/dev/md/dsk/d7       93615073 60080034 32598889    65%    /export
swap                 58298208    1320 58296888     1%    /etc/svc/volatile
swap                 58296904      16 58296888     1%    /tmp


I suspected it was a corrupt "df" command, due to the "]" showing up.  The first line where the hostname should be is corrupted as well though.

This usually only happens once, next disk check reports green as usual with correct data.

Using Hobbit Monitor 4.3.0-0.20080403

Craig
list Martha McConaghy · Thu, 30 Oct 08 16:54:24 EDT ·
Craig,

That is exactly the same error I'm seeing.

Martha