Weird disk alert with bad data
list Martha McConaghy
We recently got the AIX client working with our Hobbit server. I then had to apply a patch to rrd/do_vmstat.c to fix a problem with rrd crashing due to an uninitialized variable coming from the AIX client. Despite that, I'm still seeing a weird problem. One of the other non-AIX clients will have their disk check to to red alert. When, I take a look at it, the disks are fine. However, the data being processed by rrd is off by a few characters which seems to be what is causing the red alert to be generated. It will last for an hour or so, then will go green again and the problem will move to a different non-AIX client. When I remove the three AIX clients from bb-hosts, the problem disappears. So, it seems to be pretty clearly related to the AIX client, though is affecting other alerts. Any thoughts on what to do? Have we stumbled onto another bug? Martha
list Stef Coene
▸
On Tuesday 28 October 2008, Martha McConaghy wrote:
We recently got the AIX client working with our Hobbit server. I then had to apply a patch to rrd/do_vmstat.c to fix a problem with rrd crashing due to an uninitialized variable coming from the AIX client. Despite that, I'm still seeing a weird problem. One of the other non-AIX clients will have their disk check to to red alert. When, I take a look at it, the disks are fine. However, the data being processed by rrd is off by a few characters which seems to be what is causing the red alert to be generated. It will last for an hour or so, then will go green again and the problem will move to a different non-AIX client. When I remove the three AIX clients from bb-hosts, the problem disappears. So, it seems to be pretty clearly related to the AIX client, though is affecting other alerts. Any thoughts on what to do? Have we stumbled onto another bug?
What patch did you applied for the rrd?
I have lots of AIX client talking to lots of hobbit servers and I never had a
problem with the rrds. The only patch I applied regarding vmstat is adding
cpu_pc and cpu_ec and striping of . and , of the numbers.
My vmstat patch:
--- ./hobbit-4.2.0/hobbitd/rrd/do_vmstat.c 2006-08-09 22:10:06.000000000
+0200
+++ ./hobbit-4.2.0-OK/hobbitd/rrd/do_vmstat.c 2007-03-13 11:40:39.000000000
+0100
@@ -76,6 +76,8 @@
{ 14, "cpu_sys" },
{ 15, "cpu_idl" },
{ 16, "cpu_wait" },
+ { 17, "cpu_pc" },
+ { 18, "cpu_ec" },
{ -1, NULL }
};
@@ -322,6 +324,17 @@
p = strchr(datapart, '\n'); if (p) *p = '\0';
p = strtok(datapart, " "); datacount = 0;
while (p && (datacount < MAX_VMSTAT_VALUES)) {
• + /* Removing . and , from the numbers */
+ char *p1;
+ while ( (p1 = strchr(p,'.')) != NULL ) {
+ strcpy (p1, p1+1) ;
+ }
+ char *p2;
+ while ( (p2 = strchr(p,',')) != NULL ) {
+ strcpy (p2, p2+1) ;
+ }
• values[datacount++] = atoi(p);
p = strtok(NULL, " ");
}
Stef
list Martha McConaghy
Thanks, Stef. That patch seems to have resolved the problem. Martha
list Craig Cook
Speaking of weird disk alerts... I have been getting alarms like this lately, on various Solaris hosts at random times... Thu Oct 30]16:06:19 EDT 2008 - Filesystems NOT ok red 6676762 3552136 66% / (10332220% used) has reached the PANIC level (95%) red 1622387 2474913 40% /var (4138686% used) has reached the PANIC level (95%) red 60080034 32598889 65% /export (93615073% used) has reached the PANIC level (95%) red 1320 58296888 1% /etc/svc/volatile (58298208% used) has reached the PANIC level (95%) red 16 58296888 1% /tmp (58296904% used) has reached the PANIC level (95%) Filesystem kbytes use ]vail capacity Mounted on /dev/md/dsk/d0 10332220 6676762 3552136 66% / /dev/md/dsk/d4 4138686 1622387 2474913 40% /var /dev/md/dsk/d7 93615073 60080034 32598889 65% /export swap 58298208 1320 58296888 1% /etc/svc/volatile swap 58296904 16 58296888 1% /tmp I suspected it was a corrupt "df" command, due to the "]" showing up. The first line where the hostname should be is corrupted as well though. This usually only happens once, next disk check reports green as usual with correct data. Using Hobbit Monitor 4.3.0-0.20080403 Craig
list Martha McConaghy
Craig, That is exactly the same error I'm seeing. Martha