Here I am with some new data, because the problem still exists. I know
that the rrd-data-daemon crashes with the "xstrdup: Cannot dup NULL
string" error. I have setup netapp.pl with $Hobbit_fd_lib::debug = 2;
and fount that the systat output is different; don't know if it is the
real cause of the crash...?!
orwell:/usr/lib/hobbit/server/ext # cat
/var/lib/hobbit/tmp/netapp.sysstat.DEBUG.camelot
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s
Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write
age hit time ty util in out in out
29% 0 7976 0 7976 3147 5098 3872 3104 0 0
3 96% 12% T 8% 0 0 0 0 0 0
orwell:/usr/lib/hobbit/server/ext # cat
/var/lib/hobbit/tmp/netapp.sysstat.DEBUG.noah
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s
Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write
age hit time ty util in out
8% 0 0 0 0 1 6 986 1988 0 060 100% 13% T 9% 0 0 0 0
The other files (/var/lib/hobbit/tmp/netapp.xtstats.DEBUG.camelot)
also show a change of output. The current beginning was previously the
ending of the output file. So now it begins with :
system:system:nfs_ops:3190/s
system:system:cifs_ops:0/s
system:system:http_ops:0/s
system:system:fcp_ops:0/s
system:system:iscsi_ops:0/s
system:system:read_ops:619/s
system:system:write_ops:144/s
system:system:net_data_recv:4187KB/s
system:system:net_data_sent:23328KB/s
system:system:disk_data_read:5493KB/s
system:system:disk_data_written:6156KB/s
system:system:cpu_busy:10%
system:system:avg_processor_busy:10%
system:system:total_processor_busy:20%
system:system:num_processors:2
system:system:time:1244021254s
system:system:uptime:1048085s
disk:2000001D:38B5ED6F:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:total_transfers:8/s
disk:2000001D:38B5ED6F:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:user_read_chain:3.60
Were from our pre-7.3.1.1 filers, the output is:
.....
disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_read_latency:0us
disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_read_blocks:0/s
disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_write_latency:0us
disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_write_blocks:0/s
disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:0%
system:system:nfs_ops:0/s
system:system:cifs_ops:0/s
system:system:http_ops:0/s
system:system:dafs_ops:0/s
system:system:fcp_ops:0/s
system:system:iscsi_ops:0/s
system:system:net_data_recv:13KB/s
system:system:net_data_sent:47KB/s
system:system:disk_data_read:986KB/s
system:system:disk_data_written:1988KB/s
system:system:cpu_busy:8%
system:system:avg_processor_busy:5%
system:system:total_processor_busy:10%
system:system:num_processors:2
system:system:time:1244021255s
system:system:uptime:7436873s
2009/5/30 Peter Welter <user-f55666bd0d1e@xymon.invalid>:
Addendum:
Turning off 'netapp.pl' to all filers and selectively turning it on
again, it appears that there are no problems with On Tap 7.2.3 and
7.2.4. The error does not show up and all trending (also for other
data-dependant trending) show no holes anymore.
But these 7.3.1.1-filers are very important, so I have to turn the
monitoring on again on this NetApp-cluster. Will see if debugging the
perl script will give more relevant data.
2009/5/29 Peter Welter <user-f55666bd0d1e@xymon.invalid>:
Hello all,
Last friday may 22 at 8:20 we finished our upgrade from our
NetApp-filers (version 7.2.3 to 7.3.1.1). These filers were (and are)
monitored by Xymon in combination with the perl-netapp-client.
Combined a great combo!
However, since the upgrade I keep getting this error in
/var/log/hobbit/rrd-data.log:
...
2009-05-22 08:22:00 xstrdup: Cannot dup NULL string
2009-05-22 08:22:00 Worker process died with exit code 6, terminating
....
This error appears every 5 minutes.
Only one graph-type is not trended anymore since the upgrade, the
xtstatscolumn which deliver all statistics about each drive in the
filer. (About +/- 20 graphs). Sometimes, it does trend some data but
that is for a very short time, let's say 5 or 15 minutes. Then for
hours, nothing.
One filer has not been upgraded, but shows the same lack of trending.
But that can be caused because I have set it up with MultiThreading
(something that can be set using a parameter).
Now I will change this to 1 (for each filer a seperate process) to see
if the problem can be narrowed, so I'll update this problem later on
this weekend.
Regards,
Peter
PS I do not know if this has to do with either Xymon of netapp.pl, but
since it is integrated into the Xymon-source (hobbitd/rrd/do_netapp.c)
I think it should be posted here.