RHEL5 and status-board not available bug?

list Buchan Milne
Mon, 16 Feb 2009 15:55:26 +0200
Message-Id: <user-f348eb847559@xymon.invalid>

On Monday 16 February 2009 13:35:51 Flyzone Micky wrote:

On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:

"really low" as in ... how much ?

Output of iostat command:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.22    0.00    0.91    3.62    0.00   93.26

This is the output of iostat about nfs:
Device:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
vnetapp:/vol/hobbit     1631.11       373.97         0.00

wBlk_dir/s   rBlk_svr/s   wBlk_svr/s    rops/s    wops/s
      0.00      1170.83       825.22    840.76    840.76

Unfortunately, this doesn't show anything about how the underlying IO system 
is performing. The load average for this host would be relevant, as well as 
iostat-type data for the NFS server, and any stats available in the actual 
disks.

E.g., 1 NFS bulk operation could translate to 16 IOPS on the "spindle", so you 
could be doing 25000 IOPS, which is quite serious IO (you probably need at 
least 160 fast spindles to manage that). Or, it could translate to less. So, 
you need to check your storage system.

In this last iostat have also a rsync statistic in it cause I was
mantening a rsync on local disk of hobbit.

Unlucky nfsstat doesn't sho

of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.

But in this case will not appear a warning like this (that I don't have)?
WARNING: Runtime 110 longer than BBSLEEP

I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.

Maybe I missed in the last post, but I have already done, and didn't
solve the problem.

For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.

Right...but doesn't appear always, I remember also an old patch
that was in all-in-one about dirty-datas, but was already applied.

I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?

Done, like all the logs in the system and hobbit. Nothing more
message that could help.

I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ?

Excluded, the port is 235-300 at maximun, and in the kernel parameter
I also tried to use (like in Oracle):
net.ipv4.ip_local_port_range = 1024 65000
but with or without nothing change.

Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp.

The networks are about 4 differents.
And however, remember about my test on a just 20 clients.

Is this server also running the network tests ?
...
    sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT

Yes, but like before...appear also with just a 20 clients,
so I would exclude a problem related at the numbers of clients.
However I tried also with:
net.ipv4.tcp_fin_timeout = 30
instead of the default 120 seconds in RHEL5 to leave a port
in TIME_WAIT state.

One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.

Ahm...what is a oomph? :-S

A datapoint here. I'm also running Hobbit on a 64-bit Linux
platform, but it is using SPARC (Sun) hardware.

we are trying to shutdown all our sparc and pass to linux.. :)

So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?

Yes, this is the point RHEL > 5.0 and 64bit (AMD)...
I need yet to try on Fedora 10 64bit

My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit (well, 
devmon) is polling about 10 network devices, and getting client reports from 
about 4 VMs (hobbitd gets 1.7 messages/sec), updating 2300 RRD files, and I've 
never seen this.

In the production environment, my hobbit on RHEL5 x86_64 is only doing 
polling/testing/proxying (the display is on a RHEL4 i386).

Regards,
Buchan