Xymon Mailing List Archive search

RHEL5 and status-board not available bug?

list Henrik Størner
Tue, 10 Feb 2009 16:39:35 +0100
Message-Id: <user-5c51f2d12169@xymon.invalid>

I'm not completely sure if you believe there is a bug in Xymon,
or in the Linux kernel of your RHEL system ... But I have a few
comments.

On Tue, Feb 10, 2009 at 07:35:24AM +0000, Flyzone Micky wrote:
Well...We think it's a big bug, where 'we' is me and RedHat support.
Of course I'm speaking of Linux and not about the Solaris bug,
and my kernel parameter are ok.

I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with 
kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, 
and nfs files storing the xymon data in a Veritas cluster.
The xymon server get 3000 hosts and about 17093 status messages.
The problem is...the timeout, the hobbit status page go in green,
the pages sometimes are slow to be read or give a "Status not
available"
3000 hosts is a fairly large setup. I assume you're doing data
collection for graphs for all of these servers, and that you're
running version 4.2.x of Xymon.

I would guess that your problems - at least in part - stem from 
the amount of I/O you're doing for updating all of the RRD-files.
I know from personal experience that heavy disk I/O can cause
network connections in Xymon to time out. Having your data on a
network-filesystem is different from what I've tried, but it
could make this problem worse - because the I/O is now entirely
handled by the Linux kernel, whereas with a local disk for storage
at least some of the I/O is handled by the disk controller.

What you could try - at least for a short period - would be to
stop the [rrdstatus] and [rrddata] tasks in hobbitlaunch.cfg.
This stops data from being collected into the graphs, but it
will also reduce your disk I/O to practially nil. If your system
then starts behaving properly, then we need to look at reducing
the load from your RRD updates (I have a couple of suggestions).
If the problem persists, then some other explanation must be found.

Speaking with Redhat premium support, I sent them a trace of the
error (about 40MB gzip...) and for them the cause is a bug in the
thread management cause in the RHEL5 is not more possible to use
the old POSIX implementation of threading, but needs to use just
the Linux Threading "version". Of course I have lost some of the
sentences....sorry but I'm not a programmer.
I don't know how the change in "POSIX threading" plays into this.
Hobbit is not a threaded application, it is plain and simple 
single-task application all the way through. It may have some
meaning in relation to NFS.


Regards,
Henrik