RHEL5 and status-board not available bug?

12 messages in this thread

list Flyzone Micky · Tue, 10 Feb 2009 07:35:24 +0000 (GMT) ·

Well...We think it's a big bug, where 'we' is me and RedHat support.
Of course I'm speaking of Linux and not about the Solaris bug,
and my kernel parameter are ok.

I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with 
kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, 
and nfs files storing the xymon data in a Veritas cluster.
The xymon server get 3000 hosts and about 17093 status messages.
The problem is...the timeout, the hobbit status page go in green,
the pages sometimes are slow to be read or give a "Status not
available"

Speaking with Redhat premium support, I sent them a trace of the
error (about 40MB gzip...) and for them the cause is a bug in the
thread management cause in the RHEL5 is not more possible to use
the old POSIX implementation of threading, but needs to use just
the Linux Threading "version". Of course I have lost some of the
sentences....sorry but I'm not a programmer.
They avoid at all a problem with the nfs share, the throughput of
xymon is about a stable 30KB/s, while network test indicate a 
possibility of 50-78MB/s. However I had to modify the mount option
to avoid many setattr calls.

As a workaround I have modify the sendmessage call in lib folder 
adding to repeat the send of message:
        if (res == BB_ETIMEOUT) {
                usleep(5);
                res = sendtomany((recipient ? recipient : bbdisp), xgetenv("BBDISPLAYS"), msg, respfd, respstr, fullresponse, timeout);
        }
This of course increase the busy time but doesn't get again an
"all system green" problem.
I'm running a xymon 4.2.0 with allinonepatch and xymon 4.2.2 
doesn't seem to have any changes in this problem however I'll 
try in the next days.
Other issue...shutting down xymon I always need to clear all
with ipcrm cause segments are yet present.
Nothing more in logs, just the status-board not available.

If someone already got this issue (doesn't seem in the past posts)
please give me a tip....
Ah..here my kernel parameter:

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 67108864
max total shared memory (kbytes) = 17179869184
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 100
semaphore max value = 32767

------ Messages: Limits --------
max queues system wide = 16
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536

Thanks in advance.

-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid

list Henrik Størner · Tue, 10 Feb 2009 16:39:35 +0100 ·

I'm not completely sure if you believe there is a bug in Xymon,
or in the Linux kernel of your RHEL system ... But I have a few
comments.

▸ quoted from Flyzone Micky


On Tue, Feb 10, 2009 at 07:35:24AM +0000, Flyzone Micky wrote:

Well...We think it's a big bug, where 'we' is me and RedHat support.
Of course I'm speaking of Linux and not about the Solaris bug,
and my kernel parameter are ok.

I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with 
kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, 
and nfs files storing the xymon data in a Veritas cluster.
The xymon server get 3000 hosts and about 17093 status messages.
The problem is...the timeout, the hobbit status page go in green,
the pages sometimes are slow to be read or give a "Status not
available"

3000 hosts is a fairly large setup. I assume you're doing data
collection for graphs for all of these servers, and that you're
running version 4.2.x of Xymon.

I would guess that your problems - at least in part - stem from 
the amount of I/O you're doing for updating all of the RRD-files.
I know from personal experience that heavy disk I/O can cause
network connections in Xymon to time out. Having your data on a
network-filesystem is different from what I've tried, but it
could make this problem worse - because the I/O is now entirely
handled by the Linux kernel, whereas with a local disk for storage
at least some of the I/O is handled by the disk controller.

What you could try - at least for a short period - would be to
stop the [rrdstatus] and [rrddata] tasks in hobbitlaunch.cfg.
This stops data from being collected into the graphs, but it
will also reduce your disk I/O to practially nil. If your system
then starts behaving properly, then we need to look at reducing
the load from your RRD updates (I have a couple of suggestions).
If the problem persists, then some other explanation must be found.

▸ quoted from Flyzone Micky

Speaking with Redhat premium support, I sent them a trace of the
error (about 40MB gzip...) and for them the cause is a bug in the
thread management cause in the RHEL5 is not more possible to use
the old POSIX implementation of threading, but needs to use just
the Linux Threading "version". Of course I have lost some of the
sentences....sorry but I'm not a programmer.

I don't know how the change in "POSIX threading" plays into this.
Hobbit is not a threaded application, it is plain and simple 
single-task application all the way through. It may have some
meaning in relation to NFS.


Regards,
Henrik

list Flyzone Micky · Thu, 12 Feb 2009 18:06:48 +0000 (GMT) ·

▸ quoted from Henrik Størner

On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:

I'm not completely sure if you believe there is a bug in Xymon,
or in the Linux kernel of your RHEL system ...

I think is in Hobbit. And I have news about it, I'll write more down.

▸ quoted from Henrik Størner

3000 hosts is a fairly large setup. I assume you're doing data
collection for graphs for all of these servers, and that you're
running version 4.2.x of Xymon.

Correct, I'll try the 4.3 in lab next week now that I know how
the "bug" works.

▸ quoted from Henrik Størner

I would guess that your problems - at least in part - stem from 
the amount of I/O you're doing for updating all of the RRD-files.

No, excluded at all, already tried to disable all the ext tests.
However I tried also switching the data in local SCSI disks
and iostat indicate a really low I/O wait.

If the problem persists, then some other explanation must be found.

Must for sure....it's a big trouble saw 3000 hosts becaming purple
then green then purple :)

▸ quoted from Henrik Størner

I don't know how the change in "POSIX threading" plays into this.
Hobbit is not a threaded application, it is plain and simple 
single-task application all the way through. It may have some
meaning in relation to NFS.

Ups...is not a multithread? I'm not a programmer but....how it can
follow 3000 hosts sending data without multithread?

However here the news: the problem persist just with RHEL5 with
architecture x86_64 with all kind of 2.6 kernels.
With RHEL5 and x86 (32bit) there isn't the bug.
I would like to try a Fedora on my notebook....I'll let you know.
For us the best resolutions is to reinstall all in 32 bit, I'm
already working on it (the first server it's already up, hobbit
now it's working correctly just with this "little" edit)

However, the problem exist also in our hobbit lab (always 64bit)
stressing the Hobbit with more than 20 "virtual host"
Be sure of one things: is not a hardware or bottleneck related problem,
the bottleneck was before on a old machine with a I/O wait really
hight, now with this two new servers is not.

However, there is someone with a x86_64 architecture with similar problems?
And if someone have a Redhat Developper support license, the RH
support teams already told me that they can work on it.

Have a nice evening.

▸ quoted from Flyzone Micky


-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid

list Henrik Størner · Thu, 12 Feb 2009 23:31:50 +0100 ·

▸ quoted from Flyzone Micky

On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:

On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:

I'm not completely sure if you believe there is a bug in Xymon,
or in the Linux kernel of your RHEL system ...

I think is in Hobbit. And I have news about it, I'll write more down.

I would guess that your problems - at least in part - stem from 
the amount of I/O you're doing for updating all of the RRD-files.

No, excluded at all, already tried to disable all the ext tests.
However I tried also switching the data in local SCSI disks
and iostat indicate a really low I/O wait.

"really low" as in ... how much ? If you're looking at the vmstat 
output, check the "vmstat1" graph and see how much I/O wait
takes up of your cpu time. AND - remember that disk I/O in
Linux is a single-processor task, so on a dual-CPU box
your I/O system is saturated when your vmstat1 graph shows
50% of the time is spent in I/O wait.

On a quad-cpu box the limit it 25%, obviously.

I also have my RRD files on 10k RPM SCSI disks, hardware raid
controller etc. Without the caching in Xymon 4.3, it couldn't
keep up with the amount of RRD updates I was feeding it it.
Which also shows in the fact that flushing the cache - which
essentially does the same amount of disk I/O as a full update
of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.

I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.

▸ quoted from Flyzone Micky

If the problem persists, then some other explanation must be found.

Must for sure....it's a big trouble saw 3000 hosts becaming purple
then green then purple :)

For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.

I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?

I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ? 

Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp. This could be a problem - I've seen Hobbit break
on a system with ~1200 hosts, because the network test would
ping all of them, overflowing the ARP cache. This is tunable
with
     sysctl net.ipv4.neigh.default.gc_thresh1=3072
     sysctl net.ipv4.neigh.default.gc_thresh2=4096
(see the arp(7) man-page for what these do).

Is this server also running the network tests ?

Network-wise, it makes sense to tune a busy Hobbit server in the
same manner that you would a very busy webserver (which also 
has to handle lots of short-lived connections). Another possible
tuning parameter would be 
     sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT
state for new connections. It goes against the recommended way
of doing TCP, but unless you're running Hobbit over high-latency
networks it should not cause any problems.

▸ quoted from Flyzone Micky

I don't know how the change in "POSIX threading" plays into this.
Hobbit is not a threaded application, it is plain and simple 
single-task application all the way through. It may have some
meaning in relation to NFS.

Ups...is not a multithread? I'm not a programmer but....how it can
follow 3000 hosts sending data without multithread?

By avoiding all the overhead of using threads :-)

Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second.
Each host triggers perhaps 5-10 connections (e.g. an old client
reporting cpu,disk,memory,msgs,procs,conn), and since the core
daemon isn't doing any disk I/O handling 50-100 connections per
second isn't that big a deal.

▸ quoted from Flyzone Micky

However here the news: the problem persist just with RHEL5 with
architecture x86_64 with all kind of 2.6 kernels.
With RHEL5 and x86 (32bit) there isn't the bug.

It's quite odd that there is a problem on x86-64, but not on x86-32.
One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.

A datapoint here. I'm also running Hobbit on a 64-bit Linux 
platform, but it is using SPARC (Sun) hardware. Kernel is
2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old),
but handles twice the number of hosts and statuses that you have.
I do have the RRD's on a different server, though.

▸ quoted from Flyzone Micky

However, the problem exist also in our hobbit lab (always 64bit)
stressing the Hobbit with more than 20 "virtual host"

So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?

I think I would have heard about it before if this was a general
problem.


Regards,
Henrik

list Flyzone Micky · Mon, 16 Feb 2009 11:35:51 +0000 (GMT) ·

▸ quoted from Flyzone Micky

On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:

"really low" as in ... how much ?

Output of iostat command:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.22    0.00    0.91    3.62    0.00   93.26

This is the output of iostat about nfs:
Device:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s  
vnetapp:/vol/hobbit     1631.11       373.97         0.00

wBlk_dir/s   rBlk_svr/s   wBlk_svr/s    rops/s    wops/s
      0.00      1170.83       825.22    840.76    840.76

In this last iostat have also a rsync statistic in it cause I was 
mantening a rsync on local disk of hobbit.

Unlucky nfsstat doesn't sho

▸ quoted from Henrik Størner

of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.

But in this case will not appear a warning like this (that I don't have)?
WARNING: Runtime 110 longer than BBSLEEP

▸ quoted from Henrik Størner

I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.

Maybe I missed in the last post, but I have already done, and didn't 
solve the problem.

▸ quoted from Henrik Størner

For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.

Right...but doesn't appear always, I remember also an old patch 
that was in all-in-one about dirty-datas, but was already applied.

▸ quoted from Henrik Størner

I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?

Done, like all the logs in the system and hobbit. Nothing more 
message that could help.

▸ quoted from Henrik Størner

I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ?

Excluded, the port is 235-300 at maximun, and in the kernel parameter
I also tried to use (like in Oracle):
net.ipv4.ip_local_port_range = 1024 65000
but with or without nothing change.

▸ quoted from Henrik Størner

Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp.

The networks are about 4 differents.
And however, remember about my test on a just 20 clients.

▸ quoted from Henrik Størner

Is this server also running the network tests ?
...
    sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT

Yes, but like before...appear also with just a 20 clients,
so I would exclude a problem related at the numbers of clients.
However I tried also with:
net.ipv4.tcp_fin_timeout = 30
instead of the default 120 seconds in RHEL5 to leave a port 
in TIME_WAIT state.

▸ quoted from Henrik Størner

One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.

Ahm...what is a oomph? :-S

▸ quoted from Henrik Størner

A datapoint here. I'm also running Hobbit on a 64-bit Linux 
platform, but it is using SPARC (Sun) hardware.

we are trying to shutdown all our sparc and pass to linux.. :)

▸ quoted from Henrik Størner

So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?

Yes, this is the point RHEL > 5.0 and 64bit (AMD)...
I need yet to try on Fedora 10 64bit

▸ quoted from Henrik Størner

I think I would have heard about it before if this was a general
problem.

Eh...I would like also to have heard it before :)))

However, shutting down hobbit, in the ipcs command yet show the
shared memory segment used with no process hobbit active, maybe 
something that hangs in hobbit?

Have a nice day

P.S: how could I reply using normal email client without create a
new thread to the ML?

▸ quoted from Flyzone Micky


-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid

list Buchan Milne · Mon, 16 Feb 2009 15:55:26 +0200 ·

▸ quoted from Flyzone Micky

On Monday 16 February 2009 13:35:51 Flyzone Micky wrote:

On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:

"really low" as in ... how much ?

Output of iostat command:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.22    0.00    0.91    3.62    0.00   93.26

This is the output of iostat about nfs:
Device:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
vnetapp:/vol/hobbit     1631.11       373.97         0.00

wBlk_dir/s   rBlk_svr/s   wBlk_svr/s    rops/s    wops/s
      0.00      1170.83       825.22    840.76    840.76

Unfortunately, this doesn't show anything about how the underlying IO system 
is performing. The load average for this host would be relevant, as well as 
iostat-type data for the NFS server, and any stats available in the actual 
disks.

E.g., 1 NFS bulk operation could translate to 16 IOPS on the "spindle", so you 
could be doing 25000 IOPS, which is quite serious IO (you probably need at 
least 160 fast spindles to manage that). Or, it could translate to less. So, 
you need to check your storage system.

▸ quoted from Flyzone Micky

In this last iostat have also a rsync statistic in it cause I was
mantening a rsync on local disk of hobbit.

Unlucky nfsstat doesn't sho

of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.

But in this case will not appear a warning like this (that I don't have)?
WARNING: Runtime 110 longer than BBSLEEP

I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.

Maybe I missed in the last post, but I have already done, and didn't
solve the problem.

For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.

Right...but doesn't appear always, I remember also an old patch
that was in all-in-one about dirty-datas, but was already applied.

I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?

Done, like all the logs in the system and hobbit. Nothing more
message that could help.

I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ?

Excluded, the port is 235-300 at maximun, and in the kernel parameter
I also tried to use (like in Oracle):
net.ipv4.ip_local_port_range = 1024 65000
but with or without nothing change.

Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp.

The networks are about 4 differents.
And however, remember about my test on a just 20 clients.

Is this server also running the network tests ?
...
    sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT

Yes, but like before...appear also with just a 20 clients,
so I would exclude a problem related at the numbers of clients.
However I tried also with:
net.ipv4.tcp_fin_timeout = 30
instead of the default 120 seconds in RHEL5 to leave a port
in TIME_WAIT state.

One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.

Ahm...what is a oomph? :-S

A datapoint here. I'm also running Hobbit on a 64-bit Linux
platform, but it is using SPARC (Sun) hardware.

we are trying to shutdown all our sparc and pass to linux.. :)

So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?

Yes, this is the point RHEL > 5.0 and 64bit (AMD)...
I need yet to try on Fedora 10 64bit

My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit (well, 
devmon) is polling about 10 network devices, and getting client reports from 
about 4 VMs (hobbitd gets 1.7 messages/sec), updating 2300 RRD files, and I've 
never seen this.

In the production environment, my hobbit on RHEL5 x86_64 is only doing 
polling/testing/proxying (the display is on a RHEL4 i386).

Regards,
Buchan

list Flyzone Micky · Mon, 16 Feb 2009 15:15:34 +0000 (GMT) ·

----- Original Message -----
From: "Buchan Milne" <user-9b139aff4dec@xymon.invalid>

News about my problem.
Moved to RHEL5.3 32bit with PAE kernel, with the same architecture,
data on NFS, bonding and veritas cluster like before.
The problem is dissappear.

So, I repeat, just on 64bit RHEL => 5.0 seems appear the bug.

So, you need to check your storage system.

Well, like I told, the problem persist also on local disk with 64bit.

My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit

...cut...

and I've never seen this.

5.2 on sun, the kernel is different of amd/intel
Which version of kernel are you using?

▸ quoted from Buchan Milne

In the production environment, my hobbit on RHEL5 x86_64 is only doing
polling/testing/proxying (the display is on a RHEL4 i386).

hum...maybe problem related to bbgen?
Mine was a bbdisplay and bbnet on the same server.

▸ quoted from Flyzone Micky


-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid

list Buchan Milne · Mon, 16 Feb 2009 18:20:55 +0200 ·

▸ quoted from Flyzone Micky

On Monday 16 February 2009 17:15:34 Flyzone Micky wrote:

----- Original Message -----
From: "Buchan Milne" <user-9b139aff4dec@xymon.invalid>

News about my problem.
Moved to RHEL5.3 32bit with PAE kernel, with the same architecture,
data on NFS, bonding and veritas cluster like before.
The problem is dissappear.

So, I repeat, just on 64bit RHEL => 5.0 seems appear the bug.

So, you need to check your storage system.

Well, like I told, the problem persist also on local disk with 64bit.

My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit

...cut...

and I've never seen this.

5.2 on sun, the kernel is different of amd/intel
Which version of kernel are you using?

No, Intel/AMD makes no difference, x86_64 (EM64T for Intel, amd64 for AMD) vs i386 does, but:


Linux seaknight.telkomsa.net 2.6.18-92.1.10.el5xen #1 SMP Wed Jul 23 04:11:52 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2210
stepping        : 2
cpu MHz         : 1800.000
cache size      : 1024 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 4501.91
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2210
stepping        : 2
cpu MHz         : 1800.000
cache size      : 1024 KB
physical id     : 1
siblings        : 1
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 4501.91
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2210
stepping        : 2
cpu MHz         : 1800.000
cache size      : 1024 KB
physical id     : 2
siblings        : 1
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 4501.91
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2210
stepping        : 2
cpu MHz         : 1800.000
cache size      : 1024 KB
physical id     : 3
siblings        : 1
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 4501.91
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

▸ quoted from Flyzone Micky

In the production environment, my hobbit on RHEL5 x86_64 is only doing
polling/testing/proxying (the display is on a RHEL4 i386).

hum...maybe problem related to bbgen?

Not bbgen itself, as you said you get "Status not available", which is not a static html page from bbgen, but usually from bb-hostsvc.sh (calling hobbitsvc.cgi).

You may first want to deteremine whether the cgi is timing out speaking to hobbitd, or if there is some other problem.

Mine was a bbdisplay and bbnet on the same server.

Yes, in my production environment, problems with hobbitd on the RHEL5 x86_64 box would not result in any noticeable problems, but again, my workstation doesn't exhibit this problem (and I usually have about 10 tabs open on different hobbit pages, including static and cgi ones, looking at my workstation's hobbit install).

Regards,
Buchan

list Brian O'Mahony · Mon, 16 Feb 2009 16:35:00 +0000 ·

I looked through the archives but couldn't find an answer here - basically with HPUX v3 I get a permanent red flag:

/dev/deviceFileSystem (100% used) has reached the PANIC level (95%)

From the start:
Filesystem            1024-blocks  Used  Available Capacity Mounted on
DevFS                 0        0        0   100%     /dev/deviceFileSystem

How do I ignore JUST that filesytem?

B


The information in this email is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this email by anyone else
is unauthorized. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful. If you are not the intended
addressee please contact the sender and dispose of this e-mail. Thank you.

list Flyzone Micky · Tue, 17 Feb 2009 10:13:12 +0000 (GMT) ·

▸ quoted from Buchan Milne

On Monday 16 February 2009 18:20:55 Buchan Milne wrote:

You may first want to deteremine whether the cgi is timing out speaking to hobbitd, or if there is some other problem.

Is not just the cgi, all the connections going to port 1984 go in timeout,
and the result for bbdisplay is the green page, for the hobbit-clients is
a timeout error into logs and data not sent.

Yes, in my production environment, problems with hobbitd on the RHEL5 x86_64 box would not result in any noticeable problems, but again, my workstation doesn't exhibit this problem

Ok, so I would like to find some difference in out software if is not a hardware related, but however the bug for me appear just on x86_64.
I'm using hobbit 4.2.0 + all-in-one-patch at minimun, and tried also 4.2.2, 4.2.3RC1, redhat 5.0, 5.2 and 5.3 with the last kernel distributed with the release.
Local or remote data don't change the result, and for me also an installation
on a fresh 5.3 on x86_64 with hobbit give the same problem: more than 20 clients
appear in timeout after ~30 minutes that hobbit is up; virtual or physical machine
don't change result, is the same.

▸ quoted from Flyzone Micky


-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid

list Bruce White · Tue, 17 Feb 2009 13:53:02 -0600 ·

I put:

DISK "/dev/deviceFileSytem" IGNORE 

In my hobbit-client.cfg for my rx6600 machines running HPUX 11v3 and
that works fine.  It is not an HPUX 11v3 thing, it is specific to the
hardware.  My rx7640 running HPUX 11v3, but does not need the entry
because this hardware platform does not need this file system for its
devices.

       .....Bruce


Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc.

▸ quoted from Brian O'Mahony

-----Original Message-----
From: Brian O'Mahony [mailto:user-9ed4e9656005@xymon.invalid] 
Sent: Monday, February 16, 2009 10:35 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] HPUX v11.31 - disk status red

I looked through the archives but couldn't find an answer here -
basically with HPUX v3 I get a permanent red flag:

/dev/deviceFileSystem (100% used) has reached the PANIC level (95%)

From the start:
Filesystem            1024-blocks  Used  Available Capacity Mounted on
DevFS                 0        0        0   100%
/dev/deviceFileSystem

How do I ignore JUST that filesytem?

B

The information in this email is confidential and may be legally
privileged.
It is intended solely for the addressee. Access to this email by anyone
else
is unauthorized. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in
reliance
on it, is prohibited and may be unlawful. If you are not the intended
addressee please contact the sender and dispose of this e-mail. Thank
you.

list Henrik Størner · Thu, 26 Feb 2009 10:26:06 +0100 ·

▸ quoted from Flyzone Micky

On Mon, Feb 16, 2009 at 03:15:34PM +0000, Flyzone Micky wrote:

----- Original Message -----
From: "Buchan Milne" <user-9b139aff4dec@xymon.invalid>

News about my problem.
Moved to RHEL5.3 32bit with PAE kernel, with the same architecture,
data on NFS, bonding and veritas cluster like before.
The problem is dissappear.

So, I repeat, just on 64bit RHEL => 5.0 seems appear the bug.

Without any more data to go on, I cannot see that there is
an indication of this being a Xymon bug.

I'll consider it a bug in the RHEL x86-64 kernel until there is
data to prove me wrong.


Henrik

RHEL5 and status-board not available bug? 🔗 link

RHEL5 and status-board not available bug?