RHEL5 and status-board not available bug?
list Flyzone Micky
Well...We think it's a big bug, where 'we' is me and RedHat support.
Of course I'm speaking of Linux and not about the Solaris bug,
and my kernel parameter are ok.
I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with
kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet,
and nfs files storing the xymon data in a Veritas cluster.
The xymon server get 3000 hosts and about 17093 status messages.
The problem is...the timeout, the hobbit status page go in green,
the pages sometimes are slow to be read or give a "Status not
available"
Speaking with Redhat premium support, I sent them a trace of the
error (about 40MB gzip...) and for them the cause is a bug in the
thread management cause in the RHEL5 is not more possible to use
the old POSIX implementation of threading, but needs to use just
the Linux Threading "version". Of course I have lost some of the
sentences....sorry but I'm not a programmer.
They avoid at all a problem with the nfs share, the throughput of
xymon is about a stable 30KB/s, while network test indicate a
possibility of 50-78MB/s. However I had to modify the mount option
to avoid many setattr calls.
As a workaround I have modify the sendmessage call in lib folder
adding to repeat the send of message:
if (res == BB_ETIMEOUT) {
usleep(5);
res = sendtomany((recipient ? recipient : bbdisp), xgetenv("BBDISPLAYS"), msg, respfd, respstr, fullresponse, timeout);
}
This of course increase the busy time but doesn't get again an
"all system green" problem.
I'm running a xymon 4.2.0 with allinonepatch and xymon 4.2.2
doesn't seem to have any changes in this problem however I'll
try in the next days.
Other issue...shutting down xymon I always need to clear all
with ipcrm cause segments are yet present.
Nothing more in logs, just the status-board not available.
If someone already got this issue (doesn't seem in the past posts)
please give me a tip....
Ah..here my kernel parameter:
------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 67108864
max total shared memory (kbytes) = 17179869184
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 100
semaphore max value = 32767
------ Messages: Limits --------
max queues system wide = 16
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536
Thanks in advance.
--
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid
list Henrik Størner
I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ... But I have a few comments.
▸
On Tue, Feb 10, 2009 at 07:35:24AM +0000, Flyzone Micky wrote:Well...We think it's a big bug, where 'we' is me and RedHat support. Of course I'm speaking of Linux and not about the Solaris bug, and my kernel parameter are ok. I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, and nfs files storing the xymon data in a Veritas cluster. The xymon server get 3000 hosts and about 17093 status messages. The problem is...the timeout, the hobbit status page go in green, the pages sometimes are slow to be read or give a "Status not available"
3000 hosts is a fairly large setup. I assume you're doing data collection for graphs for all of these servers, and that you're running version 4.2.x of Xymon. I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files. I know from personal experience that heavy disk I/O can cause network connections in Xymon to time out. Having your data on a network-filesystem is different from what I've tried, but it could make this problem worse - because the I/O is now entirely handled by the Linux kernel, whereas with a local disk for storage at least some of the I/O is handled by the disk controller. What you could try - at least for a short period - would be to stop the [rrdstatus] and [rrddata] tasks in hobbitlaunch.cfg. This stops data from being collected into the graphs, but it will also reduce your disk I/O to practially nil. If your system then starts behaving properly, then we need to look at reducing the load from your RRD updates (I have a couple of suggestions). If the problem persists, then some other explanation must be found.
▸
Speaking with Redhat premium support, I sent them a trace of the error (about 40MB gzip...) and for them the cause is a bug in the thread management cause in the RHEL5 is not more possible to use the old POSIX implementation of threading, but needs to use just the Linux Threading "version". Of course I have lost some of the sentences....sorry but I'm not a programmer.
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS. Regards, Henrik
list Flyzone Micky
▸
On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ...
I think is in Hobbit. And I have news about it, I'll write more down.
▸
3000 hosts is a fairly large setup. I assume you're doing data collection for graphs for all of these servers, and that you're running version 4.2.x of Xymon.
Correct, I'll try the 4.3 in lab next week now that I know how the "bug" works.
▸
I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files.
No, excluded at all, already tried to disable all the ext tests. However I tried also switching the data in local SCSI disks and iostat indicate a really low I/O wait.
If the problem persists, then some other explanation must be found.
Must for sure....it's a big trouble saw 3000 hosts becaming purple then green then purple :)
▸
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.
Ups...is not a multithread? I'm not a programmer but....how it can follow 3000 hosts sending data without multithread? However here the news: the problem persist just with RHEL5 with architecture x86_64 with all kind of 2.6 kernels. With RHEL5 and x86 (32bit) there isn't the bug. I would like to try a Fedora on my notebook....I'll let you know. For us the best resolutions is to reinstall all in 32 bit, I'm already working on it (the first server it's already up, hobbit now it's working correctly just with this "little" edit) However, the problem exist also in our hobbit lab (always 64bit) stressing the Hobbit with more than 20 "virtual host" Be sure of one things: is not a hardware or bottleneck related problem, the bottleneck was before on a old machine with a I/O wait really hight, now with this two new servers is not. However, there is someone with a x86_64 architecture with similar problems? And if someone have a Redhat Developper support license, the RH support teams already told me that they can work on it. Have a nice evening.
▸
--
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid
list Henrik Størner
▸
On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ...I think is in Hobbit. And I have news about it, I'll write more down.I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files.No, excluded at all, already tried to disable all the ext tests. However I tried also switching the data in local SCSI disks and iostat indicate a really low I/O wait.
"really low" as in ... how much ? If you're looking at the vmstat output, check the "vmstat1" graph and see how much I/O wait takes up of your cpu time. AND - remember that disk I/O in Linux is a single-processor task, so on a dual-CPU box your I/O system is saturated when your vmstat1 graph shows 50% of the time is spent in I/O wait. On a quad-cpu box the limit it 25%, obviously. I also have my RRD files on 10k RPM SCSI disks, hardware raid controller etc. Without the caching in Xymon 4.3, it couldn't keep up with the amount of RRD updates I was feeding it it. Which also shows in the fact that flushing the cache - which essentially does the same amount of disk I/O as a full update of all the RRD files - takes about 8 minutes. No chance at all then of keeping up with 5-minute update cycles. I really think you should try shutting off the hobbitd_rrd tasks, just to see what happens.
▸
If the problem persists, then some other explanation must be found.Must for sure....it's a big trouble saw 3000 hosts becaming purple then green then purple :)
For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.
I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?
I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ?
Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp. This could be a problem - I've seen Hobbit break
on a system with ~1200 hosts, because the network test would
ping all of them, overflowing the ARP cache. This is tunable
with
sysctl net.ipv4.neigh.default.gc_thresh1=3072
sysctl net.ipv4.neigh.default.gc_thresh2=4096
(see the arp(7) man-page for what these do).
Is this server also running the network tests ?
Network-wise, it makes sense to tune a busy Hobbit server in the
same manner that you would a very busy webserver (which also
has to handle lots of short-lived connections). Another possible
tuning parameter would be
sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT
state for new connections. It goes against the recommended way
of doing TCP, but unless you're running Hobbit over high-latency
networks it should not cause any problems.
▸
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.Ups...is not a multithread? I'm not a programmer but....how it can follow 3000 hosts sending data without multithread?
By avoiding all the overhead of using threads :-) Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second. Each host triggers perhaps 5-10 connections (e.g. an old client reporting cpu,disk,memory,msgs,procs,conn), and since the core daemon isn't doing any disk I/O handling 50-100 connections per second isn't that big a deal.
▸
However here the news: the problem persist just with RHEL5 with architecture x86_64 with all kind of 2.6 kernels. With RHEL5 and x86 (32bit) there isn't the bug.
It's quite odd that there is a problem on x86-64, but not on x86-32. One (I) would expect the 64-bit systems to have a bit more "oomph" so they should be the ones that worked best. A datapoint here. I'm also running Hobbit on a 64-bit Linux platform, but it is using SPARC (Sun) hardware. Kernel is 2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old), but handles twice the number of hosts and statuses that you have. I do have the RRD's on a different server, though.
▸
However, the problem exist also in our hobbit lab (always 64bit) stressing the Hobbit with more than 20 "virtual host"
So you're saying that on a RHEL 5.3 64-bit Intel server, setting up Hobbit and feeding it with data from ~20 clients will make the system break? I think I would have heard about it before if this was a general problem. Regards, Henrik
list Flyzone Micky
▸
On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
"really low" as in ... how much ?
Output of iostat command:
avg-cpu: %user %nice %system %iowait %steal %idle
2.22 0.00 0.91 3.62 0.00 93.26
This is the output of iostat about nfs:
Device: rBlk_nor/s wBlk_nor/s rBlk_dir/s
vnetapp:/vol/hobbit 1631.11 373.97 0.00
wBlk_dir/s rBlk_svr/s wBlk_svr/s rops/s wops/s
0.00 1170.83 825.22 840.76 840.76
In this last iostat have also a rsync statistic in it cause I was
mantening a rsync on local disk of hobbit.
Unlucky nfsstat doesn't sho
▸
of all the RRD files - takes about 8 minutes. No chance at all then of keeping up with 5-minute update cycles.
But in this case will not appear a warning like this (that I don't have)? WARNING: Runtime 110 longer than BBSLEEP
▸
I really think you should try shutting off the hobbitd_rrd tasks, just to see what happens.
Maybe I missed in the last post, but I have already done, and didn't solve the problem.
▸
For hosts to go purple they have to go more than 30 minutes without an update - they don't go purple just because they miss a single update.
Right...but doesn't appear always, I remember also an old patch that was in all-in-one about dirty-datas, but was already applied.
▸
I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?Done, like all the logs in the system and hobbit. Nothing more message that could help.
▸
I'm wondering if maybe you're running out of ports (there's only 64K of them, only about half can be used by normal apps). How many ports do you have in TIME_WAIT state ?
Excluded, the port is 235-300 at maximun, and in the kernel parameter I also tried to use (like in Oracle): net.ipv4.ip_local_port_range = 1024 65000 but with or without nothing change.
▸
Another thing is the size of the ARP cache, if your hosts are all on the same IP network or your router/firewall is doing proxy-arp.
The networks are about 4 differents. And however, remember about my test on a just 20 clients.
▸
Is this server also running the network tests ?
...
sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAITYes, but like before...appear also with just a 20 clients, so I would exclude a problem related at the numbers of clients. However I tried also with: net.ipv4.tcp_fin_timeout = 30 instead of the default 120 seconds in RHEL5 to leave a port in TIME_WAIT state.
▸
One (I) would expect the 64-bit systems to have a bit more "oomph" so they should be the ones that worked best.
Ahm...what is a oomph? :-S
▸
A datapoint here. I'm also running Hobbit on a 64-bit Linux platform, but it is using SPARC (Sun) hardware.
we are trying to shutdown all our sparc and pass to linux.. :)
▸
So you're saying that on a RHEL 5.3 64-bit Intel server, setting up Hobbit and feeding it with data from ~20 clients will make the system break?
Yes, this is the point RHEL > 5.0 and 64bit (AMD)... I need yet to try on Fedora 10 64bit
▸
I think I would have heard about it before if this was a general problem.
Eh...I would like also to have heard it before :))) However, shutting down hobbit, in the ipcs command yet show the shared memory segment used with no process hobbit active, maybe something that hangs in hobbit? Have a nice day P.S: how could I reply using normal email client without create a new thread to the ML?
▸
--
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid
list Buchan Milne
▸
On Monday 16 February 2009 13:35:51 Flyzone Micky wrote:
On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:"really low" as in ... how much ?Output of iostat command: avg-cpu: %user %nice %system %iowait %steal %idle 2.22 0.00 0.91 3.62 0.00 93.26 This is the output of iostat about nfs: Device: rBlk_nor/s wBlk_nor/s rBlk_dir/s vnetapp:/vol/hobbit 1631.11 373.97 0.00 wBlk_dir/s rBlk_svr/s wBlk_svr/s rops/s wops/s 0.00 1170.83 825.22 840.76 840.76
Unfortunately, this doesn't show anything about how the underlying IO system is performing. The load average for this host would be relevant, as well as iostat-type data for the NFS server, and any stats available in the actual disks. E.g., 1 NFS bulk operation could translate to 16 IOPS on the "spindle", so you could be doing 25000 IOPS, which is quite serious IO (you probably need at least 160 fast spindles to manage that). Or, it could translate to less. So, you need to check your storage system.
▸
In this last iostat have also a rsync statistic in it cause I was mantening a rsync on local disk of hobbit. Unlucky nfsstat doesn't shoof all the RRD files - takes about 8 minutes. No chance at all then of keeping up with 5-minute update cycles.But in this case will not appear a warning like this (that I don't have)? WARNING: Runtime 110 longer than BBSLEEPI really think you should try shutting off the hobbitd_rrd tasks, just to see what happens.Maybe I missed in the last post, but I have already done, and didn't solve the problem.For hosts to go purple they have to go more than 30 minutes without an update - they don't go purple just because they miss a single update.Right...but doesn't appear always, I remember also an old patch that was in all-in-one about dirty-datas, but was already applied.I suppose you have check the kernel logs ('dmesg' output) for anything odd ?Done, like all the logs in the system and hobbit. Nothing more message that could help.I'm wondering if maybe you're running out of ports (there's only 64K of them, only about half can be used by normal apps). How many ports do you have in TIME_WAIT state ?Excluded, the port is 235-300 at maximun, and in the kernel parameter I also tried to use (like in Oracle): net.ipv4.ip_local_port_range = 1024 65000 but with or without nothing change.Another thing is the size of the ARP cache, if your hosts are all on the same IP network or your router/firewall is doing proxy-arp.The networks are about 4 differents. And however, remember about my test on a just 20 clients.Is this server also running the network tests ? ... sysctl net.ipv4.tcp_tw_reuse=1 which enables the kernel to re-use ports that are in a TIME_WAITYes, but like before...appear also with just a 20 clients, so I would exclude a problem related at the numbers of clients. However I tried also with: net.ipv4.tcp_fin_timeout = 30 instead of the default 120 seconds in RHEL5 to leave a port in TIME_WAIT state.One (I) would expect the 64-bit systems to have a bit more "oomph" so they should be the ones that worked best.Ahm...what is a oomph? :-SA datapoint here. I'm also running Hobbit on a 64-bit Linux platform, but it is using SPARC (Sun) hardware.we are trying to shutdown all our sparc and pass to linux.. :)So you're saying that on a RHEL 5.3 64-bit Intel server, setting up Hobbit and feeding it with data from ~20 clients will make the system break?Yes, this is the point RHEL > 5.0 and 64bit (AMD)... I need yet to try on Fedora 10 64bit
My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit (well, devmon) is polling about 10 network devices, and getting client reports from about 4 VMs (hobbitd gets 1.7 messages/sec), updating 2300 RRD files, and I've never seen this. In the production environment, my hobbit on RHEL5 x86_64 is only doing polling/testing/proxying (the display is on a RHEL4 i386). Regards, Buchan
list Flyzone Micky
----- Original Message ----- From: "Buchan Milne" <user-9b139aff4dec@xymon.invalid>
News about my problem. Moved to RHEL5.3 32bit with PAE kernel, with the same architecture, data on NFS, bonding and veritas cluster like before. The problem is dissappear. So, I repeat, just on 64bit RHEL => 5.0 seems appear the bug.
So, you need to check your storage system.
Well, like I told, the problem persist also on local disk with 64bit.
My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit
...cut...
and I've never seen this.
5.2 on sun, the kernel is different of amd/intel Which version of kernel are you using?
▸
In the production environment, my hobbit on RHEL5 x86_64 is only doing polling/testing/proxying (the display is on a RHEL4 i386).
hum...maybe problem related to bbgen? Mine was a bbdisplay and bbnet on the same server.
▸
--
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid
list Buchan Milne
▸
On Monday 16 February 2009 17:15:34 Flyzone Micky wrote:
----- Original Message ----- From: "Buchan Milne" <user-9b139aff4dec@xymon.invalid>News about my problem. Moved to RHEL5.3 32bit with PAE kernel, with the same architecture, data on NFS, bonding and veritas cluster like before. The problem is dissappear. So, I repeat, just on 64bit RHEL => 5.0 seems appear the bug.So, you need to check your storage system.Well, like I told, the problem persist also on local disk with 64bit.My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit...cut...and I've never seen this.5.2 on sun, the kernel is different of amd/intel Which version of kernel are you using?
No, Intel/AMD makes no difference, x86_64 (EM64T for Intel, amd64 for AMD) vs i386 does, but: Linux seaknight.telkomsa.net 2.6.18-92.1.10.el5xen #1 SMP Wed Jul 23 04:11:52 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2210 stepping : 2 cpu MHz : 1800.000 cache size : 1024 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 4501.91 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2210 stepping : 2 cpu MHz : 1800.000 cache size : 1024 KB physical id : 1 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 4501.91 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 2 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2210 stepping : 2 cpu MHz : 1800.000 cache size : 1024 KB physical id : 2 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 4501.91 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 3 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2210 stepping : 2 cpu MHz : 1800.000 cache size : 1024 KB physical id : 3 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 4501.91 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc
▸
In the production environment, my hobbit on RHEL5 x86_64 is only doing polling/testing/proxying (the display is on a RHEL4 i386).hum...maybe problem related to bbgen?
Not bbgen itself, as you said you get "Status not available", which is not a static html page from bbgen, but usually from bb-hostsvc.sh (calling hobbitsvc.cgi). You may first want to deteremine whether the cgi is timing out speaking to hobbitd, or if there is some other problem.
Mine was a bbdisplay and bbnet on the same server.
Yes, in my production environment, problems with hobbitd on the RHEL5 x86_64 box would not result in any noticeable problems, but again, my workstation doesn't exhibit this problem (and I usually have about 10 tabs open on different hobbit pages, including static and cgi ones, looking at my workstation's hobbit install). Regards, Buchan
list Brian O'Mahony
I looked through the archives but couldn't find an answer here - basically with HPUX v3 I get a permanent red flag: /dev/deviceFileSystem (100% used) has reached the PANIC level (95%) From the start: Filesystem 1024-blocks Used Available Capacity Mounted on DevFS 0 0 0 100% /dev/deviceFileSystem How do I ignore JUST that filesytem? B The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you are not the intended addressee please contact the sender and dispose of this e-mail. Thank you.
list Flyzone Micky
▸
On Monday 16 February 2009 18:20:55 Buchan Milne wrote:
You may first want to deteremine whether the cgi is timing out speaking to hobbitd, or if there is some other problem.
Is not just the cgi, all the connections going to port 1984 go in timeout, and the result for bbdisplay is the green page, for the hobbit-clients is a timeout error into logs and data not sent.
Yes, in my production environment, problems with hobbitd on the RHEL5 x86_64 box would not result in any noticeable problems, but again, my workstation doesn't exhibit this problem
Ok, so I would like to find some difference in out software if is not a hardware related, but however the bug for me appear just on x86_64. I'm using hobbit 4.2.0 + all-in-one-patch at minimun, and tried also 4.2.2, 4.2.3RC1, redhat 5.0, 5.2 and 5.3 with the last kernel distributed with the release. Local or remote data don't change the result, and for me also an installation on a fresh 5.3 on x86_64 with hobbit give the same problem: more than 20 clients appear in timeout after ~30 minutes that hobbit is up; virtual or physical machine don't change result, is the same.
▸
--
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free user-9cccf6680cef@xymon.invalid
list Bruce White
I put:
DISK "/dev/deviceFileSytem" IGNORE
In my hobbit-client.cfg for my rx6600 machines running HPUX 11v3 and
that works fine. It is not an HPUX 11v3 thing, it is specific to the
hardware. My rx7640 running HPUX 11v3, but does not need the entry
because this hardware platform does not need this file system for its
devices.
.....Bruce
Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc.
▸
-----Original Message-----
From: Brian O'Mahony [mailto:user-9ed4e9656005@xymon.invalid]
Sent: Monday, February 16, 2009 10:35 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] HPUX v11.31 - disk status red
I looked through the archives but couldn't find an answer here -
basically with HPUX v3 I get a permanent red flag:
/dev/deviceFileSystem (100% used) has reached the PANIC level (95%)
From the start:
Filesystem 1024-blocks Used Available Capacity Mounted on
DevFS 0 0 0 100%
/dev/deviceFileSystem
How do I ignore JUST that filesytem?
B
The information in this email is confidential and may be legally
privileged.
It is intended solely for the addressee. Access to this email by anyone
else
is unauthorized. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in
reliance
on it, is prohibited and may be unlawful. If you are not the intended
addressee please contact the sender and dispose of this e-mail. Thank
you.
list Henrik Størner
▸
On Mon, Feb 16, 2009 at 03:15:34PM +0000, Flyzone Micky wrote:
----- Original Message ----- From: "Buchan Milne" <user-9b139aff4dec@xymon.invalid>News about my problem. Moved to RHEL5.3 32bit with PAE kernel, with the same architecture, data on NFS, bonding and veritas cluster like before. The problem is dissappear. So, I repeat, just on 64bit RHEL => 5.0 seems appear the bug.
Without any more data to go on, I cannot see that there is an indication of this being a Xymon bug. I'll consider it a bug in the RHEL x86-64 kernel until there is data to prove me wrong. Henrik