Channel processing problem with 4.11
list Mark Deiss
Testing in two separate hobbit environments. Both environments are able to
process BB-PE traffic. Using Hobbit 4.11 server release and a simple client
test file to transmit status to the servers (modified hobbitclient-linux.sh
and hobbitclient-hpux.sh).
First environment using Mandrake 9.0 server and Fedora Core 2 client -
client transmissions received and processed properly on server.
Second environment using Fedora Core 3 server and HP-UX 11i client, have
problems with hobbitd_channel on server.
Initial error was: Worker process died with exit code 6, terminating
I reduced down the size of the HP-UX client message - no longer sending
ps/top/vmstat output; still blowing up.
Commented out some of the signal handler lines and set to ignore some of the
signals to drill into problem in the hobbitd_channel.c file.
/* sigaction(SIGPIPE, &sa, NULL); */
signal(SIGPIPE, SIG_IGN);
/* sigaction(SIGINT, &sa, NULL); */
signal(SIGINT, SIG_IGN);
sigaction(SIGTERM, &sa, NULL);
/* sigaction(SIGCHLD, &sa, NULL); */
signal(SIGCHLD, SIG_IGN);
Rerun, error message is now: Our child has failed and will not talk to
us....
Guessing that may have a blocking problem - even though there is only the
server and the one client using the channel, increased the sleep value.
else if (errno == EAGAIN) {
/*
* Write would block ... stop for now.
* Wait just a little while before
continuing, so we
* dont do busy-waiting when the worker
child is not
* accepting more data.
*/
canwrite = 0;
/* usleep(2500); */
usleep(25000)
Same error. Any ideas what to look into next?
Also, has anyone configured hobbit server to work with a number of HPUX
clients? Looking to handle around 50 HPUX servers and 50 Windows servers
into a single Hobbit server.
list Henrik Størner
▸
In <FB13116A8C464943B4A5436A616C95F80530AB1F at rocexu01> "Deiss, Mark" <user-5cd8675c2346@xymon.invalid> writes:
First environment using Mandrake 9.0 server and Fedora Core 2 client - client transmissions received and processed properly on server.
Second environment using Fedora Core 3 server and HP-UX 11i client, have problems with hobbitd_channel on server.
Initial error was: Worker process died with exit code 6, terminating
This means that the hobbitd_client program has crashed. There ought to be a core-dump in the ~hobbit/server/tmp/ directory; if you could run it through the procedure described in http://www.hswn.dk/hobbit/help/known-issues.html#bugreport it would make it simpler to find.
▸
I reduced down the size of the HP-UX client message - no longer sending ps/top/vmstat output; still blowing up.
Please send me a copy of the client message. You'll find it in ~hobbit/client/tmp/msg.txt on the HP-UX server. I've had one other report of HP-UX clients causing the hobbitd_client module to crash, so there is probably something special about the client messages from HP-UX based systems that trigger this.
▸
Also, has anyone configured hobbit server to work with a number of HPUX clients? Looking to handle around 50 HPUX servers and 50 Windows servers into a single Hobbit server.
Shouldn't be a problem. I have about 1500 clients reporting into one Hobbit server (HP-UX, Solaris, Windows, AIX). Henrik
list Mark Deiss
Banging on it and no core dumps (yet). The hobbitd client channel process
appears to resurrect off the parent channel process for the signal 6 faults
(SIGABRT). Probably be able to get a core dump faster if re-disable the
signals and pound the channel some more.
I have narrowed down the fault in the client message to the first line in
the df output (how strange). Below is the client side code where only
reporting the initial client line and then the [df] block header and the df
output. If the HPUX df header line is removed before sending out message, no
error on channel.
Using the hobbit bb binary for client side transmission. For the tests
below, restored back the original hobbitd_channel binary with the standard
signal handlers. Playing around with the content of the df header line to
try and figure out what is so special about it. Very confusing, guess will
go back in and disable the signal handlers and try and force out a core
dump.
!/bin/sh
#
#---------------------------------------------------------------------------
-#
# HP-UX client for Hobbit
#
#
#
# Copyright (C) 2005 Henrik Storner <user-ce4a2c883f75@xymon.invalid>
#
#
#
# This program is released under the GNU General Public License (GPL),
#
# version 2. See the file "COPYING" for details.
#
#
#
#---------------------------------------------------------------------------
-#
#
# $Id: hobbitclient-hpux.sh,v 1.4 2005/07/24 11:32:51 henrik Exp $
MACHINE=`/usr/bin/hostname`
# no good - uname reports HP-UX, hobbit code uses hpux, need to ditch the
minus sign
#BBOSTYPE="`/usr/bin/uname -s | /usr/bin/tr '[A-Z]' '[a-z]'`"
BBOSTYPE="hpux"
{
echo "client $MACHINE.$BBOSTYPE"
##echo "[date]"
##/usr/bin/date
echo "[df]"
# the default does not work - header and metrics
#/usr/bin/df -Pk
# the next line works - report filesystem metrics without a header line
#/usr/bin/df -Pk | sed -ne '2,$p'
# the next line causes error on channel - just printing header line
/usr/bin/df -Pk | sed -ne '1p'
##echo "[memory]"
##echo "Total:`./bb-hp-memsz -p`"
##echo "Free:`./bb-hp-memsz -f`"
} | ./bb --debug aaa.bbb.ccc.ddd "@"
exit
The message body that results in a channel error is (grabbing snapshot on
client - the server client/tmp/msg.txt keeps catching the content of the
server's local check):
client csdaj401.hpux
[df]
Filesystem 1024-blocks Used Available Capacity Mounted on
▸
-----Original Message-----
From: Henrik Storner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: Friday, September 30, 2005 9:07 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Channel processing problem with 4.11
In <FB13116A8C464943B4A5436A616C95F80530AB1F at rocexu01> "Deiss, Mark"
<user-5cd8675c2346@xymon.invalid> writes:
First environment using Mandrake 9.0 server and Fedora Core 2 client - client transmissions received and processed properly on server.
Second environment using Fedora Core 3 server and HP-UX 11i client, have problems with hobbitd_channel on server.
Initial error was: Worker process died with exit code 6, terminating
This means that the hobbitd_client program has crashed. There ought to be a core-dump in the ~hobbit/server/tmp/ directory; if you could run it through the procedure described in http://www.hswn.dk/hobbit/help/known-issues.html#bugreport it would make it simpler to find.
I reduced down the size of the HP-UX client message - no longer sending ps/top/vmstat output; still blowing up.
Please send me a copy of the client message. You'll find it in ~hobbit/client/tmp/msg.txt on the HP-UX server. I've had one other report of HP-UX clients causing the hobbitd_client module to crash, so there is probably something special about the client messages from HP-UX based systems that trigger this.
Also, has anyone configured hobbit server to work with a number of HPUX clients? Looking to handle around 50 HPUX servers and 50 Windows servers into a single Hobbit server.
Shouldn't be a problem. I have about 1500 clients reporting into one Hobbit server (HP-UX, Solaris, Windows, AIX). Henrik
list Henrik Størner
▸
On Fri, Sep 30, 2005 at 10:47:39AM -0500, Deiss, Mark wrote:
I have narrowed down the fault in the client message to the first line in the df output (how strange). Below is the client side code where only reporting the initial client line and then the [df] block header and the df output. If the HPUX df header line is removed before sending out message, no error on channel.
I just had another report about hobbitd_client crashing, also with the "df"
reports. In that case I was able to track it down and the attached patch
should fix it (the patch is on top of the current snapshot). Could you
try if this fixes it ? It might be the same problem.
Regards,
Henrik
-------------- next part --------------
--- hobbitd/hobbitd_client.c 2005/09/28 21:21:56 1.34
+++ hobbitd/hobbitd_client.c 2005/09/30 15:53:55
@@ -314,7 +314,7 @@
if (usestr && isdigit((int)*usestr)) usage = atoi(usestr); else usage = -1;
strcpy(p, bol); fsname = getcolumn(p, mntcol);
- add_disk_count(fsname);
+ if (fsname) add_disk_count(fsname);
if (fsname && (usage != -1)) {
get_disk_thresholds(hinfo, fsname, &warnlevel, &paniclevel);
--- hobbitd/client_config.c 2005/09/21 11:37:05 1.9
+++ hobbitd/client_config.c 2005/09/30 15:54:57
@@ -544,6 +544,8 @@
int ovector[10];
int result;
+ if (!pname) return;
• for (pwalk = head; (pwalk); pwalk = pwalk->next) {
switch (pwalk->rule->ruletype) {
case C_PROC:
list Mark Deiss
Confused even more now. Back to Sourceforge, from what I can tell, am using
the current (?) release of hobbitserver 4.1.1. Not seeing anything in any of
the Soureforge headings about a more current snapshot - under CVS or patch
groupings.
For your supplied patch, patch failed to apply - inspecting the patch, there
are the following reference to old and patched lines in the
hobbitd_client.c:
- add_disk_count(fsname);
+ if (fsname) add_disk_count(fsname);
Did a global search in the entire hobbit code tree and there is no
add_disk_count function defined. Noticed the old code of:
if (fsname && (usage != -1)) {
Where in the 4.1.1 code base it is:
if (usage != -1)) {
Figured a check for valid fsname is what you are after before proceeding
with the add_disk_count call above so made this change in the branch to work
on disk threshold checking. Recompiled, retested, same error. Oh well.
For the second patch in client_config.c, the line counts do not add up in
the patch - closest code appears to be in the add_process_count function.
Would not expect add_process_count to be part of processing df messages.
Guess the add_disk_count is defined in this module in another version.
RCS string for hobbitd_client.c and client_config.c module in the 4.1.1
release are:
static char rcsid[] = "$Id: hobbitd_client.c,v 1.21 2005/07/25 13:57:56
henrik Exp $";
static char rcsid[] = "$Id: client_config.c,v 1.5 2005/07/24 07:50:13 henrik
Exp $";
In the patch file, noticed string IDs for the patch targets of:
▸
--- hobbitd/hobbitd_client.c 2005/09/28 21:21:56 1.34
+++ hobbitd/hobbitd_client.c 2005/09/30 15:53:55
--- hobbitd/client_config.c 2005/09/21 11:37:05 1.9
+++ hobbitd/client_config.c 2005/09/30 15:54:57
So looks like am working with an older release set. Using the CVS web
browser agent on Sourceforge, not turning up any more recent code trees for
other then the hobbit client package.
▸
-----Original Message-----
From: user-ce4a2c883f75@xymon.invalid [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: Friday, September 30, 2005 12:02 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Channel processing problem with 4.11
On Fri, Sep 30, 2005 at 10:47:39AM -0500, Deiss, Mark wrote:I have narrowed down the fault in the client message to the first line in the df output (how strange). Below is the client side code where only reporting the initial client line and then the [df] block header and the df output. If the HPUX df header line is removed before sending out message, no error on channel.
I just had another report about hobbitd_client crashing, also with the "df" reports. In that case I was able to track it down and the attached patch should fix it (the patch is on top of the current snapshot). Could you try if this fixes it ? It might be the same problem. Regards, Henrik
list Dan Vande More
Find the latest snapshot here: http://hswn.dk/beta/hobbit-snapshot.tar.gz (In the future, you can browse the http://hswn.dk/beta/ directory) Then I usually just copy the Makefile over from the directory I compiled the last one out of. Then what I do is edit <snapshot dir>/build/Makefile.rules and take "demo-build" off the end of line 18. It looks like this: BUILDTARGETS = lib-build common-build bbdisplay-build bbnet-build bbproxy-build docs-build build-build hobbitd-build client demo-build After that, I'd try applying the patches he sent you. Good luck Dan
▸
On 10/3/05, Deiss, Mark <user-5cd8675c2346@xymon.invalid> wrote:Confused even more now. Back to Sourceforge, from what I can tell, am using the current (?) release of hobbitserver 4.1.1. Not seeing anything in any of the Soureforge headings about a more current snapshot - under CVS or patch groupings. For your supplied patch, patch failed to apply - inspecting the patch, there are the following reference to old and patched lines in the hobbitd_client.c: - add_disk_count(fsname); + if (fsname) add_disk_count(fsname); Did a global search in the entire hobbit code tree and there is no add_disk_count function defined. Noticed the old code of: if (fsname && (usage != -1)) { Where in the 4.1.1 code base it is: if (usage != -1)) { Figured a check for valid fsname is what you are after before proceeding with the add_disk_count call above so made this change in the branch to work on disk threshold checking. Recompiled, retested, same error. Oh well. For the second patch in client_config.c, the line counts do not add up in the patch - closest code appears to be in the add_process_count function. Would not expect add_process_count to be part of processing df messages. Guess the add_disk_count is defined in this module in another version. RCS string for hobbitd_client.c and client_config.c module in the 4.1.1 release are: static char rcsid[] = "$Id: hobbitd_client.c,v 1.21 2005/07/25 13:57:56 henrik Exp $"; static char rcsid[] = "$Id: client_config.c,v 1.5 2005/07/24 07:50:13 henrik Exp $"; In the patch file, noticed string IDs for the patch targets of: --- hobbitd/hobbitd_client.c 2005/09/28 21:21:56 1.34 +++ hobbitd/hobbitd_client.c 2005/09/30 15:53:55 --- hobbitd/client_config.c 2005/09/21 11:37:05 1.9 +++ hobbitd/client_config.c 2005/09/30 15:54:57 So looks like am working with an older release set. Using the CVS web browser agent on Sourceforge, not turning up any more recent code trees for other then the hobbit client package. -----Original Message----- From: user-ce4a2c883f75@xymon.invalid [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Friday, September 30, 2005 12:02 PM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] Channel processing problem with 4.11 On Fri, Sep 30, 2005 at 10:47:39AM -0500, Deiss, Mark wrote:I have narrowed down the fault in the client message to the first line in the df output (how strange). Below is the client side code where only reporting the initial client line and then the [df] block header and the df output. If the HPUX df header line is removed before sending out message, no error on channel.I just had another report about hobbitd_client crashing, also with the "df" reports. In that case I was able to track it down and the attached patch should fix it (the patch is on top of the current snapshot). Could you try if this fixes it ? It might be the same problem. Regards, Henrik
list Mark Deiss
Thanks, was not aware of the beta release site.
▸
-----Original Message----- From: Dan Vande More [mailto:user-f3c4c62d9d50@xymon.invalid] Sent: Monday, October 03, 2005 8:44 AM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] Channel processing problem with 4.11 Find the latest snapshot here: http://hswn.dk/beta/hobbit-snapshot.tar.gz (In the future, you can browse the http://hswn.dk/beta/ directory) Then I usually just copy the Makefile over from the directory I compiled the last one out of. Then what I do is edit <snapshot dir>/build/Makefile.rules and take "demo-build" off the end of line 18. It looks like this: BUILDTARGETS = lib-build common-build bbdisplay-build bbnet-build bbproxy-build docs-build build-build hobbitd-build client demo-build After that, I'd try applying the patches he sent you. Good luck Dan On 10/3/05, Deiss, Mark <user-5cd8675c2346@xymon.invalid> wrote:
Confused even more now. Back to Sourceforge, from what I can tell, am using the current (?) release of hobbitserver 4.1.1. Not seeing anything in any of the Soureforge headings about a more current snapshot - under CVS or patch groupings. For your supplied patch, patch failed to apply - inspecting the patch, there are the following reference to old and patched lines in the hobbitd_client.c: - add_disk_count(fsname); + if (fsname) add_disk_count(fsname); Did a global search in the entire hobbit code tree and there is no add_disk_count function defined. Noticed the old code of: if (fsname && (usage != -1)) { Where in the 4.1.1 code base it is: if (usage != -1)) { Figured a check for valid fsname is what you are after before proceeding with the add_disk_count call above so made this change in the branch to work on disk threshold checking. Recompiled, retested, same error. Oh well. For the second patch in client_config.c, the line counts do not add up in the patch - closest code appears to be in the add_process_count function. Would not expect add_process_count to be part of processing df messages. Guess the add_disk_count is defined in this module in another version. RCS string for hobbitd_client.c and client_config.c module in the 4.1.1 release are: static char rcsid[] = "$Id: hobbitd_client.c,v 1.21 2005/07/25 13:57:56 henrik Exp $"; static char rcsid[] = "$Id: client_config.c,v 1.5 2005/07/24 07:50:13 henrik Exp $"; In the patch file, noticed string IDs for the patch targets of: --- hobbitd/hobbitd_client.c 2005/09/28 21:21:56 1.34 +++ hobbitd/hobbitd_client.c 2005/09/30 15:53:55 --- hobbitd/client_config.c 2005/09/21 11:37:05 1.9 +++ hobbitd/client_config.c 2005/09/30 15:54:57 So looks like am working with an older release set. Using the CVS web browser agent on Sourceforge, not turning up any more recent code trees for other then the hobbit client package. -----Original Message----- From: user-ce4a2c883f75@xymon.invalid [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Friday, September 30, 2005 12:02 PM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] Channel processing problem with 4.11 On Fri, Sep 30, 2005 at 10:47:39AM -0500, Deiss, Mark wrote:I have narrowed down the fault in the client message to the first line in the df output (how strange). Below is the client side code where only reporting the initial client line and then the [df] block header and the df output. If the HPUX df header line is removed before sending out message, no error on channel.I just had another report about hobbitd_client crashing, also with the
"df"
reports. In that case I was able to track it down and the attached patch should fix it (the patch is on top of the current snapshot). Could you try if this fixes it ? It might be the same problem. Regards, Henrik