Xymon Mailing List Archive search

Dumb hobbit network test question

8 messages in this thread

list Tom Kauffman · Tue, 29 Nov 2005 11:27:50 -0500 ·
I've just done my tri-annual hardware shuffle, swapping out all my
on-lease RS-6000 for brand spanking new systems. Part of this included
upgrading from AIX 5.1 to AIX 5.3.

Now, on half the systems (the last set replaced) I get errors on the
smtp and ftp tests -- typically one test every two hours. Interestingly,
all these systems barf on the same test cycle. This is obviously
something not quite right in the AIX config, but I'm at a loss on what.
I just found out today that we also have production rsh scripts that
time out on the same cycle (yeah, I know -- but they've been rsh since
year dot, and getting them to ssh is real low on the list . . )

Here's a sample of the network test error:
Service ftp on hudson is not OK : Unexpected service response
Service smtp on hudson is not OK : Unexpected service response

How can I log the actual response? I'm currently running hobbit 4.03rc1;
that's scheduled to change sometime next week.

Other suggestions?

TIA --

Tom Kauffman
NIBCO, Inc
list Tom Kauffman · Wed, 30 Nov 2005 15:38:56 -0500 ·
OK -- I used the --debug option; it wasn't as bad as I thought it would
be, the resulting log was just over 11 MB when my problem occurred and I
could turn it off.

Henrik, can you clarify what this really means?

Address=10.8.224.9:21, open=1, res=0, err=1, connecttime=0.003110,
totaltime=10.063026,
Address=10.8.224.38:21, open=1, res=0, err=0, connecttime=0.003060,
totaltime=0.028471, banner='220 wabash FTP server (Version 4.2 Sat
 Feb 5 10:12:55 CST 2005) ready.
221 Goodbye.
' (86 bytes) (good response)

2005-11-30 14:03:59 tcp_got_expected: No data in banner
2005-11-30 14:03:59 Adding to combo msg: status volga.ftp yellow <!--
[flags:OrdastILe] --> Wed Nov 30 14:03:00 2005 ftp NOT ok

This system is showing a load of 0.1 (max 2.0) om a 2-way 1.6 GHz
machine; the FTP connect time is 17.4 microseconds (avg) and peaked in
the last 48 hours at 5.2 milliseconds

TIA

Tom Kauffman
NIBCO, Inc
quoted from Tom Kauffman

-----Original Message-----
From: Kauffman, Tom [mailto:user-3feba9e60a8b@xymon.invalid] 
Sent: Tuesday, November 29, 2005 11:28 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Dumb hobbit network test question

I've just done my tri-annual hardware shuffle, swapping out all my
on-lease RS-6000 for brand spanking new systems. Part of this included
upgrading from AIX 5.1 to AIX 5.3.

Now, on half the systems (the last set replaced) I get errors on the
smtp and ftp tests -- typically one test every two hours. Interestingly,
all these systems barf on the same test cycle. This is obviously
something not quite right in the AIX config, but I'm at a loss on what.
I just found out today that we also have production rsh scripts that
time out on the same cycle (yeah, I know -- but they've been rsh since
year dot, and getting them to ssh is real low on the list . . )

Here's a sample of the network test error:
Service ftp on hudson is not OK : Unexpected service response
Service smtp on hudson is not OK : Unexpected service response

How can I log the actual response? I'm currently running hobbit 4.03rc1;
that's scheduled to change sometime next week.

Other suggestions?

TIA --

Tom Kauffman
NIBCO, Inc
list Frederic Mangeant · Wed, 30 Nov 2005 22:16:04 +0100 ·
Hi Tom
quoted from Tom Kauffman
-----Original Message-----
From: Kauffman, Tom [mailto:user-3feba9e60a8b@xymon.invalid] Sent: Tuesday, November 29, 2005 11:28 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Dumb hobbit network test question
[snip] 
Here's a sample of the network test error:
Service ftp on hudson is not OK : Unexpected service response
Service smtp on hudson is not OK : Unexpected service response
Are you using the "--checkresponse" option ? I had the same "Unexpected service response" warnings until I removed it.

http://www.hswn.dk/hobbit/help/manpages/man1/bbtest-net.1.html

--checkresponse[=COLOR]
    When testing well-known services (e.g. FTP, SSH, SMTP, POP-2, POP-
3, IMAP, NNTP and rsync), bbtest-net will look for a valid service-
specific "OK" response. If another reponse is seen, this will cause the test to report a warning (yellow) status. Without this option, the response from the service is ignored.
    The optional color-name is used to select a color other than yellow for the status message when the response is wrong. E.g. "--
checkresponse=red" will cause a "red" status message to be sent when the service does not respond as expected.
list Henrik Størner · Wed, 30 Nov 2005 22:58:52 +0100 ·
quoted from Tom Kauffman
On Wed, Nov 30, 2005 at 03:38:56PM -0500, Kauffman, Tom wrote:
Henrik, can you clarify what this really means?

Address=10.8.224.9:21, open=1, res=0, err=1, connecttime=0.003110, totaltime=10.063026,
"open=1" means that the connection to the server succeeded. The
interesting thing here is that it took only 0.003 seconds to get a
connection, but then Hobbit spent more than 10 seconds waiting for a
banner to appear. It never did - at least not within those 10 secs;
the "err=1" means it gave up waiting for the data and signals a timeout.
quoted from Tom Kauffman
Address=10.8.224.38:21, open=1, res=0, err=0, connecttime=0.003060, totaltime=0.028471, 
  banner='220 wabash FTP server (Version 4.2 Sat Feb 5 10:12:55 CST 2005) ready. 221 Goodbye.' (86 bytes)
This is a different server. Again, connecting takes about 0.003 secs,
but the banner appears almost immediately - the entire exchange happens
in 28 milliseconds.


It might be that the FTP server performs a reverse DNS lookup of the
Hobbit servers' IP address when Hobbit connects to check the FTP
service. Sometimes DNS lookups take a while - maybe long enough for
Hobbit to reach the 10 seconds timeout. Maybe your ftp server has 
a local DNS cache, and the timeout only happens when the cached DNS
entry expires and has to be refreshed.

One thing you can try is to add a "--timeout=30" option to the
bbtest-net command in hobbitlaunch.cfg; that makes it wait up to 30
seconds before flagging a timeout.


Regards,
Henrik
list Tom Kauffman · Wed, 30 Nov 2005 17:51:57 -0500 ·
Oh, how it helps to have additional minds on these things.

Reverse lookup looks to be the culprit. I cloned all these systems in a
bit of a hurry -- and the cloning changed the dns resolution config to
point at (in order) my D/R hotsite Win2003 domain controller (active),
my D/R hotsite D/R test domain controller (non-existent), my D/R hotsite
hobbit system (also not there), and THEN my local DNS server -- so if
the D/R DC didn't answer (wonder why IT goes away every two hours?) I
would go through multiple retries to non-existent systems.

One more item added to the system clone checklist.

Thanks!

Tom
quoted from Henrik Størner

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: Wednesday, November 30, 2005 4:59 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Dumb hobbit network test question

On Wed, Nov 30, 2005 at 03:38:56PM -0500, Kauffman, Tom wrote:
Henrik, can you clarify what this really means?

Address=10.8.224.9:21, open=1, res=0, err=1, connecttime=0.003110,
totaltime=10.063026,
"open=1" means that the connection to the server succeeded. The
interesting thing here is that it took only 0.003 seconds to get a
connection, but then Hobbit spent more than 10 seconds waiting for a
banner to appear. It never did - at least not within those 10 secs;
the "err=1" means it gave up waiting for the data and signals a timeout.
Address=10.8.224.38:21, open=1, res=0, err=0, connecttime=0.003060,
totaltime=0.028471, 
  banner='220 wabash FTP server (Version 4.2 Sat Feb 5 10:12:55 CST
2005) ready. 221 Goodbye.' (86 bytes)

This is a different server. Again, connecting takes about 0.003 secs,
but the banner appears almost immediately - the entire exchange happens
in 28 milliseconds.


It might be that the FTP server performs a reverse DNS lookup of the
Hobbit servers' IP address when Hobbit connects to check the FTP
service. Sometimes DNS lookups take a while - maybe long enough for
Hobbit to reach the 10 seconds timeout. Maybe your ftp server has 
a local DNS cache, and the timeout only happens when the cached DNS
entry expires and has to be refreshed.

One thing you can try is to add a "--timeout=30" option to the
bbtest-net command in hobbitlaunch.cfg; that makes it wait up to 30
seconds before flagging a timeout.


Regards,
Henrik
list Vernon Everett · Thu, 1 Dec 2005 13:40:22 +0800 ·
Hi

Does anybody know what the Hobbit status numbers mean?
I am getting this in my hobbitlaunch.log
2005-12-01 13:27:41 Task hobbitclient terminated, status 208

Regards
    Vernon

No trees were killed in the creation of this message. However, many
electrons were terribly inconvenienced. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

NOTICE: This message and any attachments are confidential and may contain copyright material of Australian Finance Group Limited or a third party. It is intended solely for the purpose of the addressee and any other named recipient. If you are not the intended recipient, any use, distribution, disclosure or copying of this message is strictly prohibited. The confidentiality attached
to this message is not waived or lost by reason of the mistaken transmission or delivery to any unintended party. If you have received this message in error, please notify the author immediately or contact Australian Finance Group on +61 8 9420 7888.
list Henrik Størner · Thu, 1 Dec 2005 07:39:02 +0100 ·
quoted from Vernon Everett
On Thu, Dec 01, 2005 at 01:40:22PM +0800, Vernon Everett wrote:
Does anybody know what the Hobbit status numbers mean?
I am getting this in my hobbitlaunch.log
2005-12-01 13:27:41 Task hobbitclient terminated, status 208
It's the exit code returned by the command you run. "208" doesn't
sound right; the hobbitclient.sh script normally returns a 0.

Henrik
list Vernon Everett · Thu, 1 Dec 2005 14:46:20 +0800 ·
I would have to agree it doesn't sound right. :-)

It'a also core dumping, and not showing the status page.

Any ideas?
quoted from Vernon Everett

Regards
    Vernon 

No trees were killed in the creation of this message. However, many
electrons were terribly inconvenienced. 
-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Thursday, 1 December 2005 2:39 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Dumb hobbit network test question

On Thu, Dec 01, 2005 at 01:40:22PM +0800, Vernon Everett wrote:
Does anybody know what the Hobbit status numbers mean?
I am getting this in my hobbitlaunch.log
2005-12-01 13:27:41 Task hobbitclient terminated, status 208
It's the exit code returned by the command you run. "208" doesn't sound
right; the hobbitclient.sh script normally returns a 0.

Henrik


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

NOTICE: This message and any attachments are confidential and may contain copyright material of Australian Finance Group Limited or a third party. It is intended solely for the purpose of the addressee and any other named recipient. If you are not the intended recipient, any use, distribution, disclosure or copying of this message is strictly prohibited. The confidentiality attached
to this message is not waived or lost by reason of the mistaken transmission or delivery to any unintended party. If you have received this message in error, please notify the author immediately or contact Australian Finance Group on +61 8 9420 7888.