Xymon Mailing List Archive search

Debugging help: bbtest-net gets http test timing wrong

20 messages in this thread

list Alan Sparks · Fri, 13 Jun 2008 18:20:57 -0600 ·
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan
list Alan Sparks · Sun, 15 Jun 2008 18:57:53 -0600 ·
Continuing to try to debug this problem, have tried about everything I can to resolve the issues with http probes.  Including:
* Complete rebuild of the server with CentOS 4.6 and recompile of Hobbit.  Same issues.
* Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
* Manipulating the httpd.conf configs on remote servers, forcing HTTP/1.0, removing ETags, creating a very simple index page to test against.  Same issues.
* Upgrading Apache on sample remote server to Apache 2.2.8 (most are 2.2.4).  Same issue.
* Recompiling Hobbit with debugging flags, to make sure the optimizer is not applied.  Same issue.

The only two servers I have that seem to work consistently well are a pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give Hobbit issues.  Although, again, repeated curl or wget probe cycles against the servers from the Hobbit server never show more than a 0.2s response time.

But, Hobbit continues to report things like:

http://10.1.17.251/ - OK

HTTP/1.1 200 OK
Date: Mon, 16 Jun 2008 00:54:03 GMT
Server: Apache/2.2.8 (EL)
Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
ETag: "7c809f-9b-44fa9b5806300"
Accept-Ranges: bytes
Content-Length: 155
Connection: close
Content-Type: text/html; charset=UTF-8


I can't come up with anything other than Hobbit as a cause. But is there anything I can do to trace what is happening internally to get past this problem?  Any ideas at all would really be appreciated.  Thanks in advance.
-Alan

Seconds:     3.00
quoted from Alan Sparks


Alan Sparks wrote:
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan

list Tim McCloskey · Sun, 15 Jun 2008 21:13:43 -0700 ·
I'll take a stab at this but you may have already looked at the things I wonder about or they may not help in any way. I've only thought about this for a couple of minutes so I could be way off.

You have stated there are a couple servers that seem to respond differently.  Can you, 100% consistently, recreate the proper|improper response from the web boxes?  If so, look at the changelog between apache 2.2 and 2.0 (assuming that those servers - 2.0, 2.2 - are on the same network and that all of the media setting match ie. 100Fdx 100Fdx, etc.).

Are you _sure_ you are not using DNS in some fashion, perhaps reverse lookups or perhaps the newer apache config file contains some lookup setting.  (can you get away with using the same httpd.conf on the 2.0 and 2.2 boxes?)

What does the network traffic and connections look like on each of the servers?   Have you tried running tcpdump on one of the web boxes to see if there are any clues there?

I don't recall if hobbit uses wget for his http gets.  Are all the servers using the same settings in resolv.conf and nsswitch.conf?

There are probably other things to check but start with making sure DNS is not involved, even if you think it is not.
quoted from Alan Sparks


Alan Sparks wrote:
Continuing to try to debug this problem, have tried about everything I can to resolve the issues with http probes.  Including:
* Complete rebuild of the server with CentOS 4.6 and recompile of Hobbit.  Same issues.
* Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
* Manipulating the httpd.conf configs on remote servers, forcing HTTP/1.0, removing ETags, creating a very simple index page to test against.  Same issues.
* Upgrading Apache on sample remote server to Apache 2.2.8 (most are 2.2.4).  Same issue.
* Recompiling Hobbit with debugging flags, to make sure the optimizer is not applied.  Same issue.

The only two servers I have that seem to work consistently well are a pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give Hobbit issues.  Although, again, repeated curl or wget probe cycles against the servers from the Hobbit server never show more than a 0.2s response time.

But, Hobbit continues to report things like:

http://10.1.17.251/ - OK

HTTP/1.1 200 OK
Date: Mon, 16 Jun 2008 00:54:03 GMT
Server: Apache/2.2.8 (EL)
Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
ETag: "7c809f-9b-44fa9b5806300"
Accept-Ranges: bytes
Content-Length: 155
Connection: close
Content-Type: text/html; charset=UTF-8


I can't come up with anything other than Hobbit as a cause. But is there anything I can do to trace what is happening internally to get past this problem?  Any ideas at all would really be appreciated.  Thanks in advance.
-Alan

Seconds:     3.00


Alan Sparks wrote:
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan

list Tim McCloskey · Sun, 15 Jun 2008 21:40:56 -0700 ·
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
quoted from Alan Sparks


Alan Sparks wrote:
Continuing to try to debug this problem, have tried about everything I can to resolve the issues with http probes.  Including:
* Complete rebuild of the server with CentOS 4.6 and recompile of Hobbit.  Same issues.
* Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
* Manipulating the httpd.conf configs on remote servers, forcing HTTP/1.0, removing ETags, creating a very simple index page to test against.  Same issues.
* Upgrading Apache on sample remote server to Apache 2.2.8 (most are 2.2.4).  Same issue.
* Recompiling Hobbit with debugging flags, to make sure the optimizer is not applied.  Same issue.

The only two servers I have that seem to work consistently well are a pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give Hobbit issues.  Although, again, repeated curl or wget probe cycles against the servers from the Hobbit server never show more than a 0.2s response time.

But, Hobbit continues to report things like:

http://10.1.17.251/ - OK

HTTP/1.1 200 OK
Date: Mon, 16 Jun 2008 00:54:03 GMT
Server: Apache/2.2.8 (EL)
Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
ETag: "7c809f-9b-44fa9b5806300"
Accept-Ranges: bytes
Content-Length: 155
Connection: close
Content-Type: text/html; charset=UTF-8


I can't come up with anything other than Hobbit as a cause. But is there anything I can do to trace what is happening internally to get past this problem?  Any ideas at all would really be appreciated.  Thanks in advance.
-Alan

Seconds:     3.00


Alan Sparks wrote:
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan

list Alan Sparks · Sun, 15 Jun 2008 22:42:46 -0600 ·
On the Apache servers, they are configured with HostNameLookups Off.  My own measurements using "conventional" tools (like a loop of curls and wgets) consistently show responses <0.2s.  The Hobbit problem is that a) the problem randomly affects the servers in the list (I have about 64 servers in the test server), and b) no server that Hobbit suddenly reports as "slow" ever appears to be slow from other external tests.

Configurations are matched as close as possibly, accounting for module and other differences between Apache 2.0 and 2.2.  I'm very sure DNS lookups on the Web server end cannot account for this, as it /should/ show in logs, and affect non-Hobbit probes as well.

It appears that Hobbit implements HTTP testing itself, in the bbtest-net codebase.  No external tools are used.  Yes, resolver and NSS configs are the same.  And the HTTP tests are specifically targeted at IP addresses, not host names, so there /should/ not be a DNS lookup involved in the test connection, as far as I can tell from the code...

And yeah, the tcpdump on both ends is planned for Monday.  I want to somehow prove the response is actually showing up sooner than the test says...
-Alan
quoted from Tim McCloskey

Tim McCloskey wrote:
I'll take a stab at this but you may have already looked at the things I wonder about or they may not help in any way. I've only thought about this for a couple of minutes so I could be way off.

You have stated there are a couple servers that seem to respond differently.  Can you, 100% consistently, recreate the proper|improper response from the web boxes?  If so, look at the changelog between apache 2.2 and 2.0 (assuming that those servers - 2.0, 2.2 - are on the same network and that all of the media setting match ie. 100Fdx 100Fdx, etc.).

Are you _sure_ you are not using DNS in some fashion, perhaps reverse lookups or perhaps the newer apache config file contains some lookup setting.  (can you get away with using the same httpd.conf on the 2.0 and 2.2 boxes?)

What does the network traffic and connections look like on each of the servers?   Have you tried running tcpdump on one of the web boxes to see if there are any clues there?

I don't recall if hobbit uses wget for his http gets.  Are all the servers using the same settings in resolv.conf and nsswitch.conf?

There are probably other things to check but start with making sure DNS is not involved, even if you think it is not.


Alan Sparks wrote:
Continuing to try to debug this problem, have tried about everything I can to resolve the issues with http probes.  Including:
* Complete rebuild of the server with CentOS 4.6 and recompile of Hobbit.  Same issues.
* Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
* Manipulating the httpd.conf configs on remote servers, forcing HTTP/1.0, removing ETags, creating a very simple index page to test against.  Same issues.
* Upgrading Apache on sample remote server to Apache 2.2.8 (most are 2.2.4).  Same issue.
* Recompiling Hobbit with debugging flags, to make sure the optimizer is not applied.  Same issue.

The only two servers I have that seem to work consistently well are a pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give Hobbit issues.  Although, again, repeated curl or wget probe cycles against the servers from the Hobbit server never show more than a 0.2s response time.

But, Hobbit continues to report things like:

http://10.1.17.251/ - OK

HTTP/1.1 200 OK
Date: Mon, 16 Jun 2008 00:54:03 GMT
Server: Apache/2.2.8 (EL)
Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
ETag: "7c809f-9b-44fa9b5806300"
Accept-Ranges: bytes
Content-Length: 155
Connection: close
Content-Type: text/html; charset=UTF-8


I can't come up with anything other than Hobbit as a cause. But is there anything I can do to trace what is happening internally to get past this problem?  Any ideas at all would really be appreciated.  Thanks in advance.
-Alan

Seconds:     3.00


Alan Sparks wrote:
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan

list Alan Sparks · Sun, 15 Jun 2008 22:43:45 -0600 ·
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan
quoted from Tim McCloskey

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?


Alan Sparks wrote:
Continuing to try to debug this problem, have tried about everything I can to resolve the issues with http probes.  Including:
* Complete rebuild of the server with CentOS 4.6 and recompile of Hobbit.  Same issues.
* Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
* Manipulating the httpd.conf configs on remote servers, forcing HTTP/1.0, removing ETags, creating a very simple index page to test against.  Same issues.
* Upgrading Apache on sample remote server to Apache 2.2.8 (most are 2.2.4).  Same issue.
* Recompiling Hobbit with debugging flags, to make sure the optimizer is not applied.  Same issue.

The only two servers I have that seem to work consistently well are a pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give Hobbit issues.  Although, again, repeated curl or wget probe cycles against the servers from the Hobbit server never show more than a 0.2s response time.

But, Hobbit continues to report things like:

http://10.1.17.251/ - OK

HTTP/1.1 200 OK
Date: Mon, 16 Jun 2008 00:54:03 GMT
Server: Apache/2.2.8 (EL)
Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
ETag: "7c809f-9b-44fa9b5806300"
Accept-Ranges: bytes
Content-Length: 155
Connection: close
Content-Type: text/html; charset=UTF-8


I can't come up with anything other than Hobbit as a cause. But is there anything I can do to trace what is happening internally to get past this problem?  Any ideas at all would really be appreciated.  Thanks in advance.
-Alan

Seconds:     3.00


Alan Sparks wrote:
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan

list Tim McCloskey · Sun, 15 Jun 2008 22:18:47 -0700 ·
I get that wget/curl always work.  Not sure what resolver settings may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl work fine for you).  We have many apache boxes spanning multiple networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch) likes just fine and reports accurate times (Seconds: 0.nn).  We also have fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is the case I wonder what external libraries are used (not hobbit provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim
quoted from Alan Sparks


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
list Paul Krash · Mon, 16 Jun 2008 05:40:53 -0500 ·
Since latency appears to be a problem,
Are all devices configured to use NTP?
I had an apche server in the dame boat last week, was drifting often, then would be corrected on next round of ntp syncs. Whoops, 'dame' should be "same"
In the above sentence.
Adjusting the sync time to be more frequent, and replacing the MBs CMOS battery did the trick.

I also ended up discovering a new enchanement to the server that was topping out resources, and made it slow to respond to requests.

Your mileage may very.

Best,


Paul Krash, system administrator, Exegy, Inc.; XXX-XXX-XXXX x 666
quoted from Alan Sparks

----- Original Message -----
From: Alan Sparks <user-8f2174fd8b66@xymon.invalid>
To: user-ae9b8668bcde@xymon.invalid <user-ae9b8668bcde@xymon.invalid>
Sent: Sun Jun 15 23:42:46 2008
Subject: Re: [hobbit] Debugging help: bbtest-net gets http test timing wrong

On the Apache servers, they are configured with HostNameLookups Off.  My own measurements using "conventional" tools (like a loop of curls and wgets) consistently show responses <0.2s.  The Hobbit problem is that a) the problem randomly affects the servers in the list (I have about 64 servers in the test server), and b) no server that Hobbit suddenly reports as "slow" ever appears to be slow from other external tests.

Configurations are matched as close as possibly, accounting for module and other differences between Apache 2.0 and 2.2.  I'm very sure DNS lookups on the Web server end cannot account for this, as it /should/ show in logs, and affect non-Hobbit probes as well.

It appears that Hobbit implements HTTP testing itself, in the bbtest-net codebase.  No external tools are used.  Yes, resolver and NSS configs are the same.  And the HTTP tests are specifically targeted at IP addresses, not host names, so there /should/ not be a DNS lookup involved in the test connection, as far as I can tell from the code...

And yeah, the tcpdump on both ends is planned for Monday.  I want to somehow prove the response is actually showing up sooner than the test says...
-Alan

Tim McCloskey wrote:
I'll take a stab at this but you may have already looked at the things I wonder about or they may not help in any way. I've only thought about this for a couple of minutes so I could be way off.

You have stated there are a couple servers that seem to respond differently.  Can you, 100% consistently, recreate the proper|improper response from the web boxes?  If so, look at the changelog between apache 2.2 and 2.0 (assuming that those servers - 2.0, 2.2 - are on the same network and that all of the media setting match ie. 100Fdx 100Fdx, etc.).

Are you _sure_ you are not using DNS in some fashion, perhaps reverse lookups or perhaps the newer apache config file contains some lookup setting.  (can you get away with using the same httpd.conf on the 2.0 and 2.2 boxes?)

What does the network traffic and connections look like on each of the servers?   Have you tried running tcpdump on one of the web boxes to see if there are any clues there?

I don't recall if hobbit uses wget for his http gets.  Are all the servers using the same settings in resolv.conf and nsswitch.conf?

There are probably other things to check but start with making sure DNS is not involved, even if you think it is not.


Alan Sparks wrote:
Continuing to try to debug this problem, have tried about everything I can to resolve the issues with http probes.  Including:
* Complete rebuild of the server with CentOS 4.6 and recompile of Hobbit.  Same issues.
* Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
* Manipulating the httpd.conf configs on remote servers, forcing HTTP/1.0, removing ETags, creating a very simple index page to test against.  Same issues.
* Upgrading Apache on sample remote server to Apache 2.2.8 (most are 2.2.4).  Same issue.
* Recompiling Hobbit with debugging flags, to make sure the optimizer is not applied.  Same issue.

The only two servers I have that seem to work consistently well are a pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give Hobbit issues.  Although, again, repeated curl or wget probe cycles against the servers from the Hobbit server never show more than a 0.2s response time.

But, Hobbit continues to report things like:

http://10.1.17.251/ - OK

HTTP/1.1 200 OK
Date: Mon, 16 Jun 2008 00:54:03 GMT
Server: Apache/2.2.8 (EL)
Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
ETag: "7c809f-9b-44fa9b5806300"
Accept-Ranges: bytes
Content-Length: 155
Connection: close
Content-Type: text/html; charset=UTF-8


I can't come up with anything other than Hobbit as a cause. But is there anything I can do to trace what is happening internally to get past this problem?  Any ideas at all would really be appreciated.  Thanks in advance.
-Alan

Seconds:     3.00


Alan Sparks wrote:
Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a problem with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have http tests defined for these servers.  On most of these servers, Hobbit is reporting the "Seconds:" for the response at 3 seconds.  It seems that it is inconsistent -- one cycle to the next, the 3-second response may move to a different set of servers.

The http: tests are defined using the IP address of the server - no server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and used my browser and Telnet to connect to the same URL.  I consistently get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will show a different number of hosts usually each cycle, sometimes on same servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but makes me think something is getting really mixed up in the select() processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up with a 3-second latency, when none of my test tools running off the same server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan

 This e-mail and any documents accompanying it may contain legally privileged and/or confidential information belonging to Exegy, Inc. Such information may be protected from disclosure by law. The information is intended for use by only the addressee. If you are not the intended recipient, you are hereby notified that any disclosure or use of the information is strictly prohibited. If you have received this e-mail in error, please immediately contact the sender by e-mail or phone regarding instructions for return or destruction and do not use or disclose the content to others.
list Alan Sparks · Mon, 16 Jun 2008 19:49:13 -0600 ·
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of the Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233 0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233 0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF], proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok] 265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN packets, but the first isn't getting processed and there's a 3-second delay to the next SYN (which gets ACKed).  I don't know why this happens only with the Hobbit connections... and I don't know why the first SYN seems to be getting ignored.  Server is not at all busy.

-Alan
quoted from Tim McCloskey
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl work fine for you).  We have many apache boxes spanning multiple networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch) likes just fine and reports accurate times (Seconds: 0.nn).  We also have fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is the case I wonder what external libraries are used (not hobbit provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
list Tim McCloskey · Mon, 16 Jun 2008 19:42:21 -0700 ·
So, the hobbit server initiates a SYN to the web box.  The first SYN is lost in space and hobbit hits the web box again in 3 seconds and get an ACK.  Correct?

What is seen on the web side for the first SYN from hobbit and the subsequent response of the web box?  Does the web side even see the initial SYN request?  If so perhaps you could toss something together to strace the httpd process during this time.

Regards,

Tim

Ps. Some other off the wall questions just to sidetrack you even more:
Are you running iptables or arptables and have you explicitly shut off ipv6 in modules.conf, how about selinix disabled, or enforcing?  Yeah, I know, they don't make sense but helps me to understand the environment.
quoted from Alan Sparks


Alan Sparks wrote:
15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF],
proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233 0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233 0,nop,wscale 2>
list Alan Sparks · Tue, 17 Jun 2008 17:00:04 -0600 ·
After some Googling, I have added "AcceptFilter http none" directives to the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a 64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64 install?

I see a lot of debugging trace stuff (dbgprint calls) in the contest and httptest code.  Can anyone tell me how to enable it to trace what Hobbit is doing?

Am really at a loss.  This can't be rocket science to get it to probe HTTP correctly.  But a week later, I still cannot get it to match any other monitoring tool's results.
-Alan
quoted from Alan Sparks

Alan Sparks wrote:
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of the Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233 0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233 0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF], proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok] 265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN packets, but the first isn't getting processed and there's a 3-second delay to the next SYN (which gets ACKed).  I don't know why this happens only with the Hobbit connections... and I don't know why the first SYN seems to be getting ignored.  Server is not at all busy.

-Alan
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl work fine for you).  We have many apache boxes spanning multiple networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch) likes just fine and reports accurate times (Seconds: 0.nn).  We also have fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is the case I wonder what external libraries are used (not hobbit provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
list Shane Skoglund · Wed, 18 Jun 2008 09:02:35 -0500 ·
Did you rebuild the hobbit binaries on a 64 bit machine?  Or did you install
the all the 32 bit compat libs?


On Tue, Jun 17, 2008 at 6:00 PM, Alan Sparks <user-8f2174fd8b66@xymon.invalid>
quoted from Alan Sparks
wrote:
After some Googling, I have added "AcceptFilter http none" directives to
the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a 64-bit
system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64 install?

I see a lot of debugging trace stuff (dbgprint calls) in the contest and
httptest code.  Can anyone tell me how to enable it to trace what Hobbit is
doing?

Am really at a loss.  This can't be rocket science to get it to probe HTTP
correctly.  But a week later, I still cannot get it to match any other
monitoring tool's results.
-Alan

Alan Sparks wrote:
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of the
Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the
Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF],
proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok]
265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233
0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF],
proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok]
265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233
0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF],
proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok]
265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN packets,
but the first isn't getting processed and there's a 3-second delay to the
next SYN (which gets ACKed).  I don't know why this happens only with the
Hobbit connections... and I don't know why the first SYN seems to be getting
ignored.  Server is not at all busy.

-Alan
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings may be
implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl
work fine for you).  We have many apache boxes spanning multiple networks
running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch)
likes just fine and reports accurate times (Seconds: 0.nn).  We also have
fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is
the case I wonder what external libraries are used (not hobbit provided
libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server,
regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?

list Tom Kauffman · Wed, 18 Jun 2008 10:34:52 -0400 ·
I can't speak for Red Hat or Centos, but I'm running hobbit on 3 x86_64 SUSE systems with no issues. Http tests to both Apache and IIS servers, with Apache running on Win2003, linux, and AIX all run properly.

I did explicitly build hobbit on all three systems from the same source copy, but that was the extent of my 'customization' by server.

Tom
quoted from Alan Sparks

-----Original Message-----
From: Alan Sparks [mailto:user-8f2174fd8b66@xymon.invalid]
Sent: Tuesday, June 17, 2008 7:00 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Debugging help: bbtest-net gets http test timing wrong

After some Googling, I have added "AcceptFilter http none" directives to
the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a
64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64
install?

I see a lot of debugging trace stuff (dbgprint calls) in the contest and
httptest code.  Can anyone tell me how to enable it to trace what Hobbit
is doing?

Am really at a loss.  This can't be rocket science to get it to probe
HTTP correctly.  But a week later, I still cannot get it to match any
other monitoring tool's results.
-Alan

Alan Sparks wrote:
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of
the Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the
Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF],
proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok]
265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233
0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF],
proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok]
265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233
0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF],
proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok]
265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN
packets, but the first isn't getting processed and there's a 3-second
delay to the next SYN (which gets ACKed).  I don't know why this
happens only with the Hobbit connections... and I don't know why the
first SYN seems to be getting ignored.  Server is not at all busy.

-Alan
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings
may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl
work fine for you).  We have many apache boxes spanning multiple
networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with
allinone patch) likes just fine and reports accurate times (Seconds:
0.nn).  We also have fairly proper forward and reverse DNS records
for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that
is the case I wonder what external libraries are used (not hobbit
provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server,
regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
CONFIDENTIALITY NOTICE:  This email and any attachments are for the 
exclusive and confidential use of the intended recipient.  If you are not
the intended recipient, please do not read, distribute or take action in 
reliance upon this message. If you have received this in error, please 
notify us immediately by return email and promptly delete this message 
and its attachments from your computer system. We do not waive  
attorney-client or work product privilege by the transmission of this
message.
list Buchan Milne · Wed, 18 Jun 2008 22:12:45 +0200 ·
quoted from Alan Sparks
On Wednesday 18 June 2008 01:00:04 Alan Sparks wrote:
After some Googling, I have added "AcceptFilter http none" directives to
the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a
64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64
install?
RHEL5 x86_64 and Mandriva x86_64, no issues with http (but hobbitfetch is 
certainly dying a *lot* more often than it did on RHEL4 i386).

Regards,
Buchan
list Shane Skoglund · Thu, 19 Jun 2008 07:33:49 -0500 ·
i use it on FC9 x86_64 that i built from source without any problems.  i
forgot the machine was x86_64 and tried to use a i386 compiled version, it
didnt work very well.


On Wed, Jun 18, 2008 at 3:12 PM, Buchan Milne <user-9b139aff4dec@xymon.invalid>
quoted from Buchan Milne
wrote:
On Wednesday 18 June 2008 01:00:04 Alan Sparks wrote:
After some Googling, I have added "AcceptFilter http none" directives to
the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a
64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64
install?
RHEL5 x86_64 and Mandriva x86_64, no issues with http (but hobbitfetch is
certainly dying a *lot* more often than it did on RHEL4 i386).

Regards,
Buchan

list Alan Sparks · Thu, 19 Jun 2008 20:15:01 -0600 ·
I see where the problem seems to be occurring.  But for my life I can't understand why.

Packet traces from the Hobbit server and the Web servers showing the 3-second delays show that Hobbit connects, and gets an imediate answer from the server (milliseconds).  But the servers show that Hobbit does not close the connection (a FIN packets sends/acks) for 3 seconds.

Looking at the bb-network debugging logging, I see that the select() call sleeps for 3 seconds before returning in these cases.  So the only conclusion I can arrive at is that select() doesn't return with the active file descriptors on schedule for some bizarre reason.

For a desperation test, I forced the receive buffer on the sockets to a small number (1024 bytes):
                        if (sockok) {
                                int size = 1024;
                                res = setsockopt(nextinqueue->fd,
                                        SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));

This sortof works.  the select() no longer hangs, and the HTTP tests start returning "normal"-ish results, i.e. numbers that match curl and wget statistics.

But, it messes with numbers for other Web servers, the ones that return a page significantly larger than 1024 bytes.

Like I said, I just can't get it.  Hobbit or CentOS?  There's nothing odd about this build, a generic CentOS 4.6 x86_64 build, Hobbit 4.2 with allinone patch, build for x86_64.

Any suggestions at all?  If this isn't the right place to ask, where would be?  I can't get my hands around why the only thing that I can't get to work here is Hobbit...

Thanks for your indulgence.  I really wish I could fix this.
-Alan
quoted from Alan Sparks


Alan Sparks wrote:
After some Googling, I have added "AcceptFilter http none" directives to the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a 64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64 install?

I see a lot of debugging trace stuff (dbgprint calls) in the contest and httptest code.  Can anyone tell me how to enable it to trace what Hobbit is doing?

Am really at a loss.  This can't be rocket science to get it to probe HTTP correctly.  But a week later, I still cannot get it to match any other monitoring tool's results.
-Alan

Alan Sparks wrote:
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of the Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233 0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233 0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF], proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok] 265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN packets, but the first isn't getting processed and there's a 3-second delay to the next SYN (which gets ACKed).  I don't know why this happens only with the Hobbit connections... and I don't know why the first SYN seems to be getting ignored.  Server is not at all busy.

-Alan
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl work fine for you).  We have many apache boxes spanning multiple networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch) likes just fine and reports accurate times (Seconds: 0.nn).  We also have fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is the case I wonder what external libraries are used (not hobbit provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
list Vernon Everett · Fri, 20 Jun 2008 12:15:58 +0800 ·
Hi Henrik

A feature request for you.

Most unix-like operating systems now support the -h switch on the df
command.
I am sure I do not need to tell you, -h displays the output in human
readable format.

Example
# df -h
Filesystem                      size   used  avail capacity  Mounted on
/dev/vx/dsk/bootdg/rootvol      7.9G   3.3G   4.5G    43%    /
swap                   		  7.9G   1.4M   7.9G     1%
/etc/svc/volatile
/dev/vx/dsk/bootdg/var          7.9G   4.5G   3.3G    58%    /var
/dev/vx/dsk/bootdg/opt          7.9G   1.7G   6.1G    22%    /opt
/dev/vx/dsk/int/vol0            190G   147G    40G    79%
/local/dsk/vol0
/dev/vx/dsk/6540b/dsu3          7.3T   4.6T   2.6T    64%
/local/dsk/dsu3
/dev/vx/dsk/6540b/dsu4          7.3T   5.2T   2.1T    72%
/local/dsk/dsu4

Not so good if you want to do calculations, but far more readable
compared to  

# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/vx/dsk/bootdg/rootvol   8262869 3448657 4731584    43%    /
swap                         8091848    1464 8090384     1%
/etc/svc/volatile
/dev/vx/dsk/bootdg/var       8262869 4713464 3466777    58%    /var
/dev/vx/dsk/bootdg/opt       8262869 1777713 6402528    22%    /opt
/dev/vx/dsk/int/vol0         199229440 153301977 43058152    79%
/local/dsk/vol0
/dev/vx/dsk/6540b/dsu3       7803789312 4968049472 2813588504    64%
/local/dsk/dsu3
/dev/vx/dsk/6540b/dsu4       7803789312 5579034264 2207423576    72%
/local/dsk/dsu4

Is it possible to have the calculations and graphs based on the standard
df -k output, but have the display use df -h?

It might require 2 iterations of the df command at client level, one for
display, and another for data.
I was thinking, maybe even add another definition in the client configs.
DF_DISP="df -h", which would then allow any switches to be added,
depending on preference.

I have tried changing the DF= command definition to df -h in the config,
but that seems to break some of the calculations at server side.

Regards
    Vernon


NOTICE: This email and any attachments are confidential. 
They may contain legally privileged information or 
copyright material. You must not read, copy, use or 
disclose them without authorisation. If you are not an 
intended recipient, please contact us at once by return 
email and then delete both messages and all attachments.
list Alan Sparks · Fri, 20 Jun 2008 16:37:35 -0600 ·
Does exactly the same thing on a fresh install of CentOS 5, x86_64. All built by hand.
-Alan
quoted from Alan Sparks

Alan Sparks wrote:
I see where the problem seems to be occurring.  But for my life I can't understand why.

Packet traces from the Hobbit server and the Web servers showing the 3-second delays show that Hobbit connects, and gets an imediate answer from the server (milliseconds).  But the servers show that Hobbit does not close the connection (a FIN packets sends/acks) for 3 seconds.

Looking at the bb-network debugging logging, I see that the select() call sleeps for 3 seconds before returning in these cases.  So the only conclusion I can arrive at is that select() doesn't return with the active file descriptors on schedule for some bizarre reason.

For a desperation test, I forced the receive buffer on the sockets to a small number (1024 bytes):
                       if (sockok) {
                               int size = 1024;
                               res = setsockopt(nextinqueue->fd,
                                       SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));

This sortof works.  the select() no longer hangs, and the HTTP tests start returning "normal"-ish results, i.e. numbers that match curl and wget statistics.

But, it messes with numbers for other Web servers, the ones that return a page significantly larger than 1024 bytes.

Like I said, I just can't get it.  Hobbit or CentOS?  There's nothing odd about this build, a generic CentOS 4.6 x86_64 build, Hobbit 4.2 with allinone patch, build for x86_64.

Any suggestions at all?  If this isn't the right place to ask, where would be?  I can't get my hands around why the only thing that I can't get to work here is Hobbit...

Thanks for your indulgence.  I really wish I could fix this.
-Alan


Alan Sparks wrote:
After some Googling, I have added "AcceptFilter http none" directives to the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a 64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64 install?

I see a lot of debugging trace stuff (dbgprint calls) in the contest and httptest code.  Can anyone tell me how to enable it to trace what Hobbit is doing?

Am really at a loss.  This can't be rocket science to get it to probe HTTP correctly.  But a week later, I still cannot get it to match any other monitoring tool's results.
-Alan

Alan Sparks wrote:
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of the Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233 0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233 0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF], proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok] 265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN packets, but the first isn't getting processed and there's a 3-second delay to the next SYN (which gets ACKed).  I don't know why this happens only with the Hobbit connections... and I don't know why the first SYN seems to be getting ignored.  Server is not at all busy.

-Alan
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl work fine for you).  We have many apache boxes spanning multiple networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch) likes just fine and reports accurate times (Seconds: 0.nn).  We also have fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is the case I wonder what external libraries are used (not hobbit provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?
list Joshua Krause · Mon, 23 Jun 2008 07:55:49 -0400 ·
All I did was modified the hobbitclient-linux.sh file for my linux boxes and
you will see an entry for df -Pl and I changed it to df -Ph and my graphs
and page show human readable.

-Josh
quoted from Vernon Everett

-----Original Message-----
From: Everett, Vernon [mailto:user-9da1a1882f49@xymon.invalid] 
Sent: Friday, June 20, 2008 12:16 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Disk test display feature request

Hi Henrik

A feature request for you.

Most unix-like operating systems now support the -h switch on the df
command.
I am sure I do not need to tell you, -h displays the output in human
readable format.

Example
# df -h
Filesystem                      size   used  avail capacity  Mounted on
/dev/vx/dsk/bootdg/rootvol      7.9G   3.3G   4.5G    43%    /
swap                   		  7.9G   1.4M   7.9G     1%
/etc/svc/volatile
/dev/vx/dsk/bootdg/var          7.9G   4.5G   3.3G    58%    /var
/dev/vx/dsk/bootdg/opt          7.9G   1.7G   6.1G    22%    /opt
/dev/vx/dsk/int/vol0            190G   147G    40G    79%
/local/dsk/vol0
/dev/vx/dsk/6540b/dsu3          7.3T   4.6T   2.6T    64%
/local/dsk/dsu3
/dev/vx/dsk/6540b/dsu4          7.3T   5.2T   2.1T    72%
/local/dsk/dsu4

Not so good if you want to do calculations, but far more readable
compared to  

# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/vx/dsk/bootdg/rootvol   8262869 3448657 4731584    43%    /
swap                         8091848    1464 8090384     1%
/etc/svc/volatile
/dev/vx/dsk/bootdg/var       8262869 4713464 3466777    58%    /var
/dev/vx/dsk/bootdg/opt       8262869 1777713 6402528    22%    /opt
/dev/vx/dsk/int/vol0         199229440 153301977 43058152    79%
/local/dsk/vol0
/dev/vx/dsk/6540b/dsu3       7803789312 4968049472 2813588504    64%
/local/dsk/dsu3
/dev/vx/dsk/6540b/dsu4       7803789312 5579034264 2207423576    72%
/local/dsk/dsu4

Is it possible to have the calculations and graphs based on the standard
df -k output, but have the display use df -h?

It might require 2 iterations of the df command at client level, one for
display, and another for data.
I was thinking, maybe even add another definition in the client configs.
DF_DISP="df -h", which would then allow any switches to be added,
depending on preference.

I have tried changing the DF= command definition to df -h in the config,
but that seems to break some of the calculations at server side.

Regards
    Vernon


NOTICE: This email and any attachments are confidential. 
They may contain legally privileged information or 
copyright material. You must not read, copy, use or 
disclose them without authorisation. If you are not an 
intended recipient, please contact us at once by return 
email and then delete both messages and all attachments.
list Alan Sparks · Wed, 23 Jul 2008 15:52:50 -0600 ·
So I've been trying for a onth to get this working, to no avail, and have pretty much exhausted everything to figure our why Hobbit randomly gets 3-second return times on HTTP tests.

I'm even willing to call it a kernel problem, a problem with select() -- but it happens in multiple reasonably-contemporary kernels.

I've tried:
* CentOS 4.6 my standard build, CentOS 4.6 out-of-box, CentOS 5.1 out of box -- all have same problem.
* Disabled ARES.  Checked my DNS servers, they are answering fast.  Besides, the tests are against IP addresses, not host names.
* Removed all sysctl settings, let them default.  No change.
* Experimented with concurrency settings on [bbnet].  Doesn't help.
* Ran tcpdumps between the web servers and Hobbit server.  The tcpdumps indicate the web server is always answering immediately and sending the response... but the FIN packet (when Hobbit completes the test) is delayed 3 seconds.

This issue tends to move around from host to host.  It seems to affect Web servers that are sending static HTML pages, and all of them less than 4000 bytes (many about 155 bytes).

As before, testing with other tools on the box show no issues with network connectivity or responses against the same servers.

The /only/ thing that has come close to helping is adding a setsockopt() to bbtest-net, to set the receive bufer to 1024 bytes.  This seems to override something that helps the select() call return better somehow.  It's not a reliable or even sensible solution.  It also does not work on the 2.6.18 kernel on CentOS 5.1...

I'm really stumped and at the end of the rope on this.  Has anyone had anything that looks like this?
-Alan
quoted from Alan Sparks

Alan Sparks wrote:
Does exactly the same thing on a fresh install of CentOS 5, x86_64. All built by hand.
-Alan

Alan Sparks wrote:
I see where the problem seems to be occurring.  But for my life I can't understand why.

Packet traces from the Hobbit server and the Web servers showing the 3-second delays show that Hobbit connects, and gets an imediate answer from the server (milliseconds).  But the servers show that Hobbit does not close the connection (a FIN packets sends/acks) for 3 seconds.

Looking at the bb-network debugging logging, I see that the select() call sleeps for 3 seconds before returning in these cases.  So the only conclusion I can arrive at is that select() doesn't return with the active file descriptors on schedule for some bizarre reason.

For a desperation test, I forced the receive buffer on the sockets to a small number (1024 bytes):
                       if (sockok) {
                               int size = 1024;
                               res = setsockopt(nextinqueue->fd,
                                       SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));

This sortof works.  the select() no longer hangs, and the HTTP tests start returning "normal"-ish results, i.e. numbers that match curl and wget statistics.

But, it messes with numbers for other Web servers, the ones that return a page significantly larger than 1024 bytes.

Like I said, I just can't get it.  Hobbit or CentOS?  There's nothing odd about this build, a generic CentOS 4.6 x86_64 build, Hobbit 4.2 with allinone patch, build for x86_64.

Any suggestions at all?  If this isn't the right place to ask, where would be?  I can't get my hands around why the only thing that I can't get to work here is Hobbit...

Thanks for your indulgence.  I really wish I could fix this.
-Alan


Alan Sparks wrote:
After some Googling, I have added "AcceptFilter http none" directives to the Apache 2.2 servers, which hasn't really helped anything...

Perhaps I should ask:  Can anyone verify Hobbit works correctly on a 64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64 install?

I see a lot of debugging trace stuff (dbgprint calls) in the contest and httptest code.  Can anyone tell me how to enable it to trace what Hobbit is doing?

Am really at a loss.  This can't be rocket science to get it to probe HTTP correctly.  But a week later, I still cannot get it to match any other monitoring tool's results.
-Alan

Alan Sparks wrote:
tcpdumps show a couple of interesting points.

1) There are definitely no DNS lookups occurring as a consequence of the Hobbit probes.  No port 53 traffic out.

2) The packets from the Hobbit server, and the incoming packets to the Apache server, sometimes look like:

15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143665233 0,nop,wscale 2>

15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 143668233 0,nop,wscale 2>

15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags [DF], proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum ok] 265769417:265769417(0) ack 1051782089 win 17520

So that accounts for three seconds... it appears there are 2 SYN packets, but the first isn't getting processed and there's a 3-second delay to the next SYN (which gets ACKed).  I don't know why this happens only with the Hobbit connections... and I don't know why the first SYN seems to be getting ignored.  Server is not at all busy.

-Alan
Tim McCloskey wrote:
I get that wget/curl always work.  Not sure what resolver settings may be implemented differently for hobbit.

Still thinking this may be unrelated to hobbit (even though wget/curl work fine for you).  We have many apache boxes spanning multiple networks running httpd versions 1.3, 2.0 and 2.2 that hobbit(4.2 with allinone patch) likes just fine and reports accurate times (Seconds: 0.nn).  We also have fairly proper forward and reverse DNS records for the systems involved.

I can't imagine hobbit parsing the wrong response times, but if that is the case I wonder what external libraries are used (not hobbit provided libs, as ours parse fine and are likely the same as yours).

Anyway, good luck with the tcpdump.

Regards,

Tim


Alan Sparks wrote:
UseCanonicalName is off, and HostNameLookup is off, on every server, regardless of version.
-Alan

Tim McCloskey wrote:
What do you have for
UseCanonicalName
in the apache 2.0 boxes?