Xymon Mailing List Archive search

http checks red with timeout, xymonnet fast

6 messages in this thread

list Rolf Schrittenlocher · Tue, 27 Apr 2021 06:20:02 +0000 ·
Dear all,


From one moment to the next we had the situation that all our https tests went to red because of timeouts. As far as we found no firewall or network issue. Using another machine everything is fine. We assume a hardware problem with the main board (Sun T3 Solaris), but support says the machine is ok. Doing the request directly with xymonnet the result is a quick reply. Any ideas how to debug this or what I am doing wrong? Where is a difference between the check made by xymon server and the manual check? We are using xymon 4.3.17


example entry in hosts.cfg:

ip   column_name            # noconn HIDEHTTP cont=xxx;https://address?id=354400185;keyword

Result: Server timeout Seconds: 18.94


console:

xymonnet --timing  --ssl=ssl --content=keyword https://address?id=354400185

Result: TIME TOTAL          0.011761


Any help appreciated,

regards

Rolf


Rolf Schrittenlocher

LBS-IT Systembetreuung user-64314bfd1eb5@xymon.invalid

Sammelnummer LBS-IT: 069 798-28830

Pers?nlich user-4b3b4051a09b@xymon.invalid

Direkt: 069 798-28908
list Jeremy Laidman · Wed, 28 Apr 2021 10:44:50 +1000 ·
Hi Rolf

Can you show the red alert page for a failed https test?

Is there anything showing in the xymonnet.log file?

What is the status of the xymonnet test page for your Xymon server? (This
will probably show the most recent messages from the xymonnet.log file,
among other info.)

When running xymonnet, are you switching to xymon user and running xymoncmd
to setup the environment?

Cheers
Jeremy

On Tue, 27 Apr 2021 at 23:12, Schrittenlocher, Rolf <
quoted from Rolf Schrittenlocher
user-c8b69be9a15a@xymon.invalid> wrote:
Dear all,


From one moment to the next we had the situation that all our https tests
went to red because of timeouts. As far as we found no firewall or network
issue. Using another machine everything is fine. We assume a hardware
problem with the main board (Sun T3 Solaris), but support says the machine
is ok. Doing the request directly with xymonnet the result is a quick
reply. Any ideas how to debug this or what I am doing wrong? Where is a
difference between the check made by xymon server and the manual check? We
are using xymon 4.3.17


example entry in hosts.cfg:

ip   column_name            # noconn HIDEHTTP cont=xxx;
https://address?id=354400185;keyword
quoted from Rolf Schrittenlocher

Result: Server timeout Seconds: 18.94


console:

xymonnet --timing  --ssl=ssl --content=keyword
https://address?id=354400185

Result: TIME TOTAL          0.011761


Any help appreciated,

regards

Rolf


Rolf Schrittenlocher

LBS-IT Systembetreuung user-64314bfd1eb5@xymon.invalid

Sammelnummer LBS-IT: 069 798-28830

Pers?nlich user-4b3b4051a09b@xymon.invalid

Direkt: 069 798-28908

list Rolf Schrittenlocher · Wed, 28 Apr 2021 07:37:13 +0000 ·
Hi Jeremy and others,

thanx for help

Can you show the red alert page for a failed https test?

Wed Apr 28 09:02:48 2021: Server timeout[red] https://xxx - Server timeout


Seconds:    18.92

Is there anything showing in the xymonnet.log file?
no, empty

What is the status of the xymonnet test page for your Xymon server? (This will probably show the most recent messages from the xymonnet.log file, among other info.)

not quite sure what that test page is. I added a http check http://our_xymon_server/xymon-cgi/svcstatus.sh?HOST=our_xymon_server&SERVICE=http  for our xymon server (no certifcate, no https) and the result  is interesting. It alternates between less than 1 second and timeout

[cid:user-8c2ac129e881@xymon.invalid]

HTTP/1.1 200 OK
Date: Wed, 28 Apr 2021 07:13:15 GMT
Server: Apache/2.4.26 (Unix) OpenSSL/1.0.2u
Last-Modified: Wed, 28 Apr 2021 07:12:16 GMT
ETag: "1606f-5c103193e9130"
Accept-Ranges: bytes
Content-Length: 90223
Connection: close
Content-Type: text/html

Seconds:     0.81


HTTP/1.1 200 OK
Date: Wed, 28 Apr 2021 07:17:55 GMT
Server: Apache/2.4.26 (Unix) OpenSSL/1.0.2u
Last-Modified: Wed, 28 Apr 2021 07:17:19 GMT
ETag: "14eaa-5c1032b4f2e80"
Accept-Ranges: bytes
Content-Length: 85674
Connection: close
Content-Type: text/html

Seconds:    18.91

When running xymonnet, are you switching to xymon user and running xymoncmd to setup the environment?

I did it as xymon user but without xymoncmd. Did it again with xymoncmd, very little difference.

Once again. Error occured at a time where noone was doing any changes. Switching xymon server to another machine in the same subnet everything is fine. xymon, apache, etc. installation on both machines is identical only hardware differs.

Any ideas?

Greetings

Rolf
quoted from Jeremy Laidman

Cheers
Jeremy

On Tue, 27 Apr 2021 at 23:12, Schrittenlocher, Rolf <user-c8b69be9a15a@xymon.invalid<mailto:user-c8b69be9a15a@xymon.invalid>> wrote:

Dear all,


From one moment to the next we had the situation that all our https tests went to red because of timeouts. As far as we found no firewall or network issue. Using another machine everything is fine. We assume a hardware problem with the main board (Sun T3 Solaris), but support says the machine is ok. Doing the request directly with xymonnet the result is a quick reply. Any ideas how to debug this or what I am doing wrong? Where is a difference between the check made by xymon server and the manual check? We are using xymon 4.3.17


example entry in hosts.cfg:

ip   column_name            # noconn HIDEHTTP cont=xxx;https://address?id=354400185;keyword

Result: Server timeout Seconds: 18.94


console:

xymonnet --timing  --ssl=ssl --content=keyword https://address?id=354400185

Result: TIME TOTAL          0.011761


Any help appreciated,

regards

Rolf


Rolf Schrittenlocher

LBS-IT Systembetreuung user-64314bfd1eb5@xymon.invalid<mailto:user-64314bfd1eb5@xymon.invalid>

Sammelnummer LBS-IT: 069 798-28830

Pers?nlich user-4b3b4051a09b@xymon.invalid<mailto:user-4b3b4051a09b@xymon.invalid>

Direkt: 069 798-28908


--
Rolf Schrittenlocher

LBS-IT Systembetreuung user-64314bfd1eb5@xymon.invalid<mailto:user-64314bfd1eb5@xymon.invalid>
Sammelnummer LBS-IT: 069 798-28830
Pers?nlich user-4b3b4051a09b@xymon.invalid<mailto:user-4b3b4051a09b@xymon.invalid>
Direkt: 069 798-28908
list Jeremy Laidman · Thu, 29 Apr 2021 02:20:34 +1000 ·
On Wed, 28 Apr 2021 at 17:37, Schrittenlocher, Rolf <
quoted from Rolf Schrittenlocher
user-c8b69be9a15a@xymon.invalid> wrote:
What is the status of the xymonnet test page for your Xymon server? (This
will probably show the most recent messages from the xymonnet.log file,
among other info.)

not quite sure what that test page is.
Here's an example:

[image: image.png]

Click on the green dot to see the xymonnet test page.
quoted from Rolf Schrittenlocher

I added a http check
http://our_xymon_server/xymon-cgi/svcstatus.sh?HOST=our_xymon_server&SERVICE=http
for our xymon server (no certifcate, no https) and the result  is
interesting. It alternates between less than 1 second and timeout

This seems to show a failure and then 6 seconds later a success. It's like
there are two xymonnet processes, one reporting a failure and the other
reporting success.

Can you please take a look at two status messages (click on the red/green
dot) that are a few seconds apart, and for each, check the IP address in
the message near the bottom that says, "Status message received from <IP>"
and see if they're both the same, and confirm that they are of the Xymon
server?

I'm not sure what's going on here. But I would try running a packet trace
(eg tcpdump) and inspecting the traffic for the two connections - one that
works and one that doesn't - and compare.

Also perhaps try running xymonnet manually again, but many times, to see if
it's an intermittent fault.

Check the execution parameters in tasks.cfg for the [xymonnet] section, and
make sure you're running xymonnet with the same parameters.

The problem might also be a DNS lookup issue. Try setting "testip" for the
web server in your hosts.cfg file and see if the delay goes away.
list Rolf Schrittenlocher · Thu, 29 Apr 2021 08:54:07 +0000 ·
Dear Jeremy, dear all,


What is the status of the xymonnet test page for your Xymon server? (This will probably show the most recent messages from the xymonnet.log file, among other info.)

xymonnet is green, nothing peculiar

I added a http check http://our_xymon_server/xymon-cgi/svcstatus.sh?HOST=our_xymon_server&SERVICE=http  for our xymon server (no certifcate, no https) and the result  is interesting. It alternates between less than 1 second and timeout

[X]
quoted from Jeremy Laidman

This seems to show a failure and then 6 seconds later a success. It's like there are two xymonnet processes, one reporting a failure and the other reporting success.

Can you please take a look at two status messages (click on the red/green dot) that are a few seconds apart, and for each, check the IP address in the message near the bottom that says, "Status message received from <IP>" and see if they're both the same, and confirm that they are of the Xymon server?

Yes, both from xymon servers IP
I'm not sure what's going on here. But I would try running a packet trace (eg tcpdump) and inspecting the traffic for the two connections - one that works and one that doesn't - and compare.

we did a (general) tcpdump but couldn't detect something. Well, noone here is specialist in this. I'll try this again more precisely.

I would exclude DNS as Solaris looks first in /etc/hosts before asking a nameserver and some of the adressesare included in /etc/hosts. As well, ping works fine.

Thank you for your help, Jeremy, I think, this is really a very special and local problem. I'll try to get our network specialists involved. It is difficult they are more than busy in times where everything happens online :-)

We'll use another machine for xymon and I am afraid we have to live with the fact that this machine isn't totally realiable any more,

cheers

Rolf
quoted from Jeremy Laidman


Also perhaps try running xymonnet manually again, but many times, to see if it's an intermittent fault.

Check the execution parameters in tasks.cfg for the [xymonnet] section, and make sure you're running xymonnet with the same parameters.

The problem might also be a DNS lookup issue. Try setting "testip" for the web server in your hosts.cfg file and see if the delay goes away.


--
Rolf Schrittenlocher

LBS-IT Systembetreuung user-64314bfd1eb5@xymon.invalid<mailto:user-64314bfd1eb5@xymon.invalid>
Sammelnummer LBS-IT: 069 798-28830
Pers?nlich user-4b3b4051a09b@xymon.invalid<mailto:user-4b3b4051a09b@xymon.invalid>
Direkt: 069 798-28908
list Jeremy Laidman · Thu, 29 Apr 2021 19:05:37 +1000 ·
On Thu, 29 Apr 2021 at 18:54, Schrittenlocher, Rolf <
quoted from Rolf Schrittenlocher
user-c8b69be9a15a@xymon.invalid> wrote:
I would exclude DNS as Solaris looks first in /etc/hosts before asking a
nameserver and some of the adressesare included in /etc/hosts. As well,
ping works fine.
Xymon (typically) uses its own DNS resolver library, so testing that DNS
works for Solaris commands might not give you the same results.

One option you might look into is to run truss (strace on Linux) and attach
it to the xymonnet process. Then when it performs its check, truss will
show you all of the system calls that it makes. You might see it pause on a
particular system call for 15 seconds before continuing on its way, and
knowing that system call might lead you to the cause of the problem.

(It's been many years since I've used truss. I think that dtrace might have
replaced it?)

Good luck with it.

Cheers
Jeremy