Xymon Mailing List Archive search

"purple storm" example

6 messages in this thread

list Jon Dustin · Sat, 12 May 2012 22:41:09 -0400 ·
Greetings -

This evening I had the luck to experience a "purple storm" with Xymon v4.3.7, *and* I had enough time to try a few items to combat the problem. Here is what I tried:

- applied patch from Henrik dated May 2 http://lists.xymon.com/pipermail/xymon/2012-May/034525.html 
patch -p0 <dns.patch ; make ; make install ; service xymon stop ; service xymon start

- checked my /etc/hosts file

I do have a few non-standard Xymon configuration settings:

- am NOT using FQDN

- tasks.cfg: CMD xymonnet --report --ping --ping-tasks=4 --dns-timeout=5 --checkresponse --source-ip=130.111.135.60 --dns=ip --shuffle

- hosts.cfg: 130.111.135.60   shepherd        # testip ssh bbd smtp https://shepherd.uct.usm.maine.edu/

The xymonnet column for shepherd remained purple until the DNS servers came back. It almost seemed like Xymon could not "find itself"?

Any thoughts on ways to combat this issue?

Thanks (as always) for a great product, and for any assistance you can provide.

-- 
 
Jon Dustin - Network Specialist
University of Southern Maine
Portland, ME  XXX-XXX-XXXX
list Henrik Størner · Sun, 13 May 2012 16:56:46 +0200 ·
quoted from Jon Dustin
On 13-05-2012 04:41, Jon Dustin wrote:
This evening I had the luck to experience a "purple storm" with Xymon v4.3.7
[snip]
The xymonnet column for shepherd remained purple until the DNS servers came back. It almost seemed like Xymon could not "find itself"?

Any thoughts on ways to combat this issue?
What's logged in your xymonnet.log file ?

Are your network tests and the Xymon website on the same server, or 
different servers ?

Do you have a local caching DNS server, or does your resolv.conf point 
to remote DNS servers ?


Regards,
Henrik
list Jon Dustin · Sun, 13 May 2012 22:04:44 -0400 ·
On 5/13/2012 at 10:56 AM, in message <user-122793e2102b@xymon.invalid>,
Henrik
quoted from Henrik Størner
Størner<user-ce4a2c883f75@xymon.invalid> wrote:
On 13-05-2012 04:41, Jon Dustin wrote:
This evening I had the luck to experience a "purple storm" with
Xymon v4.3.7
[snip]
The xymonnet column for shepherd remained purple until the DNS
servers came 
back. It almost seemed like Xymon could not "find itself"?

Any thoughts on ways to combat this issue?
What's logged in your xymonnet.log file ?
All I found were the following two entries:

2012-05-12 20:58:14 WARNING: Runtime 481 longer than time limit (300)
2012-05-12 22:07:20 WARNING: Runtime 767 longer than time limit (300)
quoted from Henrik Størner
Are your network tests and the Xymon website on the same server, or 
different servers ?
Same server, physical SLES11SP1x64, 16 GiB RAM, not overloaded.
quoted from Henrik Størner
Do you have a local caching DNS server, or does your resolv.conf
point 
to remote DNS servers ?
resolv.conf uses two Active Directory name servers, but the majority of
tests go against domains provided by another entity in my University
system. When their DNS servers go south, my Xymon server starts having
troubles.

Also, the TTL for our DNS records is very low (5 minutes I believe).
I'm going to see if we can increase this for our server names.

Thanks for reading.

-- 
 
Jon Dustin - Network Specialist
University of Southern Maine
Portland, ME  XXX-XXX-XXXX
list Henrik Størner · Mon, 14 May 2012 07:32:11 +0200 ·
quoted from Jon Dustin
On 14-05-2012 04:04, Jon Dustin wrote:
What's logged in your xymonnet.log file ?
All I found were the following two entries:

2012-05-12 20:58:14 WARNING: Runtime 481 longer than time limit (300)
2012-05-12 22:07:20 WARNING: Runtime 767 longer than time limit (300)
OK, if you look at the history of "xymonnet" status column, do you have a yellow status from around that time ? If you do, then check what line takes the longest time to complete.

How many systems are you testing, btw ?

There is one thing that I know of which can trigger this: xymonnet relies on two external tools (ntpdate and rpcinfo) for checking NTP-servers and RPC services. I know from personal experience that a failed NTP server can cause ntpdate to hang for a very long time, and this can block xymonnet from completing the test cycle.


Regards,
Henrik
list Jon Dustin · Mon, 14 May 2012 06:35:08 -0400 ·
On 5/14/2012 at 1:32 AM, in message <user-dd2ccbe0e83c@xymon.invalid>,
Henrik
quoted from Henrik Størner
Størner<user-ce4a2c883f75@xymon.invalid> wrote:
On 14-05-2012 04:04, Jon Dustin wrote:
What's logged in your xymonnet.log file ?
All I found were the following two entries:

2012-05-12 20:58:14 WARNING: Runtime 481 longer than time limit
(300)
2012-05-12 22:07:20 WARNING: Runtime 767 longer than time limit
(300)
OK, if you look at the history of "xymonnet" status column, do you
have 
a yellow status from around that time ? If you do, then check what
line 
takes the longest time to complete.
Yes, I DO have a yellow test result (481 seconds), and it looks like
LDAP was the culprit! 
DNS lookups completed                       4791966.865311        17.502010 Test engine setup completed                 4791966.870284         0.004972 TCP tests completed                         4791978.812050        11.941766 PING test completed (604 hosts)             4791979.652874         0.840824 PING test results sent                      4791979.656317         0.003442 Test result collection completed            4791979.656625         0.000307 LDAP test engine setup completed            4791979.656705         0.000080 LDAP tests executed                         4792364.927759       385.271054 LDAP tests result collection completed      4792364.927760         0.000000 DNS tests executed                          4792429.956221        65.028460

These test times were *before* I added your DNS patch to Xymon.
How many systems are you testing, btw ?
726 hosts in the configuration report
quoted from Henrik Størner
There is one thing that I know of which can trigger this: xymonnet relies on two external tools (ntpdate and rpcinfo) for checking NTP-servers and RPC services. I know from personal experience that a
failed NTP server can cause ntpdate to hang for a very long time, and
this can block xymonnet from completing the test cycle.
I DO have a few NTP servers (and a couple of them were the failed DNS
servers). No RPC tests however.
quoted from Jon Dustin

Thanks for reading.

-- 
 Jon Dustin - Network Specialist
University of Southern Maine
Portland, ME  XXX-XXX-XXXX
list Henrik Størner · Mon, 14 May 2012 14:25:37 +0200 ·
On Mon, 14 May 2012 06:35:08 -0400, "Jon Dustin" <user-d8c63a8259c1@xymon.invalid>
quoted from Jon Dustin
wrote:
Yes, I DO have a yellow test result (481 seconds), and it looks like
LDAP was the culprit! 

LDAP tests executed ... 385.271054 
Makes sense, really. LDAP tests use whatever LDAP library your system has,
and Xymon currently has very little control over timeout handling once it
hands over control to the LDAP library. Some libraries implement a timeout
setting - but only for queries, not when connecting to the server.

Newer OpenLDAP libraries have real timeout handling, but Xymon 4.x hasn't
been modified to use it. Will do in 5.x.


Regards,
Henrik