Xymon Mailing List Archive search

Purple storm

list Henrik Størner
Tue, 20 Mar 2012 08:10:45 +0100
Message-Id: <user-eed86368893e@xymon.invalid>

On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC’s in our
DR datacenter. When that happened, most tests that are initiated from
the xymon servers (http, dns, ssh, ftp, etc) to the monitored server
went purple. The servers that went purple were not all in our DR
datacenter, it was at all of our sites, and even included some tests to
the xymon server itself (we monitor the HTTP web page of xymon itself as
well).

Both of our xymon servers point to 2 windows DC’s in our production
datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run.

The status will tell you more about what part of the network tests are taking too long.

This should also show up in the xymonnet.log file.

One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout.

Fixes for both of these issues are "in the pipeline" for the next major Xymon version.


Regards,
Henrik