Xymon Mailing List Archive search

Purple storm

25 messages in this thread

list Ben Poppy · Mon, 19 Mar 2012 18:15:15 +0000 ·
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well).

Both of our xymon servers point to 2 windows DC's in our production datacenter in /etc/resolv.conf for DNS lookups.

Has anyone run into this before? Any ideas how it could be related? Or how to fix/prevent it?

We are running 4.3.4.

Thanks,
-Ben

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Jeremy Laidman · Tue, 20 Mar 2012 11:46:08 +1100 ·
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben
quoted from Ben Poppy
<user-1ce99a2a9ef8@xymon.invalid> wrote:
I have an interesting problem that happened last night. We are working on a
DR test. Part of that test includes shutting down some DC’s in our DR
datacenter. When that happened, most tests that are initiated from the xymon
servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername
(typically using DNS), and then uses that IP address to perform the
test.  The IP address in the hosts.cfg file is not normally used for
network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by
adding "testip" to each hosts.cfg entry that requires it.  You can add
it to a ".default." entry so that it applies to all hosts.

J
list Ben Poppy · Tue, 20 Mar 2012 01:20:02 +0000 ·
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.
quoted from Jeremy Laidman

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben <user-1ce99a2a9ef8@xymon.invalid> wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test.  The IP address in the hosts.cfg file is not normally used for network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it.  You can add it to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Phil Crooker · Tue, 20 Mar 2012 16:40:40 +1100 ·
So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
quoted from Ben Poppy
So they are pointing to 2 DC's that stay up this entire time, we'll call
them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers
are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] 
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in 
our DR datacenter. When that happened, most tests that are initiated 
from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername
(typically using DNS), and then uses that IP address to perform the
test.  The IP address in the hosts.cfg file is not normally used for
network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by
adding "testip" to each hosts.cfg entry that requires it.  You can add
it to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or
privileged information.  If you received this message in error, you
should destroy the e-mail message and any attachments or copies, and you
are prohibited from retaining, distributing, disclosing or using any
information contained within.  Please contact the sender and advise of
the erroneous delivery by return e-mail or telephone.  Thank you for
your cooperation.
list Henrik Størner · Tue, 20 Mar 2012 08:10:45 +0100 ·
quoted from Ben Poppy
On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC’s in our
DR datacenter. When that happened, most tests that are initiated from
the xymon servers (http, dns, ssh, ftp, etc) to the monitored server
went purple. The servers that went purple were not all in our DR
datacenter, it was at all of our sites, and even included some tests to
the xymon server itself (we monitor the HTTP web page of xymon itself as
well).

Both of our xymon servers point to 2 windows DC’s in our production
datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run.

The status will tell you more about what part of the network tests are taking too long.

This should also show up in the xymonnet.log file.

One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout.

Fixes for both of these issues are "in the pipeline" for the next major Xymon version.


Regards,
Henrik
list Ben Poppy · Tue, 20 Mar 2012 18:51:42 +0000 ·
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.
quoted from Phil Crooker

-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid] 
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in 
our DR datacenter. When that happened, most tests that are initiated 
from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test.  The IP address in the hosts.cfg file is not normally used for network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it.  You can add it to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.


-- 

This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised.

If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email.

ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .
quoted from Phil Crooker


The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Ben Poppy · Tue, 20 Mar 2012 18:54:50 +0000 ·
The DNS tests executed jumps to over 1500-2400 from the normal ~1 when those 4 DC's are down (which we are testing DNS, but are not the DNS servers set up in /etc/resolv.conf on the xymon servers). We are not doing any NTP tests against any hosts, nor do we do any special dns test, just the standard test.
quoted from Henrik Størner

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner
Sent: Tuesday, March 20, 2012 2:11 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well).

Both of our xymon servers point to 2 windows DC's in our production datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run.

The status will tell you more about what part of the network tests are taking too long.

This should also show up in the xymonnet.log file.

One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout.

Fixes for both of these issues are "in the pipeline" for the next major Xymon version.


Regards,
Henrik


The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Don Kuhlman · Tue, 20 Mar 2012 19:05:06 +0000 ·
Would you be able to run a tcpdump or use a network sniffer to see what
the server is doing when you're getting the long response times?

Maybe that will help you see what it is trying to reach when that is
happening.
quoted from Ben Poppy


On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups
from the xymon server to other DNS servers.

-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid]
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll call
them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers
are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in
our DR datacenter. When that happened, most tests that are initiated
from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername
(typically using DNS), and then uses that IP address to perform the test.
The IP address in the hosts.cfg file is not normally used for network
tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by
adding "testip" to each hosts.cfg entry that requires it.  You can add it
to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or
privileged information.  If you received this message in error, you
should destroy the e-mail message and any attachments or copies, and you
are prohibited from retaining, distributing, disclosing or using any
information contained within.  Please contact the sender and advise of
the erroneous delivery by return e-mail or telephone.  Thank you for your
cooperation.


-- 

This message from ORIX Australia might contain confidential and/or
privileged information. If you are not the intended recipient, any use,
disclosure or copying of this message (or of any attachments to it) is
not authorised.

If you have received this message in error, please notify the sender
immediately and delete the message and any attachments from your system.
Please inform the sender if you do not wish to receive future
communications by email.

ORIX handles personal information according to a Privacy Policy that is
consistent with the National Privacy Principles. Please let us know if
you would like a copy. It is also available at http://www.orix.com.au .


The contents of this message may contain private, protected and/or
privileged information.  If you received this message in error, you
should destroy the e-mail message and any attachments or copies, and you
are prohibited from retaining, distributing, disclosing or using any
information contained within.  Please contact the sender and advise of
the erroneous delivery by return e-mail or telephone.  Thank you for your
cooperation.
list Jamison Maxwell · Tue, 20 Mar 2012 22:11:17 +0000 ·
I think the interesting sniffer would be on the DC's that remain up.  Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site.  You shutdown the DC's in the DR site and now queries are timing  out (or something) to the DC's on the LAN.

If that's the case, I would first look at DNS on the DC's.  If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response.   It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries.  If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.  
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system?  I've seen some odd results with DNS caching.  What order are the name servers in /etc/resolv.conf?  I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.

Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works.  ...matter of fact, I think I'll do that myself....


Jamison Maxwell
quoted from Don Kuhlman

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?

Maybe that will help you see what it is trying to reach when that is happening.


On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.

-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid]
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test.
The IP address in the hosts.cfg file is not normally used for network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it.  You can add it to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.


--

This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised.

If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system.
Please inform the sender if you do not wish to receive future communications by email.

ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .


The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Jeremy Laidman · Thu, 22 Mar 2012 10:42:39 +1100 ·
On Wed, Mar 21, 2012 at 9:11 AM, Jamison Maxwell
quoted from Jamison Maxwell
<user-87d336c3dce6@xymon.invalid> wrote:
What order are the name servers in /etc/resolv.conf?  I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
That's all normal behaviour.  The standard resolver library on Linux
will only use second and third entries in resolv.conf when the first
one is unavailable for 5 seconds (see "man resolv").

The dig, nslookup and host programs don't use the resolver library,
and instead use their own in-built resolver that behaves differently
to the standard resolver - but still uses the "nameserver" entries
from resolv.conf.  For this reason, when diagnosing DNS problems with
applications, using dig/nslookup can give you results different to the
application you're testing; it's better to use things like "ping" or
"telnet" to do a lookup, as they use the same resolver library as most
other applications, including Xymon.
list Ben Poppy · Wed, 11 Apr 2012 22:23:12 +0000 ·
And a fiber cut to our DR datacenter caused another 5 hour purple storm.

The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.

While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.

At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?

Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..

Any other ideas from the list by chance?
quoted from Jamison Maxwell

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

I think the interesting sniffer would be on the DC's that remain up.  Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site.  You shutdown the DC's in the DR site and now queries are timing  out (or something) to the DC's on the LAN.

If that's the case, I would first look at DNS on the DC's.  If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response.   It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries.  If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.  
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system?  I've seen some odd results with DNS caching.  What order are the name servers in /etc/resolv.conf?  I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.

Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works.  ...matter of fact, I think I'll do that myself....


Jamison Maxwell

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?

Maybe that will help you see what it is trying to reach when that is happening.


On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.

-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid]
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test.
The IP address in the hosts.cfg file is not normally used for network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it.  You can add it to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.


--

This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised.

If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system.
Please inform the sender if you do not wish to receive future communications by email.

ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .


The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Jeremy Laidman · Thu, 12 Apr 2012 11:22:14 +1000 ·
On Thu, Apr 12, 2012 at 8:23 AM, Poppy, Ben
quoted from Ben Poppy
<user-1ce99a2a9ef8@xymon.invalid>wrote:
Another idea that they are suggesting is changing all the shortname
entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700
entries and I'd have to essentially find out what domain they are in, and
then do a rename command as well.. Not to mention that we don't have this
issue unless our DR-DC goes down..

Any other ideas from the list by chance?

Add to hosts.cfg:
   0.0.0.0         .default.       # testip

This turns off DNS lookups for all servers in hosts.cfg and uses the IP
address instead.

Cheers
Jeremy
list Jamison Maxwell · Thu, 12 Apr 2012 03:21:31 +0000 ·
I'm not convinced that this is a bug in Xymon.

I don't understand how secondary name servers in a DR site that are configured to be used as backups would cause nothing to resolve when they are unavailable.  I've run a similar configuration to what I believe you are describing without a problem....  To make sure I'm understanding what you're saying, when the DR DNS servers are unavailable, then Xymon fails to accept the DNS query results?  Another, possibly clearer, way of saying that is that the DNS queries from Xymon to your production DNS servers fail despite it absolutely receiving correct replies from your production DNS servers just because your DR site is unavailable?


Jamison Maxwell
user-87d336c3dce6@xymon.invalid
quoted from Ben Poppy

-----Original Message-----
From: Poppy, Ben [mailto:user-1ce99a2a9ef8@xymon.invalid] Sent: Wednesday, April 11, 2012 6:23 PM
To: Jamison Maxwell; xymon at xymon.com
Subject: RE: [Xymon] Purple storm

And a fiber cut to our DR datacenter caused another 5 hour purple storm.

The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.

While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.

At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?

Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..

Any other ideas from the list by chance?

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

I think the interesting sniffer would be on the DC's that remain up.  Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site.  You shutdown the DC's in the DR site and now queries are timing  out (or something) to the DC's on the LAN.

If that's the case, I would first look at DNS on the DC's.  If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response.   It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries.  If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.  
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system?  I've seen some odd results with DNS caching.  What order are the name servers in /etc/resolv.conf?  I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.

Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works.  ...matter of fact, I think I'll do that myself....


Jamison Maxwell

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?

Maybe that will help you see what it is trying to reach when that is happening.


On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.

-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid]
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test.
The IP address in the hosts.cfg file is not normally used for network tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it.  You can add it to a ".default." entry so that it applies to all hosts.

J

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.


--

This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised.

If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system.
Please inform the sender if you do not wish to receive future communications by email.

ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .


The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Ben Poppy · Thu, 12 Apr 2012 03:28:25 +0000 ·
To be honest, I'm not sure what the cause is 100%.

The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).

Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
quoted from Jamison Maxwell

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Wednesday, April 11, 2012 10:22 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

I'm not convinced that this is a bug in Xymon.

I don't understand how secondary name servers in a DR site that are configured to be used as backups would cause nothing to resolve when they are unavailable.  I've run a similar configuration to what I believe you are describing without a problem....  To make sure I'm understanding what you're saying, when the DR DNS servers are unavailable, then Xymon fails to accept the DNS query results?  Another, possibly clearer, way of saying that is that the DNS queries from Xymon to your production DNS servers fail despite it absolutely receiving correct replies from your production DNS servers just because your DR site is unavailable?


Jamison Maxwell
user-87d336c3dce6@xymon.invalid

-----Original Message-----
From: Poppy, Ben [mailto:user-1ce99a2a9ef8@xymon.invalid]
Sent: Wednesday, April 11, 2012 6:23 PM
To: Jamison Maxwell; xymon at xymon.com
Subject: RE: [Xymon] Purple storm

And a fiber cut to our DR datacenter caused another 5 hour purple storm.

The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.

While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.

At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?

Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..

Any other ideas from the list by chance?

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

I think the interesting sniffer would be on the DC's that remain up.  Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site.  You shutdown the DC's in the DR site and now queries are timing  out (or something) to the DC's on the LAN.

If that's the case, I would first look at DNS on the DC's.  If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response.   It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries.  If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.  

I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system?  I've seen some odd results with DNS caching.  What order are the name servers in /etc/resolv.conf?  I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.

Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works.  ...matter of fact, I think I'll do that myself....


Jamison Maxwell

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?

Maybe that will help you see what it is trying to reach when that is happening.


On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and 
nslookups from the xymon server to other DNS servers.

-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid]
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

So, can you do DNS queries from the xymon server when DC3 & 4 are down?

"Poppy, Ben"  03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll 
call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those 
servers are down, we begin to have issues.

-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
I have an interesting problem that happened last night. We are 
working
on a DR test. Part of that test includes shutting down some DC's in 
our DR datacenter. When that happened, most tests that are initiated 
from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
server went purple.
For network tests, Xymon resolves the IP address from the servername 
(typically using DNS), and then uses that IP address to perform the test.
The IP address in the hosts.cfg file is not normally used for network 
tests.  So if your DNS fails, Xymon's network tests fail also.

You can prevent this, and use the IP address supplied in hosts.cfg, by 
adding "testip" to each hosts.cfg entry that requires it.  You can add 
it to a ".default." entry so that it applies to all hosts.
The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Henrik Størner · Thu, 12 Apr 2012 07:47:16 +0200 ·
On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
quoted from Ben Poppy
The setup we have, should not have any dependencies on our DR site.
Our xymon servers at our primary site use DNS servers in our primary
site. They monitor a bunch of servers at our DR site, but the
dependency ends there (and all that should mean is the servers show
RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ?

I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
quoted from Ben Poppy
Another bit of information, during this 5 hour outage, both of our
xymon servers went from showing properly (where DR servers were
showing RED conn as they weren't reachable, but the servers we
monitor in our primary site were up), to everything going purple in
conn (and other tests).. It would alternate back and forth over the
course of the outage (I didn't detect a regular timeframe of when it
switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled.

I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple.

I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long.

One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then

   iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP
   iptables -I INPUT 1 -s 10.1.2.3 -j DROP

will cause all traffic to/from this server to be dropped.


Regards,
Henrik
Attachments (1)
list Henrik Størner · Thu, 12 Apr 2012 07:50:12 +0200 ·
quoted from Henrik Størner
On 12-04-2012 07:47, Henrik Størner wrote:
I know that this is difficult to test, because obviously you cannot just
cut the connection between the two sites to try it out. But you could
try applying this patch
The first part of the patch - the one for xymonnet/contest.c - is completely unrelated. You can remove this before applying if you like, but unless you have sites explicitly tested with https and SSLv2 it is harmless.


Regards,
Henrik
list Ben Poppy · Thu, 12 Apr 2012 06:27:01 +0000 ·
I may have missed this in a past post, how do I apply this patch?

I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm? Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?

Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm?

Thanks for your help. 
quoted from Henrik Størner
From: xymon-bounces at xymon.com [xymon-bounces at xymon.com] on behalf of Henrik Størner [user-ce4a2c883f75@xymon.invalid]
Sent: Thursday, April 12, 2012 12:47 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site.
Our xymon servers at our primary site use DNS servers in our primary
site. They monitor a bunch of servers at our DR site, but the
dependency ends there (and all that should mean is the servers show
RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the
other site ? Would that happen to include any DNS or NTP checks ?

I suspect that you have each of your Xymon's setup to test availability
of the DNS servers on both the primary and the DR site. That could be a
problem.
Another bit of information, during this 5 hour outage, both of our
xymon servers went from showing properly (where DR servers were
showing RED conn as they weren't reachable, but the servers we
monitor in our primary site were up), to everything going purple in
conn (and other tests).. It would alternate back and forth over the
course of the outage (I didn't detect a regular timeframe of when it
switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that
something is stalled.

I have seen something like this happen when we had a number of DNS
checks in the Xymon servers, and network access to these failed (broken
switch to a customer network). This caused xymon to stall on these DNS
checks, and all of the network tests went purple.

I know that this is difficult to test, because obviously you cannot just
cut the connection between the two sites to try it out. But you could
try applying this patch which changes the DNS lookup code to use the
same kind of timeout settings as the development version - the 4.3.x
versions suffer from a common misunderstanding about how the C-ARES
library handles timeout that make DNS timeouts take much too long.

One possible way of testing it would be if you can firewall access from
e.g. your DR site Xymon server to the primary site's DNS server. If you
are running Xymon on a Linux server, then "iptables" can do that for
you. If your primary site DNS server is 10.1.2.3, then

   iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP
   iptables -I INPUT 1 -s 10.1.2.3 -j DROP

will cause all traffic to/from this server to be dropped.


Regards,
Henrik

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Henrik Størner · Thu, 12 Apr 2012 10:43:25 +0200 ·
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben"
quoted from Ben Poppy
<user-1ce99a2a9ef8@xymon.invalid> wrote:
I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-)

Save the attachment to /tmp/dnstimeout.patch, then:

  cd xymon-4.3.7
  patch -p0 </tmp/dnstimeout.patch
  make clean
  make

You can run "make install" afterwards, but a safer option would be to just
copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin"
directory, replacing the one that is already there.
quoted from Ben Poppy

I do test DNS for sure on servers at our DR site (many of them). The
test
you suggest below, is that to simulate the purple storm? 
It is to simulate that your Xymon server loses connectivity to the DNS
server on the primary site.
quoted from Ben Poppy
Should it
essentially turn purple if I begin dropping all packets to a few DNS
servers I'm testing?
That is what I suspect, yes.
quoted from Ben Poppy
 
Would I be able to run this same iptables on my backup xymon server in
our
primary site to a few servers it checks DNS against in our DR site?
Should
that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused
the purple storm, without actually pulling the plug and disrupting the
network between the two sites. If I understand you correctly, then the
purple storm happened when you lost the connection between your two
datacenters. Since I suspect that this is related to DNS lookups taking a
very long time with the stock 4.3.7 Xymon version, you can use iptables to
just block traffic from Xymon to the DNS server(s) in the other datacenter.


Regards,
Henrik
list Mark Deiss · Thu, 12 Apr 2012 08:25:44 -0500 ·
I am not sure about how BBWIN is reporting its Windows client metrics;
MrBig for Windows reports file system usage in the following format:


Filesystem    1k-blocks        Used       Avail    Capacity  Mounted
C              15727603     8740006     6987597       55.6%  /FIXED/C
D             277225672     6502600   270723072        2.3%  /FIXED/D


Limits:
Drive      Yellow     Red       
D          70.0       80.0      
C          70.0       80.0      
Default    90.0       95.0  

We have some clustered Windows 2008 servers that are using shared
resources that are only referenced by their UNC paths - they do not have
logical drive assignments. Was wondering whether this would cause any
issues with the processing in do_disk.c section:


/* 
		 * Some Unix filesystem reports contain the word
"Filesystem".
		 * So check if there's a slash in the NT filesystem
letter - if yes,
		 * then it's really a Unix system after all.
		 */
		if ( (dsystype == DT_NT) && (*(columns[5])) &&
(strchr(columns[0], '/')) )
			dsystype = DT_UNIX;


Where the columns[5] would be referring to the "Mounted" column in the
MrBig output, and columns[0] would be referring to the Filesystem
column.  The above logic is not being performed as a block step against
the overall client's disk output but rather on a line by line basis of
the disk output. The first occurrence matched will flip the subsequent
data processing flow from Windows based to Unix based even for
subsequent disk lines from the same client. 

My question is whether the presence of UNC only mounts would be
represented with backslashes (\) with the BBWIN/MrBig/whatever Windows
reporting modules or whether they may end up converted to forward
slashes (/) and thus falsely trigger the dsystype switch. The only
affect on the subsequent processing is whether the Filesystem column
(Windows OS) is used for storing the diskname (and generation of the rrd
file name) or the "Mounted on" column is used for the diskname (Unix
OS).
	If the UNC representation can cause this false flip, then
initial Windows disk lines that are using logical drive assignments
would be stored and referenced using the one column reference and once a
UNC line is encountered, then it and all subsequent lines (including
logical drive assignment lines) would be using the other column
reference convention.
	Things could get busy if the line order varies from time to time
(i.e. addition/removal of a UNC resource); then a logical drive time
stamped data line may go in at times as a "Filesystem" rrd file, other
times as a "Mounted on" rrd file. 

setupfn2("%s%s.rrd", testname, diskname);

As I don't yet have any clients installed in these clustered Windows
boxes or UNC only mounts on other boxes, I don't know what the "Mounted
on" column output would look like for the UNC resource. If
BBWIN/MrBig/whatever are not reporting on UNC resources then that's
going to be an (separate) issue for us too. 

Possible fix if this is an issue:

Was wondering if all the Windows clients (BBWIN/MrBig/whatever) all
reliable report the last column with the "Mounted" header tag and
whether all the Unix/Linux/BSD variants report their last column with
the "Mounted on" header tag. If so then maybe a better way to handle
this is to run a check against the overall client disk msg block for the
string pattern of "Mounted on" instead. Change do_disk.c section:

else if (strstr(msg, "Filesystem")) dsystype = DT_NT;
else dsystype = DT_UNIX;

to

else if (strstr(msg, "Filesystem")) dsystype = DT_NT;		/* This
will trigger for Windows and Unix/Linux flavors */
else if (strstr(msg, "Mounted on")) dsystype = DT_UNIX;	/* Assuming all
unix/linux/BSD clients report with "Mounted on" and Windows clients only
report with "Mounted" in their header line */
list Malcolm Hunter · Thu, 12 Apr 2012 15:51:34 +0200 ·
Hi Mark,
quoted from Mark Deiss
----- Original Message -----
From: Deiss, Mark
Sent: 04/12/12 02:25 PM
To: xymon at xymon.com
Subject: [Xymon] Question on filesystem line processing for Windows in do_disk.c for UNC resources

I am not sure about how BBWIN is reporting its Windows client metrics;
MrBig for Windows reports file system usage in the following format:


Filesystem 1k-blocks Used Avail Capacity Mounted
C 15727603 8740006 6987597 55.6% /FIXED/C
D 277225672 6502600 270723072 2.3% /FIXED/D


Limits:
Drive Yellow Red 
D 70.0 80.0 
C 70.0 80.0 
Default 90.0 95.0 

We have some clustered Windows 2008 servers that are using shared
resources that are only referenced by their UNC paths - they do not have
logical drive assignments. Was wondering whether this would cause any
issues with the processing in do_disk.c section:


/* 
* Some Unix filesystem reports contain the word
"Filesystem".
* So check if there's a slash in the NT filesystem
letter - if yes,
* then it's really a Unix system after all.
*/
if ( (dsystype == DT_NT) && (*(columns[5])) &&
(strchr(columns[0], '/')) )
dsystype = DT_UNIX;


Where the columns[5] would be referring to the "Mounted" column in the
MrBig output, and columns[0] would be referring to the Filesystem
column. The above logic is not being performed as a block step against
the overall client's disk output but rather on a line by line basis of
the disk output. The first occurrence matched will flip the subsequent
data processing flow from Windows based to Unix based even for
subsequent disk lines from the same client. 

My question is whether the presence of UNC only mounts would be
represented with backslashes (\) with the BBWIN/MrBig/whatever Windows
reporting modules or whether they may end up converted to forward
slashes (/) and thus falsely trigger the dsystype switch. The only
affect on the subsequent processing is whether the Filesystem column
(Windows OS) is used for storing the diskname (and generation of the rrd
file name) or the "Mounted on" column is used for the diskname (Unix
OS).
If the UNC representation can cause this false flip, then
initial Windows disk lines that are using logical drive assignments
would be stored and referenced using the one column reference and once a
UNC line is encountered, then it and all subsequent lines (including
logical drive assignment lines) would be using the other column
reference convention.
Things could get busy if the line order varies from time to time
(i.e. addition/removal of a UNC resource); then a logical drive time
stamped data line may go in at times as a "Filesystem" rrd file, other
times as a "Mounted on" rrd file. 

setupfn2("%s%s.rrd", testname, diskname);

As I don't yet have any clients installed in these clustered Windows
boxes or UNC only mounts on other boxes, I don't know what the "Mounted
on" column output would look like for the UNC resource. If
BBWIN/MrBig/whatever are not reporting on UNC resources then that's
going to be an (separate) issue for us too. 

Possible fix if this is an issue:

Was wondering if all the Windows clients (BBWIN/MrBig/whatever) all
reliable report the last column with the "Mounted" header tag and
whether all the Unix/Linux/BSD variants report their last column with
the "Mounted on" header tag. If so then maybe a better way to handle
this is to run a check against the overall client disk msg block for the
string pattern of "Mounted on" instead. Change do_disk.c section:

else if (strstr(msg, "Filesystem")) dsystype = DT_NT;
else dsystype = DT_UNIX;

to

else if (strstr(msg, "Filesystem")) dsystype = DT_NT; /* This
will trigger for Windows and Unix/Linux flavors */
else if (strstr(msg, "Mounted on")) dsystype = DT_UNIX; /* Assuming all
unix/linux/BSD clients report with "Mounted on" and Windows clients only
report with "Mounted" in their header line */
See this update in Subversion trunk: http://xymon.svn.sourceforge.net/viewvc/xymon?view=revision&revision=6708

Malcolm
--
BBWin Development - The Windows client for Big Brother and Xymon

http://bbwin.sourceforge.net
http://xymon.sourceforge.net
list Josh Luthman · Thu, 12 Apr 2012 12:53:16 -0400 ·
Can you make the default to testip but specify a host to use DNS?

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX
quoted from Henrik Størner


On Thu, Apr 12, 2012 at 4:43 AM,  <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben"
<user-1ce99a2a9ef8@xymon.invalid> wrote:
I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-)

Save the attachment to /tmp/dnstimeout.patch, then:

 cd xymon-4.3.7
 patch -p0 </tmp/dnstimeout.patch
 make clean
 make

You can run "make install" afterwards, but a safer option would be to just
copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin"
directory, replacing the one that is already there.

I do test DNS for sure on servers at our DR site (many of them). The
test
you suggest below, is that to simulate the purple storm?
It is to simulate that your Xymon server loses connectivity to the DNS
server on the primary site.
Should it
essentially turn purple if I begin dropping all packets to a few DNS
servers I'm testing?
That is what I suspect, yes.
Would I be able to run this same iptables on my backup xymon server in
our
primary site to a few servers it checks DNS against in our DR site?
Should
that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused
the purple storm, without actually pulling the plug and disrupting the
network between the two sites. If I understand you correctly, then the
purple storm happened when you lost the connection between your two
datacenters. Since I suspect that this is related to DNS lookups taking a
very long time with the stock 4.3.7 Xymon version, you can use iptables to
just block traffic from Xymon to the DNS server(s) in the other datacenter.


Regards,
Henrik

list Jeremy Laidman · Fri, 13 Apr 2012 13:46:03 +1000 ·
On Fri, Apr 13, 2012 at 2:53 AM, Josh Luthman
quoted from Josh Luthman
<user-4c45a83f15cb@xymon.invalid>wrote:
Can you make the default to testip but specify a host to use DNS?
No, but you can define .default. more than once, and the defaults will
change for subsequent hosts until the next .default. (if any).  So you
could do:

0.0.0.0 .default. # testip dialup

1.1.1.1 server1 # ssh smtp
1.1.1.2 server2 # ssh telnet

0.0.0.0 .default. # dialup
1.1.1.3 server3 # ssh http

0.0.0.0 .default. # testip dialup
1.1.1.4 server4 # ssh rdp
1.1.1.5 server5 # ssh telnet

All hosts except server3 will get the "testip" setting.

Please note that this is just how I think it should work, and I haven't
tested it.

J
list Josh Luthman · Fri, 13 Apr 2012 09:03:40 -0400 ·
I put it at the top and included a host below pages and groups.
quoted from Josh Luthman

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX
On Apr 12, 2012 11:46 PM, "Jeremy Laidman" <user-71895fb2e44c@xymon.invalid> wrote:
On Fri, Apr 13, 2012 at 2:53 AM, Josh Luthman <user-4c45a83f15cb@xymon.invalid
quoted from Jeremy Laidman
wrote:
Can you make the default to testip but specify a host to use DNS?
No, but you can define .default. more than once, and the defaults will
change for subsequent hosts until the next .default. (if any).  So you
could do:

0.0.0.0 .default. # testip dialup

1.1.1.1 server1 # ssh smtp
1.1.1.2 server2 # ssh telnet

0.0.0.0 .default. # dialup
1.1.1.3 server3 # ssh http

0.0.0.0 .default. # testip dialup
1.1.1.4 server4 # ssh rdp
1.1.1.5 server5 # ssh telnet

All hosts except server3 will get the "testip" setting.

Please note that this is just how I think it should work, and I haven't
tested it.

J

list Ben Poppy · Thu, 19 Apr 2012 03:38:59 +0000 ·
Hmm, interestingly enough, I have not been able to reproduce the purple storm cutting off communication to all 10 DC/DNS servers at our DR location.

I think I'm still going to move forward with updating to the latest stable release, as well as with the patch you provided. And if worse comes to worse, I'll kill monitoring of DR servers if/when we lose connectivity again.

Thanks for your help, it is so greatly appreciated!
quoted from Henrik Størner

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner
Sent: Thursday, April 12, 2012 12:47 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site.
Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ?

I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled.

I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple.

I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long.

One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then

   iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP
   iptables -I INPUT 1 -s 10.1.2.3 -j DROP

will cause all traffic to/from this server to be dropped.


Regards,
Henrik

The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.
list Ben Poppy · Fri, 20 Apr 2012 21:13:42 +0000 ·
So far so good, I have been unable to reproduce the purple storm. We'll find out in a couple weeks when we do our next DR isolation test. Thanks so much for your help!
quoted from Josh Luthman

-----Original Message-----
From: user-ce4a2c883f75@xymon.invalid [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: Thursday, April 12, 2012 3:43 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: RE: [Xymon] Purple storm

On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben"
<user-1ce99a2a9ef8@xymon.invalid> wrote:
I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-)

Save the attachment to /tmp/dnstimeout.patch, then:

  cd xymon-4.3.7
  patch -p0 </tmp/dnstimeout.patch
  make clean
  make

You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin"
directory, replacing the one that is already there.

I do test DNS for sure on servers at our DR site (many of them). The
test
you suggest below, is that to simulate the purple storm? 
It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.
Should it
essentially turn purple if I begin dropping all packets to a few DNS 
servers I'm testing?
That is what I suspect, yes.
 
Would I be able to run this same iptables on my backup xymon server in
our
primary site to a few servers it checks DNS against in our DR site?
Should
that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter.


Regards,
Henrik


The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.