Purple storm
list Ben Poppy
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well). Both of our xymon servers point to 2 windows DC's in our production datacenter in /etc/resolv.conf for DNS lookups. Has anyone run into this before? Any ideas how it could be related? Or how to fix/prevent it? We are running 4.3.4. Thanks, -Ben The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Jeremy Laidman
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben
▸
<user-1ce99a2a9ef8@xymon.invalid> wrote:I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC’s in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J
list Ben Poppy
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.
▸
-----Original Message-----
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben <user-1ce99a2a9ef8@xymon.invalid> wrote:I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Phil Crooker
So, can you do DNS queries from the xymon server when DC3 & 4 are down?
"Poppy, Ben" 03/20/12 11:50 AM >>>
▸
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Henrik Størner
▸
On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC’s in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well). Both of our xymon servers point to 2 windows DC’s in our production datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run. The status will tell you more about what part of the network tests are taking too long. This should also show up in the xymonnet.log file. One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout. Fixes for both of these issues are "in the pipeline" for the next major Xymon version. Regards, Henrik
list Ben Poppy
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.
▸
-----Original Message-----
From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid]
Sent: Tuesday, March 20, 2012 12:41 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
So, can you do DNS queries from the xymon server when DC3 & 4 are down?
"Poppy, Ben" 03/20/12 11:50 AM >>>
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation. --
This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised. If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email. ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .
▸
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Ben Poppy
The DNS tests executed jumps to over 1500-2400 from the normal ~1 when those 4 DC's are down (which we are testing DNS, but are not the DNS servers set up in /etc/resolv.conf on the xymon servers). We are not doing any NTP tests against any hosts, nor do we do any special dns test, just the standard test.
▸
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner
Sent: Tuesday, March 20, 2012 2:11 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
On 19-03-2012 19:15, Poppy, Ben wrote:I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well). Both of our xymon servers point to 2 windows DC's in our production datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run. The status will tell you more about what part of the network tests are taking too long. This should also show up in the xymonnet.log file. One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout. Fixes for both of these issues are "in the pipeline" for the next major Xymon version. Regards, Henrik The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Don Kuhlman
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times? Maybe that will help you see what it is trying to reach when that is happening.
▸
On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers. -----Original Message----- From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm So, can you do DNS queries from the xymon server when DC3 & 4 are down?"Poppy, Ben" 03/20/12 11:50 AM >>>So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:I have an interesting problem that happened last night. We are workingon a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation. -- This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised. If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email. ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au . The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Jamison Maxwell
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN. If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency. I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries. Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself.... Jamison Maxwell
▸
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers. -----Original Message----- From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm So, can you do DNS queries from the xymon server when DC3 & 4 are down?"Poppy, Ben" 03/20/12 11:50 AM >>>So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:I have an interesting problem that happened last night. We are workingon a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation. -- This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised. If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email. ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au . The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Jeremy Laidman
On Wed, Mar 21, 2012 at 9:11 AM, Jamison Maxwell
▸
<user-87d336c3dce6@xymon.invalid> wrote:What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
That's all normal behaviour. The standard resolver library on Linux will only use second and third entries in resolv.conf when the first one is unavailable for 5 seconds (see "man resolv"). The dig, nslookup and host programs don't use the resolver library, and instead use their own in-built resolver that behaves differently to the standard resolver - but still uses the "nameserver" entries from resolv.conf. For this reason, when diagnosing DNS problems with applications, using dig/nslookup can give you results different to the application you're testing; it's better to use things like "ping" or "telnet" to do a lookup, as they use the same resolver library as most other applications, including Xymon.
list Ben Poppy
And a fiber cut to our DR datacenter caused another 5 hour purple storm. The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter. While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers. At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2? Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down.. Any other ideas from the list by chance?
▸
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN.
If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself....
Jamison Maxwell
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers. -----Original Message----- From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm So, can you do DNS queries from the xymon server when DC3 & 4 are down?"Poppy, Ben" 03/20/12 11:50 AM >>>So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:I have an interesting problem that happened last night. We are workingon a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation. -- This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised. If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email. ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au . The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Jeremy Laidman
On Thu, Apr 12, 2012 at 8:23 AM, Poppy, Ben
▸
<user-1ce99a2a9ef8@xymon.invalid>wrote:
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down.. Any other ideas from the list by chance?
Add to hosts.cfg:
0.0.0.0 .default. # testip
This turns off DNS lookups for all servers in hosts.cfg and uses the IP
address instead.
Cheers
Jeremy
list Jamison Maxwell
I'm not convinced that this is a bug in Xymon. I don't understand how secondary name servers in a DR site that are configured to be used as backups would cause nothing to resolve when they are unavailable. I've run a similar configuration to what I believe you are describing without a problem.... To make sure I'm understanding what you're saying, when the DR DNS servers are unavailable, then Xymon fails to accept the DNS query results? Another, possibly clearer, way of saying that is that the DNS queries from Xymon to your production DNS servers fail despite it absolutely receiving correct replies from your production DNS servers just because your DR site is unavailable? Jamison Maxwell user-87d336c3dce6@xymon.invalid
▸
-----Original Message-----
From: Poppy, Ben [mailto:user-1ce99a2a9ef8@xymon.invalid] Sent: Wednesday, April 11, 2012 6:23 PM
To: Jamison Maxwell; xymon at xymon.com
Subject: RE: [Xymon] Purple storm
And a fiber cut to our DR datacenter caused another 5 hour purple storm.
The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.
While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.
At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..
Any other ideas from the list by chance?
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN.
If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself....
Jamison Maxwell
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers. -----Original Message----- From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm So, can you do DNS queries from the xymon server when DC3 & 4 are down?"Poppy, Ben" 03/20/12 11:50 AM >>>So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:I have an interesting problem that happened last night. We are workingon a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts. J The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation. -- This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised. If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email. ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au . The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Ben Poppy
To be honest, I'm not sure what the cause is 100%. The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down). Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
▸
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Wednesday, April 11, 2012 10:22 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
I'm not convinced that this is a bug in Xymon.
I don't understand how secondary name servers in a DR site that are configured to be used as backups would cause nothing to resolve when they are unavailable. I've run a similar configuration to what I believe you are describing without a problem.... To make sure I'm understanding what you're saying, when the DR DNS servers are unavailable, then Xymon fails to accept the DNS query results? Another, possibly clearer, way of saying that is that the DNS queries from Xymon to your production DNS servers fail despite it absolutely receiving correct replies from your production DNS servers just because your DR site is unavailable?
Jamison Maxwell
user-87d336c3dce6@xymon.invalid
-----Original Message-----
From: Poppy, Ben [mailto:user-1ce99a2a9ef8@xymon.invalid]
Sent: Wednesday, April 11, 2012 6:23 PM
To: Jamison Maxwell; xymon at xymon.com
Subject: RE: [Xymon] Purple storm
And a fiber cut to our DR datacenter caused another 5 hour purple storm.
The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.
While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.
At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..
Any other ideas from the list by chance?
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN.
If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself....
Jamison Maxwell
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers. -----Original Message----- From: Phil Crooker [mailto:user-e8e31cd73303@xymon.invalid] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm So, can you do DNS queries from the xymon server when DC3 & 4 are down?"Poppy, Ben" 03/20/12 11:50 AM >>>So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues. -----Original Message----- From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:I have an interesting problem that happened last night. We are workingon a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also. You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts.
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Henrik Størner
On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
▸
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ? I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
▸
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled. I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple. I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long. One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP iptables -I INPUT 1 -s 10.1.2.3 -j DROP will cause all traffic to/from this server to be dropped. Regards, Henrik
Attachments (1)
list Henrik Størner
▸
On 12-04-2012 07:47, Henrik Størner wrote:
I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch
The first part of the patch - the one for xymonnet/contest.c - is completely unrelated. You can remove this before applying if you like, but unless you have sites explicitly tested with https and SSLv2 it is harmless. Regards, Henrik
list Ben Poppy
I may have missed this in a past post, how do I apply this patch? I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm? Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing? Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm? Thanks for your help.
▸
From: xymon-bounces at xymon.com [xymon-bounces at xymon.com] on behalf of Henrik Størner [user-ce4a2c883f75@xymon.invalid]
Sent: Thursday, April 12, 2012 12:47 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
On 12-04-2012 05:28, Poppy, Ben wrote:To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ? I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled. I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple. I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long. One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP iptables -I INPUT 1 -s 10.1.2.3 -j DROP will cause all traffic to/from this server to be dropped. Regards, Henrik The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Henrik Størner
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben"
▸
<user-1ce99a2a9ef8@xymon.invalid> wrote:I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-) Save the attachment to /tmp/dnstimeout.patch, then: cd xymon-4.3.7 patch -p0 </tmp/dnstimeout.patch make clean make You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin" directory, replacing the one that is already there.
▸
I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm?
It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.
▸
Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?
That is what I suspect, yes.
▸
Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site?
Should
that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter. Regards, Henrik
list Mark Deiss
I am not sure about how BBWIN is reporting its Windows client metrics;
MrBig for Windows reports file system usage in the following format:
Filesystem 1k-blocks Used Avail Capacity Mounted
C 15727603 8740006 6987597 55.6% /FIXED/C
D 277225672 6502600 270723072 2.3% /FIXED/D
Limits:
Drive Yellow Red
D 70.0 80.0
C 70.0 80.0
Default 90.0 95.0
We have some clustered Windows 2008 servers that are using shared
resources that are only referenced by their UNC paths - they do not have
logical drive assignments. Was wondering whether this would cause any
issues with the processing in do_disk.c section:
/*
* Some Unix filesystem reports contain the word
"Filesystem".
* So check if there's a slash in the NT filesystem
letter - if yes,
* then it's really a Unix system after all.
*/
if ( (dsystype == DT_NT) && (*(columns[5])) &&
(strchr(columns[0], '/')) )
dsystype = DT_UNIX;
Where the columns[5] would be referring to the "Mounted" column in the
MrBig output, and columns[0] would be referring to the Filesystem
column. The above logic is not being performed as a block step against
the overall client's disk output but rather on a line by line basis of
the disk output. The first occurrence matched will flip the subsequent
data processing flow from Windows based to Unix based even for
subsequent disk lines from the same client.
My question is whether the presence of UNC only mounts would be
represented with backslashes (\) with the BBWIN/MrBig/whatever Windows
reporting modules or whether they may end up converted to forward
slashes (/) and thus falsely trigger the dsystype switch. The only
affect on the subsequent processing is whether the Filesystem column
(Windows OS) is used for storing the diskname (and generation of the rrd
file name) or the "Mounted on" column is used for the diskname (Unix
OS).
If the UNC representation can cause this false flip, then
initial Windows disk lines that are using logical drive assignments
would be stored and referenced using the one column reference and once a
UNC line is encountered, then it and all subsequent lines (including
logical drive assignment lines) would be using the other column
reference convention.
Things could get busy if the line order varies from time to time
(i.e. addition/removal of a UNC resource); then a logical drive time
stamped data line may go in at times as a "Filesystem" rrd file, other
times as a "Mounted on" rrd file.
setupfn2("%s%s.rrd", testname, diskname);
As I don't yet have any clients installed in these clustered Windows
boxes or UNC only mounts on other boxes, I don't know what the "Mounted
on" column output would look like for the UNC resource. If
BBWIN/MrBig/whatever are not reporting on UNC resources then that's
going to be an (separate) issue for us too.
Possible fix if this is an issue:
Was wondering if all the Windows clients (BBWIN/MrBig/whatever) all
reliable report the last column with the "Mounted" header tag and
whether all the Unix/Linux/BSD variants report their last column with
the "Mounted on" header tag. If so then maybe a better way to handle
this is to run a check against the overall client disk msg block for the
string pattern of "Mounted on" instead. Change do_disk.c section:
else if (strstr(msg, "Filesystem")) dsystype = DT_NT;
else dsystype = DT_UNIX;
to
else if (strstr(msg, "Filesystem")) dsystype = DT_NT; /* This
will trigger for Windows and Unix/Linux flavors */
else if (strstr(msg, "Mounted on")) dsystype = DT_UNIX; /* Assuming all
unix/linux/BSD clients report with "Mounted on" and Windows clients only
report with "Mounted" in their header line */
list Malcolm Hunter
Hi Mark,
▸
----- Original Message -----
From: Deiss, Mark
Sent: 04/12/12 02:25 PM
To: xymon at xymon.com
Subject: [Xymon] Question on filesystem line processing for Windows in do_disk.c for UNC resources
I am not sure about how BBWIN is reporting its Windows client metrics;
MrBig for Windows reports file system usage in the following format:
Filesystem 1k-blocks Used Avail Capacity Mounted
C 15727603 8740006 6987597 55.6% /FIXED/C
D 277225672 6502600 270723072 2.3% /FIXED/D
Limits:
Drive Yellow Red
D 70.0 80.0
C 70.0 80.0
Default 90.0 95.0
We have some clustered Windows 2008 servers that are using shared
resources that are only referenced by their UNC paths - they do not have
logical drive assignments. Was wondering whether this would cause any
issues with the processing in do_disk.c section:
/*
* Some Unix filesystem reports contain the word
"Filesystem".
* So check if there's a slash in the NT filesystem
letter - if yes,
* then it's really a Unix system after all.
*/
if ( (dsystype == DT_NT) && (*(columns[5])) &&
(strchr(columns[0], '/')) )
dsystype = DT_UNIX;
Where the columns[5] would be referring to the "Mounted" column in the
MrBig output, and columns[0] would be referring to the Filesystem
column. The above logic is not being performed as a block step against
the overall client's disk output but rather on a line by line basis of
the disk output. The first occurrence matched will flip the subsequent
data processing flow from Windows based to Unix based even for
subsequent disk lines from the same client.
My question is whether the presence of UNC only mounts would be
represented with backslashes (\) with the BBWIN/MrBig/whatever Windows
reporting modules or whether they may end up converted to forward
slashes (/) and thus falsely trigger the dsystype switch. The only
affect on the subsequent processing is whether the Filesystem column
(Windows OS) is used for storing the diskname (and generation of the rrd
file name) or the "Mounted on" column is used for the diskname (Unix
OS).
If the UNC representation can cause this false flip, then
initial Windows disk lines that are using logical drive assignments
would be stored and referenced using the one column reference and once a
UNC line is encountered, then it and all subsequent lines (including
logical drive assignment lines) would be using the other column
reference convention.
Things could get busy if the line order varies from time to time
(i.e. addition/removal of a UNC resource); then a logical drive time
stamped data line may go in at times as a "Filesystem" rrd file, other
times as a "Mounted on" rrd file.
setupfn2("%s%s.rrd", testname, diskname);
As I don't yet have any clients installed in these clustered Windows
boxes or UNC only mounts on other boxes, I don't know what the "Mounted
on" column output would look like for the UNC resource. If
BBWIN/MrBig/whatever are not reporting on UNC resources then that's
going to be an (separate) issue for us too.
Possible fix if this is an issue:
Was wondering if all the Windows clients (BBWIN/MrBig/whatever) all
reliable report the last column with the "Mounted" header tag and
whether all the Unix/Linux/BSD variants report their last column with
the "Mounted on" header tag. If so then maybe a better way to handle
this is to run a check against the overall client disk msg block for the
string pattern of "Mounted on" instead. Change do_disk.c section:
else if (strstr(msg, "Filesystem")) dsystype = DT_NT;
else dsystype = DT_UNIX;
to
else if (strstr(msg, "Filesystem")) dsystype = DT_NT; /* This
will trigger for Windows and Unix/Linux flavors */
else if (strstr(msg, "Mounted on")) dsystype = DT_UNIX; /* Assuming all
unix/linux/BSD clients report with "Mounted on" and Windows clients only
report with "Mounted" in their header line */See this update in Subversion trunk: http://xymon.svn.sourceforge.net/viewvc/xymon?view=revision&revision=6708 Malcolm -- BBWin Development - The Windows client for Big Brother and Xymon http://bbwin.sourceforge.net http://xymon.sourceforge.net
list Josh Luthman
Can you make the default to testip but specify a host to use DNS? Josh Luthman Office: XXX-XXX-XXXX Direct: XXX-XXX-XXXX XXXX Wayne St Suite XXXX Troy, OH XXXXX
▸
On Thu, Apr 12, 2012 at 4:43 AM, <user-ce4a2c883f75@xymon.invalid> wrote:On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben" <user-1ce99a2a9ef8@xymon.invalid> wrote:I may have missed this in a past post, how do I apply this patch?Ah - ok, my developer-mind assumed everyone knows how to do that :-) Save the attachment to /tmp/dnstimeout.patch, then: cd xymon-4.3.7 patch -p0 </tmp/dnstimeout.patch make clean make You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin" directory, replacing the one that is already there.I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm?It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?That is what I suspect, yes.Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site?Shouldthat effectively cause the purple storm?What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter. Regards, Henrik
list Jeremy Laidman
On Fri, Apr 13, 2012 at 2:53 AM, Josh Luthman
▸
<user-4c45a83f15cb@xymon.invalid>wrote:
Can you make the default to testip but specify a host to use DNS?
No, but you can define .default. more than once, and the defaults will change for subsequent hosts until the next .default. (if any). So you could do: 0.0.0.0 .default. # testip dialup 1.1.1.1 server1 # ssh smtp 1.1.1.2 server2 # ssh telnet 0.0.0.0 .default. # dialup 1.1.1.3 server3 # ssh http 0.0.0.0 .default. # testip dialup 1.1.1.4 server4 # ssh rdp 1.1.1.5 server5 # ssh telnet All hosts except server3 will get the "testip" setting. Please note that this is just how I think it should work, and I haven't tested it. J
list Josh Luthman
I put it at the top and included a host below pages and groups.
▸
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX
On Apr 12, 2012 11:46 PM, "Jeremy Laidman" <user-71895fb2e44c@xymon.invalid> wrote:
On Fri, Apr 13, 2012 at 2:53 AM, Josh Luthman <user-4c45a83f15cb@xymon.invalid
▸
wrote:Can you make the default to testip but specify a host to use DNS?No, but you can define .default. more than once, and the defaults will change for subsequent hosts until the next .default. (if any). So you could do: 0.0.0.0 .default. # testip dialup 1.1.1.1 server1 # ssh smtp 1.1.1.2 server2 # ssh telnet 0.0.0.0 .default. # dialup 1.1.1.3 server3 # ssh http 0.0.0.0 .default. # testip dialup 1.1.1.4 server4 # ssh rdp 1.1.1.5 server5 # ssh telnet All hosts except server3 will get the "testip" setting. Please note that this is just how I think it should work, and I haven't tested it. J
list Ben Poppy
Hmm, interestingly enough, I have not been able to reproduce the purple storm cutting off communication to all 10 DC/DNS servers at our DR location. I think I'm still going to move forward with updating to the latest stable release, as well as with the patch you provided. And if worse comes to worse, I'll kill monitoring of DR servers if/when we lose connectivity again. Thanks for your help, it is so greatly appreciated!
▸
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner
Sent: Thursday, April 12, 2012 12:47 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm
On 12-04-2012 05:28, Poppy, Ben wrote:To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ? I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled. I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple. I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long. One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP iptables -I INPUT 1 -s 10.1.2.3 -j DROP will cause all traffic to/from this server to be dropped. Regards, Henrik The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
list Ben Poppy
So far so good, I have been unable to reproduce the purple storm. We'll find out in a couple weeks when we do our next DR isolation test. Thanks so much for your help!
▸
-----Original Message-----
From: user-ce4a2c883f75@xymon.invalid [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: Thursday, April 12, 2012 3:43 AM
To: Poppy, Ben
Cc: xymon at xymon.com
Subject: RE: [Xymon] Purple storm
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben"
<user-1ce99a2a9ef8@xymon.invalid> wrote:I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-) Save the attachment to /tmp/dnstimeout.patch, then: cd xymon-4.3.7 patch -p0 </tmp/dnstimeout.patch make clean make You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin" directory, replacing the one that is already there.
I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm?
It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.
Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?
That is what I suspect, yes.
Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site?
Should
that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter. Regards, Henrik The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.