On Tue, February 16, 2016 1:44 am, L-M-J wrote:
Hi,
I'm still running into troubles every night between ~0h30 and ~2h40 :-(
1) I checked the backup on my physical XYmon server : around 9pm and
runs for 4:45 min.
2) We cross-monitored the DNS server from another monitoring tool : no
DNS outage detected.
3) I monitored the Xymon server network link state with "mii-tool" every
seconds : no troubles detected
4) I pinged my Xymon servers from 2 differents network places all night
long : no troubles detected.
5) No firewalls between my Xymon server and the monitored hosts
6) Over 500 hosts, only ~30 are in trouble every night and mostly the
same
7) Hosts are VM, physical servers, public internet website
Here is what I've found in the xymond.log today :
2016-02-16 02:02:57 Flapping detected for www.foo1.com:http - 5 changes
in 1708 seconds
2016-02-16 02:02:57 Flapping detected for www.foo2.com:http - 5 changes
in 1708 seconds
2016-02-16 02:02:57 Flapping detected for www.microsoft.com:http - 5
changes in 1708 seconds
2016-02-16 02:06:14 Flapping detected for server01:http - 5 changes in
1678 seconds
2016-02-16 02:06:14 Flapping detected for server02:http - 5 changes in
1678 seconds
2016-02-16 02:06:29 Flapping detected for server03:conn - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server04:ldap - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server06:ssh - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server05:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server07:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server08:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server09:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for foo.bar1.com:http - 5 changes
in 1745 seconds
2016-02-16 02:07:21 Flapping detected for foo.bar2.com:http - 5 changes
in 1745 seconds
2016-02-16 02:07:21 Flapping detected for foo.bar3.fr:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server10:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server11-t:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server12:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server13:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server14:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server15:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server16:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server17:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server18:http - 5 changes in
1745 seconds
2016-02-16 02:07:21 Flapping detected for server19:http - 5 changes in
1745 seconds
Here is a part of the configuration + errors displayed in the XYmon HTTP
interface :
hosts.cfg : 0.0.0.0 server03 # conn NAME:"server03" DESCR:"VM FOO BAR"
Error : conn NOT ok : DNS lookup failed / Unable to resolve hostname
server03
System unreachable for 2 poll periods (86 seconds)
Everything looks like the DNS resolution failed.
hosts.cfg : 10.X.Y.188 server05 # conn tse NAME:"Server 05" DESCR:"My
comment" http://server05/
Error : DNS error red http://server05/ - DNS error
- Why I have a "DNS error" here ? I set up the IP yesterday to this host
to solve the issue. The "conn" error disappear since yesterday evening
but the http still remains.
All signs do point to an issue with DNS resolution here.
Was this a custom compile or are you using a package? If custom, what
version of c-ares is on your system? That's the underlying resolution
library that xymonnet is using by default to handle DNS lookups. The fact
that the 'conn' test remained good after you added the local hosts entry
matches that, since HTTP tests are performed using their own secondary DNS
lookup (to deal with vhosts, etc) unless the IP is specified there as
well.
Xymon otherwise does not cache DNS records or anything else when it comes
to network polling like this, since xymonnet is a brand new execution for
each run.
Try adding the '--dnslog=' option to xymonnet during this period to get a
log of exactly what's happening with DNS resolution, and --debug as well
(but just once or twice). You can also try testing using '--no-ares',
however the system resolver is much slower and less predictable than
c-ares (normally).
Another potential help might be altering your --concurrency=N setting to
something lower than the system default (which will typically be 256).
There's clearly *something* going on that's specific to that period, but
signs do point to something more on the host. This is especially true if
you add a local DNS cache and you're still seeing the problem.
HTH,
-jc