Xymon Mailing List Archive search

DNS failures causing "runtime longer than time limit"

list Jeremy Laidman
Mon, 11 Sep 2017 13:52:46 +1000
Message-Id: <CACO=ejzNx8h95yL6PozO4vDW508t=user-68ec9e7c6bea@xymon.invalid>

Hi

I'm reviving an old thread, because this is biting me again, so I wanted to
know if anyone had any fresh ideas on this problem.

Many of the servers I monitor are DNS servers, so the C-ARES library has a
lot of queries to perform every 5 minutes. In some cases, I want to ensure
that a DNS service is down (and alert when not) so most of the time I can
expect a timeout, leading to a long poll cycle. I'd really like to be able
to drop the timeout to significantly less than the 23 seconds it's taking
now per server.

Cheers
Jeremy


On 3 June 2015 at 13:49, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
OK, I'm a bit puzzled by this, and definitely pushing the envelope of my
debugging and C coding skills.  The relevant code from xymonnet/dns.c is:

    168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
    169                 tvp = ares_timeout(channel, &tv, &tv);

I ran this through gdb, with "--dns-timeout=3" specified, setting a
breakpoint at line 168.  I confirmed that dnstimeout is set to 3.  When I
step one line, I should see tv.tv_sec set to 3 also, but it's set to 0.

If I don't specify --dns-timeout at all, printing dnstimeout shows "30".
Again, after stepping to the next line, tv.tv_sec is still zero.

Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
(gdb) p dnstimeout
$14 = 30
(gdb) n
169                     tvp = ares_timeout(channel, &tv, &tv);
(gdb) p tv
$15 = {tv_sec = 0, tv_usec = 0}
(gdb)

So what gives here?

J


On 3 June 2015 at 13:08, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Hi

I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's compiled with
c-ares support.

I have 12 new DNS servers that were added to Xymon about one month ago.
All of my server entries in hosts.cfg have "testip".  The tasks.cfg runs
xymonet with "--dns-timeout=3".  The hosts entries look like so:

10.10.10.1 dnshost1.example.com    # testip dns=NS:example.com,SOA:
example.com

About a week ago, connectivity to all of these servers failed, and at the
same time, the xymonnet run time jumped from less than 15 seconds to about
330 seconds, so about 315 seconds extra.  The xymonnet page says 295
seconds is taken up by DNS tests.

If the increase in time taken is about 315 and is entirely due to the 12
servers failing, then each failed server is adding about 26 seconds to the
total run time.

I don't think this should be happening like this.  With two DNS checks
per server, the DNS checks should be taking 6 seconds each to time-out, not
26.  If I run xymonnet with "--timing --no-update" and specify only one
hostname, I can view the results and the timing.  This shows that the ping
check gets reported after about 3 seconds, and then the DNS tests are
executed and take 26 seconds total.

My naiive assumption was that when a server failed a ping (and didn't
have "noclear" defined in hosts.cfg) then the network checks would be
skipped.  On re-reading the man page for hosts.cfg, it dawned on me that a
failed ping simply suppresses failed test /results/, but doesn't stop the
tests from being run.

So the real problem is that the "--dns-timeout=3" is not being taken into
consideration by xymonnet.  If I run xymonnet with "--debug" it tells me:

1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
type=2
1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
type=6
1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
1900 2015-06-03 12:02:46 Finished ARES queue after loop 423

This is peculiar.  Why would it say "processing 0 DNS lookups" when there
are two lookups to test?  Could this be because xymonnet hasn't actually
been built with ARES support and I didn't know it?  Is there a good way to
tell?  If I add "--no-ares" I get the same results perhaps suggesting a
lack of ARES support.  On the other hand, if I add "timeout:3" and
"attempts:1" into resolv.conf, I also get the same results.  If I run "nm
/path/to/xymonnet | grep gethostby" it returns "ares_gethostbyname".

Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet binary from
there, with no change in behaviour.  I also tried removing the
"--dns-timeout" option so that it defaults to 30 seconds, but still no
change - 26 seconds for two DNS tests.

So, I'm not really sure what the problem is, but xymonnet certainly isn't
behaving as I would expect.

Cheers
Jeremy