Xymon Mailing List Archive search

DNS failures causing "runtime longer than time limit"

list Henrik Størner
Mon, 11 Sep 2017 07:11:56 +0200
Message-Id: <user-4a7905cf32fe@xymon.invalid>

Hi,

I remember looking into this a long time ago, and the --dnstimeout 
setting does not quite work as expected - because C-ARES does not quite 
work as expected.

C-ARES has some timeout settings for queries, but it performs an 
exponential back-off between queries, so it is impossible to really hit 
the exact timeout you specify in --dnstimeout.

In fact, current 4.3.x versions have a hard-coded setting for the C-ARES 
timeouts - it starts with a 2 second timeout and performs 4 attempts, 
which ends up with approximately 23 second timeout for all DNS queries. 
This is in xymonnet/dns.c (look for "ARES timeout"). If you need those 
really short timeouts, then that is probably what you should change.


Regards,
Henrik


On 11-09-2017 05:52, Jeremy Laidman wrote:
Hi

I'm reviving an old thread, because this is biting me again, so I 
wanted to know if anyone had any fresh ideas on this problem.

Many of the servers I monitor are DNS servers, so the C-ARES library 
has a lot of queries to perform every 5 minutes. In some cases, I want 
to ensure that a DNS service is down (and alert when not) so most of 
the time I can expect a timeout, leading to a long poll cycle. I'd 
really like to be able to drop the timeout to significantly less than 
the 23 seconds it's taking now per server.

Cheers
Jeremy


On 3 June 2015 at 13:49, Jeremy Laidman <user-71895fb2e44c@xymon.invalid 
<mailto:user-71895fb2e44c@xymon.invalid>> wrote:

    OK, I'm a bit puzzled by this, and definitely pushing the envelope
    of my debugging and C coding skills.  The relevant code from
    xymonnet/dns.c is:

        168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
        169                 tvp = ares_timeout(channel, &tv, &tv);

    I ran this through gdb, with "--dns-timeout=3" specified, setting
    a breakpoint at line 168.  I confirmed that dnstimeout is set to
    3.  When I step one line, I should see tv.tv_sec set to 3 also,
    but it's set to 0.

    If I don't specify --dns-timeout at all, printing dnstimeout shows
    "30".  Again, after stepping to the next line, tv.tv_sec is still
    zero.

    Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
    168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
    (gdb) p dnstimeout
    $14 = 30
    (gdb) n
    169                     tvp = ares_timeout(channel, &tv, &tv);
    (gdb) p tv
    $15 = {tv_sec = 0, tv_usec = 0}
    (gdb)

    So what gives here?

    J


    On 3 June 2015 at 13:08, Jeremy Laidman <user-71895fb2e44c@xymon.invalid
    <mailto:user-71895fb2e44c@xymon.invalid>> wrote:

        Hi

        I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's
        compiled with c-ares support.

        I have 12 new DNS servers that were added to Xymon about one
        month ago.  All of my server entries in hosts.cfg have
        "testip".  The tasks.cfg runs xymonet with "--dns-timeout=3".
        The hosts entries look like so:

        10.10.10.1 dnshost1.example.com <http://dnshost1.example.com>;
        # testip dns=NS:example.com
        <http://example.com>,SOA:example.com <http://example.com>;

        About a week ago, connectivity to all of these servers failed,
        and at the same time, the xymonnet run time jumped from less
        than 15 seconds to about 330 seconds, so about 315 seconds
        extra.  The xymonnet page says 295 seconds is taken up by DNS
        tests.

        If the increase in time taken is about 315 and is entirely due
        to the 12 servers failing, then each failed server is adding
        about 26 seconds to the total run time.

        I don't think this should be happening like this.  With two
        DNS checks per server, the DNS checks should be taking 6
        seconds each to time-out, not 26.  If I run xymonnet with
        "--timing --no-update" and specify only one hostname, I can
        view the results and the timing. This shows that the ping
        check gets reported after about 3 seconds, and then the DNS
        tests are executed and take 26 seconds total.

        My naiive assumption was that when a server failed a ping (and
        didn't have "noclear" defined in hosts.cfg) then the network
        checks would be skipped. On re-reading the man page for
        hosts.cfg, it dawned on me that a failed ping simply
        suppresses failed test /results/, but doesn't stop the tests
        from being run.

        So the real problem is that the "--dns-timeout=3" is not being
        taken into consideration by xymonnet.  If I run xymonnet with
        "--debug" it tells me:

        1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
        <http://example.com>';, class=1, type=2
        1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
        <http://example.com>';, class=1, type=6
        1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
        1900 2015-06-03 12:02:46 Finished ARES queue after loop 423

        This is peculiar.  Why would it say "processing 0 DNS lookups"
        when there are two lookups to test?  Could this be because
        xymonnet hasn't actually been built with ARES support and I
        didn't know it?  Is there a good way to tell?  If I add
        "--no-ares" I get the same results perhaps suggesting a lack
        of ARES support.  On the other hand, if I add "timeout:3" and
        "attempts:1" into resolv.conf, I also get the same results. If
        I run "nm /path/to/xymonnet | grep gethostby" it returns
        "ares_gethostbyname".

        Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet
        binary from there, with no change in behaviour.  I also tried
        removing the "--dns-timeout" option so that it defaults to 30
        seconds, but still no change - 26 seconds for two DNS tests.

        So, I'm not really sure what the problem is, but xymonnet
        certainly isn't behaving as I would expect.

        Cheers
        Jeremy