Xymon Mailing List Archive search

DNS failures causing "runtime longer than time limit"

4 messages in this thread

list Jeremy Laidman · Wed, 3 Jun 2015 13:08:29 +1000 ·
Hi

I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's compiled with
c-ares support.

I have 12 new DNS servers that were added to Xymon about one month ago.
All of my server entries in hosts.cfg have "testip".  The tasks.cfg runs
xymonet with "--dns-timeout=3".  The hosts entries look like so:

10.10.10.1 dnshost1.example.com    # testip dns=NS:example.com,SOA:
example.com

About a week ago, connectivity to all of these servers failed, and at the
same time, the xymonnet run time jumped from less than 15 seconds to about
330 seconds, so about 315 seconds extra.  The xymonnet page says 295
seconds is taken up by DNS tests.

If the increase in time taken is about 315 and is entirely due to the 12
servers failing, then each failed server is adding about 26 seconds to the
total run time.

I don't think this should be happening like this.  With two DNS checks per
server, the DNS checks should be taking 6 seconds each to time-out, not
26.  If I run xymonnet with "--timing --no-update" and specify only one
hostname, I can view the results and the timing.  This shows that the ping
check gets reported after about 3 seconds, and then the DNS tests are
executed and take 26 seconds total.

My naiive assumption was that when a server failed a ping (and didn't have
"noclear" defined in hosts.cfg) then the network checks would be skipped.
On re-reading the man page for hosts.cfg, it dawned on me that a failed
ping simply suppresses failed test /results/, but doesn't stop the tests
from being run.

So the real problem is that the "--dns-timeout=3" is not being taken into
consideration by xymonnet.  If I run xymonnet with "--debug" it tells me:

1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1, type=2
1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1, type=6
1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
1900 2015-06-03 12:02:46 Finished ARES queue after loop 423

This is peculiar.  Why would it say "processing 0 DNS lookups" when there
are two lookups to test?  Could this be because xymonnet hasn't actually
been built with ARES support and I didn't know it?  Is there a good way to
tell?  If I add "--no-ares" I get the same results perhaps suggesting a
lack of ARES support.  On the other hand, if I add "timeout:3" and
"attempts:1" into resolv.conf, I also get the same results.  If I run "nm
/path/to/xymonnet | grep gethostby" it returns "ares_gethostbyname".

Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet binary from
there, with no change in behaviour.  I also tried removing the
"--dns-timeout" option so that it defaults to 30 seconds, but still no
change - 26 seconds for two DNS tests.

So, I'm not really sure what the problem is, but xymonnet certainly isn't
behaving as I would expect.

Cheers
Jeremy
list Jeremy Laidman · Wed, 3 Jun 2015 13:49:21 +1000 ·
OK, I'm a bit puzzled by this, and definitely pushing the envelope of my
debugging and C coding skills.  The relevant code from xymonnet/dns.c is:

    168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
    169                 tvp = ares_timeout(channel, &tv, &tv);

I ran this through gdb, with "--dns-timeout=3" specified, setting a
breakpoint at line 168.  I confirmed that dnstimeout is set to 3.  When I
step one line, I should see tv.tv_sec set to 3 also, but it's set to 0.

If I don't specify --dns-timeout at all, printing dnstimeout shows "30".
Again, after stepping to the next line, tv.tv_sec is still zero.

Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
(gdb) p dnstimeout
$14 = 30
(gdb) n
169                     tvp = ares_timeout(channel, &tv, &tv);
(gdb) p tv
$15 = {tv_sec = 0, tv_usec = 0}
(gdb)

So what gives here?

J
quoted from Jeremy Laidman


On 3 June 2015 at 13:08, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Hi

I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's compiled with
c-ares support.

I have 12 new DNS servers that were added to Xymon about one month ago.
All of my server entries in hosts.cfg have "testip".  The tasks.cfg runs
xymonet with "--dns-timeout=3".  The hosts entries look like so:

10.10.10.1 dnshost1.example.com    # testip dns=NS:example.com,SOA:
example.com

About a week ago, connectivity to all of these servers failed, and at the
same time, the xymonnet run time jumped from less than 15 seconds to about
330 seconds, so about 315 seconds extra.  The xymonnet page says 295
seconds is taken up by DNS tests.

If the increase in time taken is about 315 and is entirely due to the 12
servers failing, then each failed server is adding about 26 seconds to the
total run time.

I don't think this should be happening like this.  With two DNS checks per
server, the DNS checks should be taking 6 seconds each to time-out, not
26.  If I run xymonnet with "--timing --no-update" and specify only one
hostname, I can view the results and the timing.  This shows that the ping
check gets reported after about 3 seconds, and then the DNS tests are
executed and take 26 seconds total.

My naiive assumption was that when a server failed a ping (and didn't have
"noclear" defined in hosts.cfg) then the network checks would be skipped.
On re-reading the man page for hosts.cfg, it dawned on me that a failed
ping simply suppresses failed test /results/, but doesn't stop the tests
from being run.

So the real problem is that the "--dns-timeout=3" is not being taken into
consideration by xymonnet.  If I run xymonnet with "--debug" it tells me:

1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
type=2
1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
type=6
1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
1900 2015-06-03 12:02:46 Finished ARES queue after loop 423

This is peculiar.  Why would it say "processing 0 DNS lookups" when there
are two lookups to test?  Could this be because xymonnet hasn't actually
been built with ARES support and I didn't know it?  Is there a good way to
tell?  If I add "--no-ares" I get the same results perhaps suggesting a
lack of ARES support.  On the other hand, if I add "timeout:3" and
"attempts:1" into resolv.conf, I also get the same results.  If I run "nm
/path/to/xymonnet | grep gethostby" it returns "ares_gethostbyname".

Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet binary from
there, with no change in behaviour.  I also tried removing the
"--dns-timeout" option so that it defaults to 30 seconds, but still no
change - 26 seconds for two DNS tests.

So, I'm not really sure what the problem is, but xymonnet certainly isn't
behaving as I would expect.

Cheers
Jeremy

list Jeremy Laidman · Mon, 11 Sep 2017 13:52:46 +1000 ·
Hi

I'm reviving an old thread, because this is biting me again, so I wanted to
know if anyone had any fresh ideas on this problem.

Many of the servers I monitor are DNS servers, so the C-ARES library has a
lot of queries to perform every 5 minutes. In some cases, I want to ensure
that a DNS service is down (and alert when not) so most of the time I can
expect a timeout, leading to a long poll cycle. I'd really like to be able
to drop the timeout to significantly less than the 23 seconds it's taking
now per server.

Cheers
Jeremy
quoted from Jeremy Laidman


On 3 June 2015 at 13:49, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
OK, I'm a bit puzzled by this, and definitely pushing the envelope of my
debugging and C coding skills.  The relevant code from xymonnet/dns.c is:

    168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
    169                 tvp = ares_timeout(channel, &tv, &tv);

I ran this through gdb, with "--dns-timeout=3" specified, setting a
breakpoint at line 168.  I confirmed that dnstimeout is set to 3.  When I
step one line, I should see tv.tv_sec set to 3 also, but it's set to 0.

If I don't specify --dns-timeout at all, printing dnstimeout shows "30".
Again, after stepping to the next line, tv.tv_sec is still zero.

Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
(gdb) p dnstimeout
$14 = 30
(gdb) n
169                     tvp = ares_timeout(channel, &tv, &tv);
(gdb) p tv
$15 = {tv_sec = 0, tv_usec = 0}
(gdb)

So what gives here?

J


On 3 June 2015 at 13:08, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Hi

I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's compiled with
c-ares support.

I have 12 new DNS servers that were added to Xymon about one month ago.
All of my server entries in hosts.cfg have "testip".  The tasks.cfg runs
xymonet with "--dns-timeout=3".  The hosts entries look like so:

10.10.10.1 dnshost1.example.com    # testip dns=NS:example.com,SOA:
example.com

About a week ago, connectivity to all of these servers failed, and at the
same time, the xymonnet run time jumped from less than 15 seconds to about
330 seconds, so about 315 seconds extra.  The xymonnet page says 295
seconds is taken up by DNS tests.

If the increase in time taken is about 315 and is entirely due to the 12
servers failing, then each failed server is adding about 26 seconds to the
total run time.

I don't think this should be happening like this.  With two DNS checks
per server, the DNS checks should be taking 6 seconds each to time-out, not
26.  If I run xymonnet with "--timing --no-update" and specify only one
hostname, I can view the results and the timing.  This shows that the ping
check gets reported after about 3 seconds, and then the DNS tests are
executed and take 26 seconds total.

My naiive assumption was that when a server failed a ping (and didn't
have "noclear" defined in hosts.cfg) then the network checks would be
skipped.  On re-reading the man page for hosts.cfg, it dawned on me that a
failed ping simply suppresses failed test /results/, but doesn't stop the
tests from being run.

So the real problem is that the "--dns-timeout=3" is not being taken into
consideration by xymonnet.  If I run xymonnet with "--debug" it tells me:

1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
type=2
1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
type=6
1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
1900 2015-06-03 12:02:46 Finished ARES queue after loop 423

This is peculiar.  Why would it say "processing 0 DNS lookups" when there
are two lookups to test?  Could this be because xymonnet hasn't actually
been built with ARES support and I didn't know it?  Is there a good way to
tell?  If I add "--no-ares" I get the same results perhaps suggesting a
lack of ARES support.  On the other hand, if I add "timeout:3" and
"attempts:1" into resolv.conf, I also get the same results.  If I run "nm
/path/to/xymonnet | grep gethostby" it returns "ares_gethostbyname".

Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet binary from
there, with no change in behaviour.  I also tried removing the
"--dns-timeout" option so that it defaults to 30 seconds, but still no
change - 26 seconds for two DNS tests.

So, I'm not really sure what the problem is, but xymonnet certainly isn't
behaving as I would expect.

Cheers
Jeremy

list Henrik Størner · Mon, 11 Sep 2017 07:11:56 +0200 ·
Hi,

I remember looking into this a long time ago, and the --dnstimeout 
setting does not quite work as expected - because C-ARES does not quite 
work as expected.

C-ARES has some timeout settings for queries, but it performs an 
exponential back-off between queries, so it is impossible to really hit 
the exact timeout you specify in --dnstimeout.

In fact, current 4.3.x versions have a hard-coded setting for the C-ARES 
timeouts - it starts with a 2 second timeout and performs 4 attempts, 
which ends up with approximately 23 second timeout for all DNS queries. 
This is in xymonnet/dns.c (look for "ARES timeout"). If you need those 
really short timeouts, then that is probably what you should change.


Regards,
Henrik
quoted from Jeremy Laidman


On 11-09-2017 05:52, Jeremy Laidman wrote:
Hi

I'm reviving an old thread, because this is biting me again, so I 
wanted to know if anyone had any fresh ideas on this problem.

Many of the servers I monitor are DNS servers, so the C-ARES library 
has a lot of queries to perform every 5 minutes. In some cases, I want 
to ensure that a DNS service is down (and alert when not) so most of 
the time I can expect a timeout, leading to a long poll cycle. I'd 
really like to be able to drop the timeout to significantly less than 
the 23 seconds it's taking now per server.

Cheers
Jeremy


On 3 June 2015 at 13:49, Jeremy Laidman <user-71895fb2e44c@xymon.invalid 
<mailto:user-71895fb2e44c@xymon.invalid>> wrote:

    OK, I'm a bit puzzled by this, and definitely pushing the envelope
    of my debugging and C coding skills.  The relevant code from
    xymonnet/dns.c is:

        168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
        169                 tvp = ares_timeout(channel, &tv, &tv);

    I ran this through gdb, with "--dns-timeout=3" specified, setting
    a breakpoint at line 168.  I confirmed that dnstimeout is set to
    3.  When I step one line, I should see tv.tv_sec set to 3 also,
    but it's set to 0.

    If I don't specify --dns-timeout at all, printing dnstimeout shows
    "30".  Again, after stepping to the next line, tv.tv_sec is still
    zero.

    Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
    168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
    (gdb) p dnstimeout
    $14 = 30
    (gdb) n
    169                     tvp = ares_timeout(channel, &tv, &tv);
    (gdb) p tv
    $15 = {tv_sec = 0, tv_usec = 0}
    (gdb)

    So what gives here?

    J


    On 3 June 2015 at 13:08, Jeremy Laidman <user-71895fb2e44c@xymon.invalid
    <mailto:user-71895fb2e44c@xymon.invalid>> wrote:

        Hi

        I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's
        compiled with c-ares support.

        I have 12 new DNS servers that were added to Xymon about one
        month ago.  All of my server entries in hosts.cfg have
        "testip".  The tasks.cfg runs xymonet with "--dns-timeout=3".
        The hosts entries look like so:

        10.10.10.1 dnshost1.example.com <http://dnshost1.example.com>;
        # testip dns=NS:example.com
        <http://example.com>,SOA:example.com <http://example.com>;
quoted from Jeremy Laidman

        About a week ago, connectivity to all of these servers failed,
        and at the same time, the xymonnet run time jumped from less
        than 15 seconds to about 330 seconds, so about 315 seconds
        extra.  The xymonnet page says 295 seconds is taken up by DNS
        tests.

        If the increase in time taken is about 315 and is entirely due
        to the 12 servers failing, then each failed server is adding
        about 26 seconds to the total run time.

        I don't think this should be happening like this.  With two
        DNS checks per server, the DNS checks should be taking 6
        seconds each to time-out, not 26.  If I run xymonnet with
        "--timing --no-update" and specify only one hostname, I can
        view the results and the timing. This shows that the ping
        check gets reported after about 3 seconds, and then the DNS
        tests are executed and take 26 seconds total.

        My naiive assumption was that when a server failed a ping (and
        didn't have "noclear" defined in hosts.cfg) then the network
        checks would be skipped. On re-reading the man page for
        hosts.cfg, it dawned on me that a failed ping simply
        suppresses failed test /results/, but doesn't stop the tests
        from being run.

        So the real problem is that the "--dns-timeout=3" is not being
        taken into consideration by xymonnet.  If I run xymonnet with
        "--debug" it tells me:

        1900 2015-06-03 12:02:20 ares_search: tlookup='example.com

        <http://example.com>';, class=1, type=2
        1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
        <http://example.com>';, class=1, type=6
quoted from Jeremy Laidman
        1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
        1900 2015-06-03 12:02:46 Finished ARES queue after loop 423

        This is peculiar.  Why would it say "processing 0 DNS lookups"
        when there are two lookups to test?  Could this be because
        xymonnet hasn't actually been built with ARES support and I
        didn't know it?  Is there a good way to tell?  If I add
        "--no-ares" I get the same results perhaps suggesting a lack
        of ARES support.  On the other hand, if I add "timeout:3" and
        "attempts:1" into resolv.conf, I also get the same results. If
        I run "nm /path/to/xymonnet | grep gethostby" it returns
        "ares_gethostbyname".

        Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet
        binary from there, with no change in behaviour.  I also tried
        removing the "--dns-timeout" option so that it defaults to 30
        seconds, but still no change - 26 seconds for two DNS tests.

        So, I'm not really sure what the problem is, but xymonnet
        certainly isn't behaving as I would expect.

        Cheers
        Jeremy