Xymon Mailing List Archive search

Xymon flapping: network slowness reality or delusion?

4 messages in this thread

list David Mills · Fri, 15 Mar 2013 17:57:07 +0000 ·
Hi, All ...

The other day, our Xymon (4.3.3) started sending out notifications due to flapping on various hosts, various network-based tests which lasted for a rather sharply-defined period. It caused a fair bit of angst and I was on the hot-seat to prove Xymon was functioning properly.

Here are some of the summary facts:

-       The flapping is pretty well documented in Xymon as occurring due to connection times exceeding our 10-second threshold - most of the , as configured in tasks.cfg

           CMD xymonnet --report --ping --checkresponse \
                        --timeout=10 --dns-timeout=2 \
                        --dnslog=/var/log/xymon-4.3.3/dns.log \
                        --concurrency=5
   INTERVAL 3m

-       Output from the "xymonnet" report (currently - not captured during the "storm") shows:

            xymonnet version 4.3.3
            SSL library : OpenSSL 0.9.8l 5 Nov 2009
            LDAP library: OpenLDAP 20416

            Statistics:
             Hosts total           :     2081
             Hosts with no tests   :        2
             Total test count      :     2864
             Status messages       :     2856
             Alert status msgs     :        0
             Transmissions         :       30

            DNS statistics:
             # hostnames resolved  :     3337
             # succesful           :      921
             # failed              :     1266
             # calls to dnsresolve :     2850

            TCP test statistics:
             # TCP tests total     :     1769
             # HTTP tests          :     1244
             # Simple TCP tests    :      525
             # Connection attempts :     1767
             # bytes written       :   235845
             # bytes read          :  2514747


            TIME SPENT
            Event                                           Start time          Duration
            xymonnet startup                            1040654.310651                 -
            Service definitions loaded                  1040654.319152          0.008501
            Tests loaded                                1040655.696733          1.377581
            DNS lookups completed                       1040656.213268          0.516534
            Test engine setup completed                 1040657.416739          1.203470
            TCP tests completed                         1040675.444183         18.027443
            PING test completed (923 hosts)             1040699.991467         24.547283
            PING test results sent                      1040700.080247          0.088780
            Test result collection completed            1040700.144033          0.063785
            LDAP test engine setup completed            1040700.152852          0.008819
            LDAP tests executed                         1040700.360821          0.207968
            LDAP tests result collection completed      1040700.360829          0.000007
            DNS tests executed                          1040700.441820          0.080991
            NTP tests executed                          1040722.413523         21.971702
            Test results transmitted                    1040723.295458          0.881935
            xymonnet completed                          1040723.313935          0.018476
            TIME TOTAL                                                         69.003284

-       Rather sharply defined start-up / cut-off for the "storm": I can point to the 5-minute segment when it started / stopped
-       The Xymon server OS/NIC hardware check out diagnostically
-       According to our network team's records, the network connection bandwidth utilization coming in / out of the Xymon server was < 1% capacity (i.e. we have lots of bandwidth)
-       According to our network team there were no significant loss of packets or congestion at the switch level (there's only one hop between the Xymon server and the rest of the hosts)
-       The types of services affected seemed pretty random: mostly HTTP tests, but LOTs of SSH/ping/NTP/LDAP, etc. as well.

Any initial thoughts?

Thanks!

david

~~~~~~~~~~~~~~~~~~~
David Mills
Systems Administrator
Northrop Grumman
XXX-XXX-XXXX
user-eb64c112f0e9@xymon.invalid
list Japheth Cleaver · Fri, 15 Mar 2013 18:30:57 -0000 (UTC) ·
quoted from David Mills
Hi, All ...

The other day, our Xymon (4.3.3) started sending out notifications due to
flapping on various hosts, various network-based tests which lasted for a
rather sharply-defined period. It caused a fair bit of angst and I was on
the hot-seat to prove Xymon was functioning properly.

Here are some of the summary facts:

-       The flapping is pretty well documented in Xymon as occurring due
to connection times exceeding our 10-second threshold - most of the , as
configured in tasks.cfg

           CMD xymonnet --report --ping --checkresponse \
                        --timeout=10 --dns-timeout=2 \
                        --dnslog=/var/log/xymon-4.3.3/dns.log \
                        --concurrency=5
   INTERVAL 3m

-       Output from the "xymonnet" report (currently - not captured during
the "storm") shows:

            xymonnet version 4.3.3
            SSL library : OpenSSL 0.9.8l 5 Nov 2009
            LDAP library: OpenLDAP 20416

            Statistics:
             Hosts total           :     2081
             Hosts with no tests   :        2
             Total test count      :     2864
             Status messages       :     2856
             Alert status msgs     :        0
             Transmissions         :       30

            DNS statistics:
             # hostnames resolved  :     3337
             # succesful           :      921
             # failed              :     1266
             # calls to dnsresolve :     2850

            TCP test statistics:
             # TCP tests total     :     1769
             # HTTP tests          :     1244
             # Simple TCP tests    :      525
             # Connection attempts :     1767
             # bytes written       :   235845
             # bytes read          :  2514747


            TIME SPENT
            Event                                           Start time
     Duration
            xymonnet startup                            1040654.310651
            • Service definitions loaded                  1040654.319152
     0.008501
            Tests loaded                                1040655.696733
     1.377581
            DNS lookups completed                       1040656.213268
     0.516534
            Test engine setup completed                 1040657.416739
     1.203470
            TCP tests completed                         1040675.444183
    18.027443
            PING test completed (923 hosts)             1040699.991467
    24.547283
            PING test results sent                      1040700.080247
     0.088780
            Test result collection completed            1040700.144033
     0.063785
            LDAP test engine setup completed            1040700.152852
     0.008819
            LDAP tests executed                         1040700.360821
     0.207968
            LDAP tests result collection completed      1040700.360829
     0.000007
            DNS tests executed                          1040700.441820
     0.080991
            NTP tests executed                          1040722.413523
    21.971702
            Test results transmitted                    1040723.295458
     0.881935
            xymonnet completed                          1040723.313935
     0.018476
            TIME TOTAL
    69.003284

-       Rather sharply defined start-up / cut-off for the "storm": I can
point to the 5-minute segment when it started / stopped
-       The Xymon server OS/NIC hardware check out diagnostically
-       According to our network team's records, the network connection
bandwidth utilization coming in / out of the Xymon server was < 1%
capacity (i.e. we have lots of bandwidth)
-       According to our network team there were no significant loss of
packets or congestion at the switch level (there's only one hop between
the Xymon server and the rest of the hosts)
-       The types of services affected seemed pretty random: mostly HTTP
tests, but LOTs of SSH/ping/NTP/LDAP, etc. as well.

Any initial thoughts?

Thanks!

david

~~~~~~~~~~~~~~~~~~~
David Mills
Systems Administrator
Northrop Grumman
XXX-XXX-XXXX
user-eb64c112f0e9@xymon.invalid
Assuming you're saving status results in history (the default), can you
look at the status messages from the down periods? Were they DNS timeouts
or timeout timeouts? I'd start with the ping checks, since that's pretty
cut-and-dried...

- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look
at the 'trends' page for the polling host for that period and see if
anything unusual happened around the same time?


HTH,
-jc
list David Mills · Fri, 15 Mar 2013 22:17:36 +0000 ·
quoted from Japheth Cleaver
-----Original Message-----
From: user-87556346d4af@xymon.invalid [mailto:user-87556346d4af@xymon.invalid] Sent: Friday, March 15, 2013 1:31 PM
To: Mills, David (IS)
Cc: xymon at xymon.com
Subject: EXT :Re: [Xymon] Xymon flapping: network slowness reality or delusion?
Hi, All ...

The other day, our Xymon (4.3.3) started sending out notifications due to flapping on various hosts, various network-based tests which lasted for a rather sharply-defined period. It caused a fair bit of angst and I was on the hot-seat to prove Xymon was functioning properly.

Here are some of the summary facts:
<snip>


Assuming you're saving status results in history (the default), can you look at the status messages from the down periods? Were they DNS timeouts or timeout timeouts? I'd start with the ping checks, since that's pretty cut-and-dried...

- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look at the 'trends' page for the polling host for that period and see if anything unusual happened around the same time?


HTH,
-jc

==

Thanks! After poking around on the Xymonnet history dumps, I found some very interesting stuff I don't know what to make of:

- For the top 20 worst times in a 24 hour period, the three categories of networking that had significantly elevated levels were "TCP tests completed", "DNS tests executed" and "NTP tests executed".
- Oddly, after graphing the respective times for these categories in a spreadsheet, it became obvious that the DNS and TCP tests were roughly inversions of each other: when one was super-high, the other would go low. - Even weirder, the PING tests were ... NORMAL!! While the rest of the Xymon network tests were jumping off a cliff, good old 'ping' was chugging along without (mostly) mishap. This last datum seems to blow a hole in the theory that this is truly a network problem (vs. a Xymon server/host problem).

Any other thoughts?

david
list David Mills · Thu, 21 Mar 2013 15:00:44 +0000 ·
Resolution: unfortunately, it turned out, in this case, that after stopping / restarting the Xymon server ("xymon.sh restart"), everything became docile again. The response time graph for "xymonnet" on the server looks like a really bad hair day followed by a nearly straight blue line after the restart.

We're running 4.3.3 (and, yes, we're trying to migrate to 4.3.10). Has anyone heard of related bugs in this version of the code, or other theories?

Thanks!

david
quoted from David Mills

-----Original Message-----
From: Mills, David (IS) 
Sent: Friday, March 15, 2013 5:18 PM
To: 'user-87556346d4af@xymon.invalid'
Cc: xymon at xymon.com
Subject: RE: EXT :Re: [Xymon] Xymon flapping: network slowness reality or delusion?

-----Original Message-----
From: user-87556346d4af@xymon.invalid [mailto:user-87556346d4af@xymon.invalid]
Sent: Friday, March 15, 2013 1:31 PM
To: Mills, David (IS)
Cc: xymon at xymon.com
Subject: EXT :Re: [Xymon] Xymon flapping: network slowness reality or delusion?
Hi, All ...

The other day, our Xymon (4.3.3) started sending out notifications due 
to flapping on various hosts, various network-based tests which lasted 
for a rather sharply-defined period. It caused a fair bit of angst and 
I was on the hot-seat to prove Xymon was functioning properly.

Here are some of the summary facts:
<snip>


Assuming you're saving status results in history (the default), can you look at the status messages from the down periods? Were they DNS timeouts or timeout timeouts? I'd start with the ping checks, since that's pretty cut-and-dried...

- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look at the 'trends' page for the polling host for that period and see if anything unusual happened around the same time?


HTH,
-jc

==
Thanks! After poking around on the Xymonnet history dumps, I found some very interesting stuff I don't know what to make of:

- For the top 20 worst times in a 24 hour period, the three categories of networking that had significantly elevated levels were "TCP tests completed", "DNS tests executed" and "NTP tests executed".
- Oddly, after graphing the respective times for these categories in a spreadsheet, it became obvious that the DNS and TCP tests were roughly inversions of each other: when one was super-high, the other would go low. 
- Even weirder, the PING tests were ... NORMAL!! While the rest of the Xymon network tests were jumping off a cliff, good old 'ping' was chugging along without (mostly) mishap. This last datum seems to blow a hole in the theory that this is truly a network problem (vs. a Xymon server/host problem).

Any other thoughts?

david