Hi, All ...
The other day, our Xymon (4.3.3) started sending out notifications due to
flapping on various hosts, various network-based tests which lasted for a
rather sharply-defined period. It caused a fair bit of angst and I was on
the hot-seat to prove Xymon was functioning properly.
Here are some of the summary facts:
- The flapping is pretty well documented in Xymon as occurring due
to connection times exceeding our 10-second threshold - most of the , as
configured in tasks.cfg
CMD xymonnet --report --ping --checkresponse \
--timeout=10 --dns-timeout=2 \
--dnslog=/var/log/xymon-4.3.3/dns.log \
--concurrency=5
INTERVAL 3m
- Output from the "xymonnet" report (currently - not captured during
the "storm") shows:
xymonnet version 4.3.3
SSL library : OpenSSL 0.9.8l 5 Nov 2009
LDAP library: OpenLDAP 20416
Statistics:
Hosts total : 2081
Hosts with no tests : 2
Total test count : 2864
Status messages : 2856
Alert status msgs : 0
Transmissions : 30
DNS statistics:
# hostnames resolved : 3337
# succesful : 921
# failed : 1266
# calls to dnsresolve : 2850
TCP test statistics:
# TCP tests total : 1769
# HTTP tests : 1244
# Simple TCP tests : 525
# Connection attempts : 1767
# bytes written : 235845
# bytes read : 2514747
TIME SPENT
Event Start time
Duration
xymonnet startup 1040654.310651
• Service definitions loaded 1040654.319152
0.008501
Tests loaded 1040655.696733
1.377581
DNS lookups completed 1040656.213268
0.516534
Test engine setup completed 1040657.416739
1.203470
TCP tests completed 1040675.444183
18.027443
PING test completed (923 hosts) 1040699.991467
24.547283
PING test results sent 1040700.080247
0.088780
Test result collection completed 1040700.144033
0.063785
LDAP test engine setup completed 1040700.152852
0.008819
LDAP tests executed 1040700.360821
0.207968
LDAP tests result collection completed 1040700.360829
0.000007
DNS tests executed 1040700.441820
0.080991
NTP tests executed 1040722.413523
21.971702
Test results transmitted 1040723.295458
0.881935
xymonnet completed 1040723.313935
0.018476
TIME TOTAL
69.003284
- Rather sharply defined start-up / cut-off for the "storm": I can
point to the 5-minute segment when it started / stopped
- The Xymon server OS/NIC hardware check out diagnostically
- According to our network team's records, the network connection
bandwidth utilization coming in / out of the Xymon server was < 1%
capacity (i.e. we have lots of bandwidth)
- According to our network team there were no significant loss of
packets or congestion at the switch level (there's only one hop between
the Xymon server and the rest of the hosts)
- The types of services affected seemed pretty random: mostly HTTP
tests, but LOTs of SSH/ping/NTP/LDAP, etc. as well.
Any initial thoughts?
Thanks!
david
~~~~~~~~~~~~~~~~~~~
David Mills
Systems Administrator
Northrop Grumman
XXX-XXX-XXXX
user-eb64c112f0e9@xymon.invalid
Assuming you're saving status results in history (the default), can you
look at the status messages from the down periods? Were they DNS timeouts
or timeout timeouts? I'd start with the ping checks, since that's pretty
cut-and-dried...
- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look
at the 'trends' page for the polling host for that period and see if
anything unusual happened around the same time?
HTH,
-jc