Xymon Mailing List Archive search

Brief red alarms

22 messages in this thread

list Jaime Kikpole · Thu, 18 Nov 2010 11:59:18 -0500 ·
We recently had a major change in our network's design.  The entire
topology had the change, including IP addresses of servers and the
"routes" through switches between things.  Ever since then, Xymon is
reporting very brief (10-20 seconds) outages of one server or another
every 30-60 minutes.

I tried this suggestion:
http://xymon.sourceforge.net/docs/known-issues.html#netfail

No luck.  A few minutes later, the server Xymon is running on
allegedly failed the "conn" test for about 1 second.

Any other ideas?  If you want to see the symptoms, look at my Xymon
instance at http://cns.cairodurham.org/hobbit/bb2.html.  This will
show the brief outages that I'm talking about.

Thanks in advance,
Jaime Kikpole

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Josh Luthman · Thu, 18 Nov 2010 12:29:47 -0500 ·
I had that same issue (with conn tests).  The server was on a mini itx
board with a VIA CPU.  I moved it to our ESXi box and haven't seen it
since.

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX


On Thu, Nov 18, 2010 at 11:59 AM, Jaime Kikpole
quoted from Jaime Kikpole
<user-c575ba5bb612@xymon.invalid> wrote:
We recently had a major change in our network's design.  The entire
topology had the change, including IP addresses of servers and the
"routes" through switches between things.  Ever since then, Xymon is
reporting very brief (10-20 seconds) outages of one server or another
every 30-60 minutes.

I tried this suggestion:
http://xymon.sourceforge.net/docs/known-issues.html#netfail

No luck.  A few minutes later, the server Xymon is running on
allegedly failed the "conn" test for about 1 second.

Any other ideas?  If you want to see the symptoms, look at my Xymon
instance at http://cns.cairodurham.org/hobbit/bb2.html.  This will
show the brief outages that I'm talking about.

Thanks in advance,
Jaime Kikpole

--
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org

list Jason Chambers · Thu, 18 Nov 2010 17:34:21 +0000 ·
I would do ping tests from your Xymon servers to your other servers first to make sure there isn't packets being dropped. You might have some hardware issues you don't know about after the IP change.

Jason Chambers
IT Help Desk Associate

GEOSOFT INC.
freedom to explore
T +X XXX.XXX.XXXX #344
F +X XXX.XXX.XXXX

Visit our user-be8ce7065ec5@xymon.invalid
quoted from Jaime Kikpole


-----Original Message-----
From: Jaime Kikpole [mailto:user-c575ba5bb612@xymon.invalid] 
Sent: November-18-10 11:59 AM
To: xymon at xymon.com
Subject: [xymon] Brief red alarms

We recently had a major change in our network's design.  The entire topology had the change, including IP addresses of servers and the "routes" through switches between things.  Ever since then, Xymon is reporting very brief (10-20 seconds) outages of one server or another every 30-60 minutes.

I tried this suggestion:
http://xymon.sourceforge.net/docs/known-issues.html#netfail

No luck.  A few minutes later, the server Xymon is running on allegedly failed the "conn" test for about 1 second.

Any other ideas?  If you want to see the symptoms, look at my Xymon instance at http://cns.cairodurham.org/hobbit/bb2.html.  This will show the brief outages that I'm talking about.

Thanks in advance,
Jaime Kikpole

--
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Tim McCloskey · Thu, 18 Nov 2010 09:39:15 -0800 ·
Hobbit may not be the issue at all here.  You could test this by writing a small script to do a constant ping of host $foo and record the results to a file.  You might see similar patterns outside of hobbit.  
 
Looking at the current summary page it appears that these hosts are more problematic.  Yet, looking at a report (short version attached) for Nov1 to Nov 17 there are clearly quite a few more.
  
 2 10.1.0.32
 2 10.1.0.40
 2 10.1.0.73
 2 10.1.0.92
 2 10.3.23.242
 4 163.153.65.139
 4 atlas.cairodurham.org

From the report (which is for conn test only) we see these hosts have 100% uptime.  Until a short while ago 10.1.0.73 was at 100% avail.

10.1.0.71	100
10.1.0.72	100
10.1.0.90	100

What makes these three hosts different?

Regards, 

Tim
quoted from Jaime Kikpole


From: Jaime Kikpole [user-c575ba5bb612@xymon.invalid]
Sent: Thursday, November 18, 2010 8:59 AM
To: xymon at xymon.com
Subject: [xymon] Brief red alarms

We recently had a major change in our network's design.  The entire
topology had the change, including IP addresses of servers and the
"routes" through switches between things.  Ever since then, Xymon is
reporting very brief (10-20 seconds) outages of one server or another
every 30-60 minutes.

I tried this suggestion:
http://xymon.sourceforge.net/docs/known-issues.html#netfail

No luck.  A few minutes later, the server Xymon is running on
allegedly failed the "conn" test for about 1 second.

Any other ideas?  If you want to see the symptoms, look at my Xymon
instance at http://cns.cairodurham.org/hobbit/bb2.html.  This will
show the brief outages that I'm talking about.

Thanks in advance,
Jaime Kikpole

--
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
Attachments (1)
list Jaime Kikpole · Thu, 18 Nov 2010 12:53:01 -0500 ·
quoted from Tim McCloskey
On Thu, Nov 18, 2010 at 12:39 PM, Tim McCloskey <user-440820cc07d6@xymon.invalid> wrote:
What makes these three hosts different?
Well, for one, they didn't exist until recently.  :)

We just had a major topology change.  Every server and switch is on a
new IP address and nearly every switch was replaced with new hardware.
 New subnets exist, too.  So this really has to be seen from a
perspective of setting up Xymon for the first time in the last 2 days
and ignoring older data all together.

I'm running a extended ping test to 10.1.0.73 now to see if we have
intermittent issues with network traffic.

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Thu, 18 Nov 2010 13:00:59 -0500 ·
On Thu, Nov 18, 2010 at 12:53 PM, Jaime Kikpole
quoted from Jaime Kikpole
<user-c575ba5bb612@xymon.invalid> wrote:
I'm running a extended ping test to 10.1.0.73 now to see if we have
intermittent issues with network traffic.
For what its worth:

^C
--- 10.1.0.73 ping statistics ---
528 packets transmitted, 528 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.621/2.458/52.160/5.032 ms


Any thoughts?

Jaime


-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Tom Kauffman · Thu, 18 Nov 2010 14:17:36 -0400 ·
quoted from Jaime Kikpole
On Thursday 18 November 2010 01:00:59 pm Jaime Kikpole wrote:
On Thu, Nov 18, 2010 at 12:53 PM, Jaime Kikpole

<user-c575ba5bb612@xymon.invalid> wrote:
I'm running a extended ping test to 10.1.0.73 now to see if we have
intermittent issues with network traffic.
For what its worth:

^C
--- 10.1.0.73 ping statistics ---
528 packets transmitted, 528 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.621/2.458/52.160/5.032 ms


Any thoughts?

Jaime
Check your DNS server(s). Chances are that one or more systems has an entry 
for the old IP addresses.  A continuous ping won't show this, as address 
resolution occurs just once.

Tom
list Tim McCloskey · Thu, 18 Nov 2010 10:21:01 -0800 ·
You've probably already done the obvious....  First, I'd make sure that all the hosts/switch ports have proper media settings (speed/duplex/autoneg or not).  While it may sound odd (for your new IP network) I would also make sure that all of the arp cache for the environment gets flushed. 

What interval do you have in bbtest-net? 
server/etc/hobbitlaunch.cfg
[bbnet]
        ENVFILE server/etc/hobbitserver.cfg
        NEEDS hobbitd
        CMD bbtest-net --report --ping --checkresponse
        LOGFILE $BBSERVERLOGS/bb-network.log
        INTERVAL 5m
quoted from Tom Kauffman


Tim


From: Jaime Kikpole [user-c575ba5bb612@xymon.invalid]
Sent: Thursday, November 18, 2010 10:00 AM
To: xymon at xymon.com
Subject: Re: [xymon] Brief red alarms

On Thu, Nov 18, 2010 at 12:53 PM, Jaime Kikpole
<user-c575ba5bb612@xymon.invalid> wrote:
I'm running a extended ping test to 10.1.0.73 now to see if we have
intermittent issues with network traffic.
For what its worth:

^C
--- 10.1.0.73 ping statistics ---
528 packets transmitted, 528 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.621/2.458/52.160/5.032 ms


Any thoughts?

Jaime


--
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Thu, 18 Nov 2010 13:26:47 -0500 ·
quoted from Tom Kauffman
On Thu, Nov 18, 2010 at 1:17 PM, Tom Kauffman <user-eb86974acd2c@xymon.invalid> wrote:
Check your DNS server(s). Chances are that one or more systems has an entry
for the old IP addresses.  A continuous ping won't show this, as address
resolution occurs just once.
Just checked.  Nothing out of place in DNS (forward or reverse.)

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Thu, 18 Nov 2010 13:29:11 -0500 ·
quoted from Tim McCloskey
On Thu, Nov 18, 2010 at 1:21 PM, Tim McCloskey <user-440820cc07d6@xymon.invalid> wrote:
What interval do you have in bbtest-net?
I had the defaults.  I changed it to --concurreny=100, but that didn't
help.  Neither did --concurrency=50.

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Tim McCloskey · Thu, 18 Nov 2010 10:45:05 -0800 ·
quoted from Tim McCloskey
[bbnet]
        ENVFILE server/etc/hobbitserver.cfg
        NEEDS hobbitd
        CMD bbtest-net --report --ping --checkresponse
        LOGFILE $BBSERVERLOGS/bb-network.log
        INTERVAL 5m

I'm using 4.2.0 so maybe the setting above is different in 4.2.3...? Too short of an interval test can result in this something like you are seeing.  The server might not be able to ping all of the hosts in under nn time.  If the network is fine then the next question is this a xymon install that worked before the new IP's etc....?  Or is this a brand new install?  If it is new, and the network is fine, try reducing the INTERVAL to something like 5 minutes are let it cook for a day or so.
quoted from Jaime Kikpole

Regards,

Tim

From: Jaime Kikpole [user-c575ba5bb612@xymon.invalid]
Sent: Thursday, November 18, 2010 10:29 AM
To: xymon at xymon.com
Subject: Re: [xymon] Brief red alarms

On Thu, Nov 18, 2010 at 1:21 PM, Tim McCloskey <user-440820cc07d6@xymon.invalid> wrote:
What interval do you have in bbtest-net?
I had the defaults.  I changed it to --concurreny=100, but that didn't
help.  Neither did --concurrency=50.

Jaime

--
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Tue, 23 Nov 2010 11:25:03 -0500 ·
quoted from Tim McCloskey
On Thu, Nov 18, 2010 at 1:45 PM, Tim McCloskey <user-440820cc07d6@xymon.invalid> wrote:
If the network is fine then the next question is this a xymon install that worked before the new IP's etc....?
 Or is this a brand new install?  If it is new, and the network is fine, try reducing the INTERVAL to
something like 5 minutes are let it cook for a day or so.
Its an existing install.  We changed the IPs and all the switches.  In
theory, the bandwidth for all links is the same or higher than before.
 100Mbps to 10Gbps, depending on the link.  This behavior started
after the changes, though.

For what its worth, we have been using "INTERVAL 5m" in the bbnet
section of hobbitlaunch.cfg since I first installed Xymon.

Any thoughts?

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Bruce White · Wed, 24 Nov 2010 10:01:43 -0600 ·
I have seen this kind of behavior when a single bad route keeps getting reintroduced to the network routers (generally from a single router).   When the bad route hits the router the ping attempts it return route on, the ping fails.

Just a thought on something might check into.

     ....Bruce


 
 Bruce White
 Senior Enterprise Systems Engineer | Phone: XXX-XXX-XXXX | Fax: XXX-XXX-XXXX | user-58f975e8bf9d@xymon.invalid | http://www.fellowes.com/
 
 
 
Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc.
quoted from Jaime Kikpole
 
-----Original Message-----
From: Jaime Kikpole [mailto:user-c575ba5bb612@xymon.invalid] 
Sent: Tuesday, November 23, 2010 10:25 AM
To: xymon at xymon.com
Subject: Re: [xymon] Brief red alarms

On Thu, Nov 18, 2010 at 1:45 PM, Tim McCloskey <user-440820cc07d6@xymon.invalid> wrote:
If the network is fine then the next question is this a xymon install that worked before the new IP's etc....?
 Or is this a brand new install?  If it is new, and the network is fine, try reducing the INTERVAL to
something like 5 minutes are let it cook for a day or so.
Its an existing install.  We changed the IPs and all the switches.  In
theory, the bandwidth for all links is the same or higher than before.
 100Mbps to 10Gbps, depending on the link.  This behavior started
after the changes, though.

For what its worth, we have been using "INTERVAL 5m" in the bbnet
section of hobbitlaunch.cfg since I first installed Xymon.

Any thoughts?

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Bruce White · Wed, 24 Nov 2010 10:06:43 -0600 ·
Also, are you running fping?  If you are running the default "hobbitping", all bets are off on the pings actually being accurate.
signature

    .....Bruce


 
 Bruce White
 Senior Enterprise Systems Engineer | Phone: XXX-XXX-XXXX | Fax: XXX-XXX-XXXX | user-58f975e8bf9d@xymon.invalid | http://www.fellowes.com/
 
 
 
Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc.
 
-----Original Message-----

quoted from Jaime Kikpole
From: Jaime Kikpole [mailto:user-c575ba5bb612@xymon.invalid] 
Sent: Tuesday, November 23, 2010 10:25 AM
To: xymon at xymon.com
Subject: Re: [xymon] Brief red alarms

On Thu, Nov 18, 2010 at 1:45 PM, Tim McCloskey <user-440820cc07d6@xymon.invalid> wrote:
If the network is fine then the next question is this a xymon install that worked before the new IP's etc....?
 Or is this a brand new install?  If it is new, and the network is fine, try reducing the INTERVAL to
something like 5 minutes are let it cook for a day or so.
Its an existing install.  We changed the IPs and all the switches.  In
theory, the bandwidth for all links is the same or higher than before.
 100Mbps to 10Gbps, depending on the link.  This behavior started
after the changes, though.

For what its worth, we have been using "INTERVAL 5m" in the bbnet
section of hobbitlaunch.cfg since I first installed Xymon.

Any thoughts?

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Thu, 25 Nov 2010 22:12:31 -0500 ·
quoted from Bruce White
On Wed, Nov 24, 2010 at 11:06 AM, White, Bruce <user-58f975e8bf9d@xymon.invalid> wrote:
Also, are you running fping?  If you are running the default "hobbitping",
all bets are off on the pings actually being accurate.
Taking a quick look:

atlas:etc>grep fping *.cfg
hobbitserver.cfg:# Make sure the path includes the directories where
you have fping, mail and (optionally) ntpdate installed,

atlas:etc>grep hobbitping *.cfg
hobbitserver.cfg:FPING="hobbitping"					# Path and options for the ping program.

If I'm reading this right, I'm using hobbitping and not fping.  The
odd thing is that this symptom did not exist before the equipment and
topology changes.  So I'll be checking on my "route:..." statements
and any routing protocols (like RIP) in the new equipment.  Either of
these ideas might help.

Thanks!

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Thu, 25 Nov 2010 22:17:32 -0500 ·
A coincidence just lead me to a log file with lines like these:
+pid 44909 (hobbitping), uid 280: exited on signal 11
+pid 45111 (hobbitping), uid 280: exited on signal 11
+pid 46426 (hobbitping), uid 280: exited on signal 11

I wonder if there is something else at play here.

The Unix box that I installed Xymon onto had its IP change at the same
time as the network changes.  Is it possible that there was a side
effect from the IP change?

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Thu, 25 Nov 2010 22:54:56 -0500 ·
Would changing out hobbitping for fping be as simple as changing the line:
FPING="hobbitping"
...to the line...
FPING="fping"
...in the file hobbitserver.cfg?  I already have fping at
/usr/local/sbin and that path is already in the PATH variable in
hobbitserver.cfg.

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Henrik Størner · Mon, 29 Nov 2010 16:55:56 +0000 (UTC) ·
quoted from Jaime Kikpole
In <AANLkTik=user-0b838d402eab@xymon.invalid> Jaime Kikpole <user-c575ba5bb612@xymon.invalid> writes:
A coincidence just lead me to a log file with lines like these:
+pid 44909 (hobbitping), uid 280: exited on signal 11
+pid 45111 (hobbitping), uid 280: exited on signal 11
+pid 46426 (hobbitping), uid 280: exited on signal 11
Signal 11 i SEGV, so this obviously should not be happening.
Which version is this (sorry if this is in a previous message)?
There were some fixes for hobbitping in beta-3.

It would be nice if you could try fping - then at least we could
narrow down the problem.
quoted from Jaime Kikpole
The Unix box that I installed Xymon onto had its IP change at the same
time as the network changes.  Is it possible that there was a side
effect from the IP change?
It shouldn't matter, unless you have another box with the same
IP on your network. And then you would have much bigger problems,
I think.


Regards,
Henrik
list Jaime Kikpole · Mon, 29 Nov 2010 12:04:26 -0500 ·
quoted from Henrik Størner
On Mon, Nov 29, 2010 at 11:55 AM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:
Signal 11 i SEGV, so this obviously should not be happening.
Which version is this (sorry if this is in a previous message)?
There were some fixes for hobbitping in beta-3.
I'm running 4.2.3.  Sorry.
quoted from Henrik Størner

It would be nice if you could try fping - then at least we could
narrow down the problem.
Is that just a matter of putting "fping" in the FPING variable of
hobbitserver.cfg?
quoted from Henrik Størner

The Unix box that I installed Xymon onto had its IP change at the same
time as the network changes.  Is it possible that there was a side
effect from the IP change?
It shouldn't matter, unless you have another box with the same
IP on your network. And then you would have much bigger problems,
I think.
Thanks.

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Ryan Novosielski · Mon, 29 Nov 2010 16:00:51 -0500 ·
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
quoted from Jaime Kikpole

On 11/25/2010 10:54 PM, Jaime Kikpole wrote:
Would changing out hobbitping for fping be as simple as changing the line:
FPING="hobbitping"
...to the line...
FPING="fping"
...in the file hobbitserver.cfg?  I already have fping at
/usr/local/sbin and that path is already in the PATH variable in
hobbitserver.cfg.
This sounds reasonable to me, though I'd check the documentation to be sure.

- -- 
- ---- _  _ _  _ ___  _  _  _
|Y#| |  | |\/| |  \ |\ |  | |Ryan Novosielski - Sr. Systems Programmer
|$&| |__| |  | |__/ | \| _| |user-ae4522577e16@xymon.invalid - 973/972.0922 (2-0922)
\__/ Univ. of Med. and Dent.|IST/CST-Academic Svcs. - ADMC 450, Newark
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkz0FIIACgkQmb+gadEcsb4KUwCglrljjMSEpscnzFRVyafGGbgl
+l0AoKoGthb6PqJZSRCfFJBPeTn+wKT6
=lumV
-----END PGP SIGNATURE-----
list Henrik Størner · Mon, 29 Nov 2010 21:38:49 +0000 (UTC) ·
quoted from Jaime Kikpole
On Mon, 29 Nov 2010 12:04:26 -0500, Jaime Kikpole wrote:
On Mon, Nov 29, 2010 at 11:55 AM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:
Signal 11 i SEGV, so this obviously should not be happening. Which
version is this (sorry if this is in a previous message)? There were
some fixes for hobbitping in beta-3.
I'm running 4.2.3.  Sorry.
Nothing to be sorry for. Like I said, there were some fixes to hobbitping 
done in the 4.3.0 beta-3 release, so hopefully it shouldn't segfault now.
quoted from Jaime Kikpole
It would be nice if you could try fping - then at least we could narrow
down the problem.
Is that just a matter of putting "fping" in the FPING variable of
hobbitserver.cfg?
Yes.


Regards,
Henrik
list Jaime Kikpole · Tue, 30 Nov 2010 11:01:17 -0500 ·
quoted from Henrik Størner
On Mon, Nov 29, 2010 at 4:38 PM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:
Nothing to be sorry for. Like I said, there were some fixes to hobbitping
done in the 4.3.0 beta-3 release, so hopefully it shouldn't segfault now.
Good to know.  When I can manage to schedule the next upgrade, this
will be something to look forward to.  Thanks.
quoted from Henrik Størner

Is that just a matter of putting "fping" in the FPING variable of
hobbitserver.cfg?
Yes.
I did this and there have been no more false alarms.  Just as a test,
I unplugged one server's network cable and xymon reported it down
within 1-2 minutes.  I plugged it back in and it reported it back up
about 1-2 minutes later.

I'd like to thank everyone for the help.  Xymon has been so good for
our needs that my coworker said that he feels naked whenever xymon is
offline.  (A major switch failed at one point, so he couldn't see the
web server xymon is on.  He had to diagnose things the old fashioned
way, i.e. lots of pings.)  I recommend it to sysadmins all over the
place now.  :)

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org