Xymon Mailing List Archive search

sorry for the constant revision (was: re: purple haze)

list Rob Munsch
Tue, 01 Nov 2005 14:40:48 -0500
Message-Id: <user-947627412cb6@xymon.invalid>

Last email for a while, i promise; i'm chainsmoking packets at this point.  but i found this-

---
2005-11-01 14:14:20 TCP tests completed normally
2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99
2005-11-01 14:14:20 Sending results for service conn
---

Okay, it can't find fping.  But...
---
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
FPING="/usr/sbin/fping"                                 # Path and options for the 'fping' program.
hobbit at randomaccess ~/server/bin $ /usr/sbin/fping -Ae brassai
10.10.10.15 is alive (0.15 ms)
hobbit at randomaccess ~/server/bin $
---

So it should be finding fping just fine, and fping is working.
The path is in hobbitserver.cfg:
---
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
# as well as the BBHOME/bin directory where all of the Hobbit programs reside.
PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/hobbit/server/bin"
...
# For bbtest-net
...
FPING="/usr/sbin/fping"                                                 # Path and options for the 'fping' program.
---

and

[bbnet]
        ENVFILE /home/hobbit/server/etc/hobbitserver.cfg


So, by all the above:  fping is functional, it is accessible by the 'hobbit' user, it can reach the clients, it is in the PATH, it is defined in the ENVFILE bbnet is using.

So what's gone wrong??


Rob Munsch wrote:
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran

./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug

Now, i can ping and ssh to all clients from server just fine.  But i see this:

---
2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov  1 14:14:20 2005 conn NOT ok
status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov  1 14:14:20 2005 conn NOT ok

Service conn on brassai is not OK : Host does not respond to ping

System unreachable for 3 poll periods (56 seconds)
---

Aha.  Since the ping test fails, why test other net services?  So now it makes sense; the net tests are not being run, hence the purple.

a'course, i don't know why the nettest is suddenly unable to ping anything.  It is getting the right IPs internally:

---
2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x
2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x
2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x
---

and i thought cranking the concurrency way down might help, but apparently it doesn't.

So, i'm glad i found the cause... now i just need to find out the cause's cause.  o_O
-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com