Xymon Mailing List Archive search

Inexplicable purple on running services

10 messages in this thread

list Rob Munsch · Mon, 31 Oct 2005 17:32:44 -0500 ·
Consider the below.  Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple.  They are still up, running, and just fine in every respect.  The status message is even the same as when it was showing green.  But now every ssh, ldaps and dns light is purple.

The last thing i was messing with when this happened was the alerts config file; i hadn't touched bb-hosts.  cpu, disk, memory etc. all remain green.

I cannot find anything in the logs that indicated what changed 25 minutes ago.  I have restarted the hobbitd.  Something like this seemed to happen yesterday; after a number of monitored services were green and unchanging for a while, they went purple, yet report as "OK" across the board.  While tweaking other settings, everything went back to normal.  I don't understand what could call this, or why it's displaying the purple light when it "knows" it's fine.

Any ideas?


      Mon Oct 31 16:30:06 2005 ssh ok

Service ssh on <machinename> is OK (up)

SSH-<ver>-OpenSSH_<ver>

Seconds: 0.00

Status unchanged in 0 hours, 25 minutes

Status message received from hobbitd

-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Dan Vande More · Mon, 31 Oct 2005 20:50:05 -0600 ·
The hobbit server considers any service it hasn't heard from in a
configurable period of time  as purple. Mine is set at 30 minutes, I
think the default is somewhere around that area. Assuming the default
is 30 minutes, this means that if it all went purple 25 minutes ago,
the test hasn't reported to the server in 55 minutes.

I would suggest ensuring clients have connectivity to the server, ie:

telnet bbserver 1984

Which tests for tcp connectivity to port 1984, the port hobbit expects
client updates on. Of the millions of things that could cause this,
I'd say server connectivity (Default gateway, IP address),firewall
ACLs or the hobbit server no longer running are the most likely.

Good luck

Dan
quoted from Rob Munsch

On 10/31/05, Rob Munsch <user-f39e4aae1456@xymon.invalid> wrote:
Consider the below.  Approx. 25 minutes ago, across all monitored
systems, all net monitored services - ssh, ldaps and dns - went to
purple.  They are still up, running, and just fine in every respect.
The status message is even the same as when it was showing green.  But
now every ssh, ldaps and dns light is purple.

The last thing i was messing with when this happened was the alerts
config file; i hadn't touched bb-hosts.  cpu, disk, memory etc. all
remain green.

I cannot find anything in the logs that indicated what changed 25
minutes ago.  I have restarted the hobbitd.  Something like this seemed
to happen yesterday; after a number of monitored services were green and
unchanging for a while, they went purple, yet report as "OK" across the
board.  While tweaking other settings, everything went back to normal.
I don't understand what could call this, or why it's displaying the
purple light when it "knows" it's fine.

Any ideas?


      Mon Oct 31 16:30:06 2005 ssh ok

Service ssh on <machinename> is OK (up)

SSH-<ver>-OpenSSH_<ver>

Seconds: 0.00

Status unchanged in 0 hours, 25 minutes

Status message received from hobbitd

--
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com

list Henrik Størner · Tue, 1 Nov 2005 07:24:09 +0100 ·
quoted from Rob Munsch
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
Consider the below.  Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple.  They are still up, running, and just fine in every respect.  The status message is even the same as when it was showing green.  But now every ssh, ldaps and dns light is purple.
Purple is an indication that some part of your monitoring system
has stopped.

All of the purple ones are network services ? Then it sounds as if
your network tests have stopped running. Check the
~hobbit/server/logs/bb-network.log file for any errors.


Regards,
Henrik
list Rob Munsch · Tue, 01 Nov 2005 13:48:04 -0500 ·
There's no entries in the network log since 10/28.  Hobbit is running on the server, and the clients are running on the various clients.

CPU, Memory, Disk and Procs all remain green!
SSH, ldaps, and dns on the clients are purple.

On the hobbit server itself, bbd is purple.  Everything else is green.
Network connectivity between all clients > server is functional.

I don't get it...
quoted from Henrik Størner

Henrik Stoerner wrote:
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
 
Consider the below.  Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple.  They are still up, running, and just fine in every respect.  The status message is even the same as when it was showing green.  But now every ssh, ldaps and dns light is purple.
   
Purple is an indication that some part of your monitoring system
has stopped.

All of the purple ones are network services ? Then it sounds as if
your network tests have stopped running. Check the
~hobbit/server/logs/bb-network.log file for any errors.


Regards,
Henrik

-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Rob Munsch · Tue, 01 Nov 2005 14:21:22 -0500 ·
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran

./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug

Now, i can ping and ssh to all clients from server just fine.  But i see this:

---
2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov  1 14:14:20 2005 conn NOT ok
status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov  1 14:14:20 2005 conn NOT ok

Service conn on brassai is not OK : Host does not respond to ping

System unreachable for 3 poll periods (56 seconds)
---

Aha.  Since the ping test fails, why test other net services?  So now it makes sense; the net tests are not being run, hence the purple.

a'course, i don't know why the nettest is suddenly unable to ping anything.  It is getting the right IPs internally:

---
2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x
2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x
2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x
---

and i thought cranking the concurrency way down might help, but apparently it doesn't.

So, i'm glad i found the cause... now i just need to find out the cause's cause.  o_O
quoted from Rob Munsch

Rob Munsch wrote:
There's no entries in the network log since 10/28.  Hobbit is running on the server, and the clients are running on the various clients.

CPU, Memory, Disk and Procs all remain green!
SSH, ldaps, and dns on the clients are purple.

On the hobbit server itself, bbd is purple.  Everything else is green.
Network connectivity between all clients > server is functional.

I don't get it...

Henrik Stoerner wrote:
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
 
Consider the below.  Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple.  They are still up, running, and just fine in every respect.  The status message is even the same as when it was showing green.  But now every ssh, ldaps and dns light is purple.
  
Purple is an indication that some part of your monitoring system
has stopped.

All of the purple ones are network services ? Then it sounds as if
your network tests have stopped running. Check the
~hobbit/server/logs/bb-network.log file for any errors.


Regards,
Henrik

-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Rob Munsch · Tue, 01 Nov 2005 14:40:48 -0500 ·
Last email for a while, i promise; i'm chainsmoking packets at this point.  but i found this-

---
2005-11-01 14:14:20 TCP tests completed normally
2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99
2005-11-01 14:14:20 Sending results for service conn
---

Okay, it can't find fping.  But...
---
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
FPING="/usr/sbin/fping"                                 # Path and options for the 'fping' program.
hobbit at randomaccess ~/server/bin $ /usr/sbin/fping -Ae brassai
10.10.10.15 is alive (0.15 ms)
hobbit at randomaccess ~/server/bin $
---

So it should be finding fping just fine, and fping is working.
The path is in hobbitserver.cfg:
---
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
# as well as the BBHOME/bin directory where all of the Hobbit programs reside.
PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/hobbit/server/bin"
...
# For bbtest-net
...
FPING="/usr/sbin/fping"                                                 # Path and options for the 'fping' program.
---

and

[bbnet]
        ENVFILE /home/hobbit/server/etc/hobbitserver.cfg


So, by all the above:  fping is functional, it is accessible by the 'hobbit' user, it can reach the clients, it is in the PATH, it is defined in the ENVFILE bbnet is using.

So what's gone wrong??
quoted from Rob Munsch


Rob Munsch wrote:
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran

./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug

Now, i can ping and ssh to all clients from server just fine.  But i see this:

---
2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov  1 14:14:20 2005 conn NOT ok
status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov  1 14:14:20 2005 conn NOT ok

Service conn on brassai is not OK : Host does not respond to ping

System unreachable for 3 poll periods (56 seconds)
---

Aha.  Since the ping test fails, why test other net services?  So now it makes sense; the net tests are not being run, hence the purple.

a'course, i don't know why the nettest is suddenly unable to ping anything.  It is getting the right IPs internally:

---
2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x
2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x
2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x
---

and i thought cranking the concurrency way down might help, but apparently it doesn't.

So, i'm glad i found the cause... now i just need to find out the cause's cause.  o_O
-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Henrik Størner · Tue, 1 Nov 2005 22:23:12 +0100 ·
quoted from Rob Munsch
On Tue, Nov 01, 2005 at 02:40:48PM -0500, Rob Munsch wrote:
Last email for a while, i promise; i'm chainsmoking packets at this 
point.  but i found this-

---
2005-11-01 14:14:20 TCP tests completed normally
2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99
2005-11-01 14:14:20 Sending results for service conn
---

Okay, it can't find fping.  But...
---
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
# Make sure the path includes the directories where you have fping, mail 
and (optionally) ntpdate installed,
FPING="/usr/sbin/fping"                                 # Path and options for the 'fping' program.
This is pretty odd, because with that FPING setting you should also
see /usr/sbin/fping in the logfile entry - it should read
2005-11-01 14:14:20 Execution of '/usr/sbin/fping -Ae' failed with error-code 99

Could you check your hobbitlaunch.cfg file ? [bbnet] section should be

[bbnet]
	ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
	NEEDS hobbitd
	CMD bbtest-net --report --ping --checkresponse
	LOGFILE $BBSERVERLOGS/bb-network.log
	INTERVAL 5m

I suspect that maybe the ENVFILE setting is missing or points to the
wrong file ...
[bbnet]
       ENVFILE /home/hobbit/server/etc/hobbitserver.cfg
Hrm, so you did that.


What happens if you run

   bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping 

?


Henrik
list Rob Munsch · Tue, 01 Nov 2005 16:38:00 -0500 ·
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug
---
It seems to have worked, pingwise:

2005-11-01 16:27:24 Sending results for service conn
2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok
2005-11-01 16:27:24 Adding to combo msg: status orizo.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok
2005-11-01 16:27:24 Adding to combo msg: status moadib.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok
2005-11-01 16:27:24 Adding to combo msg: status brassai.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok
2005-11-01 16:27:24 Adding to combo msg: status doisneau.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok

but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now.  It completes the conn test, reports the results, and that's that.

I've checked the permissions on hobbitserver.cfg and they're correct.  There's no reason the hobbit user shouldn't be able to read it.
I then ran bbcmd again without specifying the env, and got identical results.

I don't understand why ssh et. al. is yielding nothing at all..?  If it failed to connect in some way it'd be red, wouldn't it?  They remain purple... for some reason the tests aren't being done at all.  I haven't altered the services definitions in any way.
quoted from Henrik Størner

Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 02:40:48PM -0500, Rob Munsch wrote:
 
Last email for a while, i promise; i'm chainsmoking packets at this point.  but i found this-

---
2005-11-01 14:14:20 TCP tests completed normally
2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99
2005-11-01 14:14:20 Sending results for service conn
---

Okay, it can't find fping.  But...
---
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
FPING="/usr/sbin/fping"                                 # Path and options for the 'fping' program.
   
This is pretty odd, because with that FPING setting you should also
see /usr/sbin/fping in the logfile entry - it should read
2005-11-01 14:14:20 Execution of '/usr/sbin/fping -Ae' failed with error-code 99

Could you check your hobbitlaunch.cfg file ? [bbnet] section should be

[bbnet]
	ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
	NEEDS hobbitd
	CMD bbtest-net --report --ping --checkresponse
	LOGFILE $BBSERVERLOGS/bb-network.log
	INTERVAL 5m

I suspect that maybe the ENVFILE setting is missing or points to the
wrong file ...

 
[bbnet]
      ENVFILE /home/hobbit/server/etc/hobbitserver.cfg
   
Hrm, so you did that.


What happens if you run

  bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping 
?


Henrik

-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Henrik Størner · Tue, 1 Nov 2005 22:53:06 +0100 ·
quoted from Rob Munsch
On Tue, Nov 01, 2005 at 04:38:00PM -0500, Rob Munsch wrote:
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug
---
It seems to have worked, pingwise:

2005-11-01 16:27:24 Sending results for service conn
2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok

but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now.  It completes the conn test, reports the results, and that's that.
Run "bbcmd bbtest-net --ping --report --debug 2>&1 >debug.log" and
send the full debug logfile directly to me (user-ce4a2c883f75@xymon.invalid).


Henrik
list Rob Munsch · Tue, 01 Nov 2005 17:44:13 -0500 ·
Right then.  Straightened out thanks to Henrik's generosity with his time.
As a warning to my fellow knobs, here's the postmortem:

In bb-hosts, the group-only definitions controls the display of the tests.  Group-only arguments do NOT call for the actual tests themselves; they must be specified after the hashmark on the client line normally.  (Client tests are reported by the client and don't get specified after the hash.)  So, this fails to provide the expected info:

---
group-only      ldaps|ssh|cpu|memory|disk|procs hobbitses
10.10.10.15     brassai
---

and this succeeds:

---
group-only      ldaps|ssh|cpu|memory|disk|procs hobbitses
10.10.10.15     brassai # ldaps ssh
---

In the former, ldaps and ssh will have null info.  cpu -> procs will display correctly, but no net tests will be called on brassai other than ping, by default, as --ping is specified by default in hobbitlaunch.cfg under the [bbnet] section.

I hope this helps anyone else new and blundering their way around the config files as i was.
quoted from Henrik Størner

Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 04:38:00PM -0500, Rob Munsch wrote:
 
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug
---
It seems to have worked, pingwise:

2005-11-01 16:27:24 Sending results for service conn
2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov  1 16:27:24 2005 conn ok

but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now.  It completes the conn test, reports the results, and that's that.
   
Run "bbcmd bbtest-net --ping --report --debug 2>&1 >debug.log" and
send the full debug logfile directly to me (user-ce4a2c883f75@xymon.invalid).


Henrik

-- 
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com