Inexplicable purple on running services
list Rob Munsch
Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect. The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.
The last thing i was messing with when this happened was the alerts config file; i hadn't touched bb-hosts. cpu, disk, memory etc. all remain green.
I cannot find anything in the logs that indicated what changed 25 minutes ago. I have restarted the hobbitd. Something like this seemed to happen yesterday; after a number of monitored services were green and unchanging for a while, they went purple, yet report as "OK" across the board. While tweaking other settings, everything went back to normal. I don't understand what could call this, or why it's displaying the purple light when it "knows" it's fine.
Any ideas?
Mon Oct 31 16:30:06 2005 ssh ok
Service ssh on <machinename> is OK (up)
SSH-<ver>-OpenSSH_<ver>
Seconds: 0.00
Status unchanged in 0 hours, 25 minutes
Status message received from hobbitd
--
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Dan Vande More
The hobbit server considers any service it hasn't heard from in a configurable period of time as purple. Mine is set at 30 minutes, I think the default is somewhere around that area. Assuming the default is 30 minutes, this means that if it all went purple 25 minutes ago, the test hasn't reported to the server in 55 minutes. I would suggest ensuring clients have connectivity to the server, ie: telnet bbserver 1984 Which tests for tcp connectivity to port 1984, the port hobbit expects client updates on. Of the millions of things that could cause this, I'd say server connectivity (Default gateway, IP address),firewall ACLs or the hobbit server no longer running are the most likely. Good luck Dan
▸
On 10/31/05, Rob Munsch <user-f39e4aae1456@xymon.invalid> wrote:Consider the below. Approx. 25 minutes ago, across all monitored
systems, all net monitored services - ssh, ldaps and dns - went to
purple. They are still up, running, and just fine in every respect.
The status message is even the same as when it was showing green. But
now every ssh, ldaps and dns light is purple.
The last thing i was messing with when this happened was the alerts
config file; i hadn't touched bb-hosts. cpu, disk, memory etc. all
remain green.
I cannot find anything in the logs that indicated what changed 25
minutes ago. I have restarted the hobbitd. Something like this seemed
to happen yesterday; after a number of monitored services were green and
unchanging for a while, they went purple, yet report as "OK" across the
board. While tweaking other settings, everything went back to normal.
I don't understand what could call this, or why it's displaying the
purple light when it "knows" it's fine.
Any ideas?
Mon Oct 31 16:30:06 2005 ssh ok
Service ssh on <machinename> is OK (up)
SSH-<ver>-OpenSSH_<ver>
Seconds: 0.00
Status unchanged in 0 hours, 25 minutes
Status message received from hobbitd
--
Rob Munsch
Systems Analyst, Solutions for Progress
http://www.solutionsforprogress.com
list Henrik Størner
▸
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect. The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.
Purple is an indication that some part of your monitoring system has stopped. All of the purple ones are network services ? Then it sounds as if your network tests have stopped running. Check the ~hobbit/server/logs/bb-network.log file for any errors. Regards, Henrik
list Rob Munsch
There's no entries in the network log since 10/28. Hobbit is running on the server, and the clients are running on the various clients. CPU, Memory, Disk and Procs all remain green! SSH, ldaps, and dns on the clients are purple. On the hobbit server itself, bbd is purple. Everything else is green. Network connectivity between all clients > server is functional. I don't get it...
▸
Henrik Stoerner wrote:
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect. The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.Purple is an indication that some part of your monitoring system has stopped. All of the purple ones are network services ? Then it sounds as if your network tests have stopped running. Check the ~hobbit/server/logs/bb-network.log file for any errors. Regards, Henrik
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
list Rob Munsch
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran ./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug Now, i can ping and ssh to all clients from server just fine. But i see this: --- 2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok Service conn on brassai is not OK : Host does not respond to ping System unreachable for 3 poll periods (56 seconds) --- Aha. Since the ping test fails, why test other net services? So now it makes sense; the net tests are not being run, hence the purple. a'course, i don't know why the nettest is suddenly unable to ping anything. It is getting the right IPs internally: --- 2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x --- and i thought cranking the concurrency way down might help, but apparently it doesn't. So, i'm glad i found the cause... now i just need to find out the cause's cause. o_O
▸
Rob Munsch wrote:
There's no entries in the network log since 10/28. Hobbit is running on the server, and the clients are running on the various clients. CPU, Memory, Disk and Procs all remain green! SSH, ldaps, and dns on the clients are purple. On the hobbit server itself, bbd is purple. Everything else is green. Network connectivity between all clients > server is functional. I don't get it... Henrik Stoerner wrote:On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect. The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.Purple is an indication that some part of your monitoring system has stopped. All of the purple ones are network services ? Then it sounds as if your network tests have stopped running. Check the ~hobbit/server/logs/bb-network.log file for any errors. Regards, Henrik
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
list Rob Munsch
Last email for a while, i promise; i'm chainsmoking packets at this point. but i found this-
---
2005-11-01 14:14:20 TCP tests completed normally
2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99
2005-11-01 14:14:20 Sending results for service conn
---
Okay, it can't find fping. But...
---
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
FPING="/usr/sbin/fping" # Path and options for the 'fping' program.
hobbit at randomaccess ~/server/bin $ /usr/sbin/fping -Ae brassai
10.10.10.15 is alive (0.15 ms)
hobbit at randomaccess ~/server/bin $
---
So it should be finding fping just fine, and fping is working.
The path is in hobbitserver.cfg:
---
# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
# as well as the BBHOME/bin directory where all of the Hobbit programs reside.
PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/hobbit/server/bin"
...
# For bbtest-net
...
FPING="/usr/sbin/fping" # Path and options for the 'fping' program.
---
and
[bbnet]
ENVFILE /home/hobbit/server/etc/hobbitserver.cfg
So, by all the above: fping is functional, it is accessible by the 'hobbit' user, it can reach the clients, it is in the PATH, it is defined in the ENVFILE bbnet is using.
So what's gone wrong??
▸
Rob Munsch wrote:
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran ./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug Now, i can ping and ssh to all clients from server just fine. But i see this: --- 2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok Service conn on brassai is not OK : Host does not respond to ping System unreachable for 3 poll periods (56 seconds) --- Aha. Since the ping test fails, why test other net services? So now it makes sense; the net tests are not being run, hence the purple. a'course, i don't know why the nettest is suddenly unable to ping anything. It is getting the right IPs internally: --- 2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x --- and i thought cranking the concurrency way down might help, but apparently it doesn't. So, i'm glad i found the cause... now i just need to find out the cause's cause. o_O
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
list Henrik Størner
▸
On Tue, Nov 01, 2005 at 02:40:48PM -0500, Rob Munsch wrote:
Last email for a while, i promise; i'm chainsmoking packets at this point. but i found this- --- 2005-11-01 14:14:20 TCP tests completed normally 2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99 2005-11-01 14:14:20 Sending results for service conn --- Okay, it can't find fping. But... --- hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping # Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed, FPING="/usr/sbin/fping" # Path and options for the 'fping' program.
This is pretty odd, because with that FPING setting you should also see /usr/sbin/fping in the logfile entry - it should read 2005-11-01 14:14:20 Execution of '/usr/sbin/fping -Ae' failed with error-code 99 Could you check your hobbitlaunch.cfg file ? [bbnet] section should be [bbnet] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg NEEDS hobbitd CMD bbtest-net --report --ping --checkresponse LOGFILE $BBSERVERLOGS/bb-network.log INTERVAL 5m I suspect that maybe the ENVFILE setting is missing or points to the wrong file ...
[bbnet]
ENVFILE /home/hobbit/server/etc/hobbitserver.cfgHrm, so you did that. What happens if you run bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping ? Henrik
list Rob Munsch
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug --- It seems to have worked, pingwise: 2005-11-01 16:27:24 Sending results for service conn 2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status orizo.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status moadib.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status brassai.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status doisneau.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now. It completes the conn test, reports the results, and that's that. I've checked the permissions on hobbitserver.cfg and they're correct. There's no reason the hobbit user shouldn't be able to read it. I then ran bbcmd again without specifying the env, and got identical results. I don't understand why ssh et. al. is yielding nothing at all..? If it failed to connect in some way it'd be red, wouldn't it? They remain purple... for some reason the tests aren't being done at all. I haven't altered the services definitions in any way.
▸
Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 02:40:48PM -0500, Rob Munsch wrote:Last email for a while, i promise; i'm chainsmoking packets at this point. but i found this- --- 2005-11-01 14:14:20 TCP tests completed normally 2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99 2005-11-01 14:14:20 Sending results for service conn --- Okay, it can't find fping. But... --- hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping # Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed, FPING="/usr/sbin/fping" # Path and options for the 'fping' program.This is pretty odd, because with that FPING setting you should also see /usr/sbin/fping in the logfile entry - it should read 2005-11-01 14:14:20 Execution of '/usr/sbin/fping -Ae' failed with error-code 99 Could you check your hobbitlaunch.cfg file ? [bbnet] section should be [bbnet] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg NEEDS hobbitd CMD bbtest-net --report --ping --checkresponse LOGFILE $BBSERVERLOGS/bb-network.log INTERVAL 5m I suspect that maybe the ENVFILE setting is missing or points to the wrong file ...[bbnet] ENVFILE /home/hobbit/server/etc/hobbitserver.cfgHrm, so you did that. What happens if you run bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping ? Henrik
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
list Henrik Størner
▸
On Tue, Nov 01, 2005 at 04:38:00PM -0500, Rob Munsch wrote:
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug --- It seems to have worked, pingwise: 2005-11-01 16:27:24 Sending results for service conn 2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now. It completes the conn test, reports the results, and that's that.
Run "bbcmd bbtest-net --ping --report --debug 2>&1 >debug.log" and send the full debug logfile directly to me (user-ce4a2c883f75@xymon.invalid). Henrik
list Rob Munsch
Right then. Straightened out thanks to Henrik's generosity with his time. As a warning to my fellow knobs, here's the postmortem: In bb-hosts, the group-only definitions controls the display of the tests. Group-only arguments do NOT call for the actual tests themselves; they must be specified after the hashmark on the client line normally. (Client tests are reported by the client and don't get specified after the hash.) So, this fails to provide the expected info: --- group-only ldaps|ssh|cpu|memory|disk|procs hobbitses 10.10.10.15 brassai --- and this succeeds: --- group-only ldaps|ssh|cpu|memory|disk|procs hobbitses 10.10.10.15 brassai # ldaps ssh --- In the former, ldaps and ssh will have null info. cpu -> procs will display correctly, but no net tests will be called on brassai other than ping, by default, as --ping is specified by default in hobbitlaunch.cfg under the [bbnet] section. I hope this helps anyone else new and blundering their way around the config files as i was.
▸
Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 04:38:00PM -0500, Rob Munsch wrote:hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug --- It seems to have worked, pingwise: 2005-11-01 16:27:24 Sending results for service conn 2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now. It completes the conn test, reports the results, and that's that.Run "bbcmd bbtest-net --ping --report --debug 2>&1 >debug.log" and send the full debug logfile directly to me (user-ce4a2c883f75@xymon.invalid). Henrik
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com