out of sockets for hobbitfetch
list Craig Cook
The server is Solaris 10. Yesterday I shut down hobbit, moved to another disk, created a soft link, started up again. Now this shows up in hobbitfetch.log: 2008-12-31 08:49:52 Caught TERM signal, terminating 2008-12-31 08:53:49 Connection lost during connect/write to 10.1.1.184:1984 (req 132): Broken pipe 2008-12-31 08:53:56 Connection lost during connect/write to 10.1.1.183:1984 (req 246): Broken pipe 2008-12-31 08:53:56 Out of sockets (req 280) 2008-12-31 08:53:56 Out of sockets (req 281) 2008-12-31 08:53:56 Out of sockets (req 282) 2008-12-31 08:53:56 Out of sockets (req 283) 2008-12-31 08:53:56 Out of sockets (req 284) 2008-12-31 08:53:56 Out of sockets (req 285) All my hobbitfetch clients are purple. How do I identify what sockets are required? It sounds like an OS resource issue. Thanks Craig
list Asif Iqbal
▸
On Wed, Dec 31, 2008 at 9:45 AM, Craig Cook <user-bd346ac7bd4a@xymon.invalid> wrote:
The server is Solaris 10. Yesterday I shut down hobbit, moved to another disk, created a soft link, started up again. Now this shows up in hobbitfetch.log: 2008-12-31 08:49:52 Caught TERM signal, terminating 2008-12-31 08:53:49 Connection lost during connect/write to 10.1.1.184:1984 (req 132): Broken pipe 2008-12-31 08:53:56 Connection lost during connect/write to 10.1.1.183:1984 (req 246): Broken pipe 2008-12-31 08:53:56 Out of sockets (req 280) 2008-12-31 08:53:56 Out of sockets (req 281) 2008-12-31 08:53:56 Out of sockets (req 282) 2008-12-31 08:53:56 Out of sockets (req 283) 2008-12-31 08:53:56 Out of sockets (req 284) 2008-12-31 08:53:56 Out of sockets (req 285) All my hobbitfetch clients are purple. How do I identify what sockets are required? It sounds like an OS resource issue.
Have you tried running `execsnoop' and `opensnoop' (from
dtracetoolkit) ? It should tell you who is trying to connect or open
new socket, if I am not mistaken.
Thanks Craig
-- Asif Iqbal PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
list Tim Grzechowski
Something on the network blew up, router or switch. Of course hobbit went
purple. It was a Thursday afternoon and it was in the Network Gods hands at
this point. They fixed it that night I had still had the purple condition
in the morning. I checked both the server and clients (~100) and everything
was running. I restarted both anyway and waited. Went home for the
weekend. Came in Monday and still had the purple plague.
Did some reading, found this scrip in one of the hobbit mailing list and ran
it.
/usr/lib/hobbit/server
# !/usr/bin/ksh
HBBB="/usr/lib/hobbit/server/bin/bb --debug"
${HBBB} 127.0.0.1 "hobbitdboard color=purple fields=hostname,testname"
|
while read L; do
HOST=`echo $L | cut -d'|' -f1`
TEST=`echo $L | cut -d'|' -f2`
${HBBB} 127.0.0.1 "drop $HOST $TEST"
done
bash-3.00$
Purple condition is gone. But so are many of my monitoring tasks such as
CPU, processes, file system status, and more. None of them are working sans
"conn', 'info', and 'trends.' Even after several days.
What do I need to do to hobbit and/or the client to have these monitored
again?
THANK YOU!
/tg
list Stef Coene
▸
On Wednesday 31 December 2008, Tim Grzechowski wrote:
What do I need to do to hobbit and/or the client to have these monitored again?
What you did with your script is deleting all checks that are purple. A purple check is a check that has not been updated for a while. So removing a purple check can be done in 2 ways: like you did by deleting the check or by sending a new status. In your case, you need to check why the status is not been send to the hobbit server. Check the client logs, you can also try a telnet on the hobbit port (default 1984) from the client to the hobbit server to see if there is a network problem. Also, check the ghost client list (can be found on the hobbit server in the menu). Happy new year and wishes you all a good monitoring time, Stef
list Tim Grzechowski
Stef, For some reason your email makes Outlook burp (go figure). Comes up blank and I get a Windows error that says: From: Stef Coene [mailto:user-dbffe946c0f4@xymon.invalid] Sent: Thursday, January 01, 2009 2:50 AM To: user-ae9b8668bcde@xymon.invalid Subject: Re: [hobbit] Purple Problems
list Tim Grzechowski
Stef, For some reason your email comes up blank and I get a Windows alert that says: Can you, or somebody else, please repost. Thanks! /tg
▸
From: Stef Coene [mailto:user-dbffe946c0f4@xymon.invalid]
Sent: Thursday, January 01, 2009 2:50 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Purple Problems
list Tim Grzechowski
The Ghost Client list is blank and blue (Disabled?). Checked eight clients out of the ~100 and all of them are able to telnet to the server on port 1984 without issue. Shutdown the client on a client only machine. Copy /dev/null to hobbitclient.log and clientlaunch.log . Started 'runclient.sh start' -- hobbitclient.log is empty. clientlaunch.log has two lines that show it has started. No change. I shut down the client (on the server) and hobbit server itself. Checked / cleared the logs. Restarted both the server, and after a couple minutes restarted client on the server as well. No change. Still not showing at of the pertinent info. On the server checked client-local.cfg and bb-hosts in hobbit's /etc directory and they are fine and the last day of access was weeks before this issue popped up. Any other ideas? /tg P.S. All the file systems have ample available space.
▸
From: Stef Coene [mailto:user-dbffe946c0f4@xymon.invalid]
Sent: Thursday, January 02, 2009 17:44:28 +0100
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Purple Problems
What you did with your script is deleting all checks that are purple.
A purple check is a check that has not been updated for a while. So
removing
a purple check can be done in 2 ways: like you did by deleting the check or
by sending a new status. In your case, you need to check why the status is
not been send to the hobbit server. Check the client logs, you can also try
a telnet on the hobbit port (default 1984) from the client to the hobbit
server to see if there is a network problem. Also, check the ghost client
list (can be found on the hobbit server in the menu).
Happy new year and wishes you all a good monitoring time,
Stef
list Stef Coene
On Friday 02 January 2009, Tim Grzechowski wrote:
Any other ideas?
Make sure BBDISP is correct in hobbitclient.cfg. Check the files in the tmp directory in the hobbit client directory. There should be some files created by the client: logfetch.client_name.cfg msg.client_name.txt hobbit_vmstat.client_name.25687 logfetch.client_name.status Stef
list David W Gore
Trying running the main client script by hand to make sure none of the commands hang (from the client ~/client/bin directory): ./bbcmd sh -x hobbitclient-<your OS>.sh # replace <your OS> with the proper OS If ~/client/tmp/msg.<hostname>.txt is being created then there is probably no need to execute the line above. You checked for ghosts as others sugested right? ~David
▸
From: Tim Grzechowski [mailto:user-ad307ef791f0@xymon.invalid]
Sent: Friday, January 02, 2009 21:16
To: user-ae9b8668bcde@xymon.invalid
Subject: RE: [hobbit] Purple Problems
The Ghost Client list is blank and blue (Disabled?).
Checked eight clients out of the ~100 and all of them are able
to telnet to the server on port 1984 without issue.
Shutdown the client on a client only machine. Copy /dev/null to
hobbitclient.log and clientlaunch.log . Started 'runclient.sh start'
-- hobbitclient.log is empty. clientlaunch.log has two lines that show
it has started.
No change.
I shut down the client (on the server) and hobbit server itself.
Checked / cleared the logs. Restarted both the server, and after a
couple minutes restarted client on the server as well.
No change. Still not showing at of the pertinent info.
On the server checked client-local.cfg and bb-hosts in hobbit's
/etc directory and they are fine and the last day of access was weeks
before this issue popped up.
Any other ideas?
/tg
P.S. All the file systems have ample available space.
From: Stef Coene [mailto:user-dbffe946c0f4@xymon.invalid]
Sent: Thursday, January 02, 2009 17:44:28 +0100
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Purple Problems
What you did with your script is deleting all checks that are
purple.
A purple check is a check that has not been updated for a while.
So removing
a purple check can be done in 2 ways: like you did by deleting
the check or
by sending a new status. In your case, you need to check why
the status is
not been send to the hobbit server. Check the client logs, you
can also try
a telnet on the hobbit port (default 1984) from the client to
the hobbit
server to see if there is a network problem. Also, check the
ghost client
list (can be found on the hobbit server in the menu).
Happy new year and wishes you all a good monitoring time,
Stef
list Henrik Størner
▸
In <user-b35e9ae0491b@xymon.invalid> Tim Grzechowski <user-ad307ef791f0@xymon.invalid> writes:
Something on the network blew up, router or switch. Of course hobbit went purple. It was a Thursday afternoon and it was in the Network Gods hands at this point. They fixed it that night I had still had the purple condition in the morning. I checked both the server and clients (~100) and everything was running. I restarted both anyway and waited. Went home for the weekend. Came in Monday and still had the purple plague.
I assume your Hobbit webpages are being updated ? (Check the timestamp in the upper-right corner). Is hobbitd_client running on the server ? On one of the clients, login as the user running the HObbit client and run the "bbcmd" tool. You'll get a new shell prompt. Do an echo $BB echo $BBDISP and check that these point to the "bb" utility on the client, and the IP-address of your Hobbit server. Then run $BB $BBDISP "ping" You should get a response back from the Hobbit server.o Next, run $BB $BBDISP "status $MACHINE.purpletest green Checking status" This sends a "purpletest" status-message to the Hobbit server. If everything works OK, then you should get a "purpletest" status column (in color green) for this client host) the next time the Hobbit webpages are updated. Let us know what you find out. Regards, Henrik
list Tim Grzechowski
Thanks Henrik! Unbeknownst to me there was some sort of turf war of sorts... somebody changed the BBDISP via an "include" line in the hobbitclient.cfg file that changed where it was pointing to. We added ours server to the BBDISPLAYS and we are happily co-existing now. Thanks again! /tg
▸
-----Original Message-----
From: Henrik Størner [mailto:user-ce4a2c883f75@xymon.invalid]
Sent: Monday, January 05, 2009 6:49 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Purple Problems
In <user-b35e9ae0491b@xymon.invalid> Tim Grzechowski
<user-ad307ef791f0@xymon.invalid> writes:
Something on the network blew up, router or switch. Of course hobbit went purple. It was a Thursday afternoon and it was in the Network Gods hands at this point. They fixed it that night I had still had the purple condition in the morning. I checked both the server and clients (~100) and everything was running. I restarted both anyway and waited. Went home for the weekend. Came in Monday and still had the purple plague.
I assume your Hobbit webpages are being updated ? (Check the timestamp in the upper-right corner). Is hobbitd_client running on the server ? On one of the clients, login as the user running the HObbit client and run the "bbcmd" tool. You'll get a new shell prompt. Do an echo $BB echo $BBDISP and check that these point to the "bb" utility on the client, and the IP-address of your Hobbit server. Then run $BB $BBDISP "ping" You should get a response back from the Hobbit server.o Next, run $BB $BBDISP "status $MACHINE.purpletest green Checking status" This sends a "purpletest" status-message to the Hobbit server. If everything works OK, then you should get a "purpletest" status column (in color green) for this client host) the next time the Hobbit webpages are updated. Let us know what you find out. Regards, Henrik
list Walter Rutherford
Hey all, This is probably an old issue but I didn't see a way to search the archives. Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. I've found differences in file location, file ownership, UID, GID, etc.. but so far none of that seems to be the problem. The custom script checks raids. Strangely, all of the stagnant hosts show the same three disks entries from mid-July no matter how many disks they really have. Unfortunately I don't know what may've happened in July; that was before I started working here. I suspect the xymon-client software was copied from a live system, including the old status reports, but in so doing something wasn't re-configured correctly for the new systems. Even stranger, at my urging the Lead SA undisabled the purple notifications. I was expecting the page to go purple but it remains green even though the page isn't updating. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report *including the correct,* *current raid information*! I think this means that the client is capturing the correct data and sending it to the server, the server is actually receiving the report, but after that the raid report isn't being handled correctly. Other systems display as expected. So far I haven't found anywhere on the server that the purple systems are configured or handled differently. I doubt we're the first to experience this problem. Does this sound familiar? Thanks in advance for any hints you can provide for where to look next. WLR
list Phil Crooker
Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending the wrong hostname, somehow....
▸
From: Xymon <xymon-bounces at xymon.com> on behalf of Walter Rutherford <user-6b85327bb8bb@xymon.invalid>
Sent: Sunday, 30 August 2015 1:52 PM
To: Xymon at xymon.com
Subject: [Xymon] purple problems
Hey all,
This is probably an old issue but I didn't see a way to search the archives.
Our xymon server is showing purple indicators for two of our custom scripts
but only on a handful of systems. I've found differences in file location, file
ownership, UID, GID, etc.. but so far none of that seems to be the problem.
The custom script checks raids. Strangely, all of the stagnant hosts show
the same three disks entries from mid-July no matter how many disks they
really have. Unfortunately I don't know what may've happened in July; that
was before I started working here. I suspect the xymon-client software was
copied from a live system, including the old status reports, but in so doing
something wasn't re-configured correctly for the new systems.
Even stranger, at my urging the Lead SA undisabled the purple notifications.
I was expecting the page to go purple but it remains green even though the
page isn't updating. At the bottom of the incorrect raid report page there is a
link to "client data". If I follow the link I get a full report including the correct,
current raid information!
I think this means that the client is capturing the correct data and sending
it to the server, the server is actually receiving the report, but after that the
raid report isn't being handled correctly. Other systems display as expected.
So far I haven't found anywhere on the server that the purple systems are
configured or handled differently.
I doubt we're the first to experience this problem. Does this sound familiar?
Thanks in advance for any hints you can provide for where to look next.
WLR
list Jeremy Laidman
On 30 August 2015 at 14:22, Walter Rutherford <user-6b85327bb8bb@xymon.invalid>
▸
wrote:
This is probably an old issue but I didn't see a way to search the archives.
https://www.google.com/?q=site:lists.xymon.com+purple+raid
▸
Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems.
The scripts are running client-side and/or server-side? Can you describe how the scripts work? Are they locally-written scripts or did you get them from somewhere online? RAID checks are not standard for most Xymon clients. I've never used or seen RAID checks. A quick look at the source code indicates built-in support for only Linux, where "md" devices are identified in /proc/mdstat.
▸
At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report *including the correct,* *current raid information*!
How is the RAID information getting into the client data? This might not be used by your custom scripts, and so might be a red herring. More detail is required about the raid scripts. Or whether you're using the built-in support for Linux RAID meta-devices reporting with client data in the [mdstat] section. If the latter, perhaps you could show the [mdstat] section of client data? Cheers Jeremy
list Walter Rutherford
All good questions. Hunting for the answers helped me to see some patterns I'd missed before. The xymon server hostname and IP seem to be consistent, but that's about all that is consistent. There is a separate column for 'disks' on the main webpage and it correctly shows the output from a 'df' command. The script running on the clients' sides is called "raid.sh", the comments at the top of the script indicate it is over a decade old; bb-mdstat.h based on bb-raid.sh. There's a link from /home/xymon-client/ext to /usr/share/xymon-client/ext on most systems. The directory and the scripts in it are owned by either root or xymon. Changing location, ownership, and perms to match one of the working systems hasn't helped. The broken raid reports are all from Linux boxes. The working reports look like this: * Mon Aug 31 09:38:49 AKDT 2015 RAID ALL devices OK* * green md0 Status OK* * green md1 Status OK* * green md2 Status OK* * ============================ /proc/mdstat ===========================* * Personalities : [raid1] * * md0 : active raid1 sdc1[1] sda1[0]* * 511988 blocks super 1.0 [2/2] [UU]* * md2 : active raid1 sdd[3] sdb[2]* * 536869888 blocks super 1.2 [2/2] [UU]* * md1 : active raid1 sdc2[1] sda2[2]* * 41428924 blocks super 1.1 [2/2] [UU]* * bitmap: 1/1 pages [4KB], 65536KB chunk* * unused devices: * * Run /sbin/mdadm -D /dev/md* for more info* The non-working systems either show nothing at all (that's better than purple) OR show the same three green md[0-2] devices (whether it has three raid devices or not) on a blue disabled background. So, I'm almost positive someone copied a working system incorrectly to other clients without cleaning up the foreign logs. The working systems overwrote or just aged out the incorrect information while the non-working ones just keep reporting it. I have found logs but none for this raid information. Perhaps the logs are compressed or otherwise rendered humanly unreadable. So, I copied the /usr/share/xymon-client/ext scripts from a working system to several that were reporting nothing and restarted xymon-client. Most did nothing, one is showing a "no data" indicator. The raid out- put looks normal except the device is md127 - perhaps the high number is confusing the script. But the wbinfo.sh script I copied at the same time to/from the same directory is now showing green. Argh! I don't even know where the xymon-client scripts running here came from so I'm reluctant (but motivated) to just rip them all out by the roots and start over from a known baseline. WLR ================================================================================== Phil Crooker <user-e8e31cd73303@xymon.invalid> 3:57 PM (17 hours ago)
▸
Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending
the wrong hostname, somehow....
==================================================================================
Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
7:07 PM (14 hours ago)
▸
On 30 August 2015 at 14:22, Walter Rutherford <user-6b85327bb8bb@xymon.invalid> wrote: This is probably an old issue but I didn't see a way to search the archives. https://www.google.com/?q=site:lists.xymon.com+purple+raid Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. The scripts are running client-side and/or server-side? Can you describe how the scripts work? Are they locally-written scripts or did you get them from somewhere online? RAID checks are not standard for most Xymon clients. I've never used or seen RAID checks. A quick look at the source code indicates built-in support for only Linux, where "md" devices are identified in /proc/mdstat. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report including the correct, current raid information! How is the RAID information getting into the client data? This might not be used by your custom scripts, and so might be a red herring. More detail is required about the raid scripts. Or whether you're using the built-in support for Linux RAID meta-devices reporting with client data in the [mdstat] section. If the latter, perhaps you could show the [mdstat] section of client data? Cheers ====================================================================================
---------- Forwarded message ----------
▸
From: Walter Rutherford <user-6b85327bb8bb@xymon.invalid>
Date: Sat, Aug 29, 2015 at 8:22 PM
Subject: purple problems
To: Xymon at xymon.com
Hey all,
This is probably an old issue but I didn't see a way to search the archives.
Our xymon server is showing purple indicators for two of our custom scripts
but only on a handful of systems. I've found differences in file location,
file
ownership, UID, GID, etc.. but so far none of that seems to be the problem.
The custom script checks raids. Strangely, all of the stagnant hosts show
the same three disks entries from mid-July no matter how many disks they
really have. Unfortunately I don't know what may've happened in July; that
was before I started working here. I suspect the xymon-client software was
copied from a live system, including the old status reports, but in so doing
something wasn't re-configured correctly for the new systems.
Even stranger, at my urging the Lead SA undisabled the purple notifications.
I was expecting the page to go purple but it remains green even though the
page isn't updating. At the bottom of the incorrect raid report page there
is a
link to "client data". If I follow the link I get a full report *including
the correct,*
*current raid information*!
I think this means that the client is capturing the correct data and sending
it to the server, the server is actually receiving the report, but after
that the
raid report isn't being handled correctly. Other systems display as
expected.
So far I haven't found anywhere on the server that the purple systems are
configured or handled differently.
I doubt we're the first to experience this problem. Does this sound
familiar?
Thanks in advance for any hints you can provide for where to look next.
WLR
list Walter Rutherford
Found it! Besides the "raid.sh" script in ext/ I needed a raid configuration in etc/client.d/. I thought that was defined in another file but apparently not. On Mon, Aug 31, 2015 at 10:53 AM, Walter Rutherford <user-6b85327bb8bb@xymon.invalid
▸
wrote:
All good questions. Hunting for the answers helped me to see some patterns I'd missed before. The xymon server hostname and IP seem to be consistent, but that's about all that is consistent. There is a separate column for 'disks' on the main webpage and it correctly shows the output from a 'df' command. The script running on the clients' sides is called "raid.sh", the comments at the top of the script indicate it is over a decade old; bb-mdstat.h based on bb-raid.sh. There's a link from /home/xymon-client/ext to /usr/share/xymon-client/ext on most systems. The directory and the scripts in it are owned by either root or xymon. Changing location, ownership, and perms to match one of the working systems hasn't helped. The broken raid reports are all from Linux boxes. The working reports look like this: * Mon Aug 31 09:38:49 AKDT 2015 RAID ALL devices OK* * green md0 Status OK* * green md1 Status OK* * green md2 Status OK* * ============================ /proc/mdstat ===========================* * Personalities : [raid1] * * md0 : active raid1 sdc1[1] sda1[0]* * 511988 blocks super 1.0 [2/2] [UU]* * md2 : active raid1 sdd[3] sdb[2]* * 536869888 blocks super 1.2 [2/2] [UU]* * md1 : active raid1 sdc2[1] sda2[2]* * 41428924 blocks super 1.1 [2/2] [UU]* * bitmap: 1/1 pages [4KB], 65536KB chunk* * unused devices: * * Run /sbin/mdadm -D /dev/md* for more info* The non-working systems either show nothing at all (that's better than purple) OR show the same three green md[0-2] devices (whether it has three raid devices or not) on a blue disabled background. So, I'm almost positive someone copied a working system incorrectly to other clients without cleaning up the foreign logs. The working systems overwrote or just aged out the incorrect information while the non-working ones just keep reporting it. I have found logs but none for this raid information. Perhaps the logs are compressed or otherwise rendered humanly unreadable. So, I copied the /usr/share/xymon-client/ext scripts from a working system to several that were reporting nothing and restarted xymon-client. Most did nothing, one is showing a "no data" indicator. The raid out- put looks normal except the device is md127 - perhaps the high number is confusing the script. But the wbinfo.sh script I copied at the same time to/from the same directory is now showing green. Argh! I don't even know where the xymon-client scripts running here came from so I'm reluctant (but motivated) to just rip them all out by the roots and start over from a known baseline. WLR ================================================================================== Phil Crooker <user-e8e31cd73303@xymon.invalid> 3:57 PM (17 hours ago) Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending the wrong hostname, somehow.... ================================================================================== Jeremy Laidman <user-71895fb2e44c@xymon.invalid> 7:07 PM (14 hours ago) On 30 August 2015 at 14:22, Walter Rutherford <user-6b85327bb8bb@xymon.invalid> wrote: This is probably an old issue but I didn't see a way to search the archives. https://www.google.com/?q=site:lists.xymon.com+purple+raid Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. The scripts are running client-side and/or server-side? Can you describe how the scripts work? Are they locally-written scripts or did you get them from somewhere online? RAID checks are not standard for most Xymon clients. I've never used or seen RAID checks. A quick look at the source code indicates built-in support for only Linux, where "md" devices are identified in /proc/mdstat. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report including the correct, current raid information! How is the RAID information getting into the client data? This might not be used by your custom scripts, and so might be a red herring. More detail is required about the raid scripts. Or whether you're using the built-in support for Linux RAID meta-devices reporting with client data in the [mdstat] section. If the latter, perhaps you could show the [mdstat] section of client data? Cheers ==================================================================================== ---------- Forwarded message ---------- From: Walter Rutherford <user-6b85327bb8bb@xymon.invalid> Date: Sat, Aug 29, 2015 at 8:22 PM Subject: purple problems To: Xymon at xymon.com Hey all, This is probably an old issue but I didn't see a way to search the archives. Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. I've found differences in file location, file ownership, UID, GID, etc.. but so far none of that seems to be the problem. The custom script checks raids. Strangely, all of the stagnant hosts show the same three disks entries from mid-July no matter how many disks they really have. Unfortunately I don't know what may've happened in July; that was before I started working here. I suspect the xymon-client software was copied from a live system, including the old status reports, but in so doing something wasn't re-configured correctly for the new systems. Even stranger, at my urging the Lead SA undisabled the purple notifications. I was expecting the page to go purple but it remains green even though the page isn't updating. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report *including the correct,* *current raid information*! I think this means that the client is capturing the correct data and sending it to the server, the server is actually receiving the report, but after that the raid report isn't being handled correctly. Other systems display as expected. So far I haven't found anywhere on the server that the purple systems are configured or handled differently. I doubt we're the first to experience this problem. Does this sound familiar? Thanks in advance for any hints you can provide for where to look next. WLR
list Walter Rutherford
Spoke too soon. Some of the systems actually have client.d/raid and they still aren't reporting. At least one didn't even have the directories. I guess that's one of the hazards of inheriting systems that were installed and/or modified by multiple people over time. On Mon, Aug 31, 2015 at 12:58 PM, Walter Rutherford <user-6b85327bb8bb@xymon.invalid
▸
wrote:
Found it! Besides the "raid.sh" script in ext/ I needed a raid configuration in etc/client.d/. I thought that was defined in another file but apparently not. On Mon, Aug 31, 2015 at 10:53 AM, Walter Rutherford < user-6b85327bb8bb@xymon.invalid> wrote:All good questions. Hunting for the answers helped me to see some patterns I'd missed before. The xymon server hostname and IP seem to be consistent, but that's about all that is consistent. There is a separate column for 'disks' on the main webpage and it correctly shows the output from a 'df' command. The script running on the clients' sides is called "raid.sh", the comments at the top of the script indicate it is over a decade old; bb-mdstat.h based on bb-raid.sh. There's a link from /home/xymon-client/ext to /usr/share/xymon-client/ext on most systems. The directory and the scripts in it are owned by either root or xymon. Changing location, ownership, and perms to match one of the working systems hasn't helped. The broken raid reports are all from Linux boxes. The working reports look like this: * Mon Aug 31 09:38:49 AKDT 2015 RAID ALL devices OK* * green md0 Status OK* * green md1 Status OK* * green md2 Status OK* * ============================ /proc/mdstat ===========================* * Personalities : [raid1] * * md0 : active raid1 sdc1[1] sda1[0]* * 511988 blocks super 1.0 [2/2] [UU]* * md2 : active raid1 sdd[3] sdb[2]* * 536869888 blocks super 1.2 [2/2] [UU]* * md1 : active raid1 sdc2[1] sda2[2]* * 41428924 blocks super 1.1 [2/2] [UU]* * bitmap: 1/1 pages [4KB], 65536KB chunk* * unused devices: * * Run /sbin/mdadm -D /dev/md* for more info* The non-working systems either show nothing at all (that's better than purple) OR show the same three green md[0-2] devices (whether it has three raid devices or not) on a blue disabled background. So, I'm almost positive someone copied a working system incorrectly to other clients without cleaning up the foreign logs. The working systems overwrote or just aged out the incorrect information while the non-working ones just keep reporting it. I have found logs but none for this raid information. Perhaps the logs are compressed or otherwise rendered humanly unreadable. So, I copied the /usr/share/xymon-client/ext scripts from a working system to several that were reporting nothing and restarted xymon-client. Most did nothing, one is showing a "no data" indicator. The raid out- put looks normal except the device is md127 - perhaps the high number is confusing the script. But the wbinfo.sh script I copied at the same time to/from the same directory is now showing green. Argh! I don't even know where the xymon-client scripts running here came from so I'm reluctant (but motivated) to just rip them all out by the roots and start over from a known baseline. WLR ================================================================================== Phil Crooker <user-e8e31cd73303@xymon.invalid> 3:57 PM (17 hours ago) Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending the wrong hostname, somehow.... ================================================================================== Jeremy Laidman <user-71895fb2e44c@xymon.invalid> 7:07 PM (14 hours ago) On 30 August 2015 at 14:22, Walter Rutherford <user-6b85327bb8bb@xymon.invalid> wrote: This is probably an old issue but I didn't see a way to search the archives. https://www.google.com/?q=site:lists.xymon.com+purple+raid Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. The scripts are running client-side and/or server-side? Can you describe how the scripts work? Are they locally-written scripts or did you get them from somewhere online? RAID checks are not standard for most Xymon clients. I've never used or seen RAID checks. A quick look at the source code indicates built-in support for only Linux, where "md" devices are identified in /proc/mdstat. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report including the correct, current raid information! How is the RAID information getting into the client data? This might not be used by your custom scripts, and so might be a red herring. More detail is required about the raid scripts. Or whether you're using the built-in support for Linux RAID meta-devices reporting with client data in the [mdstat] section. If the latter, perhaps you could show the [mdstat] section of client data? Cheers ==================================================================================== ---------- Forwarded message ---------- From: Walter Rutherford <user-6b85327bb8bb@xymon.invalid> Date: Sat, Aug 29, 2015 at 8:22 PM Subject: purple problems To: Xymon at xymon.com Hey all, This is probably an old issue but I didn't see a way to search the archives. Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. I've found differences in file location, file ownership, UID, GID, etc.. but so far none of that seems to be the problem. The custom script checks raids. Strangely, all of the stagnant hosts show the same three disks entries from mid-July no matter how many disks they really have. Unfortunately I don't know what may've happened in July; that was before I started working here. I suspect the xymon-client software was copied from a live system, including the old status reports, but in so doing something wasn't re-configured correctly for the new systems. Even stranger, at my urging the Lead SA undisabled the purple notifications. I was expecting the page to go purple but it remains green even though the page isn't updating. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report *including the correct,* *current raid information*! I think this means that the client is capturing the correct data and sending it to the server, the server is actually receiving the report, but after that the raid report isn't being handled correctly. Other systems display as expected. So far I haven't found anywhere on the server that the purple systems are configured or handled differently. I doubt we're the first to experience this problem. Does this sound familiar? Thanks in advance for any hints you can provide for where to look next. WLR
list Martin Lenko
Hi Walter, the purple color means that the server didn't get any status update for more than LIFETIME interval ( depending on the configuration, usually 30 minutes). There are number of reasons why you might get that for external tests: - the external script is not executed - I would check whether the config file for the test contains the right paths to script and log file, whether the script is executable by the user under which xymon is running. - the external script runs but it fails so it doesn't send the status message to xymon server - check the log file for any errors. If it is a sheel script, you can print something to STDOUT at the beginning of the script just to make sure that it runs. If it doesn't write anything, check the permissions of the log file. Configuring separate log file per external test helps to separate messages from other scripts and xymon client itself. - the external script doesn't contain the right path to xymon executable (or bb if it is older version from times of hobbit) so it fails to send the status message. If this doesn't help you to find out the issue, could you send the test config and script? Regards, Martin On 31 August 2015 at 22:09, Walter Rutherford <user-6b85327bb8bb@xymon.invalid>
▸
wrote:
Spoke too soon. Some of the systems actually have client.d/raid and they still aren't reporting. At least one didn't even have the directories. I guess that's one of the hazards of inheriting systems that were installed and/or modified by multiple people over time. On Mon, Aug 31, 2015 at 12:58 PM, Walter Rutherford < user-6b85327bb8bb@xymon.invalid> wrote:Found it! Besides the "raid.sh" script in ext/ I needed a raid configuration in etc/client.d/. I thought that was defined in another file but apparently not. On Mon, Aug 31, 2015 at 10:53 AM, Walter Rutherford < user-6b85327bb8bb@xymon.invalid> wrote:All good questions. Hunting for the answers helped me to see some patterns I'd missed before. The xymon server hostname and IP seem to be consistent, but that's about all that is consistent. There is a separate column for 'disks' on the main webpage and it correctly shows the output from a 'df' command. The script running on the clients' sides is called "raid.sh", the comments at the top of the script indicate it is over a decade old; bb-mdstat.h based on bb-raid.sh. There's a link from /home/xymon-client/ext to /usr/share/xymon-client/ext on most systems. The directory and the scripts in it are owned by either root or xymon. Changing location, ownership, and perms to match one of the working systems hasn't helped. The broken raid reports are all from Linux boxes. The working reports look like this: * Mon Aug 31 09:38:49 AKDT 2015 RAID ALL devices OK* * green md0 Status OK* * green md1 Status OK* * green md2 Status OK* * ============================ /proc/mdstat ===========================* * Personalities : [raid1] * * md0 : active raid1 sdc1[1] sda1[0]* * 511988 blocks super 1.0 [2/2] [UU]* * md2 : active raid1 sdd[3] sdb[2]* * 536869888 blocks super 1.2 [2/2] [UU]* * md1 : active raid1 sdc2[1] sda2[2]* * 41428924 blocks super 1.1 [2/2] [UU]* * bitmap: 1/1 pages [4KB], 65536KB chunk* * unused devices: * * Run /sbin/mdadm -D /dev/md* for more info* The non-working systems either show nothing at all (that's better than purple) OR show the same three green md[0-2] devices (whether it has three raid devices or not) on a blue disabled background. So, I'm almost positive someone copied a working system incorrectly to other clients without cleaning up the foreign logs. The working systems overwrote or just aged out the incorrect information while the non-working ones just keep reporting it. I have found logs but none for this raid information. Perhaps the logs are compressed or otherwise rendered humanly unreadable. So, I copied the /usr/share/xymon-client/ext scripts from a working system to several that were reporting nothing and restarted xymon-client. Most did nothing, one is showing a "no data" indicator. The raid out- put looks normal except the device is md127 - perhaps the high number is confusing the script. But the wbinfo.sh script I copied at the same time to/from the same directory is now showing green. Argh! I don't even know where the xymon-client scripts running here came from so I'm reluctant (but motivated) to just rip them all out by the roots and start over from a known baseline. WLR ================================================================================== Phil Crooker <user-e8e31cd73303@xymon.invalid> 3:57 PM (17 hours ago) Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending the wrong hostname, somehow.... ================================================================================== Jeremy Laidman <user-71895fb2e44c@xymon.invalid> 7:07 PM (14 hours ago) On 30 August 2015 at 14:22, Walter Rutherford <user-6b85327bb8bb@xymon.invalid> wrote: This is probably an old issue but I didn't see a way to search the archives. https://www.google.com/?q=site:lists.xymon.com+purple+raid Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. The scripts are running client-side and/or server-side? Can you describe how the scripts work? Are they locally-written scripts or did you get them from somewhere online? RAID checks are not standard for most Xymon clients. I've never used or seen RAID checks. A quick look at the source code indicates built-in support for only Linux, where "md" devices are identified in /proc/mdstat. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report including the correct, current raid information! How is the RAID information getting into the client data? This might not be used by your custom scripts, and so might be a red herring. More detail is required about the raid scripts. Or whether you're using the built-in support for Linux RAID meta-devices reporting with client data in the [mdstat] section. If the latter, perhaps you could show the [mdstat] section of client data? Cheers ==================================================================================== ---------- Forwarded message ---------- From: Walter Rutherford <user-6b85327bb8bb@xymon.invalid> Date: Sat, Aug 29, 2015 at 8:22 PM Subject: purple problems To: Xymon at xymon.com Hey all, This is probably an old issue but I didn't see a way to search the archives. Our xymon server is showing purple indicators for two of our custom scripts but only on a handful of systems. I've found differences in file location, file ownership, UID, GID, etc.. but so far none of that seems to be the problem. The custom script checks raids. Strangely, all of the stagnant hosts show the same three disks entries from mid-July no matter how many disks they really have. Unfortunately I don't know what may've happened in July; that was before I started working here. I suspect the xymon-client software was copied from a live system, including the old status reports, but in so doing something wasn't re-configured correctly for the new systems. Even stranger, at my urging the Lead SA undisabled the purple notifications. I was expecting the page to go purple but it remains green even though the page isn't updating. At the bottom of the incorrect raid report page there is a link to "client data". If I follow the link I get a full report *including the correct,* *current raid information*! I think this means that the client is capturing the correct data and sending it to the server, the server is actually receiving the report, but after that the raid report isn't being handled correctly. Other systems display as expected. So far I haven't found anywhere on the server that the purple systems are configured or handled differently. I doubt we're the first to experience this problem. Does this sound familiar? Thanks in advance for any hints you can provide for where to look next. WLR