Xymon Mailing List Archive search

Configuring Devmon for the first time

list Buchan Milne
Wed, 1 Jun 2011 16:39:10 +0200
Message-Id: <user-c6fc2cdf7109@xymon.invalid>

On Tuesday, 31 May 2011 03:24:05 user-7cb0f5662626@xymon.invalid wrote:
I've had issues with devmon not updating the bb-display and everything
going purple.
Firstly, I don't think this is Josh's problem, as he didn't have a devmon process, whereas this behaviour is typically that devmon hangs (but the process is still running).

If you have different behaviour to the I discuss below, please log a new tracker item.

The 'hang' issue is covered in this tracker item:

http://sourceforge.net/tracker/?func=detail&aid=2897345&group_id=160720&atid=816977

(Unfortunately, it was logged anonymously, and I have had no feedback on improvements in devmon svn for this issue, either via the tracker, or the mails on the mailing list)

Discussion of the issue also occurred on the devmon-support mailing list:

http://sourceforge.net/mailarchive/forum.php?thread_name=user-13d284bbdc54@xymon.invalid&forum_name=devmon-
support

The status has not changed, my failure logs still die at:

[11-05-05 at 15:54:02] DEBUG: Printing single combo message size 13390
[11-05-05 at 15:54:02] DEBUG: Finished printing single combo message
[11-05-05 at 15:55:42] Fork 3 timed out waiting for data from parent: Timeout at /usr/share/devmon/modules/dm_snmp.pm line 516, <$__ANONIO__> line 30203.

The printing code is wrapped in an eval'd alarm subroutine which should return within 10 seconds, and log that the printing had completed or that it had timed out. Instead, the fork has noticed that it hasn't seen anything from the 'master' process within the poll period for some time 40s later.

The question is, what should be done in this case? Should the forks attempt to kill the master devmon process?

Anyway, I would be grateful if someone could reproduce this on a different platform. I currently see this on RHEL5 x86_64 with perl-5.8.8-27.el5. Other environments have been green since 25 Jan ( since they were upgraded to rev 214: http://devmon.svn.sourceforge.net/viewvc/devmon?view=revision&revision=214).
I created a "devmon watchdog" script that's runs every 5 min using lynx
(txt base html browser) which checks if the status of devmon (shows as dm
test) on bb-monitor. If its purple then I kill the devmon process and
start it up again....band-aid solution, but it does the trick.

I no script expert, but can share the bash script if you want/need.
Here is mine, but I am *not* going to add it to svn and the next release unless I have had some feedback on the changes to prevent this occurring at all, preferable with the failure logs the script keeps.

I run mine from hobbitlaunch.cfg (the problematic box is still running 4.2.2 for now):

[devmon]
        ENVFILE /usr/lib64/hobbit/server/etc/hobbitserver.cfg
        CMD /usr/local/bin/restart-devmon-if-purple
        INTERVAL 1m
        LOGFILE /var/log/hobbit/devmon-restart.log

I have a sudo rule in place to allow the hobbit user to call 'sudo /etc/init.d/devmon stop'


#!/bin/bash
if [ "$BB" == "" ]
then
        echo "This script must be run under a Hobbit or Xymon environment" >&2
        echo "e.g. by: bbcmd $0" >&2
        exit 1
fi
if [ "$BBDISPLAYS" != "" ]
then
        BBDISP=${BBDISPLAYS#,*}
fi
COLOR=$($BB $BBDISP "hobbitdboard host=$HOSTNAME test=dm" | cut -d'|' -f3)

if [ "`id -u`" -eq 0 ]
then
        DEVMON="/etc/init.d/devmon"
        PKILL="pkill"
else
        DEVMON="sudo /etc/init.d/devmon"
        PKILL="sudo pkill"
fi

if [ "$COLOR" == "purple" ]
then
        LOGSAVE=/var/log/devmon/failures/devmon-failure-`date +%Y-%m-%d-%H:%M:
%S`.log
        echo "Devmon is purple, saving last 200 lines of log to $LOGSAVE"
        tail -n200 /var/log/devmon/devmon.log > $LOGSAVE
        $DEVMON stop
        NUM=$(pgrep -u devmon|wc -l)
        if [ "$NUM" -ne 0 ]
        then                 echo "Devmon failed to stop cleanly, terminating manually"
                $PKILL -u devmon
                sleep 5
        fi
        NUM=$(pgrep -u devmon|wc -l)
        if [ "$NUM" -ne 0 ]
        then                 echo "Devmon failed to terminate cleanly, killing manually"
                $PKILL -9 -u devmon
        fi
        $DEVMON start
else
        [ "$DEBUG" == 1 ] && echo "Devmon isn't purple, it is $COLOR"
fi


Regards,
Buchan