Xymon Mailing List Archive search

Need help determining why alerts didn't come

list Tom Callahan
Fri, 07 Nov 2008 10:26:36 -0500
Message-Id: <user-00509fe46c7a@xymon.invalid>

I¹ve noticed inability to correctly parse ³df² if you have long device names
(think device-mapper).

My solution was to change DF=²df ­k² in bbsys.local to DF=²df ­k ­P² for
POSIX mode.

Try that and see if it helps?


On 11/7/08 9:52 AM, "Bouchard, Brian" <user-4c1afba0ca37@xymon.invalid> wrote:
Hello Hobbit Gurus,
 
I am seeking help determining why we recently received only some alerts that
were configured on a given server.
 
 
In my hobbit-clients.cfg file I have multiple sections of relevance:
 
#######################################################
# generic checks for all WebLogic Servers
#######################################################
HOST= applesauce,gravy,enchilada,chips
        DISK    *       95 97
        PROC dsmcad 1 -1 yellow
        FILE "%/wls_domains/.*/jrockit..*.dump" NOEXIST red
#######################################################
# specific checks for applesauce
#######################################################
HOST=applesauce
       LOG  /var/log/messages "%(?-i)SERIOUS_CRITICAL" COLOR=yellow
       PROC "weblogic.Name=" 3 3 red TEXT=TOTAL_WEBLOGIC_PROCESSES
       PROC "weblogic.Name=prod_alsb_01" 1 1 red TEXT=PROD_ALSB_01
       PROC "weblogic.Name=prod_ccs_wli_01" 1 1 red TEXT=PROD_CCS_WLI_01
       PROC "weblogic.Name=prod_ccs_aldsp_01" 1 1 red TEXT=PROD_CCS_ALDSP_01
 
 
So, a couple of questions:
 
1)       Is it valid to have different alerts for the same HOST in the
hobbit-clients.cfg like this?  It seemed to work in some instances, but I
should ask before moving forwardŠ


2)       Yesterday, I received the alerts with TEXT=
³TOTAL_WEBLOGIC_PROCESSES² and ³PROD_ALSB_01² when I logged onto the server, I
found the filesystem this process was running on was 100% used, which caused
this process to die.  I cleaned up a bunch of log files, and restarted the
process and all was goodŠ  BUTŠ Why didn¹t I receive the alert that the DISK
was more than 97% full.  I checked the history for the disk usage, and it had
been over 95% for at least 6 hours prior to the process going down.  Also, the
check for the ³jrockit² file did not kick off when that file was create
(after the filesystem was at 100%)  I need to determine why we weren¹t warned
on the disk space issue before our production application came down.


3)       One other thing I noticed was that the IP address for this server was
incorrect in the bb-hosts file.  I assume that¹s an issue, but I¹m not sure
why we got some expected alerts and not others.  Also, I updated this entry in
the bb-hosts file to the correct IP, and cycled the hobbit server, but I am
still not receiving the alert on the jrockit file, which is still out there.
 
Any help is appreciated.  I¹m relatively new to Hobbit, so its completely
within the realm of possibility that I don¹t have any of this set up
correctly. Please feel free to correct me on anything that looks out of whack.
 
- Brian