Xymon Mailing List Archive search

alert/hostname loading

list John Thurston
Mon, 14 Dec 2015 11:27:05 -0900
Message-Id: <user-d63fd8e65f76@xymon.invalid>

On 12/1/2015 12:03 PM, John Thurston wrote:
On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
- snip -
Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.

AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).
- snip -
I haven't yet found a way to induce this failure, so I haven't yet
identified the minimal recovery steps. I'm working on it, though.
I think I might be able to reproduce the failure :)  Start with the 
following, stable server arrangement:

+ x.bar.com is running xymon 4.3.22 on Solaris 10 SPARC
+ The following is defined in tasks.cfg:
   CMD xymond_channel --channel=page  --log=$XYMONSERVERLOGS/alert.log \
   xymond_alert --debug --checkpoint-file=$XYMONTMP/alert.chk \
   --checkpoint-interval=600
+ Host foo.bar.com is defined in DNS and does not permit ICMP traffic 
and does not have a xymon client installed on it

Throw a spanner in the works by the following actions:

+ Add host foo.bar.com to an existing page and group in hosts.cfg
+ ~/server/bin/xymoncmd ~/server/bin/xymonnet foo.bar.com

And see the trouble commence in alert.log:
6690 2015-12-14 10:52:06.859998 Got 415 bytes
6690 2015-12-14 10:52:06.860110 xymond_alert: Got message 95 @@page#95/foo.bar.com|1450122726.859873|10.10.10.55|foo.bar.com|conn|0.0.0.0|1450124526|red|none|1450122726|Page/Subpage|65234||||
6690 2015-12-14 10:52:06.860140 startpos 5659, fillpos 5659, endpos -1
6690 2015-12-14 10:52:06.860172 Got page message from foo.bar.com:conn
6690 2015-12-14 10:52:06.860249 Alert status changed from 0 to 1
6690 2015-12-14 10:52:06.860285 Checking criteria for host 'foo.bar.com', which is not defined
6690 2015-12-14 10:52:06.861674 Checking criteria for host 'foo.bar.com', which is not defined
6690 2015-12-14 10:52:06.861728 Checking criteria for host 'foo.bar.com', which is not defined
6690 2015-12-14 10:52:06.861761 Found no first matching rule
6690 2015-12-14 10:52:06.861813 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
6690 2015-12-14 10:52:06.861861 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
6690 2015-12-14 10:52:06.861891 Checking criteria for host 'zebra.bar.com', which is not defined
After killing the "xymond_channel --channel=page" process, a new one is 
created as a child of xymonlaunch and everything behaves normally again.

I currently have a tail on my alert.log to warn me of the appearance of 
the string, "which is not defined". When that appears, I know it is time 
to HUP the "page" channel. This is a rather crude hammer to leave laying 
on the table next to my production server, but it keeps us running :)

I have a core file from the xymond_channel process, but its stack 
contains only:
 feee041c _syscall6 (1, 1, 0, 1, 7d0, 3a0f4) + 20
 00013c90 _start   (0, 0, 0, 0, 0, 0) + 5c
I have a core file from the xymond_alert process, but its stack contains 
only:
 feede7d8 __pollsys (ffbfcd50, 1, ffbfcdc0, 0, 0, 0) + 8
 fee79b8c pselect  (ffbfcd50, fef56790, fef56790, 40, ffbfcdc0, 0) + 1c8
 fee79f04 select   (1, ffbfce58, 0, 0, ffbfce48, ffbfced8) + a0
 00015fa4 get_xymond_message (4b400, 4b14c, 4b148, ffbfcf88, 4b16c, 35d50) + 270
 0003293c main     (1, 566f245d, 0, 33b00, 4b000, 33bb8) + 378
 00014a34 _start   (0, 0, 0, 0, 0, 0) + 5c
which is whatever it was happily processing when I killed it, not the 
stack at the time it ended up at line 815 of loadalerts.c
What can I do and what information can I gather which will help narrow 
the fault domain?

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska