Xymon Mailing List Archive search

xymon hostdata module going rogue

list John Thurston
Tue, 01 Dec 2015 08:14:20 -0900
Message-Id: <user-8864744db8f1@xymon.invalid>

How embarrassing. I was composing a note to mention a problem with the 
list archives not capturing all messages . . . when I discovered that 
the message for which I was searching was never sent to the list.

I composed the following message back in early October and then sent it 
only to myself :p  No wonder it didn't generate any chatter.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
  . . .
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon. Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this is
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?
This patch took care of the defunct/zonebie processes on "drop" events, 
but I've just discovered that it does not solve the underlying problem. 
It still appears that xymond_hostdata does not behave correctly 
following a "drop" command. The effect is that alerts fail to be 
delivered for _some_ messages because hostnames can no longer be retrieved.

Example:

My xymon server is humming along. I have the alert module debug-logging 
to alerts.log.  Immediately after issuing a "drop" command of the sort:

#xymon localhost "drop foo.bar.com sslcert"

the following sorts appear in the alerts.log. After this, some messages 
may result in alert emails being sent, but most quietly disappear.
Currently, my resolution is to "xymon.sh restart" but that is much too 
heavy handed for long term use.
21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted
21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
21178 2015-10-05 16:39:43.257718 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257773 Found a first matching rule
21178 2015-10-05 16:39:43.257802 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257830 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257854 Found a first matching rule
21178 2015-10-05 16:39:43.257879 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257910 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257935 Found a first matching rule
21178 2015-10-05 16:39:43.257960 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257986 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258010 Found a first matching rule
21178 2015-10-05 16:39:43.258035 Checking criteria for host 'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258061 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258088 Found a first matching rule
21178 2015-10-05 16:39:43.258113 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258140 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258164 Found a first matching rule
21178 2015-10-05 16:39:43.258188 Checking criteria for host 'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258211 0 alerts to go
21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos 134769, endpos -1, usedbytes=0, bufleft=131470
21178 2015-10-05 16:39:47.962032 Got 2831 bytes
21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039 @@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754||||
21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos -1
21178 2015-10-05 16:39:47.962204 Got page message from soajnuexhs1.bar.com:msgs
21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos 137600, endpos -1, usedbytes=0, bufleft=128639
21178 2015-10-05 16:39:58.022397 Got 297 bytes
21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040 @@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS|||||
21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos -1
21178 2015-10-05 16:39:58.022593 Got page message from doadofjdc-ea05p.bar.com:msgs
21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0
21178 2015-10-05 16:39:58.022666 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022706 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022739 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022776 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022808 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022841 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022873 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022904 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022935 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022967 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022998 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023028 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023059 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023089 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023120 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023151 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023187 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023221 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023252 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023282 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023313 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023342 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023369 Found no first matching rule
21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos 137897, endpos -1, usedbytes=0, bufleft=128342
21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due to EOF

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska