Xymon Mailing List Archive search

alert/hostname loading (was Re: xymon hostdata module going rogue)

list Japheth Cleaver
Tue, 1 Dec 2015 12:48:14 -0800
Message-Id: <user-d30b5302b317@xymon.invalid>


On Tue, December 1, 2015 9:14 am, John Thurston wrote:
How embarrassing. I was composing a note to mention a problem with the
list archives not capturing all messages . . . when I discovered that
the message for which I was searching was never sent to the list.

I composed the following message back in early October and then sent it
only to myself :p  No wonder it didn't generate any chatter.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
I have a xymon server running 4.3.21 that seems to be accumulating
processes like these:

hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
[xymond_hostdata] <defunct>
  . . .
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21

I've ended up here after a customer let me know that email alerts were
not working as expected. After a few hours of digging around, I
decided
that the alert daemon was failing to retrieve hostnames and failing
miserably.

Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It
certainly looks like this behavior breaks the alert daemon.
Fortunately,
I "drop" hosts in batches so can restart Xymon at that time, but this
is
still pretty icky.

J.C., do you know if your patch made it into the code-base?

Has anyone else tested this patch? If so, on what operating systems?
This patch took care of the defunct/zonebie processes on "drop" events,
but I've just discovered that it does not solve the underlying problem.
It still appears that xymond_hostdata does not behave correctly
following a "drop" command. The effect is that alerts fail to be
delivered for _some_ messages because hostnames can no longer be
retrieved.

Example:

My xymon server is humming along. I have the alert module debug-logging
to alerts.log.  Immediately after issuing a "drop" command of the sort:

#xymon localhost "drop foo.bar.com sslcert"

the following sorts appear in the alerts.log. After this, some messages
may result in alert emails being sent, but most quietly disappear.
Currently, my resolution is to "xymon.sh restart" but that is much too
heavy handed for long term use.
21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted
21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of
/opt/xymon/server/etc/alerts.cfg
21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of
/opt/xymon/server/etc/holidays.cfg
21178 2015-10-05 16:39:43.257718 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257773 Found a first matching rule
21178 2015-10-05 16:39:43.257802 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257830 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257854 Found a first matching rule
21178 2015-10-05 16:39:43.257879 Checking criteria for host
'doadrbjnu-sp.bar.com', which is not defined
21178 2015-10-05 16:39:43.257910 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257935 Found a first matching rule
21178 2015-10-05 16:39:43.257960 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.257986 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258010 Found a first matching rule
21178 2015-10-05 16:39:43.258035 Checking criteria for host
'steam.bar.com', which is not defined
21178 2015-10-05 16:39:43.258061 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258088 Found a first matching rule
21178 2015-10-05 16:39:43.258113 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258140 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258164 Found a first matching rule
21178 2015-10-05 16:39:43.258188 Checking criteria for host
'upsjdc.bar.com', which is not defined
21178 2015-10-05 16:39:43.258211 0 alerts to go
21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos
134769, endpos -1, usedbytes=0, bufleft=131470
21178 2015-10-05 16:39:47.962032 Got 2831 bytes
21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039
@@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754||||
21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos
-1
21178 2015-10-05 16:39:47.962204 Got page message from
soajnuexhs1.bar.com:msgs
21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos
137600, endpos -1, usedbytes=0, bufleft=128639
21178 2015-10-05 16:39:58.022397 Got 297 bytes
21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040
@@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS|||||
21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos
-1
21178 2015-10-05 16:39:58.022593 Got page message from
doadofjdc-ea05p.bar.com:msgs
21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0
21178 2015-10-05 16:39:58.022666 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022706 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022739 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022776 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022808 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022841 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022873 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022904 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022935 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022967 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.022998 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023028 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023059 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023089 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023120 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023151 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023187 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023221 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023252 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023282 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023313 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023342 Checking criteria for host
'doadofjdc-ea05p.bar.com', which is not defined
21178 2015-10-05 16:39:58.023369 Found no first matching rule
21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos
137897, endpos -1, usedbytes=0, bufleft=128342
21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due
to EOF

Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.

AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).
There were a few patches for things in xymond.c at one point, and more
error checking when going to POSIX btrees generally, but I hadn't
encountered this in other intermittent hostlist readers.

1) Which version of Solaris is this?
2) Have you experienced this in other workers for xymon? (IE,
xymond_client not being able to look up hostnames after a drop -- would
probably lead to random purples)
3) Does issuing a "reload" command or -HUP to xymond_alert re-sync things?


-jc