Xymon Mailing List Archive search

hostname retrieval is broken after adding a host

7 messages in this thread

list John Thurston · Mon, 01 Feb 2016 10:10:58 -0900 ·
This defect is getting to be a serious problem in my production Xymon.

For specifics, see my notes of 20151201 and 20151214 :
http://lists.xymon.com/pipermail/xymon/2015-December/042712.html
http://lists.xymon.com/pipermail/xymon/2015-December/042787.html

In general, adding a host to hosts.cfg corrupts the in-memory list of valid hosts. This causes other worker processes (specifically "alert") to fail. It doesn't fail _completely_. Some alerts continue to be sent, but there are footprints in the logs. I have a script watching for these footprints. When seen, I kill the "xymond_channel --channel=page" process, a new one is started, and business continues.

I need to squash this bug.

Is there a way to interactively run a worker process and have it hit the in-memory table of hostnames?

If not, is there a way to spill the in-memory table of hostnames without using a debugger?

Can anyone tell me which worker processes us the in-memory host list?
-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Japheth Cleaver · Mon, 1 Feb 2016 15:41:33 -0800 ·
Hi,

Actually, I think I must have missed your final response on this at
http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my
apologies.

On what's happening, I think this might be a side-effect of
https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record
for the purposes of command-line --test functionality when the host
doesn't exist. For an incoming unknown host (from xymond_alert's
perspective), the same path is being executed.

The problem is that localhostinfo re-initializes the hostlist, which would
almost certainly cause problems somewhat similar to what you're describing
here. The attached patch should fix that (by only doing it if we're in
test mode). The only other place this is used is in xymond_client when
it's itself running in --local mode, in which case it doesn't have a
pre-existing tree to get corrupted and then exits immediately anyway.


This really calls for a re-factoring around host loading, but I'm leery of
too much direct modification in 4.3, this probably being caused by that
recent code.

Can you give it a test and let us know the result?


Regards,
-jc
quoted from John Thurston

On Mon, February 1, 2016 11:10 am, John Thurston wrote:
This defect is getting to be a serious problem in my production Xymon.

For specifics, see my notes of 20151201 and 20151214 :
http://lists.xymon.com/pipermail/xymon/2015-December/042712.html
http://lists.xymon.com/pipermail/xymon/2015-December/042787.html

In general, adding a host to hosts.cfg corrupts the in-memory list of
valid hosts. This causes other worker processes (specifically "alert")
to fail. It doesn't fail _completely_. Some alerts continue to be sent,
but there are footprints in the logs. I have a script watching for these
footprints. When seen, I kill the "xymond_channel --channel=page"
process, a new one is started, and business continues.

I need to squash this bug.

Is there a way to interactively run a worker process and have it hit the
in-memory table of hostnames?

If not, is there a way to spill the in-memory table of hostnames without
using a debugger?

Can anyone tell me which worker processes us the in-memory host list?
--
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska

list John Thurston · Mon, 01 Feb 2016 15:59:25 -0900 ·
quoted from Japheth Cleaver
On 2/1/2016 2:41 PM, J.C. Cleaver wrote:
Hi,

Actually, I think I must have missed your final response on this at
http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my
apologies.

On what's happening, I think this might be a side-effect of
https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record
for the purposes of command-line --test functionality when the host
doesn't exist. For an incoming unknown host (from xymond_alert's
perspective), the same path is being executed.
I've applied the patch to my non-production server and performed my failure-reproduction steps. The behavior is certainly better. The alert process is no longer tanking for every message received :)

What I do get, for a newly added host, is "Checking criteria for host 'foo.bar.com', which is not defined. Will not alert until hostlist reload."  This happens following all subsequent runs of xymonnet.

Is there anything which will trigger a hostlist reload?

Is there a tidy way to manually reload the list?

It doesn't seem to happen until I kill the "xymond_channel --channel=page" process. This seems like a hamfisted thing to do after every edit of hosts.cfg :(

Related question:

If this is in main code, and not some odd-ball null/EOF/posix problem (as has often tripped up my Solaris systems in the recent past), why am I the only one seeing this failure? Why aren't the folks running linux having their alerts fail?
quoted from Japheth Cleaver

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Japheth Cleaver · Tue, 2 Feb 2016 06:42:37 -0800 ·
quoted from John Thurston
On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
On 2/1/2016 2:41 PM, J.C. Cleaver wrote:
Hi,

Actually, I think I must have missed your final response on this at
http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my
apologies.

On what's happening, I think this might be a side-effect of
https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record
for the purposes of command-line --test functionality when the host
doesn't exist. For an incoming unknown host (from xymond_alert's
perspective), the same path is being executed.
I've applied the patch to my non-production server and performed my
failure-reproduction steps. The behavior is certainly better. The alert
process is no longer tanking for every message received :)

What I do get, for a newly added host, is "Checking criteria for host
'foo.bar.com', which is not defined. Will not alert until hostlist
reload."  This happens following all subsequent runs of xymonnet.

Is there anything which will trigger a hostlist reload?

Is there a tidy way to manually reload the list?

It doesn't seem to happen until I kill the "xymond_channel
--channel=page" process. This seems like a hamfisted thing to do after
every edit of hosts.cfg :(

Related question:

If this is in main code, and not some odd-ball null/EOF/posix problem
(as has often tripped up my Solaris systems in the recent past), why am
I the only one seeing this failure? Why aren't the folks running linux
having their alerts fail?
This one took me quite a while to figure out, mainly because I was looking
at the wrong code base for a while.

It turns out the host info record here is *only* used for display groups
and holiday lookups (probably rarely used), within the context of
alerting. In all other cases, it not being in the hostlist doesn't impact
the application of alert rule, since all the needed info is coming in via
the '@@page' message itself. The patch should be updated to let those come
straight through instead of exiting out if it doesn't see it.


My confusion came from different issue: xymond_alert actually never
reloads the hosts config at all! I found/fixed this back in Sept '14 in
the RPMs but it wasn't applied into 4.3 back then.

I'd been living with that code for so long I forgot that that reload
wasn't needed here -- and, obviously, alerts have been working *in
general*... (We only noticed the lack of reload because we were dependent
on a dynamic value in the hosts.cfg line coming through to the alert
script via XMH_RAW in updated form.)

xymond_alert reloading was put into 4.4 at
https://sourceforge.net/p/xymon/code/7776/ among the patch bursts, but the
live host add issue has probably been in since this release. There are a
few takeaways from this... but this needs to be fixed in 4.3 (among
several other incoming issues that are pending confirmation).


Can you please check the included two patches? One is an update for the
previous one, which passes the alert check through (only adding the dummy
record in --test mode to begin with), the other adds hosts.cfg reloading
on intervals or on demand. It's based on the 4.4 version, but with only a
small change. I'd like to add both, as I can't see any drawback to
reloading hosts.cfg from xymond_alert's perspective, but the first may be
sufficient to get back to the status quo.


Regards,

-jc
list Ryan Novosielski · Tue, 2 Feb 2016 10:36:51 -0500 ·
Possibly worth chiming in here that I use the holidays list, in case anyone is thinking of "simplifying" the code. :-)

--
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | user-46c89e614701@xymon.invalid<mailto:user-46c89e614701@xymon.invalid>- 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
quoted from Japheth Cleaver
    `'

On Feb 2, 2016, at 09:42, J.C. Cleaver <user-87556346d4af@xymon.invalid<mailto:user-87556346d4af@xymon.invalid>> wrote:

On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
On 2/1/2016 2:41 PM, J.C. Cleaver wrote:
Hi,

Actually, I think I must have missed your final response on this at
http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my
apologies.

On what's happening, I think this might be a side-effect of
https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record
for the purposes of command-line --test functionality when the host
doesn't exist. For an incoming unknown host (from xymond_alert's
perspective), the same path is being executed.

I've applied the patch to my non-production server and performed my
failure-reproduction steps. The behavior is certainly better. The alert
process is no longer tanking for every message received :)

What I do get, for a newly added host, is "Checking criteria for host

'foo.bar.com<http://foo.bar.com>';, which is not defined. Will not alert until hostlist
quoted from Japheth Cleaver
reload."  This happens following all subsequent runs of xymonnet.

Is there anything which will trigger a hostlist reload?

Is there a tidy way to manually reload the list?

It doesn't seem to happen until I kill the "xymond_channel
--channel=page" process. This seems like a hamfisted thing to do after
every edit of hosts.cfg :(

Related question:

If this is in main code, and not some odd-ball null/EOF/posix problem
(as has often tripped up my Solaris systems in the recent past), why am
I the only one seeing this failure? Why aren't the folks running linux
having their alerts fail?


This one took me quite a while to figure out, mainly because I was looking
at the wrong code base for a while.

It turns out the host info record here is *only* used for display groups
and holiday lookups (probably rarely used), within the context of
alerting. In all other cases, it not being in the hostlist doesn't impact
the application of alert rule, since all the needed info is coming in via
the '@@page' message itself. The patch should be updated to let those come
straight through instead of exiting out if it doesn't see it.


My confusion came from different issue: xymond_alert actually never
reloads the hosts config at all! I found/fixed this back in Sept '14 in
the RPMs but it wasn't applied into 4.3 back then.

I'd been living with that code for so long I forgot that that reload
wasn't needed here -- and, obviously, alerts have been working *in
general*... (We only noticed the lack of reload because we were dependent
on a dynamic value in the hosts.cfg line coming through to the alert
script via XMH_RAW in updated form.)

xymond_alert reloading was put into 4.4 at
https://sourceforge.net/p/xymon/code/7776/ among the patch bursts, but the
live host add issue has probably been in since this release. There are a
few takeaways from this... but this needs to be fixed in 4.3 (among
several other incoming issues that are pending confirmation).


Can you please check the included two patches? One is an update for the
previous one, which passes the alert check through (only adding the dummy
record in --test mode to begin with), the other adds hosts.cfg reloading
on intervals or on demand. It's based on the 4.4 version, but with only a
small change. I'd like to add both, as I can't see any drawback to
reloading hosts.cfg from xymond_alert's perspective, but the first may be
sufficient to get back to the status quo.


Regards,

-jc

<localalertmode-2.patch>
<reloadalert.patch>
list John Thurston · Tue, 02 Feb 2016 11:24:59 -0900 ·
On 2/2/2016 5:42 AM, J.C. Cleaver wrote:
On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
- snip -
quoted from Ryan Novosielski
. . . why am
I the only one seeing this failure? Why aren't the folks running linux
having their alerts fail?
  - snip -
quoted from Ryan Novosielski
It turns out the host info record here is *only* used for display groups
and holiday lookups (probably rarely used), within the context of
alerting.
And I suspect I am one of the few people using display groups to drive my alerting. I resisted defining alert groups back in the BB days because it seemed like too much work. When I moved to Xymon and I could leverage my existing display groups, I jumped on board.

- snip -
quoted from Ryan Novosielski
Can you please check the included two patches? One is an update for the
previous one, which passes the alert check through (only adding the dummy
record in --test mode to begin with), the other adds hosts.cfg reloading
on intervals or on demand.
With these patches, my non-production server running 4.3.22 on Solaris 10 is running much better. This is very encouraging :)

Looking at the patch files and reading the new source, am I correct it adds a couple of startup options to xymond_alert?
   --reload-interval=number-of-seconds
   --loadhostsfromxymond
where the first specifies the number of seconds after which the contents of hosts.cfg should be reloaded, and the second says hosts.cfg could be retrieved from xymond rather than the filesystem (similar to the existing option for xymongen).
quoted from John Thurston

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Japheth Cleaver · Tue, 2 Feb 2016 13:13:11 -0800 ·
quoted from John Thurston
On Tue, February 2, 2016 12:24 pm, John Thurston wrote:
On 2/2/2016 5:42 AM, J.C. Cleaver wrote:
On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
- snip -
. . . why am
I the only one seeing this failure? Why aren't the folks running linux
having their alerts fail?
  - snip -
It turns out the host info record here is *only* used for display groups
and holiday lookups (probably rarely used), within the context of
alerting.
And I suspect I am one of the few people using display groups to drive
my alerting. I resisted defining alert groups back in the BB days
because it seemed like too much work. When I moved to Xymon and I could
leverage my existing display groups, I jumped on board.
Ahh, yes, this would definitely have affected this then...
quoted from John Thurston

Can you please check the included two patches? One is an update for the
previous one, which passes the alert check through (only adding the
dummy
record in --test mode to begin with), the other adds hosts.cfg reloading
on intervals or on demand.
With these patches, my non-production server running 4.3.22 on Solaris
10 is running much better. This is very encouraging :)
Indeed! This is slated for RC2 now.
quoted from John Thurston

Looking at the patch files and reading the new source, am I correct it
adds a couple of startup options to xymond_alert?
   --reload-interval=number-of-seconds
   --loadhostsfromxymond
where the first specifies the number of seconds after which the contents
of hosts.cfg should be reloaded, and the second says hosts.cfg could be
retrieved from xymond rather than the filesystem (similar to the
existing option for xymongen).

Correct. Easier to grok in both cases. Actually, all long-running
processes that manipulate hosts in something other than a textual way
should be periodically reloading, from whichever source they're using.


Regards,

-jc