On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
- snip -
Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.
AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).
See my later message for its relation to 'drop' activity.
There were a few patches for things in xymond.c at one point, and more
error checking when going to POSIX btrees generally, but I hadn't
encountered this in other intermittent hostlist readers.
1) Which version of Solaris is this?
Solaris 10, most recent update, SPARC
2) Have you experienced this in other workers for xymon? (IE,
xymond_client not being able to look up hostnames after a drop -- would
probably lead to random purples)
I haven't seen behavior like that with other worker processes.
Is there a way to interactively run a worker process and have it hit the
daemon process for the hostnames?
Aside from making the process dump core, is there a way to get the
daemon to spill its current list of hostnames?
3) Does issuing a "reload" command or -HUP to xymond_alert re-sync things?
I didn't do a 'reload', but I killed the "xymond_channel --channel=page
--log=/var/log/xymon/alert.log xymond_alert" process and alerts started
working again.
I haven't yet found a way to induce this failure, so I haven't yet
identified the minimal recovery steps. I'm working on it, though.
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska