On Tue, December 1, 2015 1:41 pm, John Thurston wrote:
On 12/1/2015 11:51 AM, J.C. Cleaver wrote:
On Tue, December 1, 2015 9:32 am, John Thurston wrote:
*snip*
In this occurrence, it does not appear to be related to a "drop"
message. My last recorded "drop" was at 20151103-0846 and the alert
process didn't start logging "which is not defined" until 20151120-0007
Hmm. Okay, that does change things slightly. Fortunately, that means
it's
probably specifically caused by drops per se. Were there any other
errors
that occurred with other components around this time?
I have several instances of "Oversize status msg from " in the
xymond.log, but those are appearing six hours before the bad behavior
appeared in xymon_alert. I have difficulty believing they are related.
Ack. Yeah, that should have been 'NOT specifically' :)
Perhaps the system
being low enough on memory that some re-allocations might have failed?
I think this is unlikely. The system has 256GB of RAM, and there are no
memory caps placed on the non-global zone in which xymon is running. I
don't have information of its size on Nov 20, but today it using about
400MB of RAM. All of the zones on the system are consuming less than
10GB of the 256GB and it wouldn't have been significantly different a
few weeks ago.
I've been doing some 'drops' today to try to break it, but haven't
succeeded. I'll continue to beat on it and see if I can find a
repeatable failure scenario.
fwiw, this is under 4.3.22
Hmm.
This is an area where it's possible that glibc/NULL issues might be
causing subtle things too. I could easily see the btree getting hosed by
tree re-insertion of a key we weren't really expecting.
-jc