On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:
Re: Axel Beckert 2016-06-15 <user-0f042641758d@xymon.invalid>
in the past few months I found more and more indices for a strange bug
in (at least) Xymon 4.3.27 which occasionally mixes up hosts when
handling reports:
* Machines with a single disk (e.g. VMs) occassional report status of
a "raid" test which is not deployed to them -- and then (for obvious
reasons) went purple on it. On that server, there's only one machine
in having a RAID, but its "raid" reports have been misassigned to at
least three other hosts, all host which have rather many tests
(compared to a bunch of sensors which send in only very few tests
per host).
[...]
Fwiw, I've seen instances of such behavior ever since I've started
taking care of a hobbit installation at a customer site in late 2007.
Symptoms are randomly mixed up hosts. I can say if there are tests
that are hit more than others, the problem is mostly visible through
disk tests by finding rrd files on disk for partitions that do not
exist on this host.
It doesn't seem to happen constantly, but rather in bursts, but I
don't have hard data on that. My impression was that it only happens
during busy periods, but that could be totally wrong.
We've been on 4.3.0 for a long time until finally upgrading about two
years ago, and I thought the problem was gone then, but what Axel is
describing is exactly what we were (are?) seeing there.
Christoph
In some cases, I've seen this and tracked it down to malformed messages
resulting from incomplete client reports. Unfortunately, I wasn't able to
track down all of them from that, but many correllated with periods of
intense load.
The client message (well, all messages, really, but client messages might
be more noticable since they're the largest on a plain system) doesn't
have an EOM indicator, so it's impossible to see if something's gotten
truncated.
This will be solved in V5 style messages (which have a size indicator) or
when combining into an extcombo.
One work-around is to add --filter=\[clock\] to:
xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)
This will block partial client messages from getting further into xymond
when they happen, at the expense of some increased CPU load on
xymond_channel, with potential back-pressure into xymond if the message
load is high enough.
Of course, not having truncated messages in the first place would be nice :)
HTH,
-jc