Xymon server (4.3.27) occasionally assigns reports to the wrong host

4 messages in this thread

list Axel Beckert · Wed, 15 Jun 2016 17:58:16 +0200 ·

Hi,

in the past few months I found more and more indices for a strange bug
in (at least) Xymon 4.3.27 which occasionally mixes up hosts when
handling reports:

One of our two active Xymon servers occasionally assigns test reports
to the wrong hosts, also causing false positives in alerting.
(Actually I had the notion that this is happening for quite a while
now but we today found a false positve caused by that and could nail
down the real origin of the report to be clearly from a different
host.)

Not sure where I should start with describing the details, so here's
just a bunch of facts:

* Happens at least for the tests "disk", "ports", "cpu" and our
  self-written test "raid", but maybe more.

* Some of the wrong status pages show "Status message received from
  127.0.0.1" despite they should have come from some client IP address
  ("raid" test), others show the expected IP of the host for which the
  test is said to belong as sender ("disk" test).

* Machines with a single disk (e.g. VMs) occassional report status of
  a "raid" test which is not deployed to them -- and then (for obvious
  reasons) went purple on it. On that server, there's only one machine
  in having a RAID, but its "raid" reports have been misassigned to at
  least three other hosts, all host which have rather many tests
  (compared to a bunch of sensors which send in only very few tests
  per host).

* All the so far noticed wrongly assigned raid reports were
  originating from the same machine and all the so far noticed wrongly
  assigned disk reports were originating from a different, but always
  the same machine.

* For about a month one machine's service status page (as well as
  non-green.html) occassionally (approx. 1-3 times a day) show the
  disk report for another machine. At least all the cases we looked at
  due to different "yellow" thresholds for disk usage levels were
  coming from the same other machine.

  Examples take from the history of the "disk" test of the same host:

  + An expected, "green" report:

    Wed Jun 15 17:39:49 CEST 2016 - Filesystems ok

    Filesystem     1024-blocks    Used Available Capacity Mounted on
    udev               2013012       0   2013012       0% /dev
    tmpfs               405736   46412    359324      12% /run
    /dev/sda1         75620456 2599176  69942852       4% /
    tmpfs              2028668       0   2028668       0% /dev/shm
    tmpfs                 5120       4      5116       1% /run/lock
    tmpfs              2028668       0   2028668       0% /sys/fs/cgroup
    cgmfs                  100       0       100       0% /run/cgmanager/fs

  + One of the misassigned false positive reports:

    Tue Jun 14 10:35:13 CEST 2016 - Filesystems NOT ok

    yellow / (94% used) has reached the WARNING level (90%)

    Filesystem     1024-blocks      Used Available Capacity Mounted on
    udev                505668         0    505668       0% /dev
    tmpfs               102980       884    102096       1% /run
    /dev/sda1        155570664 137658232  10003236      94% /
    tmpfs                 5120         4      5116       1% /run/lock
    none               1132776       516   1132260       1% /tmp
    cgroup                  12         0        12       0% /sys/fs/cgroup

  + Excerpt from the "Client data" link from the status page of the
    item directly above this one:

    [df]
    Filesystem     1024-blocks    Used Available Capacity Mounted on
    udev               2013012       0   2013012       0% /dev
    tmpfs               405736   46464    359272      12% /run
    /dev/sda1         75620456 2591688  69950340       4% /
    tmpfs              2028668       0   2028668       0% /dev/shm
    tmpfs                 5120       4      5116       1% /run/lock
    tmpfs              2028668       0   2028668       0% /sys/fs/cgroup
    cgmfs                  100       0       100       0% /run/cgmanager/fs
    tmpfs               405736       0    405736       0% /run/user/0

  Please notice
  + the different set of file systems due to the reports are in
    reality coming from systems with different init systems.
  + the different overall size of /dev/sda1. The system hasn't
    rebooted since like 10 days and only has an 80 GB disk while the
    yellow report comes from a system with a 160 GB disk.

* The "Client data" link always have the correct "df" and "netstat"
  output even in cases where the service status page shows the "df" or
  "netstat" output from the wrong host. So I assume it's a bug in
  report handling and neither on the client side nor on the network
  side of the of the daemon.

* The misassigned reporting of the "raid" test started on 22nd of
  April 2016. That fits very well to the fact that I upgraded that
  server on 20th of April 2016 from 4.3.26 (plus
  CGI-regression fixes) to 4.3.27.

  The misassigned reporting of the "disk" test was first seen in our
  (IRC) logs on 15th of May 2016, i.e. weeks after the upgrade of
  Xymon to 4.3.27. But the misassigned "ports" test seems to have
  started already on 13th of April 2016 and that test was green for
  more than a year beforehands), i.e. while still 4.3.26 (plus
  CGI-regression fixes) was running.

  Can't say when the misassigned cpu tests started since I didn't find
  them via obvious false positives but because an actual report
  reported xen daemons on a webserver.

* There are hosts whose reports gets assigned to other hosts while
  being assigned reports from yet another host, e.g. Host 1 (web
  server with nearly full disk)'s procs test is (occassionally)
  reported as being for host 2 (a Xen server, easily spottable by
  several daemons containing "xen" in their name), host 2's cpu test is
  (occassionally) reported for host 3 (another web server, but with
  very low disk usage) and host's 1 disk test is (occassionally)
  reported as belonging to host 3, too.

So far we've noticed this only on one of our two xymon servers, let's
call it "Server B". And even explicitly grepping in the log history of
our Xymon IRC bot for potential cases (like ports on servers where
they never were nor shouldn't be) on Server A didn't bring up anything
similar.

Both run on Debian 8 Jessie ("Stable") on the amd64 architecture with
backported xymon packages from Debian Unstable/Testing (i.e. always
the newest packages -- as I test the packages there under live
conditions before uploading them to Debian). Both use Apache 2.4 as
webserver.

Server A:
* real hardware
* about 850 hosts
* configured "critical view" setup
* only reports to itself.
* permanently monitored by our group, i.e. I expect that we would
  notice such misassignements rather quickly, especially if they
  happen in the "critical view".
* Also runs some other monitoring-like tools like e.g. Smokeping.

Server B:
* virtual machine
* about 35 hosts
* Some hosts (actually network-enabled environment sensors) report
  once per minute. Those reports get passed through to a separate task
  ("CMD xymond_channel --channel=status ...") which logs _all_ their
  values and archives them month-wise for later reference.
* no "critical view"
* reports to itself and to Server A.
* Also tries to report to Server C (which was a test installation of
  the Xymon 5 branch, now offline)
* mostly monitors customer machines (call it "Xymon as a Service" ;-),
  i.e. not that much looked at by our group.
* Writes "Flushed 1 stale messages for 0.0.0.0:0" every 5 minutes into
  /var/log/xymon/clientdata.log, probably related to Server C being
  offline.

So far we only noticed that strange behaviour on Server B -- which is
the less busier server (despite that bunch of minute-wise reporting
hosts).

Actually the first notion has been seen when the VM "Server B"
reported to Server A that its (non-existing) "raid" test went
purple.

And then today one of our customers argued about his disk test going
yellow several times a day despite the accused partition is at 4% most
of the time according to Xymon -- except when it's at 94% for 5
minutes and then back at 4% again... (And I can't imagine an old 80 GB
disk being stuffed with 75 GB of data within 5 minutes) One single
_other_ host reporting to Server B is permanently on 94% disk usage and
its (and only its) yellow thresholds is set to 95%, hence its reports
are green for the same set of data.

So does anyone have an idea what goes wrong on our Server B so that it
mixes up reports from multiple hosts while Server A doesn't show these
symptoms?

		Kind regards, Axel Beckert
-- 
Axel Beckert <user-96d9963fe797@xymon.invalid>       support: +41 44 633 26 68
IT Services Group, HPT H 6                  voice: +41 44 633 41 89
Departement of Physics, ETH Zurich
CH-8093 Zurich, Switzerland		   http://nic.phys.ethz.ch/

list Christoph Berg · Wed, 15 Jun 2016 18:30:56 +0200 ·

Re: Axel Beckert 2016-06-15 <user-0f042641758d@xymon.invalid>

▸ quoted from Axel Beckert

in the past few months I found more and more indices for a strange bug
in (at least) Xymon 4.3.27 which occasionally mixes up hosts when
handling reports:

* Machines with a single disk (e.g. VMs) occassional report status of
  a "raid" test which is not deployed to them -- and then (for obvious
  reasons) went purple on it. On that server, there's only one machine
  in having a RAID, but its "raid" reports have been misassigned to at
  least three other hosts, all host which have rather many tests
  (compared to a bunch of sensors which send in only very few tests
  per host).

[...]


Fwiw, I've seen instances of such behavior ever since I've started
taking care of a hobbit installation at a customer site in late 2007.
Symptoms are randomly mixed up hosts. I can say if there are tests
that are hit more than others, the problem is mostly visible through
disk tests by finding rrd files on disk for partitions that do not
exist on this host.

It doesn't seem to happen constantly, but rather in bursts, but I
don't have hard data on that. My impression was that it only happens
during busy periods, but that could be totally wrong.

We've been on 4.3.0 for a long time until finally upgrading about two
years ago, and I thought the problem was gone then, but what Axel is
describing is exactly what we were (are?) seeing there.

Christoph

list Japheth Cleaver · Wed, 15 Jun 2016 14:36:14 -0700 ·

▸ quoted from Christoph Berg

On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:

Re: Axel Beckert 2016-06-15 <user-0f042641758d@xymon.invalid>

in the past few months I found more and more indices for a strange bug
in (at least) Xymon 4.3.27 which occasionally mixes up hosts when
handling reports:

* Machines with a single disk (e.g. VMs) occassional report status of
  a "raid" test which is not deployed to them -- and then (for obvious
  reasons) went purple on it. On that server, there's only one machine
  in having a RAID, but its "raid" reports have been misassigned to at
  least three other hosts, all host which have rather many tests
  (compared to a bunch of sensors which send in only very few tests
  per host).

[...]

Fwiw, I've seen instances of such behavior ever since I've started
taking care of a hobbit installation at a customer site in late 2007.
Symptoms are randomly mixed up hosts. I can say if there are tests
that are hit more than others, the problem is mostly visible through
disk tests by finding rrd files on disk for partitions that do not
exist on this host.

It doesn't seem to happen constantly, but rather in bursts, but I
don't have hard data on that. My impression was that it only happens
during busy periods, but that could be totally wrong.

We've been on 4.3.0 for a long time until finally upgrading about two
years ago, and I thought the problem was gone then, but what Axel is
describing is exactly what we were (are?) seeing there.

Christoph

In some cases, I've seen this and tracked it down to malformed messages
resulting from incomplete client reports. Unfortunately, I wasn't able to
track down all of them from that, but many correllated with periods of
intense load.

The client message (well, all messages, really, but client messages might
be more noticable since they're the largest on a plain system) doesn't
have an EOM indicator, so it's impossible to see if something's gotten
truncated.

This will be solved in V5 style messages (which have a size indicator) or
when combining into an extcombo.

One work-around is to add --filter=\[clock\] to:
 xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)

This will block partial client messages from getting further into xymond
when they happen, at the expense of some increased CPU load on
xymond_channel, with potential back-pressure into xymond if the message
load is high enough.


Of course, not having truncated messages in the first place would be nice :)


HTH,
-jc

list Axel Beckert · Thu, 16 Jun 2016 17:02:43 +0200 ·

Hi,

▸ quoted from Japheth Cleaver


On Wed, Jun 15, 2016 at 02:36:14PM -0700, J.C. Cleaver wrote:

On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:

Fwiw, I've seen instances of such behavior ever since I've started
taking care of a hobbit installation at a customer site in late
2007.

Oops. Never ran into that knowingly before -- and I run Hobbit/Xymon
servers since about 2007 or so, too.

▸ quoted from Japheth Cleaver

Symptoms are randomly mixed up hosts. I can say if there are tests
that are hit more than others, the problem is mostly visible through
disk tests by finding rrd files on disk for partitions that do not
exist on this host.

I remember having seen misnamed rrd files in the past, that was more
like bitflips in the test names or so. (We get some data from some
embedded devices which were buggy in the beginning and occassionally
sent garbled messages to our Xymon server.) But I never noticed device
or path names which don't fit to the machine. So probably not the same
thing.

It doesn't seem to happen constantly, but rather in bursts,

That explains why it seemed to coincident with my upgrade to 4.3.27
and then I found cases from a few days earlier, too.

▸ quoted from Japheth Cleaver

but I don't have hard data on that. My impression was that it only
happens during busy periods, but that could be totally wrong.

Hrm, according to xymon, that xymon server has an average load of 0.2
and only a few peaks which go over 1.0 (highest load in the RRD is
1.5).

So I wouldn't say that this server is often "busy".

I also checked the past two days: The times where I have load peaks up
to 1.4 are other times than the ones where disk tests got
misassigned. :-(

▸ quoted from Japheth Cleaver

In some cases, I've seen this and tracked it down to malformed messages
resulting from incomplete client reports.

I can imagine that. Where the incomplete client reports from the host
where they were assigned to or from the one where they should have
been assigned to?

▸ quoted from Japheth Cleaver

Unfortunately, I wasn't able to track down all of them from that,
but many correllated with periods of intense load.

Fits with what Christoph experienced, but I doubt that it's related to
load on my server. Load on the clients might be possible, though.

▸ quoted from Japheth Cleaver

The client message (well, all messages, really, but client messages might
be more noticable since they're the largest on a plain system) doesn't
have an EOM indicator, so it's impossible to see if something's gotten
truncated.

This will be solved in V5 style messages (which have a size
indicator)

Nice!

▸ quoted from Japheth Cleaver

One work-around is to add --filter=\[clock\] to:
 xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)

This will block partial client messages from getting further into xymond
when they happen, at the expense of some increased CPU load on
xymond_channel, with potential back-pressure into xymond if the message
load is high enough.

Hrm, I'm a bit reluctant to add this since the man page says:

       --filter=EXPRESSION
              EXPRESSION   is   a   Perl-compatible   regular  expression.
              xymond_channel will match the first  line  of  each  message
              against this expression, and silently drops any message that
              does not match the expression.

If I download the client data of an arbitray host, the first line is
always empty and the second line reads "[collector:]". "[clock]" only
shows up at the very end:

---8<---

[collector:]
client <hostname>.linux linux
[date]
Thu Jun 16 16:31:51 CEST 2016
[uname]
Linux <hostname> 3.16.0-4-amd64 x86_64
[osversion]
Debian 8.5
Distributor ID: Debian
Description:    Debian GNU/Linux 8.5 (jessie)
Release:        8.5
Codename:       jessie
[uptime]
 16:31:51 up 365 days, 28 min,  0 users,  load average: 8.43, 8.46, 8.23
[who]
[df]
Filesystem                1024-blocks        Used  Available Capacity Mounted on
[...]
[clientversion]
Xymon version 4.3.17
[clock]
epoch: 1466087516.425434
local: 2016-06-16 16:31:56 CEST
UTC: 2016-06-16 14:31:56 GMT
--->8---

So filtering for messages containing "[clock]" seems to make sense as
the message needs to be nearly complete to contain that string.

OTOH the xymond_channel(8) man page says it only matches the first
line of the message. What's considered to be a "message"? Each block
starting with "[something]" (but then the man page would claim that it
drops all other blocks) or the whole set of data linked as "Client
data" on service status pages?

Seems to me that either way something's wrong in the man page.

Or are those data block shown in reverse order on the web?

Can you confirm that adding "--filter=\[clock\]" won't drop nearly all
of the valid messages?

▸ quoted from Japheth Cleaver

Of course, not having truncated messages in the first place would be
nice :)

:-)

▸ quoted from Axel Beckert


		Kind regards, Axel Beckert
-- 
Axel Beckert <user-96d9963fe797@xymon.invalid>       support: +41 44 633 26 68
IT Services Group, HPT H 6                  voice: +41 44 633 41 89
Departement of Physics, ETH Zurich
CH-8093 Zurich, Switzerland		   http://nic.phys.ethz.ch/

Xymon server (4.3.27) occasionally assigns reports to the wrong host 🔗 link

Xymon server (4.3.27) occasionally assigns reports to the wrong host