Thoughts on Usefulness/Reliability of Purple Alerts

list Japheth Cleaver
Tue, 25 Aug 2015 23:38:24 -0700
Message-Id: <user-548650d81dec@xymon.invalid>

On Mon, August 24, 2015 10:15 am, Sean MacGuire wrote:

OK, I'll chime in... I wrote Big Brother so know something about
purple alerts.

The purple alert was something no other monitoring system did, and
it took care of the problem of a BB (and Xymon) client dropping dead
and/or machines being in a zombie state (i.e. responding to pings but
otherwise hung).

They're useful as a indication of something being wrong, vs a red or
yellow alert which provide a clear and actionable problem ststus.

So purple alerts are awesome, unless your server has lost contact
with the clients reporting in and everyone goes purple at the same
time resulting in the "Massive Purple Explosion".

Background explanation - the idea was to timestamp reports into the
future, and if a client doesn't report in by then, the validity of
the last report is in question - it's works the exact same way as
the expiration date on a carton of milk - the milk might not have
gone bad, yet, but you might want to check it before drinking it.


Matt Vander Werf wrote:

This is primarily for Henrik and J.C., but anyone else is free to chime
in their thoughts on this as well!

*Background:
We have a Xymon server (the latest Terabithia RPM on RHEL 7) in
production that monitors around 1950 hosts (and consistently growing).
About a week ago, we experienced some pretty bad purple alert storms in
the middle of the night that were all false-positive alerts (over 300
alerts one night). For most of the tests that went purple, they went
back to green at the next update interval. At this point, we've been
unable to figure out a root cause behind this issue, but it hasn't
happened again since early last week (all the easy, understandable
possible causes has been ruled out: network load/bandwidth, CPU load of
Xymon server and of Xymon clients affected, etc.).

We have been using purple alerts for some time now, find them fairly
reliable for the most part, and think they are useful, as machines hang
or something similar (causing the Xymon client on the machine to stop
being able to report to the Xymon server) and we don't get any red or
yellow alerts for any other tests (sometimes a machine can hang but
still have a network connection that can be successfully pinged by
Xymon, we have found). We haven't had any major issues with
false-positive purple alerts (for the most part), or any purple alert
storms, since we started using them consistently for all our machines a
couple years ago.

I understand that when Xymon was first forked from Big Brother a long
while back, it may have been noted that one big change from Big Brother
was that you didn't need to do purple alerts (or something like that)
and that it was discouraged to use purple alerts, as they were seen as
widely unreliable. (I'm hearing this from a coworker of mine, who set up
our original Xymon server some 5 years ago, but have been unable to find
what he's referring to.) But from what I can see from the current
documentation and the mailing list archives, I'm not seeing any place
where the use of purple alerts is discouraged due to them being
unreliable.

*Question(s):
So, I wanted to see what the current thinking/view regarding purple
alerts and the use of purple alerts was by both the original main
maintainer, Henrik, and the more current main maintainer, J.C. (at least
of the current release). Are purple alerts still considered wholly
unreliable, or even somewhat unreliable (or were they ever)? Are they
discouraged in any way or fashion from being used? Have they caused
issues for any of you on this list? Or vice versa: Have they worked well
for you? I'm fully aware that this purple alert storm issue we had is
just a one-off occurrence and we could have not more additional issues
in the future with purple alerts.

I understand that purple alerts are different than other alerts, like
red and yellow alerts, in that it is an indication that the Xymon client
has stopped working/reporting (on a per-test basis) to the Xymon server
for some reason, rather than an issue from a specific test (e.g. with
the CPU load, memory, etc.).

*(Possible) Feature Request:
In addition, I'd be interested if there was a way that you could only
get one alert for a machine if say all the tests for that machine go
purple, instead of an alert for each purple test. I don't believe this
is possible currently, correct? Is this something that could possibly be
implemented in the future? I understand if it's not or if it wouldn't be
very easy.


I appreciate your time in answering my questions and look forward to
your input! (And apologies for the long-winded e-mail!)


Thanks very much in advance!!

--
Matt Vander Werf


Matt,


Generally speaking, a purple alert should be seen first and foremost as an
indication of a failure in the monitoring *system*... where "system"
includes the client pushing data up from the various servers you're paying
attention to.

By having a calculation made on each message's receipt of how long that
message is good for (receipt time + [default, or specified]), we have a
"fail safe" for an unknown issue occurring that requires attention. The
proximate cause of the purple is the failure to receive a message. Whether
that's caused by a hang or death of the usual originator, a bug in a
xymonproxy, a cut network cable, or xymond being unable to handle all of
the traffic sent to it before it times out, is left somewhat as an
exercise for the administrator.

Because purple alerts are generated from xymond's own view of its internal
state (calculated once a minute) and are never sent IN to xymond, purple
alerts should be a reliable indicator that... some other type of
unreliability is going on :)


Because of the wide possibility of different configurations, it's a little
dangerous to create a one-size-fits-all strategy for purples. In a typical
xymon installation with xymonnet and xymond_client running locally on the
same machine, with no proxies or network segments in the middle, and with
clients reporting directly in as well, you really shouldn't see any purple
alerts outside of clients dying... And if the client is dying because the
box is dying, by default you'll only get the 'conn' test red alert instead
of the various xymond_client and xymonnet-generated ones (unless you're
using the 'noclear' line in hosts.cfg).


Your suggestion to have only a single 'purple' come through would
*typically* work, but you'd have to ask yourself which test would be the
representative one. In our case, we found it easiest to nominate a
specific xymond_client test -- "memory" -- and only send purple
notifications for that out to our alert team. That takes care of
xymond_client, while leaving esoteric situations caused by the failure of
different sharded xymonnet's, xymonproxy's, or custom independent tests
free to fail in their own way.


Again, it's the non-typical cases where it gets tricky. What about custom
tests that aren't being generated by xymond_client that are still
functioning? Perhaps you have xymonnet running on a different machine
that's reporting back to xymond (or to a xymonproxy that's reporting back
to xymond!) that has failed in some way. And of course, it could be that
xymond is under heavy load and is unable to keep up with incoming messages
generally (something we experienced in both the TCP and BFQ configs as we
were scaling out).


Sort of along these lines, however, I'd been considering having a more
"host-wide" way of defining certain failure states directly within xymond,
which would allow some of this override logic to happen centrally (and
more reliably). Imagine a 'conn' being red optionally causing *all* tests
to fail-to-clear, removing the need for this calculation from the
remainder of xymonnet tests. Or a true host-wide "disable" that gets
applied to all tests, even new ones, as a xymond flag. A host-wide
"purple-state" could be conceptualized as well.

That's just a thought, though, and it kind of depends on whether people
would find such a feature useful.


Anyway, I hope that's answered some of your questions!


Regards,

-jc