Xymon Mailing List Archive search

Thoughts on Usefulness/Reliability of Purple Alerts

list Matt Vander Werf
Wed, 26 Aug 2015 10:01:38 -0400
Message-Id: <user-99b69892a3a8@xymon.invalid>

Hi Phil,

Thanks for your input!

Just curious, but what would be your definition of "a huge number of hosts
and tests"? This might be kind of subjective, as people (including myself)
might interpret this differently. Would around 1950 hosts, most of them
with just the standard tests (conn,cpu,disk,memory) set up for alerts, with
maybe 50 or so additional other alerts set up for various other "custom"
tests?

Just curious where you come down on a "huge number of hosts and tests".

Thanks.

--
Matt Vander Werf
HPC System Administrator
University of Notre Dame
Center for Research Computing - Union Station
XXX W. South Street
South Bend, IN XXXXX
Phone: (XXX) XXX-XXXX

On Tue, Aug 25, 2015 at 9:12 PM, Phil Crooker <user-e8e31cd73303@xymon.invalid>
wrote:
My two bits:


Big brother had problems with network tests where if you ran too many
tests (eg ssh, smtp, etc) or if they took too long they wouldn't complete
before the next round of tests began. This caused everything to go purple
and certainly one of the reasons purple alerts weren't considered
'reliable'. In my experience (with not a huge number of hosts and
tests) this doesn't occur with xymon, presumably because the tests are run
in parallel rather than sequentially as was the case with BB.


About your feature request - have you tried using the 'depends=' parameter
in hosts.cfg?


*From:* Xymon <xymon-bounces at xymon.com> on behalf of Matt Vander Werf <
user-07704c41c3ad@xymon.invalid>
*Sent:* Tuesday, 25 August 2015 2:31 AM
*To:* user-87556346d4af@xymon.invalid; user-ce4a2c883f75@xymon.invalid
*Cc:* xymon at xymon.com; Rich Sudlow
*Subject:* [Xymon] Thoughts on Usefulness/Reliability of Purple Alerts

This is primarily for Henrik and J.C., but anyone else is free to chime in
their thoughts on this as well!

*Background:
We have a Xymon server (the latest Terabithia RPM on RHEL 7) in production
that monitors around 1950 hosts (and consistently growing).
About a week ago, we experienced some pretty bad purple alert storms in
the middle of the night that were all false-positive alerts (over 300
alerts one night). For most of the tests that went purple, they went back
to green at the next update interval. At this point, we've been unable to
figure out a root cause behind this issue, but it hasn't happened again
since early last week (all the easy, understandable possible causes has
been ruled out: network load/bandwidth, CPU load of Xymon server and of
Xymon clients affected, etc.).

We have been using purple alerts for some time now, find them fairly
reliable for the most part, and think they are useful, as machines hang or
something similar (causing the Xymon client on the machine to stop being
able to report to the Xymon server) and we don't get any red or yellow
alerts for any other tests (sometimes a machine can hang but still have a
network connection that can be successfully pinged by Xymon, we have
found). We haven't had any major issues with false-positive purple alerts
(for the most part), or any purple alert storms, since we started using
them consistently for all our machines a couple years ago.

I understand that when Xymon was first forked from Big Brother a long
while back, it may have been noted that one big change from Big Brother was
that you didn't need to do purple alerts (or something like that) and that
it was discouraged to use purple alerts, as they were seen as widely
unreliable. (I'm hearing this from a coworker of mine, who set up our
original Xymon server some 5 years ago, but have been unable to find what
he's referring to.) But from what I can see from the current documentation
and the mailing list archives, I'm not seeing any place where the use of
purple alerts is discouraged due to them being unreliable.

*Question(s):
So, I wanted to see what the current thinking/view regarding purple alerts
and the use of purple alerts was by both the original main maintainer,
Henrik, and the more current main maintainer, J.C. (at least of the current
release). Are purple alerts still considered wholly unreliable, or even
somewhat unreliable (or were they ever)? Are they discouraged in any way or
fashion from being used? Have they caused issues for any of you on this
list? Or vice versa: Have they worked well for you? I'm fully aware that
this purple alert storm issue we had is just a one-off occurrence and we
could have not more additional issues in the future with purple alerts.

I understand that purple alerts are different than other alerts, like red
and yellow alerts, in that it is an indication that the Xymon client has
stopped working/reporting (on a per-test basis) to the Xymon server for
some reason, rather than an issue from a specific test (e.g. with the CPU
load, memory, etc.).

*(Possible) Feature Request:
In addition, I'd be interested if there was a way that you could only get
one alert for a machine if say all the tests for that machine go purple,
instead of an alert for each purple test. I don't believe this is possible
currently, correct? Is this something that could possibly be implemented in
the future? I understand if it's not or if it wouldn't be very easy.


I appreciate your time in answering my questions and look forward to your
input! (And apologies for the long-winded e-mail!)


Thanks very much in advance!!

--
Matt Vander Werf
--

Please consider the environment before printing this e-mail

This message from ORIX Australia may contain confidential and/or
privileged information. If you are not the intended recipient, any use,
disclosure or copying of this message (or of any attachments to it) is not
authorised. If you have received this message in error, please notify the
sender immediately and delete the message and any attachments from your
system. Please inform the sender if you do not wish to receive further
communications by email.

The ORIX Australia Privacy Policy outlines what kinds of personal
information we collect and hold, how we collect and handle it and your
rights in regards to your personal information. Our Privacy Policy is
available on our website <http://www.orix.com.au>;.

We do not accept liability for any loss or damage caused by any computer
viruses or defects that may be transmitted with this message. We recommend
you carry out your own checks for viruses or defects.