Xymon Mailing List Archive search

Spurious purple messages

list Colin Coe
Mon, 14 Sep 2015 13:17:19 +0800
Message-Id: <CANvHAxQUQgFnbg7JPE=user-078a1e093242@xymon.invalid>

OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time (13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of pressure.

Interestingly, trying to do the snapshot report gave me "Cannot create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat of
an
older message - in which case Xymon might have something fishy going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page
you
are looking at. Meaning you could wait up to 7 minutes for the real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton