Xymon Mailing List Archive search

Spurious purple messages

16 messages in this thread

list Colin Coe · Tue, 8 Sep 2015 09:39:53 +0800 ·
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
list Clark Tony · Tue, 8 Sep 2015 06:36:02 +0000 ·
Sounds like network interface overload, perhaps a backup or massive file transfer etc,,,
quoted from Colin Coe

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: 08 September 2015 02:40
To: xymon at xymon.com
Subject: [Xymon] Spurious purple messages

Hi all

Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.

The host status, as shown in the Main View, is green for all hosts and tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.

Any advise on how I should track this down?

Thanks


This message and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this message in error please delete it and any files transmitted with it, after notifying user-3450fbb6127d@xymon.invalid.
 
Any opinions expressed in this message may be those of the author and not necessarily those of the company. The company accepts no responsibility for the accuracy or completeness of any information contained herein. This message is not intended to create legal relations between the company and the recipient. 

Recipients should please note that messages sent via the Internet may be intercepted and that caution should therefore be exercised before dispatching to the company any confidential or sensitive information. 
Mizuho International plc Bracken House, One Friday Street, London EC4M 9JA. TEL. 020 72361090. Wholly owned subsidiary of Mizuho Securities Co., Ltd. Member of Mizuho Financial Group. Authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Member of the London Stock Exchange.

Registered in England No. 1203696. Registered office as above.
list Vernon Everett · Tue, 8 Sep 2015 17:56:20 +1000 ·
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage system, and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency, which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which they did.

Also, check the "serial numbers" on the messages. Is this a repeat of an
older message - in which case Xymon might have something fishy going on, or
are they new messages every day, as in it really thinks there is a problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page you
are looking at. Meaning you could wait up to 7 minutes for the real status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon
quoted from Colin Coe


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks

-- 

"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Tue, 8 Sep 2015 17:15:36 +0800 ·
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time (13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of pressure.

Interestingly, trying to do the snapshot report gave me "Cannot create
output directory".

Thanks again

CC
quoted from Vernon Everett

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage system, and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency, which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which they did.

Also, check the "serial numbers" on the messages. Is this a repeat of an
older message - in which case Xymon might have something fishy going on, or
are they new messages every day, as in it really thinks there is a problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page you
are looking at. Meaning you could wait up to 7 minutes for the real status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Tue, 8 Sep 2015 19:37:49 +1000 ·
That might be a permissions thing.
quoted from Colin Coe


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time (13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of pressure.

Interestingly, trying to do the snapshot report gave me "Cannot create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage system,
and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency, which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which they
did.

Also, check the "serial numbers" on the messages. Is this a repeat of an
older message - in which case Xymon might have something fishy going on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page you
are looking at. Meaning you could wait up to 7 minutes for the real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Fri, 11 Sep 2015 16:13:29 +0800 ·
Almost...

Turned out to be SELinux, my old nemesis.  :)
quoted from Vernon Everett


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time (13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of pressure.

Interestingly, trying to do the snapshot report gave me "Cannot create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage system,
and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency, which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which they
did.

Also, check the "serial numbers" on the messages. Is this a repeat of an
older message - in which case Xymon might have something fishy going on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page
you
are looking at. Meaning you could wait up to 7 minutes for the real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Sat, 12 Sep 2015 19:48:01 +1000 ·
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.
quoted from Colin Coe


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time (13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of pressure.

Interestingly, trying to do the snapshot report gave me "Cannot create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <
user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat of
an
older message - in which case Xymon might have something fishy going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page
you
are looking at. Meaning you could wait up to 7 minutes for the real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Mon, 14 Sep 2015 13:17:19 +0800 ·
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
quoted from Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time (13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of pressure.

Interestingly, trying to do the snapshot report gave me "Cannot create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat of
an
older message - in which case Xymon might have something fishy going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the page
you
are looking at. Meaning you could wait up to 7 minutes for the real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all

Since Friday September 4, I've started receiving "stopped reporting
(PURPLE)" messages for all tests on all hosts from one of our Xymon
servers.

The host status, as shown in the Main View, is green for all hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Tue, 15 Sep 2015 12:06:06 +1000 ·
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?
quoted from Colin Coe


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <
user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Tue, 15 Sep 2015 11:29:20 +0800 ·
Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
quoted from Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Glauber Ribeiro · Tue, 15 Sep 2015 15:44:03 +0000 ·
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
quoted from Colin Coe

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: Monday, September 14, 2015 22:29
To: Vernon Everett
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Wed, 16 Sep 2015 14:26:07 +0800 ·
Hi all

The date/time is set correctly:
---
timedatectl
      Local time: Wed 2015-09-16 14:23:45 AWST
  Universal time: Wed 2015-09-16 06:23:45 UTC
        RTC time: Wed 2015-09-16 06:23:42
        Timezone: Australia/Perth (AWST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
---

fping responds with "host is alive", ping responds with "normal" ping
successful output.


Anyone else have any ideas on this, I really don't want to have to
blow this server away and start again...

Thanks

On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber
quoted from Glauber Ribeiro
<user-59d088777028@xymon.invalid> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: Monday, September 14, 2015 22:29
To: Vernon Everett
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Phil Crooker · Wed, 16 Sep 2015 07:36:59 +0000 ·
So within the 30 mins prior to the purple state, is the conn test time incrementing with normal test times or is it stuck at 1:15PM and not updating?  In other words is it an actual purple event or a false positive? If it is a false positive, no idea....

If it is an actual purple event:

Have you run the pings during the 'purple time' to see that comms actually works then? Try tcp type connections, eg script a wget or whatever to run every few seconds. How about a tcpdump through that period? How about running a ps listing every 15 seconds or vmstat through the period to see if anything is amiss.

Have you tried eliminating tests/hosts? For example, does it happen with one host and just the conn test? All hosts with only the conn test?

Are there any tests that take a long time (ie look at the xymongen and xymonnet stats for the xymon server) or test that are blocking - eg nfs hard mounts?
quoted from Colin Coe

From: Xymon <xymon-bounces at xymon.com> on behalf of Colin Coe <user-5b250cd7a540@xymon.invalid>
Sent: Wednesday, 16 September 2015 3:56 PM
To: Ribeiro, Glauber
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi all

The date/time is set correctly:
---
timedatectl
      Local time: Wed 2015-09-16 14:23:45 AWST
  Universal time: Wed 2015-09-16 06:23:45 UTC
        RTC time: Wed 2015-09-16 06:23:42
        Timezone: Australia/Perth (AWST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
---

fping responds with "host is alive", ping responds with "normal" ping
successful output.


Anyone else have any ideas on this, I really don't want to have to
blow this server away and start again...

Thanks

On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber
<user-59d088777028@xymon.invalid> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: Monday, September 14, 2015 22:29
To: Vernon Everett
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Glauber Ribeiro · Wed, 16 Sep 2015 14:01:02 +0000 ·
Sorry, I wasn't clear. I was wondering if there could be some process set up in cron to adjust the time, which could be causing this (bumping the server time once a day). Just hypothetical, unlikely.

g
quoted from Colin Coe

-----Original Message-----
From: Colin Coe [mailto:user-5b250cd7a540@xymon.invalid] 
Sent: Wednesday, September 16, 2015 01:26
To: Ribeiro, Glauber
Cc: Vernon Everett; xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi all

The date/time is set correctly:
---
timedatectl
      Local time: Wed 2015-09-16 14:23:45 AWST
  Universal time: Wed 2015-09-16 06:23:45 UTC
        RTC time: Wed 2015-09-16 06:23:42
        Timezone: Australia/Perth (AWST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
---

fping responds with "host is alive", ping responds with "normal" ping
successful output.


Anyone else have any ideas on this, I really don't want to have to
blow this server away and start again...

Thanks

On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber
<user-59d088777028@xymon.invalid> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: Monday, September 14, 2015 22:29
To: Vernon Everett
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Thu, 17 Sep 2015 06:28:19 +0800 ·
Glauber, I can confirm there are no cron jobs or similar that alter the time.

Phil, I can confirm that it is a false positive.

I figure there must be some stale data somewhere but I've not found
it.   What process sends the notifications?  Where does this process
get its data?

Thanks all

On Wed, Sep 16, 2015 at 10:01 PM, Ribeiro, Glauber
quoted from Glauber Ribeiro
<user-59d088777028@xymon.invalid> wrote:
Sorry, I wasn't clear. I was wondering if there could be some process set up in cron to adjust the time, which could be causing this (bumping the server time once a day). Just hypothetical, unlikely.

g

-----Original Message-----
From: Colin Coe [mailto:user-5b250cd7a540@xymon.invalid]
Sent: Wednesday, September 16, 2015 01:26
To: Ribeiro, Glauber
Cc: Vernon Everett; xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi all

The date/time is set correctly:
---
timedatectl
      Local time: Wed 2015-09-16 14:23:45 AWST
  Universal time: Wed 2015-09-16 06:23:45 UTC
        RTC time: Wed 2015-09-16 06:23:42
        Timezone: Australia/Perth (AWST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
---

fping responds with "host is alive", ping responds with "normal" ping
successful output.


Anyone else have any ideas on this, I really don't want to have to
blow this server away and start again...

Thanks

On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber
<user-59d088777028@xymon.invalid> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: Monday, September 14, 2015 22:29
To: Vernon Everett
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Colin Coe · Sat, 19 Sep 2015 14:47:53 +0800 ·
Hi all

I ended up resolving this by stopping the Xymon service, removing all
files in $XYMONTMP and then starting xymon again.

Thanks all for the suggestions

CC
quoted from Colin Coe

On Thu, Sep 17, 2015 at 6:28 AM, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Glauber, I can confirm there are no cron jobs or similar that alter the time.

Phil, I can confirm that it is a false positive.

I figure there must be some stale data somewhere but I've not found
it.   What process sends the notifications?  Where does this process
get its data?

Thanks all

On Wed, Sep 16, 2015 at 10:01 PM, Ribeiro, Glauber
<user-59d088777028@xymon.invalid> wrote:
Sorry, I wasn't clear. I was wondering if there could be some process set up in cron to adjust the time, which could be causing this (bumping the server time once a day). Just hypothetical, unlikely.

g

-----Original Message-----
From: Colin Coe [mailto:user-5b250cd7a540@xymon.invalid]
Sent: Wednesday, September 16, 2015 01:26
To: Ribeiro, Glauber
Cc: Vernon Everett; xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi all

The date/time is set correctly:
---
timedatectl
      Local time: Wed 2015-09-16 14:23:45 AWST
  Universal time: Wed 2015-09-16 06:23:45 UTC
        RTC time: Wed 2015-09-16 06:23:42
        Timezone: Australia/Perth (AWST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
---

fping responds with "host is alive", ping responds with "normal" ping
successful output.


Anyone else have any ideas on this, I really don't want to have to
blow this server away and start again...

Thanks

On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber
<user-59d088777028@xymon.invalid> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?

-----Original Message-----
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
Sent: Monday, September 14, 2015 22:29
To: Vernon Everett
Cc: xymon at xymon.com
Subject: Re: [Xymon] Spurious purple messages

Hi Vernon,

Yep, very interesting.  The purple messages come through every day at
about the same time, give or take a minute or so.

Yep, pings work and the normal "main view" and "all non-green view" works fine.

The logs look fine.  I'd really like to get to the bottom of this...

Thanks

CC

On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
That's interesting.
No idea what it means, or where to go from here, but it's certainly
interesting.

Does it happen the exact same time every day?
Have you tried a ping from the Xymon host to the client at or around the
time of the issue? See if there's any oddities?

Is there anything in the logs?


On 14 September 2015 at 15:17, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
OK, looking at this again.  The main view looks fine, but the 'conn'
test on every host is a yellow circle with a question mark (unknown)
in the snapshot report view since September 4, 2015 at 13:32:42.

September 4, 2015 at 13:32:41 and earlier look fine.

Thanks

On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid> wrote:
Good to know it's not just me that fights with SELinux. :-)

Now that it works, what does the snapshot report reveal at the time the
purple alerts go out?

Purples require a "no report" for 30 minutes to trigger.
You might want to check all your logs at around 30-35 minutes before the
emails.


On 11 September 2015 at 18:13, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Almost...

Turned out to be SELinux, my old nemesis.  :)


On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
That might be a permissions thing.


On 8 September 2015 at 19:15, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Vernon

Thanks for the really good info.  The message serial numbers are
different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.

The network is not congested nor is the SAN under any kind of
pressure.

Interestingly, trying to do the snapshot report gave me "Cannot
create
output directory".

Thanks again

CC

On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
<user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Colin

What do the client hosts share in common?
I have seen in the past, a client was overloading their storage
system,
and
were overflowing buffers and exceeding the storage array's ability
to
process IO requests. Of course this caused a general disk latency,
which
slowed things down to the point of a purple flood.
Was no simple solution to that one, except buy more storage, which
they
did.

Also, check the "serial numbers" on the messages. Is this a repeat
of
an
older message - in which case Xymon might have something fishy
going
on,
or
are they new messages every day, as in it really thinks there is a
problem.

Xymon only updates pages every 2 and 5 minutes, depending on the
page
you
are looking at. Meaning you could wait up to 7 minutes for the
real
status
to appear.
A purple takes 30 minutes to trigger.
With some unfortunate, and highly improbable timing on whatever is
triggering these events, it's possible you might not see the
purple.
Have you pulled up a "snapshot report" for the exact time of the
messages?

Something else unlikely, but possible, is the network.
The conn test used ping, which is UDP
The Xymon agent sends using TCP.
Is there anything interesting happening on the network at the
time?

Regards
Vernon


On 8 September 2015 at 11:39, Colin Coe <user-5b250cd7a540@xymon.invalid>
wrote:
Hi all

Since Friday September 4, I've started receiving "stopped
reporting
(PURPLE)" messages for all tests on all hosts from one of our
Xymon
servers.

The host status, as shown in the Main View, is green for all
hosts
and
tests.  No purple at all.

The "stopped reporting (PURPLE)" messages are being sent at the
same
time every day, 1:45PM.

Any advise on how I should track this down?

Thanks
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton