duration of MSG red status

11 messages in this thread

list Nicole Beck · Fri, 24 Oct 2014 15:13:46 +0000 ·

Hello,
I think this has been asked before, but it was a long time ago and I wondered if something changed since then.

How long will the status of MSGS stay red?  We just setup monitoring recently and it seems like it stays red for about 30 minutes. Is that normal?  Is this configurable?

We are running Xymon server 4.2.3.

Thanks,
Nicole

list Japheth Cleaver · Fri, 24 Oct 2014 11:50:07 -0700 ·

▸ quoted from Nicole Beck


On Fri, October 24, 2014 8:13 am, Nicole Beck wrote:

Hello,
I think this has been asked before, but it was a long time ago and I
wondered if something changed since then.

How long will the status of MSGS stay red?  We just setup monitoring
recently and it seems like it stays red for about 30 minutes. Is that
normal?  Is this configurable?

We are running Xymon server 4.2.3.

Nicole,

The duration of the 'msgs' test is actually a function of how many cycles
back logfetch will scan for content to include in the log data going
forward (actual calculation of the color is via the regex's performed by
xymond_client).

logfetch will look back 6 runtime-positions which, combined with the
default xymonclient run interval of 5m, ends up causing the 30m figure.

The former value is compiled in, however the run frequency is
configurable. (We run our clients on 100s cycles, which means our msgs
tests last for 10-12m.)


I'm not sure how easy the 6x positions would be to be made dynamic or a
runtime option, but that would be nice.


Regards,

-jc

list Ryan Novosielski · Fri, 24 Oct 2014 17:41:32 -0400 ·

▸ quoted from Japheth Cleaver

On Oct 24, 2014, at 14:50, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:

On Fri, October 24, 2014 8:13 am, Nicole Beck wrote:
Hello,
I think this has been asked before, but it was a long time ago and I
wondered if something changed since then.

How long will the status of MSGS stay red?  We just setup monitoring
recently and it seems like it stays red for about 30 minutes. Is that
normal?  Is this configurable?

We are running Xymon server 4.2.3.

Nicole,

The duration of the 'msgs' test is actually a function of how many cycles
back logfetch will scan for content to include in the log data going
forward (actual calculation of the color is via the regex's performed by
xymond_client).

logfetch will look back 6 runtime-positions which, combined with the
default xymonclient run interval of 5m, ends up causing the 30m figure.

The former value is compiled in, however the run frequency is
configurable. (We run our clients on 100s cycles, which means our msgs
tests last for 10-12m.)


I'm not sure how easy the 6x positions would be to be made dynamic or a
runtime option, but that would be nice.

Could have sworn the number of lines to look at was configurable too. Maybe I'm thinking of BB?

list Bill Arlofski · Sun, 26 Oct 2014 10:26:43 -0400 ·

▸ quoted from Ryan Novosielski

On 10/24/2014 05:41 PM, Novosielski, Ryan wrote:

On Oct 24, 2014, at 14:50, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:

On Fri, October 24, 2014 8:13 am, Nicole Beck wrote:
Hello,
I think this has been asked before, but it was a long time ago and I
wondered if something changed since then.

How long will the status of MSGS stay red?  We just setup monitoring
recently and it seems like it stays red for about 30 minutes. Is that
normal?  Is this configurable?

We are running Xymon server 4.2.3.

Nicole,

The duration of the 'msgs' test is actually a function of how many cycles
back logfetch will scan for content to include in the log data going
forward (actual calculation of the color is via the regex's performed by
xymond_client).

logfetch will look back 6 runtime-positions which, combined with the
default xymonclient run interval of 5m, ends up causing the 30m figure.

The former value is compiled in, however the run frequency is
configurable. (We run our clients on 100s cycles, which means our msgs
tests last for 10-12m.)


I'm not sure how easy the 6x positions would be to be made dynamic or a
runtime option, but that would be nice.

Could have sworn the number of lines to look at was configurable too. Maybe I'm thinking of BB?

Hi Ryan, I was thinking the same thing, but I think we may be thinking of the
max bytes to send. from client-local.cfg docs:

log:/var/log/messages:10240 - The log:FILENAME:SIZE line defines the filename
of the log, and the maximum amount of data (in bytes) to send to the Xymon server.


This thread caused me to start thinking about a similar problem I have not had
time to look into for a long time, and I think Xymon has an option that might
fix both of our problems.

My situation:
I have a  custom script on a server that checks licenses for Zimbra email
archiving accounts.  If all the available "archiving account" licenses have
been used, and an archiving account is attempted to be created, the script
will log:

"error: ArchivingAccountsLimit exceeded: 163/125"


When I set script and Xymon logfile test this up, I tested it and Xymon
properly reported yellow and I thought I was set.  I didn't realize that it
was only staying yellow for 30 minutes.  So once my testing was done, I set
the script to run at 2:00am daily and thought I was done.

Unfortunately, this just means that every morning at 2am this test goes yellow
for 30 minutes and is green by the time the IT people come in. (They do not
get/want alerts for anything other than some temperatures currently)

So... while re-investigating this, I see that the client-local.cfg has an
optional trigger:PATTERN option for logfiles which states:


"The trigger PATTERN line (optional) is used only when there is more data   in
the log than the maximum size set in the "log:FILENAME:SIZE" line. The
"trigger" pattern is then used to find particularly interesting lines in the
logfile - these will always be sent to the Xymon server. After picking out the
"trigger" lines, any remaining space up to the maximum size is filled in with
the most recent entries from the logfile. "PATTERN" is a regular expression."


I have not tested this, but it would seem to indicate that it would cause the
client to send the Xymon server all the lines that match the trigger pattern
(regardless of how far back in time they go in the logfile) which should cause
the test to stay non-green until the logfile is rotated and no more lines with
the trigger pattern exist.

Can anyone confirm or deny this functionality?


Bill


-- 
Bill Arlofski
Reverse Polarity, LLC
http://www.revpol.com/
-- Not responsible for any advertising below this line --

list Jeremy Laidman · Tue, 28 Oct 2014 07:45:08 +1100 ·

▸ quoted from Bill Arlofski

On 27 October 2014 01:26, Bill Arlofski <user-0b8af203a56e@xymon.invalid> wrote:

I have not tested this, but it would seem to indicate that it would cause
the
client to send the Xymon server all the lines that match the trigger
pattern
(regardless of how far back in time they go in the logfile) which should
cause
the test to stay non-green until the logfile is rotated and no more lines
with
the trigger pattern exist.

I haven't verified this, but my understanding of how the "logfetch" process
works is that it keeps state of where it got up to in each logfile, and for
the next (5 minute) round, it starts looking for matches only from that
point onwards.  This means, if there's a trigger match in the log file, the
client will send it to the server in that round only.

J

list Bill Arlofski · Mon, 27 Oct 2014 18:58:28 -0400 ·

▸ quoted from Jeremy Laidman

On 10/27/2014 04:45 PM, Jeremy Laidman wrote:

On 27 October 2014 01:26, Bill Arlofski <user-0b8af203a56e@xymon.invalid> wrote:

I have not tested this, but it would seem to indicate that it would cause
the
client to send the Xymon server all the lines that match the trigger
pattern
(regardless of how far back in time they go in the logfile) which should
cause
the test to stay non-green until the logfile is rotated and no more lines
with
the trigger pattern exist.

I haven't verified this, but my understanding of how the "logfetch" process
works is that it keeps state of where it got up to in each logfile, and for
the next (5 minute) round, it starts looking for matches only from that
point onwards.  This means, if there's a trigger match in the log file, the
client will send it to the server in that round only.

J

Yes, my testing over the weekend seemed to indicate that as well. JC Cleaver
described the process pretty clearly too.

My problem is that the log file in my example gets appended once/night, and
there are plenty of lines with the "trigger" I am needing to alert on - in
other words, the log is pretty static, and when the problem exists, it will
exists until the next run 24 hours later and I would want to keep that Xymon
msgs test yellow until it actually cleared up, not based on an arbitrary 6 x 5
minute client reports.

Since the msgs test works as you and JC have described, I guess my only option
would be to write a short client-side "ZimbraLicense" test which would check
the log for the trigger text, and set test color accordingly.

Other ideas?   Can I somehow hammer this square peg into a round hole?

:)


Thanks!

Bill


--
Bill Arlofski
Reverse Polarity, LLC
http://www.revpol.com/
-- Not responsible for anything below this line --

list Jeremy Laidman · Tue, 28 Oct 2014 11:05:03 +1100 ·

▸ quoted from Bill Arlofski

On 28 October 2014 09:58, Bill Arlofski <user-0b8af203a56e@xymon.invalid> wrote:

Other ideas?   Can I somehow hammer this square peg into a round hole?

You can create a dynamic file based on the logfile, and alert on that.  For
example, in client-local.cfg, something like this:

log:`LOG=/tmp/zlic.status; M=$(date +%M); [ $(expr $M % 10) -ge 5 ] && rm
-f $LOG; grep "ArchivingAccountsLimit exceeded" /var/log/messages >> $LOG;
[ -s $LOG ] && echo "$LOG"`:4096

I'm assuming that /var/log/messages is rotated daily.  What happens here is
that zlic.status will get the log entries from your current messages file
(updated every 5 minutes) appended to it.  If there are no log entries,
then the filename is not echoed and Xymon will ignore it (and no alerts
possible).

The trick here is that the zlic.status file is emptied only every second
run (every 10 minutes) prior to appending the log entries. By shrinking the
file size, logfetch thinks the file has been rotated, zeroes its status,
and starts looking at the file from the beginning.

Note that if you get a log entry in your messages file just prior to
rotation, then you'll only get an alert between the time the message is
detected and the messages file is rotated, which could be only a few
minutes, or even not at all if the timing isn't favourable.  So in other
words, this will generate an alert that persists until the next rotation of
messages, or messages in the last 0-24 hours.  If you want to go for longer
than that, you could perhaps grep from the current and previous messages
file, so you're alerting on any messages in the last 24-48 hours.

Another way to do this is to use a "file:" definition, similarly creating a
status file and then alarming on the file's size (non-zero indicating an
alertable log entry).  For example:

file:`LOG=/tmp/zlic.status; grep "ArchivingAccountsLimit exceeded"
/var/log/messages >> $LOG; echo $LOG`

Then in analysis.cfg, create a matching entry and alert on size>0.  A
down-side to this approach is that you get a particularly unhelpful message
along the lines of "FILE /tmp/zlic.status red size >0".

A third and similar way to do this is to create a file that exists only if
the licencing log is not detected.  Like so:

file:`LOG=/tmp/zlic.OK; grep "ArchivingAccountsLimit exceeded" >/dev/null
&& rm -f $LOG || touch $LOG; echo $LOG`

Then in analysis.cfg, create a matching entry and alert on "noexist".

Yet another way to do this is to use a pseudo-file to generate a status
message.  For example:

file:`COL=green; MSG="licencing OK";  LOGS=$(grep "ArchivingAccountsLimit
exceeded" /var/log/messages); [ "$LOGS" ] && { COL=red; MSG="licencing
error"; }; echo "status ${MACHINE}.zlic $COL $(date) $MSG" | $XYMON $XYMSRV
@`

There is no output from this pseudo-file, so Xymon will not take any "file"
connotations from it and will simply ignore it, except for the side-effects
from the $XYMON command that's also run here.  This is tantamount to having
a client-side ext script, and you may simply prefer to do that.  But this
can be deployed centrally.

A few notes:
1) None of these specific examples have been tested, and may contain syntax
errors, but scriptlets like these have been used on production systems.
2) I deliberately avoided using colons and backticks, because they are
interpreted by the logfetch binary, and break the scriptlets.
3) These scriptlets take up to 15 minutes to start reporting after being
added to client-local.cfg.  When I'm testing these sort of things, I like
to bring up a xymoncmd shell, and paste in the bits between the backticks,
and look for errors or unexpected output.

J

list Nicole Beck · Tue, 28 Oct 2014 18:16:48 +0000 ·

What I’m seeing is that I get an alert for my trigger string (which has a timestamp on it), and then I keep getting alerts for the same trigger string (with the same timestamp) for the next 30 minutes. I’m not sure if anything else was append to the log file in that 30 minutes. I stop getting the alerts after 30 minutes and don’t have to wait until the log is rotated for the alert to clear.

Nicole

▸ quoted from Jeremy Laidman

From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Jeremy Laidman
Sent: Monday, October 27, 2014 4:45 PM
To: Bill Arlofski
Cc: xymon at xymon.com
Subject: Re: [Xymon] duration of MSG red status

On 27 October 2014 01:26, Bill Arlofski <user-0b8af203a56e@xymon.invalid<mailto:user-0b8af203a56e@xymon.invalid>> wrote:
I have not tested this, but it would seem to indicate that it would cause the
client to send the Xymon server all the lines that match the trigger pattern
(regardless of how far back in time they go in the logfile) which should cause
the test to stay non-green until the logfile is rotated and no more lines with
the trigger pattern exist.

I haven't verified this, but my understanding of how the "logfetch" process works is that it keeps state of where it got up to in each logfile, and for the next (5 minute) round, it starts looking for matches only from that point onwards.  This means, if there's a trigger match in the log file, the client will send it to the server in that round only.

J

list Jeremy Laidman · Wed, 29 Oct 2014 12:37:03 +1100 ·

Nicole

▸ quoted from Nicole Beck


On 29 October 2014 05:16, Nicole Beck <user-80034b0579c6@xymon.invalid> wrote:

What I’m seeing is that I get an alert for my trigger string (which has a
timestamp on it), and then I keep getting alerts for the same trigger
string (with the same timestamp) for the next 30 minutes.

How often do you get the repeated alerts?  Or how many in that 30 minutes?

▸ quoted from Nicole Beck

I’m not sure if anything else was append to the log file in that 30
minutes. I stop getting the alerts after 30 minutes and don’t have to wait
until the log is rotated for the alert to clear.

Do you have ALERTREPEAT defined in xymonserver.cfg?  The default is 30
seconds, but you may have it less than that.

Similarly, do you have "REPEAT" defined in alerts.cfg for the rule matching
these alerts?  (The "REPEAT" value in alerts.cfg defaults to the setting of
ALERTREPEAT.)

Is your message status (red?) staying non-green for the 30 minutes, or
non-green for only a short time, or flapping like red/green/red/green?

The way messages get to Xymon are via the client data.  So during an
"event" you can click on the "Client data available" link at the bottom of
your "msgs" page for the host, and it should show you all of the client
data, and you can search for the logfilename to see what log lines the
client sent to the server.  Or you can click on the logfile name on the
"msgs" page for a modified client data report showing just the log lines
for that logfile.

What I'm trying to understand is whether you are getting the same messages
sent multiple times from the client causing multiple events, or whether the
one event is generating multiple alerts.

From what I can tell, a red "msgs" status will stay red for only one
5-minute client cycle.  The next time the client sends its client data
report, if the logfile in question has no new matching lines, it will
actively generate a green status.

J

list Nicole Beck · Mon, 3 Nov 2014 20:16:48 +0000 ·

Hi Jeremy,

I got 7, one every 5 minutes.

ALERTREPEAT is set to 30 in hobbitserver.cfg.

Our hobbit-alerts.cfg file has “DURATION>1m REPEAT=5m” for the msgs test for that machine.

As far as I could tell, the messages status is yellow and it is staying yellow, not flapping.  When I click on history in the GUI, it shows that it was yellow for 35 minutes.

It looks like it’s the same message that we keep getting an alert for.  We had an incident on Friday, where we got 7 email alerts.  Below are examples of the portion of the email that showed the yellow alert.  The timestamp in the log is 21:00:16 for all of the alerts, so it’s the same message.

Email alert 1:

yellow System logs at Fri Oct 31 21:01:10 EDT 2014 


&yellow Warnings in <a href="/xymon-cgi/bb-hostsvc.sh?CLIENT=bbgroupa-web4.syr.edu&amp;SECTION=msgs:/usr/local/blackboard/logs/tomcat/activemq.txt">/usr/local/blackboard/logs/tomcat/activemq.txt</a>

<pre>

&yellow WARN 2014-10-31 21:00:16,480 ActiveMQ NIO Worker 30057 org.apache.activemq.broker.TransportConnection.Transport - Transport Connection to: tcp://128.230.126.194:49464 failed: java.io.EOFException </pre>

Email alert 2

yellow System logs at Fri Oct 31 21:06:10 EDT 2014 


&yellow Warnings in <a href="/xymon-cgi/bb-hostsvc.sh?CLIENT=bbgroupa-web4.syr.edu&amp;SECTION=msgs:/usr/local/blackboard/logs/tomcat/activemq.txt">/usr/local/blackboard/logs/tomcat/activemq.txt</a>

<pre>

&yellow WARN 2014-10-31 21:00:16,480 ActiveMQ NIO Worker 30057 org.apache.activemq.broker.TransportConnection.Transport - Transport Connection to: tcp://128.230.126.194:49464 failed: java.io.EOFException </pre>


Email alert 7

yellow System logs at Fri Oct 31 21:31:11 EDT 2014 


&yellow Warnings in <a href="/xymon-cgi/bb-hostsvc.sh?CLIENT=bbgroupa-web4.syr.edu&amp;SECTION=msgs:/usr/local/blackboard/logs/tomcat/activemq.txt">/usr/local/blackboard/logs/tomcat/activemq.txt</a>

<pre>

&yellow WARN 2014-10-31 21:00:16,480 ActiveMQ NIO Worker 30057 org.apache.activemq.broker.TransportConnection.Transport - Transport Connection to: tcp://128.230.126.194:49464 failed: java.io.EOFException </pre>


The Hobbit acknowledge code that appears in the subject of the emails is all the same code.  Maybe we are getting multiple email messages because we did not acknowledge the alert. But, if the string does not appear again in the file in the next cycle, shouldn’t it turn back to green?

When it happens again, I will try to look at the “client data available” link .

I hope this helps.

Nicole

▸ quoted from Jeremy Laidman

From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Tuesday, October 28, 2014 9:37 PM
To: Nicole Beck
Cc: Bill Arlofski; xymon at xymon.com
Subject: Re: [Xymon] duration of MSG red status

Nicole

On 29 October 2014 05:16, Nicole Beck <user-80034b0579c6@xymon.invalid<mailto:user-80034b0579c6@xymon.invalid>> wrote:
What I’m seeing is that I get an alert for my trigger string (which has a timestamp on it), and then I keep getting alerts for the same trigger string (with the same timestamp) for the next 30 minutes.

How often do you get the repeated alerts?  Or how many in that 30 minutes?

I’m not sure if anything else was append to the log file in that 30 minutes. I stop getting the alerts after 30 minutes and don’t have to wait until the log is rotated for the alert to clear.

Do you have ALERTREPEAT defined in xymonserver.cfg?  The default is 30 seconds, but you may have it less than that.

Similarly, do you have "REPEAT" defined in alerts.cfg for the rule matching these alerts?  (The "REPEAT" value in alerts.cfg defaults to the setting of ALERTREPEAT.)

Is your message status (red?) staying non-green for the 30 minutes, or non-green for only a short time, or flapping like red/green/red/green?

The way messages get to Xymon are via the client data.  So during an "event" you can click on the "Client data available" link at the bottom of your "msgs" page for the host, and it should show you all of the client data, and you can search for the logfilename to see what log lines the client sent to the server.  Or you can click on the logfile name on the "msgs" page for a modified client data report showing just the log lines for that logfile.

What I'm trying to understand is whether you are getting the same messages sent multiple times from the client causing multiple events, or whether the one event is generating multiple alerts.

From what I can tell, a red "msgs" status will stay red for only one 5-minute client cycle.  The next time the client sends its client data report, if the logfile in question has no new matching lines, it will actively generate a green status.

J

list Jeremy Laidman · Fri, 7 Nov 2014 13:54:31 +1100 ·

▸ quoted from Nicole Beck

On 4 November 2014 07:16, Nicole Beck <user-80034b0579c6@xymon.invalid> wrote:

Our hobbit-alerts.cfg file has “DURATION>1m REPEAT=5m” for the msgs test
for that machine.

You've configured REPEAT=5m meaning you want Xymon to resend alerts every 5
minutes until green.  Is this what you want?

This is a different issue to "msgs" staying yellow for more than 5
minutes.  Nearly all of my "msgs" events last for 5 minutes.

Your symptoms are consistent with 6 or more client data messages containing
the same (or new) log messages.    So I think you should look at the client
data when it next occurs and see if it's being updated from one client data
message to the next.

It's interesting that the alert emails have the same log entries,
suggesting that the state mechanism is not working on the Xymon client.
This would happen if something was erasing the logfetch state file on the
Xymon client, named $XYMONTMP/logfetch.$MACHINEDOTS.status.  If logfetch
doesn't know where it got up to in a logfile, it has to start from the
beginning each time, and it will report the same messages in the client
data, each time it runs.

Unlikely, but another possibility, is that the logfile is being shortened
each time.  When logfetch detects that a logfile is shorter than the last
time it ran, it assumes that the logfile rotated, and so it resets its
state and goes back to the start of the logfile.  How is the logfile being
generated?

Cheers
Jeremy

duration of MSG red status 🔗 link

duration of MSG red status