duration of MSG red status
list Nicole Beck
Hello, I think this has been asked before, but it was a long time ago and I wondered if something changed since then. How long will the status of MSGS stay red? We just setup monitoring recently and it seems like it stays red for about 30 minutes. Is that normal? Is this configurable? We are running Xymon server 4.2.3. Thanks, Nicole
list Japheth Cleaver
▸
On Fri, October 24, 2014 8:13 am, Nicole Beck wrote:
Hello, I think this has been asked before, but it was a long time ago and I wondered if something changed since then. How long will the status of MSGS stay red? We just setup monitoring recently and it seems like it stays red for about 30 minutes. Is that normal? Is this configurable? We are running Xymon server 4.2.3.
Nicole, The duration of the 'msgs' test is actually a function of how many cycles back logfetch will scan for content to include in the log data going forward (actual calculation of the color is via the regex's performed by xymond_client). logfetch will look back 6 runtime-positions which, combined with the default xymonclient run interval of 5m, ends up causing the 30m figure. The former value is compiled in, however the run frequency is configurable. (We run our clients on 100s cycles, which means our msgs tests last for 10-12m.) I'm not sure how easy the 6x positions would be to be made dynamic or a runtime option, but that would be nice. Regards, -jc
list Ryan Novosielski
▸
On Oct 24, 2014, at 14:50, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Fri, October 24, 2014 8:13 am, Nicole Beck wrote: Hello, I think this has been asked before, but it was a long time ago and I wondered if something changed since then. How long will the status of MSGS stay red? We just setup monitoring recently and it seems like it stays red for about 30 minutes. Is that normal? Is this configurable? We are running Xymon server 4.2.3.Nicole, The duration of the 'msgs' test is actually a function of how many cycles back logfetch will scan for content to include in the log data going forward (actual calculation of the color is via the regex's performed by xymond_client). logfetch will look back 6 runtime-positions which, combined with the default xymonclient run interval of 5m, ends up causing the 30m figure. The former value is compiled in, however the run frequency is configurable. (We run our clients on 100s cycles, which means our msgs tests last for 10-12m.) I'm not sure how easy the 6x positions would be to be made dynamic or a runtime option, but that would be nice.
Could have sworn the number of lines to look at was configurable too. Maybe I'm thinking of BB?
list Bill Arlofski
▸
On 10/24/2014 05:41 PM, Novosielski, Ryan wrote:
On Oct 24, 2014, at 14:50, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Fri, October 24, 2014 8:13 am, Nicole Beck wrote: Hello, I think this has been asked before, but it was a long time ago and I wondered if something changed since then. How long will the status of MSGS stay red? We just setup monitoring recently and it seems like it stays red for about 30 minutes. Is that normal? Is this configurable? We are running Xymon server 4.2.3.Nicole, The duration of the 'msgs' test is actually a function of how many cycles back logfetch will scan for content to include in the log data going forward (actual calculation of the color is via the regex's performed by xymond_client). logfetch will look back 6 runtime-positions which, combined with the default xymonclient run interval of 5m, ends up causing the 30m figure. The former value is compiled in, however the run frequency is configurable. (We run our clients on 100s cycles, which means our msgs tests last for 10-12m.) I'm not sure how easy the 6x positions would be to be made dynamic or a runtime option, but that would be nice.Could have sworn the number of lines to look at was configurable too. Maybe I'm thinking of BB?
Hi Ryan, I was thinking the same thing, but I think we may be thinking of the max bytes to send. from client-local.cfg docs: log:/var/log/messages:10240 - The log:FILENAME:SIZE line defines the filename of the log, and the maximum amount of data (in bytes) to send to the Xymon server. This thread caused me to start thinking about a similar problem I have not had time to look into for a long time, and I think Xymon has an option that might fix both of our problems. My situation: I have a custom script on a server that checks licenses for Zimbra email archiving accounts. If all the available "archiving account" licenses have been used, and an archiving account is attempted to be created, the script will log: "error: ArchivingAccountsLimit exceeded: 163/125" When I set script and Xymon logfile test this up, I tested it and Xymon properly reported yellow and I thought I was set. I didn't realize that it was only staying yellow for 30 minutes. So once my testing was done, I set the script to run at 2:00am daily and thought I was done. Unfortunately, this just means that every morning at 2am this test goes yellow for 30 minutes and is green by the time the IT people come in. (They do not get/want alerts for anything other than some temperatures currently) So... while re-investigating this, I see that the client-local.cfg has an optional trigger:PATTERN option for logfiles which states: "The trigger PATTERN line (optional) is used only when there is more data in the log than the maximum size set in the "log:FILENAME:SIZE" line. The "trigger" pattern is then used to find particularly interesting lines in the logfile - these will always be sent to the Xymon server. After picking out the "trigger" lines, any remaining space up to the maximum size is filled in with the most recent entries from the logfile. "PATTERN" is a regular expression." I have not tested this, but it would seem to indicate that it would cause the client to send the Xymon server all the lines that match the trigger pattern (regardless of how far back in time they go in the logfile) which should cause the test to stay non-green until the logfile is rotated and no more lines with the trigger pattern exist. Can anyone confirm or deny this functionality? Bill -- Bill Arlofski Reverse Polarity, LLC http://www.revpol.com/ -- Not responsible for any advertising below this line --
list Jeremy Laidman
▸
On 27 October 2014 01:26, Bill Arlofski <user-0b8af203a56e@xymon.invalid> wrote:
I have not tested this, but it would seem to indicate that it would cause the client to send the Xymon server all the lines that match the trigger pattern (regardless of how far back in time they go in the logfile) which should cause the test to stay non-green until the logfile is rotated and no more lines with the trigger pattern exist.
I haven't verified this, but my understanding of how the "logfetch" process works is that it keeps state of where it got up to in each logfile, and for the next (5 minute) round, it starts looking for matches only from that point onwards. This means, if there's a trigger match in the log file, the client will send it to the server in that round only. J
list Bill Arlofski
▸
On 10/27/2014 04:45 PM, Jeremy Laidman wrote:
On 27 October 2014 01:26, Bill Arlofski <user-0b8af203a56e@xymon.invalid> wrote:I have not tested this, but it would seem to indicate that it would cause the client to send the Xymon server all the lines that match the trigger pattern (regardless of how far back in time they go in the logfile) which should cause the test to stay non-green until the logfile is rotated and no more lines with the trigger pattern exist.I haven't verified this, but my understanding of how the "logfetch" process works is that it keeps state of where it got up to in each logfile, and for the next (5 minute) round, it starts looking for matches only from that point onwards. This means, if there's a trigger match in the log file, the client will send it to the server in that round only. J
Yes, my testing over the weekend seemed to indicate that as well. JC Cleaver described the process pretty clearly too. My problem is that the log file in my example gets appended once/night, and there are plenty of lines with the "trigger" I am needing to alert on - in other words, the log is pretty static, and when the problem exists, it will exists until the next run 24 hours later and I would want to keep that Xymon msgs test yellow until it actually cleared up, not based on an arbitrary 6 x 5 minute client reports. Since the msgs test works as you and JC have described, I guess my only option would be to write a short client-side "ZimbraLicense" test which would check the log for the trigger text, and set test color accordingly. Other ideas? Can I somehow hammer this square peg into a round hole? :) Thanks! Bill -- Bill Arlofski Reverse Polarity, LLC http://www.revpol.com/ -- Not responsible for anything below this line --
list Jeremy Laidman
▸
On 28 October 2014 09:58, Bill Arlofski <user-0b8af203a56e@xymon.invalid> wrote:
Other ideas? Can I somehow hammer this square peg into a round hole?
You can create a dynamic file based on the logfile, and alert on that. For
example, in client-local.cfg, something like this:
log:`LOG=/tmp/zlic.status; M=$(date +%M); [ $(expr $M % 10) -ge 5 ] && rm
-f $LOG; grep "ArchivingAccountsLimit exceeded" /var/log/messages >> $LOG;
[ -s $LOG ] && echo "$LOG"`:4096
I'm assuming that /var/log/messages is rotated daily. What happens here is
that zlic.status will get the log entries from your current messages file
(updated every 5 minutes) appended to it. If there are no log entries,
then the filename is not echoed and Xymon will ignore it (and no alerts
possible).
The trick here is that the zlic.status file is emptied only every second
run (every 10 minutes) prior to appending the log entries. By shrinking the
file size, logfetch thinks the file has been rotated, zeroes its status,
and starts looking at the file from the beginning.
Note that if you get a log entry in your messages file just prior to
rotation, then you'll only get an alert between the time the message is
detected and the messages file is rotated, which could be only a few
minutes, or even not at all if the timing isn't favourable. So in other
words, this will generate an alert that persists until the next rotation of
messages, or messages in the last 0-24 hours. If you want to go for longer
than that, you could perhaps grep from the current and previous messages
file, so you're alerting on any messages in the last 24-48 hours.
Another way to do this is to use a "file:" definition, similarly creating a
status file and then alarming on the file's size (non-zero indicating an
alertable log entry). For example:
file:`LOG=/tmp/zlic.status; grep "ArchivingAccountsLimit exceeded"
/var/log/messages >> $LOG; echo $LOG`
Then in analysis.cfg, create a matching entry and alert on size>0. A
down-side to this approach is that you get a particularly unhelpful message
along the lines of "FILE /tmp/zlic.status red size >0".
A third and similar way to do this is to create a file that exists only if
the licencing log is not detected. Like so:
file:`LOG=/tmp/zlic.OK; grep "ArchivingAccountsLimit exceeded" >/dev/null
&& rm -f $LOG || touch $LOG; echo $LOG`
Then in analysis.cfg, create a matching entry and alert on "noexist".
Yet another way to do this is to use a pseudo-file to generate a status
message. For example:
file:`COL=green; MSG="licencing OK"; LOGS=$(grep "ArchivingAccountsLimit
exceeded" /var/log/messages); [ "$LOGS" ] && { COL=red; MSG="licencing
error"; }; echo "status ${MACHINE}.zlic $COL $(date) $MSG" | $XYMON $XYMSRV
@`
There is no output from this pseudo-file, so Xymon will not take any "file"
connotations from it and will simply ignore it, except for the side-effects
from the $XYMON command that's also run here. This is tantamount to having
a client-side ext script, and you may simply prefer to do that. But this
can be deployed centrally.
A few notes:
1) None of these specific examples have been tested, and may contain syntax
errors, but scriptlets like these have been used on production systems.
2) I deliberately avoided using colons and backticks, because they are
interpreted by the logfetch binary, and break the scriptlets.
3) These scriptlets take up to 15 minutes to start reporting after being
added to client-local.cfg. When I'm testing these sort of things, I like
to bring up a xymoncmd shell, and paste in the bits between the backticks,
and look for errors or unexpected output.
J
list Nicole Beck
What I’m seeing is that I get an alert for my trigger string (which has a timestamp on it), and then I keep getting alerts for the same trigger string (with the same timestamp) for the next 30 minutes. I’m not sure if anything else was append to the log file in that 30 minutes. I stop getting the alerts after 30 minutes and don’t have to wait until the log is rotated for the alert to clear. Nicole
▸
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Jeremy Laidman
Sent: Monday, October 27, 2014 4:45 PM
To: Bill Arlofski
Cc: xymon at xymon.com
Subject: Re: [Xymon] duration of MSG red status
On 27 October 2014 01:26, Bill Arlofski <user-0b8af203a56e@xymon.invalid<mailto:user-0b8af203a56e@xymon.invalid>> wrote:
I have not tested this, but it would seem to indicate that it would cause the
client to send the Xymon server all the lines that match the trigger pattern
(regardless of how far back in time they go in the logfile) which should cause
the test to stay non-green until the logfile is rotated and no more lines with
the trigger pattern exist.
I haven't verified this, but my understanding of how the "logfetch" process works is that it keeps state of where it got up to in each logfile, and for the next (5 minute) round, it starts looking for matches only from that point onwards. This means, if there's a trigger match in the log file, the client will send it to the server in that round only.
J
list Jeremy Laidman
Nicole
▸
On 29 October 2014 05:16, Nicole Beck <user-80034b0579c6@xymon.invalid> wrote:
What I’m seeing is that I get an alert for my trigger string (which has a timestamp on it), and then I keep getting alerts for the same trigger string (with the same timestamp) for the next 30 minutes.
How often do you get the repeated alerts? Or how many in that 30 minutes?
▸
I’m not sure if anything else was append to the log file in that 30 minutes. I stop getting the alerts after 30 minutes and don’t have to wait until the log is rotated for the alert to clear.
Do you have ALERTREPEAT defined in xymonserver.cfg? The default is 30 seconds, but you may have it less than that. Similarly, do you have "REPEAT" defined in alerts.cfg for the rule matching these alerts? (The "REPEAT" value in alerts.cfg defaults to the setting of ALERTREPEAT.) Is your message status (red?) staying non-green for the 30 minutes, or non-green for only a short time, or flapping like red/green/red/green? The way messages get to Xymon are via the client data. So during an "event" you can click on the "Client data available" link at the bottom of your "msgs" page for the host, and it should show you all of the client data, and you can search for the logfilename to see what log lines the client sent to the server. Or you can click on the logfile name on the "msgs" page for a modified client data report showing just the log lines for that logfile. What I'm trying to understand is whether you are getting the same messages sent multiple times from the client causing multiple events, or whether the one event is generating multiple alerts. From what I can tell, a red "msgs" status will stay red for only one 5-minute client cycle. The next time the client sends its client data report, if the logfile in question has no new matching lines, it will actively generate a green status. J
list Nicole Beck
Hi Jeremy, I got 7, one every 5 minutes. ALERTREPEAT is set to 30 in hobbitserver.cfg. Our hobbit-alerts.cfg file has “DURATION>1m REPEAT=5m” for the msgs test for that machine. As far as I could tell, the messages status is yellow and it is staying yellow, not flapping. When I click on history in the GUI, it shows that it was yellow for 35 minutes. It looks like it’s the same message that we keep getting an alert for. We had an incident on Friday, where we got 7 email alerts. Below are examples of the portion of the email that showed the yellow alert. The timestamp in the log is 21:00:16 for all of the alerts, so it’s the same message. Email alert 1: yellow System logs at Fri Oct 31 21:01:10 EDT 2014 &yellow Warnings in <a href="/xymon-cgi/bb-hostsvc.sh?CLIENT=bbgroupa-web4.syr.edu&SECTION=msgs:/usr/local/blackboard/logs/tomcat/activemq.txt">/usr/local/blackboard/logs/tomcat/activemq.txt</a> <pre> &yellow WARN 2014-10-31 21:00:16,480 ActiveMQ NIO Worker 30057 org.apache.activemq.broker.TransportConnection.Transport - Transport Connection to: tcp://128.230.126.194:49464 failed: java.io.EOFException </pre> Email alert 2 yellow System logs at Fri Oct 31 21:06:10 EDT 2014 &yellow Warnings in <a href="/xymon-cgi/bb-hostsvc.sh?CLIENT=bbgroupa-web4.syr.edu&SECTION=msgs:/usr/local/blackboard/logs/tomcat/activemq.txt">/usr/local/blackboard/logs/tomcat/activemq.txt</a> <pre> &yellow WARN 2014-10-31 21:00:16,480 ActiveMQ NIO Worker 30057 org.apache.activemq.broker.TransportConnection.Transport - Transport Connection to: tcp://128.230.126.194:49464 failed: java.io.EOFException </pre> Email alert 7 yellow System logs at Fri Oct 31 21:31:11 EDT 2014 &yellow Warnings in <a href="/xymon-cgi/bb-hostsvc.sh?CLIENT=bbgroupa-web4.syr.edu&SECTION=msgs:/usr/local/blackboard/logs/tomcat/activemq.txt">/usr/local/blackboard/logs/tomcat/activemq.txt</a> <pre> &yellow WARN 2014-10-31 21:00:16,480 ActiveMQ NIO Worker 30057 org.apache.activemq.broker.TransportConnection.Transport - Transport Connection to: tcp://128.230.126.194:49464 failed: java.io.EOFException </pre> The Hobbit acknowledge code that appears in the subject of the emails is all the same code. Maybe we are getting multiple email messages because we did not acknowledge the alert. But, if the string does not appear again in the file in the next cycle, shouldn’t it turn back to green? When it happens again, I will try to look at the “client data available” link . I hope this helps. Nicole
▸
From: Jeremy Laidman [mailto:user-71895fb2e44c@xymon.invalid]
Sent: Tuesday, October 28, 2014 9:37 PM
To: Nicole Beck
Cc: Bill Arlofski; xymon at xymon.com
Subject: Re: [Xymon] duration of MSG red status
Nicole
On 29 October 2014 05:16, Nicole Beck <user-80034b0579c6@xymon.invalid<mailto:user-80034b0579c6@xymon.invalid>> wrote:
What I’m seeing is that I get an alert for my trigger string (which has a timestamp on it), and then I keep getting alerts for the same trigger string (with the same timestamp) for the next 30 minutes.
How often do you get the repeated alerts? Or how many in that 30 minutes?
I’m not sure if anything else was append to the log file in that 30 minutes. I stop getting the alerts after 30 minutes and don’t have to wait until the log is rotated for the alert to clear.
Do you have ALERTREPEAT defined in xymonserver.cfg? The default is 30 seconds, but you may have it less than that.
Similarly, do you have "REPEAT" defined in alerts.cfg for the rule matching these alerts? (The "REPEAT" value in alerts.cfg defaults to the setting of ALERTREPEAT.)
Is your message status (red?) staying non-green for the 30 minutes, or non-green for only a short time, or flapping like red/green/red/green?
The way messages get to Xymon are via the client data. So during an "event" you can click on the "Client data available" link at the bottom of your "msgs" page for the host, and it should show you all of the client data, and you can search for the logfilename to see what log lines the client sent to the server. Or you can click on the logfile name on the "msgs" page for a modified client data report showing just the log lines for that logfile.
What I'm trying to understand is whether you are getting the same messages sent multiple times from the client causing multiple events, or whether the one event is generating multiple alerts.
From what I can tell, a red "msgs" status will stay red for only one 5-minute client cycle. The next time the client sends its client data report, if the logfile in question has no new matching lines, it will actively generate a green status.
J
list Jeremy Laidman
▸
On 4 November 2014 07:16, Nicole Beck <user-80034b0579c6@xymon.invalid> wrote:
Our hobbit-alerts.cfg file has “DURATION>1m REPEAT=5m” for the msgs test for that machine.
You've configured REPEAT=5m meaning you want Xymon to resend alerts every 5 minutes until green. Is this what you want? This is a different issue to "msgs" staying yellow for more than 5 minutes. Nearly all of my "msgs" events last for 5 minutes. Your symptoms are consistent with 6 or more client data messages containing the same (or new) log messages. So I think you should look at the client data when it next occurs and see if it's being updated from one client data message to the next. It's interesting that the alert emails have the same log entries, suggesting that the state mechanism is not working on the Xymon client. This would happen if something was erasing the logfetch state file on the Xymon client, named $XYMONTMP/logfetch.$MACHINEDOTS.status. If logfetch doesn't know where it got up to in a logfile, it has to start from the beginning each time, and it will report the same messages in the client data, each time it runs. Unlikely, but another possibility, is that the logfile is being shortened each time. When logfetch detects that a logfile is shorter than the last time it ran, it assumes that the logfile rotated, and so it resets its state and goes back to the start of the logfile. How is the logfile being generated? Cheers Jeremy