Xymon Mailing List Archive search

Stale alerts not being purged

2 messages in this thread

list John Thurston · Thu, 17 Nov 2022 09:04:33 -0900 ·
My Xymon (Xymon 4.3.30-1.el7.terabithia) is no longer noticing it is time to stop sending email alerts.

A customer will ping me, saying "I'm still getting emails for a problem I fixed 10 days ago!"

I find the messages in question in the /notifications.log/ Yep, there are a lot of them. I can see the test recovered ages ago, and there should no longer be notifications.

If I go look in /alert.chk/, I can see the host:test in question

If I restart xymon, the /alert.log/ will get a bunch of lines "Stale alert found", but the lines remain in the /alert.chk/

The only way I have figured out to clean this up is to grep the 'Stale' host:test pairs out of the /alert.log/, stop xymon, feed the host:test pairs through sed to delete the offending lines from /alert.chk/, and restart xymon.


Anyone have any ideas what's wrong here?

-- 
--
Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Department of Administration
State of Alaska
list John Thurston · Tue, 27 Dec 2022 08:46:40 -0900 ·
I've been forced to implement the following on a daily christmas tree timer :/

TODAY=$( date +%Y-%m-%d )
logfile=/var/log/xymon/alert.log
checkpointfile=/var/lib/xymon/tmp/alert.chk

z=$( mktemp -p /tmp $0.XXXX )
trap "rm -rf $z" exit

# Restarting the daemon is the only way I have found
# to generate the 'Stale' lines in the log file
systemctl restart xymon
sleep 120

# find the stale hosts reported today in the log file
# build a file containing 'delete' commands for sed
egrep "^$TODAY ............... Stale alert " ${logfile} |cut -d " " -f 6 \
 ? | tr ':' '|' | sed 's#^#/^# ; s#$#/d#' > ${z}

# Delete the stale hosts from the checkpoint file while xymond is stopped
systemctl stop xymon && sed -i -f ${z} ${checkpointfile} && systemctl start xymon


If anyone can offer any ideas into /why/ my xymon isn't purging dead alerts, I'd love to hear them.
signature


--
Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Department of Administration
State of Alaska

quoted from John Thurston
On 11/17/2022 9:04 AM, John Thurston wrote:
My Xymon (Xymon 4.3.30-1.el7.terabithia) is no longer noticing it is time to stop sending email alerts.

A customer will ping me, saying "I'm still getting emails for a problem I fixed 10 days ago!"

I find the messages in question in the /notifications.log/ Yep, there are a lot of them. I can see the test recovered ages ago, and there should no longer be notifications.

If I go look in /alert.chk/, I can see the host:test in question

If I restart xymon, the /alert.log/ will get a bunch of lines "Stale alert found", but the lines remain in the /alert.chk/

The only way I have figured out to clean this up is to grep the 'Stale' host:test pairs out of the /alert.log/, stop xymon, feed the host:test pairs through sed to delete the offending lines from /alert.chk/, and restart xymon.


Anyone have any ideas what's wrong here?

-- 
--
Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Department of Administration
State of Alaska