Hello Henrik,
two things we can do:
1) Add "--checkpoint-file=$BBTMP/alert.chk --checkpoint-interval=600" to
the hobbitd_alert command in hobbitlaunch.cfg. That way it will
remember all active alerts when you restart Hobbit.
I'll do that asap (coming monday). That will certainly resolve this issue.
2) When a new alert was first seen (also after a restart of Hobbit), the
duration was reset to 0 - instead of using the information Hobbit
already had about when the status change occurred. I've changed this
in the code, so that it picks up the duration of the alert from the
timestamp we keep for when the last status change happened.
Ok, but that usefull addition is for new/coming releases.
However, I think I found out why the entire problem showed up in the
first place. I had a alert-config that first mailed on an occuring
event and if that was not dealt with properly, ran a pager script 20
minutes later. After an evening of applying (OS-)patches, a reboot
etc. it did not work anymore. Eventually I thought that it had to do
with a alert-config modification, resulting in this
email-conversation.
As suggested, I checked the alerttrace.log, but could not find a
reason why this problem happened (I changed pagerscript to mail, but
no result). It *does* worked fine when *all* the alerts are processed
at the same time!
Exploring the mailinglist and Changes-file for each version, I think
it can be brought down to a known bug in Hobbit that is to be fixed in
4.1.2; see my mail from August 19th, 11:42.
Since we are running 4.0.4, I'm thinking what is a wise thing to do?
The workaround does work fine now (we are a 24*7 University), I
thinking to wait untill 4.1.2 reaches the Stable status, since 4.1.1
does not solve this particular bug.
Regards, Peter