Xymon Mailing List Archive search

acknowledgment for a yellow alert doesn't seem to work

10 messages in this thread

list Tom Georgoulias · Wed, 06 Apr 2005 15:32:29 -0400 ·
Figures, just as I'm about to host a session to show off Hobbit to the support team and roll it out into production, I hit a situation that has me scratching my head.

I've got an add on script, bb-memory, that is monitoring the memory on my linux clients and on one system it is in the yellow state at the moment.  A Hobbit alert was generated, it was sent to the appropriate email address, yada yada yada.

When I try to use the ack code from the subject line to acknowledge the alert, it doesn't work.  Doesn't show up in ~/data/acks/acklog or on the web page, and using "!xxxxxx" doesn't have any effect either.  Within the same hour, I had a red alert from a different host (disk usage) and I was able to acknowledge it successfully.  Thought I'd be clever and clean up just enough to make it turn yellow, which it did, and was able to ack the yellow alert as well.  SO it didn't have anything to do with color states.

Using hobbit 4.0, with the following patches from the list:

bbnet-iponly.patch
hobbit-4.0.1-diskhistlog.patch
eventlog-crash.patch
post-4.0-includes.patch

Any ideas on what to check?

Tom
list Tom Georgoulias · Thu, 07 Apr 2005 10:09:09 -0400 ·
quoted from Tom Georgoulias
Tom Georgoulias wrote:
When I try to use the ack code from the subject line to acknowledge the alert, it doesn't work.  Doesn't show up in ~/data/acks/acklog or on the web page, and using "!xxxxxx" doesn't have any effect either.

Any ideas on what to check?
I've been doing more troubleshooting on this, but I still haven't resolved it.

I've created a large test file and filled up the disk partition to 95%, which generates a yellow alert.

Then I used this command to acknowledge the alert:

~/hobbit/server/bin/bb 127.0.0.1 "hobbitdack 158136 10 command line acknowledgment"

Got this entry in my data/acks/acklog file:

1112878495      158136  10      158136  np_filename_not_used radm200p.nandomedia.com.disk    yellow  command line acknowledgment

So that worked.

Then I waited 10 mins, got my next page and tried to acknowledge it via the web, with maint.pl.  Didn't work.

Do the cookies have a lifespan or a one-time use policy?

Tom
list Daniel Deighton · Thu, 07 Apr 2005 11:06:03 -0400 ·
I'm seeing a similar problem, however my issue is with a red alert.  In
my case, acks are hit or miss.  Occasionally, it will work, but it often
fails.  The apache logs show that the ack was received, but nothing
shows up in the acklog.  
So far I don't see a pattern.  Any ideas to test this further?

  -Dan
quoted from Tom Georgoulias

On Thu, 2005-04-07 at 10:09 -0400, Tom Georgoulias wrote:
Tom Georgoulias wrote:
When I try to use the ack code from the subject line to acknowledge the > alert, it doesn't work.  Doesn't show up in ~/data/acks/acklog or on the > web page, and using "!xxxxxx" doesn't have any effect either.
Any ideas on what to check?
I've been doing more troubleshooting on this, but I still haven't resolved it.

I've created a large test file and filled up the disk partition to 95%, which generates a yellow alert.

Then I used this command to acknowledge the alert:

~/hobbit/server/bin/bb 127.0.0.1 "hobbitdack 158136 10 command line acknowledgment"

Got this entry in my data/acks/acklog file:

1112878495      158136  10      158136  np_filename_not_used radm200p.nandomedia.com.disk    yellow  command line acknowledgment

So that worked.

Then I waited 10 mins, got my next page and tried to acknowledge it via the web, with maint.pl.  Didn't work.

Do the cookies have a lifespan or a one-time use policy?

Tom

-- 

Daniel Deighton <user-fdcc03e0c730@xymon.invalid>
list Tom Georgoulias · Thu, 07 Apr 2005 11:40:35 -0400 ·
quoted from Daniel Deighton
Daniel Deighton wrote:
I'm seeing a similar problem, however my issue is with a red alert.  In
my case, acks are hit or miss.  Occasionally, it will work, but it often
fails.  The apache logs show that the ack was received, but nothing
shows up in the acklog.  
So far I don't see a pattern.  Any ideas to test this further?
I'm still trying to narrow it down myself.  I know that you have to use the most recent cookie/ack code, so if you get multiple pages, use the last one.

Try using this command if it isn't working using the CGI form from the hobbit webpage:

~/hobbit/server/bin/bb 127.0.0.1 "hobbitdack ACKCODE TIME EXPLANATION MSG"
list Tom Georgoulias · Thu, 07 Apr 2005 11:55:18 -0400 ·
quoted from Tom Georgoulias
Tom Georgoulias wrote:
Daniel Deighton wrote:
I'm seeing a similar problem, however my issue is with a red alert.  In
my case, acks are hit or miss.  Occasionally, it will work, but it often
fails.  The apache logs show that the ack was received, but nothing
shows up in the acklog.  
So far I don't see a pattern.  Any ideas to test this further?

I'm still trying to narrow it down myself.  I know that you have to use the most recent cookie/ack code, so if you get multiple pages, use the last one.
Seems strange, but it appears that once an alert that has been previously acknowledged expires, an email is sent out again that has the old ack code in teh subject.


If you wait until the next alert email, it'll have a new ack code.

If I use it via the webpage, it works again.

Just an observation.

Tom
list Henrik Størner · Thu, 7 Apr 2005 18:05:26 +0200 ·
On Thu, Apr 07, 2005 at 10:09:09AM -0400, Tom Georgoulias wrote:
Do the cookies have a lifespan or a one-time use policy?
Yes, they are only valid for 30 minutes after they've been generated.


Could you try the attached patch ? If causes hobbitd to log if it
receives an ack-message that is discarded because the cookie was not
valid.

Also, if you want to check what the current cookie value is, you can
run

   bb 127.0.0.1 "hobbitdboard host=HOSTNAME test=TESTNAME fields=hostname,testname,cookie"

It will respond with

      HOSTNAME|TESTNAME|1029348

The cookie is the third ('|'-separated) field.


Regards,
Henrik
-------------- next part --------------
--- hobbitd/hobbitd.c	2005/04/03 15:44:07	1.136
+++ hobbitd/hobbitd.c	2005/04/07 15:58:19
@@ -2014,6 +2014,12 @@
 					}
 				}
 			}
+			else {
+				errprintf("Cookie %d not found, dropping ack\n", cookie);
+			}
+		}
+		else {
+			errprintf("Bogus ack message from %s: '%s'\n", sender, msg->buf);
 		}
 
 		MEMUNDEFINE(durstr);
list Tom Georgoulias · Thu, 07 Apr 2005 13:29:00 -0400 ·
quoted from Henrik Størner
Henrik Stoerner wrote:
On Thu, Apr 07, 2005 at 10:09:09AM -0400, Tom Georgoulias wrote:
Do the cookies have a lifespan or a one-time use policy?

Yes, they are only valid for 30 minutes after they've been generated.
Thanks for clarifying that.  I was under the impression that a cookie was valid as long as the alert remained in that state or the ack period was still valid, no matter how long that was.

This seems to explain why I couldn't ack the yellow alert I mentioned at the beginning of this thread.  I'm sure I didn't try to acknowledge the alert until a couple of hours after it first came through, and the alerts aren't sent unless the condition persists for more than 45mins, and resends are every hour.  So the cookie must've gone stale by then.
quoted from Henrik Størner
Could you try the attached patch ? If causes hobbitd to log if it
receives an ack-message that is discarded because the cookie was not
valid.
Done.  I'll let report back with my findings.
quoted from Henrik Størner
Also, if you want to check what the current cookie value is, you can
run

   bb 127.0.0.1 "hobbitdboard host=HOSTNAME test=TESTNAME fields=hostname,testname,cookie"

Very useful command.  I've added it to my notes.  ;)

Tom
list Tom Georgoulias · Thu, 07 Apr 2005 13:57:32 -0400 ·
quoted from Tom Georgoulias
Tom Georgoulias wrote:
Henrik Stoerner wrote:
Could you try the attached patch ? If causes hobbitd to log if it
receives an ack-message that is discarded because the cookie was not
valid.
Patch seems to work.

I have a yellow alert on a system.

Check the cookie:
-bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie"
radm200p.nandomedia.com|disk|406429

Wait a while, then check again:
-bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie"
radm200p.nandomedia.com|disk|712535

Use the old cookie to try and ack the alert, then check hobbitd.log:

bash-2.05b$ tail hobbitd.log
2005-04-06 14:55:48 Setup complete
2005-04-06 15:01:33 Setup complete
2005-04-07 13:21:55 Setup complete
2005-04-07 13:36:53 Cookie 406429 not found, dropping ack

Stale cookie didn't work, event was logged.

So now the real issue for me is how to use this piece of info about cookie lifespans when I put Hobbit into production.  I don't want the support folks to have to log into my hobbit server and check for the latest cookie value before acknowledging an alert. I've also got a range of time & repeat delays for my alerts, depending on what system parameter is being measured, and I'd hate to have to use <30 mins across the board.
list Daniel Deighton · Thu, 07 Apr 2005 17:22:26 -0400 ·
Something strange happened on my server.  It seems that a cookie expired
after only 9 minutes (or less).  I've included the pertinent info below.
What would cause this behavior?

 -Dan


Email Notifications Headers
                           Subject: Hobbit [753059]
sundeigh.deightime.net:meta CRITICAL
(RED)
                              Date: Thu,  7 Apr 2005 16:29:54 -0400
(EDT)

notifications.log
Thu Apr  7 15:59:54 2005 sundeigh.deightime.net.meta (1.1.1.1) dan-
user-e831702c8a73@xymon.invalid 1112903993 999
Thu Apr  7 16:29:54 2005 sundeigh.deightime.net.meta (1.1.1.1) dan-
user-e831702c8a73@xymon.invalid 1112905794 999

hobbitd.log
2005-04-07 16:38:06 Cookie 753059 not found, dropping ack

After the ack failed, I ran the following (thanks for the patch,
Henrik):
./bb 127.0.0.1 "hobbitdboard host=sundeigh.deightime.net test=meta
fields=hostname,testname,cookie"
sundeigh.deightime.net|meta|615614

date (run right after the above bb command)
Thu Apr  7 16:42:05 EDT 2005
quoted from Tom Georgoulias


On Thu, 2005-04-07 at 13:57 -0400, Tom Georgoulias wrote:
Tom Georgoulias wrote:
Henrik Stoerner wrote:
Could you try the attached patch ? If causes hobbitd to log if it
receives an ack-message that is discarded because the cookie was not
valid.
Patch seems to work.

I have a yellow alert on a system.

Check the cookie:
-bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie"
radm200p.nandomedia.com|disk|406429

Wait a while, then check again:
-bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie"
radm200p.nandomedia.com|disk|712535

Use the old cookie to try and ack the alert, then check hobbitd.log:

bash-2.05b$ tail hobbitd.log
2005-04-06 14:55:48 Setup complete
2005-04-06 15:01:33 Setup complete
2005-04-07 13:21:55 Setup complete
2005-04-07 13:36:53 Cookie 406429 not found, dropping ack

Stale cookie didn't work, event was logged.

So now the real issue for me is how to use this piece of info about cookie lifespans when I put Hobbit into production.  I don't want the support folks to have to log into my hobbit server and check for the latest cookie value before acknowledging an alert. I've also got a range of time & repeat delays for my alerts, depending on what system parameter is being measured, and I'd hate to have to use <30 mins across the board.

-- 

Daniel Deighton <user-fdcc03e0c730@xymon.invalid>
list Henrik Størner · Fri, 8 Apr 2005 07:40:37 +0200 ·
quoted from Daniel Deighton
On Thu, Apr 07, 2005 at 05:22:26PM -0400, Daniel Deighton wrote:
Something strange happened on my server.  It seems that a cookie expired
after only 9 minutes (or less).  I've included the pertinent info below.
What would cause this behavior?
OK, I think I've found the root cause of this issue, and it is
fundamentally a design flaw in how the cookies are generated.

Currently, a cookie is generated the moment a status changes from
green to yellow/red/purple, and gets a lifetime of 30 minutes. But the
cookie may not be delivered in an alert until some time after,
depending on any DURATION>x settings in the alert config - and by then
the cookie may be close to expiring. Combined with alerts only being
repeated every 30 minutes (by default), you can end up in a situation
where the cookie you get in the alert message will only be valid for a
minute or so.

The *real* solution is to change the cookie-generation so it happens
when the alert is sent out. That requires some serious changes to the
code - so I'll postpone that a bit and make that together with the
escalation-alert handling that is planned for 4.1.

So for now, the attached patch just changes the lifetime of a cookie
to 24 hours. That should make it work.


Regards,
Henrik
-------------- next part --------------
--- hobbitd/hobbitd.c	2005/04/03 15:44:07	1.136
+++ hobbitd/hobbitd.c	2005/04/08 05:36:29
@@ -909,7 +909,15 @@
 			} while (find_cookie(newcookie));
 
 			log->cookie = newcookie;
-			log->cookieexpires = log->validtime;
• +			/*
+			 * This is fundamentally flawed. The cookie should be generated by
+			 * the alert module, because it may not be sent to the user for
+			 * a long time, depending on the alert configuration.
+			 * That's for 4.1 - for now, we'll just give it a long enough 
+			 * lifetime so that cookies will be valid.
+			 */
+			log->cookieexpires = 86400; /* Valid for 1 day */
 		}
 	}
 	else {
@@ -2014,6 +2022,12 @@
quoted from Henrik Størner
 					}
 				}
 			}
+			else {
+				errprintf("Cookie %d not found, dropping ack\n", cookie);
+			}
+		}
+		else {
+			errprintf("Bogus ack message from %s: '%s'\n", sender, msg->buf);
 		}
 
 		MEMUNDEFINE(durstr);