Xymon Mailing List Archive search

Acknowledge issue continues with xymon 4.3.2

9 messages in this thread

list Sean Clark · Tue, 5 Apr 2011 09:00:53 -0400 ·

I have xymon 4.3.2 installed now

Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts.
As time progresses, it gets worse and worse – at first it's random, some can be acknowledged, some can't
Then, more and more can not be acknowledged

New alerts, Existing alerts that were already acknowledged, it doesn't matter

This is a fairly impacting issue, and others on the list have said they have this same problem


All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard


2011-04-05 05:23:09 Cookie 115771 not found, dropping ack
2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
2011-04-05 05:23:09 Cookie 47469 not found, dropping ack
2011-04-05 06:38:55 Cookie 86204 not found, dropping ack
2011-04-05 06:41:37 Cookie 86204 not found, dropping ack


This is what my logs start filling up with.


Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation

--


This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark · Tue, 5 Apr 2011 10:39:20 -0400 ·

It's definitely some sort of "data in memory" corruption that occurs that
is repeatable - I've noticed that when I restart when the problem first
occurs, loading the chk file that is saved, it gets this message:

2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17


This matches up with the number of Cookies it couldn't find - I am
guessing it's missing the cookies in those records

And more and more of those messages depending on how long I wait to
restart (I.e. As the acknowledge problem gets wose and worse)


If I restart when I am not showing signs of it not finding cookies, I do
not get that message in the xymonlaunch.log - it just works fine and
exactly as I expect


Is there some sort of memory limit or that I am hitting? My xymond process
takes up 524 MB of memory right now.

Just looking for any steps to take next
quoted from Sean Clark


On 4/5/11 9:00 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
I have xymon 4.3.2 installed now

Every 4 days, almost exactly, I start losing the ability to acknowledge
some alerts.
As time progresses, it gets worse and worse ­ at first it's random, some
can be acknowledged, some can't
Then, more and more can not be acknowledged

New alerts, Existing alerts that were already acknowledged, it doesn't
matter

This is a fairly impacting issue, and others on the list have said they
have this same problem


All I have is that find_cookie in lib/rbt.c is not finding the cookie,
despite it being visible in the hobbitdboard


2011-04-05 05:23:09 Cookie 115771 not found, dropping ack
2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
2011-04-05 05:23:09 Cookie 47469 not found, dropping ack
2011-04-05 06:38:55 Cookie 86204 not found, dropping ack
2011-04-05 06:41:37 Cookie 86204 not found, dropping ack


This is what my logs start filling up with.


Can anyone on this list point me to at least some starting point to try
and solve this? It's seriously impacting my xymon implementation

--


This E-mail and any of its attachments may contain Time Warner Cable
proprietary information, which is privileged, confidential, or subject to
copyright belonging to Time Warner Cable. This E-mail is intended solely
for the use of the individual or entity to which it is addressed. If you
are not the intended recipient of this E-mail, you are hereby notified
that any dissemination, distribution, copying, or action taken in
relation to the contents of and attachments to this E-mail is strictly
prohibited and may be unlawful. If you have received this E-mail in
error, please notify the sender immediately and permanently delete the
original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark · Thu, 7 Apr 2011 09:16:01 -0400 ·
I'm not expecting some sort of magic patch to fix this tomorrow, I am just
looking for some direction to take

So far, I haven't even had an acknowledgement that anyone's read this,
other than people who have the same problem as me, whose prescribed
options are "fix the problem you can't ack faster."


I'll list things that I have changed from the stock xymon settings in the
hopes that Henrik or someone else can say "if you change that, you need to
change this or you will most likely have your shm and chk files corrupted"

In xymonserver.cfg

MAXMSG_STATUS="1036118"
MAXMSG_CLIENT="1036118"
MAXMSG_DATA="1036118"
MAXMSG_NOTES="1036118"

MAXLINE="1036118"


In tasks.cfg:

History disabled
Xymongen disabled

[all others are in their 'default' state, I.e. Proxy disabled, xymond
enabled]


I have 78 rules in alerts.cfg spread across 8,565 hosts.
I've added 14 graphing items in graphs.cfg


It's compile for i386 Linux

Previously the binaries were stripped because I installed them via the
spec file from the developer's list.

I built the binaries and did a make install instead so they are no longer
stripped


I do not get a core file for failing to acknowledge. Eventually no events
can be acknowledged at all, and if it gets to that point, the only way to
restart xymon is to remove the .chk files [it seems to tolerate 6-20 items
corrupted, but hundreds it will fail to start]


I am just looking for guidance, or some thing to try - please let me know
quoted from Sean Clark


On 4/5/11 9:00 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
I have xymon 4.3.2 installed now

Every 4 days, almost exactly, I start losing the ability to acknowledge
some alerts.
As time progresses, it gets worse and worse ­ at first it's random, some
can be acknowledged, some can't
Then, more and more can not be acknowledged

New alerts, Existing alerts that were already acknowledged, it doesn't
matter

This is a fairly impacting issue, and others on the list have said they
have this same problem


All I have is that find_cookie in lib/rbt.c is not finding the cookie,
despite it being visible in the hobbitdboard


2011-04-05 05:23:09 Cookie 115771 not found, dropping ack
2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
2011-04-05 05:23:09 Cookie 47469 not found, dropping ack
2011-04-05 06:38:55 Cookie 86204 not found, dropping ack
2011-04-05 06:41:37 Cookie 86204 not found, dropping ack


This is what my logs start filling up with.


Can anyone on this list point me to at least some starting point to try
and solve this? It's seriously impacting my xymon implementation

--


This E-mail and any of its attachments may contain Time Warner Cable
proprietary information, which is privileged, confidential, or subject to
copyright belonging to Time Warner Cable. This E-mail is intended solely
for the use of the individual or entity to which it is addressed. If you
are not the intended recipient of this E-mail, you are hereby notified
that any dissemination, distribution, copying, or action taken in
relation to the contents of and attachments to this E-mail is strictly
prohibited and may be unlawful. If you have received this E-mail in
error, please notify the sender immediately and permanently delete the
original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Henrik Størner · Thu, 07 Apr 2011 15:36:42 +0200 ·
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid>
quoted from Sean Clark
wrote:
I'm not expecting some sort of magic patch to fix this tomorrow, I am
just
looking for some direction to take

So far, I haven't even had an acknowledgement that anyone's read this,
other than people who have the same problem as me, whose prescribed
options are "fix the problem you can't ack faster."
I've seen your messages, but haven't had a chance to dig into the code to
see where the problem is.
quoted from Sean Clark
I'll list things that I have changed from the stock xymon settings in
the
hopes that Henrik or someone else can say "if you change that, you need
to
change this or you will most likely have your shm and chk files
corrupted"
I don't see any config changes you've done that would explain this.
quoted from Sean Clark

Every 4 days, almost exactly, I start losing the ability to acknowledge
some alerts.
As time progresses, it gets worse and worse ­ at first it's random, some
can be acknowledged, some can't
Then, more and more can not be acknowledged
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack
2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
How do you ack an event ? Are you using the "Acknowledge alert" webpage
with the "--no-pin" option (default), or is it ack via email or ... ?


Regards,
Henrik
list Sean Clark · Thu, 7 Apr 2011 09:43:08 -0400 ·
$BB $BBPAGE \"xymondack $NUMBER $DELAY $MESSAGE\"";


From a script


Where $BB is

/sw/libexec/hobbit/client/bin]./bb --version Hobbit version 4.2.0


BBPAGE is my xymond display running 4.2.3

$NUMBER is the cookie, obtained by using that same hobbit client to run
"hobbitdboard
fields=hostname,testname,color,acktime,disabletime,cookie,ackmsg,dismsg,las
tchange"

$DELAY is typically 120, but setable

$MESSAGE is just text


--

Sean Clark
Sr. Engineer, Software
ATG Network Operations & Planning Integrated Regional OSS
<http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP/
OSS/Network.aspx>
user-2db5fbcae9a7@xymon.invalid  <mailto:user-2db5fbcae9a7@xymon.invalid> devaudio
<aim://devaudio>  <mailto:user-2db5fbcae9a7@xymon.invalid>
Office: (XXX) XXX-XXXX cell: (XXX) XXX-XXXX
quoted from Henrik Størner


On 4/7/11 9:36 AM, "user-ce4a2c883f75@xymon.invalid" <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid>
wrote:
I'm not expecting some sort of magic patch to fix this tomorrow, I am
just
looking for some direction to take

So far, I haven't even had an acknowledgement that anyone's read this,
other than people who have the same problem as me, whose prescribed
options are "fix the problem you can't ack faster."
I've seen your messages, but haven't had a chance to dig into the code to
see where the problem is.
I'll list things that I have changed from the stock xymon settings in
the
hopes that Henrik or someone else can say "if you change that, you need
to
change this or you will most likely have your shm and chk files
corrupted"
I don't see any config changes you've done that would explain this.

Every 4 days, almost exactly, I start losing the ability to acknowledge
some alerts.
As time progresses, it gets worse and worse ­ at first it's random, some
can be acknowledged, some can't
Then, more and more can not be acknowledged
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack
2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
How do you ack an event ? Are you using the "Acknowledge alert" webpage
with the "--no-pin" option (default), or is it ack via email or ... ?


Regards,
Henrik
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark · Thu, 7 Apr 2011 09:45:21 -0400 ·
But I can say, using the webpage default method produces the same error
messages "Cookie not found" -- so I didn't think it would be my method of
acknowledging
quoted from Sean Clark


--

Sean Clark
Sr. Engineer, Software
ATG Network Operations & Planning Integrated Regional OSS
<http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP/
OSS/Network.aspx>
user-2db5fbcae9a7@xymon.invalid  <mailto:user-2db5fbcae9a7@xymon.invalid> devaudio
<aim://devaudio>  <mailto:user-2db5fbcae9a7@xymon.invalid>
Office: (XXX) XXX-XXXX cell: (XXX) XXX-XXXX


On 4/7/11 9:43 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
$BB $BBPAGE \"xymondack $NUMBER $DELAY $MESSAGE\"";

From a script

Where $BB is

/sw/libexec/hobbit/client/bin]./bb --version Hobbit version 4.2.0


BBPAGE is my xymond display running 4.2.3

$NUMBER is the cookie, obtained by using that same hobbit client to run
"hobbitdboard

fields=hostname,testname,color,acktime,disabletime,cookie,ackmsg,dismsg,la
s
quoted from Sean Clark
tchange"

$DELAY is typically 120, but setable

$MESSAGE is just text


--

Sean Clark
Sr. Engineer, Software
ATG Network Operations & Planning Integrated Regional OSS
<http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP
/
OSS/Network.aspx>
user-2db5fbcae9a7@xymon.invalid  <mailto:user-2db5fbcae9a7@xymon.invalid> devaudio
<aim://devaudio>  <mailto:user-2db5fbcae9a7@xymon.invalid>
Office: (XXX) XXX-XXXX cell: (XXX) XXX-XXXX


On 4/7/11 9:36 AM, "user-ce4a2c883f75@xymon.invalid" <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid>
wrote:
I'm not expecting some sort of magic patch to fix this tomorrow, I am
just
looking for some direction to take

So far, I haven't even had an acknowledgement that anyone's read this,
other than people who have the same problem as me, whose prescribed
options are "fix the problem you can't ack faster."
I've seen your messages, but haven't had a chance to dig into the code to
see where the problem is.
I'll list things that I have changed from the stock xymon settings in
the
hopes that Henrik or someone else can say "if you change that, you need
to
change this or you will most likely have your shm and chk files
corrupted"
I don't see any config changes you've done that would explain this.

Every 4 days, almost exactly, I start losing the ability to acknowledge
some alerts.
As time progresses, it gets worse and worse ­ at first it's random,
some
can be acknowledged, some can't
Then, more and more can not be acknowledged
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack
2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
How do you ack an event ? Are you using the "Acknowledge alert" webpage
with the "--no-pin" option (default), or is it ack via email or ... ?


Regards,
Henrik
This E-mail and any of its attachments may contain Time Warner Cable
proprietary information, which is privileged, confidential, or subject to
copyright belonging to Time Warner Cable. This E-mail is intended solely
for the use of the individual or entity to which it is addressed. If you
are not the intended recipient of this E-mail, you are hereby notified
that any dissemination, distribution, copying, or action taken in
relation to the contents of and attachments to this E-mail is strictly
prohibited and may be unlawful. If you have received this E-mail in
error, please notify the sender immediately and permanently delete the
original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Henrik Størner · Thu, 07 Apr 2011 23:02:51 +0200 ·
quoted from Sean Clark
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error
messages "Cookie not found" -- so I didn't think it would be my method of
acknowledging
Ok, that would have been my next question :-)

It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant.

Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure.

However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here.

I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code.

Regards,
Henrik
list Sean Clark · Fri, 8 Apr 2011 09:14:27 -0400 ·
Thank you I will install this post haste.

Hope your surgery goes well, try not to look at bright lights for a while
:-D
quoted from Henrik Størner


On 4/7/11 5:02 PM, "Henrik Størner" <user-ce4a2c883f75@xymon.invalid> wrote:
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error
messages "Cookie not found" -- so I didn't think it would be my method
of
acknowledging
Ok, that would have been my next question :-)

It is quite possible that it's a bug in the xymond code. I don't know
why it hits you so much, but that is kind of irrelevant.

Inside xymond, the cookies are stored in a datastructure called a
"red-black tree" ("rbtree" for short). This uses some code that I picked
up from someone else - it is used in lots of places, e.g. all of the
hosts.cfg configuration is also stored in a similar datastructure.

However, the cookie-handling is special because it cookies are
frequently deleted (hosts being removed happens much less frequently). I
have had some crashes that I could never really explain when hosts were
removed, and I really do suspect that particular bit of code that
deletes an entry from the rbtree to be buggy. Therefore, it could very
well be that there is a real problem here.

I've come up with a version of xymond.c that eliminates the rbtree code
for the cookies. It uses a much less efficient way of looking up the
cookies - basically, it will scan through all of the status-log entries
that xymond has in memory - but since this only happens when a cookie
needs to be renewed, or when xymond receives an ack, it should not put
too much extra load on your system. It would be very interesting to hear
if this patch on top of 4.3.2 solves the issue; if it does, then I
surely know that there is a bug in the rbtree "delete node" code.

Regards,
Henrik

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark · Mon, 11 Apr 2011 09:32:44 -0400 ·
Just so anyone else following this thread is aware, the diff is for the
trunk version, not the 4.3.2 release, although you could prolly figure it
out for the 4.3.2 if you were so inclined
quoted from Sean Clark


On 4/8/11 9:14 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
Thank you I will install this post haste.

Hope your surgery goes well, try not to look at bright lights for a while
:-D


On 4/7/11 5:02 PM, "Henrik Størner" <user-ce4a2c883f75@xymon.invalid> wrote:
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error
messages "Cookie not found" -- so I didn't think it would be my method
of
acknowledging
Ok, that would have been my next question :-)

It is quite possible that it's a bug in the xymond code. I don't know
why it hits you so much, but that is kind of irrelevant.

Inside xymond, the cookies are stored in a datastructure called a
"red-black tree" ("rbtree" for short). This uses some code that I picked
up from someone else - it is used in lots of places, e.g. all of the
hosts.cfg configuration is also stored in a similar datastructure.

However, the cookie-handling is special because it cookies are
frequently deleted (hosts being removed happens much less frequently). I
have had some crashes that I could never really explain when hosts were
removed, and I really do suspect that particular bit of code that
deletes an entry from the rbtree to be buggy. Therefore, it could very
well be that there is a real problem here.

I've come up with a version of xymond.c that eliminates the rbtree code
for the cookies. It uses a much less efficient way of looking up the
cookies - basically, it will scan through all of the status-log entries
that xymond has in memory - but since this only happens when a cookie
needs to be renewed, or when xymond receives an ack, it should not put
too much extra load on your system. It would be very interesting to hear
if this patch on top of 4.3.2 solves the issue; if it does, then I
surely know that there is a bug in the rbtree "delete node" code.

Regards,
Henrik

This E-mail and any of its attachments may contain Time Warner Cable
proprietary information, which is privileged, confidential, or subject to
copyright belonging to Time Warner Cable. This E-mail is intended solely
for the use of the individual or entity to which it is addressed. If you
are not the intended recipient of this E-mail, you are hereby notified
that any dissemination, distribution, copying, or action taken in
relation to the contents of and attachments to this E-mail is strictly
prohibited and may be unlawful. If you have received this E-mail in
error, please notify the sender immediately and permanently delete the
original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.