Acknowledge issue continues with xymon 4.3.2
list Sean Clark
I have xymon 4.3.2 installed now Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse – at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged New alerts, Existing alerts that were already acknowledged, it doesn't matter This is a fairly impacting issue, and others on the list have said they have this same problem All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard 2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack 2011-04-05 05:23:09 Cookie 47469 not found, dropping ack 2011-04-05 06:38:55 Cookie 86204 not found, dropping ack 2011-04-05 06:41:37 Cookie 86204 not found, dropping ack This is what my logs start filling up with. Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation -- This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark
It's definitely some sort of "data in memory" corruption that occurs that is repeatable - I've noticed that when I restart when the problem first occurs, loading the chk file that is saved, it gets this message: 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 This matches up with the number of Cookies it couldn't find - I am guessing it's missing the cookies in those records And more and more of those messages depending on how long I wait to restart (I.e. As the acknowledge problem gets wose and worse) If I restart when I am not showing signs of it not finding cookies, I do not get that message in the xymonlaunch.log - it just works fine and exactly as I expect Is there some sort of memory limit or that I am hitting? My xymond process takes up 524 MB of memory right now. Just looking for any steps to take next
▸
On 4/5/11 9:00 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
I have xymon 4.3.2 installed now Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged New alerts, Existing alerts that were already acknowledged, it doesn't matter This is a fairly impacting issue, and others on the list have said they have this same problem All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard 2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack 2011-04-05 05:23:09 Cookie 47469 not found, dropping ack 2011-04-05 06:38:55 Cookie 86204 not found, dropping ack 2011-04-05 06:41:37 Cookie 86204 not found, dropping ack This is what my logs start filling up with. Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation -- This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark
I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster." I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted" In xymonserver.cfg MAXMSG_STATUS="1036118" MAXMSG_CLIENT="1036118" MAXMSG_DATA="1036118" MAXMSG_NOTES="1036118" MAXLINE="1036118" In tasks.cfg: History disabled Xymongen disabled [all others are in their 'default' state, I.e. Proxy disabled, xymond enabled] I have 78 rules in alerts.cfg spread across 8,565 hosts. I've added 14 graphing items in graphs.cfg It's compile for i386 Linux Previously the binaries were stripped because I installed them via the spec file from the developer's list. I built the binaries and did a make install instead so they are no longer stripped I do not get a core file for failing to acknowledge. Eventually no events can be acknowledged at all, and if it gets to that point, the only way to restart xymon is to remove the .chk files [it seems to tolerate 6-20 items corrupted, but hundreds it will fail to start] I am just looking for guidance, or some thing to try - please let me know
▸
On 4/5/11 9:00 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
I have xymon 4.3.2 installed now Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged New alerts, Existing alerts that were already acknowledged, it doesn't matter This is a fairly impacting issue, and others on the list have said they have this same problem All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard 2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack 2011-04-05 05:23:09 Cookie 47469 not found, dropping ack 2011-04-05 06:38:55 Cookie 86204 not found, dropping ack 2011-04-05 06:41:37 Cookie 86204 not found, dropping ack This is what my logs start filling up with. Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation -- This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Henrik Størner
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid>
▸
wrote:I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster."
I've seen your messages, but haven't had a chance to dig into the code to see where the problem is.
▸
I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted"
I don't see any config changes you've done that would explain this.
▸
Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
How do you ack an event ? Are you using the "Acknowledge alert" webpage with the "--no-pin" option (default), or is it ack via email or ... ? Regards, Henrik
list Sean Clark
$BB $BBPAGE \"xymondack $NUMBER $DELAY $MESSAGE\""; From a script Where $BB is /sw/libexec/hobbit/client/bin]./bb --version Hobbit version 4.2.0 BBPAGE is my xymond display running 4.2.3 $NUMBER is the cookie, obtained by using that same hobbit client to run "hobbitdboard fields=hostname,testname,color,acktime,disabletime,cookie,ackmsg,dismsg,las tchange" $DELAY is typically 120, but setable $MESSAGE is just text -- Sean Clark Sr. Engineer, Software ATG Network Operations & Planning Integrated Regional OSS <http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP/ OSS/Network.aspx> user-2db5fbcae9a7@xymon.invalid <mailto:user-2db5fbcae9a7@xymon.invalid> devaudio <aim://devaudio> <mailto:user-2db5fbcae9a7@xymon.invalid> Office: (XXX) XXX-XXXX cell: (XXX) XXX-XXXX
▸
On 4/7/11 9:36 AM, "user-ce4a2c883f75@xymon.invalid" <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster."I've seen your messages, but haven't had a chance to dig into the code to see where the problem is.I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted"I don't see any config changes you've done that would explain this.Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ackHow do you ack an event ? Are you using the "Acknowledge alert" webpage with the "--no-pin" option (default), or is it ack via email or ... ? Regards, Henrik
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark
But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledging
▸
-- Sean Clark Sr. Engineer, Software ATG Network Operations & Planning Integrated Regional OSS <http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP/ OSS/Network.aspx> user-2db5fbcae9a7@xymon.invalid <mailto:user-2db5fbcae9a7@xymon.invalid> devaudio <aim://devaudio> <mailto:user-2db5fbcae9a7@xymon.invalid> Office: (XXX) XXX-XXXX cell: (XXX) XXX-XXXX On 4/7/11 9:43 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
$BB $BBPAGE \"xymondack $NUMBER $DELAY $MESSAGE\"";
From a script
Where $BB is /sw/libexec/hobbit/client/bin]./bb --version Hobbit version 4.2.0 BBPAGE is my xymond display running 4.2.3 $NUMBER is the cookie, obtained by using that same hobbit client to run "hobbitdboard
fields=hostname,testname,color,acktime,disabletime,cookie,ackmsg,dismsg,la
s
▸
tchange" $DELAY is typically 120, but setable $MESSAGE is just text -- Sean Clark Sr. Engineer, Software ATG Network Operations & Planning Integrated Regional OSS <http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP / OSS/Network.aspx> user-2db5fbcae9a7@xymon.invalid <mailto:user-2db5fbcae9a7@xymon.invalid> devaudio <aim://devaudio> <mailto:user-2db5fbcae9a7@xymon.invalid> Office: (XXX) XXX-XXXX cell: (XXX) XXX-XXXX On 4/7/11 9:36 AM, "user-ce4a2c883f75@xymon.invalid" <user-ce4a2c883f75@xymon.invalid> wrote:On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster."I've seen your messages, but haven't had a chance to dig into the code to see where the problem is.I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted"I don't see any config changes you've done that would explain this.Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ackHow do you ack an event ? Are you using the "Acknowledge alert" webpage with the "--no-pin" option (default), or is it ack via email or ... ? Regards, HenrikThis E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Henrik Størner
▸
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledging
Ok, that would have been my next question :-)
It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant.
Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure.
However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here.
I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code.
Regards,
Henrik
list Sean Clark
Thank you I will install this post haste. Hope your surgery goes well, try not to look at bright lights for a while :-D
▸
On 4/7/11 5:02 PM, "Henrik Størner" <user-ce4a2c883f75@xymon.invalid> wrote:
Den 07-04-2011 15:45, Clark, Sean skrev:But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledgingOk, that would have been my next question :-) It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant. Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure. However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here. I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code. Regards, Henrik
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
list Sean Clark
Just so anyone else following this thread is aware, the diff is for the trunk version, not the 4.3.2 release, although you could prolly figure it out for the 4.3.2 if you were so inclined
▸
On 4/8/11 9:14 AM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:
Thank you I will install this post haste. Hope your surgery goes well, try not to look at bright lights for a while :-D On 4/7/11 5:02 PM, "Henrik Størner" <user-ce4a2c883f75@xymon.invalid> wrote:Den 07-04-2011 15:45, Clark, Sean skrev:But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledgingOk, that would have been my next question :-) It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant. Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure. However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here. I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code. Regards, HenrikThis E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.