Flushing Stale messages?

9 messages in this thread

list Sean Clark · Fri, 15 Mar 2013 11:21:20 -0400 ·


I have a channel parser than looks at items in the 'stachg' channel

It looks like it's working for me (it parses and does stuff properly)

However – my log is filling up with this:


2013-03-15 11:08:29 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:30 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:31 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:32 Flushed 6 stale messages for 0.0.0.0:0
2013-03-15 11:08:33 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:34 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:35 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:36 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:37 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:38 Flushed 4 stale messages for 0.0.0.0:0


Is this telling my my parse can not handle the channel in a timely manner, and the message is growing "stale" and I am droping things?


-Sean


This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

list Sean Clark · Fri, 15 Mar 2013 12:34:38 -0400 ·

I'll answer that myself – yes that means whatever is there can't process the channel fast enough


So, I'll have to go back to my older parser – which is getting this:


Core was generated by `xymond_mysql --pidfile=/var/log/xymon/xymond_history.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x08049de1 in addnetpeer (peername=0x4f8ca0 "") at xymond_channel.c:140
140 xymond_channel.c: No such file or directory.
in xymond_channel.c
(gdb) where
#0  0x08049de1 in addnetpeer (peername=0x4f8ca0 "") at xymond_channel.c:140
#1  0x00511e9c in ?? ()
#2  0x004f8ca0 in ?? () from /lib/ld-linux.so.2
#3  0x08057190 in stackfgets (buffer=0x80497b0, extraincl=0x2 <Address 0x2 out of bounds>) at stackio.c:434
#4  0x080496c1 in _start ()


Which is getting a null timestamp for some items on stachg channel :/

▸ quoted from Sean Clark

From: <Clark>, Sean Clark <user-2db5fbcae9a7@xymon.invalid<mailto:user-2db5fbcae9a7@xymon.invalid>>
Date: Friday, March 15, 2013 11:21 AM
To: "xymon at xymon.com<mailto:xymon at xymon.com>" <xymon at xymon.com<mailto:xymon at xymon.com>>
Subject: [Xymon] Flushing Stale messages?

I have a channel parser than looks at items in the 'stachg' channel

It looks like it's working for me (it parses and does stuff properly)

However – my log is filling up with this:

2013-03-15 11:08:29 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:30 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:31 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:32 Flushed 6 stale messages for 0.0.0.0:0
2013-03-15 11:08:33 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:34 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:35 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:36 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:37 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:38 Flushed 4 stale messages for 0.0.0.0:0

Is this telling my my parse can not handle the channel in a timely manner, and the message is growing "stale" and I am droping things?

-Sean

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

list Japheth Cleaver · Fri, 15 Mar 2013 17:19:27 -0000 (UTC) ·

Yeah, that generally means your pipe has backed up too much.

"Rate of messages" is a good metric to keep track of (visible at 5m
intervals from the xymond status report). If you're getting 3000 messages
every 300 seconds, that's 0.1s you've got to process each message coming
in on average, but subject to expected spikes and the buffers running
over.


Depending on what you're doing, smoothing out how often you're getting
messages to reduce spikes will help, as will filtering at xymond_channel
if you're only interesting in a subset, along with (obviously) trying to
make the message processor more efficient.

Eventually, it could lead to forking off the handling (if you can do it
efficiently and have cores to spare), or using an async queue somewhere.


On the second part, that's interesting... Can you provide a sample msg
with a null?


Regards,

-jc

▸ quoted from Sean Clark



--- Original Message ---

I'll answer that myself  yes that means whatever is there can't process
the channel fast enough


So, I'll have to go back to my older parser  which is getting this:


Core was generated by `xymond_mysql
--pidfile=/var/log/xymon/xymond_history.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x08049de1 in addnetpeer (peername=0x4f8ca0 "") at xymond_channel.c:140
140	xymond_channel.c: No such file or directory.
in xymond_channel.c
(gdb) where
#0  0x08049de1 in addnetpeer (peername=0x4f8ca0 "") at xymond_channel.c:140
#1  0x00511e9c in ?? ()
#2  0x004f8ca0 in ?? () from /lib/ld-linux.so.2
#3  0x08057190 in stackfgets (buffer=0x80497b0, extraincl=0x2 <Address 0x2
out of bounds>) at stackio.c:434
#4  0x080496c1 in _start ()


Which is getting a null timestamp for some items on stachg channel :/


From: <Clark>, Sean Clark <user-2db5fbcae9a7@xymon.invalid>
Date: Friday, March 15, 2013 11:21 AM
To: "xymon at xymon.com" <xymon at xymon.com>
Subject: [Xymon] Flushing Stale messages?


I have a channel parser than looks at items in the 'stachg' channel

It looks like it's working for me (it parses and does stuff properly)

However  my log is filling up with this:


2013-03-15 11:08:29 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:30 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:31 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:32 Flushed 6 stale messages for 0.0.0.0:0
2013-03-15 11:08:33 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:34 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:35 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:36 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:37 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:38 Flushed 4 stale messages for 0.0.0.0:0


Is this telling my my parse can not handle the channel in a timely manner,
and the message is growing "stale" and I am droping things?


-Sean

list Sean Clark · Fri, 15 Mar 2013 13:36:48 -0400 ·

Heh , I'd have to look at the whole stachg channel to find needle in
haystack for that

Got a couple (once every 2-3 day) core dumps here:

Program terminated with signal 11, Segmentation fault.
#0  main (argc=2, argv=0xbfd1a444) at xymond_mysql.c:371


xymond_mysql.c line 371:
   mysql_escape_string(timestamp,metadata[1],timestampbytes);
Timestampbytes is strln of timestamp


I am not strong in C , however, so to find that needle, I wrote a perl
version that pipes hist to mysql (that way, it logs exceptions etc etc),
However, the perl version can't handle the rate of messages (between
300-500/sec)

Bleh


What I STRONGLY need help with is my xymond.chk getting corrupted - henrik
looked at one a while back, and gave me something to look at/fix
Which I did, but it's still getting corrupted (and then any time it
crashes, lose all states)

Do you know of a good way to parse/manage the chk file to see what it
doesn't like?


On 3/15/13 1:19 PM, "user-87556346d4af@xymon.invalid" <user-87556346d4af@xymon.invalid>

▸ quoted from Japheth Cleaver

wrote:

Yeah, that generally means your pipe has backed up too much.

"Rate of messages" is a good metric to keep track of (visible at 5m
intervals from the xymond status report). If you're getting 3000 messages
every 300 seconds, that's 0.1s you've got to process each message coming
in on average, but subject to expected spikes and the buffers running
over.


Depending on what you're doing, smoothing out how often you're getting
messages to reduce spikes will help, as will filtering at xymond_channel
if you're only interesting in a subset, along with (obviously) trying to
make the message processor more efficient.

Eventually, it could lead to forking off the handling (if you can do it
efficiently and have cores to spare), or using an async queue somewhere.


On the second part, that's interesting... Can you provide a sample msg
with a null?


Regards,

-jc


--- Original Message ---

I'll answer that myself  yes that means whatever is there can't process
the channel fast enough


So, I'll have to go back to my older parser  which is getting this:


Core was generated by `xymond_mysql
--pidfile=/var/log/xymon/xymond_history.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x08049de1 in addnetpeer (peername=0x4f8ca0 "") at
xymond_channel.c:140
140    xymond_channel.c: No such file or directory.
in xymond_channel.c
(gdb) where
#0  0x08049de1 in addnetpeer (peername=0x4f8ca0 "") at
xymond_channel.c:140
#1  0x00511e9c in ?? ()
#2  0x004f8ca0 in ?? () from /lib/ld-linux.so.2
#3  0x08057190 in stackfgets (buffer=0x80497b0, extraincl=0x2 <Address 0x2
out of bounds>) at stackio.c:434
#4  0x080496c1 in _start ()


Which is getting a null timestamp for some items on stachg channel :/


From: <Clark>, Sean Clark <user-2db5fbcae9a7@xymon.invalid>
Date: Friday, March 15, 2013 11:21 AM
To: "xymon at xymon.com" <xymon at xymon.com>
Subject: [Xymon] Flushing Stale messages?


I have a channel parser than looks at items in the 'stachg' channel

It looks like it's working for me (it parses and does stuff properly)

However  my log is filling up with this:


2013-03-15 11:08:29 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:30 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:31 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:32 Flushed 6 stale messages for 0.0.0.0:0
2013-03-15 11:08:33 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:34 Flushed 2 stale messages for 0.0.0.0:0
2013-03-15 11:08:35 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:36 Flushed 3 stale messages for 0.0.0.0:0
2013-03-15 11:08:37 Flushed 4 stale messages for 0.0.0.0:0
2013-03-15 11:08:38 Flushed 4 stale messages for 0.0.0.0:0


Is this telling my my parse can not handle the channel in a timely manner,
and the message is growing "stale" and I am droping things?


-Sean

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

list Japheth Cleaver · Fri, 15 Mar 2013 18:41:41 -0000 (UTC) ·

That's odd. If you're on a box with a lot of memory, writing out to a
tmpfs might help. For your worker, I'd suggest just adding a debug line or
two in front of that section.

WRT the checkpoint file, the only real corruption I've seen myself has
occurred when malformed utf-8 packets came in -- I'd accidentally included
gzip output in a script I'd put in my /local directory :/.

You could try modifying the init startup/shutdown script to copy over the
checkpoint file every once in a while, and then point a copy of xymond
over to it in --debug mode and see if it chokes... and if so, how far in.

Thinking about it, a --validate flag to xymond might not be too hard to
whip up.


Regards,

-jc

▸ quoted from Sean Clark



--- Original Message ---

Heh , I'd have to look at the whole stachg channel to find needle in
haystack for that

Got a couple (once every 2-3 day) core dumps here:

Program terminated with signal 11, Segmentation fault.
#0  main (argc=2, argv=0xbfd1a444) at xymond_mysql.c:371


xymond_mysql.c line 371:
   mysql_escape_string(timestamp,metadata[1],timestampbytes);
Timestampbytes is strln of timestamp


I am not strong in C , however, so to find that needle, I wrote a perl
version that pipes hist to mysql (that way, it logs exceptions etc etc),
However, the perl version can't handle the rate of messages (between
300-500/sec)

Bleh


What I STRONGLY need help with is my xymond.chk getting corrupted - henrik
looked at one a while back, and gave me something to look at/fix
Which I did, but it's still getting corrupted (and then any time it
crashes, lose all states)

Do you know of a good way to parse/manage the chk file to see what it
doesn't like?

list Sean Clark · Fri, 15 Mar 2013 15:31:14 -0400 ·

Just as a note of perl vs straight C code


Using mysql libs & C to insert stachg channel -- handles about 1200 msgs/5
minutes before it starts flushing on a dual core machine with 8 GB RAM
Same hardware using Perl, DBD:Mysql -- tops out @ about 300


/sw/xymon/server/bin/xymond --listen=127.0.0.1:1985 --debug
--checkpoint-file=./xymond.chk.crashed

As to the debug loading of chk file:


31911 2013-03-15 15:23:17 Opening file /sw/xymon/server/etc/hosts.cfg
31911 2013-03-15 15:23:19 Opening file
/sw/xymon/server/etc/client-local.cfg
2013-03-15 15:23:19 Setting up network listener on 127.0.0.1:1985
2013-03-15 15:23:19 Setting up signal handlers
2013-03-15 15:23:19 Setting up xymond channels
31911 2013-03-15 15:23:19 Setting up status channel (id=1)
31911 2013-03-15 15:23:19 calling ftok('/sw/xymon/server',1)
31911 2013-03-15 15:23:19 ftok() returns: 0x1000047
31911 2013-03-15 15:23:19 shmget() returns: 0xD6800C
2013-03-15 15:23:19 FATAL: xymond sees clientcount 1, should be 0
Check for hanging xymond_channel processes or stale semaphores
2013-03-15 15:23:19 Cannot setup status channel


That is telling me


On 3/15/13 2:41 PM, "user-87556346d4af@xymon.invalid" <user-87556346d4af@xymon.invalid>

▸ quoted from Japheth Cleaver

wrote:

That's odd. If you're on a box with a lot of memory, writing out to a
tmpfs might help. For your worker, I'd suggest just adding a debug line or
two in front of that section.

WRT the checkpoint file, the only real corruption I've seen myself has
occurred when malformed utf-8 packets came in -- I'd accidentally included
gzip output in a script I'd put in my /local directory :/.

You could try modifying the init startup/shutdown script to copy over the
checkpoint file every once in a while, and then point a copy of xymond
over to it in --debug mode and see if it chokes... and if so, how far in.

Thinking about it, a --validate flag to xymond might not be too hard to
whip up.


Regards,

-jc


--- Original Message ---

Heh , I'd have to look at the whole stachg channel to find needle in
haystack for that

Got a couple (once every 2-3 day) core dumps here:

Program terminated with signal 11, Segmentation fault.
#0  main (argc=2, argv=0xbfd1a444) at xymond_mysql.c:371


xymond_mysql.c line 371:
  mysql_escape_string(timestamp,metadata[1],timestampbytes);
Timestampbytes is strln of timestamp


I am not strong in C , however, so to find that needle, I wrote a perl
version that pipes hist to mysql (that way, it logs exceptions etc etc),
However, the perl version can't handle the rate of messages (between
300-500/sec)

Bleh


What I STRONGLY need help with is my xymond.chk getting corrupted - henrik
looked at one a while back, and gave me something to look at/fix
Which I did, but it's still getting corrupted (and then any time it
crashes, lose all states)

Do you know of a good way to parse/manage the chk file to see what it
doesn't like?

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

list Sean Clark · Fri, 15 Mar 2013 15:45:06 -0400 ·

Whoops sent this before I finished typing


This is telling me some xymond_channel isn't exiting properly and it can't
load? It's not telling me much about invalid data for hosts (which is
where henrik pointed me back in the day)

▸ quoted from Sean Clark



On 3/15/13 3:31 PM, "Clark, Sean" <user-2db5fbcae9a7@xymon.invalid> wrote:

Just as a note of perl vs straight C code


Using mysql libs & C to insert stachg channel -- handles about 1200 msgs/5
minutes before it starts flushing on a dual core machine with 8 GB RAM
Same hardware using Perl, DBD:Mysql -- tops out @ about 300


/sw/xymon/server/bin/xymond --listen=127.0.0.1:1985 --debug
--checkpoint-file=./xymond.chk.crashed

As to the debug loading of chk file:


31911 2013-03-15 15:23:17 Opening file /sw/xymon/server/etc/hosts.cfg
31911 2013-03-15 15:23:19 Opening file
/sw/xymon/server/etc/client-local.cfg
2013-03-15 15:23:19 Setting up network listener on 127.0.0.1:1985
2013-03-15 15:23:19 Setting up signal handlers
2013-03-15 15:23:19 Setting up xymond channels
31911 2013-03-15 15:23:19 Setting up status channel (id=1)
31911 2013-03-15 15:23:19 calling ftok('/sw/xymon/server',1)
31911 2013-03-15 15:23:19 ftok() returns: 0x1000047
31911 2013-03-15 15:23:19 shmget() returns: 0xD6800C
2013-03-15 15:23:19 FATAL: xymond sees clientcount 1, should be 0
Check for hanging xymond_channel processes or stale semaphores
2013-03-15 15:23:19 Cannot setup status channel


That is telling me


On 3/15/13 2:41 PM, "user-87556346d4af@xymon.invalid" <user-87556346d4af@xymon.invalid>
wrote:

That's odd. If you're on a box with a lot of memory, writing out to a
tmpfs might help. For your worker, I'd suggest just adding a debug line
or
two in front of that section.

WRT the checkpoint file, the only real corruption I've seen myself has
occurred when malformed utf-8 packets came in -- I'd accidentally
included
gzip output in a script I'd put in my /local directory :/.

You could try modifying the init startup/shutdown script to copy over the
checkpoint file every once in a while, and then point a copy of xymond
over to it in --debug mode and see if it chokes... and if so, how far in.

Thinking about it, a --validate flag to xymond might not be too hard to
whip up.


Regards,

-jc


--- Original Message ---

Heh , I'd have to look at the whole stachg channel to find needle in
haystack for that

Got a couple (once every 2-3 day) core dumps here:

Program terminated with signal 11, Segmentation fault.
#0  main (argc=2, argv=0xbfd1a444) at xymond_mysql.c:371


xymond_mysql.c line 371:
  mysql_escape_string(timestamp,metadata[1],timestampbytes);
Timestampbytes is strln of timestamp


I am not strong in C , however, so to find that needle, I wrote a perl
version that pipes hist to mysql (that way, it logs exceptions etc etc),
However, the perl version can't handle the rate of messages (between
300-500/sec)

Bleh


What I STRONGLY need help with is my xymond.chk getting corrupted -
henrik
looked at one a while back, and gave me something to look at/fix
Which I did, but it's still getting corrupted (and then any time it
crashes, lose all states)

Do you know of a good way to parse/manage the chk file to see what it
doesn't like?

This E-mail and any of its attachments may contain Time Warner Cable
proprietary information, which is privileged, confidential, or subject to
copyright belonging to Time Warner Cable. This E-mail is intended solely
for the use of the individual or entity to which it is addressed. If you
are not the intended recipient of this E-mail, you are hereby notified
that any dissemination, distribution, copying, or action taken in
relation to the contents of and attachments to this E-mail is strictly
prohibited and may be unlawful. If you have received this E-mail in
error, please notify the sender immediately and permanently delete the
original and any copy of this E-mail and any printout.

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

list Henrik Størner · Tue, 19 Mar 2013 11:02:57 +0100 ·

▸ quoted from Sean Clark

On 15-03-2013 20:31, Clark, Sean wrote:

As to the debug loading of chk file:


31911 2013-03-15 15:23:17 Opening file /sw/xymon/server/etc/hosts.cfg
31911 2013-03-15 15:23:19 Opening file
/sw/xymon/server/etc/client-local.cfg
2013-03-15 15:23:19 Setting up network listener on 127.0.0.1:1985
2013-03-15 15:23:19 Setting up signal handlers
2013-03-15 15:23:19 Setting up xymond channels
31911 2013-03-15 15:23:19 Setting up status channel (id=1)
31911 2013-03-15 15:23:19 calling ftok('/sw/xymon/server',1)
31911 2013-03-15 15:23:19 ftok() returns: 0x1000047
31911 2013-03-15 15:23:19 shmget() returns: 0xD6800C
2013-03-15 15:23:19 FATAL: xymond sees clientcount 1, should be 0
Check for hanging xymond_channel processes or stale semaphores
2013-03-15 15:23:19 Cannot setup status channel

This happens when xymond has crashed and is restarting, but either some of the old xymond_channel messages are still running (hanging on to a shared memory segment or a semaphore), or the shared memory segments were not cleaned up after the crash.

You can check with ipcs (as the xymon user) if there are any shared memory segments lying around after all of the xymon tasks have exited.

I have a script to cleanup everything and restart Xymon - writing new code may on rare occasions mean that xymond crashes :-) - feel free to try this. If you're not on a Linux box, make sure the "ipcs -m" and "ipcs -s" output has the shmid / semid in column 2. If not, adjust the 'awk' command to grab the correct column.


#!/bin/sh

if [ `id -u` != `id -u xymon` ]
then
         echo "You must be the 'xymon' user to run this."
         exit 1
fi

echo "Stopping Xymon"
~xymon/server/xymon.sh stop
sleep 2
if [ -f /var/run/xymon/xymond.pid ]
then
         echo "Forcing kill of xymon process, PID `cat /var/run/xymon/xymond.pid`"
         kill -9 `cat /var/run/xymon/xymond.pid`
fi

echo "Cleaning up shared memory segments"
ipcs -s|grep "^0"|awk '{print $2}'|while read ID; do ipcrm -s $ID; done
echo "Cleaning up semaphores"
ipcs -m|grep "^0"|awk '{print $2}'|while read ID; do ipcrm -m $ID; done
echo "Cleaning up socket files"
rm ~xymon/server/tmp/xymond_if

echo "Starting Xymon"
~xymon/server/xymon.sh start

echo "Done"
exit 0


Regards,
Henrik

list Sean Clark · Tue, 19 Mar 2013 18:51:26 -0400 ·

Thanks, I will try it out.


Why does removing the chk file seem to fix it? Does it try to establish
old connections?

▸ quoted from Henrik Størner



On 3/19/13 6:02 AM, "Henrik Størner" <user-ce4a2c883f75@xymon.invalid> wrote:

On 15-03-2013 20:31, Clark, Sean wrote:

As to the debug loading of chk file:


31911 2013-03-15 15:23:17 Opening file /sw/xymon/server/etc/hosts.cfg
31911 2013-03-15 15:23:19 Opening file
/sw/xymon/server/etc/client-local.cfg
2013-03-15 15:23:19 Setting up network listener on 127.0.0.1:1985
2013-03-15 15:23:19 Setting up signal handlers
2013-03-15 15:23:19 Setting up xymond channels
31911 2013-03-15 15:23:19 Setting up status channel (id=1)
31911 2013-03-15 15:23:19 calling ftok('/sw/xymon/server',1)
31911 2013-03-15 15:23:19 ftok() returns: 0x1000047
31911 2013-03-15 15:23:19 shmget() returns: 0xD6800C
2013-03-15 15:23:19 FATAL: xymond sees clientcount 1, should be 0
Check for hanging xymond_channel processes or stale semaphores
2013-03-15 15:23:19 Cannot setup status channel

This happens when xymond has crashed and is restarting, but either some
of the old xymond_channel messages are still running (hanging on to a
shared memory segment or a semaphore), or the shared memory segments
were not cleaned up after the crash.

You can check with ipcs (as the xymon user) if there are any shared
memory segments lying around after all of the xymon tasks have exited.

I have a script to cleanup everything and restart Xymon - writing new
code may on rare occasions mean that xymond crashes :-) - feel free to
try this. If you're not on a Linux box, make sure the "ipcs -m" and
"ipcs -s" output has the shmid / semid in column 2. If not, adjust the
'awk' command to grab the correct column.


#!/bin/sh

if [ `id -u` != `id -u xymon` ]
then
        echo "You must be the 'xymon' user to run this."
        exit 1
fi

echo "Stopping Xymon"
~xymon/server/xymon.sh stop
sleep 2
if [ -f /var/run/xymon/xymond.pid ]
then
        echo "Forcing kill of xymon process, PID `cat
/var/run/xymon/xymond.pid`"
        kill -9 `cat /var/run/xymon/xymond.pid`
fi

echo "Cleaning up shared memory segments"
ipcs -s|grep "^0"|awk '{print $2}'|while read ID; do ipcrm -s $ID; done
echo "Cleaning up semaphores"
ipcs -m|grep "^0"|awk '{print $2}'|while read ID; do ipcrm -m $ID; done
echo "Cleaning up socket files"
rm ~xymon/server/tmp/xymond_if

echo "Starting Xymon"
~xymon/server/xymon.sh start

echo "Done"
exit 0


Regards,
Henrik

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

Flushing Stale messages? 🔗 link

Flushing Stale messages?