Thanks for your assistance Jeremy.
Quite a bit to digest.
Everything in the xymon.out file I am collecting looks exactly like I would
expect it to look.
There is a line, prefixed by @@data#<sequence> (or at least I think it's a
sequence number),followed by pipe-separated data that looks like the host
name, time stamp, IP address, an empty field, host name again, the rrd-file
prefix, another blank field and the last field is other.
I then get a blank line, and the data looks like what I am trying to send.
Looks like we might need to check with JC for more on that GOCLIENT thing.
I just find it odd that it happened about the same time as the corruption.
I haven't seen it again today, and haven't seen any other corruption either.
At this site, we are running Xymon 4.3.12, but I have seen similar
behaviour (although not to such an extent) elsewhere with 4.3.17, and I
think I also saw it with 4.3.18, but I no longer have access to that site.
I am not seeing any lost data points in the other graphs. But that could be
difficult to spot.
Will run a few rrdtool dumps, and look for gaps at that timestamp. Let you
know what I find.
As for the --debug option, it caused xymond_rrd to crash and burn, dumping
cores as we go.
It gets ugly.
Earlier in this thread, John Thurston mentioned this behaviour too.
It also creates a red xymond_rrd button on the xymon server, but the
results are not overly helpful.
- Program crashedFatal signal caught!
Don't think it started after an upgrade.
Something I did notice, the problem appears to be limited to data only,
used to display graphs in trends.
I am not seeing this for data when there is a status and data component.
Or at least I haven't seen it yet.
What are the implications of running with "--no-cache"?
I have implemented this by adding "--no-cache" but if it's going to have a
long-term impact, I don't want to leave it that way indefinitely.
Regards
Vernon
On 4 March 2015 at 14:03, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 4 March 2015 at 12:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Here's what I ran, with error output.
./xymoncmd xymond_channel --channel=data --filter=e-series cat >
/var/tmp/xymon.out
2015-03-04 08:45:22 Using default environment file
/opt/local/xymon/server/etc/xymonserver.cfg
2015-03-04 08:45:58 Peer not up, flushing message queue
2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.
What is that GOCLIENT thing?
From what I can understand, it's a semaphore shared between xymond and all
of the xymond_channel instances. When there are several channel readers,
they all get sent the message address, and as each one accepts the message,
she decrements GOCLIENT. When GOCLIENT is zero, it means all readers have
received (and probably copied) the message, and the memory can be freed.
Each reader waits until GOCLIENT goes back to zero before waiting for the
next message.
There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go
back to zero. If the time is exceeded in a channel reader, it means
another reader is taking too long to handle a message, and so the first
reader gives up, logs the error you saw, and carries on with the next
message loop. I'm not sure if this is a sign of trouble. Or it might be
normal when you're running your own instance of xymond_channel. Or it
might be a side-effect of the "cat" command blocking when writing to your
output file due to a high message rate and contention on whatever
filesystem has /var/tmp/.
There's a description of how GOCLIENT works in the file new-daemon.txt, in
the source code.
In the output file, /var/tmp/xymon.out from
./xymoncmd xymond_channel --channel=data --filter=e-series cat >
/var/tmp/xymon.out
there is no mention of the subversion or energise stuff either.
Does it have mention of the correct data set names? We can't draw any
conclusions if it's not collecting the data we expect.
Did any of the RRD files skip an update at the time the new rogue files
were created? Do these files match up with entries in xymon.out? Or
anything interesting at the same time as the rogue entries were created?
If you're seeing correct entries in xymon.out, but not the bogus entries,
then I'm inclined to agree that xymond_rrd is at fault, and is possibly
using memory it's not supposed to. I wonder if running xymond_rrd with
"--no-cache" might have an effect. Obviously, it's better if you can cache
updates to the RRD files, but it might narrow down the region of code
that's responsible.
This is not conclusive. It's possible that when you have two instances of
xymond_channel, only one is corrupting data names, and it just so happened
that it was the one being used by xymond_rrd. Could be that another time
you would see your extra reader getting the bogus entries. That's the
problem with using a second instance for analysis, rather than somehow
getting the analysis happening on the one that writes to the RRD files.
On the other hand, if you ran two instances of xymond_rrd, both on the
same data channel, and if both instances create the bogus RRD files, then
you know that you can probably use the second instance to narrow down the
fault, without impacting the creation of RRD files for real work.
Are you still running xymond_rrd with "--debug"? Did this show anything
interesting when the bogus RRD files were created?
What version of Xymon are you running? Did this start happening after an
upgrade? I wonder if it's a bug with some versions but not others.
J
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton