Xymon Mailing List Archive search

rrd logs and graphs

list Vernon Everett
Wed, 25 Feb 2015 11:40:58 +0800
Message-Id: <CAGo4kcZq4P+YFYLFK5O=user-fe3dcc75d21f@xymon.invalid>

Hi Jeremy

Added some debug code to my script. Here's an extract.
      DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$2 }') # Current IO
latency
      $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo
"$DATA"; echo)"
echo      $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo
"$DATA"; echo)"
      DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Max IO
latency
      $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo
"$DATA"; echo)"
echo      $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo
"$DATA"; echo)"
      DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Avg IO
latency
      $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo
"$DATA"; echo)"
echo      $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo
"$DATA"; echo)"

And I managed to get a couple of bizarre data files.
e-series-dcuriolat,icmpOutParmProbs.rrd
e-series-dcuriolat,icmpOutRedirects.rrd
e-series-dcuriolat,ipv6InTruncatedPkts.rrd
e-series-dcuriolat,ipv6OutFragFails.rrd
e-series-dcuriolat,UDP_udpInDatagrams.rrd
e-series-dcuriolat,udpInCksumErrs.rrd

And if I grep in my log file for icmp or any of those terms, I come up with
nothing.
So I am guessing it's not coming from the client.

I want to try the snoop, but this client script is running on the server,
as a client script.
It collects data from a bunch of NetApp E-series devices, and sends it to
the server in the normal way.
So you can imagine what the snoop data is going to look like.
But I will give it a go, and see if there is something in it.

As for debugging the rrd tasks, John was right.
Adding --debug to the rrd config causes it to crash.
Then I just het heaps of this.
2015-02-25 11:31:07 Peer not up, flushing message queue
2015-02-25 11:31:07 Peer not up, flushing message queue
2015-02-25 11:31:07 Peer not up, flushing message queue
And the occasional
19073 2015-02-25 11:31:14 2015-02-25 11:31:15 Child process 19073 died:
Signal 6

But I think I am reasonably happy that the strange data isn't coming from
the client script.
Martin Flemming is a list member in Germany (think) who is helping me test
this script.
I will ask him if he's seeing the same issues. If not, I think we can rule
out the script.

Regards
Vernon


On 24 February 2015 at 14:26, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
I'm assuming you've checked your debug output from your script to see if
the $TEMPFILE.* file contents look OK.

Perhaps run your own instance of "xymond_channel --channel=data" to
capture the messages as they come from xymond to xymond_rrd.  This will
generate a lot of output, so you'll want to use "--filter" and perhaps
"grep" to trim it down.

You could also run snoop/tcpdump at the same time and try to capture the
data message as it arrives at your Xymon server.  If you have lots of Xymon
traffic it might be better to do so on the client side.

The trick is to get a snapshot at the time that the RRD file is created,
without collecting so much data that you run out of disk!  So doing things
like this:

while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host
blabla; gzip dump.out; mv dump.out.gz dump.out-`date +%s`; done

This will capture 10k of packets at a time, then compress and rotate.

You can also run xymond in a host-specific debug mode, by appending
"--dbghost=HOSTNAME".  That will spit out all the traffic into
/tmp/xymond.dbg for analysis.  Again, you might need to periodically rotate
that file and signal xymond to re-open output files (I'm guessing a HUP
signal might do this, or just kill the process and have xymonlaunch restart
it).

The path the data take would be:

[script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel]
-> [xymond_rrd] -> [rrd file]

What we want to do is to watch the traffic/messages to determine which of
these components is causing the problem.  My first step would be to try to
isolate whether it's a client or server problem, hence watching the traffic
with tcpdump/snoop.  If the traffic is transmitted over the wire in the
correct form, then I'd look at what xymond gives to xymond_channel.  And so
on.  Once we can identify the process that creates the phantom entity, we
can look for the root cause and then work-arounds/solutions.

J


On 24 February 2015 at 16:46, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
I am getting those sporadic .rrd files in spades. :-(
Sometimes, only a single data point in the file. But enough files, and
your graphs start to look like crap.

Tomorrow I am off to a client where it's happening all the time.
What can I send you to assist with investigating?

I am trying to figure out if it's a bug in Xymon, or a bug in my script.
So far I have no evidence to support it being either.

Regards
Vernon


On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I
stumbled on something else I don't understand, and was hoping somebody out
there could help.

As part of my investigation, I added --debug to the [rrdstatus] and
[rrddata] entries on the server tasks.cfg
And the logs started showing heaps of the message
2014-11-14 10:41:36 Peer not up, flushing message queue
What is that?
It doesn't look right to me.
It's usually normal.  See Henrik's response to a similar question:

http://lists.xymon.com/archive/2014-April/039461.html

Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes.  It's puzzling, but rare enough that I haven't had time to
investigate.

J

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton