Xymon Mailing List Archive search

strange graph behavior - random machines & graphs

list Gary Baluha
Fri, 30 Nov 2007 15:07:23 -0500
Message-Id: <user-12f7d107dc8f@xymon.invalid>

On Nov 30, 2007 2:14 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:
Hmm, now this is interesting.  I have the Hobbit server (Hobbit A, from a
previous post) monitoring my work laptop (mostly so I can test out
client-side external scripts).  I have been taking my laptop home with me
this week, and I noticed that the time period while I'm *at* work, the
graphs are plotting valid data.  However, during the time that I turn my
laptop off and bring it home, to the time that I bring my laptop in the next
day and power it on, the graphs are showing the same invalid bogus data that
the other bad graphs are showing.

In other words, the rrd graphs are getting bogus data for a machine that
isn't even reporting to the Hobbit server!  Interesting, isn't it?
I'm definitely on to something with this.  I intentionally stopped the
Hobbit client process on one of the machines that has the bad RRD graphs for
about 20 minutes, and then started it back up.  Once the client reported the
latest data back, the RRD graph had another spike in it!

The other interesting thing is, the hobbitd-rrd --debug logging (
rrd-status.log) does *not* show any abnormal data.  It appears that Hobbit
is logging valid data to "rrdupdate".  So the bogus data appears to be
down-stream of this.

So it seems these data spikes *do* correspond to something: they correspond
to a lack of data reported back from the clients.  Furthermore, when I do an
rrd dump, I can see the bogus data in the "secondary_value" field:

-----Start of RRD dump-----
<!-- Round Robin Archives -->   <rra>
                <cf> AVERAGE </cf>
                <pdp_per_row> 1 </pdp_per_row> <!-- 300 seconds -->

                <params>
                <xff> 5.0000000000e-01 </xff>
                </params>
                <cdp_prep>
                        <ds>
                        <primary_value> 2.6110000000e+01 </primary_value>
                        <secondary_value> 5.1776682516e+170</secondary_value>
                        <value> 5.1776682516e+170 </value>
                        <unknown_datapoints> 0 </unknown_datapoints>
                        </ds>
                </cdp_prep>
                <database>
                        <!-- 2007-11-28 15:05:00 EST / 1196280300 -->
<row><v> 5.1776682516e+170 </v></row>
-----SNIP-----
-----End of RRD dump-----

The number 5.1776682516e+170 corresponds to the "517768..." large number
that the GPRINT portion of the rrd graphs are displaying.

Anyone have any ideas of what else to turn logging up on?